Skip to Main Content

Partition Reassignment in Seconds

Partitions are the core resource model in Apache Kafka®, serving as the primary conduit for client read and write traffic. As the number of partitions in a cluster increases, uneven partition distribution and hotspot partitions can occur. Consequently, partition reassignments are a common high-frequency operation in the day-to-day maintenance of Apache Kafka®.

In the Shared Nothing architecture of Apache Kafka®, partition reassignment involves replicating large amounts of data. For example, a Kafka partition with a 100MiB/s traffic rate generates approximately 8.2TB of data in one day. If this partition needs to be reassigned to another broker, the entire dataset must be replicated. Even with nodes that have 1 GBps bandwidth, this reassignment would take hours to complete, rendering Apache Kafka® clusters nearly incapable of real-time elasticity.

Thanks to AutoMQ's shared storage architecture, partition reassignments require syncing only a small amount of data, reducing reassignment time to seconds.

How AutoMQ Achieves Second-Level Partition Reassignments

In AutoMQ's shared storage architecture, almost all data is stored in object storage, with minimal data temporarily residing in WAL storage. During a partition reassignment, only the data that has not yet been uploaded to object storage and is temporarily held in WAL needs to be forcibly uploaded. This process allows the partition to be safely transferred to another node, typically within about 1.5 seconds.

The specific reassignment process is illustrated in the figure above, using the example of reassigning partition P1 from Broker-0 to Broker-1:

  1. When the KRaft Controller receives a partition reassignment command, it constructs the corresponding PartitionChangeRecord and commits it to the KRaft Log layer. This action removes Broker-0 from the Leader Replica list and adds Broker-1 to the Follower Replica list. Broker-0, upon syncing the KRaft Log and detecting the P1 partition change, initiates the partition shutdown process.

  2. During the partition shutdown, if P1 contains data that has not yet been uploaded to object storage, a forced upload is triggered. In a stable running cluster, this data typically amounts to a few hundred megabytes. Given the burst network bandwidth capabilities provided by current cloud providers, this process usually takes only seconds. Once P1's data upload is complete, the partition can be safely closed and deleted from Broker-0.

  3. After the Broker completes its shutdown, it will proactively trigger a leader election. At this point, Broker-1, being the sole Replica, is promoted to the Leader of P1, and the partition recovery process begins.

  4. During partition recovery, the metadata corresponding to P1 is fetched from the object storage to restore the relevant Checkpoint for P1. Depending on P1's shutdown state (whether it was a Cleaned Shutdown), the corresponding data recovery is performed.

  5. At this point, the partition reassignment is complete.

Significance of Second-Level Partition Reassignment

In a production environment, a Kafka cluster typically serves multiple applications. Fluctuations in application traffic and partition distribution can cause cluster capacity issues or machine hotspots. Kafka operations personnel need to expand the cluster and reassign hotspot partitions to idle nodes to ensure the availability of cluster services.

The time taken for partition reassignment determines the efficiency of emergency response and maintenance:

  • The shorter the partition reassignment time, the shorter the duration from cluster expansion to meeting capacity demands, and the shorter the service disruption time.

  • Faster partition reassignment results in shorter observation times for operations personnel, enabling quicker operational feedback and decision-making for subsequent actions.

In the architecture of AutoMQ, partition reassignment within seconds is the foundation for many automation capabilities, including automatic scaling, descaling, and continuous self-balancing of traffic.

Quick Experience

Refer to Example: Partition Reassignment in Seconds▸ to experience the partition reassignment capabilities of AutoMQ in seconds.