In an online Apache Kafka® cluster, fluctuations in traffic, Topic creation and deletion, and Broker failures and restarts are constantly occurring. These changes can lead to an uneven distribution of traffic across cluster nodes, resulting in resource wastage and impacting business stability. It is essential to proactively reassign different partitions of a Topic across nodes to balance traffic and data.Documentation Index
Fetch the complete documentation index at: https://docs.automq.com/llms.txt
Use this file to discover all available pages before exploring further.
Challenges Faced by Open Source Solutions
Apache Kafka has always faced significant challenges in achieving self-balancing of data. The community has two solutions:- The official Apache Kafka partition reassignment tool requires operations personnel to devise specific reassignment plans. For Kafka clusters with hundreds or thousands of nodes, it is nearly impossible to manually monitor the cluster state and create a comprehensive partition reassignment plan.
- The community also offers third-party external plugins such as Cruise Control[1] to assist in generating reassignment plans. However, due to the extensive variables involved in Apache Kafka’s self-balancing process (replica distribution, leader traffic distribution, node resource utilization, etc.) and the resource contention and hours-to-days-long duration caused by data synchronization, existing solutions are complex and have low decision timeliness. Implementing data self-balancing strategies still relies on operations personnel’s review and continuous monitoring, failing to truly address the challenges posed by Apache Kafka data self-balancing.
AutoMQ’s Architectural Advantages
Thanks to AutoMQ’s deep integration with cloud-native capabilities, we have reimplemented Apache Kafka’s underlying storage entirely based on cloud object storage, upgrading from a Shared Nothing architecture to a Shared Storage architecture. This enables second-level partition reassignment capabilities, greatly simplifying the factors involved in reassignment planning:- There is no need to consider node disk resources.
- No need to consider the leader distribution and replica distribution of partitions.
- The reassignment of partitions does not involve data synchronization and copying.
AutoMQ Data Self-Balancing Implementation

AutoMQ Data Self-Balancing Example

NOTE: In this scenario, to facilitate observation, the partition reassignment cooldown time is manually extended. With the default configuration, the traffic balancing time is approximately 1 minute.