Skip to Main Content

Continuous Self-Balancing

The term "AutoMQ Kafka" mentioned in this article specifically refers to the source available project automq-for-kafka under the GitHub AutoMQ organization by AutoMQ CO., LTD.

What Is Data Self-balancing?

In an online Apache Kafka cluster, fluctuations in traffic, creation and deletion of Topics, and the rise and fall of Brokers are happening all the time. These changes can lead to uneven distribution of traffic among the nodes in the cluster, resulting in resource waste and business instability. At this point, it is necessary to actively reassign different partitions of the Topic among the nodes to balance traffic and data.

Challenges Faced by Open-source Solutions

Currently, Apache Kafka only provides a partition reassignment tool, but the specific reassignment plan needs to be determined by the operation and maintenance personnel. For a Kafka cluster with hundreds or even thousands of nodes, it is almost impossible for humans to monitor the cluster status and formulate a perfect partition reassignment plan. Therefore, the community also has third-party external plugins like Cruise Control for Apache Kafka to assist in generating reassignment plans. However, due to the large number of variables in the decision-making process of Apache Kafka's data self-balancing (replica distribution, Leader traffic distribution, node resource utilization, etc.), as well as resource contention and hours or even days of consumption due to data synchronization in the data self-balancing process, the existing solutions have high complexity and low timeliness in decision-making. When implementing data self-balancing strategies, they still need to rely on the review and continuous monitoring by operation and maintenance personnel, and they cannot truly solve the problems brought about by Apache Kafka's data self-balancing.

the Architectural Advantages of AutoMQ for Kafka

Thanks to AutoMQ Kafka's deep application of cloud-native capabilities, we have completely re-implemented the underlying storage of Apache Kafka based on cloud object storage, which brings the following advantages:

  • Complete separation of storage and compute architecture. The Broker only needs to retain a small amount of block storage space for Write Ahead Log, and the rest of the data sinks to object storage, which is visible within the cluster.

  • High availability can be achieved with only a single replica of the partition.

    Based on the above advantages, the decision factors for the partition reassignment plan have been greatly simplified:

  • No need to consider the disk resources of the node.

  • No need to consider the distribution of the partition's Leader and replicas.

  • Partition reassignment does not involve data synchronization and copying.

    Therefore, we have the opportunity to implement a built-in, lightweight data auto balancing component in AutoMQ Kafka, which continuously monitors the cluster status and automatically executes partition reassignment.

Implementation of Data Self-balancing in AutoMQ for Kafka

Architecture Diagram

Data Self-balancing Process

Metrics Collection

AutoMQ Kafka Broker has implemented the MetricsReporter interface of Apache Kafka, listens to all built-in metric information of Apache Kafka, and regularly samples interested metrics (such as network ingress and egress traffic, CPU utilization, etc.), and completes the pre-aggregation of metrics on the Broker side, serializes the aggregated metrics into a Kafka Message, and sends it to a designated internal Topic.

AutoMQ Kafka Controller maintains a cluster state model in memory to describe the current cluster state, including Broker status, Broker resource capacity, traffic information of Topic-Partition managed by various Brokers, etc. By listening to the event information of Kraft Log, the cluster state model can timely perceive the status changes of Broker and Topic-Partition and update the model, thereby keeping consistent with the real cluster state.

Simultaneously, the metrics collector in AutoMQ Kafka Controller consumes messages from internal Topic in real time, deserializes the messages into specific metrics, and updates them to the cluster state model, thus constructing all the prerequisite information needed for data self-balancing.

Decision Analysis

The scheduler of AutoMQ Kafka Controller periodically gets the snapshot of the cluster state model, and based on various predefined "targets", identifies Brokers with too high or low traffic, and tries to move or swap partitions to achieve traffic self-balancing.

For detailed decision process, please refer to

Example of AutoMQ for Kafka Data Self-balancing

Cluster Configuration

Machine type: AWS r6i.large (2C16G, baseline bandwidth 0.781Gbps, burst bandwidth 12.5Gbps)

Cluster: 1 Controller, 3 Brokers

Load Details

  1. In the first phase, evenly distribute traffic to the 3 Brokers, about 40MB/s each
  2. In the second phase, increase the traffic of Broker-0 to about 80MB/s, Broker-1 to about 120MB/s, and keep Broker-2 at 40MB/s

Traffic Curve

As shown in the figure, after the traffic of Broker-0 and Broker-1 increases, automatic load balancing is triggered, and the traffic of the three Brokers gradually converges to a balanced range.

NOTE: In this scenario, in order to facilitate observation, the partition scheduling cooldown time is artificially increased. Under default configuration, the traffic balance time is about 1 minute.

Quick Experience

Refer to the "Automatic Data Self-balancing" section in Quick Start▸.

Main Configuration Items for S3 Data Self-balancing

Refer to the "Continuous Data Self-balancing Configuration" section in Configuration▸.