Skip to Main Content

Scale-out/in in Minutes

The term "AutoMQ Kafka" mentioned in this article specifically refers to the source available project automq-for-kafka under the GitHub AutoMQ organization by AutoMQ CO., LTD.

Introduction

In the cloud-native era, with the help of the node elasticity provided by cloud vendors, we can perform node-level scaling on an online cluster, such as AWS's Auto Scaling Groups, Alibaba Cloud's ECS Auto Scaling Group. However, since it involves the migration of business traffic, Apache Kafka clusters often cannot directly apply the elasticity provided by cloud vendors, and require operators to manually move the traffic, which usually takes hours. For online clusters with frequently changing traffic, it is almost impossible to scale on demand. In order to ensure the stability of the cluster, operators can only choose to deploy in advance according to the maximum capacity to avoid the risk of not being able to scale in time when the peak of traffic arrives, which also leads to a lot of resource waste.

How Does AutoMQ for Kafka Achieve Smooth Scaling Within Minutes?

Architecture

With the continuous data self-balancing capability of AutoMQ Kafka (refer to Continuous Self-Balancing▸), we can complete the rescheduling of traffic within minutes when the nodes of the online cluster change.

Expansion

Taking AWS Auto Scaling Groups (ASG) as an example, by configuring traffic threshold monitoring, when the cluster traffic reaches the expansion threshold, it will automatically pull up new Broker nodes. At this time, the Controller detects that the traffic is unbalanced and automatically moves the partition to the newly created Broker to complete the traffic rebalancing.

Example

The following figure shows the change in the number of Brokers in an AutoMQ Kafka cluster with the increase in traffic. As the traffic linearly increases, Brokers are dynamically created and join the cluster to balance the load.

The following figure shows the change in traffic of each Broker node during the increase in traffic. It can be seen that the newly created Broker completes the traffic rebalancing within minutes.

Shrinkage

Still taking AWS Auto Scaling Groups as an example, when the cluster traffic reaches the contraction threshold, it will offline the Broker node that is about to be contracted. At this time, the partitions on this Broker will be reassign to the remaining Brokers in a Round-robin manner within a few seconds (the implementation process refers to Partition Reassignment in Seconds▸), completing the graceful shutdown of the Broker and traffic transfer.

Example

The following figure shows the change in the number of Brokers in an AutoMQ Kafka cluster with the decrease in traffic. As the traffic linearly decreases, Brokers are dynamically offline to save resources.

The following figure shows the change in traffic of each Broker node during the decrease in traffic. It can be seen that the load of the offline Broker is transferred to the remaining Brokers (every time a Broker is offline, there is a significant increase in the traffic of the remaining Brokers).

In the above examples, in order to facilitate observation, we artificially increased the node scaling cooling time of ASG, and increased the delay of process startup and destruction.