Skip to Main Content

Scale-out/in in Seconds

In the cloud-native era, leveraging the elasticity provided by cloud providers, we can efficiently scale a cluster at the node level using AWS Auto Scaling Groups [1] or Alibaba Cloud's ESS Auto Scaling Groups [2]. However, due to the need for traffic reassignment, Apache Kafka® clusters often cannot directly utilize the elasticity provided by cloud providers and require manual intervention from operations personnel to move the traffic, which typically takes hours. For online clusters with frequent traffic fluctuations, this makes it almost impossible to scale on demand. To ensure cluster stability, operations personnel have to pre-deploy according to maximum capacity to avoid risks associated with untimely scaling during traffic peaks, leading to significant resource wastage.

How AutoMQ Achieves Smooth Scaling in Seconds

AutoMQ's ability to scale in seconds relies on a core feature: second-level partition reassignment (refer to Partition Reassignment in Seconds▸).

After scaling out nodes using Auto Scaling Groups (ASG) or Kubernetes' HPA [3], you only need to bulk reassign some partitions in the cluster to the new nodes to achieve self-balancing (refer to Continuous Self-Balancing▸). This can typically be completed within ten seconds.

Triggering Scaling

Taking AWS Auto Scaling Groups (ASG) as an example, by configuring traffic threshold monitoring, when the cluster traffic reaches the scaling threshold, new broker nodes are automatically launched. At this point, the Controller detects the traffic imbalance and automatically moves partitions to the newly created brokers, completing the traffic redistribution.

The figure below shows the change in the number of brokers in an AutoMQ Kafka cluster as the traffic increases. It can be seen that brokers are dynamically created and added to the cluster to balance the load as traffic increases linearly.

The figure below shows the traffic changes across broker nodes during traffic increases. It can be seen that the newly created brokers complete traffic rebalancing within ten seconds.

Triggering Scale Down

Taking AWS Auto Scaling Groups as an example, when the cluster traffic reaches the scale-down threshold, the Broker node to be scaled down will undergo a graceful shutdown process. During this time, the partitions on the Broker will be reassigned in a round-robin manner to the remaining Brokers within seconds, completing the graceful shutdown and traffic transfer.

The figure below shows the change in the number of Brokers in an AutoMQ Kafka cluster as traffic decreases. You can observe that as traffic linearly decreases, Brokers are dynamically shut down to save resources.

The figure below illustrates the changes in traffic across Broker nodes during traffic decline. It is evident that the load on the shutting down Broker is transferred to the remaining Brokers (whenever a Broker is shut down, the traffic on the remaining Brokers significantly increases).

In the above example, to facilitate observation, the ASG's scale-up and scale-down cooldown times were artificially increased, and process startup and destruction delays were added.

Advantages of Automatic Scaling

AutoMQ's shared storage architecture inherently supports rapid automatic scaling, which is also the foundation for achieving Serverless. The automatic scaling capabilities of AutoMQ provide at least the following advantages:

  • Cost advantages: There is no need to prepare resources based on peak demand. Resources automatically scale according to business traffic, effectively handling tidal and burst-type workloads, with payment based on usage and no wasted idle resources.

  • Stability advantages: Seamless scaling without causing additional traffic pressure on the cluster. This allows for lossless scaling even under high watermarks. In contrast, scaling in Apache Kafka® is a high-risk operation that can only be performed under low watermarks.

  • Multi-tenant advantages: Clusters with automatic scaling capabilities eliminate the need to mix multiple businesses to improve resource utilization. It is entirely possible to configure an independent cluster for each business. Each independent cluster can scale according to its own traffic model. This ensures cost advantages while preventing global impact if a particular business encounters issues.

References

[1]. AWS Auto Scaling Groups: https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html

[2]. Alibaba Cloud ESS Elastic Scaling Group: https://www.aliyun.com/product/ecs/ess

[3]. Kubernetes HPA Component for Scaling: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/