Cost-Effective: AutoMQ vs. Apache Kafka
Apache Kafka clusters incur costs primarily from compute and storage. Compute costs cover the servers required to run Kafka Brokers, such as AWS EC2, while storage costs pertain to the storage devices needed to retain data, such as AWS EBS.
AutoMQ's new cloud-native architecture significantly optimizes both compute and storage, reducing total cluster costs to one-tenth of the original for the same traffic volume.
This report will detail AutoMQ's cost optimizations in both storage and compute, and calculate potential cost savings. Finally, we deployed an AutoMQ cluster in a typical scenario and compared its costs with Apache Kafka (versions below 3.6.0) that do not support tiered storage.
Unless otherwise specified, the following pricing is based on Amazon Web Services' Ningxia region (cn-northwest-1) as of October 31, 2023.
Storage Solution: Fully Leverage High-Reliability, Low-Cost Cloud Storage
In the era of cloud computing, major cloud providers offer highly reliable cloud storage solutions. Depending on different use cases, users can choose products that meet their needs, such as AWS EBS, AWS S3, and AWS EFS.
To maximize the use of cloud storage, AutoMQ primarily stores data in object storage, using only a small amount of block storage as a buffer. This approach ensures both performance and reliability while significantly reducing data storage costs.
Based on EBS with No Replica Redundancy
Cloud providers offer EBS services with high durability. Constructing multiple replicas on cloud disks to further improve durability has minimal effect but significantly increases storage costs.
AutoMQ uses only small-capacity cloud disks as a durable buffer for data before it is uploaded to object storage, eliminating the need for additional data replication. At the same time, AutoMQ ensures high reliability in various scenarios.
Fully Utilizing Object Storage
Object storage is one of the most cost-effective and virtually unlimited storage options in cloud services. By offloading the majority of data to object storage, AutoMQ significantly reduces storage costs. This method is particularly effective for scenarios involving large amounts of data storage.
For example, AWS AWS S3 Standard Storage is priced at 0.1755 CNY/(GiB*month), while AWS EBS gp3 is priced at 0.5312 CNY/(GiB*month), resulting in a 67.0% storage cost saving by using S3.
Comparing the storage costs of AutoMQ and Apache Kafka, AutoMQ's use of S3 Standard Storage for storing 10 TiB of data per month costs:
10 TiB * 1024 GiB/TiB * 0.1755 CNY/(GiB*month) = 1797.12 CNY/month
Whereas 3-replica Apache Kafka using EBS gp3 for storing 10 TiB of data per month costs (assuming disk utilization at 80%):
10 TiB * 1024 GiB/TiB ÷ 80% * 3 * 0.5312 CNY/(GiB*month) = 20398.08 CNY/month
The storage cost of Apache Kafka is theoretically 11.4 times that of AutoMQ (20398.08 / 1797.12).
Computing Solution: Maximizing the Benefits of On-Demand Billing and Elastic Cloud Computing
In the era of cloud computing, cloud service providers offer highly elastic cloud computing services. Users can purchase or release cloud servers as needed, paying only for what they use. This allows users to release idle servers to save costs.
AutoMQ adopts a storage-compute separation architecture, fully leveraging the elasticity of cloud computing. Whether it's reassignment of partitions or scaling, AutoMQ can complete these tasks within minutes.
On-demand Scaling, No Idling
AutoMQ can achieve second-level partition reassignment and self-balancing of traffic. Thus, whether scaling down or up, it can be completed within minutes.
AutoMQ's rapid scaling capabilities allow AutoMQ to adjust cluster capacity based on real-time traffic changes, avoiding resource wastage. In contrast, Apache Kafka needs to be deployed based on the maximum estimated traffic to avoid business impact. This flexibility can significantly reduce costs in scenarios with significant traffic fluctuations.
For a cluster with a traffic peak ratio of 10:1 lasting 4 hours, the theoretical ratio of instances required for AutoMQ versus Apache Kafka with 3 replicas is as follows.
(1 * (24 - 4) + 10 * 4) : (10 * 24 * 3) = 1 : 12
Spot Instances, Flexible and Cost-effective
Current mainstream cloud providers offer spot instances (also known as "preemptible instances"), which have the following characteristics compared to on-demand instances:
Lower cost. For example, AWS spot instances can offer up to 90% discounts, and Alibaba Cloud preemptible instances can also offer up to 90% discounts.
Uncontrollable lifecycle. Spot instances are forcibly terminated when the bid price falls below the market price.
The management of spot instances, which can be forcibly terminated at any time, may be more challenging than on-demand instances. AutoMQ can effectively address this issue.
When receiving a signal indicating that an instance is about to be terminated, AutoMQ can quickly reassign the partitions on that Broker to other Brokers, enabling a graceful shutdown.
Even in extreme cases where the instance is terminated before the partitions are fully reassigned, AutoMQ can still recover and upload data from the instance's data disk, ensuring no data loss.
In an AutoMQ cluster, all Brokers can be spot instances, significantly reducing costs.
For example, using the AWS r6i.large instance type, the on-demand price is 0.88313 CNY/hour, whereas the spot price is 0.2067 CNY/hour. Utilizing spot instances can save 76.6% in costs.
Bandwidth on the Cutting Edge
For different instance types, cloud providers set network bandwidth limits, which can affect the throughput a single Kafka Broker can handle. Optimizing the network bandwidth utilization of Brokers helps increase the single-machine throughput limit and reduce costs.
When the production-consumption ratio in a cluster is 1:1, the traffic limit handled by an AutoMQ Broker is 1.5 times that of Apache Kafka®.
When an AutoMQ Broker receives one unit of traffic message, its outbound traffic includes one unit sent to the Consumer and one unit uploaded to object storage, totaling two units.
In contrast, an Apache Kafka® Broker with three replicas, upon receiving one unit of traffic message, has outbound traffic that includes one unit sent to the Consumer and two units for inter-replica replication, totaling three units.
Test Data: Real Online Scenario
To validate the cost advantage of AutoMQ, deploy an AutoMQ cluster on AWS and simulate the sending and receiving of messages. Use AWS CloudWatch and AWS Cost Explorer to monitor cluster capacity and cost changes, and finally compare the costs with Apache Kafka®. This will provide a clearer understanding of AutoMQ's cost-efficiency advantages.
Test Plan
Refer to Cluster Deployment on Linux▸ for quick deployment of an AutoMQ cluster.
Use the OpenMessaging Benchmark Framework to send and receive messages on the cluster continuously for 24 hours.
Monitor the number of Brokers in the cluster, the traffic per Broker, and the overall cluster traffic using AWS CloudWatch.
Use AWS Cost Explorer to obtain hourly costs of various cloud products within the cluster.
To simulate real-world cluster traffic, slight modifications were made to the OpenMessaging Benchmark Framework to support varying traffic for message sending and receiving over specified time periods. The traffic pattern used during testing was:
Normal traffic at 80 MiB/s.
Traffic increases to 800 MiB/s from 00:00 to 01:00, decreases to 400 MiB/s by 02:00, and returns to 80 MiB/s by 03:00.
Traffic increases to 800 MiB/s from 13:00 to 13:45 and returns to 80 MiB/s by 14:30.
From 18:00 to 19:00, the traffic surged to 1200 MiB/s, and by 20:00, the traffic returned to 80 MiB/s.
For other configurations, refer to Appendix 1.
Results
Running the above load in the AWS Ningxia region (cn-northwest-1) for 24 hours yielded the following results.
Dynamic Scaling
Using AWS CloudWatch, you can obtain the relationship between the number of brokers and the total cluster traffic over time. As shown below:
Explanation:
The blue curve in the figure represents the total traffic of messages produced in the cluster (i.e., the total size of messages produced per second). Since the production-consumption ratio is 1:1, this is also the total traffic of consumed messages. The unit is byte/s, and the unit on the left Y-axis is an exponent with a base of 10, for example, 1M = 1,000,000, 1G = 1,000,000,000.
Due to the self-balancing feature of the AWS Auto Scaling group and the release of spot instances, the number of Brokers may still fluctuate in a short time even when the cluster traffic remains stable.
From the figure, it can be seen:
AutoMQ can scale in and out in real-time according to traffic changes, with only a minute-level delay, which will save a significant amount of computing costs.
During the scaling process, AutoMQ only causes short-term and minor traffic fluctuations, without affecting cluster availability.
Cost Composition
Through AWS Cost Explorer, you can obtain the total cost of the cluster and its composition at the corresponding time. See the table below:
Type | Cost (CNY) | Proportion |
---|---|---|
EC2 - On-Demand Instances | 64.450 | 34.3% |
EC2 - Spot Instances | 19.446 | 10.4% |
S3 - Standard Storage Fees | 93.715 | 49.9% |
S3 - API Request Fees | 7.692 | 4.1% |
EBS - gp3 | 2.431 | 1.3% |
Total | 187.734 | 100.0% |
Note:
The cost statistics listed in the table cover the period from 2023-11-01 00:00 to 2023-11-01 23:59, totaling 24 hours.
To ensure cluster stability, AutoMQ uses 3 On-Demand instances as Controllers (which also serve as Brokers handling a small portion of the traffic), costing 0.88313 CNY/hour 3 24 hours = 63.59 CNY, which aligns with the "EC2 - On-Demand Instances" item in the table.
Due to AWS Cost Explorer's delay in calculating "S3 - Standard Storage Fees," the costs listed are estimated at 0.1755 CNY/(GiBmonth) 16242 GiB / 730 hours/month * 24 hours = 93.715 CNY. The 16242 GiB represents the data volume generated by the aforementioned traffic over 24 hours.
"S3 - API Call Fees" include costs incurred by invoking the following APIs: GetObject, PutObject, InitiateMultipartUpload, UploadPart, CopyPart, and CompleteMultipartUpload.
Costs below 0.001 CNY, such as those from CreateBucket, ListBucket, and DeleteBucket API calls, are not listed in the table.
From the table, it is evident that:
Because all Brokers in AutoMQ use spot instances and these instances scale up or down as needed, there is a significant reduction in computational costs.
AutoMQ stores the vast majority of data in object storage (S3) and only a small portion in block storage (EBS) used as a buffer, significantly reducing storage costs.
Compared to Apache Kafka
We also estimated the costs required for Apache Kafka (version below 3.6.0, without tiered storage) under the same scenario.
The following assumptions were made for the Apache Kafka cluster:
To purchase on-demand instances based on the cluster's peak traffic (1200 MiB/s), using the same r6i.large instance type (with an on-demand price of 0.88313 CNY/hour and a spot price of 0.2067 CNY/hour, with a baseline bandwidth of 100 MiB/s), and with a network watermark of 80%; additionally, purchase 3 more on-demand instances as Controllers.
Purchase block storage based on the cluster's total storage (16242 GiB), using gp3 (priced at 0.5312 CNY per GiB per month), with 3x replication storage and a storage watermark of 80%.
The estimation is as follows:
Single broker traffic capacity: 100 MiB/s * 80% / (1 + 2) = 26.67 MiB/s
Number of brokers in the cluster: 1200 MiB/s ÷ 26.67 MiB/s = 45
Required number of instances: 45 + 3 = 48
Daily cost calculation: 48 * 24 hours * 0.88313 CNY/hour = 1017.366 CNY
Required storage size: 16242 GiB * 3 / 80% = 60907.5 GiB
Daily storage cost: 60907.5 GiB * 0.5312 CNY/(GiB*month) / 730 hour/month * 24 hour = 1063.695 CNY
Total cost: 1017.366 CNY + 1063.695 CNY = 2081.061 CNY
Comparison with AutoMQ:
Cost Category | Apache Kafka (CNY) | AutoMQ (CNY) | Multiplier |
---|---|---|---|
Compute | 1017.336 | 83.896 | 12.13 |
Storage | 1063.695 | 103.838 | 10.24 |
Total | 2081.061 | 187.734 | 11.09 |
AutoMQ leverages the elasticity of the cloud and object storage to significantly reduce compute and storage costs compared to Apache Kafka, saving over 10 times in expenses.
Appendix 1: Test Configuration
Configuration file for AutoMQ Installer, kos-config.yaml:
kos:
installID: xxxx
vpcID: vpc-xxxxxx
cidr: 10.0.1.0/24
zoneNameList: cn-northwest-1b
kafka:
controllerCount: 3
heapOpts: "-Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:MaxDirectMemorySize=6g"
controllerSettings:
- autobalancer.reporter.network.in.capacity=60000
- autobalancer.reporter.network.out.capacity=60000
brokerSettings:
- autobalancer.reporter.network.in.capacity=100000
- autobalancer.reporter.network.out.capacity=100000
commonSettings:
- metric.reporters=kafka.autobalancer.metricsreporter.AutoBalancerMetricsReporter,org.apache.kafka.server.metrics.s3stream.KafkaS3MetricsLoggerReporter
- s3.metrics.logger.interval.ms=60000
- autobalancer.topic.num.partitions=1
- autobalancer.controller.enable=true
- autobalancer.controller.anomaly.detect.interval.ms=60000
- autobalancer.controller.metrics.delay.ms=20000
- autobalancer.controller.network.in.distribution.detect.threshold=0.2
- autobalancer.controller.network.in.distribution.detect.avg.deviation=0.05
- autobalancer.controller.network.out.distribution.detect.threshold=0.2
- autobalancer.controller.network.out.distribution.detect.avg.deviation=0.05
- autobalancer.controller.network.in.utilization.threshold=0.8
- autobalancer.controller.network.out.utilization.threshold=0.8
- autobalancer.controller.execution.interval.ms=100
- autobalancer.controller.execution.steps=1024
- autobalancer.controller.load.aggregation=true
- autobalancer.controller.exclude.topics=__consumer_offsets
- autobalancer.reporter.metrics.reporting.interval.ms=5000
- s3.network.baseline.bandwidth=104824045
- s3.wal.capacity=4294967296
- s3.wal.cache.size=2147483648
- s3.wal.object.size=536870912
- s3.stream.object.split.size=8388608
- s3.object.block.size=16777216
- s3.object.part.size=33554432
- s3.block.cache.size=1073741824
- s3.object.compaction.cache.size=536870912
scaling:
cooldown: 10
alarmPeriod: 60
scalingAlarmEvaluationTimes: 1
fallbackAlarmEvaluationTimes: 2
scalingNetworkUpBoundRatio: 0.8
scalingNetworkLowerBoundRatio: 0.8
ec2:
instanceType: r6i.large
controllerSpotEnabled: false
keyPairName: kafka_on_s3_benchmark_key-xxxx
enablePublic: true
enableDetailedMonitor: true
accessKey: xxxxxx
secretKey: xxxxxx
Description:
The instance type used is uniformly r6i.large, with a network baseline bandwidth of 0.781 Gbps. Therefore, s3.network.baseline.bandwidth is set to 104,824,045 bytes.
To simulate a production environment, the number of controllers is set to 3, and the controllers use on-demand instances.
To quickly sense traffic changes and adjust capacity in a timely manner, AWS EC2 detailed monitoring is enabled. Additionally, kos.scaling.cooldown is set to 10 seconds, and kos.scaling.alarmPeriod is set to 60 seconds.
To fully leverage the elasticity of AutoMQ, both kos.scaling.scalingNetworkUpBoundRatio and kos.scaling.scalingNetworkLowerBoundRatio are set to 0.8.
The OpenMessaging Benchmark Framework configuration is as follows:
driver.yaml:
name: AutoMQ
driverClass: io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkDriver
# Kafka Client-specific Configuration
replicationFactor: 3
reset: false
topicConfig: |
min.insync.replicas=2
commonConfig: |
bootstrap.servers=10.0.1.134:9092,10.0.1.132:9092,10.0.1.133:9092
producerConfig: |
acks=all
linger.ms=0
batch.size=131072
send.buffer.bytes=1048576
receive.buffer.bytes=1048576
consumerConfig: |
auto.offset.reset=earliest
enable.auto.commit=false
auto.commit.interval.ms=0
max.partition.fetch.bytes=131072
send.buffer.bytes=1048576
receive.buffer.bytes=1048576
workload.yaml:
name: 1-topic-128-partitions-4kb-4p4c-dynamic
topics: 1
partitionsPerTopic: 128
messageSize: 4096
payloadFile: "payload/payload-4Kb.data"
subscriptionsPerTopic: 1
consumerPerSubscription: 4
producersPerTopic: 4
producerRate: 19200
producerRateList:
- [16, 0, 20480]
- [17, 0, 204800]
- [18, 0, 102400]
- [19, 0, 20480]
- [ 5, 0, 20480]
- [ 5, 45, 204800]
- [ 6, 30, 20480]
- [10, 0, 20480]
- [11, 0, 307200]
- [12, 0, 20480]
consumerBacklogSizeGB: 0
warmupDurationMinutes: 0
testDurationMinutes: 2100
Additionally, using two c6in.2xlarge instances as workers, with a network baseline bandwidth of 12.5 Gbps (equivalent to 1600 MiB/s), can meet the demand for sending and receiving messages during peak traffic.