Benchmark: AutoMQ vs. Apache Kafka
Comparative Conclusions
100x Efficiency Improvement
300 Times Faster Partition Reassignment Compared to Apache Kafka®: AutoMQ's partition reassignment speed is approximately 300 times faster than Apache Kafka®. AutoMQ transforms Kafka's high-risk routine maintenance tasks into automated, low-risk operations that are almost imperceptible.
4-Minute Elasticity from Zero to 1 GiB/s: AutoMQ clusters can elastically scale from 0 MiB/s to 1 GiB/s in just 4 minutes, allowing the system to quickly expand and respond to sudden traffic surges.
200 Times More Efficient Cold Read Compared to Apache Kafka®: With read-write separation, AutoMQ reduces the latency by 200 times and increases the catch-up throughput by 5 times compared to Apache Kafka®. AutoMQ easily handles both online message smoothing and offline batch processing scenarios.
10x Cost Savings
2 Times the Throughput Limit Compared to Apache Kafka®: With the same hardware specifications, AutoMQ's maximum throughput is twice that of Apache Kafka®, and the P999 latency is one-quarter of Apache Kafka®. In real-time stream computing scenarios, using AutoMQ allows for lower costs and faster computation results.
1/11th the Billing Costs Compared to Apache Kafka®: By fully leveraging Auto Scaling and object storage, AutoMQ achieves an 11-fold cost reduction compared to Apache Kafka®. With AutoMQ, there is no need to provision capacity for peak loads, realizing true pay-as-you-go billing for both compute and storage.
Testing Preparation
The benchmark is enhanced based on the Linux Foundation's OpenMessaging Benchmark, simulating real user scenarios with dynamic workloads. All test scenarios, including configurations and loads, can be found in the GitHub repository.
Configuration Parameters
By default, AutoMQ flushes data to disk before responding, with the following configuration:
acks=all
flush.message=1
AutoMQ ensures high data durability through the multi-replica mechanism of EBS, eliminating the need for multi-replica configuration on the Kafka side.
Apache Kafka selects version 3.6.0 and, following Confluent's recommendations, does not set flush.messages = 1
. Instead, it uses three-replica asynchronous memory flush to ensure data durability (power failure in the data center can cause data loss), with the following configuration:
acks=all
replicationFactor=3
min.insync.replicas=2
Machine Specifications
In terms of cost-effectiveness, smaller instance types combined with EBS are more advantageous than larger instances with SSDs.
Using small r6in.large + EBS vs. large i3en.2xlarge + SSD as an example:
i3en.2xlarge, 8 cores, 64 GB memory, network baseline bandwidth 8.4 Gbps, comes with two 2.5 TB NVMe SSDs, maximum disk throughput 600 MB/s; price $0.9040/h.
r6in.large 5 + 5 TB EBS, 10 cores, 80 GB memory, network baseline bandwidth 15.625 Gbps, EBS baseline bandwidth 625 MB/s; price (compute) 0.1743 5 + (storage) 0.08 5 1024 / 24 / 60 = $1.156/h.
At first glance, the price and performance of the two options seem comparable. Considering that in actual production environments, data needs to be retained for longer periods, using i3en.2xlarge would require horizontally scaling compute nodes to increase the cluster's storage space, wasting compute resources. If using r6in.large + EBS, only the EBS capacity needs to be adjusted.
Therefore, from a cost and performance perspective, AutoMQ and Apache Kafka calculations both choose r6in.large as the smallest elastic unit for Brokers, and GP3 type EBS and Standard S3 for storage.
r6in.large: 2 cores, 16 GB memory, network baseline bandwidth 3.125 Gbps, EBS baseline bandwidth 156.25 MB/s; price $0.1743/h.
GP3 EBS: free tier 3000 IOPS, 125 MB/s bandwidth; price storage $0.08 per GB per month, additional bandwidth $0.040 per MB/s per month, additional IOPS $0.005 per month.
AutoMQ and Apache Kafka have different positions regarding EBS:
AutoMQ uses EBS as a write buffer, so EBS only needs to be configured with 3 GB of storage, and the free tier can be used for IOPS and bandwidth.
Apache Kafka® stores data on EBS, and the EBS space required depends on the traffic and retention period of the specific test scenario. Additional EBS bandwidth of 31 MB/s can be purchased to further increase the throughput per unit cost of Apache Kafka®.
100x Efficiency Improvement
Second-Level Partition Reassignment
In a production environment, a Kafka cluster typically serves multiple business units. Traffic fluctuations and partition distribution may cause cluster capacity shortages or machine hotspots. Kafka operators need to expand the cluster and reassign hotspot partitions to idle nodes to ensure the cluster's service availability.
The time required for partition reassignment determines the emergency response and operational efficiency:
The shorter the partition reassignment time, the shorter the duration from cluster expansion to meeting capacity demands, and the less time services are affected.
The faster the partition reassignment, the shorter the observation time for operators, enabling quicker operational feedback and subsequent decisions.
300x efficiency improvement: AutoMQ reduces the time to reassign a 30 GiB partition from 12 minutes to 2.2 seconds compared to Apache Kafka.
Testing
This test measures the reassignment time and impact for migrating a 30 GiB partition to a node where no replica of that partition exists, under regular send and consume traffic scenarios, using AutoMQ and Apache Kafka. The specific test scenario is as follows:
Two r6in.large instances acting as brokers, on which we create:
A single-partition, single-replica Topic A, with a continuous read/write throughput of 40 MiB/s.
A four-partition, single-replica Topic B, with a continuous read/write throughput of 10 MiB/s, serving as background traffic.
After 13 minutes, the only partition of Topic A is reassigned to another node, with a reassignment throughput limit of 100 MiB/s.
Each Apache Kafka broker additionally mounts a 320 GB 156 MiB/s gp3 EBS for data storage.
Driver files: apache-kafka-driver.yaml, automq-for-kafka-driver.yaml
Load files: partition-reassignment.yaml
AutoMQ installation configuration file: partition-reassignment.yaml
Comparison Item | AutoMQ | Apache Kafka |
---|---|---|
Reassignment Time | 2.2s | 12min |
Reassignment Impact | Max send latency 2.2s | Continuous send latency jitter between 1ms and 90ms within 12min |
Analysis
AutoMQ partition reassignment requires only uploading buffered data from EBS to S3 to safely open on the new node. Typically, 500 MiB of data can be uploaded within 2 seconds. The reassignment time for an AutoMQ partition is independent of the amount of data in the partition, averaging around 1.5 seconds. During the reassignment, AutoMQ partitions return a NOT_LEADER_OR_FOLLOWER error code to the client. After the reassignment is complete, the client updates to the new Topic routing table, and internal retries send to the new node, resulting in increased send latency for that partition, which returns to normal levels post-reassignment.
Apache Kafka® partition reassignment requires copying the partition replicas to new nodes. While copying historical data, it must also catch up with newly written data. The reassignment duration is calculated as the partition data size divided by the (reassignment throughput limit - partition write throughput). In actual production environments, partition reassignment typically takes hours. In this test, a 30 GiB partition reassignment took 12 minutes. Besides the long duration, Apache Kafka reassignment requires reading cold data from the hard disk, which, even with throttling set, can cause latency spikes due to page cache contention, affecting service quality. This is illustrated by the green curve fluctuations in the figure.
0 -> 1 GiB/s Extreme Elasticity
Kafka administrators usually plan cluster capacity based on historical experience, but unexpected hot events and activities can cause a sudden spike in cluster traffic. At such times, it is necessary to quickly scale the cluster and self-balance the partitions to handle the traffic surge.
Extreme Elasticity: AutoMQ clusters can automatically scale from 0 MiB/s to 1 GiB/s in just 4 minutes.
Test
The purpose of this test is to measure the emergency Auto Scaling elasticity feature of AutoMQ, specifically the speed of scaling from 0 MiB/s to 1 GiB/s. The test scenario is as follows:
The cluster initially has only one Broker, with Auto Scaling emergency elasticity capacity set to 1 GiB/s, and a Topic with 1000 partitions is created.
Using OpenMessaging, the sending traffic is directly set to 1 GiB/s.
Driver files: apache-kafka-driver.yaml, automq-for-kafka-driver.yaml
Load file: emergency-scaling.yaml
AutoMQ installation configuration file: emergency-scaling.yaml
Analysis Item | Monitoring and Alerts | Batch Scaling | Auto Balancing | Total |
---|---|---|---|---|
0 -> 1 GiB/s Elastic Time | 70s | 80s | 90s | 4min |
Analysis
The cluster capacity of AutoMQ is typically maintained at 80% through the Auto Scaling target tracking policy. In scenarios of unexpected traffic spikes, the target tracking policy may not meet capacity demands in time. Auto Scaling provides an emergency strategy that directly scales the cluster to the target capacity when the cluster water level exceeds 90%.
In this test, the Auto Scaling emergency strategy scaled the cluster capacity to the target capacity within 4 minutes:
70s: AWS CloudWatch monitoring has a maximum precision of 1 minute. It collects data when the cluster utilization exceeds 90% and triggers an alarm.
80s: AWS scales up the nodes in batches to the target capacity, and the Brokers complete node registration.
90s: AutoMQ's Auto Balancing detects traffic imbalance between nodes and performs automatic traffic self-balancing.
The cluster capacity meets the demand of 1 GiB/s, and the sending time returns to the benchmark time.
Catch-up Read
Catch-up read is a common scenario in messaging and stream systems:
For messaging, messages are typically used for decoupling between business processes and smoothing peaks and troughs. Smoothing requires that the message queue can pile up the data sent from the upstream, allowing the downstream to consume it slowly. At this time, the downstream catch-up reads data that is cold data not in memory.
For streams, periodic batch processing tasks need to scan and compute data from several hours or even a day ago.
Consumer downtime for several hours before coming back online; consumer logic issues were fixed, and then historical data was reconsumed.
Catch-up reading primarily focuses on two aspects:
Catch-up read speed: The faster the catch-up read speed, the quicker the consumer can recover from failure, and the faster batch processing tasks can produce analytical results.
Isolation of read and write operations: Catch-up reading should ideally not impact the production rate and latency.
200x efficiency improvement, AutoMQ's read-write separation compared to Apache Kafka® in catch-up read scenarios, with send latency reduced from 800ms to 3ms and catch-up time reduced from 215 minutes to 42 minutes.
Test
This test measures the catch-up read performance of AutoMQ and Apache Kafka® in clusters of the same size. The test scenario is as follows:
Deploy a cluster of 20 Brokers, creating a Topic with 1000 partitions.
Continuously send at 800 MiB/s throughput.
After sending 4 TiB of data, start the consumer to consume from the earliest offset.
Apache Kafka® brokers are each mounted with a 1000GB 156MiB/s gp3 EBS for data storage.
Driver files: apache-kafka-driver.yaml, automq-for-kafka-driver.yaml
Workload file: catch-up-read.yaml
AutoMQ installation configuration file: catch-up-read.yaml
Comparison Item | Time to Send During Catch-up Read | Impact on Send Throughput During Catch-up Read | Peak Throughput During Catch-up Read |
---|---|---|---|
AutoMQ | Less than 3ms | Read-Write Isolation, Maintains 800 MiB/s | 2500 ~ 2700 MiB/s |
Apache Kafka | Approximately 800ms | Interference, dropped to 150 MiB/s | 2600 to 3000 MiB/s (at the cost of writes) |
Analysis
With the same cluster size, during catch-up reads, AutoMQ's send throughput remained unaffected, whereas Apache Kafka's send throughput dropped by 80%. This is because Apache Kafka reads from the disk during catch-up reads and does not implement IO isolation, which consumes AWS EBS read/write bandwidth, leading to reduced disk write bandwidth and decreased send throughput. In contrast, AutoMQ separates reads and writes, utilizing object storage for reads during catch-up, thereby not consuming disk read/write bandwidth and not affecting send throughput.
With the same cluster size, during catch-up reads, AutoMQ's average send latency increased by approximately 0.4 ms compared to just sending, while Apache Kafka's latency surged by approximately 800 ms. The increase in Apache Kafka's send latency can be attributed to two factors: firstly, as mentioned earlier, catch-up reads consume AWS EBS read/write bandwidth, reducing write throughput and increasing latency; secondly, reading cold data from the disk during catch-up pollutes the page cache, which also leads to increased write latency.
Notably, when catching up on 4 TiB of data, AutoMQ took 42 minutes, whereas Apache Kafka took 29 minutes. The shorter duration for Apache Kafka can be attributed to two reasons:
During catch-up reads, Apache Kafka's send throughput decreased by 80%, reducing the amount of data it needed to catch up on.
Apache Kafka did not implement IO isolation, sacrificing send rate to improve read rate.
If we assume that Apache Kafka has IO isolation, meaning it reads while maximizing send rate, the calculation is as follows:
Assuming Apache Kafka® has a send rate of 700 MiB/s during catch-up reads, considering three replicas, the EBS bandwidth consumption would be 700 MiB/s * 3 = 2100 MiB/s.
The total EBS bandwidth in the cluster is 156.25 MiB/s * 20 = 3125 MiB/s.
The available bandwidth for reads is 3125 MiB/s - 2100 MiB/s = 1025 MiB/s.
In a catch-up read scenario where reading occurs simultaneously with sending, reading 4 TiB of data would take 4 TiB 1024 GiB/TiB 1024 MiB/GiB / (1025 MiB/s - 700 MiB/s) / 60 s/min = 215 min.
Apache Kafka® needs 215 minutes to catch up and read 4 TiB of data without significantly affecting the sending rate, which is 5 times longer than AutoMQ.
10x Cost Savings
Kafka's costs are primarily driven by computation and storage. AutoMQ has theoretically optimized these two aspects to achieve the lowest possible cost in the cloud, resulting in a 10x cost saving compared to Apache Kafka®:
Compute
Spot instances save up to 90%: AutoMQ benefits from stateless brokers, enabling the use of Spot instances at scale to reduce single-node compute costs.
EBS multi-replica high durability saves up to 66%: AutoMQ uses EBS multi-replica to ensure high data durability. Compared to ISR triple-replica, a single compute instance can replace up to three instances.
Auto Scaling: AutoMQ employs a target tracking policy to dynamically scale the cluster based on real-time traffic.
Storage
- Object storage saves up to 93%: AutoMQ stores nearly all data in object storage, which can save up to 93% in storage costs compared to triple-replica EBS.
Fixed Scale
AutoMQ's peak throughput is 2x that of Apache Kafka, with P999 latency being 1/4 of Apache Kafka's.
Testing
This test measures the performance and throughput limits of AutoMQ and Apache Kafka at different traffic scales on clusters of the same size. The test scenarios are as follows:
Deploy a cluster with 23 brokers and create a Topic with 1000 partitions.
Launch read/write traffic at 500 MiB/s and 1 GiB/s respectively at a 1:1 ratio; additionally, test the maximum throughput of both (AutoMQ 2200 MiB/s, Apache Kafka 1100 MiB/s).
Apache Kafka mounts an additional 500GB 156MiB/s gp3 EBS per broker for data storage.
Driver file: apache-kafka-driver.yaml, automq-for-kafka-driver.yaml
Workload file: tail-read-500m.yaml, tail-read-1024m.yaml, tail-read-1100m.yaml, tail-read-2200m.yaml
AutoMQ installation configuration file: tail-read.yaml
Comparison Item | Maximum Throughput | 500 MiB/s Send Latency P999 | 1 GiB/s Send Latency P999 |
---|---|---|---|
AutoMQ | 2200 MiB/s | 13.829 ms | 25.492 ms |
Apache Kafka | 1100 MiB/s | 55.401 ms | 119.033 ms |
Sending Latency and E2E Latency Detailed Data:
Pub Latency(ms) | 500 MiB/s | 1 GiB/s | Throughput Limit | |||
---|---|---|---|---|---|---|
AutoMQ | Apache Kafka | AutoMQ | Apache Kafka | AutoMQ 2200 MiB/s | Apache Kafka 1100 MiB/s | |
AVG | 2.116 | 1.832 | 2.431 | 3.901 | 5.058 | 4.591 |
P50 | 1.953 | 1.380 | 2.118 | 2.137 | 3.034 | 2.380 |
P75 | 2.271 | 1.618 | 2.503 | 3.095 | 3.906 | 3.637 |
P95 | 2.997 | 2.618 | 3.859 | 8.254 | 9.555 | 10.951 |
P99 | 6.368 | 12.274 | 8.968 | 50.762 | 37.373 | 60.207 |
P999 | 13.829 | 55.401 | 25.492 | 119.033 | 331.729 | 134.814 |
P9999 | 32.467 | 76.304 | 65.24 | 233.89 | 813.415 | 220.280 |
E2E Latency(ms) | 1 GiB/s | Throughput Limit | ||
---|---|---|---|---|
AutoMQ | Apache Kafka | AutoMQ 2200 MiB/s | Apache Kafka 1100 MiB/s | |
AVG | 4.405 | 4.786 | 6.858 | 5.477 |
P50 | 3.282 | 3.044 | 4.828 | 3.318 |
P75 | 3.948 | 4.108 | 6.270 | 4.678 |
P95 | 10.921 | 9.514 | 12.510 | 11.946 |
P99 | 26.610 | 51.531 | 34.350 | 60.272 |
P999 | 53.407 | 118.142 | 345.055 | 133.056 |
P9999 | 119.254 | 226.945 | 825.883 | 217.076 |
Analysis
- Under the same cluster size, the peak throughput of AutoMQ is twice that of Apache Kafka.
AutoMQ ensures high data durability through multiple replicas based on EBS, without additional replication at the upper layer, while Apache Kafka ensures data durability through ISR triple replicas. Without considering CPU and network bottlenecks, both AutoMQ and Apache Kafka max out disk bandwidth. The theoretical throughput limit of AutoMQ is three times that of Apache Kafka.
In this test, AutoMQ has higher CPU usage compared to Apache Kafka because it needs to upload data to S3, leading to AutoMQ hitting the CPU bottleneck first. With 23 r6in.large instances, the total disk bandwidth limit is 3588 MB/s. The theoretical peak sending limit for Apache Kafka with triple replicas is 1196 MB/s, with Apache Kafka hitting the disk bottleneck first. Ultimately, the peak throughput achieved in the stress test shows AutoMQ being twice that of Apache Kafka.
- Under the same cluster size and traffic (500 MiB/s), the P999 sending latency of AutoMQ is one-fourth that of Apache Kafka. Even with AutoMQ handling twice the traffic (500 MiB/s to 1024 MiB/s), the P999 sending latency remains half that of Apache Kafka.
AutoMQ uses Direct IO to bypass the file system and write directly to EBS raw devices, eliminating file system overhead and achieving more stable send latency.
Apache Kafka, on the other hand, uses Buffered IO to write data to the page cache. Once data is written to the page cache, it returns success, and the operating system flushes dirty pages to the hard drive in the background. The overhead of the file system, cold read consumption, and page cache misses can all cause jitter in send latency.
When converted to a throughput of 1 GiB/s, AutoMQ can offer up to 20x the compute cost savings and 10x the storage cost savings compared to Apache Kafka.
Compute: AutoMQ uses EBS solely as a buffer for writing to S3, uploading data to S3 during shutdown, which completes within 30 seconds. This allows AutoMQ to fully utilize Spot instances, which are up to 90% cheaper than On-Demand instances. Coupled with AutoMQ's single-node throughput being twice that of Apache Kafka, AutoMQ can achieve up to 20x the compute cost savings compared to Apache Kafka.
Storage: Almost all of AutoMQ's data is stored in S3, which charges based on the actual amount of data stored. Apache Kafka stores data on hard drives with triple replication, typically reserving at least 20% extra disk space in production environments. AutoMQ can achieve up to 13x storage cost savings per GB compared to Apache Kafka, calculated as 1 / (S3 unit price 0.023 / (3 replicas * 0.08 EBS unit price / 0.8 disk water level)). Including the cost of S3 API calls, AutoMQ can ultimately achieve up to 10x storage cost savings compared to Apache Kafka.
Additionally, Apache Kafka's extreme throughput testing already maxes out disk bandwidth. In actual production environments, disk bandwidth needs to be reserved for partition reassignment and cold data catch-up reads, so the write water level is set lower. In contrast, AutoMQ's catch-up reads from S3 use network bandwidth, separating read and write bandwidth. This allows disk bandwidth to be fully utilized for writing, keeping the actual production water level consistent with the test level.
Auto Scaling
With 11x extreme cost savings, AutoMQ fully leverages Auto Scaling and object storage, achieving true pay-as-you-go for compute and storage.
Testing
This test simulates production peak and valley loads to measure the cost and performance of AutoMQ under Auto Scaling target tracking policies. The test scenario is as follows:
Install the cluster on AWS using the AutoMQ Installer.
Create a Topic with 256 partitions and a retention time of 24 hours.
Perform a stress test with dynamic traffic at a 1:1 read-write ratio as follows:
Normal traffic is 40 MiB/s.
From 12:00 to 12:30, traffic increases to 800 MiB/s and returns to 40 MiB/s by 13:00.
From 18:00 to 18:30, traffic increases to 1200 MiB/s and returns to 40 MiB/s by 19:00.
Driver files: apache-kafka-driver.yaml, automq-for-kafka-driver.yaml
Load file: auto-scaling.yaml
AutoMQ installation configuration file: auto-scaling.yaml
Cost comparison of AutoMQ and Apache Kafka under the same workload:
Cost Category | Apache Kafka (USD / month) | AutoMQ (USD / month) | Multiplier |
---|---|---|---|
Compute | 3,054.26 | 201.46 | 15.2 |
Storage | 2,095.71 | 257.38 | 8.1 |
Total | 5,149.97 | 458.84 | 11.2 |
Analysis
This test was conducted on AWS East Coast, where both computing and storage were billed on a pay-as-you-go basis:
Computing: Under the target tracking policy of Auto Scaling, computing nodes dynamically scale in and out based on cluster traffic, maintaining cluster utilization at 80% on a minute-level basis (AWS monitoring and alerting progress is minute-level, with monitoring delays resulting in a tracking accuracy of around 2 minutes).
Storage: Most of the storage data is on S3, with storage costs mainly comprising S3 storage fees and S3 API call fees. S3 storage costs are correlated with the data write volume and retention time, while S3 API call costs are correlated with the write volume. Both S3 storage fees and S3 API call fees are billed on a pay-as-you-go basis.
If using Apache Kafka® to set up a three-replica cluster to handle a daily peak traffic of 1 GiB/s, under the same traffic model, Apache Kafka® would cost at least the following per month:
Computing: Due to the difficulty of dynamic scaling, capacity needs to be prepared for peak usage. Broker costs: r6in.large unit price $0.17433 per hour * 730 hours per month * 23 instances = $2927.00.
Storage:
Storage costs: (Total data volume 6890.625 GB x 3 replicas / Disk utilization 80%) x EBS unit price $0.08/GB per month = $2067.19.
Storage bandwidth costs: MB bandwidth unit price $0.04 x additional purchased bandwidth 31MB/s x 23 EBS volumes = $28.52.
Cost Category | Apache Kafka (USD / month) | AutoMQ (USD / month) | Multiplier | |
---|---|---|---|---|
Calculation | Controller | 127.26 | 127.26 | 1.0 |
Broker | 2,927.00 | 74.20 | 39.4 | |
Total | 3,054.26 | 201.46 | 15.2 | |
Storage | EBS Storage | 2,067.19 | 0.43 | - |
EBS Throughput | 28.52 | - | - | |
S3 Storage | - | 158.48 | - | |
S3 API | - | 98.47 | - | |
Total | 2,095.71 | 257.38 | 8.1 | |
Total | 5,149.97 | 458.84 | 11.2 |
Summary
This benchmark demonstrates significant optimizations in efficiency and cost savings of AutoMQ after reengineering Kafka based on cloud architecture compared to Apache Kafka:
100x Efficiency Improvement:
In the partition reassignment scenario, AutoMQ reduces the time for reassigning a 30GB partition from Apache Kafka's 12 minutes to just 2.2 seconds, achieving a 300x efficiency improvement.
Extreme elasticity: AutoMQ can automatically scale out from 0 to 1 GiB/s in just 4 minutes to meet target capacity.
In the scenario of catching up with historical data reads, AutoMQ's read-write separation reduces average send latency by 200 times from 800ms to 3ms, and the catch-up read throughput is 5 times that of Apache Kafka.
10x Cost Savings:
At a fixed scale, AutoMQ's throughput limit reaches up to 2200 MiB/s, which is 2 times that of Apache Kafka. The P999 send latency is only 1/4 of Apache Kafka's.
In the dynamic load scenario of 40 MiB/s to 1200 MiB/s, AutoMQ's Auto Scaling automatically adjusts capacity during low-peak periods, significantly saving computational resources. The actual measurement shows that AutoMQ achieves 11x cost savings compared to Apache Kafka.