Skip to Main Content

Cost Analysis Report

Info

The term "AutoMQ Kafka" mentioned in this text specifically refers to the open-source project called automq-for-kafka, which is hosted under the GitHub organization AutoMQ, operated by AutoMQ CO.,LTD.

Introduction

For an Apache Kafka® cluster,its cost mainly consists of two parts: computing cost and storage cost. The former mainly includes servers used to run Kafka Broker (such as AWS EC2), and the latter mainly includes storage devices used to save data (such as AWS EBS).

AutoMQ Kafka has been greatly optimized for both computing and storage parts. Under the same traffic, the total cluster cost can be reduced to 1/10 of the original.

In this report, we will introduce the cost optimization of AutoMQ Kafka from the aspects of storage and computing respectively, and calculate the theoretical cost savings. Finally, we also ran an AutoMQ Kafka cluster with reference to common online scenarios, and compared its cost with Apache Kafka® (version below 3.6.0, no tiered storage).

Note

Unless otherwise specified, the pricing of relevant cloud products mentioned below are the pricing of Amazon Cloud Technology Ningxia Region (cn-northwest-1) on October 31, 2023.

Storage: Make Full Use of Highly Reliable, Low-cost Cloud Storage

In the era of cloud, various Cloud providers provide cloud storage with ultra-high reliability; at the same time, in different scenarios, Cloud providers provide a variety of cloud products with their own characteristics for users to choose, such as AWS EBS, AWS S3, AWS EFS, and so on. In order to make full use of cloud storage, AutoMQ Kafka offloads the vast majority of data to Object storage and uses only very small Block storage(hundreds of MB to several GB) as buffers. This greatly improves performance and reliability while ensuring performance and reliability. Reduced data storage costs.

Single Copy, High Reliability

As one of the most widely used storage devices in cloud storage, cloud disks currently provide ultra-high storage reliability; however, multi-copy storage based on cloud disks has little improvement in reliability, but will double storage costs. .

In AutoMQ Kafka , a small-capacity single-copy cloud disk is used as a buffer before data is uploaded to the Object storage, without the need for additional data replication; at the same time, AutoMQ Kafka uses a series of means (see Single-copy High Availability▸) can ensure that a single copy cloud disk still has high availability in various scenarios.

Therefore, AutoMQ Kafka can achieve the same storage reliability and availability as Apache Kafka® 3 replica by using a single copy cloud disk.

Affordable Object Storage

Object storage is one of the cheapest storage products on the cloud, with extremely low prices and nearly unlimited capacity. AutoMQ Kafka significantly reduces storage costs by offloading the vast majority of data to Object storage(see [Main Storage▸](/docs/automq-s3kafka/Q8fNwoCDGiBOV6k8CDSccKKRn9d# Main-storage)).

Taking AWS as an example, the unit price of AWS S3 Standard Storageis0.1755CNY/(GiB*month),theunitpriceofAWSEBSgp3is0.5312CNY/(GiB*month), using S3 will save 67.0% in storage costs.

Let's use AWS EBS and S3 as an example to compare and calculate the storage costs of AutoMQ Kafka and Apache Kafka® .

The monthly cost of AutoMQ Kafka using S3 Standard Storage to store 10 TiB of data is:


10 TiB * 1024 GiB/TiB * 0.1755 CNY/(GiB*month) = 1797.12 CNY/month

The monthly cost of 3 replicas of Apache Kafka® using EBS gp3 to store 10 TiB of data is (let the disk water level be 80%):


10 TiB * 1024 GiB/TiB ÷ 80% * 3 * 0.5312 CNY/(GiB*month) = 20398.08 CNY/month

The storage cost of Apache Kafka® is theoretically 20398.08 / 1797.12 ~= 11.4 times that of AutoMQ Kafka .

Computing: Make Full Use of Pay-as-you-go, Arbitrarily Scalable Cloud Computing

In the cloud era, Cloud providers provide extremely elastic cloud computing services. Users can purchase or release cloud servers at any time as needed. At the same time, these cloud servers are billed on a pay-as-you-go basis, and users can release them when the servers are idle to save costs.

AutoMQ Kafka storage-computing separation architecture can naturally take advantage of the elastic capabilities of cloud computing: whether it is partition migration or machine expansion, AutoMQ Kafka can be completed in minutes.

Scale According to Volume, No Idle Time

AutoMQ Kafka has the capabilities of second-level partition migration▸ and continuous data rebalancing▸, so regardless of shrinkage or Capacity expansion can be completed within minutes (see Minute-level smooth expansion and contraction▸).

This ability to quickly expand and shrink allows AutoMQ Kafka to change the cluster capacity in real time based on cluster traffic to avoid wasting computing resources; while Apache Kafka® needs to be deployed based on the estimated maximum traffic to avoid business disruption caused by being unable to expand in time when the traffic peak arrives. damaged. This can save a lot of costs in scenarios with obvious traffic peaks and valleys.

Assuming that the peak-to-valley traffic ratio of a certain cluster is 10:1, and the daily peak traffic lasts for 4 hours, the theoretical ratio of the number of instances required for AutoMQ Kafka and 3 copies of Apache Kafka® is:


(1 * (24 - 4) + 10 * 4) : (10 * 24 * 3) = **1 : 12**

Bidding Instances, Flexible and Cheap

Currently, all major Cloud providers provide bidding instance (also known as "preemptible instance") services. Compared with pay-as-you-go instances, they have the following characteristics:

Spot instances may be forcibly released at any time, making them more difficult to utilize than on-demand instances, but AutoMQ Kafka can completely solve this problem:

  • When receiving a signal that the instance is about to be released, AutoMQ Kafka can quickly migrate the partition on the Broker to other Brokers (see Second-level partition migration▸), and then gracefully download Wire.
  • In extreme cases, when the partition is released before the instance is completely migrated, AutoMQ Kafka can still recover and upload data from the data disk of the instance (see [Abnormal downtime▸](/docs/automq-s3kafka/ KRMqwQBysionzukazS6cnP2Hnmh# Abnormal Downtime)) to Avoid Data Loss.

In an AutoMQ Kafka cluster, all Brokers can be bidding instances, thereby significantly saving costs.

Taking the AWS r6i.largemodelasanexample,theon-demandpriceis0.88313CNY/hour,thebiddingpriceis0.2067CNY/hour, you can save money by using spot instances 76.6% of cost.

Use Bandwidth Wisely

Each Cloud providers has set a network bandwidth limit for instances of different models (incoming and outgoing traffic are calculated separately). This limit will limit the traffic that a single Broker can carry. On the contrary, if the network bandwidth used by the Broker can be saved, the single-machine traffic limit can be increased, thereby saving costs.

Below we compare the traffic usage of AutoMQ Kafka and Apache Kafka® , under the condition that the cluster production and consumption ratio is 1:1:

  • For AutoMQ Kafka, when the Broker receives 1 traffic message, its outgoing traffic includes 1 traffic sent to the consumer and 1 traffic uploaded to the Object storage. Total 2 servings.
  • For Apache Kafka® with 3 copies, when the Broker receives 1 copy of the traffic message, its outgoing traffic includes 1 copy of the traffic sent to the consumer and 2 copies of the traffic for replication between copies. Total of 3 servings.

It can be deduced that when the production-consumption ratio is 1:1, the upper limit of AutoMQ Kafka Broker's traffic is 1.5 times that of Apache Kafka® .

Online Scene Actual Test

In order to verify the cost advantages of AutoMQ Kafka , we built an AutoMQ Kafka cluster in AWS and simulated common scenarios to send and receive messages. Finally, we obtained the cluster capacity change curve and cost curve through AWS CloudWatch and AWS Cost Explorer, and compared AutoMQ Kafka The cost is compared with Apache Kafka® .

Test Program

It is worth mentioning that in order to simulate cluster traffic in real scenarios, we made certain changes to the OpenMessaging Benchmark Framework to support changing the traffic of sending and receiving messages within a specified time period. The flow curve used in the test is:

  • The normal traffic is 80 MiB/s.
  • From 00:00 to 01:00 the traffic rises to 800 MiB/s, by 02:00 the traffic drops to 400 MiB/s, and by 03:00 the traffic returns to 80 MiB/s.
  • From 13:00 to 13:45 the traffic rises to 800 MiB/s, and by 14:30 the traffic returns to 80 MiB/s.
  • From 18:00 to 19:00 the traffic rises to 1200 MiB/s, and by 20:00 the traffic returns to 80 MiB/s.

See Appendix 1 for other configuration details.

Operation Result

Run the above load in AWS Ningxia Region(cn-northwest-1) for 24 hours and obtain the following results.

Dynamic Expansion and Contraction

Through AWS CloudWatch, you can obtain the relationship between the number of brokers and the total traffic of the cluster over time within the corresponding period of time. As shown below:

Note:

  • The blue curve in the figure is the total flow of messages produced in the cluster (that is, the total size of messages produced per second). Since the production-consumption ratio is 1:1, this is also the total flow of consumed messages. The units are byte/s, and the units identified in the left Y-axis are base 10 exponents, for example, 1M = 1,000,000, 1G = 1,000,000,000.
  • Due to the Rebalancing, and the release of spot instances, when the cluster traffic remains stable, the number of Brokers may still increase or decrease in a short period of time.

It can be seen from the figure:

  • AutoMQ Kakfa can expand and contract in real time as traffic increases and decreases, with only minute-level delays, which will save a lot of computing costs.
  • During the expansion and contraction process of AutoMQ Kakfa, it will only cause short-term and small traffic fluctuations and will not affect cluster availability.

Cost Components

The total cluster cost and its components during the corresponding time can be obtained through AWS Cost Explorer. See table below:

TypeCost (CNY)Proportion
EC2 - On-Demand Instances64.45034.3%
EC2 - Spot Instances19.44610.4%
S3 \ - Standard Storage Fee93.71549.9%
S3 - API call fee7.6924.1%
EBS - gp32.4311.3%
Total187.734100.0%

Note:

  • The cost accounting period listed in the table is from 2023-11-01 00:00 to 2023-11-01 23:59, a total of 24 hours.
  • To ensure cluster stability, AutoMQ Kafka will use 3 on-demand instances as Controller (at the same time, it will also serve as Broker to bear a small part of the traffic), and its cost is 0.88313 CNY/hour * 3 * 24 hour = 63.59 CNY, Basically consistent with the "EC2 - On-Demand Instance" item in the table.
  • Due to the delay in AWS Cost Explorer's statistics of "S3 -Standard Storage Fee", the cost listed in the table is an estimated value of 0.1755 CNY/(GiB*month) * 16242 GiB / 730 hour/month * 24 hour = 93.715 CNY. 16242 GiB is the amount of data generated by the aforementioned traffic in 24 hours.
  • "S3 - API call charges" include charges for calling the following APIs: GetObject, PutObject, InitiateMultipartUpload, UploadPart, CopyPart and CompleteMultipartUpload.
  • The table does not list fees that cost less than 0.001 CNY, such as API call fees such as CreateBucket, ListBucket, DeleteBucket, etc.

As can be seen from the table:

  • Since all Brokers in AutoMQ Kafka use spot instances and the instances are scaled up and down on demand, computing costs are significantly reduced.
  • AutoMQ Kafka stores most of the data in Object storage(S3), and only a small amount of data is stored in Block storage(EBS) as a buffer, which greatly reduces storage costs.

In summary, AutoMQ Kafka minute-level smooth expansion and contraction▸ and the use of Object storage(see S3 Stream▸) can give full play to the advantages of the cloud, significantly reduce costs, and truly become cloud native.

Comparison with Apache Kafka®

We also estimated the cost required for Apache Kafka® (version below 3.6.0, no tiered storage) in the same scenario.

Make the following presets for this Apache Kafka® cluster:

  • Purchase an on-demand instance based on the cluster traffic peak value (1200 MiB/s) and use the same r6i.large model (its on-demand price is 0.88313 CNY/hour, the bidding price is 0.2067 CNY/hour, and the baseline bandwidth is 100MiB /s), and the network water level is 80%; in addition, purchase 3 additional on-demand instances as Controller.
  • Purchase Block storage based on the total storage capacity of the cluster(16242 GiB), use gp3 (its selling price is 0.5312 CNY per GiB per month), use 3 copy storage, and the storage water level is 80%.

The estimate is as follows:


The upper limit of traffic carried by a single broker: 100 MiB/s * 80% / (1 + 2) = 26.67 MiB/s
Number of brokers in the cluster: 1200 MiB/s ÷ 26.67 MiB/s = 45
Number of instances required: 45 + 3 = 48
Daily calculation cost: 48 * 24 hour * 0.88313 CNY/hour = 1017.366 CNY

Required storage size: 16242 GiB * 3 / 80% = 60907.5 GiB
Daily storage cost: 60907.5 GiB * 0.5312 CNY/(GiB*month) / 730 hour/month * 24 hour = 1063.695 CNY

Total cost: 1017.366 CNY + 1063.695 CNY = 2081.061 CNY

Contrast this with AutoMQ Kafka :

Cost CategoryApache Kafka® (CNY)AutoMQ Kafka (CNY)Multiply
Calculation1017.33683.89612.13
Storage1063.695103.83810.24
Total2081.061187.73411.09

It can be seen that AutoMQ Kafka gives full play to the elasticity capabilities of the cloud and makes full use of Object storage. Compared with Apache Kafka® , it significantly reduces the cost of computing and storage, ultimately saving more than 10 times the cost.

Appendix 1: Test Configuration

The configuration file kos-config.yaml of the AutoMQ for Kafka installer:


kos:
installID: xxxx
vpcID: vpc-xxxxxx
cidr: 10.0.1.0/24
zoneNameList: cn-northwest-1b
kafka:
controllerCount: 3
heapOpts: "-Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:MaxDirectMemorySize=6g"
controllerSettings:
- autobalancer.reporter.network.in.capacity=60000
- autobalancer.reporter.network.out.capacity=60000
brokerSettings:
- autobalancer.reporter.network.in.capacity=100000
- autobalancer.reporter.network.out.capacity=100000
commonSettings:
- metric.reporters=kafka.autobalancer.metricsreporter.AutoBalancerMetricsReporter,org.apache.kafka.server.metrics.s3stream.KafkaS3MetricsLoggerReporter
- s3.metrics.logger.interval.ms=60000
- autobalancer.topic.num.partitions=1
- autobalancer.controller.enable=true
- autobalancer.controller.anomaly.detect.interval.ms=60000
- autobalancer.controller.metrics.delay.ms=20000
- autobalancer.controller.network.in.distribution.detect.threshold=0.2
- autobalancer.controller.network.in.distribution.detect.avg.deviation=0.05
- autobalancer.controller.network.out.distribution.detect.threshold=0.2
- autobalancer.controller.network.out.distribution.detect.avg.deviation=0.05
- autobalancer.controller.network.in.utilization.threshold=0.8
- autobalancer.controller.network.out.utilization.threshold=0.8
- autobalancer.controller.execution.interval.ms=100
- autobalancer.controller.execution.steps=1024
- autobalancer.controller.load.aggregation=true
- autobalancer.controller.exclude.topics=__consumer_offsets
- autobalancer.reporter.metrics.reporting.interval.ms=5000
- s3.network.baseline.bandwidth=104824045
- s3.wal.capacity=4294967296
- s3.wal.cache.size=2147483648
- s3.wal.object.size=536870912
- s3.stream.object.split.size=8388608
- s3.object.block.size=16777216
- s3.object.part.size=33554432
- s3.block.cache.size=1073741824
- s3.object.compaction.cache.size=536870912
scaling:
cooldown: 10
alarmPeriod: 60
scalingAlarmEvaluationTimes: 1
fallbackAlarmEvaluationTimes: 2
scalingNetworkUpBoundRatio: 0.8
scalingNetworkLowerBoundRatio: 0.8
ec2:
instanceType: r6i.large
controllerSpotEnabled: false
keyPairName: kafka_on_s3_benchmark_key-xxxx
enablePublic: true
enableDetailedMonitor: true
accessKey: xxxxxx
secretKey: xxxxxx

Some notes:

  • All models use r6i.large,anditsnetworkbaselinebandwidthis0.781Gbps,sosets3.network.baseline.bandwidthto104824045(Byte)
  • In order to simulate the production scenario, the number of controllers is configured to 3, and the controller uses on-demand instances.
  • In order to quickly sense traffic changes and expand and shrink capacity in a timely manner, AWS EC2 detailed monitoringisenabled,andkosSet.scaling.cooldownto10(s),andsetkos.scaling.alarmPeriodto60(s)
  • In order to take full advantage of the elasticity of AutoMQ Kafka , set both kos.scaling.scalingNetworkUpBoundRatio and kos.scaling.scalingNetworkLowerBoundRatio to 0.8

OpenMessaging Benchmark Framework is configured as follows:

driver.yaml:


name: AutoMQ for Kafka
driverClass: io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkDriver

# Kafka Client-specific Configuration
replicationFactor: 3
reset: false

topicConfig: |
min.insync.replicas=2

commonConfig: |
bootstrap.servers=10.0.1.134:9092,10.0.1.132:9092,10.0.1.133:9092

producerConfig: |
acks=all
linger.ms=0
batch.size=131072
send.buffer.bytes=1048576
receive.buffer.bytes=1048576

consumerConfig: |
auto.offset.reset=earliest
enable.auto.commit=false
auto.commit.interval.ms=0
max.partition.fetch.bytes=131072
send.buffer.bytes=1048576
receive.buffer.bytes=1048576

workload.yaml:


name: 1-topic-128-partitions-4kb-4p4c-dynamic

topics: 1
partitionsPerTopic: 128
messageSize: 4096
payloadFile: "payload/payload-4Kb.data"
subscriptionsPerTopic: 1
consumerPerSubscription: 4
producersPerTopic: 4
producerRate: 19200
producerRateList:
- [16, 0, 20480]
- [17, 0, 204800]
- [18, 0, 102400]
- [19, 0, 20480]
- [ 5, 0, 20480]
- [ 5, 45, 204800]
- [ 6, 30, 20480]
- [10, 0, 20480]
- [11, 0, 307200]
- [12, 0, 20480]
consumerBacklogSizeGB: 0
warmupDurationMinutes: 0
testDurationMinutes: 2100

In addition, using two c6in.2xlargeinstancesasworkers,thenetworkbaselinebandwidthis12.5Gbps(i.e.1600MiB/s), which can meet the needs of sending and receiving messages during peak traffic.