Skip to Main Content

5x Catch-up Read Efficiency

Apache Kafka users have long been plagued by KAFKA-7504[1], a performance issue that remains unresolved to this day. When a cold read occurs in an Apache Kafka cluster, if it cannot be quickly resolved, more and more tail read operations will also be slowed down, gradually exerting a significant impact on the traffic write operations.

Apache Kafka Cold Read Issue

Apache Kafka's read and write paths introduce two key technologies: Page Cache[2] and zero-copy SendFile[3] system call.

  • The Page Cache greatly simplifies Kafka's memory management burden, which is entirely handled by the kernel. However, there is an issue where hot and cold data cannot be separated. If a service continuously performs cold reads, it will compete for memory resources with hot data, leading to a continuous decline in tail read capabilities.

  • SendFile is a critical technology for Kafka's zero-copy feature, but this call occurs in Kafka's network thread pool. If SendFile needs to copy data from the disk (in a cold read scenario), it will partially block this thread pool. Since this thread pool handles Kafka requests, including write requests, the blocking behavior of SendFile will significantly impact Kafka's write operations.

AutoMQ Cold and Hot Data Isolation Architecture

From the very first day of its design, AutoMQ fully considered the cold and hot data isolation issue faced by asynchronous message middleware. In AutoMQ's architecture, there are three key data paths:

  • Write Path: Data is persistently written into WAL storage in the form of Direct IO, without relying on Page Cache. The data is written through to WAL before returning success to the client, completely separating it from the data read path.

  • Tail Read Path

  • Cold Read Path

Cold Read Performance Evaluation of AutoMQ

The following table results are derived from the actual benchmark of AutoMQ vs. Kafka (Benchmark: AutoMQ vs. Apache Kafka▸), demonstrating that under the same load and machine types, AutoMQ maintains the same level of cold read performance as Kafka without affecting write throughput and latency.

Comparison Item
Send Latency During Cold Read
Impact on Send Traffic During Cold Read
Cold Read Efficiency
(Time to Read 4TiB Data)
AutoMQ
Less than 3ms
Read-write isolation, maintains 800 MiB/s
42 minutes
Apache Kafka
Approximately 800ms
Mutual impact, drops to 150 MiB/s
215 minutes

From the results, it can be seen that AutoMQ has no impact on send latency during cold reads, whereas Apache Kafka deteriorates to second-level latency, and write traffic continuously drops. In terms of cold read efficiency, reading 4TiB of data, AutoMQ offers a 5-fold improvement in efficiency.

Advantages of Hot and Cold Data Isolation

After resolving performance issues related to hot and cold data isolation, AutoMQ's architecture becomes more tenant-friendly. Although AutoMQ benefits from second-level scalability, it is recommended to configure separate clusters for each business to avoid mixing workloads, thereby reducing costs. However, for businesses with very low traffic, mixing workloads within a single cluster can effectively reduce the number of clusters. AutoMQ's hot and cold data isolation feature completely addresses stability issues in multi-tenant scenarios.

Additionally, AutoMQ fully leverages the high throughput characteristics of object storage, achieving a 5-fold increase in cold read efficiency. This allows for quick data replay to meet business needs promptly.

Reference

[1] Kafka Cold Read Performance Issue: https://issues.apache.org/jira/browse/KAFKA-7504

[2] Linux Page Cache: https://en.wikipedia.org/wiki/Page_cache

[3] Linux SendFile: https://man7.org/linux/man-pages/man2/sendfile.2.html