Skip to Main Content

5x Catch-up Read Efficiency

Apache Kafka users have long been troubled by KAFKA-7504[1], an unresolved performance issue in Kafka. When a cold read occurs in an Apache Kafka® cluster, if the cold read cannot quickly finish, an increasing number of tailing read operations will also be dragged into the cold read state, gradually causing a significant impact on the traffic's write operations.

Apache Kafka Cold Read Problem

The read/write path in Apache Kafka® introduces two key technologies: Page Cache[2] and the zero-copy SendFile[3] system call.

  • The Page Cache greatly simplifies Kafka's memory management burden, which is entirely handled by the kernel. However, it has an issue where hot and cold data cannot be separated. If a service continuously performs cold reads, it competes for memory resources with hot data, leading to a sustained decline in tailing read capabilities.

  • SendFile is a critical technology for zero-copy in Kafka, but this call occurs within Kafka's network thread pool. If SendFile needs to copy data from the disk (a cold read scenario), it can block the thread pool to some extent. Since this thread pool handles Kafka's request entry points, including write requests, the blocking behavior of SendFile significantly impacts Kafka's write operations.

AutoMQ Cold and Hot Data Isolation Architecture

From day one, AutoMQ has fully considered the cold and hot data isolation problem faced by asynchronous messaging middleware. The AutoMQ architecture includes three critical data paths:

  • Write Path: Data is persistently written to WAL storage using Direct IO, avoiding reliance on the Page Cache. Once data is written through to WAL, it is immediately acknowledged to the client, completely separating it from the read path.

  • Tail Read Path: In tail read scenarios, data is directly read from AutoMQ's own cache. AutoMQ's cache component is similar to the Page Cache, where partition isolation for hot and cold data is implemented, and eviction policies take consumer interest into full consideration to enhance memory efficiency.

  • Cold Read Path: For cold read scenarios, data is directly read from S3 storage, with a prefetch strategy to build a cold read cache. Thanks to the ultra-high throughput of object storage and the hot-cold isolation mechanism, AutoMQ achieves multiple times improvement in cold read efficiency compared to Apache Kafka.

AutoMQ Cold Read Performance Evaluation

The following table presents results from a benchmark test comparing AutoMQ and Kafka (Benchmark: AutoMQ vs. Apache Kafka▸). Under the same load and machine configurations, AutoMQ's cold read performance matches Kafka's level while maintaining write throughput and latency unaffected.

Comparison Item
Cold Read Transmission Latency
Impact on Transmission Traffic During Cold Read
Cold Read Efficiency
(Time to Read 4TiB Data)
AutoMQ
Less than 3ms
Read-write isolation, maintains 800 MiB/s
42 minutes
Apache Kafka
Approximately 800ms
Mutual interference, drops to 150 MiB/s
215 minutes

The results show that during cold reads, AutoMQ has no impact on transmission latency, whereas Apache Kafka degrades to second-level latency, and write throughput consistently decreases. In terms of cold read efficiency, AutoMQ offers a 5-fold improvement when reading 4TiB of data.

Advantages of Hot and Cold Data Isolation

After addressing the performance issues related to hot and cold data isolation, AutoMQ's architecture has become more tenant-friendly. Although AutoMQ benefits from second-level scaling capabilities and recommends configuring separate clusters for each business to avoid cost reduction through mixed deployment, hot and cold data isolation within a cluster effectively reduces the number of clusters for many low-traffic businesses. This feature of AutoMQ thoroughly addresses stability issues in multi-tenant scenarios.

Additionally, AutoMQ fully leverages the high throughput characteristics of object storage, resulting in a 5x improvement in cold read efficiency, which can quickly meet business needs when data playback is required.

References

[1] Kafka cold read performance issue: https://issues.apache.org/jira/browse/KAFKA-7504

[2] Linux Page Cache: https://en.wikipedia.org/wiki/Page_cache

[3] Linux SendFile: https://man7.org/linux/man-pages/man2/sendfile.2.html