Overview

S3Stream is a core streaming storage component in AutoMQ. It follows AutoMQ's design philosophy of separating storage and compute, offloading Apache Kafka's native ISR-based log storage layer to cloud storage services like EBS and object storage.

S3Stream is a streaming storage library, not a distributed storage service. AutoMQ innovatively implements a set of core streaming storage APIs on top of object storage, including position management, Append, Fetch, and Trim data. Below is a code snippet showcasing several key interfaces of these APIs.

public interface Stream {
    /**
     * Get stream id
     */
    long streamId();

    /**
     * Get stream start offset.
     */
    long startOffset();

    /**
     * Get stream next append record offset.
     */
    long nextOffset();

    /**
     * Append RecordBatch to stream.
     */
    CompletableFuture<AppendResult> append(RecordBatch recordBatch);

    /**
     * Fetch RecordBatch list from a stream. 
     */
    CompletableFuture<FetchResult> fetch(long startOffset, long endOffset, int maxBytesHint);

    /**
     * Trim stream.
     */
    CompletableFuture<Void> trim(long newStartOffset);
}

Core Features of Streaming Storage

All data on the internet is generated in a streaming manner and then stored and processed in a streaming way to extract the business value of real-time data. This means that streaming data has at least the following requirements for storage:

Low Latency: The greatest value of streaming data lies in its freshness. For example, in advertising recommendation-related businesses, the demand for real-time processing is very high. Storing and computing data as quickly as possible can maximize the value of the data.
High Throughput: Since all data is generated in a streaming manner, it means that streaming storage requires extremely high throughput. Many businesses demand at least GiB/s bandwidth.
Low Cost: Massive amounts of streaming data imply high storage costs. Additionally, many businesses require data replay and re-computation, making daily storage of streaming data a common practice.

With the rapid development of big data, the demands on streaming storage in terms of cost, latency, and throughput have also correspondingly increased. However, no storage service provided by cloud providers can simultaneously meet all these requirements:

Block storage offers low latency but comes with high costs.
Object storage is cost-effective, but each API call incurs around 100ms of latency.
File storage is billed based on bandwidth, making it unsuitable for high-throughput streaming storage scenarios.

AutoMQ innovatively combines block storage EBS and object storage, leveraging the advantages of both to provide low latency, high throughput, low cost, and nearly "infinite" capacity streaming storage capabilities.

S3Stream Architecture

At the core of the S3Stream architecture, data is first persistently written to the WAL and then uploaded to S3 storage in near real-time. Additionally, to efficiently support both Tailing Read and Catch-up Read models, S3Stream includes a built-in Message Cache component to accelerate reading.

WAL Storage: Utilizes low-latency storage media, with each WAL disk requiring only a few GiB of space, typically opting for cloud storage EBS.
S3 Storage: Utilizes the largest object storage services provided by cloud providers to offer high throughput and cost-effective primary data storage services.
Data Caching: Both hot data and pre-fetched cold data are stored in the cache to accelerate read operations. Concurrently, an effective eviction strategy based on consumer focus is employed to improve memory utilization.

Overview

Core Features of Streaming Storage​

S3Stream Architecture​

Core Features of Streaming Storage

S3Stream Architecture