Skip to Main Content

Fully Utilize Cloud Storage

The term "AutoMQ Kafka" mentioned in this article specifically refers to the open-source project automq-for-kafka under the GitHub AutoMQ organization by AutoMQ CO., LTD.

Introduction

In the IDC era, because local disks are unreliable, various storage class products need to use RAID or their own replication to ensure data reliability and availability. However, in the cloud era, as one of the most widely used storage in cloud storage, cloud disks, based on multi-replica and error correction code mechanism, have provided 4 nines of service availability and 9 nines of data reliability. If storage class applications continue to build multiple replicas based on cloud disks to improve data reliability, the effect will be negligible, but it will double the storage cost.

AutoMQ for Kafka (AutoMQ Kafka) uses cloud disks as the storage for Delta WAL, with cloud disks ensuring high data reliability. It adopts a single replica cloud disk mode and does not make extra copies at the upper level of the cloud disk. By eliminating replica copying, not only can the cost of the disk be saved, but the saved network bandwidth can provide more throughput for consumption Fanout.

A single replica cloud disk can solve the problem of data reliability, but during daily application operations such as release and restart, the node where the cloud disk is mounted is unavailable. During this time, the data on the cloud disk cannot be read or written, which will cause availability issues. The unavailable time is from the time the node is shut down to the time the operation is completed and the recovery is completed, which usually lasts for several minutes. If the node crashes, the recovery time will be longer.

So how does AutoMQ Kafka ensure high availability of single replica cloud disks in daily operations and crash scenarios?

Daily Operations

When AutoMQ Kafka is performing operations that involve shutting down nodes such as upgrades and restarts:

  1. The Broker receives the kill 15 shutdown signal and begins to shut down gracefully;

  2. The Broker forcibly uploads the data in the Delta WAL on the cloud disk as a WAL object to object storage, and then marks the Delta WAL as cleared. The size of Delta WAL is within 500MB, and a 100MB/s instance can complete the upload in about 2 seconds using burst network performance;

  3. After the Delta WAL is uploaded to object storage, the data of all Partitions on this Broker can be accessed by any other node in the cluster through object storage. The Controller then reassigns the Partitions of this Broker to other surviving nodes in seconds.

  4. After all Partitions are transferred, the Broker completes the graceful shutdown.

    The entire shutdown process of AutoMQ Kafka can be completed in about one minute. Through the method of uploading Delta WAL to object storage and reassigning Partitions when the Broker shuts down, the average unavailable time for Partitions is 1.5 seconds. During daily operations, the sending and consumption behavior of AutoMQ Kafka clients is almost unaffected (the client has a failure retry mechanism at the bottom, and Partitions can be recovered within the timeout period).

Abnormal Crashes

The current AutoMQ Kafka needs to wait for the crashed Broker to restart and recover to upload Delta WAL in case of abnormal crashes. The second-level disaster recovery plan mentioned below is planned to be implemented in subsequent versions.

In the scenario of Broker abnormal crash, it is impossible to gracefully upload the Delta WAL of this node to object storage. The problem of abnormal crash is transformed into how to safely upload the Delta WAL of the crashed node to object storage, where "safe" means to Fence off the further writing of the original Broker before uploading (because this Broker may be false dead), and "upload" means other nodes can read Delta WAL.

To allow other nodes to safely upload Delta WAL data in the event of a crash, AutoMQ Kafka uses the Multi-Attach capability of cloud disks. The disaster recovery process for abnormal crashes is as follows:

  1. The Controller detects that the Broker session has timed out, marks this Broker as crashed, and assigns a surviving node in the cluster as the disaster recovery responsible node;

  2. The disaster recovery responsible node will first mount the cloud disk of the crashed node through Multi-Attach of the cloud disk, call NVMe Reservation Acquire to Fence off the write permission of the crashed node, and get the read and write permission of this cloud disk;

  3. Read Delta WAL and upload it to object storage;

  4. Shut down the S3Stream corresponding to the Partition handled by the crashed node, triggering Partition reassignment;

  5. The Partition recovers and reopens from an unclean shutdown, completing the disaster recovery after the crash.

    Steps 1 - 4 take approximately 20 seconds in total. Step 5 takes several minutes as the index of the last LogSegment needs to be rebuilt due to the Kafka Partition's unclean shutdown.