Skip to Main Content

Best Practices

Resource Allocation Management

Recommendation 1: Properly Set the Number of Topic Partitions to Avoid Throughput Bottlenecks and Waste

The number of Kafka Topic partitions affects the production and consumption throughput that a Topic can support. Messages with the same partition key are sent to the same partition, ensuring order, and each Kafka partition can only be processed by one consumer.

AutoMQ is built on object storage, which compared to Apache Kafka's local file architecture, can support several times the partition performance under the same cluster size. Additionally, AutoMQ does not charge for partitions.

When allocating partitions with AutoMQ, there is no need to consider costs. You only need to evaluate the appropriate number of partitions based on the estimated business throughput (the write throughput limit for a single AutoMQ partition is 4MB/s), avoiding the need to expand partitions later as business volume increases. The number of partitions does not need to be a multiple of the number of nodes, as AutoMQ will automatically balance the partitions.

Example:

An application estimates that the data write throughput of a Topic will be approximately 100MB/s in the next year, and each downstream consumer can process a maximum of 2MB of data per second.

Therefore, when allocating partitions, consider:

Based on the producer's perspective, 100MB/s ÷ 4MB/s = 25 partitions are needed.

Based on the consumer's perspective, 100MB/s ÷ 2MB/s = 50 consumers, and 50 consumers require at least 50 partitions.

Therefore, at least 50 partitions need to be assigned to meet the application's requirements.

For platform-based scenarios, such as real-time computing platforms, it is recommended to enable ACLs when using Kafka to achieve fine-grained resource control. After enabling resource control, Kafka clients must perform identity authentication to access specific Topics and Consumer Groups. The benefits of this approach include:

  • Preventing business units from arbitrarily creating new Topics and other resources, leading to resource misuse and governance challenges.

  • Avoiding the confusion caused by different business units mixing Consumer Groups, which can disrupt consumption load balancing.

  • By identifying Topics and Consumer Groups, you can easily find related subscribers and upstream/downstream business units, facilitating business governance.

Producer Application

Recommendation 1: for Kafka Producer Clients with Versions Lower than 2.1, It Is Advised to Set the Retry Count.

Producer applications need to check the SDK version. If the current version is lower than 2.1, manually set the retry count to ensure that message sending can automatically retry in case of failures. This helps avoid issues due to server maintenance, automatic partition reassignment, etc.

Retry parameters can be set as follows:


// Maximum number of retries
retries=Integer.MAX_VALUE

// Initial backoff delay time
retry.backoff.ms=100

// Maximum backoff delay time
retry.backoff.max.ms=1000

// RPC timeout for each send request
request.timeout.ms =30000

// The total timeout duration for the entire send call. If this time is exceeded, retries will cease, and a failure exception will be returned to the caller.
delivery.timeout.ms=120000

Suggestion 2: Optimize Producer Batch Parameters to Avoid Excessive QPS Consumption by Fragmented Requests.

Kafka is a stream storage system designed for high-throughput scenarios. The most typical use case for Kafka is to improve the efficiency and throughput of data transmission through batching. When sending messages, the Producer needs to set batching parameters reasonably to avoid the following situations:

  • Avoid sending only one message per request: Sending only one message at a time by the producer will cause a large number of Produce requests, consuming server CPU and degrading cluster performance.

  • Avoid setting excessively long batch wait times: When setting batch parameters, the producer needs to set a reasonable wait time to prevent message sending delays caused by the inability to complete batching in low-traffic scenarios.

Refer to the following for specific batch parameter settings:


// Batch size, up to 128KB of data can be accumulated at one time.
batch.size=131072
// Batch time, for the producer, if the batch limit is not reached within the specified time, sending will still be triggered. This time represents the maximum delay for sending.
linger.ms=10

Consumer Application

When Kafka Consumer applications are consuming messages using the Assign mode, where partitions are manually assigned for load balancing, it is crucial to upgrade the SDK version to 3.2 or above. Earlier versions have a defect in the Assign mode that prevents the timely commit and update of consumer offsets. For detailed defect records, refer to Kafka Issue-12563.

Recommendation 2: Reasonably Control the Consumer Heartbeat Timeout and the Number of Messages Polled Per Poll to Avoid Frequent Rebalance.

Kafka Consumers are grouped and load-balanced using the same Group Id. If a single consumer experiences a heartbeat timeout, it will be evicted from the consumer group, and the remaining consumers will perform a Rebalance and reassign the partitions.

In production scenarios, program parameter errors that lead to unexpected heartbeat timeouts and Rebalance should be avoided. Otherwise, the consumer group will continuously change, preventing message consumption.

Causes of triggering unexpected Rebalance:

  • Clients before version v0.10.2: The Consumer did not have a dedicated thread to maintain the heartbeat, instead coupling the heartbeat maintenance with the poll interface. As a result, if user consumption experiences a lag, it will cause Consumer heartbeat timeout, triggering a Rebalance.

  • Clients from version 0.10.2 onwards: If the consumption rate is too slow and exceeds a certain time limit (defined by the max.poll.interval.ms parameter, default is 5 minutes) without performing a poll to retrieve messages, the client will proactively leave the queue, triggering a rebalance.

Optimization Suggestions:

Upgrade the client to a version higher than 0.10.2.

Refer to the following recommendations to adjust parameter values:

Increase the client's consumption speed as much as possible to avoid delays.

Avoid having a single Consumer Group subscribe to a large number of Topics; aim for a dedicated Consumer Group for each Topic.

When consuming messages with Kafka, whether using the standard Kafka Consumer SDK or frameworks like Flink Connector, it is advisable to commit consumer offsets. This allows you to monitor consumption lag and mitigate risks.

There are two ways to commit offsets: automatic and manual. The controlling parameters are:

  • enable.auto.commit: Indicates whether to use the automatic offset commit mechanism. The default value is true, meaning automatic commit is used by default.

  • auto.commit.interval.ms: The time interval for automatic offset commits. The default value is 1000, which is 1 second.

The combination of these two parameters results in the client checking the time since the last offset commit before each poll of data. If the time exceeds the interval specified by auto.commit.interval.ms, the client will initiate an offset commit.

Therefore, if enable.auto.commit is set to true, ensure that the data from the previous poll has been fully consumed before each new poll; otherwise, it may lead to offset skipping.

If you want to control the offset commits manually, set enable.auto.commit to false and use the commit(offsets) function to manage offset commits.

In production environments, it is advisable to commit offsets, but not too frequently. Frequent offset commits can cause request hotspots in the Compact Topic partition that handles the offsets.

In production, it is generally recommended to commit offsets every 5-10 seconds or manually based on your needs, rather than committing offsets after every message consumption.

Suggestion 4: Avoid Consumption Blockage Leading to Backlog in Production Scenarios

Kafka consumers process messages in partition order. If a message causes a consumption blockage due to business logic, it will affect the consumption of subsequent messages in the current partition. Therefore, in production environments, it is essential to ensure that the consumption logic does not cause a permanent blockage.

If unexpected blockage occurs, it is recommended to follow the process below:

  • Determine if skipping is possible: If the abnormal message can be skipped, you can stop the consumer, go to the AutoMQ console, and reset the consumption offset to the next message to skip processing the abnormal message.

  • Determine if skipping is not possible: Fix the consumption logic to handle the abnormal message.

When using Flink Connector to consume Kafka messages, it is important to note the version of Flink Connector. It is recommended to upgrade to version 1.17.2 or above. Earlier versions of Flink Connector relied on earlier versions of the Kafka SDK, which could not retry if a failure occurred during the submission of the consumption offset. This would result in the Consumer Group of the Flink task not being able to accurately observe backlog metrics in Kafka. For detailed defect records, refer to Kafka Issue-12563.