Kafka Client Config Tuning

Resource Allocation Management

Recommendation 1: Set the Topic partition count appropriately to avoid throughput bottlenecks and wastage

The number of Kafka Topic partitions affects the production and consumption throughput a Topic can support. To ensure message order, messages with the same partition key are sent to the same partition, and each Kafka partition can only be processed by one consumer. AutoMQ is built on object storage, which, compared to Apache Kafka’s local file architecture, can support several times the partition performance under the same cluster scale. Additionally, AutoMQ does not charge for partitions. When allocating partitions using AutoMQ, there’s no need to consider costs. You only need to evaluate the appropriate number of partitions based on the estimated partition throughput performance (AutoMQ’s single partition write throughput limit is 4MB/s) to avoid having to expand partitions later when business volume increases. The number of partitions does not need to be a multiple of the number of nodes, as AutoMQ will automatically balance the partitions.

Example:An application estimates that the Topic data write throughput will be about 100MB/s within the next year, while the downstream consumers can each handle a maximum of 2MB of data per second.When allocating partitions, you need to consider:Based on the producer’s perspective, 100MB/s ÷ 4MB/s = 25 partitions are required.Based on the consumer’s perspective, 100MB/s ÷ 2MB/s = 50 consumers are needed, and 50 consumers require at least 50 partitions.Therefore, at least 50 partitions need to be allocated to meet the application’s requirements.

Recommendation 2: For platform-based application scenarios, enable ACL to achieve strict upstream and downstream access control

In platform-based scenarios, such as real-time computing platforms, it is recommended to enable ACL when using Kafka to achieve fine-grained resource control. Once resource control is enabled, Kafka clients must authenticate to access specific Topics and Groups. The benefits of this approach are as follows:

It prevents business parties from arbitrarily creating new Topics and other resources, which can lead to resource abuse and governance issues.
It helps avoid subscription chaos and impacts on load balancing caused by different business parties sharing Consumer Groups.
By identifying Topics and Consumer Groups, it becomes easier to identify related subscribers and upstream and downstream business groups, facilitating business governance.

Producer Application

Recommendation 1: For Kafka Producer clients version 2.1 or lower, set the retry count

Producer applications need to check the SDK version. If the current version is less than 2.1, it is necessary to manually set the retry count to ensure automatic retries in case of message send failures. This helps avoid failures caused by server maintenance, self-balancing, etc. Retry parameter settings can be referenced as follows:

// Maximum retry count
retries=Integer.MAX_VALUE 

// Initial backoff delay time
retry.backoff.ms=100

// Maximum backoff delay time
retry.backoff.max.ms=1000

// RPC timeout for each send request
request.timeout.ms =30000

// The total timeout for the entire send call. If this time is exceeded, no further retries will be made, and an exception will be returned to the caller.
delivery.timeout.ms=120000

Recommendation 2: Optimize the batch parameters of the Producer to avoid excessive QPS consumption by fragmented requests

Kafka is a stream storage system designed for high throughput scenarios. The most typical use case for Kafka is to improve the efficiency and throughput of data transmission through batching. In the process of sending messages by the Producer, it is necessary to set the batching parameters reasonably to avoid the following situations:

Avoid sending only one message per request: If the Producer sends only one message per request, it will generate a large number of Produce requests, consuming server-side CPU and degrading cluster performance.
Avoid setting excessively long batching wait times: When setting batch parameters, the Producer needs to set a reasonable wait time to avoid delays in sending messages due to incomplete batching in low-traffic scenarios.

Specific batch parameter settings are as follows:

// Batch size, up to 128KB of data can be accumulated at a time.
batch.size=131072
// Batch linger time, for the Producer, if the batch limit is not reached within the specified time, sending will also be triggered. This time represents the maximum delay for sending.
linger.ms=10

Recommendation 3: Set ack=all to ensure message durability before responding

Producers can adjust the ack parameter to balance between data durability and sending latency:

ack=all (default value): The server responds to the client only after the data has been persisted to cloud storage. In the event of a server crash, successfully acknowledged messages will not be lost.
ack=1: In alignment with Apache Kafka®, the AutoMQ server responds to the client immediately after the message is received in memory. If the server crashes, messages that haven’t been persisted will be lost.

It is recommended for producers to maintain the default configuration of ack=all to ensure data reliability.

AutoMQ achieves message durability and reliable storage using object storage, and the server does not produce ISR replica copies. Setting ACK=ALL handles traffic as if writing a single piece of data (synchronous persistence).

Consumer Application

Recommendation 1: For consumers using the Assign mode, upgrade to version 3.2 or above to ensure offset commit success

When a Kafka Consumer application consumes messages using the Assign mode, i.e., self-assigning partition load balancing mode, it is necessary to ensure that the SDK version is upgraded to version 3.2 or above. This is because earlier versions have a defect in the Assign mode where committed consumer offsets are not updated promptly, resulting in consumers failing to commit and update offsets in a timely manner. For detailed defect records, refer to Kafka Issue KAFKA-13563.

Recommendation 2: Control the Consumer heartbeat timeout and messages polled per poll to avoid frequent rebalancing

Kafka Consumers are grouped and load-balanced through the same Group Id. If a single consumer experiences a heartbeat timeout, it will be expelled from the consumer group, and the remaining consumers will undergo self-balancing to reassign partitions. In production scenarios, program parameter errors that lead to unexpected heartbeat timeouts and self-balancing should be avoided; otherwise, it will cause the consumer group to keep changing and fail to consume messages. Causes of unexpected self-balancing:

Client versions before v0.10.2: Consumers did not have an independent thread to maintain the heartbeat. Instead, heartbeat maintenance was coupled with the poll interface. As a result, if the user’s consumption is delayed, it leads to a Consumer heartbeat timeout, triggering self-balancing.
Client versions v0.10.2 and later: If the consumption time is too slow and exceeds a certain period (set by max.poll.interval.ms, default is 5 minutes) without polling for messages, the client will voluntarily leave the queue, triggering self-balancing.

Optimization Recommendations:Upgrade the client to version 0.10.2 or above.Adjust the parameter values according to the following instructions:Increase the client’s consumption speed as much as possible to avoid lag.Try to avoid having a single Consumer Group subscribe to a large number of Topics. Aim for a dedicated Consumer Group to subscribe to a dedicated Topic.

Recommendation 3: In production scenarios, commit the Consumer offset, but avoid committing offsets too frequently

When consuming messages using Kafka, whether you use the standard Kafka Consumer SDK or frameworks like Flink Connector, it is advisable to commit consumer offsets. This allows you to monitor consumption lag and mitigate risks. There are two ways to commit consumer offsets: automatic and manual. The controlling parameter is:

enable.auto.commit: Whether to use the automatic offset commit mechanism. The default value is true, indicating that the automatic commit mechanism is used by default.
auto.commit.interval.ms: The interval for auto-committing offsets. The default value is 1000, which is 1 second.

The combination of these two parameters means that before each poll, the client will check the time since the last offset commit. If the elapsed time exceeds the value set by auto.commit.interval.ms, the client will initiate an offset commit action. Therefore, if enable.auto.commit is set to true, you need to ensure that the data polled in the previous poll is fully consumed before polling new data. Otherwise, this might lead to offset jumping. If you prefer to control offset commits manually, set enable.auto.commit to false and use the commit(offsets) function to handle offset commits.

In production scenarios, it is advisable to commit offsets, but avoid committing them too frequently. Frequent commits can cause request hotspots in the Compact Topic partitions handling the offset data.For production scenarios, it is generally recommended to commit offsets every 5-10 seconds or manually as needed, rather than committing after every message consumption.

Recommendation 4: Avoid consumption blocking and accumulation in production scenarios

Kafka consumers process messages in the order of partitions. If a message causes a consumption block due to business logic, it will affect the consumption of subsequent messages in the current partition. Therefore, in production environments, ensure that the consumption logic does not lead to permanent blocking. If unexpected blocking occurs, it is recommended to follow the steps below:

Determine if it can be skipped: If the abnormal message can be skipped, stop the consumer first, then go to the AutoMQ Console to reset the consumption offset to the next message to achieve a skip.
Determine if it cannot be skipped: Fix the consumption logic to handle the abnormal message.

Flink Task

Recommendation 1: Upgrade the Flink Connector version to 1.17.2 or above to ensure offset commit success

When using Flink Connector to consume Kafka messages, pay attention to the Flink Connector version. It is recommended to upgrade to version 1.17.2 or above. Earlier versions of Flink Connector rely on older versions of the Kafka SDK, which cannot retry upon failure when submitting consumption offsets. This can result in the inability to accurately observe the backlog metrics of the Consumer Group in Kafka. For detailed defect records, refer to Kafka Issue KAFKA-13563.

Overview

Getting Started

Deploy on Kubernetes

Subscriptions and Billings

Manage Environments

Manage Identities and Access

Managing AutoMQ

Migrate to AutoMQ

Manage Security

Kafka Connect

Monitoring & Alert

Table Topic

Best Practice

Eliminate Inter-Zone Traffics

Release Notes

Support

Appendix

Resource Allocation Management

Recommendation 1: Set the Topic partition count appropriately to avoid throughput bottlenecks and wastage

Recommendation 2: For platform-based application scenarios, enable ACL to achieve strict upstream and downstream access control

Producer Application

Recommendation 1: For Kafka Producer clients version 2.1 or lower, set the retry count

Recommendation 2: Optimize the batch parameters of the Producer to avoid excessive QPS consumption by fragmented requests

Recommendation 3: Set ack=all to ensure message durability before responding

Consumer Application

Recommendation 1: For consumers using the Assign mode, upgrade to version 3.2 or above to ensure offset commit success

Recommendation 2: Control the Consumer heartbeat timeout and messages polled per poll to avoid frequent rebalancing

Recommendation 3: In production scenarios, commit the Consumer offset, but avoid committing offsets too frequently

Recommendation 4: Avoid consumption blocking and accumulation in production scenarios

Flink Task

Recommendation 1: Upgrade the Flink Connector version to 1.17.2 or above to ensure offset commit success

Overview

Getting Started

Deploy on Kubernetes

Subscriptions and Billings

Manage Environments

Manage Identities and Access

Managing AutoMQ

Migrate to AutoMQ

Manage Security

Kafka Connect

Monitoring & Alert

Table Topic

Best Practice

Eliminate Inter-Zone Traffics

Release Notes

Support

Appendix

Documentation Index

​Resource Allocation Management

​Recommendation 1: Set the Topic partition count appropriately to avoid throughput bottlenecks and wastage

​Recommendation 2: For platform-based application scenarios, enable ACL to achieve strict upstream and downstream access control

​Producer Application

​Recommendation 1: For Kafka Producer clients version 2.1 or lower, set the retry count

​Recommendation 2: Optimize the batch parameters of the Producer to avoid excessive QPS consumption by fragmented requests

​Recommendation 3: Set ack=all to ensure message durability before responding

​Consumer Application

​Recommendation 1: For consumers using the Assign mode, upgrade to version 3.2 or above to ensure offset commit success

​Recommendation 2: Control the Consumer heartbeat timeout and messages polled per poll to avoid frequent rebalancing

​Recommendation 3: In production scenarios, commit the Consumer offset, but avoid committing offsets too frequently

​Recommendation 4: Avoid consumption blocking and accumulation in production scenarios

​Flink Task

​Recommendation 1: Upgrade the Flink Connector version to 1.17.2 or above to ensure offset commit success

Resource Allocation Management

Recommendation 1: Set the Topic partition count appropriately to avoid throughput bottlenecks and wastage

Recommendation 2: For platform-based application scenarios, enable ACL to achieve strict upstream and downstream access control

Producer Application

Recommendation 1: For Kafka Producer clients version 2.1 or lower, set the retry count

Recommendation 2: Optimize the batch parameters of the Producer to avoid excessive QPS consumption by fragmented requests

Recommendation 3: Set ack=all to ensure message durability before responding

Consumer Application

Recommendation 1: For consumers using the Assign mode, upgrade to version 3.2 or above to ensure offset commit success

Recommendation 2: Control the Consumer heartbeat timeout and messages polled per poll to avoid frequent rebalancing

Recommendation 3: In production scenarios, commit the Consumer offset, but avoid committing offsets too frequently

Recommendation 4: Avoid consumption blocking and accumulation in production scenarios

Flink Task

Recommendation 1: Upgrade the Flink Connector version to 1.17.2 or above to ensure offset commit success