Kafka Client Config Tuning

Resource Allocation Management

Recommendation 1: Set the Topic Partition Count Appropriately to Avoid Throughput Bottlenecks and Wastage.

The number of Kafka Topic partitions affects the production and consumption throughput a Topic can support. To ensure message order, messages with the same partition key are sent to the same partition, and each Kafka partition can only be processed by one consumer.

AutoMQ is built on object storage, which, compared to Apache Kafka's local file architecture, can support several times the partition performance under the same cluster scale. Additionally, AutoMQ does not charge for partitions.

When allocating partitions using AutoMQ, there's no need to consider costs. You only need to evaluate the appropriate number of partitions based on the estimated partition throughput performance (AutoMQ's single partition write throughput limit is 4MB/s) to avoid having to expand partitions later when business volume increases. The number of partitions does not need to be a multiple of the number of nodes, as AutoMQ will automatically balance the partitions.

Example:

An application estimates that the Topic data write throughput will be about 100MB/s within the next year, while the downstream consumers can each handle a maximum of 2MB of data per second.

When allocating partitions, you need to consider:

Based on the producer's perspective, 100MB/s ÷ 4MB/s = 25 partitions are required.

Based on the consumer's perspective, 100MB/s ÷ 2MB/s = 50 consumers are needed, and 50 consumers require at least 50 partitions.

Therefore, at least 50 partitions need to be allocated to meet the application's requirements.

Recommendation 2: for Platform-based Application Scenarios, It Is Recommended to Enable ACL to Achieve Strict Upstream and Downstream Access Control, Facilitating Subsequent Link Topology Organization.

In platform-based scenarios, such as real-time computing platforms, it is recommended to enable ACL when using Kafka to achieve fine-grained resource control. Once resource control is enabled, Kafka clients must authenticate to access specific Topics and Groups. The benefits of this approach are as follows:

It prevents business parties from arbitrarily creating new Topics and other resources, which can lead to resource abuse and governance issues.
It helps avoid subscription chaos and impacts on load balancing caused by different business parties sharing Consumer Groups.
By identifying Topics and Consumer Groups, it becomes easier to identify related subscribers and upstream and downstream business groups, facilitating business governance.

Producer Application

Recommendation 1: for Kafka Producer Clients Version 2.1 or Lower, It Is Recommended to Set the Retry Count.

Producer applications need to check the SDK version. If the current version is less than 2.1, it is necessary to manually set the retry count to ensure automatic retries in case of message send failures. This helps avoid failures caused by server maintenance, self-balancing, etc.

Retry parameter settings can be referenced as follows:

// Maximum retry count
retries=Integer.MAX_VALUE 

// Initial backoff delay time
retry.backoff.ms=100

// Maximum backoff delay time
retry.backoff.max.ms=1000

// RPC timeout for each send request
request.timeout.ms =30000

// The total timeout for the entire send call. If this time is exceeded, no further retries will be made, and an exception will be returned to the caller.
delivery.timeout.ms=120000

Recommendation 2: Optimize the Batch Parameters of the Producer to Avoid Excessive QPS Consumption by Fragmented Requests.

Kafka is a stream storage system designed for high throughput scenarios. The most typical use case for Kafka is to improve the efficiency and throughput of data transmission through batching. In the process of sending messages by the Producer, it is necessary to set the batching parameters reasonably to avoid the following situations:

Avoid sending only one message per request: If the Producer sends only one message per request, it will generate a large number of Produce requests, consuming server-side CPU and degrading cluster performance.
Avoid setting excessively long batching wait times: When setting batch parameters, the Producer needs to set a reasonable wait time to avoid delays in sending messages due to incomplete batching in low-traffic scenarios.

Specific batch parameter settings are as follows:

// Batch size, up to 128KB of data can be accumulated at a time.
batch.size=131072
// Batch linger time, for the Producer, if the batch limit is not reached within the specified time, sending will also be triggered. This time represents the maximum delay for sending.
linger.ms=10

Recommendation 3: Set Ack=all to Ensure Message Durability Before Responding

Producers can adjust the ack parameter to balance between data durability and sending latency:

ack=all (default value): The server responds to the client only after the data has been persisted to cloud storage. In the event of a server crash, successfully acknowledged messages will not be lost.
ack=1: In alignment with Apache Kafka®, the AutoMQ server responds to the client immediately after the message is received in memory. If the server crashes, messages that haven't been persisted will be lost.

It is recommended for producers to maintain the default configuration of ack=all to ensure data reliability.

AutoMQ achieves message durability and reliable storage using object storage, and the server does not produce ISR replica copies. Setting ACK=ALL handles traffic as if writing a single piece of data (synchronous persistence).

Consumer Application

Recommendation 1: for Consumers Using the Assign Mode, It Is Recommended to Upgrade to Version 3.2 or Above to Ensure the Success Rate of Offset Commits.

When a Kafka Consumer application consumes messages using the Assign mode, i.e., self-assigning partition load balancing mode, it is necessary to ensure that the SDK version is upgraded to version 3.2 or above. This is because earlier versions have a defect in the Assign mode where committed consumer offsets are not updated promptly, resulting in consumers failing to commit and update offsets in a timely manner. For detailed defect records, refer to Kafka Issue-12563.

Recommendation 2: Reasonably Control the Consumer Heartbeat Timeout and the Number of Messages Polled in a Single Poll to Avoid Frequent Self-balancing.

Kafka Consumers are grouped and load-balanced through the same Group Id. If a single consumer experiences a heartbeat timeout, it will be expelled from the consumer group, and the remaining consumers will undergo self-balancing to reassign partitions.

In production scenarios, program parameter errors that lead to unexpected heartbeat timeouts and self-balancing should be avoided; otherwise, it will cause the consumer group to keep changing and fail to consume messages.

Causes of unexpected self-balancing:

Client versions before v0.10.2: Consumers did not have an independent thread to maintain the heartbeat. Instead, heartbeat maintenance was coupled with the poll interface. As a result, if the user's consumption is delayed, it leads to a Consumer heartbeat timeout, triggering self-balancing.
Client versions v0.10.2 and later: If the consumption time is too slow and exceeds a certain period (set by max.poll.interval.ms, default is 5 minutes) without polling for messages, the client will voluntarily leave the queue, triggering self-balancing.

Optimization Recommendations:

Upgrade the client to version 0.10.2 or above.

Adjust the parameter values according to the following instructions:

Increase the client's consumption speed as much as possible to avoid lag.

Try to avoid having a single Consumer Group subscribe to a large number of Topics. Aim for a dedicated Consumer Group to subscribe to a dedicated Topic.

Recommendation 3: in Production Scenarios, It Is Recommended to Commit the Consumer Offset, but Avoid Committing Offsets Too Frequently.

When consuming messages using Kafka, whether you use the standard Kafka Consumer SDK or frameworks like Flink Connector, it is advisable to commit consumer offsets. This allows you to monitor consumption lag and mitigate risks.

There are two ways to commit consumer offsets: automatic and manual. The controlling parameter is:

enable.auto.commit: Whether to use the automatic offset commit mechanism. The default value is true, indicating that the automatic commit mechanism is used by default.
auto.commit.interval.ms: The interval for auto-committing offsets. The default value is 1000, which is 1 second.

The combination of these two parameters means that before each poll, the client will check the time since the last offset commit. If the elapsed time exceeds the value set by auto.commit.interval.ms, the client will initiate an offset commit action.

Therefore, if enable.auto.commit is set to true, you need to ensure that the data polled in the previous poll is fully consumed before polling new data. Otherwise, this might lead to offset jumping.

If you prefer to control offset commits manually, set enable.auto.commit to false and use the commit(offsets) function to handle offset commits.

In production scenarios, it is advisable to commit offsets, but avoid committing them too frequently. Frequent commits can cause request hotspots in the Compact Topic partitions handling the offset data.

For production scenarios, it is generally recommended to commit offsets every 5-10 seconds or manually as needed, rather than committing after every message consumption.

Recommendation 4: Avoid Consumption Blocking and Accumulation in Production Scenarios

Kafka consumers process messages in the order of partitions. If a message causes a consumption block due to business logic, it will affect the consumption of subsequent messages in the current partition. Therefore, in production environments, ensure that the consumption logic does not lead to permanent blocking.

If unexpected blocking occurs, it is recommended to follow the steps below:

Determine if it can be skipped: If the abnormal message can be skipped, stop the consumer first, then go to the AutoMQ console to reset the consumption offset to the next message to achieve a skip.
Determine if it cannot be skipped: Fix the consumption logic to handle the abnormal message.

Flink Task

Recommendation 1: Upgrade the Flink Connector Version to 1.17.2 or Above to Ensure the Success Rate of Submitting Consumption Offsets.

When using Flink Connector to consume Kafka messages, pay attention to the Flink Connector version. It is recommended to upgrade to version 1.17.2 or above. Earlier versions of Flink Connector rely on older versions of the Kafka SDK, which cannot retry upon failure when submitting consumption offsets. This can result in the inability to accurately observe the backlog metrics of the Consumer Group in Kafka. For detailed defect records, refer to Kafka Issue-12563.

Kafka Client Config Tuning

Resource Allocation Management​

Recommendation 1: Set the Topic Partition Count Appropriately to Avoid Throughput Bottlenecks and Wastage.​

Recommendation 2: for Platform-based Application Scenarios, It Is Recommended to Enable ACL to Achieve Strict Upstream and Downstream Access Control, Facilitating Subsequent Link Topology Organization.​

Producer Application​

Recommendation 1: for Kafka Producer Clients Version 2.1 or Lower, It Is Recommended to Set the Retry Count.​

Recommendation 2: Optimize the Batch Parameters of the Producer to Avoid Excessive QPS Consumption by Fragmented Requests.​

Recommendation 3: Set Ack=all to Ensure Message Durability Before Responding​

Consumer Application​

Recommendation 1: for Consumers Using the Assign Mode, It Is Recommended to Upgrade to Version 3.2 or Above to Ensure the Success Rate of Offset Commits.​

Recommendation 2: Reasonably Control the Consumer Heartbeat Timeout and the Number of Messages Polled in a Single Poll to Avoid Frequent Self-balancing.​

Recommendation 3: in Production Scenarios, It Is Recommended to Commit the Consumer Offset, but Avoid Committing Offsets Too Frequently.​

Recommendation 4: Avoid Consumption Blocking and Accumulation in Production Scenarios​

Flink Task​

Recommendation 1: Upgrade the Flink Connector Version to 1.17.2 or Above to Ensure the Success Rate of Submitting Consumption Offsets.​