Metrics
In this article, the term AutoMQ Kafka specifically refers to the open-source automq-for-kafka project by AutoMQ HK Limited, available through the AutoMQ organization on GitHub.
This article will introduce the monitoring metrics provided by AutoMQ for Kafka (displayed in Prometheus format).
General Metrics
Kafka_server_connection_count
The current number of connections established by the node.
- Type: Gauge
Kafka_network_threads_idle_rate
The idle rate of Kafka SocketServer network threads, ranging from [0, 1.0].
- Type: Gauge
Kafka_io_threads_idle_time_nanoseconds_total
The idle time of Kafka request handler threads. This metric is the cumulative value of the Apache Kafka native metric RequestHandlerAvgIdlePercent, measured in nanoseconds. By deriving the time (in nanoseconds), you can obtain the thread idle rate. Note that when the node is a combination node (acting as both Controller and Broker), since Controller and Broker have their own request handlers respectively, this metric is the combined value of the Controller and Broker. The maximum thread idle rate derived can be 2.0.
- Type: Counter
Controller Metrics
Kafka_controller_active_count
Indicates whether the current Controller node is the active Controller. A metric value of 1 means it is active, while 0 means it is not active.
- Type: Gauge
Kafka_broker_active_count
The number of active Brokers in the current cluster.
- Type: Gauge
Kafka_broker_fenced_count
The number of fenced Brokers in the current cluster.
- Type: Gauge
Kafka_topic_count
The total number of Topics in the current cluster.
- Type: Gauge
Kafka_partition_total_count
The total number of partitions in the current cluster.
- Type: Gauge
Kafka_partition_offline_count
The total number of offline partitions in the current cluster.
- Type: Gauge
Kafka_stream_auto_balancer_metrics_time_delay_milliseconds
The delay in reporting AutoBalancer monitoring metrics by each Broker node in the cluster. If the delay exceeds a certain threshold, the Broker node will be considered out-of-sync by AutoBalancer and will no longer participate in AutoBalancer's partition reassignment.
Type: Gauge
Labels:
- node_id: The ID of the node reporting AutoBalancer monitoring metrics.
Kafka_stream_s3_object_count
The total number of Objects uploaded to Object storage by the current cluster, categorized by Object state.
Type: Gauge
Labels:
state: Object states are categorized into three types:
prepared: Objects that have not yet been fully written and committed
committed: Objects that have been fully written and committed
mark_destroyed: Objects marked for deletion, which will be removed from object storage after a certain delay
Kafka_stream_s3_object_size_bytes
The total size of objects uploaded to object storage by the current cluster
- Type: Gauge
Kafka_stream_stream_object_num
The number of StreamObjects uploaded to object storage by the current cluster
- Type: Gauge
Kafka_stream_stream_set_object_num
The number of StreamSetObjects uploaded to object storage by each broker in the current cluster
Type: Gauge
Labels:
- node_id: Corresponding broker node ID
Broker Metrics
Kafka_message_count_total
The total number of messages received by the Broker node. By monitoring over time, the message throughput can be determined.
Type: Counter
Labels:
- topic
Kafka_network_io_bytes_total
The total size of messages received and sent by the Broker node. By monitoring over time, the message size throughput can be determined.
Type: Counter
Labels:
topic
partition
direction:
in: indicates receiving messages
out: indicates sending messages
Kafka_topic_request_count_total
The total number of requests received by each Topic on the Broker node, including only produce and fetch types of requests.
Type: Counter
Labels:
topic
type: request type
produce
fetch
Kafka_topic_request_failed_total
The total number of failed requests for each Topic on the Broker node, including only produce and fetch types of requests.
Type: Counter
Labels:
topic
type: Request type
produce
fetch
Kafka_request_count_total
Total number of requests received by the Broker node.
Type: Counter
Labels:
type: Request type
version: API version of the request type
Kafka_request_error_count_total
Total number of failed requests at the Broker node. Note that even successful requests are included in this metric, with successful requests having an error code of NONE.
Type: Counter
Labels:
type: Request type
error: Error code, where NONE indicates the request was successful
Kafka_request_size_bytes_total
Total size of requests received by the Broker node.
Type: Counter
Labels:
- type: Request type
Kafka_request_size_50p(99p/mean/max)_bytes
The size of the requests received by the Broker node, expressed in various percentiles.
Type: Gauge
Labels:
- type: Request type
Kafka_request_time_milliseconds_total
The total time taken by the Broker node to process the requests.
Type: Counter
Labels:
- type: Request type
Kafka_request_time_50p(99p/mean/max)_milliseconds
The time taken by the Broker node to process the requests, expressed in various percentiles.
Type: Gauge
Labels:
- type: Request type
Kafka_request_queue_time_milliseconds_total
The total request queue time on the Broker node, which increases when Kafka IO threads are busy.
Type: Counter
Labels:
- type: Request type
Kafka_request_queue_time_50p(99p/mean/max)_milliseconds
Broker node request queuing time, represented by different percentiles.
Type: Gauge
Labels:
- type: Request type
Kafka_response_queue_time_milliseconds_total
Broker node response queuing time, which increases when Kafka network threads are busy.
Type: Counter
Labels:
- type: Request type
Kafka_response_queue_time_50p(99p/mean/max)_milliseconds
Broker node response queuing time, represented by different percentiles.
Type: Gauge
Labels:
- type: Request type
Kafka_request_queue_size
Broker node request queue size.
- Type: Gauge
Kafka_response_queue_size
The size of the response queue for Broker nodes.
- Type: Gauge
Kafka_purgatory_size
The number of requests in the producer or fetch purgatory on the Broker node.
Type: Gauge
Labels:
type:
Produce
Fetch
Kafka_partition_count
The number of partitions currently assigned to the Broker node.
- Type: Gauge
Kafka_logs_flush_time_50p(99p/mean/max)_milliseconds
The log flush time on the Broker node, which in AutoMQ for Kafka represents the flush time of Delta WAL, expressed in different percentiles.
- Type: Gauge
Kafka_log_end_offset
The maximum logical offset of each partition on the Broker node.
Type: Gauge
Labels:
topic
partition
Kafka_log_size
The message size of each partition on the Broker node.
Type: Gauge
Labels:
topic
partition
Kafka_group_commit_offset
The consumption offsets of each Consumer Group on the corresponding partitions. Note that this metric is reported by the Broker where the Group Coordinator of each Consumer Group resides.
Type: Gauge
Labels:
consumer_group
topic
partition
Kafka_group_count
The number of Consumer Groups managed by the Broker node where each Group Coordinator resides.
- Type: Gauge
Kafka_group_preparing_rebalance_count
Number of Consumer Groups currently preparing for rebalance.
- Type: Gauge
Kafka_group_completing_rebalance_count
Number of Consumer Groups waiting for Leader state assignment.
- Type: Gauge
Kafka_group_stable_count
Number of Consumer Groups in Stable state.
- Type: Gauge
Kafka_group_empty_count
Number of Consumer Groups with no members but not yet expired.
- Type: Gauge
Kafka_group_dead_count
Number of Consumer Groups with no members and metadata removed.
- Type: Gauge
Kafka_stream_upload_size_bytes_total
Total data size uploaded to object storage by broker nodes.
- Type: Counter
Kafka_stream_download_size_bytes_total
Total data size downloaded from object storage by broker nodes.
- Type: Counter
Kafka_stream_network_inbound_usage_bytes_total
Total inbound bandwidth usage of broker nodes, including received messages and data downloaded from object storage, with throughput derived over time.
Type: Counter
Labels:
type:
bypass: Indicates the inbound bandwidth usage that is not throttled, equivalent to the message ingress traffic of the Broker node.
catchup: Refers to the inbound traffic generated by cold reads, i.e., the ingress traffic from reading data from S3 due to cache misses or prefetch strategies.
compaction: Indicates the inbound traffic generated by Stream Set Object Compaction, i.e., the ingress traffic from reading data from S3 during compaction.
Kafka_stream_network_outbound_usage_bytes_total
The total outbound bandwidth usage of the Broker node, including the consumption of messages and the amount of data uploaded to object storage, can derive throughput over time.
Type: Counter
Labels:
type:
bypass: Indicates the outbound bandwidth usage that is not throttled, such as the message egress traffic of the Broker node when consuming hot data, or the outbound traffic when the Broker uploads Delta WAL to S3.
catchup: Refers to the outbound traffic generated by cold reads, equivalent to the egress traffic of the Broker node when consuming cold data.
compaction: Indicates the outbound traffic generated by Stream Set Object Compaction, i.e., the outbound traffic when uploading data to S3 during compaction.
Kafka_stream_network_inbound_available_bandwidth_bytes
The inbound traffic throughput reserved by the Broker node for cold reads and compaction. When this value is less than the inbound traffic demand for cold reads and compaction, the corresponding requests will be placed in the throttling queue, and the normal message send/receive traffic will not be affected by this throttling. Note that this metric represents only an instantaneous value at the time of sampling and should be used for reference only due to the limitations of sampling intervals and the specific implementation of the throttling strategy.
- Type: Gauge
Kafka_stream_network_outbound_available_bandwidth_bytes
The outbound traffic throughput reserved for cold reads and compaction on the Broker nodes. When this value is less than the outbound traffic demand for cold reads and compaction, the corresponding requests will be placed in a throttling queue. Normal message traffic is not affected by this throttling. Note that this metric only represents an instantaneous value at the time of sampling and, due to sampling intervals and throttling strategy implementations, this metric is for reference only.
- Type: Gauge
Kafka_stream_network_inbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds
The queuing time in the throttling queue when inbound traffic requests for cold reads and compaction are being processed.
- Type: Gauge
Kafka_stream_network_outbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds
The queuing time in the throttling queue when outbound traffic requests for cold reads and compaction are being processed.
- Type: Gauge
Kafka_stream_operation_latency_50p(99p/max/sum)_nanoseconds
The operation latency at each stage of the AutoMQ for Kafka S3Stream module.
Type: Gauge
Labels:
operation_type
operation_name