Metrics
In this article, all mentions of AutoMQ Kafka terms specifically refer to the open-source automq-for-kafka project by AutoMQ HK Limited, available on GitHub under the AutoMQ organization.
This article will introduce the monitoring metrics provided by AutoMQ for Kafka, presented in Prometheus format.
General Metrics
Kafka_server_connection_count
The current number of connections established by the node.
- Type: Gauge
Kafka_network_threads_idle_rate
The idle rate of the Kafka SocketServer network thread, ranging from [0, 1.0].
- Type: Gauge
Kafka_io_threads_idle_time_nanoseconds_total
The idle time of Kafka request handler threads. This metric is the cumulative value of Apache Kafka native metric RequestHandlerAvgIdlePercent, measured in nanoseconds. The idle rate can be derived by differentiating the time (in nanoseconds). Note that when the node serves as a combination node (i.e., both Controller and Broker), each role (Controller and Broker) has its own request handler. In this case, the metric is the aggregate value for both Controller and Broker, with a maximum derived idle rate of 2.0.
- Type: Counter
Controller Metrics
Kafka_controller_active_count
Indicates whether the current Controller node is an active Controller. A metric value of 1 signifies active, while 0 signifies inactive.
- Type: Gauge
Kafka_broker_active_count
The number of active Brokers in the current cluster.
- Type: Gauge
Kafka_broker_fenced_count
The number of fenced Brokers in the current cluster.
- Type: Gauge
Kafka_topic_count
The total number of Topics in the current cluster.
- Type: Gauge
Kafka_partition_total_count
The total number of partitions in the current cluster.
- Type: Gauge
Kafka_partition_offline_count
The total number of partitions without leaders in the current cluster.
- Type: Gauge
Kafka_stream_auto_balancer_metrics_time_delay_milliseconds
The latency in reporting AutoBalancer monitoring metrics for each Broker node in the cluster. If the latency exceeds a certain threshold, the Broker node is considered an out-of-sync node by AutoBalancer and will no longer participate in AutoBalancer's partition reassignment.
Type: Gauge
Labels:
- node_id: The ID of the node reporting AutoBalancer monitoring metrics
Kafka_stream_s3_object_count
The total number of Objects uploaded to object storage in the current cluster, categorized by Object status.
Type: Gauge
Labels:
state: The status of an Object is classified into three categories:
prepared: Objects that have not yet been fully written and are not committed.
committed: Objects that have been fully written and committed.
mark_destroyed: Objects marked for deletion, which will be removed from object storage after a certain delay.
Kafka_stream_s3_object_size_bytes
The total size of Objects uploaded to object storage by the current cluster.
- Type: Gauge
Kafka_stream_stream_object_num
The number of StreamObjects uploaded to object storage by the current cluster.
- Type: Gauge
Kafka_stream_stream_set_object_num
The number of StreamSetObjects uploaded to object storage by each Broker in the current cluster.
Type: Gauge
Labels:
- node_id: The corresponding Broker node ID.
Broker Metrics
Kafka_message_count_total
The total number of messages received by the Broker node. The message throughput can be calculated by taking the derivative over time.
Type: Counter
Labels:
- topic
Kafka_network_io_bytes_total
The total size of messages received and sent by the Broker node. The message size throughput can be calculated by taking the derivative over time.
Type: Counter
Labels:
topic
partition
direction:
in: indicates receiving messages
out: indicates sending messages
Kafka_topic_request_count_total
The total number of produce and fetch requests received by each Topic on the Broker node.
Type: Counter
Labels:
topic
type: request type
produce
fetch
Kafka_topic_request_failed_total
The total number of failed produce and fetch requests for each Topic on the Broker node.
Type: Counter
Labels:
topic
type: Request Type
produce
fetch
Kafka_request_count_total
Total number of requests received by Broker nodes.
Type: Counter
Labels:
type: Request Type
version: API version of the request type
Kafka_request_error_count_total
Total number of failed requests at Broker nodes. Note that even successful requests are included in this metric, with a success error code of NONE.
Type: Counter
Labels:
type: Request Type
error: Error code, where NONE indicates the request was successful
Kafka_request_size_bytes_total
Total size of requests received by Broker nodes.
Type: Counter
Labels:
- type: Request Type
Kafka_request_size_50p(99p/mean/max)_bytes
The size of the requests received by the Broker node, represented by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_request_time_milliseconds_total
The total time taken by the Broker node to process requests.
Type: Counter
Labels:
- type: Request Type
Kafka_request_time_50p(99p/mean/max)_milliseconds
The time taken by the Broker node to process requests, represented by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_request_queue_time_milliseconds_total
The total queuing time of requests at the Broker node. When the Kafka IO thread is busy, the request queuing time increases.
Type: Counter
Labels:
- type: Request Type
Kafka_request_queue_time_50p(99p/mean/max)_milliseconds
The queue time of requests on Broker nodes, shown by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_response_queue_time_milliseconds_total
The response queue time of Broker nodes, which increases when Kafka Network threads are busy.
Type: Counter
Labels:
- type: Request Type
Kafka_response_queue_time_50p(99p/mean/max)_milliseconds
The response queue time of Broker nodes, shown by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_request_queue_size
The size of the request queue on Broker nodes.
- Type: Gauge
Kafka_response_queue_size
The size of the response queue for the Broker node.
- Type: Gauge
Kafka_purgatory_size
The number of requests in the fetch or producer purgatory waiting on the Broker node.
Type: Gauge
Labels:
type:
Produce
Fetch
Kafka_partition_count
The number of partitions currently assigned to the Broker node.
- Type: Gauge
Kafka_logs_flush_time_50p(99p/mean/max)_milliseconds
The log flush time of the Broker node, which in AutoMQ for Kafka, indicates the flush time of Delta WAL, represented by different percentiles.
- Type: Gauge
Kafka_log_end_offset
The maximum logical offsets of each partition on the Broker node.
Type: Gauge
Labels:
topic
partition
Kafka_log_size
The size of the messages for each partition on the Broker node.
Type: Gauge
Labels:
topic
partition
Kafka_group_commit_offset
The consumption offsets of each Consumer Group in the corresponding partitions. Note that this metric is reported by the Broker where the Group Coordinator for each Consumer Group is located.
Type: Gauge
Labels:
consumer_group
topic
partition
Kafka_group_count
The number of Consumer Groups managed by the Broker node where each Group Coordinator is located.
- Type: Gauge
Kafka_group_preparing_rebalance_count
Number of Consumer Groups preparing for rebalance.
- Type: Gauge
Kafka_group_completing_rebalance_count
Number of Consumer Groups waiting for the Leader to assign states.
- Type: Gauge
Kafka_group_stable_count
Number of Consumer Groups in a stable state.
- Type: Gauge
Kafka_group_empty_count
Number of Consumer Groups with no members but not expired.
- Type: Gauge
Kafka_group_dead_count
Number of Consumer Groups with no members and metadata removed.
- Type: Gauge
Kafka_stream_upload_size_bytes_total
Total data size uploaded by Broker nodes to object storage.
- Type: Counter
Kafka_stream_download_size_bytes_total
Total data size downloaded by Broker nodes from object storage.
- Type: Counter
Kafka_stream_network_inbound_usage_bytes_total
Total inbound bandwidth usage of Broker nodes, including receiving messages and data downloaded from object storage. Derivating over time gives inbound throughput.
Type: Counter
Labels:
type:
bypass: Refers to the inbound bandwidth usage that is not rate-limited, equivalent to the message inflow at the Broker node.
catchup: Refers to the inflow generated by cold reads, i.e., the inflow from reading data from S3 due to cache misses or prefetch strategies.
compaction: Refers to the inflow generated by Stream Set Object Compaction, i.e., the inflow from reading data from S3 during compaction.
Kafka_stream_network_outbound_usage_bytes_total
The total outbound bandwidth usage at the Broker node includes the volume of consumed messages and the data uploaded to object storage. By deriving it over time, the throughput can be obtained.
Type: Counter
Labels:
type:
bypass: Refers to the outbound bandwidth usage that is not rate-limited, such as the message outflow at the Broker node when consuming hot data, or the outflow when the Broker uploads Delta WAL to S3.
catchup: Refers to the outflow generated by cold reads, equivalent to the outflow at the Broker node when consuming cold data.
compaction: Refers to the outflow generated by Stream Set Object Compaction, i.e., the outflow when uploading data to S3 during compaction.
Kafka_stream_network_inbound_available_bandwidth_bytes
The inflow throughput reserved for cold reads and compaction at the Broker node. When this value is less than the inflow demand for cold reads and compaction, the corresponding requests will be queued in a rate-limited queue. Normal message send and receive traffic is not affected by this rate-limiting. Note that this metric represents an instantaneous value at the time of sampling and is subject to the specific implementation of the sampling interval and rate-limiting strategy. This metric is for reference only.
- Type: Gauge
Kafka_stream_network_outbound_available_bandwidth_bytes
The Broker node reserves outbound throughput for cold reads and compaction. If this value is less than the outbound throughput required for cold reads and compaction, the corresponding requests will be placed in a throttling queue. Normal message sending and receiving traffic are not affected by this throttling. Note that this metric only represents the instantaneous value at the time of sampling and is limited by the sampling interval and the specific implementation of the throttling strategy. This metric is for reference only.
- Type: Gauge
Kafka_stream_network_inbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds
The queuing time in the throttling queue when the inbound traffic requests for cold reads and compaction are executed.
- Type: Gauge
Kafka_stream_network_outbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds
The queuing time in the throttling queue when the outbound traffic requests for cold reads and compaction are executed.
- Type: Gauge
Kafka_stream_operation_latency_50p(99p/max/sum)_nanoseconds
The operation time of each stage in the AutoMQ for Kafka S3Stream module.
Type: Gauge
Labels:
operation_type
operation_name