Skip to Main Content

Metrics

In this article, all mentions of AutoMQ Kafka terms specifically refer to the open-source automq-for-kafka project by AutoMQ HK Limited, available on GitHub under the AutoMQ organization.

This article will introduce the monitoring metrics provided by AutoMQ for Kafka, presented in Prometheus format.

General Metrics

Kafka_server_connection_count

The current number of connections established by the node.

  • Type: Gauge

Kafka_network_threads_idle_rate

The idle rate of the Kafka SocketServer network thread, ranging from [0, 1.0].

  • Type: Gauge

Kafka_io_threads_idle_time_nanoseconds_total

The idle time of Kafka request handler threads. This metric is the cumulative value of Apache Kafka native metric RequestHandlerAvgIdlePercent, measured in nanoseconds. The idle rate can be derived by differentiating the time (in nanoseconds). Note that when the node serves as a combination node (i.e., both Controller and Broker), each role (Controller and Broker) has its own request handler. In this case, the metric is the aggregate value for both Controller and Broker, with a maximum derived idle rate of 2.0.

  • Type: Counter

Controller Metrics

Kafka_controller_active_count

Indicates whether the current Controller node is an active Controller. A metric value of 1 signifies active, while 0 signifies inactive.

  • Type: Gauge

Kafka_broker_active_count

The number of active Brokers in the current cluster.

  • Type: Gauge

Kafka_broker_fenced_count

The number of fenced Brokers in the current cluster.

  • Type: Gauge

Kafka_topic_count

The total number of Topics in the current cluster.

  • Type: Gauge

Kafka_partition_total_count

The total number of partitions in the current cluster.

  • Type: Gauge

Kafka_partition_offline_count

The total number of partitions without leaders in the current cluster.

  • Type: Gauge

Kafka_stream_auto_balancer_metrics_time_delay_milliseconds

The latency in reporting AutoBalancer monitoring metrics for each Broker node in the cluster. If the latency exceeds a certain threshold, the Broker node is considered an out-of-sync node by AutoBalancer and will no longer participate in AutoBalancer's partition reassignment.

  • Type: Gauge

  • Labels:

    • node_id: The ID of the node reporting AutoBalancer monitoring metrics

Kafka_stream_s3_object_count

The total number of Objects uploaded to object storage in the current cluster, categorized by Object status.

  • Type: Gauge

  • Labels:

    • state: The status of an Object is classified into three categories:

      • prepared: Objects that have not yet been fully written and are not committed.

      • committed: Objects that have been fully written and committed.

      • mark_destroyed: Objects marked for deletion, which will be removed from object storage after a certain delay.

Kafka_stream_s3_object_size_bytes

The total size of Objects uploaded to object storage by the current cluster.

  • Type: Gauge

Kafka_stream_stream_object_num

The number of StreamObjects uploaded to object storage by the current cluster.

  • Type: Gauge

Kafka_stream_stream_set_object_num

The number of StreamSetObjects uploaded to object storage by each Broker in the current cluster.

  • Type: Gauge

  • Labels:

    • node_id: The corresponding Broker node ID.

Broker Metrics

Kafka_message_count_total

The total number of messages received by the Broker node. The message throughput can be calculated by taking the derivative over time.

  • Type: Counter

  • Labels:

    • topic

Kafka_network_io_bytes_total

The total size of messages received and sent by the Broker node. The message size throughput can be calculated by taking the derivative over time.

  • Type: Counter

  • Labels:

    • topic

    • partition

    • direction:

      • in: indicates receiving messages

      • out: indicates sending messages

Kafka_topic_request_count_total

The total number of produce and fetch requests received by each Topic on the Broker node.

  • Type: Counter

  • Labels:

    • topic

    • type: request type

      • produce

      • fetch

Kafka_topic_request_failed_total

The total number of failed produce and fetch requests for each Topic on the Broker node.

  • Type: Counter

  • Labels:

    • topic

    • type: Request Type

      • produce

      • fetch

Kafka_request_count_total

Total number of requests received by Broker nodes.

  • Type: Counter

  • Labels:

    • type: Request Type

    • version: API version of the request type

Kafka_request_error_count_total

Total number of failed requests at Broker nodes. Note that even successful requests are included in this metric, with a success error code of NONE.

  • Type: Counter

  • Labels:

    • type: Request Type

    • error: Error code, where NONE indicates the request was successful

Kafka_request_size_bytes_total

Total size of requests received by Broker nodes.

  • Type: Counter

  • Labels:

    • type: Request Type

Kafka_request_size_50p(99p/mean/max)_bytes

The size of the requests received by the Broker node, represented by different percentiles.

  • Type: Gauge

  • Labels:

    • type: Request Type

Kafka_request_time_milliseconds_total

The total time taken by the Broker node to process requests.

  • Type: Counter

  • Labels:

    • type: Request Type

Kafka_request_time_50p(99p/mean/max)_milliseconds

The time taken by the Broker node to process requests, represented by different percentiles.

  • Type: Gauge

  • Labels:

    • type: Request Type

Kafka_request_queue_time_milliseconds_total

The total queuing time of requests at the Broker node. When the Kafka IO thread is busy, the request queuing time increases.

  • Type: Counter

  • Labels:

    • type: Request Type

Kafka_request_queue_time_50p(99p/mean/max)_milliseconds

The queue time of requests on Broker nodes, shown by different percentiles.

  • Type: Gauge

  • Labels:

    • type: Request Type

Kafka_response_queue_time_milliseconds_total

The response queue time of Broker nodes, which increases when Kafka Network threads are busy.

  • Type: Counter

  • Labels:

    • type: Request Type

Kafka_response_queue_time_50p(99p/mean/max)_milliseconds

The response queue time of Broker nodes, shown by different percentiles.

  • Type: Gauge

  • Labels:

    • type: Request Type

Kafka_request_queue_size

The size of the request queue on Broker nodes.

  • Type: Gauge

Kafka_response_queue_size

The size of the response queue for the Broker node.

  • Type: Gauge

Kafka_purgatory_size

The number of requests in the fetch or producer purgatory waiting on the Broker node.

  • Type: Gauge

  • Labels:

    • type:

      • Produce

      • Fetch

Kafka_partition_count

The number of partitions currently assigned to the Broker node.

  • Type: Gauge

Kafka_logs_flush_time_50p(99p/mean/max)_milliseconds

The log flush time of the Broker node, which in AutoMQ for Kafka, indicates the flush time of Delta WAL, represented by different percentiles.

  • Type: Gauge

Kafka_log_end_offset

The maximum logical offsets of each partition on the Broker node.

  • Type: Gauge

  • Labels:

    • topic

    • partition

Kafka_log_size

The size of the messages for each partition on the Broker node.

  • Type: Gauge

  • Labels:

    • topic

    • partition

Kafka_group_commit_offset

The consumption offsets of each Consumer Group in the corresponding partitions. Note that this metric is reported by the Broker where the Group Coordinator for each Consumer Group is located.

  • Type: Gauge

  • Labels:

    • consumer_group

    • topic

    • partition

Kafka_group_count

The number of Consumer Groups managed by the Broker node where each Group Coordinator is located.

  • Type: Gauge

Kafka_group_preparing_rebalance_count

Number of Consumer Groups preparing for rebalance.

  • Type: Gauge

Kafka_group_completing_rebalance_count

Number of Consumer Groups waiting for the Leader to assign states.

  • Type: Gauge

Kafka_group_stable_count

Number of Consumer Groups in a stable state.

  • Type: Gauge

Kafka_group_empty_count

Number of Consumer Groups with no members but not expired.

  • Type: Gauge

Kafka_group_dead_count

Number of Consumer Groups with no members and metadata removed.

  • Type: Gauge

Kafka_stream_upload_size_bytes_total

Total data size uploaded by Broker nodes to object storage.

  • Type: Counter

Kafka_stream_download_size_bytes_total

Total data size downloaded by Broker nodes from object storage.

  • Type: Counter

Kafka_stream_network_inbound_usage_bytes_total

Total inbound bandwidth usage of Broker nodes, including receiving messages and data downloaded from object storage. Derivating over time gives inbound throughput.

  • Type: Counter

  • Labels:

    • type:

      • bypass: Refers to the inbound bandwidth usage that is not rate-limited, equivalent to the message inflow at the Broker node.

      • catchup: Refers to the inflow generated by cold reads, i.e., the inflow from reading data from S3 due to cache misses or prefetch strategies.

      • compaction: Refers to the inflow generated by Stream Set Object Compaction, i.e., the inflow from reading data from S3 during compaction.

Kafka_stream_network_outbound_usage_bytes_total

The total outbound bandwidth usage at the Broker node includes the volume of consumed messages and the data uploaded to object storage. By deriving it over time, the throughput can be obtained.

  • Type: Counter

  • Labels:

    • type:

      • bypass: Refers to the outbound bandwidth usage that is not rate-limited, such as the message outflow at the Broker node when consuming hot data, or the outflow when the Broker uploads Delta WAL to S3.

      • catchup: Refers to the outflow generated by cold reads, equivalent to the outflow at the Broker node when consuming cold data.

      • compaction: Refers to the outflow generated by Stream Set Object Compaction, i.e., the outflow when uploading data to S3 during compaction.

Kafka_stream_network_inbound_available_bandwidth_bytes

The inflow throughput reserved for cold reads and compaction at the Broker node. When this value is less than the inflow demand for cold reads and compaction, the corresponding requests will be queued in a rate-limited queue. Normal message send and receive traffic is not affected by this rate-limiting. Note that this metric represents an instantaneous value at the time of sampling and is subject to the specific implementation of the sampling interval and rate-limiting strategy. This metric is for reference only.

  • Type: Gauge

Kafka_stream_network_outbound_available_bandwidth_bytes

The Broker node reserves outbound throughput for cold reads and compaction. If this value is less than the outbound throughput required for cold reads and compaction, the corresponding requests will be placed in a throttling queue. Normal message sending and receiving traffic are not affected by this throttling. Note that this metric only represents the instantaneous value at the time of sampling and is limited by the sampling interval and the specific implementation of the throttling strategy. This metric is for reference only.

  • Type: Gauge

Kafka_stream_network_inbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds

The queuing time in the throttling queue when the inbound traffic requests for cold reads and compaction are executed.

  • Type: Gauge

Kafka_stream_network_outbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds

The queuing time in the throttling queue when the outbound traffic requests for cold reads and compaction are executed.

  • Type: Gauge

Kafka_stream_operation_latency_50p(99p/max/sum)_nanoseconds

The operation time of each stage in the AutoMQ for Kafka S3Stream module.

  • Type: Gauge

  • Labels:

    • operation_type

    • operation_name