Metrics
This article specifically addresses the AutoMQ Kafka terminology within the open-source automq-for-kafka project by "AutoMQ CO." hosted on GitHub AutoMQ.
This article will explore the monitoring metrics provided by AutoMQ for Kafka, which are displayed in Prometheus format.
General Metrics
Kafka_server_connection_count
Current number of established connections per node.
- Type: Gauge
Kafka_network_threads_idle_rate
Idle rate of Kafka SocketServer network threads, ranging from [0, 1.0].
- Type: Gauge
Kafka_io_threads_idle_time_nanoseconds_total
Idle time of Kafka request handler threads, this metric represents a cumulative value of the native Apache Kafka® metric RequestHandlerAvgIdlePercent, measured in nanoseconds. By calculating the derivative over time (nanoseconds), the idle rate of threads can be determined. It's important to note that when the node functions as a combined node (serving both as Controller and Broker), the metric represents a combined value from both the Controller and Broker, with a possible maximum idle rate derived up to 2.0.
- Type: Counter
Controller Metrics
Kafka_controller_active_count
Shows whether the current Controller node is an active Controller, with a metric value of 1 indicating active status, and 0 indicating non-active.
- Type: Gauge
Kafka_broker_active_count
Current active broker count in the cluster.
- Type: Gauge
Kafka_broker_fenced_count
Number of brokers in the cluster that are fenced.
- Type: Gauge
Kafka_topic_count
Total number of topics in the cluster.
- Type: Gauge
Kafka_partition_total_count
Total number of partitions in the cluster.
- Type: Gauge
Kafka_partition_offline_count
Total number of leaderless partitions in the cluster.
- Type: Gauge
Kafka_stream_auto_balancer_metrics_time_delay_milliseconds
The latency of each broker node in reporting AutoBalancer monitoring metrics; if the latency exceeds a specific threshold, the broker node is deemed out-of-sync and is excluded from the AutoBalancer's partition reassignments.
Type: Gauge
Labels:
- node_id: Node ID reporting the AutoBalancer monitoring metrics
Kafka_stream_s3_object_count
Total number of objects uploaded to object storage in the current cluster, categorized by the status of the objects.
Type: Gauge
Labels:
state: Object status, categorized into three types:
prepared: Objects that are still being written and have not been committed
committed: Objects that have finished writing and have been committed
mark_destroyed: Objects designated for deletion, to be removed from object storage after a certain delay
Kafka_stream_s3_object_size_bytes
Total size of Objects uploaded to object storage by the current cluster.
- Type: Gauge
Kafka_stream_stream_object_num
Number of StreamObjects uploaded to object storage by the current e cluster.
- Type: Gauge
Kafka_stream_stream_set_object_num
Number of StreamSetObjects uploaded to object storage by each Broker in the current cluster.
Type: Gauge
Labels:
- node_id: The corresponding Broker node id
Broker Metrics
Kafka_message_count_total
The total number of messages received by Broker nodes, differentiated by time to assess message count throughput.
Type: Counter
Labels:
- topic
Kafka_network_io_bytes_total
The total volume of messages received and dispatched by Broker nodes, differentiated by time to assess message size throughput.
Type: Counter
Labels:
topic
partition
direction:
in: indicates messages received
out: indicates messages sent
Kafka_topic_request_count_total
The total number of requests received by each Topic on Broker nodes, specifically including only produce and fetch request types.
Type: Counter
Labels:
topic
type: request type
produce
fetch
Kafka_topic_request_failed_total
The total number of failed requests for each Topic on Broker nodes, specifically including only produce and fetch request types.
Type: Counter
Labels:
topic
type: Request Type
produce
fetch
Kafka_request_count_total
Total number of requests received by Broker nodes.
Type: Counter
Labels:
type: Request Type
version: Api Version of the request type
Kafka_request_error_count_total
Total number of request failures on Broker nodes, note that even successful requests are counted in this metric, with an error code of NONE for successful requests.
Type: Counter
Labels:
type: Request Type
error: error code, NONE indicates the request was successful
Kafka_request_size_bytes_total
Total size of requests received by Broker nodes.
Type: Counter
Labels:
- type: Request Type
Kafka_request_size_50p(99p/mean/max)_bytes
Size of requests received by Broker nodes, represented by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_request_time_milliseconds_total
Total time spent by Broker nodes in processing requests.
Type: Counter
Labels:
- type: Request Type
Kafka_request_time_50p(99p/mean/max)_milliseconds
Time spent by Broker nodes in processing requests, represented by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_request_queue_time_milliseconds_total
Total queuing time of requests at Broker nodes, which increases when Kafka IO threads are busy.
Type: Counter
Labels:
- type: Request Type
Kafka_request_queue_time_50p(99p/mean/max)_milliseconds
Broker node request queuing time, represented by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_response_queue_time_milliseconds_total
Broker node response queuing time, which can increase when Apache Kafka® Network threads are busy.
Type: Counter
Labels:
- type: Request Type
Kafka_response_queue_time_50p(99p/mean/max)_milliseconds
Broker node response queuing time, represented by different percentiles.
Type: Gauge
Labels:
- type: Request Type
Kafka_request_queue_size
Broker node request queue size.
- Type: Gauge
Kafka_response_queue_size
Size of the response queue for broker nodes.
- Type: Gauge
Kafka_purgatory_size
Number of requests pending on broker nodes from either the producer or fetch purgatory.
Type: Gauge
Labels:
type:
Produce
Fetch
Kafka_partition_count
Current count of partitions allocated to broker nodes.
- Type: Gauge
Kafka_logs_flush_time_50p(99p/mean/max)_milliseconds
Log flush time for broker nodes in AutoMQ for Kafka, depicted through the flush time of Delta WAL across various percentiles.
- Type: Gauge
Kafka_log_end_offset
Maximum logical offset for each partition on broker nodes.
Type: Gauge
Labels:
topic
partition
Kafka_log_size
Message size for each partition on broker nodes.
Type: Gauge
Labels:
topic
partition
Kafka_group_commit_offset
Consumption offset for each Consumer Group on the respective partition, note that this metric is provided by the Group Coordinator's broker for each Consumer Group.
Type: Gauge
Labels:
consumer_group
topic
partition
Kafka_group_count
Number of Consumer Groups overseen by each Group Coordinator's broker node.
- Type: Gauge
Kafka_group_preparing_rebalance_count
Number of Consumer Groups preparing for self-balancing.
- Type: Gauge
Kafka_group_completing_rebalance_count
Number of Consumer Groups awaiting state assignments from the Leader.
- Type: Gauge
Kafka_group_stable_count
Number of Consumer Groups in a Stable state.
- Type: Gauge
Kafka_group_empty_count
Number of Consumer Groups without any members but not yet expired.
- Type: Gauge
Kafka_group_dead_count
Number of Consumer Groups without any members and with metadata removed.
- Type: Gauge
Kafka_stream_upload_size_bytes_total
Total size of data uploaded to Object storage by Broker nodes.
- Type: Counter
Kafka_stream_download_size_bytes_total
Total size of data downloaded from Object storage by Broker locations.
- Type: Counter
Kafka_stream_network_inbound_usage_bytes_total
Total inbound bandwidth usage of Broker nodes, including message reception and data downloads from object storage, calculated by deriving the inbound throughput over time.
Type: Counter
Labels:
type:
bypass: refers to the inbound bandwidth usage that is not subject to rate limiting, equivalent to the message inflow of a Broker node.
catchup: represents the inbound traffic generated by cold reads, that is, due to cache misses or prefetching strategies from S3.
compaction: indicates the inbound traffic generated by Stream Set Object Compaction, i.e., data read from S3 during compaction.
Kafka_stream_network_outbound_usage_bytes_total
The total outbound bandwidth usage of a Broker node, including consuming messages and uploading data to object storage, can be derived over time to calculate the throughput.
Type: Counter
Labels:
type:
bypass: represents the outbound bandwidth usage that is not subject to rate limiting, such as the Broker node's message outflow when consuming hot data or when uploading Delta WAL to S3.
catchup: represents the outbound traffic generated by cold reads, equivalent to the Broker node's outflow when consuming cold data.
compaction: indicates the outbound traffic generated by Stream Set Object Compaction, i.e., data uploaded to S3 during compaction.
Kafka_stream_network_inbound_available_bandwidth_bytes
Broker node's reserved inbound throughput for cold reads and compaction, when this value is less than the demand for inbound traffic from cold reads and compaction, the corresponding requests will be placed in a rate limiting queue to wait. Note, this metric only represents the instantaneous value at the time of sampling, subject to the sampling interval and the specific implementation of the rate limiting policy, and should only be used for reference.
- Type: Gauge
Kafka_stream_network_outbound_available_bandwidth_bytes
Broker nodes allocate outbound throughput for cold reads and compaction; if this throughput falls short of the requirements for cold reads and compaction, the corresponding requests will be placed in the rate limiting queue, though normal message transmission remains unaffected. It's important to note that this metric only captures the instantaneous value at the time of sampling and depends on the specific sampling interval and rate limiting policy; therefore, it should be considered for reference purposes only.
- Type: Gauge
Kafka_stream_network_inbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds
When executing cold read and compaction inbound traffic requests, measure their queue time in the rate limiting queue.
- Type: Gauge
Kafka_stream_network_outbound_limiter_queue_time_50p(99p/max/sum)_nanoseconds
When executing cold read and compaction outbound traffic requests, measure their queue time in the rate limiting queue.
- Type: Gauge
Kafka_stream_operation_latency_50p(99p/max/sum)_nanoseconds
Measure the operation duration for each phase of the AutoMQ for Kafka S3Stream module.
Type: Gauge
Labels:
operation_type
operation_name