Skip to main content

Prometheus 监控&报警

Metrics 是系统可观测非常重要的分析数据。AutoMQ 支持通过 Prometheus 透出原生 Apache Kafka 的多项 Metrics 数据。本文介绍 AutoMQ 透出的 Metrics 明细。

Metrics 采集、应用原理

AutoMQ 内部组件支持收集 Kafka Server 各项 Metrics 数据,但当前商业版本暂不提供内置的 Metrics 仪表盘和监控告警能力,用户可基于 AutoMQ 提供的集成功能实现自定义 Metrics 监控和分析,整体的架构参考下图:

参考上图,应用 Metrics 采集和分析的步骤如下:

  1. 使用 Metrics 集成功能管理集成▸ 将 Metrics 数据转发到自定义的 Prometheus 服务。

  2. 使用 AutoMQ 提供的 Grafana 模板在用户的 Grafana 集群中快速配置仪表盘。

  3. 使用 AutoMQ 提供的 Prometheus 报警模板快速配置监控告警,相关报警模板参考链接

Prometheus Metrics 定义

上述集成中透出的相关 Metrics 的详细定义请参考 AutoMQ for Kafka Metrics▸

Grafana 仪表盘示例

如上文所述, AutoMQ Cloud 暂不提供托管的 Grafana 仪表盘服务,用户参考AutoMQ 提供的 Grafana 模板快速配置仪表盘,相关仪表盘模板请前往此处链接下载。

预置的 Grafana 大盘模板提供了不同维度的指标监控:

  • Cluster Overview: 提供了集群维度的监控,包括节点数量、数据大小、集群流量等,以及Topic、Group、Broker 维度的指标概览,并提供了下钻功能,可跳转至对应的详情监控
  • Broker Metrics: 提供了 Broker 维度的监控,包括连接数量、分区数量、节点流量、节点请求等
  • Topic Metrics: 提供了 Topic 维度的指标监控,包括消息吞吐、数据总量、分区数量、消费延迟等
  • Group Metrics: 提供了 Group 维度的指标监控,包括消费速率和消费延迟

业务监控报警

AutoMQ 基于 Prometheus 集成,将 Metrics 数据推送到 Prometheus 后,用户可以使用 Prometheus 配置自定义报警规则,用于监控业务压力水位等异常情况。

报警模板

AutoMQ 基于生产环境高频使用的 Metrics 沉淀了一系列报警模板,用户可根据实际需求选择配置其中的报警规则。

报警规则模板列表如下:

报警规则项
作用和场景
HighTopicBytesInPerSec
  • 报警规则:Topic 数据写入吞吐过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量是否超过阈值,一般用于监控突发写入流量。
HighTopicBytesInPerSecDayToDayChange
  • 报警规则:Topic 数据写入吞吐日同比增长过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量日同比增长比例超过阈值,一般用于监控突发写入流量。
LowTopicBytesInPerSec
  • 报警规则:Topic 数据写入吞吐过低报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量是否低于阈值,一般用于监控是否出现写入链路受损跌零。
LowTopicBytesInPerSecDayToDayChange
  • 报警规则:Topic 数据写入吞吐日同比下跌报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量日同比下跌比例超过阈值,一般用于监控写入链路受损,或者业务波动。
HighTopicBytesOutPerSec
  • 报警规则:Topic 数据读取吞吐过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量是否超过阈值,一般用于监控突发读取和 Fanout 流量。
HighTopicBytesOutPerSecDayToDayChange
  • 报警规则:Topic 数据读取吞吐日同比增长过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量日同比增长比例超过阈值,一般用于监控突发读取流量。
LowTopicBytesOutPerSec
  • 报警规则:Topic 数据读取吞吐过低报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量是否低于阈值,一般用于监控是否出现读取链路受损跌零。
LowTopicBytesOutPerSecDayToDayChange
  • 报警规则:Topic 数据读取吞吐日同比下跌超过阈值报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量日同比下跌比例超过阈值,一般用于监控读取流量受损或者业务波动。
HighGroupConsumeRatePerTopic
  • 报警规则: Consumer Group 消费速率过高报警
  • 作用和场景:针对指定 Consumer Group 监控该 Group 每秒消费消息数量是否高于阈值,一般用于监控是否出现读取异常。
LowGroupConsumeRatePerTopic
  • 报警规则: Consumer Group 消费速率过低报警
  • 作用和场景:针对指定 Consumer Group 监控该 Group 每秒消费消息数量是否低于阈值,一般用于监控是否出现读取异常。
HighGroupConsumerLag
  • 报警规则: Consumer Group 消费堆积告警
  • 作用和场景:针对指定 Consumer Group 监控该 Group 消费堆积数量是否高于阈值,一般用于监控是否出现读取异常。

完整的报警模板 yaml 文件参考下方,可以用于复制导入。


# This is the alert rules template for AutoMQ, please modify the alert thresholds and period per your needs
# before applying it to your production environment.
groups:
- name: kafka_alerts
rules:
- alert: HighTopicBytesInPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) > 50 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "High inbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is exceeding threshold."

- alert: LowTopicBytesInPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) < 1024
for: 5m
labels:
severity: warning
annotations:
summary: "Low inbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is below threshold."

- alert: HighTopicBytesOutPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) > 50 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "High outbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is exceeding threshold."

- alert: LowTopicBytesOutPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) < 1024
for: 5m
labels:
severity: warning
annotations:
summary: "Low outbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is below threshold."

- alert: HighGroupConsumeRatePerTopic
expr: sum(max(rate(kafka_group_commit_offset{consumer_group="example_group", topic="example_topic"}[1m])) by (job, consumer_group, topic, partition)) by (job, consumer_group, topic) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High group consume rate {{ printf \"%0.2f\" $value }} msg/s for consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The consume rate of consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }} is exceeding threshold."

- alert: LowGroupConsumeRatePerTopic
expr: sum(max(rate(kafka_group_commit_offset{consumer_group="example_group", topic="example_topic"}[1m])) by (job, consumer_group, topic, partition)) by (job, consumer_group, topic) < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low group consume rate {{ printf \"%0.2f\" $value }} msg/s for consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The consume rate of consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }} is below threshold."

- alert: HighTopicBytesInPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High inbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The increase of inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: LowTopicBytesInPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) < -0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Low inbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The decrease of inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: HighTopicBytesOutPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High outbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The increase of outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: LowTopicBytesOutPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) < -0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Low outbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The decrease of outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: HighGroupConsumerLag
expr: sum(max(kafka_log_end_offset{topic="example_topic"}) by (job, topic, partition)) by (job, topic)
- on (topic) group_left (consumer_group) sum(max(kafka_group_commit_offset{consumer_group="example_group", topic="example_topic"}) by (job, consumer_group, topic, partition)) by (job, consumer_group, topic) > 10000
for: 1m
labels:
severity: warning
annotations:
summary: "High group consumer lag {{ printf \"%0.f\" $value }} for consumer group {{ $labels.consumer_group }} in cluster {{ $labels.job }} on topic {{ $labels.topic }}."
description: "The consumer lag of consumer group {{ $labels.consumer_group }} in cluster {{ $labels.job }} on topic {{ $labels.topic }} is exceeding threshold."

配置步骤

AutoMQ 提供了上述 Prometheus 报警模板,用户可将该报警模板导入当前的 Prometheus 集群(实例)中,然后基于报警模板配置自定义报警规则。

下文以阿里云 Prometheus 为例,演示报警规则配置方法,如果当前是自建 Prometheus,自行参考调整。

在阿里云 Prometheus 配置报警

  1. 导入报警模板: 打开阿里云 Prometheus 控制台,进入告警规则模板 ,选择批量导入模板
  1. 复制上述模板文件内容,并导入。
  1. 应用报警模板: 导入模板完成后,选择特定的报警模板,点击应用模板, 将想要开启的报警模板应用到对应的 Prometheus 实例。
  1. 配置监控告警规则。 以消费堆积报警 (HighGroupConsumerLag) 为例,点击「应用模板」后,选择对应的 Prometheus 实例,应用成功后,可以在左侧「告警规则列表」中看到已经启用的报警规则。
  1. 点击「编辑」,进入到报警规则的编辑页面,将 "example_topic" 和 "example_group" 更改为想要监控的 topic 和 consumer group,并将报警阈值(下图中的 10000)修改为期望的值。
  1. 配置通知策略。 编辑完成后,选择已有的通知策略,或点击「新建通知策略」进行创建。

  2. 快速复制报警策略(可选)。 如果想要对多个 Topic 或 Consumer Group 进行监控,可点击「复制」 创建多条报警规则。