Skip to main content

Prometheus 监控&报警

Metrics 是系统可观测非常重要的分析数据。AutoMQ 支持通过 Prometheus 透出原生 Apache Kafka 的多项 Metrics 数据。本文介绍 AutoMQ 透出的 Metrics 明细。

Metrics 采集、应用原理

AutoMQ 内部组件支持收集 Kafka Server 各项 Metrics 数据,但当前商业版本暂不提供内置的 Metrics 仪表盘和监控告警能力,用户可基于 AutoMQ 提供的集成功能实现自定义 Metrics 监控和分析,整体的架构参考下图:

参考上图,应用 Metrics 采集和分析的步骤如下:

  1. 使用 Metrics 集成功能管理集成▸ 将 Metrics 数据转发到自定义的 Prometheus 服务。

  2. 使用 AutoMQ 提供的 Grafana 模板在用户的 Grafana 集群中快速配置仪表盘。

  3. 使用 AutoMQ 提供的 Prometheus 报警模板快速配置监控告警,相关报警模板参考链接

Prometheus Metrics 定义

上述集成中透出的相关 Metrics 的详细定义请参考监控指标。

Grafana 仪表盘示例

如上文所述, AutoMQ Cloud 暂不提供托管的 Grafana 仪表盘服务,用户参考AutoMQ 提供的 Grafana 模板快速配置仪表盘,相关仪表盘模板请前往此处链接下载。

预置的 Grafana 大盘模板提供了不同维度的指标监控:

  • Cluster Overview: 提供了集群维度的监控,包括节点数量、数据大小、集群流量等,以及Topic、Group、Broker 维度的指标概览,并提供了下钻功能,可跳转至对应的详情监控
  • Broker Metrics: 提供了 Broker 维度的监控,包括连接数量、分区数量、节点流量、节点请求等
  • Topic Metrics: 提供了 Topic 维度的指标监控,包括消息吞吐、数据总量、分区数量、消费延迟等
  • Group Metrics: 提供了 Group 维度的指标监控,包括消费速率和消费延迟

业务监控报警

AutoMQ 基于 Prometheus 集成,将 Metrics 数据推送到 Prometheus 后,用户可以使用 Prometheus 配置自定义报警规则,用于监控业务压力水位等异常情况。

报警模板

AutoMQ 基于生产环境高频使用的 Metrics 沉淀了一系列报警模板,用户可根据实际需求选择配置其中的报警规则。

报警规则模板列表如下:

报警规则项
作用和场景
HighTopicBytesInPerSec
  • 报警规则:Topic 数据写入吞吐过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量是否超过阈值,一般用于监控突发写入流量。
HighTopicBytesInPerSecDayToDayChange
  • 报警规则:Topic 数据写入吞吐日同比增长过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量日同比增长比例超过阈值,一般用于监控突发写入流量。
LowTopicBytesInPerSec
  • 报警规则:Topic 数据写入吞吐过低报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量是否低于阈值,一般用于监控是否出现写入链路受损跌零。
LowTopicBytesInPerSecDayToDayChange
  • 报警规则:Topic 数据写入吞吐日同比下跌报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒写入数据量日同比下跌比例超过阈值,一般用于监控写入链路受损,或者业务波动。
HighTopicBytesOutPerSec
  • 报警规则:Topic 数据读取吞吐过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量是否超过阈值,一般用于监控突发读取和 Fanout 流量。
HighTopicBytesOutPerSecDayToDayChange
  • 报警规则:Topic 数据读取吞吐日同比增长过高报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量日同比增长比例超过阈值,一般用于监控突发读取流量。
LowTopicBytesOutPerSec
  • 报警规则:Topic 数据读取吞吐过低报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量是否低于阈值,一般用于监控是否出现读取链路受损跌零。
LowTopicBytesOutPerSecDayToDayChange
  • 报警规则:Topic 数据读取吞吐日同比下跌超过阈值报警
  • 作用和场景:针对指定 topic 监控该 topic 每秒读取数据量日同比下跌比例超过阈值,一般用于监控读取流量受损或者业务波动。
HighGroupConsumeRatePerTopic
  • 报警规则: Consumer Group 消费速率过高报警
  • 作用和场景:针对指定 Consumer Group 监控该 Group 每秒消费消息数量是否高于阈值,一般用于监控是否出现读取异常。
LowGroupConsumeRatePerTopic
  • 报警规则: Consumer Group 消费速率过低报警
  • 作用和场景:针对指定 Consumer Group 监控该 Group 每秒消费消息数量是否低于阈值,一般用于监控是否出现读取异常。
HighGroupConsumerLag
  • 报警规则: Consumer Group 消费堆积告警
  • 作用和场景:针对指定 Consumer Group 监控该 Group 消费堆积数量是否高于阈值,一般用于监控是否出现读取异常。

完整的报警模板 yaml 文件参考下方,可以用于复制导入。


# This is the alert rules template for AutoMQ, please modify the alert thresholds and period per your needs
# before applying it to your production environment.
groups:
- name: kafka_alerts
rules:
- alert: HighTopicBytesInPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) > 50 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "High inbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is exceeding threshold."

- alert: LowTopicBytesInPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) < 1024
for: 5m
labels:
severity: warning
annotations:
summary: "Low inbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is below threshold."

- alert: HighTopicBytesOutPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) > 50 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "High outbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is exceeding threshold."

- alert: LowTopicBytesOutPerSec
expr: sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic) < 1024
for: 5m
labels:
severity: warning
annotations:
summary: "Low outbound network throughput {{ printf \"%0.2f\" $value }} Bytes/s for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} is below threshold."

- alert: HighGroupConsumeRatePerTopic
expr: sum(max(rate(kafka_group_commit_offset{consumer_group="example_group", topic="example_topic"}[1m])) by (job, consumer_group, topic, partition)) by (job, consumer_group, topic) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High group consume rate {{ printf \"%0.2f\" $value }} msg/s for consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The consume rate of consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }} is exceeding threshold."

- alert: LowGroupConsumeRatePerTopic
expr: sum(max(rate(kafka_group_commit_offset{consumer_group="example_group", topic="example_topic"}[1m])) by (job, consumer_group, topic, partition)) by (job, consumer_group, topic) < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low group consume rate {{ printf \"%0.2f\" $value }} msg/s for consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The consume rate of consumer group {{ $labels.consumer_group }} on topic {{ $labels.topic }} in cluster {{ $labels.job }} is below threshold."

- alert: HighTopicBytesInPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High inbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The increase of inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: LowTopicBytesInPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="in", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) < -0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Low inbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The decrease of inbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: HighTopicBytesOutPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High outbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The increase of outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: LowTopicBytesOutPerSecDayToDayChange
expr: (sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m])) by (job, topic, partition)) by (job, topic)
- sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic))
/ sum(max(rate(kafka_network_io_bytes_total{direction="out", topic="example_topic"}[1m] offset 24h)) by (job, topic, partition)) by (job, topic) < -0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Low outbound network throughput change {{ printf \"%0.2f\" $value }} for topic {{ $labels.topic }} in cluster {{ $labels.job }}"
description: "The decrease of outbound bytes per second produced by topic {{ $labels.topic }} in cluster {{ $labels.job }} compared to 24h ago is exceeding threshold"

- alert: HighGroupConsumerLag
expr: sum(max(kafka_log_end_offset{topic="example_topic"}) by (job, topic, partition)) by (job, topic)
- on (topic) group_left (consumer_group) sum(max(kafka_group_commit_offset{consumer_group="example_group", topic="example_topic"}) by (job, consumer_group, topic, partition)) by (job, consumer_group, topic) > 10000
for: 1m
labels:
severity: warning
annotations:
summary: "High group consumer lag {{ printf \"%0.f\" $value }} for consumer group {{ $labels.consumer_group }} in cluster {{ $labels.job }} on topic {{ $labels.topic }}."
description: "The consumer lag of consumer group {{ $labels.consumer_group }} in cluster {{ $labels.job }} on topic {{ $labels.topic }} is exceeding threshold."

配置步骤

AutoMQ 提供了上述 Prometheus 报警模板,用户可将该报警模板导入当前的 Prometheus 集群(实例)中,然后基于报警模板配置自定义报警规则。

下文以阿里云 Prometheus 为例,演示报警规则配置方法,如果当前是自建 Prometheus,自行参考调整。

在阿里云 Prometheus 配置报警

  1. 导入报警模板: 打开阿里云 Prometheus 控制台,进入告警规则模板 ,选择批量导入模板
  1. 复制上述模板文件内容,并导入。
  1. 应用报警模板: 导入模板完成后,选择特定的报警模板,点击应用模板, 将想要开启的报警模板应用到对应的 Prometheus 实例。
  1. 配置监控告警规则。 以消费堆积报警 (HighGroupConsumerLag) 为例,点击「应用模板」后,选择对应的 Prometheus 实例,应用成功后,可以在左侧「告警规则列表」中看到已经启用的报警规则。
  1. 点击「编辑」,进入到报警规则的编辑页面,将 "example_topic" 和 "example_group" 更改为想要监控的 topic 和 consumer group,并将报警阈值(下图中的 10000)修改为期望的值。
  1. 配置通知策略。 编辑完成后,选择已有的通知策略,或点击「新建通知策略」进行创建。

  2. 快速复制报警策略(可选)。 如果想要对多个 Topic 或 Consumer Group 进行监控,可点击「复制」 创建多条报警规则。