Flashcat

Introduction

In modern enterprises, with the growing demand for data processing, AutoMQ as gradually become a key component for real-time data processing due to its efficiency and cost-effectiveness as a stream processing system. However, as the cluster size expands and business complexity increases, ensuring the stability, high availability, and performance optimization of the AutoMQ cluster becomes crucial. Therefore, integrating a robust and comprehensive monitoring system is essential for maintaining the healthy operation of the AutoMQ cluster. The Nightingale Monitoring System with its efficient data collection, flexible alert management, and rich visualization capabilities, is an ideal choice for enterprises to monitor AutoMQ clusters. By using the Nightingale Monitoring System, enterprises can grasp the operational status of the AutoMQ cluster in real-time, promptly identify and resolve potential issues, optimize system performance, and ensure business continuity and stability.

Overview of AutoMQ

AutoMQ is a cloud-based redesigned stream processing system that maintains 100% compatibility with Apache Kafka while significantly enhancing cost efficiency and elasticity by decoupling storage to object storage. Specifically, AutoMQ offloads storage to shared cloud storage such as EBS and S3 provided by cloud vendors by building the S3Stream stream storage repository on S3, offering low-cost, low-latency, high-availability, high-reliability, and virtually unlimited capacity stream storage capabilities. Compared with the traditional Shared Nothing architecture, AutoMQ adopts a Shared Storage architecture, significantly reducing storage and operational complexity while enhancing system elasticity and reliability.

The design philosophy and technical advantages of AutoMQ make it an ideal choice for replacing existing Kafka clusters in enterprises. By adopting AutoMQ, enterprises can significantly reduce storage costs, simplify operations, and achieve automatic cluster scaling and traffic self-balancing, thereby more efficiently responding to changes in business demands. Additionally, AutoMQ’s architecture supports efficient cold read operations and zero service interruption, ensuring stable system operation under high load and sudden traffic conditions. Its storage structure is as follows:

Overview of Nightingale

The Nightingale Monitoring System (Nightingale) is an open-source, cloud-native observability and analysis tool that adopts an All-in-One design philosophy, integrating data collection, visualization, monitoring alerts, and data analysis into one platform. Its main advantages include efficient data collection capabilities, flexible alert strategies, and rich visualization features. Nightingale is closely integrated with various cloud-native ecosystems, supporting multiple data sources and storage backends, and providing low-latency, high-reliability monitoring services. By using Nightingale, enterprises can achieve comprehensive monitoring and management of complex distributed systems, quickly identify and resolve issues, thereby optimizing system performance and enhancing business continuity.

Prerequisites

To monitor the status of the cluster, you will need the following setup:

Deploy an available AutoMQ node/cluster and open the Metrics collection port
Deploy Nightingale monitoring and its dependencies
Deploy Prometheus to collect Metrics data

Deploy AutoMQ, Prometheus, and Nightingale Monitoring

Deploy AutoMQ

Refer to the AutoMQ documentation: Cluster Deployment | AutoMQ [5]. Before starting the deployment, add the following configuration parameters to enable the Prometheus scrape interface. After starting the AutoMQ cluster with these parameters, each node will open an additional HTTP interface to pull AutoMQ monitoring metrics. These metrics follow the Prometheus Metrics format.

bin/kafka-server-start.sh ...\
--override  s3.telemetry.metrics.exporter.type=prometheus \
--override  s3.metrics.exporter.prom.host=0.0.0.0 \
--override  s3.metrics.exporter.prom.port=8890 \
....

When AutoMQ monitoring metrics are enabled, you can pull Prometheus-formatted monitoring metrics from any node via HTTP protocol at the address: http://{node_ip}:8890. An example of the response is as follows:

....
kafka_request_time_mean_milliseconds{otel_scope_name="io.opentelemetry.jmx",type="DescribeDelegationToken"} 0.0 1720520709290
kafka_request_time_mean_milliseconds{otel_scope_name="io.opentelemetry.jmx",type="CreatePartitions"} 0.0 1720520709290
...

For more information on the metrics, refer to the AutoMQ official documentation: Metrics | AutoMQ [6].

Deploying Prometheus

Prometheus can be deployed by downloading the binary package or using Docker. Below is an introduction to these two deployment methods.

Binary Deployment

For ease of use, you can create a new script and modify the Prometheus download version as needed. Then, execute the script to complete the deployment. First, create a new script:

cd /home
vim install_prometheus.sh
# !!! Paste the Following Script Content and Save and Exit
# Grant Permissions
chmod +x install_prometheus.sh
# Execute the Script
./install_prometheus.sh

version=2.45.3
filename=prometheus-${version}.linux-amd64
mkdir -p /opt/prometheus
wget https://github.com/prometheus/prometheus/releases/download/v${version}/${filename}.tar.gz
tar xf ${filename}.tar.gz
cp -far ${filename}/*  /opt/prometheus/

# Config as a Service
cat <<EOF >/etc/systemd/system/prometheus.service
[Unit]
Description="prometheus"
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple

ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/opt/prometheus/data --web.enable-lifecycle --web.enable-remote-write-receiver

Restart=on-failure
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus


[Install]
WantedBy=multi-user.target
EOF

systemctl enable prometheus
systemctl restart prometheus
systemctl status prometheus

Subsequently, modify the Prometheus configuration file to add a task to collect observability data from AutoMQ and restart Prometheus by executing the following command:

# Add the Following Content to the Configuration File:
vim /opt/prometheus/prometheus.yml
# Restart Prometheus
systemctl restart prometheus

Refer to the following configuration file content, please modify the client_ip to the AutoMQ observable data exposure address:

# My Global Config
global:
  scrape_interval: 15s # Set the Scrape Interval to Every 15 Seconds. Default Is Every 1 Minute.
  evaluation_interval: 15s # Evaluate Rules Every 15 Seconds. the Default Is Every 1 Minute.

scrape_configs:
  # The Job Name Is Added as a Label `job=<job_name>` to Any Timeseries Scraped from This Config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
      
  - job_name: "automq"
    static_configs:
      - targets: ["{client_ip}:8890"]

After deployment, you can access Prometheus through a browser to check if the AutoMQ Metrics data is being collected correctly by visiting http://{client_ip}:9090/targets:

Docker Deployment

If you already have a running Prometheus Docker container, please execute the following command to remove the container:

docker stop prometheus
docker rm prometheus

Create a new configuration file and mount it during Docker startup:

mkdir -p /opt/prometheus
vim /opt/prometheus/prometheus.yml
# Refer to the Configuration Mentioned in the "Binary Deployment" Section Above for the Content

Start the Docker container:

docker run -d \
  --name=prometheus \
  -p 9090:9090 \
  -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  -m 500m \
  prom/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=otlp-write-receiver \
  --web.enable-remote-write-receiver

This way, you will have a Prometheus service collecting AutoMQ Metrics. For more information on integrating AutoMQ Metrics with Prometheus, refer to: Integrate Metrics with Prometheus | AutoMQ [7].

Deploy Nightingale Monitoring

Nightingale Monitoring can be deployed in the following three ways. For more detailed deployment instructions, refer to the official documentation [8]:

Deploy using Docker compose
Deploy using binary
Helm Deployment

Next, I will proceed with the deployment using a binary method.

Download Nightingale

Please choose an appropriate version to download from the Nightingale Github releases [9] page. The version used here is v7.0.0-beta.14. If you are on an AMD architecture machine, you can directly execute the following command:

cd /home
# Download
wget https://github.com/ccfos/nightingale/releases/download/v7.0.0-beta.14/n9e-v7.0.0-beta.14-linux-amd64.tar.gz
mkdir -p /home/flashcat
# Extract the Files to the /home/flashcat Folder
tar -xzf /home/n9e-v7.0.0-beta.14-linux-amd64.tar.gz -C /home/flashcat
# Navigate to the Home Directory
cd /home/flashcat

Configure the Dependency Environment

Nightingale depends on MySQL and Redis, so you need to install these environments beforehand. You can deploy them via Docker or by executing the commands as follows:

# Install Mysql
yum -y install mariadb*
systemctl enable mariadb
systemctl restart mariadb
mysql -e "SET PASSWORD FOR 'root'@'localhost' = PASSWORD('1234');"

# Install Redis
yum install -y redis
systemctl enable redis
systemctl restart redis

Here, Redis is set to no password. Additionally, the MySQL database password is specified as 1234. If you need to change to another password, you need to configure it in the Nightingale configuration file to ensure Nightingale can connect to your database. Modify the Nightingale configuration file:

vim /home/flashcat/etc/config.toml

Modify the username and password under [DB]:
[DB]
# Postgres: Host=%s Port=%s User=%s Dbname=%s Password=%s Sslmode=%s
# Postgres: DSN="host=127.0.0.1 Port=5432 User=root Dbname=n9e_v6 Password=1234 Sslmode=disable"
# Sqlite: DSN="/path/to/filename.db"
DSN = "{username}:{password}@tcp(127.0.0.1:3306)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
# Enable Debug Mode or Not

Import Database Tables

Execute the following command:

mysql -uroot -p1234 < n9e.sql

Use a database tool to check if the database tables were successfully imported:

> show databases;
+--------------------+
| Database           |
+--------------------+
| n9e_v6             |
+--------------------+

> show tables;
+-----------------------+
| Tables_in_n9e_v6      |
+-----------------------+
| alert_aggr_view       |
| alert_cur_event       |
| alert_his_event       |
| alert_mute            |
| alert_rule            |
| alert_subscribe       |
| alerting_engines      |
| board                 |
| board_busigroup       |
| board_payload         |
| builtin_cate          |
| builtin_components    |
| builtin_metrics       |
······

Modify the Nightingale Configuration File

You need to modify the Nightingale configuration file to set up the Prometheus data source:

vim /home/flashcat/etc/config.toml

# Modify the [[Pushgw.Writers]] Section to
[[Pushgw.Writers]]
# Url = "http://127.0.0.1:8480/insert/0/prometheus/api/v1/write"
Url = "http://{client_ip}:9090/api/v1/write"

Start Nightingale

In the root directory of Nightingale /home/flashcat, execute: ./n9e. Upon successful startup, you can access it in a browser at http://{client_ip}:17000. The default login credentials are:

Username: root
Password: root.2020

Integrate Prometheus Data Source

Left sidebar Integration -> Data Source -> Prometheus.

At this point, the Nightingale monitoring deployment is complete.

Nightingale Monitoring AutoMQ Cluster Status

Next, we will introduce some of the features provided by Nightingale monitoring to help you better understand the available features integrated with AutoMQ.

Instant Query

Select built-in AutoMQ metrics:

You can try querying some data, such as the average fetch request processing time kafka_request_time_50p_milliseconds:

Additionally, you can customize some metrics and use expressions to aggregate these metrics:

Alerting Feature

Select from the left sidebar Alerts -> Alert Rules -> Create New Rule. For example, you can set an alert for kafka_network_io_bytes_total, which measures the total number of bytes sent or received by Kafka Broker nodes over the network. By setting an expression for this metric, you can calculate the inbound network I/O rate for Kafka Broker nodes. The expression is:

sum by(job, instance) (rate(kafka_network_io_bytes_total{direction="in"}[1m]))

Setting Alert Rules:

Data preview：

You can also configure groups to be notified when an alert occurs:

After creating the alert, you can simulate a high-concurrency message processing scenario: a total of 2500000 messages are sent to AutoMQ nodes within a short period. The method used is sending messages through the Kafka SDK, with a total of 50 Topics, each Topic receiving 500 messages, repeated 100 times. An example is as follows:

import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.AdminClientConfig;
import org.apache.kafka.clients.admin.NewTopic;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import org.apache.kafka.common.serialization.StringSerializer;

import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutionException;

public class KafkaTest {

    private static final String BOOTSTRAP_SERVERS = "http://{}:9092"; // your automq broker ip
    private static final int NUM_TOPICS = 50;
    private static final int NUM_MESSAGES = 500;

    public static void main(String[] args) throws Exception {
        KafkaTest test = new KafkaTest();
        // test.createTopics();    // create 50 topics
        for(int i = 0; i < 100; i++){
            test.sendMessages();   // 25,000 messages will be sent each time, and 500 messages will be sent to each of 50 topics.
        }
    }
    
    public void createTopics() {
        Properties props = new Properties();
        props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);

        try (AdminClient adminClient = AdminClient.create(props)) {
            List<NewTopic> topics = new ArrayList<>();
            for (int i = 1; i <= NUM_TOPICS; i++) {
                topics.add(new NewTopic("Topic-" + i, 1, (short) 1));
            }
            adminClient.createTopics(topics).all().get();
            System.out.println("Topics created successfully");
        } catch (InterruptedException | ExecutionException e) {
            e.printStackTrace();
        }
    }

    public void sendMessages() {
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

        try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
            for (int i = 1; i <= NUM_TOPICS; i++) {
                String topic = "Topic-" + i;
                for (int j = 1; j <= NUM_MESSAGES; j++) {
                    String key = "key-" + j;
                    String value = "{\"userId\": " + j + ", \"action\": \"visit\", \"timestamp\": " + System.currentTimeMillis() + "}";
                    ProducerRecord<String, String> record = new ProducerRecord<>(topic, key, value);
                    producer.send(record, (RecordMetadata metadata, Exception exception) -> {
                        if (exception == null) {
                            System.out.printf("Sent message to topic %s partition %d with offset %d%n", metadata.topic(), metadata.partition(), metadata.offset());
                        } else {
                            exception.printStackTrace();
                        }
                    });
                }
            }
            System.out.println("Messages sent successfully");
        }
    }
}

You can then see the alarm information in the Nightingale console:

Alert Details:

Dashboard

First, you can use the known metrics to create your own dashboard. Below is an example of a statistical dashboard for AutoMQ message request processing time, total message count, and network IO bits:

Additionally, you can use the built-in dashboards provided by the official source. Left sidebar -> Aggregate -> Template Center:

Choosing AutoMQ, you will see several DashBoards available:

Select the Topic Metrics dashboard, and the content is displayed as follows:

This dashboard shows the message input and output utilization, message input and request rates, and message sizes of the AutoMQ cluster over a recent period. These metrics are used to monitor and optimize the performance and stability of the AutoMQ cluster: By assessing the message input and output utilization, you can evaluate the load on producers and consumers, ensuring the cluster can handle the message flow properly. The message input rate is used for real-time monitoring of the rate at which producers send messages, identifying potential bottlenecks or sudden spikes in traffic. The request rate helps understand the frequency of client requests, optimizing resource allocation and processing capabilities. The message size metric analyzes the average size of messages to adjust configurations for optimizing storage and network transmission efficiency. Monitoring these metrics allows you to detect and resolve performance issues promptly, ensuring the efficient and stable operation of the AutoMQ cluster.

At this point, the integration process of FlashCat is complete. For more usage methods, you can refer to the official documentation of Nightingale for an experience.

Summary

This article elaborates on how to comprehensively monitor an AutoMQ cluster using the Nightingale monitoring system. Starting with the basic concepts of AutoMQ and Nightingale, it gradually explains how to deploy AutoMQ, Prometheus, and Nightingale, and configure monitoring and alarm rules. Through this integration, enterprises can monitor the operational status of the AutoMQ cluster in real-time, promptly identify and resolve potential issues, optimize system performance, and ensure business continuity and stability. The Nightingale monitoring system, with its powerful data collection capabilities, flexible alerting mechanisms, and rich visualization features, becomes an ideal choice for enterprises to monitor complex distributed systems. We hope this article provides valuable reference for your practical application, helping to make your system operations more efficient and stable.