S. No | Panel | Panel Description | Query | Query Description | Metrics Used | Metric Description | Expression Operating Value | Threshold Values |
---|---|---|---|---|---|---|---|---|
1 | Nodes Up | Number of nodes in the cluster. | count(redpanda_application_uptime_seconds_total{job="redpanda_pssb_cluster_exporter"}) |
Counts the uptime seconds of Redpanda applications to determine the number of nodes in the cluster. | redpanda_application_uptime_seconds_total |
Represents the total time (in seconds) that the Redpanda application has been running. | 5 | Less than 5 raises alert. |
2 | Storage Used | Storage used by Redpanda across all nodes (5 nodes). | (1 - (sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"}))) * 100 |
Calculates the percentage of used disk space by subtracting the free space ratio from 1. | redpanda_storage_disk_free_bytes , redpanda_storage_disk_total_bytes |
Measures free and total disk space available to Redpanda, in bytes. | 0-100% | 85% |
3 | Node Uptime | Uptime for each node after restart or from when the node is up. | redpanda_application_uptime_seconds_total / 86400 |
Calculates the uptime of Redpanda applications in days. | redpanda_application_uptime_seconds_total |
Represents the total uptime of the application. | Varies for each node | Value > 300s |
4 | Topics | Number of topics created in the Redpanda cluster. | sum(redpanda_cluster_topics{job=~"$job"}) by ([[aggr_criteria]]) |
Aggregates the total number of topics in the cluster by specified criteria (e.g., instance, exported instance). | redpanda_cluster_topics |
Represents the total number of topics in the cluster. | Varies (e.g., 14) | Value < 14 < Value |
5 | Partitions | Number of partitions created across all topics. | sum(min by (redpanda_topic) (redpanda_kafka_partitions{job="$job"})) |
Calculates the sum of minimum partition values per topic across all topics. | redpanda_kafka_partitions |
Represents the total number of partitions in the cluster. | 96 | Value < 96 < Value |
6 | Storage Health | Monitors available free disk space to assess storage health. | max(redpanda_storage_disk_free_space_alert{job="redpanda_pssb_cluster_exporter"}) |
Retrieves the maximum value of the free disk space alert metric. | redpanda_storage_disk_free_space_alert |
Indicates the health status of storage based on available disk space. | 0: OK, 1: Low, 2: Degraded | value > 0 |
7 | Throughput | Measures the rate of data transfer or request processing within Redpanda. | sum by ([[aggr_criteria]])(rate(redpanda_kafka_request_bytes_total[1m])) |
Calculates data throughput (bytes/sec) over a 1-minute window, grouped by specified criteria. | redpanda_kafka_request_bytes_total |
Represents total bytes processed per second. | Min: 0, Max: 48.7 B/s | 20≤x≤80 |
8 | CPU Utilization | Percentage of CPU resources actively used by Redpanda. | avg(rate(redpanda_cpu_busy_seconds_total{job="redpanda_pssb_cluster_exporter"}[$__rate_interval])) |
Measures the average CPU busy time per shard, adjusted dynamically by Grafana’s rate interval. | redpanda_cpu_busy_seconds_total |
Represents the total time (in seconds) CPU cores are actively processing Redpanda tasks. | 0-100%, Current: 3.4% | N/A |
9 | Allocated Memory | Percentage of total memory currently allocated by Redpanda. | sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_allocated_memory) + sum(redpanda_memory_free_memory)) |
Calculates the ratio of allocated memory to total memory (allocated + free). | redpanda_memory_allocated_memory , redpanda_memory_free_memory |
Represents the allocated and free memory available for Redpanda operations. | 0-100%, Current: 40.5% | N/A |
10 | Kafka RPC Active Connections | Number of active RPC (Remote Procedure Call) connections in Redpanda Kafka service. | sum(redpanda_rpc_active_connections{job="redpanda_pssb_cluster_exporter"}) by ([[aggr_criteria]]) |
Calculates the total number of active RPC connections grouped by specified criteria. | redpanda_rpc_active_connections |
Represents the number of active Kafka RPC connections. | Min: 107, Max: 338 | 230≤x≤280 |
15 | Produce Latency | The latency of Kafka “produce” requests processed by Redpanda, focusing on the 99th percentile (p99) latency and also 95th percentile (p95) latency. Shows current and yesterday latency for comparison. | histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{instance=~"$node",redpanda_request="produce",job="redpanda_pssb_cluster_exporter"}[5m])) by (le, [[aggr_criteria]])) |
The histogram_quantile function calculates the 99th percentile of latency and 95th percentile, aggregated over a 5-minute window.- le : The latency threshold for each bucket, such as 0.1s, 0.5s, 1s, etc. |
redpanda_kafka_request_latency_seconds_bucket {request="produce"} |
|||
16 | Fetch Latency | The latency of Kafka “consume” requests processed by Redpanda, specifically focusing on the 99th percentile (p99) latency and also 95th percentile (p95) latency. Shows current and yesterday latency for comparison. | histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{instance=~"$node",job="redpanda_pssb_cluster_exporter",redpanda_request="consume"}[5m])) by (le, [[aggr_criteria]])) |
The query calculates the 99th percentile (p99) latency by using the histogram_quantile(0.99, ...) function, which computes the latency value for the slowest 1% of Kafka “consume” requests.- le : Represents the upper bound of each latency bucket, capturing requests below specific thresholds. |
redpanda_kafka_request_latency_seconds_bucket {request="consume"} |
The number of active Redpanda application nodes does not match the expected count of 5. This may indicate one or more nodes being down, unavailable, or misconfigured
systemctl status redpanda
redpanda_application_uptime_seconds_total{job="redpanda_pssb_cluster_exporter"}
rpk
tool:
rpk cluster health
rpk cluster info
journalctl -u redpanda --since "30 minutes ago" > /tmp/redpanda_logs_$(hostname).log
redpanda_pssb_cluster_exporter
to ensure the metrics are being collected and sent correctly.ping <node_ip>
telnet <node_ip> 9092
systemctl restart redpanda
netstat -tlnp | grep "redpanda"
tcp 0 0 172.21.0.63:9644 0.0.0.0:* LISTEN 751/redpanda
tcp 0 0 172.21.0.63:33145 0.0.0.0:* LISTEN 751/redpanda
tcp 0 0 172.21.0.63:9092 0.0.0.0:* LISTEN 751/redpanda
tcp 0 0 0.0.0.0:8082 0.0.0.0:* LISTEN 751/redpanda
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 751/redpanda
/etc/redpanda/redpanda.yaml
fs.aio-max-nr
, update the system configuration to increase the asynchronous I/O limit:
/etc/sysctl.conf
and add or update:
fs.aio-max-nr = 1048576
sudo sysctl -p
, verify with sysctl fs.aio-max-nr
, and restart the Redpanda service:
systemctl restart redpanda
Condition: Storage utilization exceeds 80%.
Query:
1 - (sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"}))
Path: /data/ms/rpc/redpanda/data
df -h /data/ms/rpc/redpanda/data
sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"})
journalctl -u redpanda --since "30 minutes ago" > /tmp/redpanda_logs_$(hostname).log
rpk cluster health
rpk cluster info
/data/ms/rpc/redpanda/data
./data/ms/rpc/redpanda/data
is correctly mounted and has sufficient space./data
:
mkfs.ext4 /dev/<new-disk>
mount /dev/<new-disk> /data
systemctl restart redpanda
The Redpanda nodes have an uptime below 300 seconds, indicating potential restarts or stability issues
systemctl show redpanda -p ActiveEnterTimestamp
redpanda_application_uptime_seconds_total
for the specific node in Grafana or Prometheus.systemctl status redpanda
journalctl -u redpanda --since "30 minutes ago" > /tmp/redpanda_logs_$(hostname).log
systemctl restart redpanda
top
or htop
to monitor processes./etc/redpanda/redpanda.yaml
for misconfigurations.sysctl fs.aio-max-nr=1048576
sysctl -p
The number of topics in the Redpanda cluster is outside the expected range (0 to 14).This may indicate misconfigurations, topic deletions, or unexpected additions.
sum(redpanda_cluster_topics{job="redpanda_pssb_cluster_exporter"}) by (cluster)
rpk topic list
rpk cluster info
journalctl -xeu redpanda
rpk
(Redpanda’s command-line tool):rpk topic create <topic_name>
my_topic
, the command would be:
rpk topic create my_topic
rpk topic create <topic_name> --partitions <num_partitions> --replicas <num_replicas>
rpk topic create my_topic --partitions 3 --replicas 2
my_topic
with 3 partitions and 2 replicas.The total number of Kafka partitions across topics does not match the expected count of 96.
Verify the Current Partition Count:
sum(min by (redpanda_topic) (redpanda_kafka_partitions{job="redpanda_pssb_cluster_exporter"}))
List All Topics and Partition Information:
rpk
command to fetch the list of topics and their partition details:
rpk topic list
Cross-Check Partition Counts per Topic:
rpk cluster partitions list --all
rpk
command:
rpk topic partition add <topic_name> --brokers <broker_address> --count <desired_partition_count>
rpk topic create <topic_name> --partitions <partition_count> --replicas <replica_count>
Monitors the available free disk space on Redpanda storage, indicating whether the disk space is within healthy limits
max(redpanda_storage_disk_free_space_alert{job="redpanda_pssb_cluster_exporter"})
0
: OK (sufficient disk space)1
: Low (disk space is running low)2
: Degraded (critical disk space issue)df -h /data/ms/rpc/redpanda/data
journalctl -u redpanda --since "30 minutes ago" > /tmp/redpanda_logs_$(hostname).log
rpk
tool to get a high-level view of the cluster’s health:
rpk cluster health
/data/ms/rpc/redpanda/data
directory.
find /data/ms/rpc/redpanda/data -type f -name "*.log" -delete
systemctl restart redpanda
sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"}) /
(sum(redpanda_memory_free_memory{job="redpanda_pssb_cluster_exporter"}) +
sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"}))
sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"})
sum(redpanda_memory_free_memory{job="redpanda_pssb_cluster_exporter"})
free -h
used
, free
, and available
memory.top -o %MEM
journalctl -u redpanda > /tmp/redpanda_memory_logs_$(hostname).log
systemctl restart redpanda
avg(rate(redpanda_cpu_busy_seconds_total{job="redpanda_pssb_cluster_exporter"}[5m]))
top
journalctl -u redpanda.service | tail -n 50
ps aux --sort=-%cpu | head -10
systemctl restart redpanda
Review the logs provided by C3 team and tried to reslove the issue if any.
Upgrade the Virtual Machine Resources:
Implement Horizontal Scaling:
Restart the Cluster:
sudo systemctl restart redpanda
When the RPK service on the node restarts, the number of active connections increases to more than 300.
sum(redpanda_rpc_active_connections{job='redpanda_pssb_cluster_exporter'}) by (cluster)
journalctl -u redpanda > /tmp/redpanda_memory_logs_$(hostname).log
Leaderless partitions occur when no broker in the cluster assumes the leader role for a partition, potentially causing availability and consistency issues.
Check Node Availability
rpk cluster health
sudo systemctl restart redpanda
ping <broker_ip>
journalctl -u redpanda -n 50
Inspect Logs for Leadership Issues:
rpk cluster logdirs describe
free -h
df -h
top
rpk cluster maintenance status
rpk cluster maintenance enable <broker_id>
sudo systemctl restart redpanda
sudo systemctl status redpanda
rpk cluster partitions list --all
rpk cluster maintenance status
rpk cluster maintenance enable <nodeid>
rpk redpanda admin brokers decommission <node-id>
rpk redpanda admin brokers decommission-status <node-id>
rpk redpanda admin brokers decommission <node-id> --force
rpk redpanda admin brokers list
rpk cluster info
To monitor the decommissioning status of a broker after initiating the decommission process, you can use the rpk redpanda admin brokers decommission-status
command. This command provides real-time information about the progress of the decommissioning operation for a specific broker.
Decommission the Broker:
rpk redpanda admin brokers decommission <broker-id>
This initiates the decommissioning process for the broker with ID 3
.
Monitor the Decommission Status: While the decommissioning is in progress, you can monitor its status using:
rpk redpanda admin brokers decommission-status <broker-id>
This command will display the current state of the decommissioning process for broker 3
, including details such as:
Repeat Monitoring (if necessary):
You can run the decommission-status
command periodically to track the progress until the broker is fully decommissioned.
decommission-status
$ rpk redpanda admin brokers decommission-status 3
Broker ID: 3
Status: In Progress
Data Redistribution: 65% complete
Remaining Tasks: 3
Errors: None
Once the broker is fully decommissioned, the output will indicate that the process is complete:
$ rpk redpanda admin brokers decommission-status 3
Broker ID: 3
Status: Complete
Data Redistribution: 100% complete
Remaining Tasks: 0
Errors: None
After initiating the decommission process, you can monitor the status of the operation using the following command:
rpk redpanda admin brokers decommission-status <BROKER ID>
Replace [BROKER ID]
with the ID of the broker being decommissioned. This command provides real-time updates on the progress of the decommissioning process, including data redistribution status and any potential errors.
By incorporating the decommission-status
command into your workflow, you gain better visibility and control over the decommissioning process, ensuring a smooth and successful operation.
When you’re running a monolithic (non-Kubernetes) Redpanda cluster, joining a new node to the existing cluster involves several critical steps and precautions to ensure smooth operation and avoid data inconsistency or cluster instability.
1. Version Compatibility
2. Node Preparation
3. Configure the New Node’s redpanda.yaml
In your Redpanda config (/etc/redpanda/redpanda.yaml
), ensure:
Seed servers point to one or more existing nodes:
seed_servers:
- host:
address: <existing-node1-ip>
port: 33145
- host:
address: <existing-node2-ip>
port: 33145
Advertised addresses are correctly set:
advertised_rpc_api:
address: <new-node-ip>
port: 33145
advertised_kafka_api:
address: <new-node-ip>
port: 9092
4. Assign a Unique Node ID
node_id
. If node_id
is left unset (e.g., -1
), Redpanda may assign it dynamically, which is risky for cluster integrity.5. Bootstrap the Node Properly
rm -rf /var/lib/redpanda/data/*
6. Validate Cluster Health Before Adding
rpk cluster info
and ensure:
7. Start the New Node Once the config is set, start the new Redpanda node:
sudo systemctl start redpanda
Then confirm it joined the cluster:
rpk cluster info
8. Monitor Node Integration
rpk cluster metadata
rpk topic describe <topic>
rpk partition describe <topic> -p <partition>
9. Rebalance the Cluster (Optional but Recommended) If the new node doesn’t automatically get assigned partition replicas:
rpk cluster rebalance
This spreads partition replicas across the nodes.
10. Set Up Monitoring and Alerting
11. Backup Configuration