Home > DevOps > Services Offered by C3 > Monitoring > Alerts and Observability > Redpanda Alerts and Observability > Redpanda Observability

Redpanda Observability

Redpanda Monitoring Panel Documentation

S. No	Panel	Panel Description	Query	Query Description	Metrics Used	Metric Description	Expression Operating Value	Threshold Values
1	Nodes Up	Number of nodes in the cluster.	`count(redpanda_application_uptime_seconds_total{job="redpanda_pssb_cluster_exporter"})`	Counts the uptime seconds of Redpanda applications to determine the number of nodes in the cluster.	`redpanda_application_uptime_seconds_total`	Represents the total time (in seconds) that the Redpanda application has been running.	5	Less than 5 raises alert.
2	Storage Used	Storage used by Redpanda across all nodes (5 nodes).	`(1 - (sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"}))) * 100`	Calculates the percentage of used disk space by subtracting the free space ratio from 1.	`redpanda_storage_disk_free_bytes`, `redpanda_storage_disk_total_bytes`	Measures free and total disk space available to Redpanda, in bytes.	0-100%	85%
3	Node Uptime	Uptime for each node after restart or from when the node is up.	`redpanda_application_uptime_seconds_total / 86400`	Calculates the uptime of Redpanda applications in days.	`redpanda_application_uptime_seconds_total`	Represents the total uptime of the application.	Varies for each node	Value > 300s
4	Topics	Number of topics created in the Redpanda cluster.	`sum(redpanda_cluster_topics{job=~"$job"}) by ([[aggr_criteria]])`	Aggregates the total number of topics in the cluster by specified criteria (e.g., instance, exported instance).	`redpanda_cluster_topics`	Represents the total number of topics in the cluster.	Varies (e.g., 14)	Value < 14 < Value
5	Partitions	Number of partitions created across all topics.	`sum(min by (redpanda_topic) (redpanda_kafka_partitions{job="$job"}))`	Calculates the sum of minimum partition values per topic across all topics.	`redpanda_kafka_partitions`	Represents the total number of partitions in the cluster.	96	Value < 96 < Value
6	Storage Health	Monitors available free disk space to assess storage health.	`max(redpanda_storage_disk_free_space_alert{job="redpanda_pssb_cluster_exporter"})`	Retrieves the maximum value of the free disk space alert metric.	`redpanda_storage_disk_free_space_alert`	Indicates the health status of storage based on available disk space.	0: OK, 1: Low, 2: Degraded	value > 0
7	Throughput	Measures the rate of data transfer or request processing within Redpanda.	`sum by ([[aggr_criteria]])(rate(redpanda_kafka_request_bytes_total[1m]))`	Calculates data throughput (bytes/sec) over a 1-minute window, grouped by specified criteria.	`redpanda_kafka_request_bytes_total`	Represents total bytes processed per second.	Min: 0, Max: 48.7 B/s	20≤x≤80
8	CPU Utilization	Percentage of CPU resources actively used by Redpanda.	`avg(rate(redpanda_cpu_busy_seconds_total{job="redpanda_pssb_cluster_exporter"}[$__rate_interval]))`	Measures the average CPU busy time per shard, adjusted dynamically by Grafana’s rate interval.	`redpanda_cpu_busy_seconds_total`	Represents the total time (in seconds) CPU cores are actively processing Redpanda tasks.	0-100%, Current: 3.4%	N/A
9	Allocated Memory	Percentage of total memory currently allocated by Redpanda.	`sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_allocated_memory) + sum(redpanda_memory_free_memory))`	Calculates the ratio of allocated memory to total memory (allocated + free).	`redpanda_memory_allocated_memory`, `redpanda_memory_free_memory`	Represents the allocated and free memory available for Redpanda operations.	0-100%, Current: 40.5%	N/A
10	Kafka RPC Active Connections	Number of active RPC (Remote Procedure Call) connections in Redpanda Kafka service.	`sum(redpanda_rpc_active_connections{job="redpanda_pssb_cluster_exporter"}) by ([[aggr_criteria]])`	Calculates the total number of active RPC connections grouped by specified criteria.	`redpanda_rpc_active_connections`	Represents the number of active Kafka RPC connections.	Min: 107, Max: 338	230≤x≤280
15	Produce Latency	The latency of Kafka “produce” requests processed by Redpanda, focusing on the 99th percentile (p99) latency and also 95th percentile (p95) latency. Shows current and yesterday latency for comparison.	`histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{instance=~"$node",redpanda_request="produce",job="redpanda_pssb_cluster_exporter"}[5m])) by (le, [[aggr_criteria]]))`	The `histogram_quantile` function calculates the 99th percentile of latency and 95th percentile, aggregated over a 5-minute window.- `le`: The latency threshold for each bucket, such as 0.1s, 0.5s, 1s, etc.	`redpanda_kafka_request_latency_seconds_bucket {request="produce"}`
16	Fetch Latency	The latency of Kafka “consume” requests processed by Redpanda, specifically focusing on the 99th percentile (p99) latency and also 95th percentile (p95) latency. Shows current and yesterday latency for comparison.	`histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{instance=~"$node",job="redpanda_pssb_cluster_exporter",redpanda_request="consume"}[5m])) by (le, [[aggr_criteria]]))`	The query calculates the 99th percentile (p99) latency by using the `histogram_quantile(0.99, ...)` function, which computes the latency value for the slowest 1% of Kafka “consume” requests.- `le`: Represents the upper bound of each latency bucket, capturing requests below specific thresholds.	`redpanda_kafka_request_latency_seconds_bucket {request="consume"}`

Alerts and C3 Procedures

Nodes Up

Alert Name: Nodes Up

The number of active Redpanda application nodes does not match the expected count of 5. This may indicate one or more nodes being down, unavailable, or misconfigured

C3 Data Collection

1. Node-Level Information

Check if the Redpanda Service is Running on All Nodes:
- Use the following command to check the status of the Redpanda service on all nodes:
```
systemctl status redpanda
```
Uptime Metrics
- Query the Redpanda Application Uptime:
  To monitor the uptime of the Redpanda application, use the following PromQL query in Prometheus or Grafana for specific nodes:
```
redpanda_application_uptime_seconds_total{job="redpanda_pssb_cluster_exporter"}
```

2. Cluster-Level Data

Cluster Health
- Check the cluster health status using Redpanda’s built-in tools or the Grafana dashboard.
  If using the rpk tool:
```
rpk cluster health
```
Node Membership
- Verify the list of nodes registered in the cluster:
```
rpk cluster info
```

3. Logs

Redpanda Logs

Gather logs from all nodes to identify errors or failures:

journalctl -u redpanda --since "30 minutes ago"  > /tmp/redpanda_logs_$(hostname).log

Exporter Logs
- Check logs for the redpanda_pssb_cluster_exporter to ensure the metrics are being collected and sent correctly.

4. Network Diagnostics

Connectivity Checks
- Verify network connectivity between nodes:
```
ping <node_ip>
```
  or
```
telnet <node_ip> 9092
```
Firewall Rules
- Ensure no firewall or security rules are blocking communication.

C3 Remedy to Solve

Restart the Redpanda Service on Down Nodes
- Restart the Redpanda service on any down nodes:
```
systemctl restart redpanda
```
Resolve Network Issues
- Ensure nodes can communicate with each other.
- Adjust firewall rules if necessary.

Check if All Redpanda Ports are Listening

Verify that all necessary Redpanda ports are listening. Use the following command to check the status of the Redpanda ports:
```
netstat -tlnp | grep "redpanda"
```

Example output:

tcp        0      0 172.21.0.63:9644        0.0.0.0:*               LISTEN      751/redpanda        
tcp        0      0 172.21.0.63:33145       0.0.0.0:*               LISTEN      751/redpanda        
tcp        0      0 172.21.0.63:9092        0.0.0.0:*               LISTEN      751/redpanda        
tcp        0      0 0.0.0.0:8082            0.0.0.0:*               LISTEN      751/redpanda        
tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN      751/redpanda

Ensure all the relevant ports (9644, 33145, 9092, 8082, 8081) are in the LISTEN state.

DevOps Remedy

Deeper Analysis
- Review the data collected by C3, focusing on:
  - Logs: Check for errors or warnings in Redpanda logs and exporter logs.
  - Grafana Metrics: Look for signs of service crashes, high resource usage, or network interruptions.
Reconfigure Nodes
- Correct any misconfigurations in the node settings file:
```
/etc/redpanda/redpanda.yaml
```
Adjust Asynchronous I/O Limit and Restart Redpanda:
- If the Redpanda logs indicate issues related to fs.aio-max-nr, update the system configuration to increase the asynchronous I/O limit:
  - Edit /etc/sysctl.conf and add or update:
```
fs.aio-max-nr = 1048576
```
  - Apply the changes using sudo sysctl -p, verify with sysctl fs.aio-max-nr, and restart the Redpanda service:
```
systemctl restart redpanda
```

Redpanda Storage Usage Alert

Alert Name: Storage Used

Condition: Storage utilization exceeds 80%.
Query:

1 - (sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"}))

Redpanda Data Directory

Path: /data/ms/rpc/redpanda/data

C3 Data Collection

Check Storage Usage:
- Run the following command to check disk usage of the Redpanda data directory:
```
df -h /data/ms/rpc/redpanda/data
```

Collect Storage Metrics:

Query the storage metrics using PromQL:

sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"})

Redpanda Logs:

Gather logs for disk-related errors:

journalctl -u redpanda --since "30 minutes ago" > /tmp/redpanda_logs_$(hostname).log

Cluster Health:
- Verify the cluster’s health and info:
```
rpk cluster health
rpk cluster info
```

C3 Remedy to Solve

Free Up Disk Space:
- Remove unnecessary files or temporary data from /data/ms/rpc/redpanda/data.
Verify Mount Points:
- Ensure /data/ms/rpc/redpanda/data is correctly mounted and has sufficient space.

DevOps Remedy

Add or Expand Storage:
- Attach new storage, format it, and mount it to /data:
```
mkfs.ext4 /dev/<new-disk>
mount /dev/<new-disk> /data
```
Optimize Disk Space Usage:
- Archive or compress older data within the directory to reclaim space.
Restart Redpanda Service:
- Restart Redpanda after making changes:
```
systemctl restart redpanda
```
Plan for Future Scaling:
- Set up Grafana alerts and monitoring for predictive scaling to avoid storage saturation.

Redpanda Node Uptime Below Threshold

Alert Name: Node Uptime

The Redpanda nodes have an uptime below 300 seconds, indicating potential restarts or stability issues

C3 Data Collection

Node-Level Information:
If you want to check the uptime of the Redpanda service, you should use the following instead:
```
systemctl show redpanda -p ActiveEnterTimestamp
```
To check the uptime metric reported by the Redpanda exporter (from your monitoring stack), you can query the metric redpanda_application_uptime_seconds_total for the specific node in Grafana or Prometheus.
Service Status:
- Verify if the Redpanda service is running:
```
systemctl status redpanda
```

Logs:

Collect recent Redpanda logs from the affected nodes:

journalctl -u redpanda --since "30 minutes ago"  > /tmp/redpanda_logs_$(hostname).log

C3 Remedy

Restart Redpanda Service:
- Restart the Redpanda service on the affected nodes:
```
systemctl restart redpanda
```
Verify Uptime Metrics:
- Ensure the uptime metrics are updated and confirm the node is stable.
Resolve Node-Specific Issues:
- Investigate for resource exhaustion (CPU, memory, disk).
- Use top or htop to monitor processes.

DevOps Remedy

Deeper Analysis of Logs:
- Analyze Redpanda logs for recurring issues or errors.
System Configuration Check:
- Verify /etc/redpanda/redpanda.yaml for misconfigurations.
Cluster Stability:
- Ensure other nodes in the cluster are stable to prevent cascading failures.
Optimize System Limits (if needed):
- Adjust asynchronous I/O limits:
```
sysctl fs.aio-max-nr=1048576
sysctl -p
```
Update Redpanda if Required:
- If the issue persists, consider upgrading to a more stable version.

Redpanda Cluster Topic Count Out of Range

Alert Name: Topics

The number of topics in the Redpanda cluster is outside the expected range (0 to 14).This may indicate misconfigurations, topic deletions, or unexpected additions.

C3 Data Collection

Cluster Topic Count

Query the Current Topic Count in Redpanda Cluster
- Use the following PromQL query to get the number of topics in the cluster:
```
sum(redpanda_cluster_topics{job="redpanda_pssb_cluster_exporter"}) by (cluster)
```
Check Redpanda Topics
- To verify if the topics in the cluster are being properly created and are available, use the following command to list the topics in Redpanda:
```
rpk topic list
```
Also we can have the defined topics and their names in the below command
```
rpk cluster info
```

DevOps Remedy

Check Redpanda Logs
- Review the Redpanda logs on all nodes to identify when the topic has been created.
```
journalctl -xeu redpanda
```
Check topics
- Analyze with the Development team to determine if any existing topics have been deleted or if new topics have been created.
- If the required topic is created, update the alert to reflect the change.
- If any existing topics have been deleted, ensure that the necessary topic is recreated. To create a topic in Redpanda, you can use the following command with rpk (Redpanda’s command-line tool):
```
rpk topic create <topic_name>
```
For example, if you want to create a topic called my_topic, the command would be:
```
rpk topic create my_topic
```
You can also specify additional parameters like partition count and replication factor:
```
rpk topic create <topic_name> --partitions <num_partitions> --replicas <num_replicas>
```
Example:
```
rpk topic create my_topic --partitions 3 --replicas 2
```
This creates a topic my_topic with 3 partitions and 2 replicas.

Topic Partition Count Mismatch

Alert Name: Partition

The total number of Kafka partitions across topics does not match the expected count of 96.

C3 Data Collection

Verify the Current Partition Count:
- Use the following PromQL query in Prometheus or Grafana to check the current partition count:
```
sum(min by (redpanda_topic) (redpanda_kafka_partitions{job="redpanda_pssb_cluster_exporter"}))
```
- Note the current value to compare it with the expected count of 96.
List All Topics and Partition Information:
- Use the rpk command to fetch the list of topics and their partition details:
```
rpk topic list
```
Cross-Check Partition Counts per Topic:
- For a detailed view of partitions for a specific topic, use:
```
rpk cluster partitions list --all
```

DevOps Remedy

Analyze Partition Count Discrepancy:
- If the current partition count does not equal 96, work with the Development team to understand if new topics were added or existing ones modified.
Reconfigure Partitions as Necessary:
- If partitions are missing, recreate or adjust them using the rpk command:
  - Add partitions to an existing topic:
```
rpk topic partition add <topic_name> --brokers <broker_address> --count <desired_partition_count>
```
Validate Topic Requirements:
- Confirm if the expected partition count (96) matches business requirements. If a different partition count is intended, update the alert threshold to reflect the correct value.
Restore Deleted Topics (if applicable):
- If a required topic was accidentally deleted, recreate it:
```
rpk topic create <topic_name> --partitions <partition_count> --replicas <replica_count>
```
Monitor Metrics Post-Fix:
- Ensure the partition count is correctly reflected in Grafana after adjustments by re-checking the alert query.

Storage Health

Alert Name: Storage Health

Monitors the available free disk space on Redpanda storage, indicating whether the disk space is within healthy limits

C3 Data Collection

Storage Metrics

Check Disk Free Space Status:
Use the following PromQL query to verify the disk free space status:
```
max(redpanda_storage_disk_free_space_alert{job="redpanda_pssb_cluster_exporter"})
```
Status meanings:
- 0: OK (sufficient disk space)
- 1: Low (disk space is running low)
- 2: Degraded (critical disk space issue)

Node-Level Inspection
- Check Disk Space on Affected Nodes:
  Use the following command to check disk usage on Redpanda nodes:
```
df -h /data/ms/rpc/redpanda/data
```
- Redpanda Logs:
  Collect logs for insights into errors related to storage:
```
journalctl -u redpanda --since "30 minutes ago"  > /tmp/redpanda_logs_$(hostname).log
```
Redpanda Cluster Metrics
- Cluster Health Check:
  Use the rpk tool to get a high-level view of the cluster’s health:
```
rpk cluster health
```

DevOps Remedy

Free Up Disk Space
- Remove Unnecessary Files:
  Identify and delete unused files or logs in the /data/ms/rpc/redpanda/data directory.
```
find /data/ms/rpc/redpanda/data -type f -name "*.log" -delete
```
Expand Storage Capacity
- Add more storage to the affected nodes. Update the disk volume attached to the Redpanda data directory.
Monitor and Configure Alerts
- Ensure that Redpanda disk usage alerts are properly set up to notify admins before reaching critical thresholds.
Adjust Redpanda Storage Policies
- Reconfigure Redpanda storage policies to optimize disk usage, such as enabling log compaction or reducing retention periods for topics if appropriate.
Restart Redpanda (if required)
- After addressing storage issues, restart Redpanda services on affected nodes:
```
systemctl restart redpanda
```
Engage Development Teams
- Collaborate with the development team to verify whether high disk utilization is caused by specific topics or workload spikes.

Memory allocated for redpanda increased

Alert Name: Allocated Memory

C3 Data Collection

Memory Metrics

Memory Utilization Query:
Use the PromQL query to check the memory allocation ratio:

sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"}) / 
(sum(redpanda_memory_free_memory{job="redpanda_pssb_cluster_exporter"}) + 
sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"}))

Record the value and confirm it is ≥ 0.85, as this indicates high memory usage.

Allocated vs Free Memory:
Collect metrics for allocated and free memory separately:

sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"})
sum(redpanda_memory_free_memory{job="redpanda_pssb_cluster_exporter"})

Node-Level Inspection
- Check Total and Available Memory on the Node:
  Run the following command on Redpanda nodes:
```
free -h
```
  - Note the used, free, and available memory.
- System Load and Top Memory Consumers:
  Identify processes consuming the most memory:
```
top -o %MEM
```
Redpanda Logs
- Analyze Memory-Related Logs:
  Collect logs for memory issues:
```
journalctl -u redpanda > /tmp/redpanda_memory_logs_$(hostname).log
```

DevOps Remedy

Tune Redpanda Memory Settings
- Adjust Redpanda configurations to manage memory allocation efficiently.
Expand Node Resources
- Add More Memory to Nodes:
  Provision additional memory if available resources are insufficient.
Scaling Redpanda Cluster
- Add more nodes to the Redpanda cluster to distribute memory usage.
Restart High Memory-Consuming Processes
- Identify and restart memory-hogging processes if they are unrelated to Redpanda.
```
systemctl restart redpanda
```
Monitor Trends
- Ensure memory usage alerts are properly configured for early warning. Use Grafana to monitor the ratio trends.

High CPU Utilization Detected in Redpanda Cluster

Alert Name: CPU Utilization

Scenarios Triggering the Alert

Increased data throughput, unbalanced partitions, or suboptimal Redpanda configurations.
Insufficient vCPUs or resource contention in virtualized environments.
Data compaction, garbage collection, or excessive I/O operations.
Faulty producers/consumers or inefficient network settings.
Outdated Redpanda/OS versions, kernel misconfigurations, or overloaded processes.

C3 Data Collection

Query CPU Usage

Use the Prometheus expression to get real-time data

avg(rate(redpanda_cpu_busy_seconds_total{job="redpanda_pssb_cluster_exporter"}[5m]))

Check Node CPU Utilization: Verify node-level CPU usage

top

Analyze Logs: Review system logs for anomalies

journalctl -u redpanda.service | tail -n 50

Check Redpanda Process:

Check if the Redpanda process is consuming excessive CPU:

ps aux --sort=-%cpu | head -10

C3 Remedy

Restart Redpanda Service:
- If CPU usage is critically high and affecting application availability:
```
systemctl restart redpanda
```
Verify Redpanda Configuration:
- Ensure that Redpanda’s configuration if any latest changes in the configuration may effect the high cpu usage.
Inform DevOps:
- If no immediate resolution is possible, escalate to DevOps.

Devops Remedy

Review the logs provided by C3 team and tried to reslove the issue if any.
Upgrade the Virtual Machine Resources:
- If scaling the cluster isn’t enough, you may need to manually increase the virtual machine resources (e.g., CPU, memory) to accommodate higher loads.
Implement Horizontal Scaling:
- Add more nodes to the cluster if the system consistently faces high CPU utilization. Horizontal scaling helps with better load distribution, which can alleviate CPU spikes
Restart the Cluster:
- If the issue is still unresolved, restarting the entire Redpanda cluster might help free up resources and reset the system state:
```
sudo systemctl restart redpanda
```

Active RPC Connections Outside Expected Range

Alert Name: Kafka RPC: Currently active connections

When the RPK service on the node restarts, the number of active connections increases to more than 300.

Alert Scenarios:

Spike in Client Traffic: Unexpected load from producers/consumers.
Dropped Connections: Network issues or resource exhaustion on nodes.
Cluster Imbalance: Uneven distribution of connections across brokers.
Misconfigured Applications: Faulty retries or excessive reconnection attempts by clients.

redpanda_rpc_active_connections: Number of active RPC connections.
redpanda_rpc_connection_established: Tracks newly established connections.
redpanda_rpc_connection_duration: Average duration of RPC connections.

C3 Data Collection:

Check Current Active Connections:

sum(redpanda_rpc_active_connections{job='redpanda_pssb_cluster_exporter'}) by (cluster)

Verify Recent Trends:
Inspect the Grafana dashboard to confirm fluctuations in RPC connections over time.
Redpanda Logs
- Analyze Memory-Related Logs:
  Collect logs for memory issues:
```
journalctl -u redpanda > /tmp/redpanda_memory_logs_$(hostname).log
```

C3 Remedy:

Scale Resources:
- If the connections are persistently high, consider scaling the cluster by adding more brokers or improving node capacity.
Investigate Client Behavior:
- Identify misconfigured producers/consumers causing spikes or drops in RPC connections. Communicate adjustments to the client application teams.

DevOps Remedy:

Redistribute Connections:
- Redistribute connections among nodes by ensuring DNS round-robin or load balancers are functioning correctly.
Adjust Node Resources:
- Upgrade node vCPUs or memory allocation to handle the increased RPC connection load.
Update Network Policies:
- Check firewall rules and ensure no limits or restrictions are causing dropped or unstable connections.
Redpanda Updates and Patches:
- Update Redpanda to the latest stable version if there are known bugs affecting RPC performance.

Leaderless Partitions Detected

Alert Name: Leaderless Partitions

Leaderless partitions occur when no broker in the cluster assumes the leader role for a partition, potentially causing availability and consistency issues.

redpanda_cluster_unavailable_partitions: Number of partitions without a leader.
redpanda_kafka_under_replicated_replicas: Partitions that lack sufficient replicas.
redpanda_kafka_partitions: Total partitions in the cluster.

C3 Data Collection

Check Node Availability
- Ensure all brokers in the cluster are operational:
```
rpk cluster health
```
- If a broker is down, restart the service:
```
sudo systemctl restart redpanda
```
- Verify the broker’s network connectivity and logs:
```
ping <broker_ip>
journalctl -u redpanda -n 50
```
Inspect Logs for Leadership Issues:
```
rpk cluster logdirs describe
```

DevOps Remedy

Redpanda Cluster Balancing link

Investigate and Fix Configuration or Resource Issues
- Leaderless partitions may occur due to insufficient resources or configuration errors:
  - Check system resources:
```
free -h
df -h
top
```
Enable maintenance node
- Enable the maintenance node on the broker, restart the Redpanda pod, disable the maintenance node, and repeat for all nodes.
- check for the status initally
```
rpk cluster maintenance status
```
- Use rpk to enable maintenance mode on the broker. This will drain the broker gracefully.
```
rpk cluster maintenance enable <broker_id>
```
- Restart the Redpanda Service Restart the Redpanda service on the node:
```
sudo systemctl restart redpanda
```
- Check Service Status Ensure the Redpanda service has restarted successfully:
```
sudo systemctl status redpanda
```
- Again check the status for the maintenance mode
Verify that partitions now have leaders:
```
rpk cluster partitions list --all
```
Decommission a broker using the Decommission controller or manually using rpk.
- To Decommission the node using the rpk follow the step mentioneb below
  - check the node maintenance status
```
rpk cluster maintenance status
```
  - enable the maintenance mode for the node
```
rpk cluster maintenance enable <nodeid>
```
  - Decommission Node:After enabling maintenance mode, decommission Node
```
rpk redpanda admin brokers decommission <node-id>
```
  - Monitor the status:Check the decommission status for Node
```
rpk redpanda admin brokers decommission-status <node-id>
```
  - If there are any allocation failures or errors, you can force the decommission:
```
rpk redpanda admin brokers decommission <node-id> --force
```
  - Verify Removal: After successfully decommissioning both nodes, you can confirm their removal from the cluster by listing the brokers
```
rpk redpanda admin brokers list
```
  - Also, verify the cluster status:
```
rpk cluster info
```
To get better understanding and resolve the issue once refer the above link provided for the cluster maintenance

Remove Node From the Redpanda Cluster

To monitor the decommissioning status of a broker after initiating the decommission process, you can use the rpk redpanda admin brokers decommission-status command. This command provides real-time information about the progress of the decommissioning operation for a specific broker.

Updated Workflow with Decommission Status Monitoring

Decommission the Broker:
```
rpk redpanda admin brokers decommission <broker-id>
```
This initiates the decommissioning process for the broker with ID 3.
Monitor the Decommission Status: While the decommissioning is in progress, you can monitor its status using:
```
rpk redpanda admin brokers decommission-status <broker-id>
```
This command will display the current state of the decommissioning process for broker 3, including details such as:
- Whether the broker is still transferring data.
- The percentage of data redistributed to other brokers.
- Any errors or warnings encountered during the process.
Repeat Monitoring (if necessary): You can run the decommission-status command periodically to track the progress until the broker is fully decommissioned.

Example Output of `decommission-status`

$ rpk redpanda admin brokers decommission-status 3

Broker ID: 3
Status: In Progress
Data Redistribution: 65% complete
Remaining Tasks: 3
Errors: None

Once the broker is fully decommissioned, the output will indicate that the process is complete:

$ rpk redpanda admin brokers decommission-status 3

Broker ID: 3
Status: Complete
Data Redistribution: 100% complete
Remaining Tasks: 0
Errors: None

After initiating the decommission process, you can monitor the status of the operation using the following command:

rpk redpanda admin brokers decommission-status <BROKER ID>

Replace [BROKER ID] with the ID of the broker being decommissioned. This command provides real-time updates on the progress of the decommissioning process, including data redistribution status and any potential errors.

Benefits of Monitoring Decommission Status

Track Progress: Ensure the decommissioning process is proceeding as expected.
Identify Issues Early: Detect and address any errors or bottlenecks during the data redistribution phase.
Plan Next Steps: Once the decommissioning is complete, you can safely remove the broker from the cluster or repurpose it.

By incorporating the decommission-status command into your workflow, you gain better visibility and control over the decommissioning process, ensuring a smooth and successful operation.

Adding New Node to Existing Redpanda Cluster

When you’re running a monolithic (non-Kubernetes) Redpanda cluster, joining a new node to the existing cluster involves several critical steps and precautions to ensure smooth operation and avoid data inconsistency or cluster instability.

1. Version Compatibility

Ensure the new node runs the same version of Redpanda as the existing cluster. Mismatched versions can cause incompatibilities during cluster operations or replication.

2. Node Preparation

System Requirements: Make sure the hardware (CPU, memory, disk) matches or aligns with the other cluster nodes.
Networking:
- Ensure the node can reach all other Redpanda nodes via both internal (Kafka API, admin API) and raft/replication ports.
- Open the following ports (default):
  - Kafka API: 9092
  - Admin API: 9644
  - RPC/Raft: 33145
- Update any firewall or SELinux rules accordingly.

3. Configure the New Node’s redpanda.yaml In your Redpanda config (/etc/redpanda/redpanda.yaml), ensure:

Seed servers point to one or more existing nodes:

seed_servers:
  - host:
      address: <existing-node1-ip>
      port: 33145
  - host:
      address: <existing-node2-ip>
      port: 33145

Advertised addresses are correctly set:

advertised_rpc_api:
  address: <new-node-ip>
  port: 33145
advertised_kafka_api:
  address: <new-node-ip>
  port: 9092

4. Assign a Unique Node ID

Ensure that each node in the cluster has a unique and persistent node_id. If node_id is left unset (e.g., -1), Redpanda may assign it dynamically, which is risky for cluster integrity.

5. Bootstrap the Node Properly

Do not start the new node with an empty data directory if it was previously part of another cluster or trial run. Always clean the data directory unless you’re intentionally restoring it:
```
rm -rf /var/lib/redpanda/data/*
```

6. Validate Cluster Health Before Adding

Run rpk cluster info and ensure:
- All existing nodes are up and in sync.
- No ongoing leadership elections or replication lag.
Adding a node to a degraded cluster may worsen issues.

7. Start the New Node Once the config is set, start the new Redpanda node:

sudo systemctl start redpanda

Then confirm it joined the cluster:

rpk cluster info

8. Monitor Node Integration

Use:
- rpk cluster metadata
- rpk topic describe <topic>
- rpk partition describe <topic> -p <partition>
Check if the node starts receiving partition leadership or replicas.

9. Rebalance the Cluster (Optional but Recommended) If the new node doesn’t automatically get assigned partition replicas:

rpk cluster rebalance

This spreads partition replicas across the nodes.

10. Set Up Monitoring and Alerting

Ensure Prometheus or Grafana is scraping metrics from the new node.
Validate that it is visible in your observability dashboards.

11. Backup Configuration

Back up the updated cluster configuration, including any load balancer or DNS changes pointing to the new node.

Redpanda Observability

Redpanda Monitoring Panel Documentation

Alerts and C3 Procedures

Nodes Up

Alert Name: Nodes Up

C3 Data Collection

1. Node-Level Information

2. Cluster-Level Data

3. Logs

4. Network Diagnostics

C3 Remedy to Solve

DevOps Remedy

Redpanda Storage Usage Alert

Alert Name: Storage Used

Redpanda Data Directory

C3 Data Collection

C3 Remedy to Solve

DevOps Remedy

Redpanda Node Uptime Below Threshold

Alert Name: Node Uptime

C3 Data Collection

C3 Remedy

DevOps Remedy

Redpanda Cluster Topic Count Out of Range

Alert Name: Topics

C3 Data Collection

Cluster Topic Count

DevOps Remedy

Topic Partition Count Mismatch

Alert Name: Partition

C3 Data Collection

DevOps Remedy

Storage Health

Alert Name: Storage Health

C3 Data Collection

DevOps Remedy

Memory allocated for redpanda increased

Alert Name: Allocated Memory

C3 Data Collection

DevOps Remedy

High CPU Utilization Detected in Redpanda Cluster

Alert Name: CPU Utilization

Scenarios Triggering the Alert

C3 Data Collection

C3 Remedy

Devops Remedy

Active RPC Connections Outside Expected Range

Alert Name: Kafka RPC: Currently active connections

Alert Scenarios:

Related Metrics:

C3 Data Collection:

C3 Remedy:

DevOps Remedy:

Leaderless Partitions Detected

Alert Name: Leaderless Partitions

Related Metrics:

C3 Data Collection

DevOps Remedy

Redpanda Cluster Balancing link

Remove Node From the Redpanda Cluster

Updated Workflow with Decommission Status Monitoring

Example Output of decommission-status

Benefits of Monitoring Decommission Status

Adding New Node to Existing Redpanda Cluster

Example Output of `decommission-status`