Redpanda Observability

Redpanda Monitoring Panel Documentation

S. No Panel Panel Description Query Query Description Metrics Used Metric Description Expression Operating Value Threshold Values
1 Nodes Up Number of nodes in the cluster. count(redpanda_application_uptime_seconds_total{job="redpanda_pssb_cluster_exporter"}) Counts the uptime seconds of Redpanda applications to determine the number of nodes in the cluster. redpanda_application_uptime_seconds_total Represents the total time (in seconds) that the Redpanda application has been running. 5 Less than 5 raises alert.
2 Storage Used Storage used by Redpanda across all nodes (5 nodes). (1 - (sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"}))) * 100 Calculates the percentage of used disk space by subtracting the free space ratio from 1. redpanda_storage_disk_free_bytes, redpanda_storage_disk_total_bytes Measures free and total disk space available to Redpanda, in bytes. 0-100% 85%
3 Node Uptime Uptime for each node after restart or from when the node is up. redpanda_application_uptime_seconds_total / 86400 Calculates the uptime of Redpanda applications in days. redpanda_application_uptime_seconds_total Represents the total uptime of the application. Varies for each node Value > 300s
4 Topics Number of topics created in the Redpanda cluster. sum(redpanda_cluster_topics{job=~"$job"}) by ([[aggr_criteria]]) Aggregates the total number of topics in the cluster by specified criteria (e.g., instance, exported instance). redpanda_cluster_topics Represents the total number of topics in the cluster. Varies (e.g., 14) Value < 14 < Value
5 Partitions Number of partitions created across all topics. sum(min by (redpanda_topic) (redpanda_kafka_partitions{job="$job"})) Calculates the sum of minimum partition values per topic across all topics. redpanda_kafka_partitions Represents the total number of partitions in the cluster. 96 Value < 96 < Value
6 Storage Health Monitors available free disk space to assess storage health. max(redpanda_storage_disk_free_space_alert{job="redpanda_pssb_cluster_exporter"}) Retrieves the maximum value of the free disk space alert metric. redpanda_storage_disk_free_space_alert Indicates the health status of storage based on available disk space. 0: OK, 1: Low, 2: Degraded value > 0
7 Throughput Measures the rate of data transfer or request processing within Redpanda. sum by ([[aggr_criteria]])(rate(redpanda_kafka_request_bytes_total[1m])) Calculates data throughput (bytes/sec) over a 1-minute window, grouped by specified criteria. redpanda_kafka_request_bytes_total Represents total bytes processed per second. Min: 0, Max: 48.7 B/s 20≤x≤80
8 CPU Utilization Percentage of CPU resources actively used by Redpanda. avg(rate(redpanda_cpu_busy_seconds_total{job="redpanda_pssb_cluster_exporter"}[$__rate_interval])) Measures the average CPU busy time per shard, adjusted dynamically by Grafana’s rate interval. redpanda_cpu_busy_seconds_total Represents the total time (in seconds) CPU cores are actively processing Redpanda tasks. 0-100%, Current: 3.4% N/A
9 Allocated Memory Percentage of total memory currently allocated by Redpanda. sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_allocated_memory) + sum(redpanda_memory_free_memory)) Calculates the ratio of allocated memory to total memory (allocated + free). redpanda_memory_allocated_memory, redpanda_memory_free_memory Represents the allocated and free memory available for Redpanda operations. 0-100%, Current: 40.5% N/A
10 Kafka RPC Active Connections Number of active RPC (Remote Procedure Call) connections in Redpanda Kafka service. sum(redpanda_rpc_active_connections{job="redpanda_pssb_cluster_exporter"}) by ([[aggr_criteria]]) Calculates the total number of active RPC connections grouped by specified criteria. redpanda_rpc_active_connections Represents the number of active Kafka RPC connections. Min: 107, Max: 338 230≤x≤280
15 Produce Latency The latency of Kafka “produce” requests processed by Redpanda, focusing on the 99th percentile (p99) latency and also 95th percentile (p95) latency. Shows current and yesterday latency for comparison. histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{instance=~"$node",redpanda_request="produce",job="redpanda_pssb_cluster_exporter"}[5m])) by (le, [[aggr_criteria]])) The histogram_quantile function calculates the 99th percentile of latency and 95th percentile, aggregated over a 5-minute window.- le: The latency threshold for each bucket, such as 0.1s, 0.5s, 1s, etc. redpanda_kafka_request_latency_seconds_bucket {request="produce"}
16 Fetch Latency The latency of Kafka “consume” requests processed by Redpanda, specifically focusing on the 99th percentile (p99) latency and also 95th percentile (p95) latency. Shows current and yesterday latency for comparison. histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{instance=~"$node",job="redpanda_pssb_cluster_exporter",redpanda_request="consume"}[5m])) by (le, [[aggr_criteria]])) The query calculates the 99th percentile (p99) latency by using the histogram_quantile(0.99, ...) function, which computes the latency value for the slowest 1% of Kafka “consume” requests.- le: Represents the upper bound of each latency bucket, capturing requests below specific thresholds. redpanda_kafka_request_latency_seconds_bucket {request="consume"}

Alerts and C3 Procedures


Nodes Up

Alert Name: Nodes Up

The number of active Redpanda application nodes does not match the expected count of 5. This may indicate one or more nodes being down, unavailable, or misconfigured

C3 Data Collection

1. Node-Level Information

  1. Check if the Redpanda Service is Running on All Nodes:
    • Use the following command to check the status of the Redpanda service on all nodes:
      systemctl status redpanda
      
  2. Uptime Metrics
    • Query the Redpanda Application Uptime:
      To monitor the uptime of the Redpanda application, use the following PromQL query in Prometheus or Grafana for specific nodes:
      redpanda_application_uptime_seconds_total{job="redpanda_pssb_cluster_exporter"}
      

2. Cluster-Level Data

  1. Cluster Health
    • Check the cluster health status using Redpanda’s built-in tools or the Grafana dashboard.
      If using the rpk tool:
      rpk cluster health
      
  2. Node Membership
    • Verify the list of nodes registered in the cluster:
      rpk cluster info
      

3. Logs

  1. Redpanda Logs
    • Gather logs from all nodes to identify errors or failures:
      journalctl -u redpanda --since "30 minutes ago"  > /tmp/redpanda_logs_$(hostname).log
      
  2. Exporter Logs
    • Check logs for the redpanda_pssb_cluster_exporter to ensure the metrics are being collected and sent correctly.

4. Network Diagnostics

  1. Connectivity Checks
    • Verify network connectivity between nodes:
      ping <node_ip>
      
      or
      telnet <node_ip> 9092
      
  2. Firewall Rules
    • Ensure no firewall or security rules are blocking communication.

C3 Remedy to Solve

  1. Restart the Redpanda Service on Down Nodes
    • Restart the Redpanda service on any down nodes:
      systemctl restart redpanda
      
  2. Resolve Network Issues
    • Ensure nodes can communicate with each other.
    • Adjust firewall rules if necessary.
  3. Check if All Redpanda Ports are Listening
    • Verify that all necessary Redpanda ports are listening. Use the following command to check the status of the Redpanda ports:
      netstat -tlnp | grep "redpanda"
      
    • Example output:
      tcp        0      0 172.21.0.63:9644        0.0.0.0:*               LISTEN      751/redpanda        
      tcp        0      0 172.21.0.63:33145       0.0.0.0:*               LISTEN      751/redpanda        
      tcp        0      0 172.21.0.63:9092        0.0.0.0:*               LISTEN      751/redpanda        
      tcp        0      0 0.0.0.0:8082            0.0.0.0:*               LISTEN      751/redpanda        
      tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN      751/redpanda 
      
    • Ensure all the relevant ports (9644, 33145, 9092, 8082, 8081) are in the LISTEN state.

DevOps Remedy

  1. Deeper Analysis
    • Review the data collected by C3, focusing on:
      • Logs: Check for errors or warnings in Redpanda logs and exporter logs.
      • Grafana Metrics: Look for signs of service crashes, high resource usage, or network interruptions.
  2. Reconfigure Nodes
    • Correct any misconfigurations in the node settings file:
      /etc/redpanda/redpanda.yaml
      
  3. Adjust Asynchronous I/O Limit and Restart Redpanda:
    • If the Redpanda logs indicate issues related to fs.aio-max-nr, update the system configuration to increase the asynchronous I/O limit:
      • Edit /etc/sysctl.conf and add or update:
        fs.aio-max-nr = 1048576
        
      • Apply the changes using sudo sysctl -p, verify with sysctl fs.aio-max-nr, and restart the Redpanda service:
        systemctl restart redpanda
        

Redpanda Storage Usage Alert

Alert Name: Storage Used

Condition: Storage utilization exceeds 80%.
Query:

1 - (sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"}))

Redpanda Data Directory

Path: /data/ms/rpc/redpanda/data

C3 Data Collection

  1. Check Storage Usage:
    • Run the following command to check disk usage of the Redpanda data directory:
      df -h /data/ms/rpc/redpanda/data
      
  2. Collect Storage Metrics:
    • Query the storage metrics using PromQL:
      sum(redpanda_storage_disk_free_bytes{job="redpanda_pssb_cluster_exporter"}) / sum(redpanda_storage_disk_total_bytes{job="redpanda_pssb_cluster_exporter"})
      
  3. Redpanda Logs:
    • Gather logs for disk-related errors:
      journalctl -u redpanda --since "30 minutes ago" > /tmp/redpanda_logs_$(hostname).log
      
  4. Cluster Health:
    • Verify the cluster’s health and info:
      rpk cluster health
      rpk cluster info
      

C3 Remedy to Solve

  1. Free Up Disk Space:
    • Remove unnecessary files or temporary data from /data/ms/rpc/redpanda/data.
  2. Verify Mount Points:
    • Ensure /data/ms/rpc/redpanda/data is correctly mounted and has sufficient space.

DevOps Remedy

  1. Add or Expand Storage:
    • Attach new storage, format it, and mount it to /data:
      mkfs.ext4 /dev/<new-disk>
      mount /dev/<new-disk> /data
      
  2. Optimize Disk Space Usage:
    • Archive or compress older data within the directory to reclaim space.
  3. Restart Redpanda Service:
    • Restart Redpanda after making changes:
      systemctl restart redpanda
      
  4. Plan for Future Scaling:
    • Set up Grafana alerts and monitoring for predictive scaling to avoid storage saturation.

Redpanda Node Uptime Below Threshold

Alert Name: Node Uptime

The Redpanda nodes have an uptime below 300 seconds, indicating potential restarts or stability issues

C3 Data Collection

  1. Node-Level Information:
    If you want to check the uptime of the Redpanda service, you should use the following instead:
    systemctl show redpanda -p ActiveEnterTimestamp
    
    To check the uptime metric reported by the Redpanda exporter (from your monitoring stack), you can query the metric redpanda_application_uptime_seconds_total for the specific node in Grafana or Prometheus.
  2. Service Status:
    • Verify if the Redpanda service is running:
      systemctl status redpanda
      
  3. Logs:
    • Collect recent Redpanda logs from the affected nodes:
      journalctl -u redpanda --since "30 minutes ago"  > /tmp/redpanda_logs_$(hostname).log
      

C3 Remedy

  1. Restart Redpanda Service:
    • Restart the Redpanda service on the affected nodes:
      systemctl restart redpanda
      
  2. Verify Uptime Metrics:
    • Ensure the uptime metrics are updated and confirm the node is stable.
  3. Resolve Node-Specific Issues:
    • Investigate for resource exhaustion (CPU, memory, disk).
    • Use top or htop to monitor processes.

DevOps Remedy

  1. Deeper Analysis of Logs:
    • Analyze Redpanda logs for recurring issues or errors.
  2. System Configuration Check:
    • Verify /etc/redpanda/redpanda.yaml for misconfigurations.
  3. Cluster Stability:
    • Ensure other nodes in the cluster are stable to prevent cascading failures.
  4. Optimize System Limits (if needed):
    • Adjust asynchronous I/O limits:
      sysctl fs.aio-max-nr=1048576
      sysctl -p
      
  5. Update Redpanda if Required:
    • If the issue persists, consider upgrading to a more stable version.

Redpanda Cluster Topic Count Out of Range

Alert Name: Topics

The number of topics in the Redpanda cluster is outside the expected range (0 to 14).This may indicate misconfigurations, topic deletions, or unexpected additions.

C3 Data Collection

Cluster Topic Count

  1. Query the Current Topic Count in Redpanda Cluster
    • Use the following PromQL query to get the number of topics in the cluster:
    sum(redpanda_cluster_topics{job="redpanda_pssb_cluster_exporter"}) by (cluster)
    
  2. Check Redpanda Topics
    • To verify if the topics in the cluster are being properly created and are available, use the following command to list the topics in Redpanda:
    rpk topic list
    
    Also we can have the defined topics and their names in the below command
    rpk cluster info
    

DevOps Remedy

  1. Check Redpanda Logs
    • Review the Redpanda logs on all nodes to identify when the topic has been created.
    journalctl -xeu redpanda
    
  2. Check topics
    • Analyze with the Development team to determine if any existing topics have been deleted or if new topics have been created.
    • If the required topic is created, update the alert to reflect the change.
    • If any existing topics have been deleted, ensure that the necessary topic is recreated. To create a topic in Redpanda, you can use the following command with rpk (Redpanda’s command-line tool):
    rpk topic create <topic_name>
    
    For example, if you want to create a topic called my_topic, the command would be:
    rpk topic create my_topic
    
    You can also specify additional parameters like partition count and replication factor:
    rpk topic create <topic_name> --partitions <num_partitions> --replicas <num_replicas>
    
    Example:
    rpk topic create my_topic --partitions 3 --replicas 2
    
    This creates a topic my_topic with 3 partitions and 2 replicas.

Topic Partition Count Mismatch

Alert Name: Partition

The total number of Kafka partitions across topics does not match the expected count of 96.

C3 Data Collection

  1. Verify the Current Partition Count:

    • Use the following PromQL query in Prometheus or Grafana to check the current partition count:
      sum(min by (redpanda_topic) (redpanda_kafka_partitions{job="redpanda_pssb_cluster_exporter"}))
      
    • Note the current value to compare it with the expected count of 96.
  2. List All Topics and Partition Information:

    • Use the rpk command to fetch the list of topics and their partition details:
      rpk topic list
      
  3. Cross-Check Partition Counts per Topic:

    • For a detailed view of partitions for a specific topic, use:
      rpk cluster partitions list --all
      

DevOps Remedy

  1. Analyze Partition Count Discrepancy:
    • If the current partition count does not equal 96, work with the Development team to understand if new topics were added or existing ones modified.
  2. Reconfigure Partitions as Necessary:
    • If partitions are missing, recreate or adjust them using the rpk command:
      • Add partitions to an existing topic:
        rpk topic partition add <topic_name> --brokers <broker_address> --count <desired_partition_count>
        
  3. Validate Topic Requirements:
    • Confirm if the expected partition count (96) matches business requirements. If a different partition count is intended, update the alert threshold to reflect the correct value.
  4. Restore Deleted Topics (if applicable):
    • If a required topic was accidentally deleted, recreate it:
      rpk topic create <topic_name> --partitions <partition_count> --replicas <replica_count>
      
  5. Monitor Metrics Post-Fix:
    • Ensure the partition count is correctly reflected in Grafana after adjustments by re-checking the alert query.

Storage Health

Alert Name: Storage Health

Monitors the available free disk space on Redpanda storage, indicating whether the disk space is within healthy limits

C3 Data Collection

  1. Storage Metrics
  • Check Disk Free Space Status:
    Use the following PromQL query to verify the disk free space status:
    max(redpanda_storage_disk_free_space_alert{job="redpanda_pssb_cluster_exporter"})
    
  • Status meanings:
    • 0: OK (sufficient disk space)
    • 1: Low (disk space is running low)
    • 2: Degraded (critical disk space issue)
  1. Node-Level Inspection
    • Check Disk Space on Affected Nodes:
      Use the following command to check disk usage on Redpanda nodes:
      df -h /data/ms/rpc/redpanda/data
      
    • Redpanda Logs:
      Collect logs for insights into errors related to storage:
      journalctl -u redpanda --since "30 minutes ago"  > /tmp/redpanda_logs_$(hostname).log
      
  2. Redpanda Cluster Metrics
    • Cluster Health Check:
      Use the rpk tool to get a high-level view of the cluster’s health:
      rpk cluster health
      

DevOps Remedy

  1. Free Up Disk Space
    • Remove Unnecessary Files:
      Identify and delete unused files or logs in the /data/ms/rpc/redpanda/data directory.
      find /data/ms/rpc/redpanda/data -type f -name "*.log" -delete
      
  2. Expand Storage Capacity
    • Add more storage to the affected nodes. Update the disk volume attached to the Redpanda data directory.
  3. Monitor and Configure Alerts
    • Ensure that Redpanda disk usage alerts are properly set up to notify admins before reaching critical thresholds.
  4. Adjust Redpanda Storage Policies
    • Reconfigure Redpanda storage policies to optimize disk usage, such as enabling log compaction or reducing retention periods for topics if appropriate.
  5. Restart Redpanda (if required)
    • After addressing storage issues, restart Redpanda services on affected nodes:
      systemctl restart redpanda
      
  6. Engage Development Teams
    • Collaborate with the development team to verify whether high disk utilization is caused by specific topics or workload spikes.

Memory allocated for redpanda increased

Alert Name: Allocated Memory

C3 Data Collection

  1. Memory Metrics
    • Memory Utilization Query:
      Use the PromQL query to check the memory allocation ratio:
      sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"}) / 
      (sum(redpanda_memory_free_memory{job="redpanda_pssb_cluster_exporter"}) + 
      sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"}))
      
      • Record the value and confirm it is ≥ 0.85, as this indicates high memory usage.
    • Allocated vs Free Memory:
      Collect metrics for allocated and free memory separately:
      sum(redpanda_memory_allocated_memory{job="redpanda_pssb_cluster_exporter"})
      sum(redpanda_memory_free_memory{job="redpanda_pssb_cluster_exporter"})
      
  2. Node-Level Inspection
    • Check Total and Available Memory on the Node:
      Run the following command on Redpanda nodes:
      free -h
      
      • Note the used, free, and available memory.
    • System Load and Top Memory Consumers:
      Identify processes consuming the most memory:
      top -o %MEM
      
  3. Redpanda Logs
    • Analyze Memory-Related Logs:
      Collect logs for memory issues:
      journalctl -u redpanda > /tmp/redpanda_memory_logs_$(hostname).log
      

DevOps Remedy

  1. Tune Redpanda Memory Settings
    • Adjust Redpanda configurations to manage memory allocation efficiently.
  2. Expand Node Resources
    • Add More Memory to Nodes:
      Provision additional memory if available resources are insufficient.
  3. Scaling Redpanda Cluster
    • Add more nodes to the Redpanda cluster to distribute memory usage.
  4. Restart High Memory-Consuming Processes
    • Identify and restart memory-hogging processes if they are unrelated to Redpanda.
      systemctl restart redpanda
      
  5. Monitor Trends
    • Ensure memory usage alerts are properly configured for early warning. Use Grafana to monitor the ratio trends.

High CPU Utilization Detected in Redpanda Cluster

Alert Name: CPU Utilization

Scenarios Triggering the Alert

  1. Increased data throughput, unbalanced partitions, or suboptimal Redpanda configurations.
  2. Insufficient vCPUs or resource contention in virtualized environments.
  3. Data compaction, garbage collection, or excessive I/O operations.
  4. Faulty producers/consumers or inefficient network settings.
  5. Outdated Redpanda/OS versions, kernel misconfigurations, or overloaded processes.

C3 Data Collection

  1. Query CPU Usage
  • Use the Prometheus expression to get real-time data
avg(rate(redpanda_cpu_busy_seconds_total{job="redpanda_pssb_cluster_exporter"}[5m]))
  1. Check Node CPU Utilization: Verify node-level CPU usage
top
  1. Analyze Logs: Review system logs for anomalies
journalctl -u redpanda.service | tail -n 50
  1. Check Redpanda Process:
  • Check if the Redpanda process is consuming excessive CPU:
ps aux --sort=-%cpu | head -10

C3 Remedy

  1. Restart Redpanda Service:
    • If CPU usage is critically high and affecting application availability:
    systemctl restart redpanda
    
  2. Verify Redpanda Configuration:
    • Ensure that Redpanda’s configuration if any latest changes in the configuration may effect the high cpu usage.
  3. Inform DevOps:
    • If no immediate resolution is possible, escalate to DevOps.

Devops Remedy

  1. Review the logs provided by C3 team and tried to reslove the issue if any.

  2. Upgrade the Virtual Machine Resources:

    • If scaling the cluster isn’t enough, you may need to manually increase the virtual machine resources (e.g., CPU, memory) to accommodate higher loads.
  3. Implement Horizontal Scaling:

    • Add more nodes to the cluster if the system consistently faces high CPU utilization. Horizontal scaling helps with better load distribution, which can alleviate CPU spikes
  4. Restart the Cluster:

    • If the issue is still unresolved, restarting the entire Redpanda cluster might help free up resources and reset the system state:
    sudo systemctl restart redpanda
    

Active RPC Connections Outside Expected Range

Alert Name: Kafka RPC: Currently active connections

When the RPK service on the node restarts, the number of active connections increases to more than 300.

Alert Scenarios:

  1. Spike in Client Traffic: Unexpected load from producers/consumers.
  2. Dropped Connections: Network issues or resource exhaustion on nodes.
  3. Cluster Imbalance: Uneven distribution of connections across brokers.
  4. Misconfigured Applications: Faulty retries or excessive reconnection attempts by clients.
  • redpanda_rpc_active_connections: Number of active RPC connections.
  • redpanda_rpc_connection_established: Tracks newly established connections.
  • redpanda_rpc_connection_duration: Average duration of RPC connections.

C3 Data Collection:

  1. Check Current Active Connections:
    sum(redpanda_rpc_active_connections{job='redpanda_pssb_cluster_exporter'}) by (cluster)
    
  2. Verify Recent Trends:
    Inspect the Grafana dashboard to confirm fluctuations in RPC connections over time.
  3. Redpanda Logs
    • Analyze Memory-Related Logs:
      Collect logs for memory issues:
      journalctl -u redpanda > /tmp/redpanda_memory_logs_$(hostname).log
      

C3 Remedy:

  1. Scale Resources:
    • If the connections are persistently high, consider scaling the cluster by adding more brokers or improving node capacity.
  2. Investigate Client Behavior:
    • Identify misconfigured producers/consumers causing spikes or drops in RPC connections. Communicate adjustments to the client application teams.

DevOps Remedy:

  1. Redistribute Connections:
    • Redistribute connections among nodes by ensuring DNS round-robin or load balancers are functioning correctly.
  2. Adjust Node Resources:
    • Upgrade node vCPUs or memory allocation to handle the increased RPC connection load.
  3. Update Network Policies:
    • Check firewall rules and ensure no limits or restrictions are causing dropped or unstable connections.
  4. Redpanda Updates and Patches:
    • Update Redpanda to the latest stable version if there are known bugs affecting RPC performance.

Leaderless Partitions Detected

Alert Name: Leaderless Partitions

Leaderless partitions occur when no broker in the cluster assumes the leader role for a partition, potentially causing availability and consistency issues.

  • redpanda_cluster_unavailable_partitions: Number of partitions without a leader.
  • redpanda_kafka_under_replicated_replicas: Partitions that lack sufficient replicas.
  • redpanda_kafka_partitions: Total partitions in the cluster.

C3 Data Collection

  1. Check Node Availability

    • Ensure all brokers in the cluster are operational:
      rpk cluster health
      
    • If a broker is down, restart the service:
      sudo systemctl restart redpanda
      
    • Verify the broker’s network connectivity and logs:
      ping <broker_ip>
      journalctl -u redpanda -n 50
      
  2. Inspect Logs for Leadership Issues:

    rpk cluster logdirs describe
    

DevOps Remedy

Redpanda Cluster Balancing link

  1. Investigate and Fix Configuration or Resource Issues
    • Leaderless partitions may occur due to insufficient resources or configuration errors:
      • Check system resources:
        free -h
        df -h
        top
        
  2. Enable maintenance node
    • Enable the maintenance node on the broker, restart the Redpanda pod, disable the maintenance node, and repeat for all nodes.
    • check for the status initally
    rpk cluster maintenance status
    
    • Use rpk to enable maintenance mode on the broker. This will drain the broker gracefully.
    rpk cluster maintenance enable <broker_id>
    
    • Restart the Redpanda Service Restart the Redpanda service on the node:
    sudo systemctl restart redpanda
    
    • Check Service Status Ensure the Redpanda service has restarted successfully:
    sudo systemctl status redpanda
    
    • Again check the status for the maintenance mode
  3. Verify that partitions now have leaders:
    rpk cluster partitions list --all
    
  4. Decommission a broker using the Decommission controller or manually using rpk.
    • To Decommission the node using the rpk follow the step mentioneb below
      • check the node maintenance status
      rpk cluster maintenance status
      
      • enable the maintenance mode for the node
      rpk cluster maintenance enable <nodeid>
      
      • Decommission Node:After enabling maintenance mode, decommission Node
      rpk redpanda admin brokers decommission <node-id>
      
      • Monitor the status:Check the decommission status for Node
      rpk redpanda admin brokers decommission-status <node-id>
      
      • If there are any allocation failures or errors, you can force the decommission:
      rpk redpanda admin brokers decommission <node-id> --force
      
      • Verify Removal: After successfully decommissioning both nodes, you can confirm their removal from the cluster by listing the brokers
      rpk redpanda admin brokers list
      
      • Also, verify the cluster status:
      rpk cluster info
      
  5. To get better understanding and resolve the issue once refer the above link provided for the cluster maintenance

Remove Node From the Redpanda Cluster

To monitor the decommissioning status of a broker after initiating the decommission process, you can use the rpk redpanda admin brokers decommission-status command. This command provides real-time information about the progress of the decommissioning operation for a specific broker.

Updated Workflow with Decommission Status Monitoring

  1. Decommission the Broker:

    rpk redpanda admin brokers decommission <broker-id>
    

    This initiates the decommissioning process for the broker with ID 3.

  2. Monitor the Decommission Status: While the decommissioning is in progress, you can monitor its status using:

    rpk redpanda admin brokers decommission-status <broker-id>
    

    This command will display the current state of the decommissioning process for broker 3, including details such as:

    • Whether the broker is still transferring data.
    • The percentage of data redistributed to other brokers.
    • Any errors or warnings encountered during the process.
  3. Repeat Monitoring (if necessary): You can run the decommission-status command periodically to track the progress until the broker is fully decommissioned.

Example Output of decommission-status

$ rpk redpanda admin brokers decommission-status 3

Broker ID: 3
Status: In Progress
Data Redistribution: 65% complete
Remaining Tasks: 3
Errors: None

Once the broker is fully decommissioned, the output will indicate that the process is complete:

$ rpk redpanda admin brokers decommission-status 3

Broker ID: 3
Status: Complete
Data Redistribution: 100% complete
Remaining Tasks: 0
Errors: None

After initiating the decommission process, you can monitor the status of the operation using the following command:

rpk redpanda admin brokers decommission-status <BROKER ID>

Replace [BROKER ID] with the ID of the broker being decommissioned. This command provides real-time updates on the progress of the decommissioning process, including data redistribution status and any potential errors.

Benefits of Monitoring Decommission Status

  • Track Progress: Ensure the decommissioning process is proceeding as expected.
  • Identify Issues Early: Detect and address any errors or bottlenecks during the data redistribution phase.
  • Plan Next Steps: Once the decommissioning is complete, you can safely remove the broker from the cluster or repurpose it.

By incorporating the decommission-status command into your workflow, you gain better visibility and control over the decommissioning process, ensuring a smooth and successful operation.

Adding New Node to Existing Redpanda Cluster

When you’re running a monolithic (non-Kubernetes) Redpanda cluster, joining a new node to the existing cluster involves several critical steps and precautions to ensure smooth operation and avoid data inconsistency or cluster instability.

1. Version Compatibility

  • Ensure the new node runs the same version of Redpanda as the existing cluster. Mismatched versions can cause incompatibilities during cluster operations or replication.

2. Node Preparation

  • System Requirements: Make sure the hardware (CPU, memory, disk) matches or aligns with the other cluster nodes.
  • Networking:
    • Ensure the node can reach all other Redpanda nodes via both internal (Kafka API, admin API) and raft/replication ports.
    • Open the following ports (default):
      • Kafka API: 9092
      • Admin API: 9644
      • RPC/Raft: 33145
    • Update any firewall or SELinux rules accordingly.

3. Configure the New Node’s redpanda.yaml In your Redpanda config (/etc/redpanda/redpanda.yaml), ensure:

  • Seed servers point to one or more existing nodes:

    seed_servers:
      - host:
          address: <existing-node1-ip>
          port: 33145
      - host:
          address: <existing-node2-ip>
          port: 33145
    
  • Advertised addresses are correctly set:

    advertised_rpc_api:
      address: <new-node-ip>
      port: 33145
    advertised_kafka_api:
      address: <new-node-ip>
      port: 9092
    

4. Assign a Unique Node ID

  • Ensure that each node in the cluster has a unique and persistent node_id. If node_id is left unset (e.g., -1), Redpanda may assign it dynamically, which is risky for cluster integrity.

5. Bootstrap the Node Properly

  • Do not start the new node with an empty data directory if it was previously part of another cluster or trial run. Always clean the data directory unless you’re intentionally restoring it:
    rm -rf /var/lib/redpanda/data/*
    

6. Validate Cluster Health Before Adding

  • Run rpk cluster info and ensure:
    • All existing nodes are up and in sync.
    • No ongoing leadership elections or replication lag.
  • Adding a node to a degraded cluster may worsen issues.

7. Start the New Node Once the config is set, start the new Redpanda node:

sudo systemctl start redpanda

Then confirm it joined the cluster:

rpk cluster info

8. Monitor Node Integration

  • Use:
    • rpk cluster metadata
    • rpk topic describe <topic>
    • rpk partition describe <topic> -p <partition>
  • Check if the node starts receiving partition leadership or replicas.

9. Rebalance the Cluster (Optional but Recommended) If the new node doesn’t automatically get assigned partition replicas:

rpk cluster rebalance

This spreads partition replicas across the nodes.

10. Set Up Monitoring and Alerting

  • Ensure Prometheus or Grafana is scraping metrics from the new node.
  • Validate that it is visible in your observability dashboards.

11. Backup Configuration

  • Back up the updated cluster configuration, including any load balancer or DNS changes pointing to the new node.