HAProxy Observability

Prerequisites

Please read the following documents before addressing the issues to become familiar with the gluster architecture.
Configuration manual
Haproxy monitoring guide
Errors reference

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

Alert Handling Procedure

  1. Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.

  2. Severity-Based Actions:

    • Low-Priority Alerts:
      • If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
    • Escalation to DevOps:
      • If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
  3. Severity-Specific Notifications:

    • Warning Alerts:
      • For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
    • Critical Alerts:
      • For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.

Restart detected

Alertname HaproxyRestartedRecently

The expression (time() - haproxy_process_start_time_seconds) / 60 calculates the uptime of the HAProxy process in minutes. This metric provides information how long the HAProxy process has been running since its last restart. It is useful for monitoring process stability and identifying unexpected restarts. This alert indicates that the haproxy is restarted very recently. C3 team should check the logs to identify potential causes of the restart, such as crashes or system reboots, and confirm with the DevOps team whether the restart was intentional. If the restart was not planned, all relevant data, including logs and recent configuration changes, should be provided to the DevOps team for further investigation.

C3 Data Collection for HAProxy Restart Alert:

  1. Instance name, IP address, and alert age:

    • Collect the instance name and IP address where the alert is firing.
    • Record the total time the alert has been in the firing state.
  2. HAProxy process uptime:

    • Use the expression (time() - haproxy_process_start_time_seconds) / 60 to determine the process uptime in minutes.
  3. Service status:

    • Check the HAProxy service status using systemctl status haproxy and note whether it is active, restarting, or failed.
  4. Process status:

    • Use ps aux | grep haproxy to confirm if the HAProxy process is running and collect the process details.
  5. Configuration validation:

    • Run haproxy -c -f /etc/haproxy/haproxy.cfg to check for configuration errors and collect the output.
  6. Log messages:

    • Collect recent log entries around the restart event using:
     tail -100f /var/log/haproxy/haproxy_notice.log
     tail -100f /var/log/haproxy/haproxy_debug.log
     tail -100f /var/log/haproxy/haproxy_info.log
    
  7. Port status:

    • Verify HAProxy is listening on the expected ports using netstat -tlnp | grep haproxy and collect the output.
  8. Resource availability:

    • Collect CPU, memory, and disk usage metrics using top, free -h, and df -h to ensure resource availability.
  9. Firewall status:

    • Use ufw status && ufw status | grep -E "(80|443)" to confirm if any ports are blocked and collect the output, if the firewall is active.

Dependent Metrics:

When the HAProxy process uptime indicates a recent restart, please check the following metrics/data:

  1. HAProxy Backend Availability:

    • Use haproxy_backend_up{instance="$host"} to verify the status of backend servers connected to the HAProxy instance.
  2. Frontend Request Rates:

    • Monitor rate(haproxy_frontend_http_requests_total{instance="$host"}[<time interval>]) to observe any differences in request patterns after the restart.
  3. Error Rates:

    • Check haproxy_frontend_request_errors_total{instance="$host"} to identify if error rates have spiked following the restart.
  4. Queue Metrics:

    • Verify haproxy_server_queue_size{instance="$host"} to ensure backend servers are not overwhelmed and queues are being processed efficiently.
  5. Response Times:

    • Analyze haproxy_server_http_response_time_average_seconds to detect unusual latency patterns.

By examining these metrics, the team can diagnose the cause of the restart and address underlying issues.

C3 Remedy

Follow these steps for basic troubleshooting after a recent haproxy restart. Do not perform any configuration changes or actions that might impact the overall functionality. For unresolved issues, escalate to the DevOps team.

  1. Verify HAProxy Service Status

    • Check if the HAProxy service is running:
      Use:
    systemctl status haproxy
    

    Sample Output:

    ● haproxy.service - HAProxy Load Balancer
    Loaded: loaded (/lib/systemd/system/haproxy.service; enabled; vendor preset: enabled)
    Active: active (running) since Mon 2024-12-02 10:00:00 UTC; 2m ago
        Docs: man:haproxy(1)
    Main PID: 2345 (haproxy)
    Tasks: 4 (limit: 9375)
    Memory: 2.0M
    CGroup: /system.slice/haproxy.service
            ├─2345 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg
            └─2346 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg
    
    • If the service is inactive or in a failed state, escalate the issue to DevOps without restarting.
  2. Check Recent Logs

    • Analyze HAProxy logs for errors or unusual events:
      Use:
      journalctl -u haproxy | tail -n 50
      
      Look for logs related to configuration issues, errors.
  3. Verify HAProxy Listener Ports

    • Check if HAProxy listener ports are open:
      Use:
      netstat -tlnp | grep haproxy
      
      Sample Output:
      tcp        0      0 0.0.0.0:80            0.0.0.0:*               LISTEN      2345/haproxy
      tcp        0      0 0.0.0.0:443           0.0.0.0:*               LISTEN      2345/haproxy
      
      Ensure the necessary ports (e.g., 80 and 443) are in a LISTEN state.
  4. Inspect Resource Usage

    • Check for potential resource exhaustion (e.g., memory or CPU):
      Use:
      top -b -n1 | grep haproxy
      
      Look for excessive resource usage that might cause instability right after restarts.
  5. Basic Network Checks

    • Ensure the network is reachable for backend servers:
      Use:
      ping -c 3 <backend_server>
      
      Confirm the backend servers are accessible without significant packet loss or latency.
  6. Check Backend Health Metrics

    • Verify the health status of backend servers monitored by HAProxy:
      Use:
    curl -I http://172.21.0.61:8182/apm/mon/health
    

    Sample output:

    HTTP/1.1 200 
    Cache-Control: no-cache, no-store, max-age=0, must-revalidate
    Pragma: no-cache
    Expires: 0
    Strict-Transport-Security: max-age=31536000 ; includeSubDomains
    X-XSS-Protection: 1; mode=block
    X-Frame-Options: DENY
    X-Content-Type-Options: nosniff
    vary: accept-encoding
    Content-Length: 0
    Date: Thu, 05 Dec 2024 13:17:48 GMT
    
    • Verify haproxy stats from http://172.21.0.20:9500/stats URL. Look for health check failures/other issues. Credentials for stat page:
    User: haproxy
    Password: 1!Qhaproxy
    
  7. Frontend Connection Check

    • Simulate a client request to ensure HAProxy is responding correctly:
      Use:
    curl -I http://172.21.0.61:8182
    

    Sample output:

    HTTP/1.1 200 
    Set-Cookie: pssbtom=56884264BFCD00D15F0ED47ED976D524; Path=/; Secure; HttpOnly
    Set-Cookie: pssbhz=HZ5608D764C6224BD397B566FA57BC3CEB; Path=/
    Cache-Control: no-cache, no-store, max-age=0, must-revalidate
    Pragma: no-cache
    Expires: 0
    Strict-Transport-Security: max-age=31536000 ; includeSubDomains
    X-XSS-Protection: 1; mode=block
    X-Frame-Options: DENY
    X-Content-Type-Options: nosniff
    vary: accept-encoding
    Content-Type: text/html;charset=UTF-8
    Transfer-Encoding: chunked
    Date: Thu, 05 Dec 2024 13:14:24 GMT
    

Escalate to the DevOps team for further troubleshooting and configuration changes.

DevOps Remedies

In case the restart of HAProxy was not intentional, perform the following remedies to identify and resolve the root cause:

  1. Check for OOM Killer Involvement

    • Investigate if the system’s Out-of-Memory (OOM) killer terminated HAProxy:
    dmesg | grep -i "Out of memory"  
    dmesg | grep -i haproxy
    

    Sample output:

    [12345.678910] Out of memory: Kill process 5432 (haproxy) score 950 or sacrifice child
    [12345.678912] Killed process 5432 (haproxy) total-vm:1048576kB, anon-rss:512000kB, file-rss:1024kB, shmem-rss:0kB
    
    • If OOM killing is detected:
      • Increase memory allocation or optimize HAProxy’s configuration to reduce memory usage.
      • Set vm.overcommit_memory to a safe value(2) in /etc/sysctl.conf to prevent aggressive memory allocation. After setting the above parameter, to apply, Use: sudo sysctl -p
        Setting the parameter value to 2 ensure stability by overallocating memory to the process in case of process needs.
  2. Analyze Resource Usage

    • Check if the server faced CPU, memory, or disk bottlenecks:

      top -b -n 1 | grep haproxy
      free -h
      df -h
      

      Make sure that there is plenty of memory, disk space available.

    • Check if the limits for the haproxy process is too low for the load by inspecting the proc file system for the process’s informtion, To get the main PID:

    systemctl status haproxy | grep "Main PID"
    

    Sample output:

    Main PID: 2793613 (haproxy)
    

    To inspect the limits enforced:

     cat /proc/2793613/limits
    

    Sample output:

    Limit                     Soft Limit           Hard Limit           Units     
    Max cpu time              unlimited            unlimited            seconds   
    Max file size             unlimited            unlimited            bytes     
    Max data size             unlimited            unlimited            bytes     
    Max stack size            8388608              unlimited            bytes     
    Max core file size        0                    unlimited            bytes     
    Max resident set          unlimited            unlimited            bytes     
    Max processes             772996               772996               processes 
    Max open files            80334                524288               files     
    Max locked memory         65536                65536                bytes     
    Max address space         unlimited            unlimited            bytes     
    Max file locks            unlimited            unlimited            locks     
    Max pending signals       772996               772996               signals   
    Max msgqueue size         819200               819200               bytes     
    Max nice priority         0                    0                    
    Max realtime priority     0                    0                    
    Max realtime timeout      unlimited            unlimited            us 
    
    • Increase resource limits in the service file if necessary for the load:
      [Service]
      LimitNOFILE=102400
      LimitNPROC=102400
      
      And perform systemctl daemon-reload && systemctl restart haproxy

    Discuss with devops team and finalize the new values while adjusting above parameters.

  3. Investigate Connections Overload

    • Check if HAProxy experienced a surge in connections that exceeded the maxconn limit:
    echo "show info" | socat /var/run/haproxy.stat stdio | grep -i maxconn
    

    Sample output:

    Maxconn: 40100
    Hard_maxconn: 40100
    MaxConnRate: 16
    

    MaxConnRate should always be less than Maxconn or Hard_maxconn.

    • If applicable, increase the maxconn value in the HAProxy configuration:
    global
    maxconn 80200
    
    • Review logs for connection-related errors:
    grep -ir "Connection error" /var/log/haproxy/*
    
    • Add rate-limiting rules to prevent abuse or overload.
  4. Validate HAProxy Configuration before restarting with the above changes

    • Ensure the configuration is error-free:

      haproxy -c -f /etc/haproxy/haproxy.cfg
      

      Sample output:

      Configuration file is valid
      

      You might see some configuration warnings for the above output(can be ignored with proper reasons).

    • Correct any detected syntax errors and restart.

     systemctl restart haproxy
    

High frontend requests errors

Alertname: HighFrontendRequestsErrorsWarn,HighFrontendRequestsErrorsCritical

The expression (increase(haproxy_frontend_request_errors_total[1m]) / increase(haproxy_frontend_http_requests_total[1m])) * 100 calculates the percentage of frontend request errors in HAProxy over a one-minute period. This metric helps monitor the health of the HAProxy frontend by showing the ratio of failed requests to total requests. A high error rate may indicate problems such as misconfigurations, server overloads, or network issues. This alert triggers when the error rate exceeds threshold(>25 for warning, >50 for critical), signaling a potential issue that requires attention. C3 team should conduct the data collection and remedies given before intimating the issue to devops team.

C3 data collection:

  1. Instance name, IP address, and alert age:

    • Collect the instance name and IP address where the alert is firing.
    • Record the total time the alert has been in the firing state.
  2. HAProxy process uptime:

    • Use the expression (time() - haproxy_process_start_time_seconds) / 60 to determine the process uptime in minutes.
  3. Service status:

    • Check the HAProxy service status using systemctl status haproxy and note whether it is active, restarting, or failed.
  4. Process status:

    • Use ps aux | grep haproxy to confirm if the HAProxy process is running and collect the process details.
  5. Configuration validation:

    • Run haproxy -c -f /etc/haproxy/haproxy.cfg to check for configuration errors and collect the output and collect the warnings in the output.
  6. Log messages:

    • Collect recent log entries around the restart event using:
     tail -100f /var/log/haproxy/haproxy_notice.log
     tail -100f /var/log/haproxy/haproxy_debug.log
     tail -100f /var/log/haproxy/haproxy_info.log
    
  7. Resource availability:

    • Collect CPU, memory, and disk usage metrics using top, free -h, and df -h to ensure resource availability.

Dependent metrics

  1. Active backends: Use haproxy_backend_status and check if all backends are up.

  2. Active frontends: Use haproxy_frontend_status and check if all frontends are in active status.

  3. Queue status: Check current and max queues of servers to see if requests are being failed due to excess requests than the server can handle.

Use: haproxy_server_current_queue and haproxy_server_max_queue to find out if maximum queue limit is reached.

C3 Remedy

  1. Backend Connectivity
  • Identify Backends:

    • List all backends associated with the particular proxy by checking HAProxy metrics such as haproxy_backend_status.
  • Ping Backends:

    • From the HAProxy node, ping each backend server to verify connectivity:
      ping -c 4 <backend-IP>
      
    • If a backend does not respond or has high latency, record the issue for escalation.
  • Monitor Backend Health Check Status:

    • Review the health_check_status codes returned by backends to diagnose the specific service causing the issue.
    • Health check status codes and their associated services are:
      • MySQL: 520, 521, 522
      • Cassandra: 526, 527, 528
      • Hazelcast Cache: 531, 532, 533
      • Redpanda: 536, 537, 538
      • Gluster: 541, 542, 543
  • Interpret Status Codes:

    • If a backend is returning one of these codes:
      • MySQL (520–522): Investigate database service issues (e.g., query failures, connection limits).
      • Cassandra (526–528): Verify node status and cluster health for issues like node unavailability or misconfiguration.
      • Hazelcast Cache (531–533): Check cache cluster connectivity or memory exhaustion.
      • Redpanda (536–538): Diagnose message queue status for failures or resource bottlenecks.
      • Gluster (541–543): Inspect storage cluster health, network latency, or volume availability.
  • Escalate as Necessary:

    • Document the backend name, IP, and observed health check status codes.
    • If the issue cannot be resolved at the C3 level, inform the DevOps team with these details for further investigation.
  1. Backend Up/Down Changes
  • Check Backend Status Changes:

    • Use Prometheus metrics like haproxy_backend_status to compare the number of backends up and down over the last hour.
      increase(haproxy_backend_up[1h])
      
    • Record any significant fluctuations in backend statuses.
  • Analyze Network Issues:

    • If multiple backends are flapping (going up and down frequently), suspect network instability.
    • Note the affected backends and associated proxies for escalation.
  • Escalation:

    • Provide the DevOps team with the findings, including backend status changes and potential indications of network problems.
  1. Check Logs for Errors
  • Review HAProxy Logs:

    • Analyze recent log entries for errors or anomalies using:
      tail -100 /var/log/haproxy/haproxy_notice.log
      tail -100 /var/log/haproxy/haproxy_info.log
      tail -100 /var/log/haproxy/haproxy_debug.log
      
    • Look for patterns such as frequent retries, timeouts, or failed connections.
  • Report Findings:

    • Share the log analysis with the DevOps team, summarizing key observations and potential causes.

DevOps Team Remedies

Upon receiving escalation from the C3 team regarding backend connectivity or health check failures, the DevOps team should proceed with targeted troubleshooting for the identified service(s).


  1. General Troubleshooting for Backend Services

  2. Validate Connectivity to Backends:

    • Confirm that the HAProxy node can reach the backend service:
      telnet <backend-IP> <service-port>
      
    • Investigate network routes, firewall rules and fix if connectivity fails.
  3. Check Backend Logs:

    • Access the backend server and review service logs for error messages.
      for MySQL:
      tail -100 /var/log/mysql/error.log
      
  4. Verify Resource Availability:

    • Check for CPU, memory, and disk usage that might be causing service disruptions.
      top  
      free -h  
      df -h  
      

Service-Specific Remedies

MySQL (Health Check Status Codes: 520, 521, 522)
  • 520-522
    • Verify that MySQL is running:
      systemctl status mysql  
      
    • Restart the service if necessary:
      systemctl restart mysql  
      
      • Test MySQL logging and a test insert or update query as used by health checks
Cassandra/Scylla (Health Check Status Codes: 526, 527, 528)
  • 526-528
    • Check the Cassandra/Scylla node status:
      nodetool status  
      
    • Restart the affected node:
      systemctl restart scylla-server  
      
Redpanda (Health Check Status Codes: 536, 537, 538)
  • 536-538:
    • Confirm the broker process is running:
      systemctl status redpanda  
      
    • Restart the broker if necessary:
      systemctl restart redpanda  
      
GlusterFS (Health Check Status Codes: 541, 542, 543)
  • 541-543

    • Check the status of bricks:
      gluster volume status  
      
    • Restart the failed brick:
      systemctl restart glusterd  
      
  • Check for Split-Brain

    • Identify split-brain files:
      gluster volume heal <volume-name> info split-brain  
      
    • Resolve using manual conflict resolution or force a heal:
      gluster volume heal <volume-name> full  
      
  • Solve for Volume Inconsistencies

    • Trigger a volume heal:
      gluster volume heal <volume-name>  
      
    • Monitor the healing process and check for errors.

High response times for HAproxy

Alertname: HAProxyBackendHighReponseTime

The expression haproxy_server_http_response_time_average_seconds calculates the average response time of a backend server in seconds. This metric provides insights into the latency of responses from HAProxy backend servers, which can indicate performance issues. This alert triggers when the average response time exceeds 5, suggesting that backend servers may be experiencing delays or are under heavy load. The C3 team should investigate the affected backend servers, review server logs, and check resource utilization (e.g., CPU, memory, or network) on both HAproxy node and backend nodes.

C3 Data Collection:

  1. Instance name, IP address, and alert age:

    • Collect the instance name and IP address where the alert is firing.
    • Record the total duration the alert has been in the firing state.
  2. Response time details:

    • Use the metric haproxy_server_http_response_time_average_seconds to evaluate the average response time of backend servers.
  3. Recent log messages:

    • Collect logs for HAProxy to investigate potential issues causing high response times:
      tail -100 /var/log/haproxy/haproxy_info.log
      tail -100 /var/log/haproxy/haproxy_debug.log
      tail -100 /var/log/haproxy/haproxy_notice.log
      
  4. Resource availability:

    • Assess system resource utilization:
      • CPU and memory: Use top or htop.
      • Disk usage: Use df -h to check for sufficient space.
  5. Collect HTTP responses:

    • Use the query rate(haproxy_backend_http_responses_total{instance="$host"}[<time_interval>]) to observe changes in backend response rates, especially errors like 5xx, 4xx.
  6. Collect response rates from last few minutes:

    • Use query increase(haproxy_backend_http_responses_total[<time interval>]) to check if there are sudden increase in number of responses the backend is serving.
Dependant metrics
  1. Frontend Session Rate:

    • Query rate(haproxy_frontend_current_session_rate[<time interval>]) to identify spikes in session creation rates.
  2. Frontend Connections:

    • Use rate(haproxy_frontend_connections_tota[<time interval>]) to monitor increases in frontend connections over time.
  3. View average queue time:

    • Use haproxy_backend_http_queue_time_average_seconds

By collecting these metrics and analyzing them, it helps in understanding the root cause for the issue.

C3 remedies:

  1. Investigate High Response Times:

    • Use:
      haproxy_server_http_response_time_average_seconds{instance="$host"}
      
      to identify the backend servers with high response times.
    • Login to the specific backend server:
      ssh devopsadmin@backend-server-ip
      
    • Check resource usage on the backend using:
      top -n 1
      free -h
      iostat
      
      Collect and analyze CPU, memory, and disk I/O statistics.
  2. Clear Buffer and Cache:

    • Run the following commands to clear buffer and cache:
      sync; echo 3 > /proc/sys/vm/drop_caches
      
      Confirm memory availability after clearing:
      free -h
      
  3. Validate Backend Health:

    • Use Prometheus query:
      haproxy_backend_up{instance="$host"}
      
      to confirm backend servers are marked as healthy.
    • Login to the backend and check application logs for issues:
      tail -f /var/log/tomcat/catalina.out
      tail -f /var/log/mysql/error.log
      
  4. Analyze Bandwidth and Traffic:

    • On HAProxy server, monitor bandwidth usage by interfaces/connections using iftop:
      iftop -i virbr20 to monitor traffic between backends and haproxy node.
      Sample output:
    syhydsrv001:48456                                                          => 172.21.0.42                                                                 9.03Kb  1.81Kb   462b
                                                                           <=                                                                             1.43Mb   294Kb  73.4Kb
    

    iftop -i testbr to monitor traffic between frontends and client connections.
    Sample output:

    syhydsrv001                                                                => 183.82.7.33.actcorp.in                                                      32.4Kb  38.1Kb  47.6Kb
                                                                           <=                                                                             19.3Kb  21.2Kb  30.6Kb
    

syhydsrv001 => 223.182.53.215 57.9Kb 23.2Kb 8.91Kb <= 12.7Kb 5.33Kb 2.05Kb ```

  • Capture traffic using tcpdump for detailed packet analysis:

    tcpdump -i testbr host <backend-server-ip> -w haproxy_traffic.pcap
    

    Use Wireshark to open .pcap files for deep inspection.

  • On the backend servers, monitor bandwidth:

    iftop -i eth0
    
  • Capture incoming traffic to the backend:

    tcpdump -i virbr20 port 3306 -w backend_traffic.pcap
    
  1. Inspect Network Performance:

    • On the HAProxy server, test network latency:
      ping -c 4 backend-server-ip
      traceroute backend-server-ip
      
      Check for latencies.
    • If network issues are identified, involve the network team for resolution.
  2. Monitor Backend Queues:

    • Use metrics to check the current queue size for the backend:
      haproxy_server_current_queue{instance="$host"}
    • Report the continous usage to devops team for increase in resources.
  3. Inspect Logs for Errors:

    • Check HAProxy logs for specific errors:
      tail -100f /var/log/haproxy/haproxy_debug.log
      
    • Look for repeated backend errors (503, 502) and address root causes.

By following these remedies, the team can systematically diagnose and resolve high response times in HAProxy and its backend servers, ensuring optimal performance.

Devops remedies:

  1. Tune HAProxy Timeouts:

    • Adjust timeout settings in /etc/haproxy/haproxy.cfg:
      timeout connect 5s
      timeout client 50s
      timeout server 50s
      
    • Reload HAProxy configuration after changes:
      haproxy -c -f /etc/haproxy/haproxy.cfg
      systemctl reload haproxy
      
  2. Increase resources for tomcat and other services as required

For tomcat:

  1. Locate the Tomcat Service File:
  • Open the Tomcat systemd service file for editing:
    sudo vi /etc/systemd/system/<tomcat_name>.service
    
  1. Modify the Memory Allocation:
  • Locate the Environment line containing CATALINA_OPTS.
  • Increase the memory limits (-Xms and -Xmx) to suit the application’s requirements. For example, to increase to 1 GB minimum and 4 GB maximum:
    Environment="CATALINA_OPTS=-Xms1024M -Xmx4096M -server -XX:+UseParallelGC -javaagent:/opt/tomcat-exporter/jmx_prometheus_javaagent-1.0.1.jar=9115:/opt/tomcat-exporter/config.yaml"
    
  1. Save and Reload the Service File:
  • Save the changes and exit the editor.
  • Reload the systemd daemon to apply changes:
    sudo systemctl daemon-reload
    
  1. Restart Tomcat Service:
  • Restart the Tomcat service to apply the new memory settings:
    sudo systemctl restart tomcat
    
  1. Verify Changes:
  • Check if Tomcat is running with updated memory limits:
    ps aux | grep tomcat
    
    Look for the -Xms and -Xmx values in the process command line.
  1. Monitor Resource Usage:
  • Use top or htop to monitor memory usage and ensure the changes have stabilized the application:
    top -p $(pgrep -d',' -f tomcat)
    

Redpanda:

  1. Locate the Configuration File:

    • Open the Redpanda configuration file for editing:
      sudo vi /etc/default/redpanda
      
  2. Modify the Memory Setting:

    • Locate the START_ARGS line and adjust the --memory value. For example, to allocate 2048 MB:
      START_ARGS=--check=true --memory 2048M
      
  3. Save and Apply Changes:

    • Save the file and exit the editor.
    • Reload the systemd daemon to apply changes:
      sudo systemctl daemon-reload
      
  4. Restart Redpanda Service:

    • Restart the Redpanda service:
      sudo systemctl restart redpanda
      
  5. Verify Changes:

    • Check the running process for updated memory settings:
      ps aux | grep redpanda
      
      Look for the --memory parameter in the command line.
  6. Monitor Resource Usage:

    • Use top or htop to observe Redpanda’s memory usage:
      top -p $(pgrep -d',' -f redpanda)
      

MySQL Buffer Pool Size Adjustment

  1. Locate the Configuration File:

    • Open the MySQL configuration file for editing:
      sudo vi /etc/mysql/mysql.conf.d/mysqld.cnf
      
  2. Modify the Buffer Pool Size:

    • Locate the innodb_buffer_pool_size line and adjust the value. For example, to set it to 4096 MB:
      innodb_buffer_pool_size=4096M
      
  3. Save and Apply Changes:

    • Save the file and exit the editor.
    • Restart the MySQL service:
      sudo systemctl restart mysql
      
  4. Verify Changes:

    • Check the buffer pool size configuration:
      mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
      
  5. Monitor Resource Usage:

    • Use top or htop to monitor MySQL’s memory usage:
      top -p $(pgrep -d',' -f mysql)
      

3. Scylla Memory

  1. Locate the Configuration File:

    • Open the Scylla configuration file for editing:
      sudo vi /etc/default/scylla-server
      
  2. Modify the Memory Setting:

    • Locate the SCYLLA_ARGS line and adjust the --memory value. For example, to allocate 4096 MB:
      SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --memory 4096M"
      
  3. Save and Apply Changes:

    • Save the file and exit the editor.
    • Reload the systemd daemon to apply changes:
      sudo systemctl daemon-reload
      
  4. Restart Scylla Service:

    • Restart the Scylla service:
      sudo systemctl restart scylla-server
      
  5. Verify Changes:

    • Check the running process for updated memory settings:
      ps aux | grep scylla
      
      Look for the --memory parameter in the command line.
  6. Monitor Resource Usage:

    • Use top or htop to observe Scylla’s memory usage:
      top -p $(pgrep -d',' -f scylla-server)
      

By following these steps, you can adjust and verify memory allocation for Redpanda, MySQL, and Scylla effectively, ensuring they run optimally based on application demands.

For redpanda:

  1. Increase Resources if Necessary for the nodes:

    • If memory usage is still high or performance issues persist, consider scaling system resources (CPU or RAM).
    • Check the current system resource allocation:
      free -h
      lscpu
      
    • Add more resources (e.g., increasing VM memory or scaling physical hardware).
  2. Check with management/IT team to replace the LAN connections with more bandwidth allowing hardware if response latencies won’t get better with any of the above remedies.

Backend Weight Mismatch for PSSB Webservers

Alert Name: Backend Server(PSSB)

The backend weight for the pssb_webservers proxy is not equal to the expected value of 5

Alert Query

haproxy_backend_weight{proxy=~"(pssb_webservers)", instance="172.21.0.20"}

C3 Data Collection:

  1. Backend Configuration Data:
    • The configuration of the pssb_webservers backend in HAProxy (/etc/haproxy/haproxy.cfg).
    backend pssb_webservers
    server pssb1avm001 172.21.0.61:8182 maxconn 5000 check inter 55s fall 3 rise 3 cookie pssb1avm001 observe layer4  error-limit 9  on-error mark-down
    server pssb1avm002 172.21.0.62:8182 maxconn 5000 check inter 55s fall 3 rise 3 cookie pssb1avm002 observe layer4  error-limit 9  on-error mark-down
    server pssb1abm003 172.21.0.63:8182 maxconn 5000 check inter 55s fall 3 rise 3 cookie pssb1abm003 observe layer4  error-limit 9  on-error mark-down
    server pssb1avm004 172.21.0.64:8182 maxconn 5000 check inter 55s fall 3 rise 3 cookie pssb1avm004 observe layer4  error-limit 9  on-error mark-down
    server pssb1avm005 172.21.0.65:8182 maxconn 5000 check inter 55s fall 3 rise 3 cookie pssb1avm005 observe layer4  error-limit 9  on-error mark-down
    
  2. Metrics from Prometheus:
    • Historical trends of haproxy_backend_weight for pssb_webservers.

      To view stats of haproxy : Haproxy stats

    • Related metrics values
      • Check active server of proxy
      haproxy_backend_active_servers{instance="172.21.0.20", job="haproxy_exporter", proxy="pssb_webservers"}
      
      • Check backend status up or down state
      haproxy_backend_status{proxy="pssb_webservers"}
      
      • Check for each node in backend and their state[“UP”,“DOWN”,“DRAIN”,“NOLB”]
      haproxy_server_status{proxy="pssb_webservers"}
      
      • Review the total downtime of the proxy for checking logs
      haproxy_backend_downtime_seconds_total{instance="172.21.0.20", job="haproxy_exporter", proxy="pssb_webservers"}
      
  3. Logs:
    • HAProxy logs for any recent changes or issues related to server availability or backend configuration.
    tail -100f /var/log/haproxy/haproxy_notice.log
    tail -100f /var/log/haproxy/haproxy_debug.log
    tail -100f /var/log/haproxy/haproxy_info.log
    
    • Check for events that might indicate a server being marked as down or weight adjustments due to health check failures
  4. Health Check Results:
    • Results of health checks performed on the servers in the backend to see if any server is unhealthy.

DevOps Remedies:

  1. Verify and Correct the Configuration:
    • Ensure the backend configuration (haproxy.cfg) specifies the correct weight for all servers in the pssb_webservers backend.
    • Reload HAProxy to apply configuration changes if needed:
    systemctl reload haproxy
    
  2. Check Health Status:
    • If any backend server is failing health checks, troubleshoot the server:
      • Ensure the application is running on the server.
      • Verify the server responds correctly to health check requests.
      • Fix any network issues between HAProxy and the backend server.
  3. Update Grafana Rules:
    • If the alert is due to intentional changes in weight (e.g., during maintenance), update the alert threshold or silence the alert temporarily in Grafana.
  4. Resource Optimization:
    • If resource constraints are causing servers to be marked down, optimize system resources on the backend servers.

Backend Weight Mismatch for Multiple Services

Alert Name: Back - Status

The backend status for one or more specified backends is not equal to 1.

Alert Query

haproxy_backend_status{proxy=~"(artifacts_panchayatseva|elitical_api|elitical_ui|geomaps|jenkins|mdm_sb_node|monitor|panchayatseva_bot|portereu_api|portereu_ui|portereu_webservers|ps_jacoco|ps_surefire|ps_swagger|psorbit_webservers|pssb_webservers|sonar|stats|survey_artifacts|techdoc)",instance="172.21.0.20",state="UP"}

C3 Data Collection

  1. Backend Configuration Data:
    • Review the HAProxy configuration file (/etc/haproxy/haproxy.cfg) for the specified backends and their respective weight configurations.
    • Confirm the default and dynamic weight values for each backend.
  2. Prometheus Metrics:
    • Gather historical data for haproxy_backend_status for the affected backends to identify trends or anomalies.
    • Check related metrics such as haproxy_backend_active_servers and haproxy_server_status for correlations.
  3. Logs:
    • Inspect HAProxy logs for errors or recent changes related to the affected backends.
    tail -100f /var/log/haproxy/haproxy_notice.log
    tail -100f /var/log/haproxy/haproxy_debug.log
    tail -100f /var/log/haproxy/haproxy_info.log
    
  4. System Resource Utilization:
    • Monitor resource usage (CPU, memory, network) on the instance 172.21.0.20.
    • Identify if system-level constraints are affecting HAProxy or backend servers.
  5. Recent Changes:
    • Check if there were any recent updates to the HAProxy configuration, deployment changes, or maintenance activities affecting the servers.

DevOps Remedies

  1. Verify Backend Configuration:
    • Ensure each backend in haproxy.cfg is correctly configured for respective servers.
    • Reload HAProxy to apply configuration changes
    systemctl reload haproxy
    
  2. Fix Health Check Issues:
    • Troubleshoot failing health checks for any server:
      • Verify the application is running and accessible.
      • Confirm the health check URL or port is correct and responsive.
  3. Investigate Recent Changes:
    • Roll back recent changes or deployments affecting the backends if they introduced unintended weight discrepancies.
  4. Resource Optimization:
    • Address any resource constraints on 172.21.0.20 or backend servers to stabilize their availability.
  5. Review Alert Rule:
    • Update the alert rule if the expected weight value needs adjustment for certain backends.
    • Temporarily silence the alert in Grafana if the discrepancy is due to planned maintenance.

High Rate of Backend Health Check State Changes

Alert Name: Backend - UP to DOWN transitions

The rate of backend health check up/down state changes exceeds the threshold.

Alert Query

rate(haproxy_backend_check_up_down_total{proxy=~"(artifacts_panchayatseva|elitical_api|elitical_ui|geomaps|jenkins|mdm_sb_node|monitor|panchayatseva_bot|portereu_api|portereu_ui|portereu_webservers|ps_jacoco|ps_surefire|ps_swagger|psorbit_webservers|pssb_webservers|sonar|stats|survey_artifacts|techdoc)", instance="172.21.0.20"}[$__rate_interval]) > 5

C3 Data Collections

  1. Health Check Logs:
    • Collect HAProxy logs for detailed information about backend health check transitions.
    • Look for patterns, such as repeated failures or specific errors causing state changes.
  2. Backend Configuration Data:
    • Review the health check settings in haproxy.cfg for the affected backends:
      • Check intervals (inter), timeouts (timeout check), and retry counts (rise and fall parameters).
      • Check server status (UP/DOWN/MAINT) and their associated counters.
  3. Prometheus Metrics:
    • Examine related metrics such as:
      • haproxy_server_check_status to monitor the state of health checks.
      • haproxy_server_connection related metrics
    • Historical data for haproxy_backend_check_up_down_total to identify trends or patterns.
  4. System Resource Metrics:
    • Collect resource usage (CPU, memory, disk I/O) on both 172.21.0.20 and backend servers.
    • Check network latency and packet drops between HAProxy and backend servers.
  5. Application Logs:
    • Gather logs from the backend servers for errors or issues that might cause them to fail health checks intermittently.

DevOps Remedies

  1. Verify Backend Health:
    • Identify and troubleshoot the unstable servers:
      • Check application logs and performance metrics for errors or crashes.

        To view stats of haproxy : Haproxy stats

      • Verify that the application is running and responding correctly on the health check endpoint

        <backend_ip>:<port> example : 172.21.0.94:8182

  2. Adjust Health Check Parameters:
    • Modify the health check settings in haproxy.cfg to reduce sensitivity to transient issues:
      • Increase the inter (interval between health checks).
      • Adjust the rise and fall values to require more consecutive successes or failures before marking a server UP or DOWN.
    • Reload HAProxy to apply the changes:
    systemctl reload haproxy
    
  3. Optimize Backend Performance:
    • Address performance bottlenecks on the backend servers (e.g., high CPU usage, memory leaks).
    • Scale the number of backend servers if load is causing frequent failures.
  4. Network Troubleshooting:
    • Check for network issues such as high latency, packet loss, or misconfigured firewalls between HAProxy and the backend servers.
    • Resolve any issues to ensure stable connectivity.
  5. Analyze Patterns:
    • Use Prometheus graphs to identify specific timeframes or events that correlate with the increased state transitions.
    • Investigate recent deployments or configuration changes that might have triggered the issue.

Frontend Status Down for HTTPS or Stats

Alert Name: Frontend - Status

The frontend for https_443_frontend or stats is not in the UP state.

Alert Query

haproxy_frontend_status{proxy=~"(https_443_frontend|stats)",instance="172.21.0.20",state="UP"}

C3 Data Collection

  1. HAProxy Logs:
    • Collect logs for the https_443_frontend and stats frontends to understand why they transitioned out of the UP state.
    tail -100f /var/log/haproxy/haproxy_notice.log
    tail -100f /var/log/haproxy/haproxy_debug.log
    tail -100f /var/log/haproxy/haproxy_info.log
    
    • Look for errors or warnings related to binding issues, certificate problems, or misconfigurations.
  2. Prometheus Metrics:
    • Check related metrics, such as:
      • Start with traffic and session metrics (connections_total, current_sessions, bytes_in/out_total, http_requests_total) of haproxy_frontend
      haproxy_frontend_connections_total
      haproxy_frontend_current_sessions
      haproxy_frontend_current_session_rate
      haproxy_frontend_current_session_rate
      haproxy_frontend_requests_denied_total
      haproxy_frontend_sessions_total
      haproxy_frontend_http_requests_total
      haproxy_frontend_bytes_in_total and haproxy_frontend_bytes_out_total
      haproxy_frontend_http_responses_total
      
  3. Configuration File Review:
    • Inspect haproxy.cfg for:
      • Proper frontend definitions (bind statements, SSL configuration, ACLs, etc.).
      • Conflicting port or address bindings.
      • Syntax or logical errors.
  4. Network Diagnostics:
    • Check the status of ports 443 (for https_443_frontend) and the admin stats port (typically 9000 or configured value) using netstat or ss:
    netstat -tuln | grep <port>
    
    • Verify that these ports are open and not blocked by firewalls.
  5. SSL Certificate Validation:
    • For https_443_frontend, validate the SSL certificate configuration and ensure the certificate files exist and are readable.
  6. Resource Usage:
    • Monitor CPU, memory, and disk usage on the HAProxy instance to ensure the system has sufficient resources.

DevOps Remedies

  1. Restart HAProxy:
    • If the issue is transient, restarting HAProxy may resolve it:
    systemctl restart haproxy
    
  2. Verify Frontend Configuration:
    • Check haproxy.cfg:
      • Ensure proper bind statements for the affected frontends.
      • For SSL, verify that the certificate and key files are correctly specified and accessible.
      • Validate the configuration syntax:
      haproxy -c -f /etc/haproxy/haproxy.cfg
      
  3. Check Port Availability:
    • Ensure the ports (443 for HTTPS, and the stats port) are not in use by another process.
    • If another process is using the port, stop it or reassign HAProxy to a different port.
  4. SSL Certificate Issues:
    • If the SSL certificate is invalid, expired, or misconfigured:
      • Replace or renew the certificate.
      • Update the path in the haproxy.cfg.
    • Reload HAProxy after updates:
    systemctl reload haproxy
    
  5. Resolve Network Issues:
    • Ensure that the ports are open and accessible via firewall settings:
    ufw allow <port>
    
  6. Enable Logging for Further Analysis:
    • Increase the logging level in haproxy.cfg for detailed diagnostics:
    global
        log 127.0.0.1 local0 debug
    

Backend Weight Mismatch for PSOrbit Webservers

Alert Name: Backend Server(PS Orbit)

The backend weight for the psorbit_webservers proxy is either not equal to 3

Alert Query:

haproxy_backend_weight{proxy=~"(psorbit_webservers)", instance="172.21.0.20"} != 3

C3 Data Collection:

  1. Metric Values
    • Collect the actual weight of the affected backend (psorbit_webservers).
    • Verify if the weight is below or above the expected value (3).
    • Related metrics values
      • Check active server of proxy
      haproxy_backend_active_servers{instance="172.21.0.20", job="haproxy_exporter", proxy="psorbit_webservers"}
      
      • Check backend status up or down state
      haproxy_backend_status{proxy="psorbit_webservers"}
      
      • Check for each node in backend and their state[“UP”,“DOWN”,“DRAIN”,“NOLB”]
      haproxy_server_status{proxy="psorbit_webservers"}
      
      • Review the total downtime of the proxy for checking logs
      haproxy_backend_downtime_seconds_total{instance="172.21.0.20", job="haproxy_exporter", proxy="psorbit_webservers"}
      

DevOps Remedies:

  1. Verify and Correct the Configuration:
    • Ensure the backend configuration (haproxy.cfg) specifies the correct weight for all servers in the psorbit_webservers backend.
    • Reload HAProxy to apply configuration changes if needed:
    systemctl reload haproxy
    
  2. Check Health Status:
    • If any backend server is failing health checks, troubleshoot the server:
      • Ensure the application is running on the server.
      • Verify the server responds correctly to health check requests.
      • Fix any network issues between HAProxy and the backend server.
  3. Update Grafana Rules:
    • If the alert is due to intentional changes in weight (e.g., during maintenance), update the alert threshold or silence the alert temporarily in Grafana.
  4. Resource Optimization:
    • If resource constraints are causing servers to be marked down, optimize system resources on the backend servers.

Backend Weight Mismatch for Multiple Services

Alert Name: Backend - Server Weight

The backend weight for one or more specified servers is not equal to 1. Alert Query:

haproxy_backend_weight{proxy=~"(artifacts_panchayatseva|elitical_api|elitical_ui|geomaps|jenkins|mdm_sb_node|monitor|panchayatseva_bot|portereu_api|portereu_ui|portereu_webservers|ps_jacoco|ps_surefire|ps_swagger|sonar|stats|survey_artifacts|techdoc)", instance="172.21.0.20"} != 1

C3 Data Collections

  1. Metric Values
    • haproxy_backend_weight:
      • Retrieve the weight for all listed backends (e.g., artifacts_panchayatseva, elitical_api, etc.).
      • Confirm which backend weight is not equal to 1.
  2. Logs:
    • HAProxy logs for any recent changes or issues related to server availability or backend configuration.
    tail -100f /var/log/haproxy/haproxy_notice.log
    tail -100f /var/log/haproxy/haproxy_debug.log
    tail -100f /var/log/haproxy/haproxy_info.log
    
    • Check for events that might indicate a server being marked as down or weight adjustments due to health check failures
  3. Metrics from Prometheus:
    • Historical trends of haproxy_backend_weight for pssb_webservers.

      To view stats of haproxy : Haproxy stats

    • Verify related metrics(haproxy_backend*) mentioned above

DevOps Remedies

  1. Verify Backend Configuration:
    • Ensure each backend in haproxy.cfg is correctly configured for respective servers.
    • Reload HAProxy to apply configuration changes
    systemctl reload haproxy
    
  2. Check Health Status:
    • If any backend server is failing health checks, troubleshoot the server:
      • Ensure the application is running on the server.
      • Verify the server responds correctly to health check requests.
      • Fix any network issues between HAProxy and the backend server.
  3. Update Grafana Rules:
    • If the alert is due to intentional changes in weight (e.g., during maintenance), update the alert threshold or silence the alert temporarily in Grafana.
  4. Resource Optimization:
    • If resource constraints are causing servers to be marked down, optimize system resources on the backend servers.

Backend Session Limit Violation

Alert Name: Backend - Maximum observed number of active sessions limit exceed

The difference between backend session limits and max sessions is below the threshold.

Alert Query:

(haproxy_backend_limit_sessions{proxy=~"(artifacts_panchayatseva|elitical_api|elitical_ui|geomaps|jenkins|mdm_sb_node|monitor|panchayatseva_bot|portereu_api|portereu_ui|portereu_webservers|ps_jacoco|ps_surefire|ps_swagger|psorbit_webservers|pssb_webservers|sonar|stats|survey_artifacts|techdoc)",instance="172.21.0.20"} - haproxy_backend_max_sessions{proxy=~"(artifacts_panchayatseva|elitical_api|elitical_ui|geomaps|jenkins|mdm_sb_node|monitor|panchayatseva_bot|portereu_api|portereu_ui|portereu_webservers|ps_jacoco|ps_surefire|ps_swagger|psorbit_webservers|pssb_webservers|sonar|stats|survey_artifacts|techdoc)",instance="172.21.0.20"}) < 3500

C3 Data Collection

  1. Key Metrics in Use
    • haproxy_backend_limit_sessions:
      • The maximum number of concurrent sessions allowed for the backend.
    • haproxy_backend_max_sessions:
      • The highest number of sessions that have been concurrently active in the backend.
  2. Traffic Metrics:
    • Analyze:
      • haproxy_backend_bytes_in_total
      • haproxy_backend_bytes_out_total
      • haproxy_backend_connections_total
  3. Logs:
    • Review /var/log/haproxy.log for:
      • Session rejection messages.
      • Errors indicating session limit breaches.
  4. Configuration:
    • Inspect the HAProxy configuration for:
    • Backend session limits (limit-session settings).

DevOps Remedies

  1. Increase Backend Session Limits:
    • Adjust limit-session in the haproxy.cfg file for backends nearing capacity:
    backend <backend_name>
        server server1 192.168.x.x:80 check maxconn 10000
    
  2. Optimize Persistent Connections:
    • Reduce session duration or idle timeouts for long-lived connections.
  3. Add More Backend Servers:
    • Scale out the backend pool to distribute sessions across more servers.
  4. Monitor Traffic Patterns:
    • Identify and mitigate unexpected traffic surges.

Frontend Session Limit Violation

Alert Name: Frontend - Maximum observed number of active sessions limit exceeded

The difference between frontend_limit_sessions and frontend_max_sessions is below the threshold.

Alert Query:

haproxy_frontend_limit_sessions{proxy=~"(https_443_frontend|stats)",instance="172.21.0.20"} - haproxy_frontend_max_sessions{proxy=~"(https_443_frontend|stats)",instance="172.21.0.20"} < 40000

C3 Data collection

  1. Current Metrics:
    • Check the values of haproxy_frontend_limit_sessions and haproxy_frontend_max_sessions

      To view stats of haproxy : Haproxy stats

  2. Traffic Patterns:
    • Analyze:
      • haproxy_frontend_bytes_in_total
      • haproxy_frontend_bytes_out_total
      • haproxy_frontend_connections_total
  3. Logs:
    • Review HAProxy logs at below mentioned directory for any connection rejections and Session-related errors.
    tail -100f /var/log/haproxy/haproxy_notice.log
    tail -100f /var/log/haproxy/haproxy_debug.log
    tail -100f /var/log/haproxy/haproxy_info.log
    
  4. Frontend Configuration:
    • Inspect HAProxy frontend settings in haproxy.cfg for:
      • maxconn limits.
      • Rate-limiting configurations.