JMX Alerts

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

Alert Handling Procedure

  1. Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.

  2. Severity-Based Actions:

    • Low-Priority Alerts:
      • If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
    • Escalation to DevOps:
      • If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
  3. Severity-Specific Notifications:

    • Warning Alerts:
      • For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
    • Critical Alerts:
      • For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.


Alerts, Thresholds and Priorities

Table

Dashboard & Row Alert Name Panel Panel Description Query Query Description Query Operating Range Metrics Metric Description Metric Operating Range SEVERITY: CRITICAL SEVERITY: WARNING SEVERITY: OK
1.1.2 Tomcat Status Check Status Checks whether the Tomcat exporter or application instance is up and running. up{job="$job", instance="$app$node"} Verifies if the application or exporter instance is running. Boolean (0/1) up Instance status (1 = up, 0 = down). 0 or 1 0 N/A 1
1.3.4 JVM Memory Usage JVM Memory Usage [heap] Monitors the percentage of heap memory used by the JVM relative to the total physical memory available. jvm_memory_used_bytes{area="$memarea",job="$job", instance="$app$node"} / ignoring(area) group_left java_lang_OperatingSystem_TotalPhysicalMemorySize{job="$job", instance="$app$node"} *100 Tracks the percentage of heap memory used by the JVM. 0–100% jvm_memory_used_bytes, java_lang_OperatingSystem_TotalPhysicalMemorySize Heap memory used and total physical memory. 0–100% >90% 80–90% <80%
2.3.2 High Tomcat Request Processing Time Average Processing Time Monitors the average processing time (in milliseconds) for servlet requests over 5 minutes. avg(rate(Catalina_GlobalRequestProcessor_processingTime{instance=~"$instance"}[5m])) Measures the average time taken to process servlet requests. Milliseconds (ms) Catalina_GlobalRequestProcessor_processingTime Measures how long Tomcat servlets take to process requests. Positive values >9sec 3-9sec <3sec
1.1.4 High Tomcat CPU Utilization CPU Utilization Tracks the percentage of CPU utilization on the node over a 5-minute window. rate(process_cpu_seconds_total{job=“tomcat_exporter_pssb”,instance=~“pssb.*”}[5m])*100 Tracks the CPU usage percentage of Tomcat processes. 0–100% process_cpu_seconds_total Percentage of CPU used by Tomcat processes. 0–100% >50% N/A <50%

Tomcat Status Check

Explanation

This alert monitors whether the Tomcat exporter or application instance is up and running. If the instance is down, it could indicate server failures, network issues, or a problem with the application itself.

Scenarios Triggering the Alert

  1. The tomcat_exporter service is stopped or crashed.
  2. Network connectivity issues between the Prometheus server and the exporter.
  3. Firewall or IP restrictions blocking Prometheus from scraping the exporter.
  4. Misconfigured or missing exporter settings in Prometheus.

Thresholds

Severity Description
Critical Instance status is 0 (down).
Warning Not applicable for this alert.
OK Instance status is 1 (up).

C3 Data Collection

  1. Check the Status of the Exporter:

    • Log in to the server hosting the exporter.
    • Run the following command to verify if the service is active:
      systemctl status tomcat-pssb
      
      OR
      ps aux | grep tomcat
      
    • Note any errors in the service logs using:
      journalctl -u tomcat-pssb
      
  2. Ping the Exporter:

    • Test connectivity to the exporter using cURL:

      curl http://<exporter-host>:9115/metrics
      
    • If you see metrics data, the exporter is up and reachable. Otherwise, record the error message.

    • If the configuration is incorrect or missing, notify DevOps.

  3. Network Accessibility:

    • Use ping or telnet to verify the server’s network connection:
      ping <exporter-host>
      

C3 Remedy

  1. Restart the Exporter:

    • If the exporter is down, restart it:
      systemctl restart tomcat-pssb
      
    • Verify it is running:
      systemctl status tomcat-pssb
      
  2. Resolve Network Issues:

    • Check if the server is reachable from Prometheus.
    • Coordinate with the network team to resolve connectivity or firewall issues.
  3. Verify Configurations:

    • Ensure the correct scrape target and port are specified in the Prometheus config.

DevOps Remedy

If the above remedies fail:

  1. Investigate Logs:

    • Analyze the exporter logs for errors or misconfigurations.
      tail -n 100 /opt/ps/tomcat/logs/catalina.out
      
      tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
      
  2. Prometheus Server Debugging:

    • Review Prometheus scrape logs to identify errors:
      journalctl -xeu prometheus
      

High JVM Memory Usage

Explanation

This alert monitors the percentage of heap memory used by the JVM relative to the total physical memory available. High memory usage could indicate potential memory leaks, inefficient application code, or insufficient resource allocation.

Scenarios Triggering the Alert

  1. Application memory leaks or inefficient memory usage.
  2. Insufficient heap size allocated to the JVM.
  3. High traffic leading to increased memory usage for session management or caching.
  4. Inefficient or unoptimized garbage collection settings.

Thresholds

Severity Description
Critical Heap memory usage exceeds 90%.
Warning Heap memory usage is 80–90%.
OK Heap memory usage is below 80%

C3 Data Collection

  1. Identify the Affected Instance:

    • Use the alert details to get the instance with high memory usage.
  2. Check Memory Utilization Trends:

    • Query memory usage for the last 1 hour:
      jvm_memory_used_bytes{area="heap",instance="$instance"} / ignoring(area) group_left java_lang_OperatingSystem_TotalPhysicalMemorySize{instance="$instance"} * 100
      
  3. Garbage Collection Activity:

    • Analyze garbage collection frequency and duration:
      rate(jvm_gc_collection_seconds_sum{instance="$instance"}[5m])
      
  4. Thread Utilization:

    • Check for thread pool saturation:
      Catalina_ThreadPool_currentThreadsBusy{instance="$instance"}
      
  5. Logs:

    • Collect application logs for issues:
       tail -n 100 /opt/ps/tomcat/logs/catalina.out
      
       tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
      

Dependent Metrics

  • jvm_gc_collection_seconds_sum: For GC activity and impact on memory.
  • jvm_memory_max_bytes: To check the maximum allocated JVM memory.
  • Catalina_ThreadPool_currentThreadsBusy: To correlate memory usage with thread activity.

C3 Remedy

  1. Restart the Tomcat Service:

    • If memory usage is critically high and causing application instability:
      systemctl restart tomcat-pssb
      
  2. Analyze Logs for Memory Leaks:

    • Look for “OutOfMemoryError” or similar warnings in logs:
      tail -n 100 /opt/ps/tomcat/logs/catalina.out
      
      tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
      
  3. Inform DevOps:

    • Notify DevOps if memory allocation changes or application profiling is required.

DevOps Remedy

  1. Optimize Memory Allocation:
    • Review and adjust the JVM heap size in the CATALINA_OPTS configuration. For example:
      -Xms1024M -Xmx4300M
      
    • Ensure the heap size is appropriate for the workload.
  2. Application Debugging:
    • Work with the development team to identify memory leaks or inefficiencies in the code. Tools like VisualVM or JProfiler can help.
  3. Garbage Collector Tuning:
    • If GC activity is inefficient, tune the garbage collector settings (e.g., switch to G1GC or other suitable GC for the workload).
  4. Scaling Resources:
    • Add additional resources (memory, CPU) to the server or scale out horizontally by adding more nodes.

High Tomcat Request Processing Time

Explanation

The query calculates the average rate of processing time (in milliseconds) for servlet requests over a 5-minute window.

Scenarios Triggering the Alert

  1. The Tomcat server is under heavy load due to a large number of incoming requests.
  2. Problems in the application code, such as inefficient algorithms or threads competing for resources.
  3. Delays caused by external systems like databases or APIs.
  4. Server resource limitations, such as high CPU usage, insufficient memory, or slow disk operations.

Thresholds

Severity Description
Critical Average processing time exceeds 9000ms.
Warning Average processing time is 3000–9000ms.
OK Average processing time is below 3000ms.

C3 Data Collect

  1. Identify the Affected Instance:

    • Check the alert message for the instance and processingTime details.
  2. Gather Metrics via Prometheus:

    • Query the processing time for the past 1 hour to identify trends:
      rate(Catalina_GlobalRequestProcessor_processingTime{instance="$instance"}[1h])
      
  3. Analyze Traffic Volume:

    • Query the request count to see if high traffic correlates:
      rate(Catalina_GlobalRequestProcessor_requestCount{instance="$instance"}[5m])
      
  4. Check Resource Utilization:

    • Validate CPU and memory usage using these queries:
      node_cpu_seconds_total{mode="idle", instance="$instance"}
      jvm_memory_used_bytes{instance="$instance"}
      
  5. Tomcat Logs:

    • Collect logs to identify specific errors or delays:
       tail -n 100 /opt/ps/tomcat/logs/catalina.out
      
       tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
      
    • Look for long-running requests or threads.

Dependent Metrics

  • Catalina_GlobalRequestProcessor_requestCount: To identify traffic spikes.
  • jvm_memory_used_bytes: To check JVM memory utilization.
  • node_cpu_seconds_total: To analyze system CPU usage.

C3 Remedy

  1. Restart Tomcat (if needed):

    • If the issue is severe and unresolvable quickly, restart the Tomcat service:
      systemctl restart tomcat-pssb
      
  2. Review Application Logs:

    • Check for slow requests or errors in application logs:
      tail -n 100 /opt/ps/tomcat/logs/catalina.out
    
      tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
    
  3. Notify DevOps:

    • If traffic spikes are identified, notify DevOps to investigate and implement load balancing or scaling.

DevOps Remedy

If the C3 team cannot resolve the issue:

  1. Increase Resources:

    • Scale up the instance resources (CPU/memory).
    • Consider horizontal scaling by adding more instances.
  2. Optimize Application:

    • Work with developers to profile and optimize slow code paths.
  3. Load Balancing:

    • Implement load balancing strategies to distribute traffic across multiple instances.

High Tomcat CPU Utilization

Explanation

This alert monitors the CPU usage of the Tomcat process over a 5-minute interval. A consistently high CPU usage indicates excessive load on the server.

Scenarios Triggering the Alert

  1. High request volume or unexpected traffic spikes.
  2. Inefficient application code causing excessive CPU usage.
  3. Resource-intensive background tasks or scheduled jobs.
  4. High garbage collection (GC) frequency or long GC pauses.
  5. Insufficient CPU resources for the load.

C3 Data Collection

  1. Identify the Affected Instance:

    • Use the alert details to find the instance with high CPU usage.
  2. Analyze CPU Trends:

    • Query CPU usage for the last hour in prometheus:
      rate(process_cpu_seconds_total{instance="$instance"}[1h]) * 100
      
  3. Correlate with Request Metrics:

    • Check for high request rates:
      rate(Catalina_GlobalRequestProcessor_requestCount{instance="$instance"}[5m])
      
    • High request rates can overload the Tomcat server, leading to increased CPU utilization.
  4. Logs:

    • Review recent Tomcat logs:
      tail -n 100 /opt/ps/tomcat/logs/catalina.out
      
      tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
      

Dependent Metrics

  • Catalina_GlobalRequestProcessor_requestCount: For correlation with request volume.
  • catalina_threads_busy: For thread pool saturation.
  • jvm_gc_collection_seconds_sum: For GC activity impact on CPU.

C3 Remedy

  1. Restart Tomcat Service:

    • If CPU usage is critically high and affecting application availability:
      systemctl restart tomcat-pssb
      
  2. Check Logs for Errors or Loops:

    • Analyze application logs for issues causing high CPU:
      tail -n 100 /opt/ps/tomcat/logs/catalina.out
      
      tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
      
  3. Identify Traffic Spike:

    • Correlate with incoming requests using requestCount metrics.
  4. Inform DevOps:

    • If no immediate resolution is possible, escalate to DevOps.

DevOps Remedy

If the C3 team cannot resolve the issue:

  1. Application Profiling:

    • Use tools like jstack to identify threads consuming high CPU:
      jstack <PID> > /tmp/thread_dump.txt
      
  2. Optimize Code:

    • Investigate and fix inefficient application code or long-running queries.
  3. Garbage Collector Optimization:

    • Tune GC settings if frequent GC is causing high CPU usage:
      -XX:+UseG1GC -XX:MaxGCPauseMillis=200
      
  4. Scale Resources:

    • Add more CPU cores or scale horizontally if the current setup cannot handle the load.
  5. Performance Monitoring:

    • Implement request filtering or rate limiting if malicious traffic is suspected.