When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
Severity-Specific Notifications:
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
Dashboard & Row | Alert Name | Panel | Panel Description | Query | Query Description | Query Operating Range | Metrics | Metric Description | Metric Operating Range | SEVERITY: CRITICAL | SEVERITY: WARNING | SEVERITY: OK |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1.1.2 | Tomcat Status Check | Status | Checks whether the Tomcat exporter or application instance is up and running. | up{job="$job", instance="$app$node"} | Verifies if the application or exporter instance is running. | Boolean (0/1) | up | Instance status (1 = up, 0 = down). | 0 or 1 | 0 | N/A | 1 |
1.3.4 | JVM Memory Usage | JVM Memory Usage [heap] | Monitors the percentage of heap memory used by the JVM relative to the total physical memory available. | jvm_memory_used_bytes{area="$memarea",job="$job", instance="$app$node"} / ignoring(area) group_left java_lang_OperatingSystem_TotalPhysicalMemorySize{job="$job", instance="$app$node"} *100 | Tracks the percentage of heap memory used by the JVM. | 0–100% | jvm_memory_used_bytes, java_lang_OperatingSystem_TotalPhysicalMemorySize | Heap memory used and total physical memory. | 0–100% | >90% | 80–90% | <80% |
2.3.2 | High Tomcat Request Processing Time | Average Processing Time | Monitors the average processing time (in milliseconds) for servlet requests over 5 minutes. | avg(rate(Catalina_GlobalRequestProcessor_processingTime{instance=~"$instance"}[5m])) | Measures the average time taken to process servlet requests. | Milliseconds (ms) | Catalina_GlobalRequestProcessor_processingTime | Measures how long Tomcat servlets take to process requests. | Positive values | >9sec | 3-9sec | <3sec |
1.1.4 | High Tomcat CPU Utilization | CPU Utilization | Tracks the percentage of CPU utilization on the node over a 5-minute window. | rate(process_cpu_seconds_total{job=“tomcat_exporter_pssb”,instance=~“pssb.*”}[5m])*100 | Tracks the CPU usage percentage of Tomcat processes. | 0–100% | process_cpu_seconds_total | Percentage of CPU used by Tomcat processes. | 0–100% | >50% | N/A | <50% |
This alert monitors whether the Tomcat exporter or application instance is up and running. If the instance is down, it could indicate server failures, network issues, or a problem with the application itself.
tomcat_exporter
service is stopped or crashed.Severity | Description |
---|---|
Critical | Instance status is 0 (down). |
Warning | Not applicable for this alert. |
OK | Instance status is 1 (up). |
Check the Status of the Exporter:
systemctl status tomcat-pssb
ps aux | grep tomcat
journalctl -u tomcat-pssb
Ping the Exporter:
Test connectivity to the exporter using cURL:
curl http://<exporter-host>:9115/metrics
If you see metrics data, the exporter is up and reachable. Otherwise, record the error message.
If the configuration is incorrect or missing, notify DevOps.
Network Accessibility:
ping
or telnet
to verify the server’s network connection:
ping <exporter-host>
Restart the Exporter:
systemctl restart tomcat-pssb
systemctl status tomcat-pssb
Resolve Network Issues:
Verify Configurations:
If the above remedies fail:
Investigate Logs:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
Prometheus Server Debugging:
journalctl -xeu prometheus
This alert monitors the percentage of heap memory used by the JVM relative to the total physical memory available. High memory usage could indicate potential memory leaks, inefficient application code, or insufficient resource allocation.
Severity | Description |
---|---|
Critical | Heap memory usage exceeds 90%. |
Warning | Heap memory usage is 80–90%. |
OK | Heap memory usage is below 80% |
Identify the Affected Instance:
instance
with high memory usage.Check Memory Utilization Trends:
jvm_memory_used_bytes{area="heap",instance="$instance"} / ignoring(area) group_left java_lang_OperatingSystem_TotalPhysicalMemorySize{instance="$instance"} * 100
Garbage Collection Activity:
rate(jvm_gc_collection_seconds_sum{instance="$instance"}[5m])
Thread Utilization:
Catalina_ThreadPool_currentThreadsBusy{instance="$instance"}
Logs:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
jvm_gc_collection_seconds_sum
: For GC activity and impact on memory.jvm_memory_max_bytes
: To check the maximum allocated JVM memory.Catalina_ThreadPool_currentThreadsBusy
: To correlate memory usage with thread activity.Restart the Tomcat Service:
systemctl restart tomcat-pssb
Analyze Logs for Memory Leaks:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
Inform DevOps:
CATALINA_OPTS
configuration. For example:
-Xms1024M -Xmx4300M
G1GC
or other suitable GC for the workload).The query calculates the average rate of processing time (in milliseconds) for servlet requests over a 5-minute window.
Severity | Description |
---|---|
Critical | Average processing time exceeds 9000ms. |
Warning | Average processing time is 3000–9000ms. |
OK | Average processing time is below 3000ms. |
Identify the Affected Instance:
instance
and processingTime
details.Gather Metrics via Prometheus:
rate(Catalina_GlobalRequestProcessor_processingTime{instance="$instance"}[1h])
Analyze Traffic Volume:
rate(Catalina_GlobalRequestProcessor_requestCount{instance="$instance"}[5m])
Check Resource Utilization:
node_cpu_seconds_total{mode="idle", instance="$instance"}
jvm_memory_used_bytes{instance="$instance"}
Tomcat Logs:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
Catalina_GlobalRequestProcessor_requestCount
: To identify traffic spikes.jvm_memory_used_bytes
: To check JVM memory utilization.node_cpu_seconds_total
: To analyze system CPU usage.Restart Tomcat (if needed):
systemctl restart tomcat-pssb
Review Application Logs:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
Notify DevOps:
If the C3 team cannot resolve the issue:
Increase Resources:
Optimize Application:
Load Balancing:
This alert monitors the CPU usage of the Tomcat process over a 5-minute interval. A consistently high CPU usage indicates excessive load on the server.
Identify the Affected Instance:
instance
with high CPU usage.Analyze CPU Trends:
rate(process_cpu_seconds_total{instance="$instance"}[1h]) * 100
Correlate with Request Metrics:
rate(Catalina_GlobalRequestProcessor_requestCount{instance="$instance"}[5m])
Logs:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
Catalina_GlobalRequestProcessor_requestCount
: For correlation with request volume.catalina_threads_busy
: For thread pool saturation.jvm_gc_collection_seconds_sum
: For GC activity impact on CPU.Restart Tomcat Service:
systemctl restart tomcat-pssb
Check Logs for Errors or Loops:
tail -n 100 /opt/ps/tomcat/logs/catalina.out
tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log
Identify Traffic Spike:
requestCount
metrics.Inform DevOps:
If the C3 team cannot resolve the issue:
Application Profiling:
jstack
to identify threads consuming high CPU:
jstack <PID> > /tmp/thread_dump.txt
Optimize Code:
Garbage Collector Optimization:
-XX:+UseG1GC -XX:MaxGCPauseMillis=200
Scale Resources:
Performance Monitoring: