Home > DevOps > Services Offered by C3 > Monitoring > Alerts and Observability > JMX Alerts

JMX Alerts

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

Alert Handling Procedure

Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
- Low-Priority Alerts:
  - If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
- Escalation to DevOps:
  - If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
Severity-Specific Notifications:
- Warning Alerts:
  - For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
- Critical Alerts:
  - For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.

Alerts, Thresholds and Priorities

Table

Dashboard & Row	Alert Name	Panel	Panel Description	Query	Query Description	Query Operating Range	Metrics	Metric Description	Metric Operating Range	SEVERITY: CRITICAL	SEVERITY: WARNING	SEVERITY: OK
1.1.2	Tomcat Status Check	Status	Checks whether the Tomcat exporter or application instance is up and running.	up{job="$job", instance="$app$node"}	Verifies if the application or exporter instance is running.	Boolean (0/1)	up	Instance status (1 = up, 0 = down).	0 or 1	0	N/A	1
1.3.4	JVM Memory Usage	JVM Memory Usage [heap]	Monitors the percentage of heap memory used by the JVM relative to the total physical memory available.	jvm_memory_used_bytes{area="$memarea",job="$job", instance="$app$node"} / ignoring(area) group_left java_lang_OperatingSystem_TotalPhysicalMemorySize{job="$job", instance="$app$node"} *100	Tracks the percentage of heap memory used by the JVM.	0–100%	jvm_memory_used_bytes, java_lang_OperatingSystem_TotalPhysicalMemorySize	Heap memory used and total physical memory.	0–100%	>90%	80–90%	<80%
2.3.2	High Tomcat Request Processing Time	Average Processing Time	Monitors the average processing time (in milliseconds) for servlet requests over 5 minutes.	avg(rate(Catalina_GlobalRequestProcessor_processingTime{instance=~"$instance"}[5m]))	Measures the average time taken to process servlet requests.	Milliseconds (ms)	Catalina_GlobalRequestProcessor_processingTime	Measures how long Tomcat servlets take to process requests.	Positive values	>9sec	3-9sec	<3sec
1.1.4	High Tomcat CPU Utilization	CPU Utilization	Tracks the percentage of CPU utilization on the node over a 5-minute window.	rate(process_cpu_seconds_total{job=“tomcat_exporter_pssb”,instance=~“pssb.”}[5m])100	Tracks the CPU usage percentage of Tomcat processes.	0–100%	process_cpu_seconds_total	Percentage of CPU used by Tomcat processes.	0–100%	>50%	N/A	<50%

Tomcat Status Check

Alert URL
Dashboard Link
Panel URL

Explanation

This alert monitors whether the Tomcat exporter or application instance is up and running. If the instance is down, it could indicate server failures, network issues, or a problem with the application itself.

Scenarios Triggering the Alert

The tomcat_exporter service is stopped or crashed.
Network connectivity issues between the Prometheus server and the exporter.
Firewall or IP restrictions blocking Prometheus from scraping the exporter.
Misconfigured or missing exporter settings in Prometheus.

Thresholds

Severity	Description
Critical	Instance status is 0 (down).
Warning	Not applicable for this alert.
OK	Instance status is 1 (up).

C3 Data Collection

Check the Status of the Exporter:
- Log in to the server hosting the exporter.
- Run the following command to verify if the service is active:
```
systemctl status tomcat-pssb
```
  OR
```
ps aux | grep tomcat
```
- Note any errors in the service logs using:
```
journalctl -u tomcat-pssb
```
Ping the Exporter:
- Test connectivity to the exporter using cURL:
```
curl http://<exporter-host>:9115/metrics
```
- If you see metrics data, the exporter is up and reachable. Otherwise, record the error message.
- If the configuration is incorrect or missing, notify DevOps.
Network Accessibility:
- Use ping or telnet to verify the server’s network connection:
```
ping <exporter-host>
```

C3 Remedy

Restart the Exporter:
- If the exporter is down, restart it:
```
systemctl restart tomcat-pssb
```
- Verify it is running:
```
systemctl status tomcat-pssb
```
Resolve Network Issues:
- Check if the server is reachable from Prometheus.
- Coordinate with the network team to resolve connectivity or firewall issues.
Verify Configurations:
- Ensure the correct scrape target and port are specified in the Prometheus config.

DevOps Remedy

If the above remedies fail:

Investigate Logs:

Analyze the exporter logs for errors or misconfigurations.

tail -n 100 /opt/ps/tomcat/logs/catalina.out

tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Prometheus Server Debugging:
- Review Prometheus scrape logs to identify errors:
```
journalctl -xeu prometheus
```

High JVM Memory Usage

Alert URL
Dashboard Link
Panel URL

Explanation

This alert monitors the percentage of heap memory used by the JVM relative to the total physical memory available. High memory usage could indicate potential memory leaks, inefficient application code, or insufficient resource allocation.

Scenarios Triggering the Alert

Application memory leaks or inefficient memory usage.
Insufficient heap size allocated to the JVM.
High traffic leading to increased memory usage for session management or caching.
Inefficient or unoptimized garbage collection settings.

Thresholds

Severity	Description
Critical	Heap memory usage exceeds 90%.
Warning	Heap memory usage is 80–90%.
OK	Heap memory usage is below 80%

C3 Data Collection

Identify the Affected Instance:
- Use the alert details to get the instance with high memory usage.

Check Memory Utilization Trends:

Query memory usage for the last 1 hour:

jvm_memory_used_bytes{area="heap",instance="$instance"} / ignoring(area) group_left java_lang_OperatingSystem_TotalPhysicalMemorySize{instance="$instance"} * 100

Garbage Collection Activity:
- Analyze garbage collection frequency and duration:
```
rate(jvm_gc_collection_seconds_sum{instance="$instance"}[5m])
```

Thread Utilization:

Check for thread pool saturation:

Catalina_ThreadPool_currentThreadsBusy{instance="$instance"}

Logs:

Collect application logs for issues:

 tail -n 100 /opt/ps/tomcat/logs/catalina.out

 tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Dependent Metrics

jvm_gc_collection_seconds_sum: For GC activity and impact on memory.
jvm_memory_max_bytes: To check the maximum allocated JVM memory.
Catalina_ThreadPool_currentThreadsBusy: To correlate memory usage with thread activity.

C3 Remedy

Restart the Tomcat Service:
- If memory usage is critically high and causing application instability:
```
systemctl restart tomcat-pssb
```

Analyze Logs for Memory Leaks:

Look for “OutOfMemoryError” or similar warnings in logs:

tail -n 100 /opt/ps/tomcat/logs/catalina.out

tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Inform DevOps:
- Notify DevOps if memory allocation changes or application profiling is required.

DevOps Remedy

Optimize Memory Allocation:
- Review and adjust the JVM heap size in the CATALINA_OPTS configuration. For example:
```
-Xms1024M -Xmx4300M
```
- Ensure the heap size is appropriate for the workload.
Application Debugging:
- Work with the development team to identify memory leaks or inefficiencies in the code. Tools like VisualVM or JProfiler can help.
Garbage Collector Tuning:
- If GC activity is inefficient, tune the garbage collector settings (e.g., switch to G1GC or other suitable GC for the workload).
Scaling Resources:
- Add additional resources (memory, CPU) to the server or scale out horizontally by adding more nodes.

High Tomcat Request Processing Time

Alert URL
Dashboard Link
Panel URL

Explanation

The query calculates the average rate of processing time (in milliseconds) for servlet requests over a 5-minute window.

Scenarios Triggering the Alert

The Tomcat server is under heavy load due to a large number of incoming requests.
Problems in the application code, such as inefficient algorithms or threads competing for resources.
Delays caused by external systems like databases or APIs.
Server resource limitations, such as high CPU usage, insufficient memory, or slow disk operations.

Thresholds

Severity	Description
Critical	Average processing time exceeds 9000ms.
Warning	Average processing time is 3000–9000ms.
OK	Average processing time is below 3000ms.

C3 Data Collect

Identify the Affected Instance:
- Check the alert message for the instance and processingTime details.
Gather Metrics via Prometheus:
- Query the processing time for the past 1 hour to identify trends:
```
rate(Catalina_GlobalRequestProcessor_processingTime{instance="$instance"}[1h])
```
Analyze Traffic Volume:
- Query the request count to see if high traffic correlates:
```
rate(Catalina_GlobalRequestProcessor_requestCount{instance="$instance"}[5m])
```

Check Resource Utilization:

Validate CPU and memory usage using these queries:

node_cpu_seconds_total{mode="idle", instance="$instance"}
jvm_memory_used_bytes{instance="$instance"}

Tomcat Logs:

Collect logs to identify specific errors or delays:

 tail -n 100 /opt/ps/tomcat/logs/catalina.out

 tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Look for long-running requests or threads.

Dependent Metrics

Catalina_GlobalRequestProcessor_requestCount: To identify traffic spikes.
jvm_memory_used_bytes: To check JVM memory utilization.
node_cpu_seconds_total: To analyze system CPU usage.

C3 Remedy

Restart Tomcat (if needed):
- If the issue is severe and unresolvable quickly, restart the Tomcat service:
```
systemctl restart tomcat-pssb
```

Review Application Logs:

Check for slow requests or errors in application logs:

  tail -n 100 /opt/ps/tomcat/logs/catalina.out

  tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Notify DevOps:
- If traffic spikes are identified, notify DevOps to investigate and implement load balancing or scaling.

DevOps Remedy

If the C3 team cannot resolve the issue:

Increase Resources:
- Scale up the instance resources (CPU/memory).
- Consider horizontal scaling by adding more instances.
Optimize Application:
- Work with developers to profile and optimize slow code paths.
Load Balancing:
- Implement load balancing strategies to distribute traffic across multiple instances.

High Tomcat CPU Utilization

Alert URL
Dashboard Link
Panel URL

Explanation

This alert monitors the CPU usage of the Tomcat process over a 5-minute interval. A consistently high CPU usage indicates excessive load on the server.

Scenarios Triggering the Alert

High request volume or unexpected traffic spikes.
Inefficient application code causing excessive CPU usage.
Resource-intensive background tasks or scheduled jobs.
High garbage collection (GC) frequency or long GC pauses.
Insufficient CPU resources for the load.

C3 Data Collection

Identify the Affected Instance:
- Use the alert details to find the instance with high CPU usage.

Analyze CPU Trends:

Query CPU usage for the last hour in prometheus:

rate(process_cpu_seconds_total{instance="$instance"}[1h]) * 100

Correlate with Request Metrics:
- Check for high request rates:
```
rate(Catalina_GlobalRequestProcessor_requestCount{instance="$instance"}[5m])
```
- High request rates can overload the Tomcat server, leading to increased CPU utilization.

Logs:

Review recent Tomcat logs:

tail -n 100 /opt/ps/tomcat/logs/catalina.out

tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Dependent Metrics

Catalina_GlobalRequestProcessor_requestCount: For correlation with request volume.
catalina_threads_busy: For thread pool saturation.
jvm_gc_collection_seconds_sum: For GC activity impact on CPU.

C3 Remedy

Restart Tomcat Service:
- If CPU usage is critically high and affecting application availability:
```
systemctl restart tomcat-pssb
```

Check Logs for Errors or Loops:

Analyze application logs for issues causing high CPU:

tail -n 100 /opt/ps/tomcat/logs/catalina.out

tail -n 100 /opt/ps/tomcat/logs/tomcat-logger.log

Identify Traffic Spike:
- Correlate with incoming requests using requestCount metrics.
Inform DevOps:
- If no immediate resolution is possible, escalate to DevOps.

DevOps Remedy

If the C3 team cannot resolve the issue:

Application Profiling:
- Use tools like jstack to identify threads consuming high CPU:
```
jstack <PID> > /tmp/thread_dump.txt
```
Optimize Code:
- Investigate and fix inefficient application code or long-running queries.
Garbage Collector Optimization:
- Tune GC settings if frequent GC is causing high CPU usage:
```
-XX:+UseG1GC -XX:MaxGCPauseMillis=200
```
Scale Resources:
- Add more CPU cores or scale horizontally if the current setup cannot handle the load.
Performance Monitoring:
- Implement request filtering or rate limiting if malicious traffic is suspected.