Hazelcast Monitoring
Hazelcast is a distributed In-Memory Data Grid platform for Java. The architecture supports high scalability and data distribution in a clustered environment. It supports the auto-discovery of nodes and intelligent synchronization. Hazelcast offers various features such as Distributed Data Structure, Distributed Compute, Distributed Query, etc.
Reference Links
- Introduction to Hazelcast Link
- Understand HazelCast Link
- Further References Link
Alert Reference Links
- Grafana dashboards for Scylla link
- Alerts link
Row |
Panel |
Query |
Query Description |
Query Operating Range |
Metrics |
Metric Description |
SEVERITY: CRITICAL |
SEVERITY: WARNING |
SEVERITY: OK |
1.4 |
Member Count |
com_hazelcast_Metrics_size{job=“tomcat_exporter_pssb”} |
This query can describe the up status of hazelcast members |
5 |
com_hazelcast_Metrics_size |
|
< 6 |
- |
equal to 6 |
1.5 |
HZ Low cache hits |
sum by (instance) (com_hazelcast_Metrics_hits{job=“tomcat_exporter_pssb”,exported_instance=~“ps_sb_wsm”}) |
It shows the Cache hits for each instance |
positive values |
com_hazelcast_Metrics_hits |
|
0 < hits < 10 |
10 < hits < 100 |
100 < hits < 35000 |
1.5 |
High HZ Cache Hits |
sum by (instance) (com_hazelcast_Metrics_hits{job=“tomcat_exporter_pssb”,exported_instance=~“ps_sb_wsm”}) |
It shows the Cache hits for each instance |
positive values |
com_hazelcast_Metrics_hits |
|
50000 < hits |
35000 < hits < 50000 |
|
1.3 |
HZHighConnections |
com_hazelcast_Metrics_activeCount{instance=~“pssb.*”} |
It shows active connections to the hazelcast |
positive values |
com_hazelcast_Metrics_activeCount |
|
> 12 |
10 < connections < 12 |
2-10 |
1.3 |
HZLowConnections |
com_hazelcast_Metrics_activeCount{instance=~“pssb.*”} |
It shows active connections to the hazelcast |
positive values |
com_hazelcast_Metrics_activeCount |
|
0 |
0 < connections < 2 |
2-10 |
Alerts and C3 Procedures
When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
-
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
-
Severity-Based Actions:
- Low-Priority Alerts:
- If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
- Escalation to DevOps:
- If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
-
Severity-Specific Notifications:
- Warning Alerts:
- For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
- Critical Alerts:
- For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.
Preliminary Steps
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
Hazelcast Member Count Changed
Alertname: HazelcastMemberCountChanged
A change in the Hazelcast member count means that nodes (or members) have either joined or left the Hazelcast cluster. These changes can have significant implications for the cluster’s stability, performance, and resource distribution.
C3 Data Collection
To analyze changes in the Hazelcast cluster size:
-
Review Grafana Panel
- Access the relevant Grafana dashboard panel.
- Focus on the data for the last 5 minutes to identify any recent changes in member count or anomalies in node activity.
-
Verify Node Visibility Across the Cluster
- Ensure that all 5 nodes in the cluster can detect the presence of 5 other nodes.
- Use metrics and logs to verify node connectivity and synchronization.
-
Check Member Group Size Metric
- Use the following metric to confirm the cluster size:
com_hazelcast_Metrics_memberGroupsSize
- Analyze the metric data in Grafana to ensure that all nodes report a total cluster size of 6.
Dependent metrics
com_hazelcast_Metrics_memberGroupsSize
- This metric tracks the size of each member group in the Hazelcast cluster.
- Each node in the cluster should report the same group size, which should match the total number of nodes in the cluster. For instance, in a 6-node cluster, this metric should consistently show a value of 6 across all nodes.
- If the value deviates, it indicates that one or more nodes have left or joined the cluster, potentially disrupting cluster consistency and workload distribution.
C3 Remedy
The C3 team should focus on basic troubleshooting and operational checks for Hazelcast cluster member count issues. Follow these steps:
-
Verify Network Communication
- Use the
ping
command to test connectivity between nodes and confirm that network communication is functioning as expected.
-
Check Firewall Rules
- Ensure that the required Hazelcast ports are open and allowed in the firewall.
sudo ufw allow 54321/tcp
sudo ufw allow 6000/tcp:6999/tcp
sudo ufw reload
-
Restart Tomcat Service
- If specific nodes are problematic, restart the
tomcat-pssb
service on those nodes:
sudo systemctl restart tomcat-pssb
-
Review Node Health
- Check the basic health of the affected nodes, including CPU, memory, and disk usage, to rule out any resource-related issues.
-
Escalate if Necessary
- If the issue persists after performing the above checks, or if the cluster remains unstable, escalate the problem to the DevOps team for further investigation and resolution.
DevOps Remedy
-
Inspect Configuration Generation Code
- If the cluster is dynamically configured, review the configuration generation code to ensure it adapts properly to network changes or node additions.
- Verify that settings such as network interfaces, port configurations, and node discovery mechanisms align with the current infrastructure.
-
Collaborate with Development Team
- Discuss any necessary changes to the configuration generation process with the development team. This may involve adjusting settings to better suit the cluster’s network architecture or resolving issues related to node discovery or load balancing.
HZHighCacheHits
High cache hits in Hazelcast indicate that data requests are being successfully served from the cache rather than fetching from an underlying data store or performing computations. This is generally a positive sign of efficient caching, but excessive cache hits in specific scenarios can have implications:
Positive Effects:
Reduced Latency: Requests are served faster since accessing the cache is quicker than querying the database or performing I/O operations.
Improved Throughput: The system can handle more requests as less time is spent fetching data.
Potential Concerns:
Increased Memory Pressure: Serving high cache hits may indicate that a large amount of data is being held in memory, which could lead to memory contention or out-of-memory (OOM) issues if resources are insufficient.
Unbalanced Workload: High cache hits on specific nodes could mean uneven traffic distribution or hotspots in the cluster.
Data Staleness Risks: If the cache isn’t updated frequently, users might receive outdated information.
C3 Data Collection
-
Check Proper Load Balancing:
- Ensure traffic is evenly distributed across all Hazelcast nodes to prevent hotspots or resource contention on specific nodes.
-
Monitor Node Memory Usage:
- Log in to the OS dashboard in Grafana and review the RAM usage of each node to ensure adequate memory resources are available.
- Nodes with high cache usage might indicate uneven workload distribution or memory pressure.
-
Analyze Database Response Times:
- Slow database response times can cause applications to over-rely on cached data.
- Verify the database query response times and identify any increases or trends in the latency using relevant Grafana panels or database monitoring tools.
By systematically collecting this data, you can assess whether the high cache hits result from inefficient configurations, workload imbalances, or external system dependencies, and proceed with corrective actions effectively.
Dependent Metrics
- Query Latency Metrics:
rlatencya
: Measures the average response time of database queries in milliseconds.
scylla_storage_proxy_coordinator_read_latency
: Monitors the read latency on ScyllaDB.
scylla_storage_proxy_coordinator_write_latency
: Monitors the write latency on ScyllaDB.
- Node Memory Usage Metrics:
node_memory_MemTotal_bytes
: Total available memory on the node.
node_memory_MemAvailable_bytes
: Memory available for new processes without swapping.
node_memory_Cached_bytes
: Memory used for file system caching.
node_memory_SwapFree_bytes
: Amount of unused swap space.
- Cache Metrics:
com_hazelcast_Metrics_freeSpace
: Indicates the total free disk space available on the Hazelcast node.
com_hazelcast_Metrics_usableSpace
: Represents the usable disk space available for Hazelcast operations, factoring in file system and user permissions.
- Haproxy Metrics:
haproxy_server_current_sessions
: Number of active sessions handled by each backend server.
haproxy_server_bytes_in_total
: Total bytes received by each backend server.
haproxy_server_bytes_out_total
: Total bytes sent by each backend server.
haproxy_backend_up
: Indicates whether backend servers are up or down.
C3 Remedy
When handling scenarios related to increased cache hits or improper load balancing in a Hazelcast cluster, follow these steps to resolve the issues effectively:
-
Address Load Balancing Issues
If HAProxy is not distributing traffic evenly across nodes in a specific project:
- Verify the status of HAProxy to ensure it is running properly:
- If the status indicates issues, restart the HAProxy service to restore functionality:
systemctl restart haproxy
- Monitor HAProxy logs and metrics to ensure proper traffic distribution after restarting.
-
Handle Increased Cache Usage
If a node’s cache usage has increased significantly and is fully consumed:
- Check the health and performance of the underlying database to determine if slow database response times are causing reliance on cache.
- Evaluate the node’s memory usage via the Grafana OS dashboard.
- If necessary, consider escalating the issue to the DevOps team for deeper investigation of the database and system resources.
-
Restart the Application Server if Required
If the above actions do not resolve the issue, restart the Tomcat-PSSB service on the affected node to refresh the application and its connection to Hazelcast:
systemctl restart tomcat-pssb
After restarting, monitor the cache usage and load distribution metrics to ensure normal operations are restored.
For persistent issues or those affecting multiple nodes, escalate to the DevOps team for a comprehensive investigation and resolution.
Hazelcast Low Cache Hits
Alertname: (HZLowCacheHits)
Low cache hits in Hazelcast indicate that a significant portion of data requests are not being served from the cache, but instead are likely being fetched from a database or computed on demand. This can negatively impact performance, leading to increased latency and higher load on underlying data stores.
Possible Causes for Low Cache Hits:
- Tomcat Connectivity Issues: If Tomcat is not connecting properly to the Hazelcast cache, it may cause low cache hits as data is being fetched from the database.
- Load Balancer Configuration: Improper load balancing could result in requests being routed to the wrong nodes, leading to an imbalance in cache usage.
- Traffic Volume: Low cache hits might be observed during off-peak hours when there is minimal traffic to the application, such as during midnight or other periods of low user activity.
- Cache Expiration: Frequent cache evictions or improper cache expiration policies may result in a low hit ratio if cached data is frequently removed before it can be reused.
- Database Dependency: If the database is performing well but the cache isn’t being used effectively, it could indicate that the application or load balancer isn’t interacting with the cache as expected.
C3 Data Collection:
To effectively troubleshoot and resolve issues related to low cache hits in Hazelcast, follow the steps below:
-
Check Load Balancer and Tomcat Configuration:
- Ensure that the load balancing is correctly distributing traffic across Hazelcast nodes.
- Verify Tomcat is properly connected to Hazelcast and interacting with the cache as expected.
- If the traffic to the application is low, check if cache misses are normal during off-peak hours.
-
Monitor Cache Usage:
- Check the Grafana dashboard for com_hazelcast_Metrics_freeSpace and com_hazelcast_Metrics_usableSpace to assess the available cache capacity.
- If cache usage is low, verify if the application is making proper use of Hazelcast caching or if there’s an issue with cache configurations.
-
Monitor Database Response Time:
- If the application is falling back on the database due to low cache hits, check the database response times and ensure that queries are not causing delays.
- Look for any increase in database latency that might indicate poor application/database interaction.
Dependent Metrics:
-
Metrics Related to Cache Usage:
- com_hazelcast_Metrics_freeSpace: Indicates free space available in the Hazelcast cache.
- com_hazelcast_Metrics_usableSpace: Represents the usable space in Hazelcast nodes for caching purposes.
-
Metrics Related to Database Response Times:
- scylla_storage_proxy_coordinator_read_latency: Monitors the read latency for ScyllaDB, indicating if database reads are slower than expected.
- scylla_storage_proxy_coordinator_write_latency: Measures the write latency for ScyllaDB.
-
Metrics Related to Tomcat Performance:
- Monitor Tomcat logs for connection-related issues and any errors related to connectivity with Hazelcast.
-
Metrics Related to Load Balancer (HAProxy) Performance:
- haproxy_backend_up: Indicates whether backend servers are up, ensuring proper load balancing.
- haproxy_server_current_sessions: The number of active sessions handled by each Hazelcast backend server, verifying if traffic is being evenly distributed.
C3 Remedy:
If the issue is identified as low cache hits, perform the following actions:
-
Check Proper Load Balancing:
- Ensure that the load balancer (e.g., HAProxy) is distributing traffic evenly across all Hazelcast nodes. If necessary, restart the load balancer to correct any traffic imbalances:
systemctl restart haproxy
-
Ensure Tomcat Connection to Hazelcast:
- Ensure Tomcat is connected to Hazelcast and properly caching data. If needed, restart Tomcat to re-establish the connection:
systemctl restart tomcat-pssb
-
Monitor Cache Usage and Database Latency:
- Check for database slowness that may cause the application to rely more on cache, leading to low cache hits. Use the scylla_storage_proxy_coordinator_read_latency and scylla_storage_proxy_coordinator_write_latency metrics to identify issues with database performance.
-
Identify Off-Peak Activity:
- Low cache hits during off-peak hours (e.g., midnight) may be normal if user traffic is low. Monitor trends over a longer period to confirm if the low cache hits are due to reduced activity.
DevOps Remedy:
-
Optimize Cache Configuration:
- If necessary, increase the cache size in Scylla and MySQL to retain more frequently accessed data in memory, reducing the dependency on database queries.
-
Address Load Balancing Issues:
- Ensure that HAProxy is correctly routing traffic to all nodes in the Hazelcast cluster. If the load balancing issues persist, optimize the configuration or restart the load balancer as needed.
-
Increase Resources:
- If the application continues to rely on the database rather than the cache, increase resources such as RAM for the cache or adjust eviction policies to retain more data in memory.
-
Troubleshoot with Application Team:
- If cache hits are consistently low and the issue is not related to infrastructure or configuration, collaborate with the application team to investigate any potential application-level issues that could be affecting cache usage.
**Hazelcast High/Low Active Connections **
AlertName: HZHighActiveConnections
AlertName: HZLowActiveConnections
High active connections in Hazelcast indicate that a large number of clients or services are maintaining active connections to the Hazelcast cluster. This can result in increased resource consumption on the affected nodes, including CPU, memory, and network usage. While a higher number of active connections is often expected in high-traffic systems, persistent or sudden spikes in active connections may point to underlying issues that can affect cluster stability and performance.
Possible Causes for High Active Connections:
- High Traffic or Load: A surge in traffic or excessive requests from clients can lead to a higher number of active connections.
- Improper Load Balancing: If the load balancer is not properly distributing connections to nodes in the Hazelcast cluster, some nodes might experience higher traffic and active connections.
- Connection Leaks: Inadequate connection management (e.g., failing to close connections after use) can cause an increase in active connections over time.
- Application Misconfiguration: Misconfigured client applications or services continuously attempting to connect to the cluster could lead to an unusually high number of active connections.
- Cluster Instability: If one or more nodes in the cluster are slow to respond, clients may retry and increase the number of active connections.
What if High Active Connections are on Only One Node?
If high active connections are observed on only one node in the Hazelcast cluster, it may indicate an issue specific to that node. Possible causes include:
- Resource Exhaustion on One Node: The node might be experiencing resource bottlenecks (e.g., CPU, memory, or network bandwidth), causing it to handle more active connections than other nodes in the cluster.
- Uneven Load Distribution: A misconfigured load balancer might be directing more connections to this node, resulting in a higher number of active connections compared to others.
- Faulty Node Behavior: If one node is behaving abnormally, it may become a bottleneck for client connections.
Immediate Actions:
- Check Resource Usage on the Affected Node: Monitor the CPU, memory, and network usage on the node with high active connections. If resources are exhausted, consider scaling the node or adding more resources to handle the load.
- Verify Load Balancer Configuration: Ensure the load balancer is distributing connections evenly across all nodes in the Hazelcast cluster. If the load is skewed toward one node, adjust the configuration or restart the load balancer.
- Investigate Node Performance Issues: If one node consistently handles more active connections, check the node’s performance metrics and logs for any potential issues (e.g., slow database response times, network issues, or application misconfigurations).
C3 Data Collection:
To effectively troubleshoot and resolve issues related to high active connections in Hazelcast, follow these steps:
-
Monitor Active Connections:
- Monitor the number of active connections on each Hazelcast node using the com_hazelcast_Metrics_activeConnections metric. A sudden increase in this metric indicates high traffic or connection buildup.
- Review Grafana dashboards to check for spikes in active connections over time.
-
Check Load Balancer Configuration:
- Ensure that the load balancer (e.g., HAProxy) is correctly distributing client connections to all Hazelcast nodes. If a specific node is receiving a disproportionate number of connections, the load balancer might need optimization.
-
Monitor System Resources:
- Check the system resources (CPU, memory, and network bandwidth) on affected nodes. High active connections can cause resource exhaustion, leading to slower response times or crashes.
- Review the node_memory_MemTotal_bytes, node_memory_MemAvailable_bytes, and cpu_usage_percent metrics for any resource saturation.
-
Inspect Application Behavior:
- Ensure that client applications are not leaking connections or creating unnecessary connections to Hazelcast. Connection pooling mechanisms and proper connection management should be in place.
-
Check for Connection Leaks:
- Investigate whether connections are being properly closed after use. Unclosed connections or connection leaks can accumulate over time, leading to high active connection counts.
Dependent Metrics:
-
Metrics Related to Active Connections:
- com_hazelcast_Metrics_activeConnections: Tracks the number of active connections to the Hazelcast node. This metric helps you monitor how many clients are connected to a specific node.
- com_hazelcast_Metrics_totalConnections: Represents the total number of connections over time. This can help identify sudden spikes in connections.
-
Metrics Related to Node Memory and CPU Usage:
- node_memory_MemTotal_bytes: Total available memory on the node.
- node_memory_MemAvailable_bytes: Memory available for new processes without swapping.
- node_cpu_seconds_total: CPU usage percentage on each node. High CPU usage could indicate the node is under heavy load due to high active connections.
-
Metrics Related to Load Balancer Performance:
- haproxy_server_current_sessions: Number of active sessions handled by each backend server. Ensures that the load balancer is distributing traffic evenly across nodes.
- haproxy_backend_up: Indicates whether backend servers are up and available for handling requests.
C3 Remedy:
If the high active connections alert is triggered, follow these steps to investigate and mitigate the issue:
-
Ensure Proper Load Balancing:
- Verify that the load balancer is correctly distributing traffic across all Hazelcast nodes. If one node is overloaded, restart or reconfigure the load balancer to ensure an even distribution of active connections:
systemctl restart haproxy
-
Monitor and Scale Resources:
- Check the resource usage on affected nodes (memory, CPU, network). If resources are near capacity, consider scaling the system horizontally by adding more Hazelcast nodes or vertically by increasing the resources (e.g., CPU, RAM) on existing nodes.
- Monitor the node_memory_MemAvailable_bytes, cpu_usage_percent, and network metrics to identify any potential bottlenecks.
-
Investigate Application Configuration:
- Verify that client applications are properly closing connections and not leaking them over time. Ensure connection pooling is used where appropriate.
- Review application logs for any errors or unexpected behavior related to connection handling.
-
Scale or Optimize Connections:
- If the system requires handling more connections, consider optimizing the connection settings in Hazelcast to allow for more simultaneous connections or better manage existing connections.
DevOps Remedy:
-
Optimize Node Resources:
- If high active connections are causing resource saturation, increase the resources (CPU, RAM) on the affected Hazelcast nodes. If needed, adjust the node’s configuration to allow for more simultaneous connections or optimize the garbage collection settings to prevent resource exhaustion.
-
Review and Adjust Load Balancer Settings:
- Investigate HAProxy to ensure traffic is evenly distributed across all Hazelcast nodes. Fine-tune the load balancing algorithm (e.g., round-robin, least-connections) to optimize traffic flow.
-
Optimize Database and Cache Interactions:
- Ensure that high active connections are not caused by slow database queries or inefficient cache usage. Improve database performance by scaling it or optimizing queries. If necessary, increase the cache size for Hazelcast to reduce dependency on database queries.
-
Investigate Connection Management:
- Ensure that all client applications have proper connection pooling and that connections are efficiently managed. Investigate any potential connection leaks that might be increasing the number of active connections.
-
Restart Nodes if Necessary:
- If a specific node is heavily impacted and unable to handle the load, consider restarting the node to refresh the connections and resources with all the cautions in mind about cluster setup.