scylla_scylladb_current_version: This metric represents the current version of ScyllaDB deployed across nodes. It is crucial for tracking software updates and ensuring compatibility checks.
scylla_node_operation_mode: This metric shows the operational mode of ScyllaDB nodes (e.g., normal, decommissioning). It provides insights into node status transitions and operational states.
scylla_reactor_utilization: This metric represents the average CPU utilization by the ScyllaDB process. It aids in capacity planning and performance monitoring.
scylla_reactor_aio_errors: This metric shows the sum of asynchronous I/O errors. It indicates potential disk performance issues affecting database operations.
scylla_cql:non_system_prepared1m: This metric shows the count of non-system prepared queries executed per minute. It helps to understand query workload patterns.
scylla_cql_reverse_queries: This metric represents the rate of reverse queries executed. It provides insights into query patterns and database access patterns.
scylla_compaction_manager_compactions: This metric shows the active compactions managed by Scylla’s compaction manager. It is critical for optimizing database performance.
scylla_cache_row_hits: This metric represents the rate of cache hits. It indicates the effectiveness of caching mechanisms in reducing read latency and improving performance.
scylla_storage_proxy_coordinator_write_latency_count: This metric shows the rate of write operations. It measures latency in coordinating and executing write requests.
scylla_storage_proxy_coordinator_read_latency_count: This metric represents the rate of read operations. It measures latency in coordinating and executing read requests.
wlatencyp95: This metric shows the average 95th percentile write latency. It indicates latency for most write operations.
rlatencyp95: This metric represents the average 95th percentile read latency. It indicates latency for most read operations.
rlatencyp99: This metric shows the average 99th percentile read latency. It indicates latency for critical read operations.
scylla_manager_task_active_count: This metric represents the active count of tasks managed by Scylla’s manager (e.g., repairs, backups). It provides insights into ongoing maintenance activities.
scylla_manager_repair_progress: This metric shows the average repair progress across tasks managed by Scylla’s manager. It indicates the status of repair operations.
node_netstat_Tcp_CurrEstab: This metric represents the current number of established TCP connections. It is essential for monitoring network activity and connection health.
node_netstat_Tcp_ActiveOpens: This metric shows the rate of active TCP connection openings. It indicates network connection establishment dynamics.
node_filesystem_avail_bytes: This metric represents the available bytes on the filesystem. It is critical for monitoring storage capacity and resource utilization.
node_filesystem_size_bytes: This metric shows the total size of the filesystem. It provides context for storage capacity and usage metrics.
node_power_supply_online: This metric represents whether the power supply is online. It provides information on hardware health and reliability.
node_cooling_device_cur_state: This metric shows the current state of cooling devices. It offers insights into hardware cooling system operation and health.
node_time_seconds: This metric represents the current time on the node. It is essential for timestamping metrics and monitoring system uptime.
node_boot_time_seconds: This metric shows the timestamp of the node’s last boot time. It is useful for tracking system restarts and uptime metrics.
Nodes Panel
It displays the count of nodes running ScyllaDB with the current version.
This panel provides an overview of the number of nodes in the ScyllaDB cluster and their current software version.
Unreachable Panel
This panel indicates nodes that are unreachable or not successfully scraped.
This panel helps monitor nodes that are not responding or are unreachable.
Inactive Panel
It counts nodes that are not in the operational mode.
This panel tracks nodes that are not actively participating in the cluster’s operations, helping to identify any inactive nodes.
Manager Panel
It shows the active tasks managed by ScyllaDB, such as repairs and backups.
This panel provides an overview of ongoing management tasks (repairs, backups) and the current version of ScyllaDB servers managed by the manager nodes. It helps to monitor ongoing tasks and the operational status of the manager server (offline or online).
Progress Panel
This panel appears to track the progress of repair and backup tasks managed by ScyllaDB.
This panel calculates and displays the progress of ongoing repair and backup tasks in the ScyllaDB cluster, helping monitor the completion status of these operations.
Avg Write Panel
This panel shows the average time it takes for write operations to be completed in the ScyllaDB cluster.
This panel helps to monitor how quickly data is being written to the database. If the average write time increases, it could indicate potential performance issues that need to be addressed.
Avg Read Panel
This panel shows the average time it takes for read operations to be completed in the ScyllaDB cluster.
This panel helps to monitor how quickly data is being read from the database. If the average read time increases, it could indicate potential performance issues that need to be addressed.
99% Read Panel
This panel shows the 99th percentile read latency, which means the read time for the slowest 1% of read operations.
This panel helps to identify outlier read latencies. If this metric increases, it means that a small percentage of read operations are experiencing significantly higher latency, which could be a sign of underlying problems.
Load Panel
This panel shows the average utilization of the CPU reactor, which indicates how much of the CPU’s capacity is being used.
This panel helps to monitor how much of the CPU resources are being consumed. High utilization can indicate that the system is under heavy load and might need scaling or optimization.
Requests/s Panel
This panel shows the total number of requests being handled by the ScyllaDB cluster per second.
This panel helps to monitor the request load on the database. A high number of requests per second indicates heavy usage, and tracking this can help in capacity planning and identifying potential overloads.
Timeouts Panel
This panel shows the number of write operations that timed out in the ScyllaDB cluster.
Metrics Used: scylla_storage_proxy_coordinator_write_timeouts
This panel helps to monitor the reliability of write operations. If the number of timeouts increases, it could indicate network issues, configuration problems, or resource constraints.
Writes Panel
This panel shows the rate at which write operations are being performed in the ScyllaDB cluster.
This panel helps to monitor the volume of write operations. Understanding the rate of writes can help in analyzing write load patterns and ensuring the database can handle the incoming write traffic efficiently.
Read Panel
This panel shows the rate at which read operations are being performed in the ScyllaDB cluster.
This panel helps to monitor the volume of read operations. Understanding the rate of reads can help in analyzing read load patterns and ensuring the database can handle the incoming read traffic efficiently.
Active Alerts Panel
The Active Alerts Panel provides an overview of the current alerts that are active within the system.
BlackboxProbeFailed: This alert is triggered when a blackbox probe (which simulates end-user actions to test external services) fails. It indicates that a monitored service is not responding as expected.
BlackboxSlowProbe: This alert is triggered when a blackbox probe experiences slow response times. It indicates that a monitored service is responding slower than expected.
BlackboxProbeHttpFailure: This alert is triggered when a blackbox probe fails to perform an HTTP request successfully. It indicates that a monitored web service is not responding correctly to HTTP requests.
HaproxyHasNoAliveBackends: This alert is triggered when HAProxy, a high availability load balancer, detects no alive backend servers. It indicates that all backend servers configured in HAProxy are down or unresponsive.
MysqlDown: This alert is triggered when a MySQL database instance is down. It indicates that the monitored MySQL service is not accessible.
HostOutOfDiskSpace: This alert is triggered when a host is running out of disk space. It indicates that the available disk space on the monitored host is critically low.
HostSystemdServiceCrashed: This alert is triggered when a systemd service has crashed. It indicates that a critical system service managed by systemd is not running as expected.
Low Memory: This alert is triggered when the system memory is critically low. It indicates that the monitored host is running out of available memory.
Memory is predicted to be less than 1 GiB in one hour: This alert is triggered when a prediction model indicates that the available memory will fall below 1 GiB within the next hour. It helps in preemptively addressing memory issues before they become critical.
DiskFull: This alert is triggered when a disk is full. It indicates that the monitored disk has reached its capacity and can no longer store additional data.
Load Panel
This panel displays the average reactor utilization for each node in the specified cluster and data center. It provides insight into how much CPU resources are being used by the ScyllaDB reactor.
Running Compactions Panel
This panel shows the current number of running compactions across nodes in the cluster and data center. Compactions are critical for maintaining database performance.
Cache Hits/Misses Panel
This panel provides the rate of cache row hits, indicating how often requested rows are found in the cache, which improves read performance.
Writes Panel
This panel shows the rate of write latency counts, helping to monitor the performance of write operations across the nodes.
Reads Panel
This panel displays the rate of read latency counts, which is crucial for understanding the performance of read operations. It includes current metrics, as well as metrics offset by one day and one week, for comparison.
Nodes panel
This panel checks for current ScyllaDB versions and operational modes across all instances, ensuring consistency and operational status. The panel also tracks average reactor utilization to gauge CPU resource efficiency.
alert: cqlNonPrepared
cql:non_prepared > 0
alert: cql:non_paged_no_system
cql:non_paged > 0
alert: cqlNoTokenAware
cql:non_token_aware > 0
alert: cqlAllowFiltering
cql:allow_filtering > 0
ALLOW FILTERING
clause, which can significantly degrade performance due to inefficient filtering.alert: cqlCLAny
cql:any_queries > 0
ANY
, which may result in weaker guarantees for read and write operations.alert: cqlCLAll
cql:all_queries > 0
ALL
, which ensures strong consistency but can impact performance and availability.alert: nonBalancedcqlTraffic
abs(rate(scylla_cql_updates{conditional="no"}[1m]) - scalar(avg(rate(scylla_cql_updates{conditional="no"}[1m]))))/scalar(stddev(rate(scylla_cql_updates{conditional="no"}[1m]))+100) > 2
alert: nodeLocalErrors
sum(errors:local_failed) by (cluster, instance) > 0
alert: nodeIOErrors
sum(rate(scylla_reactor_aio_errors[60s])) by (cluster, instance) > 0
alert: nodeCLErrors
sum(errors:operation_unavailable) by (cluster) > 0
alert: preparedCacheEviction
sum(rate(scylla_cql_prepared_cache_evictions[2m])) by (cluster) + sum(rate(scylla_cql_authorized_prepared_statements_cache_evictions[2m])) by (cluster) > 100
alert: heavyCompaction
max(scylla_scheduler_shares{group="compaction"}) by (cluster) >= 1000
alert: shedRequests
max(sum(rate(scylla_transport_requests_shed[60s])) by (instance,cluster)/sum(rate(scylla_transport_requests_served{}[60s])) by (instance, cluster)) by(cluster) > 0.01
alert: cappedTombstone
changes(scylla_sstables_capped_tombstone_deletion_time[1h]) > 0
alert: InstanceDown (5 minutes)
up{job="scylla"} > 3
alert: InstanceDown (10 minutes)
sum(up{job="scylla"}>0)by(instance) unless sum(scylla_transport_requests_served{shard="0"}) by(instance)
alert: InstanceDown (operation mode)
scylla_node_operation_mode > 3
alert: DiskFull (Warning)
node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"} * 100 < 35
/var/lib/scylla
is less than 35%. It indicates that the disk space is getting low.alert: DiskFull (Error)
node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"} * 100 < 25
/var/lib/scylla
is less than 25%. It indicates a critical disk space issue that needs immediate attention.alert: DiskFull (Critical)
node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"} * 100 < 15
/var/lib/scylla
is less than 15%. It indicates a severe disk space shortage that could affect system performance and stability.alert: DiskFull (Root Partition)
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 20
/
) is less than 20%. It indicates that the root partition is running low on space, which could impact the overall system functionality.alert: NoCql
scylla_manager_healthcheck_cql_status == -1
alert: HighLatencies (Write)
wlatencyp95{by="instance"} > 100000
alert: HighLatencies (Write, Average)
wlatencya{by="instance"} > 10000
alert: HighLatencies (Read)
rlatencyp95{by="instance"} > 100000
alert: HighLatencies (Read, Average)
rlatencya{by="instance"} > 10000
alert: BackupFailed
(sum(scylla_manager_scheduler_run_total{type=~"backup", status="ERROR"}) or vector(0)) - (sum(scylla_manager_scheduler_run_total{type=~"backup", status="ERROR"} offset 3m) or vector(0)) > 0
alert: RepairFailed
(sum(scylla_manager_scheduler_run_total{type=~"repair", status="ERROR"}) or vector(0)) - (sum(scylla_manager_scheduler_run_total{type=~"repair", status="ERROR"} offset 3m) or vector(0)) > 0
alert: restart
resets(scylla_gossip_heart_beat[1h]) > 0
alert: oomKill
changes(node_vmstat_oom_kill[1h]) > 0
alert: tooManyFiles
(node_filesystem_files{mountpoint="/var/lib/scylla"} - node_filesystem_files_free{mountpoint="/var/lib/scylla"}) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance) > 20000
/var/lib/scylla
directory exceeds 20,000 per shard on the Scylla instance. It indicates a potential issue with file system resource usage.alert: tooManyFiles
(node_filesystem_files{mountpoint="/var/lib/scylla"} - node_filesystem_files_free{mountpoint="/var/lib/scylla"}) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance) > 30000
/var/lib/scylla
per shard./var/lib/scylla
directory exceeds 30,000 per shard instance, which can impact performance.alert: tooManyFiles
(node_filesystem_files{mountpoint="/var/lib/scylla"} - node_filesystem_files_free{mountpoint="/var/lib/scylla"}) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance) > 40000
/var/lib/scylla
per shard./var/lib/scylla
directory exceeds 40,000 per shard instance, indicating a potentially critical issue impacting system stability.alert: nodeInJoinMode
scylla_node_operation_mode == 2
alert: splitBrain
sum(scylla_gossip_live) < (count(scylla_node_operation_mode==3)-1) * count(scylla_gossip_live)
alert: bloomFilterSize
scylla_sstables_bloom_filter_memory_size/scylla_memory_total_memory > 0.2