Scylla

Metrics

Database Version and Node Status Metrics

  • scylla_scylladb_current_version: This metric represents the current version of ScyllaDB deployed across nodes. It is crucial for tracking software updates and ensuring compatibility checks.

  • scylla_node_operation_mode: This metric shows the operational mode of ScyllaDB nodes (e.g., normal, decommissioning). It provides insights into node status transitions and operational states.

System Resource Utilization Metrics

  • scylla_reactor_utilization: This metric represents the average CPU utilization by the ScyllaDB process. It aids in capacity planning and performance monitoring.

  • scylla_reactor_aio_errors: This metric shows the sum of asynchronous I/O errors. It indicates potential disk performance issues affecting database operations.

Error and Operational Metrics

  • scylla_errors:nodes_total: This metric represents the total count of errors reported by ScyllaDB nodes. It offers insights into operational issues and system stability.

Query Workload Metrics

  • scylla_cql:non_system_prepared1m: This metric shows the count of non-system prepared queries executed per minute. It helps to understand query workload patterns.

  • scylla_cql_reverse_queries: This metric represents the rate of reverse queries executed. It provides insights into query patterns and database access patterns.

Compaction and Cache Metrics

  • scylla_compaction_manager_compactions: This metric shows the active compactions managed by Scylla’s compaction manager. It is critical for optimizing database performance.

  • scylla_cache_row_hits: This metric represents the rate of cache hits. It indicates the effectiveness of caching mechanisms in reducing read latency and improving performance.

Latency Metrics

  • scylla_storage_proxy_coordinator_write_latency_count: This metric shows the rate of write operations. It measures latency in coordinating and executing write requests.

  • scylla_storage_proxy_coordinator_read_latency_count: This metric represents the rate of read operations. It measures latency in coordinating and executing read requests.

  • wlatencyp95: This metric shows the average 95th percentile write latency. It indicates latency for most write operations.

  • rlatencyp95: This metric represents the average 95th percentile read latency. It indicates latency for most read operations.

  • rlatencyp99: This metric shows the average 99th percentile read latency. It indicates latency for critical read operations.

Manager and Task Metrics

  • scylla_manager_task_active_count: This metric represents the active count of tasks managed by Scylla’s manager (e.g., repairs, backups). It provides insights into ongoing maintenance activities.

  • scylla_manager_repair_progress: This metric shows the average repair progress across tasks managed by Scylla’s manager. It indicates the status of repair operations.

Network Metrics

  • node_netstat_Tcp_CurrEstab: This metric represents the current number of established TCP connections. It is essential for monitoring network activity and connection health.

  • node_netstat_Tcp_ActiveOpens: This metric shows the rate of active TCP connection openings. It indicates network connection establishment dynamics.

Filesystem and Hardware Metrics

  • node_filesystem_avail_bytes: This metric represents the available bytes on the filesystem. It is critical for monitoring storage capacity and resource utilization.

  • node_filesystem_size_bytes: This metric shows the total size of the filesystem. It provides context for storage capacity and usage metrics.

  • node_power_supply_online: This metric represents whether the power supply is online. It provides information on hardware health and reliability.

  • node_cooling_device_cur_state: This metric shows the current state of cooling devices. It offers insights into hardware cooling system operation and health.

System Time Metrics

  • node_time_seconds: This metric represents the current time on the node. It is essential for timestamping metrics and monitoring system uptime.

  • node_boot_time_seconds: This metric shows the timestamp of the node’s last boot time. It is useful for tracking system restarts and uptime metrics.


Grafana Dashboard Panels

Cluster Overview Row

Nodes Panel

It displays the count of nodes running ScyllaDB with the current version.

  • Metrics Used: scylla_scylladb_current_version

This panel provides an overview of the number of nodes in the ScyllaDB cluster and their current software version.

Unreachable Panel

This panel indicates nodes that are unreachable or not successfully scraped.

  • Metrics Used: scrape_samples_scraped

This panel helps monitor nodes that are not responding or are unreachable.

Inactive Panel

It counts nodes that are not in the operational mode.

  • Metrics Used: scylla_node_operation_mode

This panel tracks nodes that are not actively participating in the cluster’s operations, helping to identify any inactive nodes.

Manager Panel

It shows the active tasks managed by ScyllaDB, such as repairs and backups.

  • Metrics Used: scylla_manager_task_active_count, scylla_manager_server_current_version

This panel provides an overview of ongoing management tasks (repairs, backups) and the current version of ScyllaDB servers managed by the manager nodes. It helps to monitor ongoing tasks and the operational status of the manager server (offline or online).

Progress Panel

This panel appears to track the progress of repair and backup tasks managed by ScyllaDB.

  • Metrics Used: scylla_manager_repair_progress, scylla_manager_backup_percent_progress, scylla_manager_task_active_count.

This panel calculates and displays the progress of ongoing repair and backup tasks in the ScyllaDB cluster, helping monitor the completion status of these operations.

Avg Write Panel

This panel shows the average time it takes for write operations to be completed in the ScyllaDB cluster.

  • Metrics Used: wlatencya

This panel helps to monitor how quickly data is being written to the database. If the average write time increases, it could indicate potential performance issues that need to be addressed.

Avg Read Panel

This panel shows the average time it takes for read operations to be completed in the ScyllaDB cluster.

  • Metrics Used: rlatencya

This panel helps to monitor how quickly data is being read from the database. If the average read time increases, it could indicate potential performance issues that need to be addressed.

99% Read Panel

This panel shows the 99th percentile read latency, which means the read time for the slowest 1% of read operations.

  • Metrics Used: rlatencyp99

This panel helps to identify outlier read latencies. If this metric increases, it means that a small percentage of read operations are experiencing significantly higher latency, which could be a sign of underlying problems.

Load Panel

This panel shows the average utilization of the CPU reactor, which indicates how much of the CPU’s capacity is being used.

  • Metrics Used: scylla_reactor_utilization

This panel helps to monitor how much of the CPU resources are being consumed. High utilization can indicate that the system is under heavy load and might need scaling or optimization.

Requests/s Panel

This panel shows the total number of requests being handled by the ScyllaDB cluster per second.

  • Metrics Used: scylla_transport_requests_served, scylla_thrift_served

This panel helps to monitor the request load on the database. A high number of requests per second indicates heavy usage, and tracking this can help in capacity planning and identifying potential overloads.

Timeouts Panel

This panel shows the number of write operations that timed out in the ScyllaDB cluster.

Metrics Used: scylla_storage_proxy_coordinator_write_timeouts

This panel helps to monitor the reliability of write operations. If the number of timeouts increases, it could indicate network issues, configuration problems, or resource constraints.

Writes Panel

This panel shows the rate at which write operations are being performed in the ScyllaDB cluster.

  • Metrics Used: scylla_storage_proxy_coordinator_write_latency_count, scylla_storage_proxy_coordinator_write_latency_summary_count

This panel helps to monitor the volume of write operations. Understanding the rate of writes can help in analyzing write load patterns and ensuring the database can handle the incoming write traffic efficiently.

Read Panel

This panel shows the rate at which read operations are being performed in the ScyllaDB cluster.

  • Metrics Used: scylla_storage_proxy_coordinator_read_latency_count, scylla_storage_proxy_coordinator_read_latency_summary_count

This panel helps to monitor the volume of read operations. Understanding the rate of reads can help in analyzing read load patterns and ensuring the database can handle the incoming read traffic efficiently.

Active Alerts Panel

The Active Alerts Panel provides an overview of the current alerts that are active within the system.

  • BlackboxProbeFailed: This alert is triggered when a blackbox probe (which simulates end-user actions to test external services) fails. It indicates that a monitored service is not responding as expected.

  • BlackboxSlowProbe: This alert is triggered when a blackbox probe experiences slow response times. It indicates that a monitored service is responding slower than expected.

  • BlackboxProbeHttpFailure: This alert is triggered when a blackbox probe fails to perform an HTTP request successfully. It indicates that a monitored web service is not responding correctly to HTTP requests.

  • HaproxyHasNoAliveBackends: This alert is triggered when HAProxy, a high availability load balancer, detects no alive backend servers. It indicates that all backend servers configured in HAProxy are down or unresponsive.

  • MysqlDown: This alert is triggered when a MySQL database instance is down. It indicates that the monitored MySQL service is not accessible.

  • HostOutOfDiskSpace: This alert is triggered when a host is running out of disk space. It indicates that the available disk space on the monitored host is critically low.

  • HostSystemdServiceCrashed: This alert is triggered when a systemd service has crashed. It indicates that a critical system service managed by systemd is not running as expected.

  • Low Memory: This alert is triggered when the system memory is critically low. It indicates that the monitored host is running out of available memory.

  • Memory is predicted to be less than 1 GiB in one hour: This alert is triggered when a prediction model indicates that the available memory will fall below 1 GiB within the next hour. It helps in preemptively addressing memory issues before they become critical.

  • DiskFull: This alert is triggered when a disk is full. It indicates that the monitored disk has reached its capacity and can no longer store additional data.


Information for dc1 Row

Load Panel

This panel displays the average reactor utilization for each node in the specified cluster and data center. It provides insight into how much CPU resources are being used by the ScyllaDB reactor.

  • Metric Used: scylla_reactor_utilization

Running Compactions Panel

This panel shows the current number of running compactions across nodes in the cluster and data center. Compactions are critical for maintaining database performance.

  • Metric Used: scylla_compaction_manager_compactions

Cache Hits/Misses Panel

This panel provides the rate of cache row hits, indicating how often requested rows are found in the cache, which improves read performance.

  • Metric Used: scylla_cache_row_hits

Writes Panel

This panel shows the rate of write latency counts, helping to monitor the performance of write operations across the nodes.

  • Metrics Used:scylla_storage_proxy_coordinator_write_latency_count, scylla_storage_proxy_coordinator_write_latency_summary_count

Reads Panel

This panel displays the rate of read latency counts, which is crucial for understanding the performance of read operations. It includes current metrics, as well as metrics offset by one day and one week, for comparison.

  • Metrics Used: scylla_storage_proxy_coordinator_read_latency_count, scylla_storage_proxy_coordinator_read_latency_summary_count

Nodes panel

This panel checks for current ScyllaDB versions and operational modes across all instances, ensuring consistency and operational status. The panel also tracks average reactor utilization to gauge CPU resource efficiency.

  • Metrics Used: scylla_scylladb_current_version, scylla_node_operation_mode, scylla_reactor_utilization, scylla_reactor_aio_errors, scylla_errors:nodes_total, scylla_cql:non_system_prepared1m, scylla_cql_reverse_queries

Alert rules

  • alert: cqlNonPrepared

    • expr: cql:non_prepared > 0
    • description: Some queries are non-prepared.
    • summary: Non-prepared statements.
    • Explanation: This alert is triggered when one or more CQL queries are not prepared, indicating inefficient query execution.
  • alert: cql:non_paged_no_system

    • expr: cql:non_paged > 0
    • description: Some SELECT queries are non-paged.
    • summary: Non-paged statements.
    • Explanation: This alert triggers when SELECT queries are non-paged, which may lead to performance issues due to inefficient data retrieval.
  • alert: cqlNoTokenAware

    • expr: cql:non_token_aware > 0
    • description: Some queries are not token-aware.
    • summary: Non-token-aware statements.
    • Explanation: This alert indicates that some queries are not token-aware, which can result in inefficient data distribution and retrieval.
  • alert: cqlAllowFiltering

    • expr: cql:allow_filtering > 0
    • description: Some queries use ALLOW FILTERING.
    • summary: Allow filtering queries.
    • Explanation: This alert triggers when queries use the ALLOW FILTERING clause, which can significantly degrade performance due to inefficient filtering.
  • alert: cqlCLAny

    • expr: cql:any_queries > 0
    • description: Some queries use Consistency Level: ANY.
    • summary: Non-prepared statements.
    • Explanation: This alert is triggered when queries are using the consistency level ANY, which may result in weaker guarantees for read and write operations.
  • alert: cqlCLAll

    • expr: cql:all_queries > 0
    • description: Some queries use Consistency Level: ALL.
    • summary: Non-prepared statements.
    • Explanation: This alert triggers when queries use the consistency level ALL, which ensures strong consistency but can impact performance and availability.
  • alert: nonBalancedcqlTraffic

    • expr: abs(rate(scylla_cql_updates{conditional="no"}[1m]) - scalar(avg(rate(scylla_cql_updates{conditional="no"}[1m]))))/scalar(stddev(rate(scylla_cql_updates{conditional="no"}[1m]))+100) > 2
    • description: CQL queries are not balanced among shards {{ $labels.instance }} shard {{ $labels.shard }}.
    • summary: CQL queries are not balanced.
    • Explanation: This alert indicates that CQL queries are not evenly distributed among shards, leading to potential performance bottlenecks.
  • alert: nodeLocalErrors

    • expr: sum(errors:local_failed) by (cluster, instance) > 0
    • description: Some operations failed at the replica side.
    • summary: Replica side level error.
    • Explanation: This alert triggers when there are errors on the replica side, indicating issues with data replication or query processing.
  • alert: nodeIOErrors

    • expr: sum(rate(scylla_reactor_aio_errors[60s])) by (cluster, instance) > 0
    • description: IO Errors can indicate a node with a faulty disk {{ $labels.instance }}.
    • summary: IO Disk Error.
    • Explanation: This alert triggers when there are IO errors, which might mean a faulty disk on the node.
  • alert: nodeCLErrors

    • expr: sum(errors:operation_unavailable) by (cluster) > 0
    • description: Some operations failed due to consistency level.
    • summary: Consistency Level Error.
    • Explanation: This alert triggers when operations fail because the desired consistency level wasn’t met, which can be a sign of cluster issues.
  • alert: preparedCacheEviction

    • expr: sum(rate(scylla_cql_prepared_cache_evictions[2m])) by (cluster) + sum(rate(scylla_cql_authorized_prepared_statements_cache_evictions[2m])) by (cluster) > 100
    • description: The prepared-statement cache is being continuously evicted, which could indicate a problem in your prepared-statement usage logic.
    • summary: Prepared Cache Eviction.
    • Explanation: This alert triggers when the cache for prepared statements is frequently cleared, which might mean there’s a problem with how prepared statements are used.
  • alert: heavyCompaction

    • expr: max(scylla_scheduler_shares{group="compaction"}) by (cluster) >= 1000
    • description: Compaction load increases to a level it can interfere with the system behaviour. If this persists set the compaction share to a static level.
    • summary: Heavy Compaction Load.
    • Explanation: This alert triggers when compaction processes use too much of the system’s resources, which can disrupt normal operations.
  • alert: shedRequests

    • expr: max(sum(rate(scylla_transport_requests_shed[60s])) by (instance,cluster)/sum(rate(scylla_transport_requests_served{}[60s])) by (instance, cluster)) by(cluster) > 0.01
    • description: More than 1% of the requests got shed, this is an indication of an overload, consider system resize.
    • summary: System is overloaded.
    • Explanation: This alert triggers when more than 1% of requests are shed, indicating the system might be overloaded and might need resizing.
  • alert: cappedTombstone

    • expr: changes(scylla_sstables_capped_tombstone_deletion_time[1h]) > 0
    • description: Tombstone delete time was set too far in the future and was capped.
    • summary: Tombstone delete time is capped.
    • Explanation: This alert triggers when the tombstone delete time is adjusted because it was set too far in the future.
  • alert: InstanceDown (5 minutes)

    • expr: up{job="scylla"} > 3
    • description: Instance has been down for more than 5 minutes.
    • summary: Instance down.
    • Explanation: This alert triggers when a Scylla instance is down for more than 5 minutes.
  • alert: InstanceDown (10 minutes)

    • expr: sum(up{job="scylla"}>0)by(instance) unless sum(scylla_transport_requests_served{shard="0"}) by(instance)
    • description: Instance has been down for more than 10 minutes.
    • summary: Instance down.
    • Explanation: This alert triggers when a Scylla instance has been down for more than 10 minutes and is not serving requests.
  • alert: InstanceDown (operation mode)

    • expr: scylla_node_operation_mode > 3
    • description: Instance has been down for more than 5 minutes.
    • summary: Instance down.
    • Explanation: This alert triggers when the Scylla node operation mode is greater than 3, indicating the instance has been down for more than 5 minutes.
  • alert: DiskFull (Warning)

    • expr: node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"} * 100 < 35
    • description: ‘Instance has less than 35% free disk space.’
    • summary: Instance low disk space.
    • Explanation: This alert triggers when the available disk space in /var/lib/scylla is less than 35%. It indicates that the disk space is getting low.
  • alert: DiskFull (Error)

    • expr: node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"} * 100 < 25
    • description: ‘Instance has less than 25% free disk space.’
    • summary: Instance low disk space.
    • Explanation: This alert triggers when the available disk space in /var/lib/scylla is less than 25%. It indicates a critical disk space issue that needs immediate attention.
  • alert: DiskFull (Critical)

    • expr: node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"} / node_filesystem_size_bytes{mountpoint="/var/lib/scylla"} * 100 < 15
    • description: ‘Instance has less than 15% free disk space.’
    • summary: Instance low disk space.
    • Explanation: This alert triggers when the available disk space in /var/lib/scylla is less than 15%. It indicates a severe disk space shortage that could affect system performance and stability.
  • alert: DiskFull (Root Partition)

    • expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 20
    • description: ‘Instance has less than 20% free disk space on the root partition.’
    • summary: Instance low disk space.
    • Explanation: This alert triggers when the available disk space on the root partition (/) is less than 20%. It indicates that the root partition is running low on space, which could impact the overall system functionality.
  • alert: NoCql

    • expr: scylla_manager_healthcheck_cql_status == -1
    • description: ‘Instance has denied CQL connection for more than 30 seconds.’
    • summary: Instance no CQL connection.
    • Explanation: This alert triggers when the Scylla instance denies CQL connections for more than 30 seconds. It indicates a potential issue with CQL connectivity.
  • alert: HighLatencies (Write)

    • expr: wlatencyp95{by="instance"} > 100000
    • description: ‘Instance has 95% high latency for more than 5 minutes.’
    • summary: Instance High Write Latency.
    • Explanation: This alert triggers when the 95th percentile of write latencies for the instance exceeds 100,000 microseconds (µs) for more than 5 minutes. It indicates high write latencies, which could affect write performance.
  • alert: HighLatencies (Write, Average)

    • expr: wlatencya{by="instance"} > 10000
    • description: ‘Instance has average high latency for more than 5 minutes.’
    • summary: Instance High Write Latency.
    • Explanation: This alert triggers when the average write latency for the instance exceeds 10,000 microseconds (µs) for more than 5 minutes. It indicates sustained high write latencies.
  • alert: HighLatencies (Read)

    • expr: rlatencyp95{by="instance"} > 100000
    • description: ‘Instance has 95% high latency for more than 5 minutes.’
    • summary: Instance High Read Latency.
    • Explanation: This alert triggers when the 95th percentile of read latencies for the instance exceeds 100,000 microseconds (µs) for more than 5 minutes. It indicates high read latencies, which could affect read performance.
  • alert: HighLatencies (Read, Average)

    • expr: rlatencya{by="instance"} > 10000
    • description: ‘Instance has average high latency for more than 5 minutes.’
    • summary: Instance High Read Latency.
    • Explanation: This alert triggers when the average read latency for the instance exceeds 10,000 microseconds (µs) for more than 5 minutes. It indicates sustained high read latencies.
  • alert: BackupFailed

    • expr: (sum(scylla_manager_scheduler_run_total{type=~"backup", status="ERROR"}) or vector(0)) - (sum(scylla_manager_scheduler_run_total{type=~"backup", status="ERROR"} offset 3m) or vector(0)) > 0
    • description: ‘Backup failed’
    • summary: Backup task failed.
    • Explanation: This alert triggers when there is an increase in failed backup tasks within a 10-second interval. It indicates that one or more backup operations have failed recently.
  • alert: RepairFailed

    • expr: (sum(scylla_manager_scheduler_run_total{type=~"repair", status="ERROR"}) or vector(0)) - (sum(scylla_manager_scheduler_run_total{type=~"repair", status="ERROR"} offset 3m) or vector(0)) > 0
    • description: ‘Repair failed’
    • summary: Repair task failed.
    • Explanation: This alert triggers when there is an increase in failed repair tasks within a 10-second interval. It indicates that one or more repair operations have failed recently.
  • alert: restart

    • expr: resets(scylla_gossip_heart_beat[1h]) > 0
    • description: ‘Node restarted’
    • summary: Instance restarted.
    • Explanation: This alert triggers when there is evidence of a node restart based on changes. It indicates that the Scylla instance has recently undergone a restart.
  • alert: oomKill

    • expr: changes(node_vmstat_oom_kill[1h]) > 0
    • description: ‘OOM Kill on instance’
    • summary: A process was terminated on Instance.
    • Explanation: This alert triggers when an Out-Of-Memory (OOM) kill occurs on the Scylla instance. It indicates that a process was terminated due to memory constraints.
  • alert: tooManyFiles

    • expr: (node_filesystem_files{mountpoint="/var/lib/scylla"} - node_filesystem_files_free{mountpoint="/var/lib/scylla"}) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance) > 20000
    • description: ‘Over 20k open files in /var/lib/scylla per shard '
    • summary: There are over 20K open files per shard on Instance.
    • Explanation: This alert triggers when the number of open files in the /var/lib/scylla directory exceeds 20,000 per shard on the Scylla instance. It indicates a potential issue with file system resource usage.
  • alert: tooManyFiles

    • expr: (node_filesystem_files{mountpoint="/var/lib/scylla"} - node_filesystem_files_free{mountpoint="/var/lib/scylla"}) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance) > 30000
    • description: Over 30k open files in /var/lib/scylla per shard.
    • summary: There are over 30K open files per shard on Instance.
    • Explanation: This alert triggers a warning when the number of open files in the /var/lib/scylla directory exceeds 30,000 per shard instance, which can impact performance.
  • alert: tooManyFiles

    • expr: (node_filesystem_files{mountpoint="/var/lib/scylla"} - node_filesystem_files_free{mountpoint="/var/lib/scylla"}) / on(instance) group_left count(scylla_reactor_cpu_busy_ms) by (instance) > 40000
    • description: Over 40k open files in /var/lib/scylla per shard.
    • summary: There are over 40K open files per shard on Instance.
    • Explanation: This alert triggers an error when the number of open files in the /var/lib/scylla directory exceeds 40,000 per shard instance, indicating a potentially critical issue impacting system stability.
  • alert: nodeInJoinMode

    • expr: scylla_node_operation_mode == 2
    • description: Node in Joining mode for 5 hours (or 1 day for the warning alert).
    • summary: Node in Joining mode for 5 hours (or 1 day for the warning alert).
    • Explanation: This alert notifies when a Scylla node has been in Joining mode for an extended period, which might indicate issues with cluster membership or node initialization.
  • alert: splitBrain

    • expr: sum(scylla_gossip_live) < (count(scylla_node_operation_mode==3)-1) * count(scylla_gossip_live)
    • description: Cluster in a split-brain mode.
    • summary: Some nodes in the cluster do not see all of the other live nodes.
    • Explanation: This alert triggers when the Scylla cluster experiences split-brain syndrome, where nodes are unable to communicate with each other properly, potentially leading to data consistency issues.
  • alert: bloomFilterSize

    • expr: scylla_sstables_bloom_filter_memory_size/scylla_memory_total_memory > 0.2
    • description: Bloom filter size in node.
    • summary: The bloom filter takes too much memory, update bloom_filter_fp_chance.
    • Explanation: This alert indicates when the bloom filter size on a Scylla node exceeds 20% of the total memory available, which can impact performance and may require adjusting bloom_filter_fp_chance parameter to optimize memory usage.