Before initiating the monitoring of the Scylla server, it is essential for the C3 team to understand the scylla-server’s role and significance. For information, refer to the links provided below.
Dashboard & Row | Panel | Panel Descriptionx | Query | Query Description | Query Operating Range | Metrics | Metric Description | Metric Operating Range | SEVERITY: CRITICAL | SEVERITY: WARNING | SEVERITY: OK |
---|---|---|---|---|---|---|---|---|---|---|---|
1.1 | Nodes | Number of nodes present in cluster | count(scylla_scylladb_current_version{cluster=“pssb-ds-sdc”}) | This metric exposes on every scylla node when scylla is up and running, we count based on the cluster name | 5 | scylla_scylladb_current_version | Shows current version of Scylla | We do count the metrics, our results are independent on the alues of this metric | != 5 | - | nodes == 5 |
1.1 | InActive | The number of nodes that are up but not actively part of the cluster, either because they are still joining or because they are leaving. | 5-count(scylla_node_operation_mode{cluster=~“pssb-ds-sdc”}==3) | Counts the nodes which are in normal mode of cluster named pssb-ds-sdc | 0 | scylla_node_operation_mode | unknown=0, starting=1, joining=2, normal=3, leaving=4, decomissioned=5 | 3 | != 0 | - | inactive nodes == 0 |
1.1 | UnReachable | The number of unreachable nodes.Usually because a machine is down or unreachable. | (count(scrape_samples_scraped{job=“scylla”, cluster=~“pssb-ds-sdc”}==0) OR vector(0)) | checks if Prometheus is failing to scrape metrics from any Scylla instance(s) in the pssb-ds-sdc cluster | 0 | scrape_samples_scraped | Counts no of samples scraped from prometheus | scraped samples > 0 | value != 0 | - | un-reachable nodes == 0 |
1.1 | Split Brain Mode | Split-brain mode is when different parts of the cluster operate independently, unaware of each other. | sum(scylla_gossip_live{job=“scylla”,instance=~“pssb.*”}) | Each nodes sees other 4 nodes in cluster, so sum of all metric values is 20 (4 x 5nodes) | scylla_gossip_live | This node sees how many other nodes | 4 | Yes | - | No | |
1.1 | Scylla Reactor Utilization | Shows percentage of CPU time each reactor (core) spends handling Scylla tasks | sum(scylla_reactor_utilization{cluster=“pssb-ds-sdc”}) by (instance) | It shows how busy all cpu cores on each instance is with Scylla’s tasks. | 0.1 to (100 x No.of Cpu’s) | scylla_reactor_utilization | It shows how busy all cpu cores on each instance is with Scylla’s tasks. | 0.1 - 100 as per each core | consume > 20 per each instance | 10 - 20 per each instance | 0 - 10 per each instance |
1.2 | Scylla Reads for Instance per second | Scylla read performed for each instance in one second based on the last 5 minutes | sum(rate(scylla_database_total_reads{cluster=“pssb-ds-sdc”}[5m])) by (instance) | Sums all reads performed per each shard on each instance | posititve values | scylla_database_total_reads | Scylla database total reads till now | positive values | > 30 reads / sec | 20 - 30 reads/sec | < 20 reads / sec |
1.2 | Scylla Writes for Instance per second | Writes performed for each instance in one second based on past 5 minutes | sum(rate(scylla_database_total_writes{cluster=“pssb-ds-sdc”}[5m])) by (instance) | sums all writes performed per each shard on single instance | posititve values | scylla_database_total_writes | Scylla database total writes till now | positive values | > 5 writes per sec | 2 - 5 writes per sec | 0 - 1 writes per sec |
1.2 | Scylla Memory Usage | Scylla memory usage percentage based on the scylla dedicated memory | 100 - (((sum(scylla_memory_free_memory{cluster=“pssb-ds-sdc”}) by (instance)) / (sum(scylla_memory_allocated_memory{cluster=“pssb-ds-sdc”}) by (instance))) * 100) | Scylla memory usage percentage based on scylla dedicated memory limit. | posititve values | scylla_memory_free_memory, scylla_memory_allocated_memory | available memory and allocated memory for scylla in bytes | positive values | > 95 % | 90% - 95 % | > 90 % |
2.1 | Scylla High Inserts per Second | Shows the number of inserts per second in scylla database | sum(rate(scylla_cql_inserts{cluster=“pssb-ds-sdc”}[5m])) by (instance) | Scylla inserts performed for each second in each instance | posititve values | scylla_cql_inserts | No of inserts on each shard on scylla database | positive values | > 1 | 0.5 - 1 | 0 - 0.5 |
2.1 | Scylla CQL Connections by Instance | shows the number of cql connections currently established | sum(scylla_transport_current_connections{cluster=“pssb-ds-sdc”}) by (instance) | sum all cql connections for each shard on each instance | posititve values | scylla_transport_current_connections | No of cql connections for each shard right now | 10 < number_of_connections | > 10 | 8.1 - 10 | 0 - 8.0 |
When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
Severity-Specific Notifications:
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
For the referenced alerts, utilize the provided PromQL queries. Ensure you adjust the labels and time ranges in the queries to match the specific node or cluster to collect data. Analyze the results thoroughly, comparing trends and patterns across relevant time periods. To draw accurate conclusions, experiment with additional custom PromQL queries to cross-validate findings and gather comprehensive data insights.
A Scylla Node Down indicates that the node is not operational or unreachable in the cluster. This can be caused by hardware failure, network issues, misconfiguration, or a crash. The node’s unavailability could lead to degraded performance, reduced fault tolerance, and potential data inconsistency depending on the replication factor and cluster topology.
Node Information:
Collect the following details about the affected node:
Cluster Status:
To verify the cluster’s status and the affected node, log in to any node within the cluster and run the following command:
nodetool status
Example Output
Example Explanation for UN:
The second and third line of the nodetool status
ouput says it all about First Column in the above picture.
Status Explanation
Status Code | Description |
---|---|
U | The node is up. |
D | The node is down. |
State Explanation
State Code | Description |
---|---|
N | The node is in a normal state. |
L | The node is leaving the cluster. |
J | The node is joining the cluster. |
M | The node is moving within the cluster. |
If any node found other than UN
state in the scylla cluster, note down that node details and login to the node and do the following operations
- Check Scylla Service Status
Scylla Service Status:
systemctl status scylla-server
journalctl -xeu scylla-server
sudo journalctl -u scylla-server --since "1 hour ago" --no-pager
Handling Inactive or Failed Scylla Service
When the Scylla service is found to be inactive or in a failed state, follow these steps to investigate and resolve the issue:
If the Scylla service is inactive:
$ uptime
16:49:45 up 13 days, 22:45, 7 users, load average: 0.99, 0.62, 0.52
If the Scylla service is in Failed State:
journalctl
command like specified above steps and try to find out the reason for service failureWhen troubleshooting a Scylla Node Down alert, the following affected metrics can help identify the root cause. These metrics focus on system resources, Scylla-specific performance, and hardware/network issues:
1. System Resource Metrics
uptime
node_memory_MemAvailable_bytes
, node_cpu_seconds_total
Disk Space Used Basic
Grafana Dashboard Panel in OS Folder. Linkmount -a
Cluster Impact:
It can seriously affect the cluster in the ways:
When troubleshooting a ScyllaNodeDown alert, follow these steps based on the situation:
Proceed to the following steps only when only one scylla-server is down, otherwise inform it to DevOps team with the collected data.
If Only One System Has Recently Rebooted:
sudo systemctl restart scylla-server
If More Than One Node Got Rebooted:
If the Error is Related to Memory, Disk, or CPU:
If Failing to Achieve Quorum Level:
When handling a ScyllaNodeDown alert, the DevOps team should follow these steps to ensure proper resolution:
If Scylla is in an Inactive or Failed State Due to CPU, Memory, or Disk Issues:
sudo systemctl restart scylla-server
If the Cluster is Failing to Achieve Quorum Level:
nodetool status
command to identify nodes that are down or in an abnormal state.
nodetool status
nodetool repair
A ScyllaNodesInactive alert indicates that one or more nodes in the Scylla cluster are in a state where they are up but are not actively participating in the cluster. This status is often observed during cluster transitions, such as when nodes are:
Inactive nodes can cause temporary disruptions in cluster performance, replication, and fault tolerance, depending on the cluster topology and data distribution.
Node Information:
Collect the following details about the affected nodes:
Cluster Status:
To identify nodes that are inactive, log in to any active node in the cluster and run the following command:
nodetool status
Refer the first alert to understand the output of the nodetool status
Service Status:
Verify the Scylla service on the inactive node to ensure it is running:
systemctl status scylla-server
If the service is active, check the logs to determine the progress of the node’s transition (e.g., joining, leaving). Use the following command to check recent logs:
sudo journalctl -u scylla --since "1 hour ago" --no-pager
The following metrics can help identify the cause of the inactive nodes:
When a Scylla node is in joining or leaving mode in a Scylla cluster, it is undergoing a transitional state where its membership in the cluster is being updated. During these states, it is essential to monitor specific dependent metrics to ensure cluster health, consistency, and performance. Here’s a detailed list of key metrics to track during these modes:
Joining or Leaving Mode:
In this state, the node is being added to the cluster and is receiving data through streaming.
Streaming Metrics:
scylla_streaming_total_incoming_bytes
and scylla_streaming_total_outgoing_bytes
: Tracks the total amount of data to be streamed to/from the joining node.scylla_streaming_finished_percentage
: Tracks the percentage of data already streamed.Node Status:
nodetool status
: Ensures the node is correctly listed as J
(Joining). Node which is in leaving mode show the status as L
Streaming Progress:
Use nodetool netstats
to monitor the progress of streaming operations (data synchronization between nodes).
Look for the completion of repair operations initiated during node joining and make sure there is no message in pending state.
nodetool netstats
Example Output
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 17107
Mismatch (Blocking): 7
Mismatch (Background): 13
Pool Name Active Pending Completed Dropped
Large messages n/a 0 25062982 0
Small messages n/a 0 29354568 0
Gossip messages n/a 0 0 0
journalctl -u scylla-server --since "1 hour ago" --no-pager
When responding to a ScyllaNodesInactive alert for the long time, follow these steps:
If the Node is Joining the Cluster:
nodetool netstats
to monitor the progress of data synchronization.If the Node is Leaving the Cluster:
nodetool decommission status
If System Resource Issues are Detected:
If the Node Remains Inactive Beyond Expected Time:
sudo systemctl restart scylla-server
If System Resource Bottlenecks Cause Inactivity:
If the Cluster Configuration is Affected:
nodetool status
.If Node Cannot Rejoin Automatically:
nodetool netstats
A Scylla Split Brain Mode occurs when two or more subsets of nodes in the cluster form separate groups, each thinking they are the entire cluster. This situation can cause conflicting writes between subsets, inability to achieve quorum or correct partitioning and increased latency and degraded throughput.
Cluster Status:
Use the following command to check the overall cluster state:
nodetool status
Look for nodes that are unreachable (D
state) or nodes forming separate groups (nodes not appearing as part of the same cluster).
Ring View Comparison:
nodetool ring
on different nodes to confirm if there is a mismatch in the cluster’s ring structure.Run this Command in every node of the scylla cluster Command:
nodetool ring
Example Output:
Node 1 (172.21.0.61):
Address DC Rack Status State Load Owns Token
pssb1avm001 pssb-ds-sdc rack1 Up Normal 120 GB 33.33% 0
pssb1avm002 pssb-ds-sdc rack1 Down Leaving 130 GB 33.33% 28035242
Node 2 (172.21.0.62):
Address DC Rack Status State Load Owns Token
pssb1avm001 pssb-ds-sdc rack1 Up Normal 120 GB 33.33% 0
pssb1avm002 pssb-ds-sdc rack1 Up Joining 130 GB 33.33% 43345242
Here in the above two outputs, we can observe the status of pssb1avm002
node is Down and Leaving
in Node-01 view and Up and Joining
in Node-02 View.
Explanation:
Leaving
on one node, Joining
on another) indicates potential split-brain behavior.Gossip State:
Inspect gossip state to identify communication issues between nodes:
nodetool gossipinfo
nodetool statusgossip
Example Output:
/10.0.0.1
generation:1617123457
heartbeat:1081
STATUS:16:NORMAL,-1
LOAD:120.45
SCHEMA:84:a3a2e354-8a5c-39b4-809c-2125dfb8f123
DC:datacenter1
RACK:rack1
/10.0.0.2
generation:1617123456
heartbeat:2
STATUS:16:LEAVING,-1
LOAD:0.00
SCHEMA:84:a3a2e354-8a5c-39b4-809c-2125dfb8f123
DC:datacenter1
RACK:rack1
Look for discrepancies such as nodes in DOWN
state or inconsistent HOST_ID
.
In this example, Node 2 has a much lower heartbeat and is in a LEAVING
state, indicating it might be part of a split cluster.
System Logs:
Review Scylla logs for errors related to network partitions or split-brain scenarios:
journalctl -u scylla-server --since "1 hour ago" --no-pager
Metrics to Monitor:
scylla_gossip_live: Represents the number of nodes that a specific Scylla node recognizes as live in the cluster.
For instance, in a five-node cluster, each node should report 4 other nodes as live. Any deviation indicates potential cluster connectivity issues.
scylla_gossip_unreachable: Tracks the number of nodes that are unreachable by the given Scylla node.
The expected value for this metric should always be 0. Non-zero values may indicate network issues or node failures.
For detailed visualization and analysis, refer to the Grafana dashboard panel: Scylla Gossip Metrics.
Verify Node Network Connectivity:
Check inter-node network reachability using ping
or telnet
.
ping <node_ip>
telnet <node_ip> 9042
Make sure all ports mentioned here are allowed through firewall in the partitoned node
9042/tcp
9142/tcp
7000/tcp
7001/tcp
7199/tcp
10000/tcp
9180/tcp
9100/tcp
9160/tcp
19042/tcp
19142/tcp
Restart Gossip Service:
If any nodes are unable to synchronize gossip state, restart the gossip service in that node:
nodetool disablegossip
nodetool enablegossip
Validate Node Configuration:
Ensure all nodes are using the same cluster_name
and seed_node
configuration in scylla.yaml
.
Monitor Cluster Quorum:
Use Grafana or monitoring tools to observe quorum consistency.
Node Rejoin:
For nodes isolated due to split-brain, rejoin them using:
nodetool repair
Resolve Network Partition:
Reform the Cluster:
nodetool repair
as needed to stabilize the cluster.nodetool removenode <host-id>
when the node is in DN
state of the cluster to get the stability and rejoin that node to the cluster againCluster-wide Configuration Check:
cluster_name
.seed_node
.If the error still persists, try rebuild the cluster again with no data deletions.
Here’s a professional and detailed documentation for the ScyllaHighReactorUtilization alert:
The Scylla High Reactor Utilization alert indicates excessive reactor thread utilization in one or more ScyllaDB nodes. Reactor threads are critical for processing requests and performing I/O operations, and high utilization can lead to degraded performance, increased latencies, and a potential bottleneck in the node’s operations. This typically occurs due to unbalanced workloads, high traffic routed to a single node, or resource constraints such as CPU or memory.
To troubleshoot the alert, perform the following steps:
Node Status: Run the following command to check the load on the node:
nodetool status
Ensure that the load is manageable compared to other nodes in the cluster. High load on this node might explain elevated reactor utilization.
Reactor Utilization Metric:
Check the scylla_reactor_utilization
metric in Prometheus or Grafana. This metric shows per-CPU reactor utilization as a value between 0
and 1
. Analyze recent changes to identify patterns or spikes.
Scylla Dashboards: Review the Scylla Dashboards folder for any major changes in key panels, such as CPU usage, latency, or throughput. Dashboard Link
CPU Load: Verify the CPU load on the affected node:
uptime
top
High CPU utilization can exacerbate reactor thread bottlenecks.
Pending Reactor Tasks:
Check the scylla_reactor_tasks_pending
metric in Prometheus. Increased pending tasks can indicate a backlog of work for the reactor threads, potentially caused by high traffic or slow query processing.
Scylla Connections: Investigate if all incoming connections are routed to this specific node. High connection concentration can cause elevated reactor utilization. Check connection distribution and rebalance traffic if necessary.
Monitor the following metrics to identify the root cause of the high reactor utilization:
Pending Tasks:
scylla_reactor_tasks_pending
A rise in this metric indicates a backlog of work for the reactor threads.
Reactor Utilization:
scylla_reactor_utilization
Shows reactor thread usage per CPU. Higher values indicate increased utilization.
Reactor Stalls:
rate(scylla_reactor_stalls_count{instance="$instance"}[5m])
Tracks stalls in reactor operations. A rising trend in the graph suggests delays in processing.
Threaded I/O Fallbacks:
rate(scylla_reactor_io_threaded_fallbacks{instance="$instance"}[5m])
Tracks instances where the reactor falls back to threaded I/O. Increased values suggest the reactor cannot handle I/O efficiently.
Dashboard Panel:
Refer to the panel in the Grafana dashboard to visualize these metrics and correlate them with the observed issue.
Dashboard Panel Link
Monitor CPU Load:
Observe if the CPU load starts stabilizing or decreasing naturally over time. If the node remains operational and queries are being processed, the system might self-recover.
Check Request Processing:
Ensure that incoming requests are being processed without significant delays or errors.
Escalate if Needed:
If the reactor utilization continues to rise, latency values increase, and the node becomes difficult to operate, treat the issue as critical and inform the DevOps team immediately.
Resource Scaling: If the system is underprovisioned, add resources such as additional CPU cores or scale the cluster horizontally by adding more nodes.
Load Balancing: Review and adjust traffic distribution across the cluster. Ensure no single node is receiving disproportionate traffic, which could lead to elevated reactor utilization.
Restart Scylla Service: If the issue persists after balancing and scaling, consider restarting the Scylla service to clear any transient states:
sudo systemctl restart scylla-server
Here’s a professional and clearly structured documentation for the ScyllaHighWrites alert based on the provided key points:
The ScyllaHighWrites alert is triggered when a node experiences an unusually high volume of write requests. This condition can result in increased latencies, pressure on commit logs, unbalanced load distribution, and potential node instability. Resolving this alert ensures consistent cluster performance and prevents write-related failures.
To identify and troubleshoot the cause of the high write load, perform the following steps:
Verify Node Load with nodetool status
:
nodetool status
to check the load distribution across nodes.Monitor Write Latency:
Use the metric scylla_storage_proxy_coordinator_write_latency
in Grafana or Prometheus.
High latency suggests that the node is struggling to handle the volume of write requests.
Analyze Write Attempts:
sum(rate(scylla_storage_proxy_coordinator_total_write_attempts_local_node{instance="psorbit-node01"}[5m])) by (instance)
sum(rate(scylla_storage_proxy_coordinator_total_write_attempts_remote_node{instance="psorbit-node01"}[5m])) by (instance)
Check Database Write Volume:
sum(rate(scylla_database_total_writes{cluster="pssb-ds-sdc"}[5m])) by (instance)
An increase in this value confirms a spike in write requests across the database.
Review Grafana Dashboard:
Access the relevant panels to examine write-related metrics and trends for anomalies.
Grafana Dashboard Link
The following metrics provide deeper insights into the issue and help correlate the root cause:
Write Latency:
scylla_storage_proxy_coordinator_write_latency
: A high value indicates write performance degradation.
Local and Remote Write Attempts:
scylla_storage_proxy_coordinator_total_write_attempts_local_node
: Tracks local write operations.
scylla_storage_proxy_coordinator_total_write_attempts_remote_node
: Tracks remote write operations.
Commit Log Flush Rate:
rate(scylla_commitlog_flush{instance="pssb1avm001"}[1m])
: Increased flush rates suggest high write activity and pressure on commit logs.
C3 Team have less interaction dealing with this alert and communicate with devops if it becomes critical.
Wait for Stabilization: If the write load is due to a temporary surge (e.g., batch processing), monitor the system to ensure it stabilizes over time.
Observe Load Distribution: Check other nodes to confirm that write requests are evenly distributed across the cluster.
Escalate if Critical: If the affected node becomes unresponsive or unmanageable, escalate the issue to the DevOps team immediately. Treat this as a critical priority.
Increase Resources: Allocate additional resources (CPU, memory, or disk capacity) to the affected nodes to handle the increased write workload.
Investigate Root Cause: Analyze the source of the high write load, such as unbalanced traffic, unoptimized workloads, or application anomalies. Address these issues to prevent future occurrences.
Manage Token Distribution: Review and adjust token assignments to ensure an even distribution of write loads across the cluster.
Update Configuration: If necessary, tune ScyllaDB configurations to handle a specific number of writes more efficiently (e.g., commitlog thresholds or write batch limits).
The HighScyllaReads alert indicates a node or cluster is experiencing an unusually high volume of read requests. This condition can lead to increased latency, elevated resource utilization, and potential performance degradation if not managed promptly. Properly diagnosing and addressing this alert ensures consistent query performance and balanced cluster operation.
Follow these steps to investigate and gather data related to the high read activity:
Check Node Load with nodetool status
:
Run nodetool status
to inspect the load distribution across nodes.
A disproportionately high load on a specific node indicates uneven read traffic.
Monitor Read Latency:
Use the metric scylla_storage_proxy_coordinator_read_latency
in Grafana or Prometheus.
High latency suggests the node is struggling to process read requests efficiently.
Analyze Read Operations:
sum(rate(scylla_storage_proxy_coordinator_total_read_attempts_local_node{instance="psorbit-node01"}[5m])) by (instance)
sum(rate(scylla_storage_proxy_coordinator_total_read_attempts_remote_node{instance="psorbit-node01"}[5m])) by (instance)
Evaluate Read Volume:
sum(rate(scylla_database_total_reads{cluster="pssb-ds-sdc"}[5m])) by (instance)
Review Grafana Dashboard:
Key metrics to monitor for correlating and diagnosing the issue include:
Read Latency:
scylla_storage_proxy_coordinator_read_latency
: Elevated values signal a delay in processing read requests.Local and Remote Read Attempts:
scylla_storage_proxy_coordinator_total_read_attempts_local_node
: Measures local read operations.scylla_storage_proxy_coordinator_total_read_attempts_remote_node
: Tracks remote read operations.Database Read Volume:
rate(scylla_database_total_reads{instance="pssb1avm001"}[1m])
: Indicates the total read volume processed by the node.Cache Hits:
scylla_cache_hits
and scylla_cache_misses
: Evaluate cache efficiency. A low cache hit rate may increase read latency.Monitor and Wait for Stabilization:
Review Load Distribution:
Escalate if Critical:
Increase Node Resources:
Review Cache Efficiency:
Balance Read Traffic:
Scale the Cluster:
Restart Scylla Service if Necessary:
sudo systemctl restart scylla-server
Here’s a professional and clear documentation for the ScyllaHighReads alert, structured for easy understanding by new team members:
The ScyllaHighMemory alert is triggered when ScyllaDB’s memory usage exceeds normal thresholds, potentially leading to degraded performance or system instability. High memory usage can stem from memory-intensive operations like handling large datasets, compactions, or excessive read/write workloads. Addressing this alert promptly ensures smooth cluster operations and prevents critical failures.
To investigate the ScyllaHighMemory alert, perform the following steps:
Check Read/Write Requests on the Node:
nodetool status
to verify the node’s load and traffic distribution.Monitor Memory-Intensive Operations:
journalctl -u scylla-server --since "1 hour ago" --no-pager | grep 'compaction'
Review Key Metrics:
sum(rate(scylla_database_total_reads{cluster="pssb-ds-sdc"}[5m])) by (instance)
Monitor Swap Usage:
Review the following metrics to understand the underlying cause of high memory usage
Scylla Metrics Documentation Link
Latency Metrics:
scylla_storage_proxy_coordinator_write_latency_summary
wlatencyp99
wlatencya
scylla_storage_proxy_coordinator_read_latency_summary
rlatencya
rlatency995
Scylla Memory Metrics to Monitor:
scylla_memory_allocated_memory
scylla_memory_free_memory
scylla_lsa_memory_allocated
scylla_lsa_memory_freed
Reactor Stalls:
scylla_reactor_stalls_count
for increasing trends, indicating memory pressure affecting Scylla’s event loop.Kernel Interventions:
For addressing high memory usage as part of C3-level actions:
Increase Available RAM:
Monitor and Stabilize:
For persistent memory issues requiring deeper intervention, DevOps can take the following actions:
Update Scylla Configuration:
commitlog_total_space_in_mb: 8192
compaction_strategy: IncrementalCompactionStrategy
Analyze and Distribute Load:
num_tokens: 256
Horizontal Scaling:
Restart Scylla Services:
sudo systemctl restart scylla-server
Description:
It include two alerts that are to be fired when cql connection going out of range. These triggers when the number of CQL connections in a Scylla cluster exceeds the expected threshold, indicating a potential overload or misconfiguration of the system. A high number of CQL connections can lead to resource exhaustion (e.g., CPU, memory), increased latency, and degraded performance due to network or disk bottlenecks.
Rate of CQL Requests:
rate(scylla_transport_cql_requests_count{cluster="pssb-ds-sdc"}[1m])
System-Level Metrics for Connection Analysis:
Active Connections: Use system tools to check open connections to ScyllaDB’s CQL port (default is 9042):
netstat -an | grep :9042 | wc -l
Core Resource Metrics:
Network Activity: Tools like iftop
or tcpdump
can be used to monitor traffic volume on port 9042 to ensure there is no congestion or unusually high traffic.
Client Misbehavior: Identify clients making excessive connections by checking connection attempts from each client using s netstat
:
netstat -an | grep :9042 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr
Example Output
4 172.21.0.63
2 172.21.0.65
2 172.21.0.64
2 172.21.0.62
2 172.21.0.61
1 172.21.0.1
Journal Logs:
journalctl -u scylla-server --since "1 hour ago" | grep CQL
Monitor the following metrics to diagnose potential issues with high CQL connections:
scylla_transport_cql_requests_count
: The total count of CQL requests, which helps in identifying increased connection activity.
scylla_transport_cql_connections
: The current number of active CQL connections in the cluster.
scylla_reactor_stalls_count
: The number of reactor stalls, which may indicate resource contention or high connection load.
scylla_reactor_utilization
: The percentage of reactor thread usage, high values suggest the system may be overwhelmed.
Load Imbalance: Some nodes show signs of high resource usage (e.g., CPU, memory), while others appear idle or underutilized. High CQL connection volumes may lead to uneven load distribution across Scylla nodes. If clients are not evenly distributing their connections, some nodes could experience high CPU/memory usage, while others remain underutilized. This imbalance can lead to resource contention and degrade cluster performance.
As a C3 team member, your role for this alert is primarily observational, with minimal direct action. Follow these guidelines to address the situation.
Evaluate the Cluster’s Connection Status:
Assess Node-Specific Connection Issues:
sudo systemctl restart scylla-server
Avoid Direct Configuration Changes:
Restart a single ScyllaDB node to test if the issue resolves and connections resume:
sudo systemctl restart scylla-server
Monitor the node after the restart. If connections are restored, proceed to restart the other nodes in the cluster, one at a time, ensuring the cluster remains consistent and operational.
Cluster Validation:
nodetool status
nodetool repair
nodetoo repair
[2024-11-27 18:29:16,603] Starting repair command #1, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
[2024-11-27 18:29:17,697] Repair session 1
[2024-11-27 18:29:17,697] Repair session 1 finished
[2024-11-27 18:29:17,717] Starting repair command #2, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
[2024-11-27 18:29:17,819] Repair session 2
[2024-11-27 18:29:17,820] Repair session 2 finished
[2024-11-27 18:29:17,848] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2024-11-27 18:29:18,950] Repair session 3
[2024-11-27 18:29:18,950] Repair session 3 finished
Check Tomcat and Haproxy logs to identify excessive traffic or uneven load distribution, and adjust configurations if needed. Rebalance the load if necessary by updating the Haproxy configuration to distribute connections evenly across the cluster nodes.
Examine Connection Behavior:
netstat -an | grep 9042
ScyllaHighInserts alert means high insert statements are performing on the scylla database on the respective node, helps ensuring smooth operations and cluster stability. High insert operations can significantly impact cluster performance, disk I/O, memory, and network utilization.
The ScyllaHighInserts alert indicates that a high volume of insert operations is being performed on the ScyllaDB cluster. This can have several effects:
To collect relevant data when this alert fires, follow these steps:
Verify Cluster Health:
nodetool status
:
nodetool status
UN
(Up and Normal) state and that the Load column shows balanced data distribution.Monitor Disk and Memory Usage: Look at the OS Grafana Dashboard for further examination about the alert and its affects
iostat -x 1
free -m
Check Latency Metrics:
scylla_storage_proxy_coordinator_write_latency
scylla_storage_proxy_coordinator_read_latency
Analyze Network Traffic:
iftop
Verify Insert Statements:
nodetool status
Monitor the following metrics to assess the system’s state during high insert activity:
scylla_memory_allocated_memory
scylla_memory_free_memory
scylla_storage_proxy_coordinator_write_latency
scylla_storage_proxy_coordinator_read_latency
scylla_database_total_writes
scylla_commitlog_flush
scylla_commitlog_pending_flushes
rate(scylla_compaction_manager_pending_compactions{cluster="pssb-ds-sdc"}[5m])
scylla_streaming_total_incoming_bytes
and scylla_streaming_total_outgoing_bytes
will increaseC3 team’s role in resolving this alert is minimal, focusing on monitoring and data collection. Here’s what C3 should do:
Monitor Resource Utilization:
Check Load Distribution:
Restart Scylla on Affected Nodes (if necessary):
sudo systemctl restart scylla-server
Report Critical Issues to DevOps:
The DevOps team is responsible for resolving high insert activity issues and ensuring cluster stability. Follow these steps:
Scale Resources Temporarily:
commitlog_total_space_in_mb: 8192
cache_size_in_mb: 30% of RAM
Optimize Node Memory:
SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --memory 1230M"
Proceed to the this step if the increase memory for scylla-server should not disturb resources for any other services in the node.
Check and Tune Disk and Network I/O:
Investigate Application Queries:
Monitor and Tune Inter-Node Traffic:
nodetool move <new_token>
Plan for Long-Term Resource Scaling:
To ensure clarity and completeness, the section on running a full cluster repair has been added to the documentation. Below is the updated version of the document with the new information integrated.
Steps:
Check Node Status
Run the nodetool status
command to verify the status of all nodes in the cluster.
Example output:
Datacenter: DC1
Status=Up/Down
State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
Ensure the node you want to remove is in the Up Normal (UN) state.
Decommission the Node
nodetool decommission
command to remove the node you are connected to. This ensures that the data on the node being removed is streamed to the remaining nodes in the cluster.nodetool decommission
Monitor Progress
nodetool netstats
command to monitor the progress of token reallocation and data streaming.Verify Removal
nodetool status
again to confirm that the node has been removed from the cluster.Manually Remove Data
sudo rm -rf /var/lib/scylla/data
sudo find /var/lib/scylla/commitlog -type f -delete
sudo find /var/lib/scylla/hints -type f -delete
sudo find /var/lib/scylla/view_hints -type f -delete
Steps:
Attempt to Restore the Node
nodetool decommission
command to remove it (refer to the “Removing a Running Node” section).Fallback Procedure: Remove Permanently Down Node
nodetool removenode
command with the Host ID of the node.nodetool removenode 675ed9f4-6564-6dbd-can8-43fddce952gy
Precautions
Ensure all other nodes in the cluster are in the Up Normal (UN) state.
Run a Full Cluster Repair before executing nodetool removenode
to ensure all replicas have the most up-to-date data.
nodetool repair -full
-full
option ensures a complete repair of all data ranges owned by the node.If the operation fails due to node failures, re-run the repair and then retry nodetool removenode
.
Warning
nodetool removenode
on a running node that is reachable by other nodes in the cluster.Scenario: A node gets stuck in the Joining (UJ) state and never transitions to Up Normal (UN).
Drain the Node
nodetool drain
command to stop the node from listening to client and peer connections.nodetool drain
Stop the Node
sudo systemctl stop scylla-server
Clean the Data
sudo rm -rf /var/lib/scylla/data
sudo find /var/lib/scylla/commitlog -type f -delete
sudo find /var/lib/scylla/hints -type f -delete
sudo find /var/lib/scylla/view_hints -type f -delete
Restart the Node
sudo systemctl start scylla-server
Disk Space Utilization
Data Consistency
nodetool removenode
to ensure data consistency across replicas.
nodetool repair -full
command on each node in the cluster to perform a full repair.Repair Based Node Operations (RBNO)
removenode
is enabled, re-running repairs after node failures is not required.Avoid Misuse of Commands
nodetool removenode
on a running node that is reachable by other nodes in the cluster. This can lead to data loss or inconsistency.Here’s a structured, cleanly formatted documentation you can use for “Adding a New Node to an Existing ScyllaDB Cluster” (Out Scale):
Adding a node to a ScyllaDB cluster is known as bootstrapping, during which the new node receives data streamed from existing nodes. This process can be time-consuming depending on the data size and network bandwidth. In multi-availability zone deployments, ensure AZ balance is maintained.
nodetool status
Config | Command |
---|---|
Cluster Name | grep cluster_name /etc/scylla/scylla.yaml |
Seeds | grep seeds: /etc/scylla/scylla.yaml |
Endpoint Snitch | grep endpoint_snitch /etc/scylla/scylla.yaml |
ScyllaDB Version | scylla --version |
Authenticator | grep authenticator /etc/scylla/scylla.yaml |
Examples:
# For Scylla Enterprise
sudo yum install scylla-enterprise-2018.1.9
# For Scylla Open Source
sudo yum install scylla-3.0.3
⚠️ Do not use a different version or patch release. This may break compatibility.
Do not start the node before completing configuration.
Edit /etc/scylla/scylla.yaml
:
Key | Description |
---|---|
cluster_name |
Must match existing cluster name |
listen_address |
IP of the new node |
rpc_address |
IP for CQL client connections |
endpoint_snitch |
Match with existing cluster setting |
seeds |
Comma-separated IPs of existing seed nodes |
Start the Scylla service:
sudo systemctl start scylla-server
The node will join the cluster and begin bootstrapping. Check its status:
nodetool status
You should see the new node in UJ (Up/Joining) state:
UJ 192.168.1.203 ... Rack: B1
Wait until it changes to:
UN 192.168.1.203 ... Rack: B1
Once the new node is UN (Up Normal):
nodetool cleanup
⚠️ Cleanup removes old data that is now owned by the new node. It’s important to prevent data resurrection.
Step | Done |
---|---|
All existing nodes are up | ✅ |
Cluster details collected | ✅ |
ScyllaDB version matched | ✅ |
scylla.yaml configured properly |
✅ |
I/O scheduler files copied | ✅ |
Node started | ✅ |
Node reached UN state | ✅ |
Cleanup run on other nodes | ✅ |
Monitoring/Manager updated | ✅ |