Scylla Monitoring

Before initiating the monitoring of the Scylla server, it is essential for the C3 team to understand the scylla-server’s role and significance. For information, refer to the links provided below.

  • Introduction on Scylla-Server Link
  • Scylla Architecture Link
  • Scylla Components Link
  • Nodetool Utility Link
  • Grafana dashboards for Scylla link
  • Alerts link

Monitoring Panel severity Matrix

Dashboard & Row Panel Panel Descriptionx Query Query Description Query Operating Range Metrics Metric Description Metric Operating Range SEVERITY: CRITICAL SEVERITY: WARNING SEVERITY: OK
1.1 Nodes Number of nodes present in cluster count(scylla_scylladb_current_version{cluster=“pssb-ds-sdc”}) This metric exposes on every scylla node when scylla is up and running, we count based on the cluster name 5 scylla_scylladb_current_version Shows current version of Scylla We do count the metrics, our results are independent on the alues of this metric != 5 - nodes == 5
1.1 InActive The number of nodes that are up but not actively part of the cluster, either because they are still joining or because they are leaving. 5-count(scylla_node_operation_mode{cluster=~“pssb-ds-sdc”}==3) Counts the nodes which are in normal mode of cluster named pssb-ds-sdc 0 scylla_node_operation_mode unknown=0, starting=1, joining=2, normal=3, leaving=4, decomissioned=5 3 != 0 - inactive nodes == 0
1.1 UnReachable The number of unreachable nodes.Usually because a machine is down or unreachable. (count(scrape_samples_scraped{job=“scylla”, cluster=~“pssb-ds-sdc”}==0) OR vector(0)) checks if Prometheus is failing to scrape metrics from any Scylla instance(s) in the pssb-ds-sdc cluster 0 scrape_samples_scraped Counts no of samples scraped from prometheus scraped samples > 0 value != 0 - un-reachable nodes == 0
1.1 Split Brain Mode Split-brain mode is when different parts of the cluster operate independently, unaware of each other. sum(scylla_gossip_live{job=“scylla”,instance=~“pssb.*”}) Each nodes sees other 4 nodes in cluster, so sum of all metric values is 20 (4 x 5nodes) scylla_gossip_live This node sees how many other nodes 4 Yes - No
1.1 Scylla Reactor Utilization Shows percentage of CPU time each reactor (core) spends handling Scylla tasks sum(scylla_reactor_utilization{cluster=“pssb-ds-sdc”}) by (instance) It shows how busy all cpu cores on each instance is with Scylla’s tasks. 0.1 to (100 x No.of Cpu’s) scylla_reactor_utilization It shows how busy all cpu cores on each instance is with Scylla’s tasks. 0.1 - 100 as per each core consume > 20 per each instance 10 - 20 per each instance 0 - 10 per each instance
1.2 Scylla Reads for Instance per second Scylla read performed for each instance in one second based on the last 5 minutes sum(rate(scylla_database_total_reads{cluster=“pssb-ds-sdc”}[5m])) by (instance) Sums all reads performed per each shard on each instance posititve values scylla_database_total_reads Scylla database total reads till now positive values > 30 reads / sec 20 - 30 reads/sec < 20 reads / sec
1.2 Scylla Writes for Instance per second Writes performed for each instance in one second based on past 5 minutes sum(rate(scylla_database_total_writes{cluster=“pssb-ds-sdc”}[5m])) by (instance) sums all writes performed per each shard on single instance posititve values scylla_database_total_writes Scylla database total writes till now positive values > 5 writes per sec 2 - 5 writes per sec 0 - 1 writes per sec
1.2 Scylla Memory Usage Scylla memory usage percentage based on the scylla dedicated memory 100 - (((sum(scylla_memory_free_memory{cluster=“pssb-ds-sdc”}) by (instance)) / (sum(scylla_memory_allocated_memory{cluster=“pssb-ds-sdc”}) by (instance))) * 100) Scylla memory usage percentage based on scylla dedicated memory limit. posititve values scylla_memory_free_memory, scylla_memory_allocated_memory available memory and allocated memory for scylla in bytes positive values > 95 % 90% - 95 % > 90 %
2.1 Scylla High Inserts per Second Shows the number of inserts per second in scylla database sum(rate(scylla_cql_inserts{cluster=“pssb-ds-sdc”}[5m])) by (instance) Scylla inserts performed for each second in each instance posititve values scylla_cql_inserts No of inserts on each shard on scylla database positive values > 1 0.5 - 1 0 - 0.5
2.1 Scylla CQL Connections by Instance shows the number of cql connections currently established sum(scylla_transport_current_connections{cluster=“pssb-ds-sdc”}) by (instance) sum all cql connections for each shard on each instance posititve values scylla_transport_current_connections No of cql connections for each shard right now 10 < number_of_connections > 10 8.1 - 10 0 - 8.0

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

  1. Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.

  2. Severity-Based Actions:

    • Low-Priority Alerts:
      • If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
    • Escalation to DevOps:
      • If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
  3. Severity-Specific Notifications:

    • Warning Alerts:
      • For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
    • Critical Alerts:
      • For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.

PromQL Query Analysis Guidelines:

For the referenced alerts, utilize the provided PromQL queries. Ensure you adjust the labels and time ranges in the queries to match the specific node or cluster to collect data. Analyze the results thoroughly, comparing trends and patterns across relevant time periods. To draw accurate conclusions, experiment with additional custom PromQL queries to cross-validate findings and gather comprehensive data insights.

Scylla Node Down

Alertname: ScyllaNodeDown

A Scylla Node Down indicates that the node is not operational or unreachable in the cluster. This can be caused by hardware failure, network issues, misconfiguration, or a crash. The node’s unavailability could lead to degraded performance, reduced fault tolerance, and potential data inconsistency depending on the replication factor and cluster topology.

C3 Data Collection

  1. Node Information:
    Collect the following details about the affected node:

    • Node name and IP address
    • Time duration the alert has been in a firing state
  2. Cluster Status:

    To verify the cluster’s status and the affected node, log in to any node within the cluster and run the following command:

    nodetool status
    

    Example Output nodetool_status

    Example Explanation for UN:

    • U: The node is up.
    • N: The node is in a normal state.

    The second and third line of the nodetool status ouput says it all about First Column in the above picture.

    Status Explanation

    Status Code Description
    U The node is up.
    D The node is down.

    State Explanation

    State Code Description
    N The node is in a normal state.
    L The node is leaving the cluster.
    J The node is joining the cluster.
    M The node is moving within the cluster.

    If any node found other than UN state in the scylla cluster, note down that node details and login to the node and do the following operations - Check Scylla Service Status

  3. Scylla Service Status:

    • Check if the Scylla service is running:
      systemctl status scylla-server  
      
    • If the service is inactive, open the journalctl logs,
      journalctl -xeu scylla-server
      
      Open the scylla logs for the past one hour
      sudo journalctl -u scylla-server --since "1 hour ago" --no-pager
      
  4. Handling Inactive or Failed Scylla Service

    When the Scylla service is found to be inactive or in a failed state, follow these steps to investigate and resolve the issue:

    If the Scylla service is inactive:

    • If it shows inactive (dead), you may need to restart the service, when node is rebooted. Check system uptime to ensure the node is rebooted
      • At this point, it is important to check if the system was rebooted unexpectedly. You can check the system uptime with:
        $ uptime
        16:49:45 up 13 days, 22:45,  7 users,  load average: 0.99, 0.62, 0.52
        
      • If the system uptime is unexpectedly low (indicating a recent reboot), investigate the reasons why the system might have rebooted. This could be due to hardware issues, kernel panics, or resource exhaustion (e.g., out of memory or disk space).

    If the Scylla service is in Failed State:

    • Check logs for the failure details using journalctl command like specified above steps and try to find out the reason for service failure

Dependent Metrics

When troubleshooting a Scylla Node Down alert, the following affected metrics can help identify the root cause. These metrics focus on system resources, Scylla-specific performance, and hardware/network issues:

1. System Resource Metrics

  • CPU Usage: High CPU usage can cause Scylla to become unresponsive or crash.
    • Metric to monitor: uptime
  • Memory Usage: Excessive memory, cpu usage may lead to Scylla service crashes.
    • Metric to monitor: node_memory_MemAvailable_bytes, node_cpu_seconds_total
    • Check the node where scylla is in failed or inactive state in grafana dashboard Link
  • Disk Usage: Running out of disk space or encountering disk I/O issues can lead to node failure.
    • Look at the Disk Space Used Basic Grafana Dashboard Panel in OS Folder. Link
    • Ensure all partitions are mounted successfully. Any failed mount points can lead to Scylla server malfunction or failure. Run the Command to check all mount points are working properly
      mount -a
      
  1. Cluster Impact:

    It can seriously affect the cluster in the ways:

    • It can degrade performance, especially if the node holds critical data or handles a significant portion of the workload.
    • It may reduce the fault tolerance of the cluster, causing issues with data replication or availability, depending on the cluster’s replication factor and topology.

C3 Remedy

When troubleshooting a ScyllaNodeDown alert, follow these steps based on the situation:

Proceed to the following steps only when only one scylla-server is down, otherwise inform it to DevOps team with the collected data.

  1. If Only One System Has Recently Rebooted:

    • If only one node has rebooted recently and the system is back up, try restarting the Scylla service on that node:
      sudo systemctl restart scylla-server
      
  2. If More Than One Node Got Rebooted:

    • If multiple nodes have rebooted, do not restart Scylla. Treat this as a critical issue and immediately inform the DevOps team for further process.
  3. If the Error is Related to Memory, Disk, or CPU:

    • If system resources such as memory, disk, or CPU are the cause of the issue, and once all resources are restored to normal conditions, attempt to restart the scylla-server
  4. If Failing to Achieve Quorum Level:

    • If the system is unable to achieve the quorum level (and Scylla is still running), inform the DevOps team. This may indicate a more serious underlying issue with the cluster’s health or replication setup.

DevOps Remedy

When handling a ScyllaNodeDown alert, the DevOps team should follow these steps to ensure proper resolution:

  1. If Scylla is in an Inactive or Failed State Due to CPU, Memory, or Disk Issues:

    • Analyze the Root Cause:
      • Check system metrics (CPU, memory, disk usage) and identify the cause of the resource exhaustion or failure.
      • Ensure that:
        • CPU usage has returned to normal levels.
        • Sufficient memory is available.
        • Disk space and I/O are no longer an issue.
    • Restart the Scylla Server:
      • After addressing all resource-related issues, restart the Scylla service:
        sudo systemctl restart scylla-server  
        
  2. If the Cluster is Failing to Achieve Quorum Level:

    • Review Cluster Configuration:
      • Ensure that all nodes in the cluster are properly configured and can communicate with each other.
    • Verify Cluster Status:
      • Use the nodetool status command to identify nodes that are down or in an abnormal state.
        nodetool status  
        
    • Reform the Scylla Cluster:
      • If the quorum level cannot be achieved, it may require re-adding or repairing nodes to restore the cluster to a healthy state. After the scylla cluster is rebuilt, run the following comands below mentioned.
      • Use Scylla commands to re-establish the cluster:
        nodetool repair  
        

Scylla Nodes Inactive

Alertname: ScyllaNodesInactive

  • Scylla Node Joining Mode Alert URL: Link
  • Scylla Node Leaving Mode Alert URL: Link
  • Grafana Dashboard URL: Link
  • Inactive Panel URL: Link

A ScyllaNodesInactive alert indicates that one or more nodes in the Scylla cluster are in a state where they are up but are not actively participating in the cluster. This status is often observed during cluster transitions, such as when nodes are:

  • Joining the cluster (e.g., newly added nodes still synchronizing data).
  • Leaving the cluster (e.g., nodes being decommissioned or removed).

Inactive nodes can cause temporary disruptions in cluster performance, replication, and fault tolerance, depending on the cluster topology and data distribution.

C3 Data Collection

  1. Node Information:
    Collect the following details about the affected nodes:

    • Node name and IP address
    • Duration of inactivity
  2. Cluster Status:

    To identify nodes that are inactive, log in to any active node in the cluster and run the following command:

    nodetool status  
    

    Refer the first alert to understand the output of the nodetool status

  3. Service Status:
    Verify the Scylla service on the inactive node to ensure it is running:

    systemctl status scylla-server  
    

    If the service is active, check the logs to determine the progress of the node’s transition (e.g., joining, leaving). Use the following command to check recent logs:

    sudo journalctl -u scylla --since "1 hour ago" --no-pager  
    

Dependent Metrics

The following metrics can help identify the cause of the inactive nodes:

When a Scylla node is in joining or leaving mode in a Scylla cluster, it is undergoing a transitional state where its membership in the cluster is being updated. During these states, it is essential to monitor specific dependent metrics to ensure cluster health, consistency, and performance. Here’s a detailed list of key metrics to track during these modes:

Joining or Leaving Mode:

In this state, the node is being added to the cluster and is receiving data through streaming.

  1. Streaming Metrics:

    • Streaming progress:
      • scylla_streaming_total_incoming_bytes and scylla_streaming_total_outgoing_bytes: Tracks the total amount of data to be streamed to/from the joining node.
      • scylla_streaming_finished_percentage: Tracks the percentage of data already streamed.
  2. Node Status:

    • nodetool status: Ensures the node is correctly listed as J (Joining). Node which is in leaving mode show the status as L
  3. Streaming Progress:

    Use nodetool netstats to monitor the progress of streaming operations (data synchronization between nodes). Look for the completion of repair operations initiated during node joining and make sure there is no message in pending state.

    nodetool netstats
    

    Example Output

        Mode: NORMAL
        Not sending any streams.
        Read Repair Statistics:
        Attempted: 17107
        Mismatch (Blocking): 7
        Mismatch (Background): 13
        Pool Name                    Active   Pending      Completed   Dropped
        Large messages                  n/a         0       25062982         0
        Small messages                  n/a         0       29354568         0
        Gossip messages                 n/a         0              0         0
    
    • Data Synchronization Logs: Analyze Scylla logs to track repair and streaming activities using:
      journalctl -u scylla-server --since "1 hour ago" --no-pager
      
    • Look at the Grafana Panels related
      • Inactive Dashboard Panel Link

C3 Remedy

When responding to a ScyllaNodesInactive alert for the long time, follow these steps:

  1. If the Node is Joining the Cluster:

    • Use nodetool netstats to monitor the progress of data synchronization.
    • Check system metrics (CPU, memory, disk) and ensure they are within acceptable limits.
    • Allow time for the node to finish joining if resources are adequate.
  2. If the Node is Leaving the Cluster:

    • Confirm the node is in the process of being decommissioned.
    • Verify the decommission progress with:
      nodetool decommission status  
      
    • Ensure the node has sufficient resources to complete the process.
  3. If System Resource Issues are Detected:

    • Address high memory, CPU, or disk usage issues.
    • Once resources are restored to normal, allow the process to continue.
  4. If the Node Remains Inactive Beyond Expected Time:

    • Look at the logs of scylla-server and watch the repair status of each table. Restart the Scylla service if system resources are normal and the node shows no activity. L
      sudo systemctl restart scylla-server  
      
    • If the issue persists, inform the DevOps team for further action.

DevOps Remedy

  1. If System Resource Bottlenecks Cause Inactivity:

    • Analyze and resolve CPU, memory, or disk issues on the affected node.
  2. If the Cluster Configuration is Affected:

    • Check for consistency in the cluster using nodetool status.
    • Re-build the cluster if nodes fail to rejoin after resource stabilization:
  3. If Node Cannot Rejoin Automatically:

    • Manually bootstrap or re-add the node to the cluster.
    • Verify data synchronization post-recovery using:
      nodetool netstats  
      

Scylla Split Brain Mode

Alertname: ScyllaSplitBrainMode

A Scylla Split Brain Mode occurs when two or more subsets of nodes in the cluster form separate groups, each thinking they are the entire cluster. This situation can cause conflicting writes between subsets, inability to achieve quorum or correct partitioning and increased latency and degraded throughput.

C3 Data Collection

  1. Cluster Status:
    Use the following command to check the overall cluster state:

    nodetool status
    

    Look for nodes that are unreachable (D state) or nodes forming separate groups (nodes not appearing as part of the same cluster).

  2. Ring View Comparison:

    • Run nodetool ring on different nodes to confirm if there is a mismatch in the cluster’s ring structure.
    • Check for discrepancies in the token allocation or missing nodes.

    Run this Command in every node of the scylla cluster Command:

    nodetool ring
    

    Example Output:
    Node 1 (172.21.0.61):

    Address       DC          Rack        Status    State     Load       Owns    Token
    pssb1avm001      pssb-ds-sdc rack1       Up        Normal    120 GB     33.33%  0
    pssb1avm002     pssb-ds-sdc rack1       Down      Leaving   130 GB     33.33%  28035242
    

    Node 2 (172.21.0.62):

    Address       DC          Rack        Status    State     Load       Owns    Token
    pssb1avm001      pssb-ds-sdc rack1       Up        Normal    120 GB     33.33%  0
    pssb1avm002      pssb-ds-sdc rack1       Up        Joining   130 GB     33.33%  43345242
    

    Here in the above two outputs, we can observe the status of pssb1avm002 node is Down and Leaving in Node-01 view and Up and Joining in Node-02 View. Explanation:

    • The mismatched State (Leaving on one node, Joining on another) indicates potential split-brain behavior.
  3. Gossip State:
    Inspect gossip state to identify communication issues between nodes:

    nodetool gossipinfo
    
    nodetool statusgossip
    

    Example Output:

    /10.0.0.1
    generation:1617123457
    heartbeat:1081
    STATUS:16:NORMAL,-1
    LOAD:120.45
    SCHEMA:84:a3a2e354-8a5c-39b4-809c-2125dfb8f123
    DC:datacenter1
    RACK:rack1
    
    /10.0.0.2
    generation:1617123456
    heartbeat:2
    STATUS:16:LEAVING,-1
    LOAD:0.00
    SCHEMA:84:a3a2e354-8a5c-39b4-809c-2125dfb8f123
    DC:datacenter1
    RACK:rack1
    

    Look for discrepancies such as nodes in DOWN state or inconsistent HOST_ID.

    In this example, Node 2 has a much lower heartbeat and is in a LEAVING state, indicating it might be part of a split cluster.

  4. System Logs:
    Review Scylla logs for errors related to network partitions or split-brain scenarios:

    journalctl -u scylla-server --since "1 hour ago" --no-pager
    

Dependent Metrics

Metrics to Monitor:

  • scylla_gossip_live: Represents the number of nodes that a specific Scylla node recognizes as live in the cluster.
    For instance, in a five-node cluster, each node should report 4 other nodes as live. Any deviation indicates potential cluster connectivity issues.

  • scylla_gossip_unreachable: Tracks the number of nodes that are unreachable by the given Scylla node.
    The expected value for this metric should always be 0. Non-zero values may indicate network issues or node failures.

For detailed visualization and analysis, refer to the Grafana dashboard panel: Scylla Gossip Metrics.

C3 Remedy

  1. Verify Node Network Connectivity:
    Check inter-node network reachability using ping or telnet.

    ping <node_ip>
    telnet <node_ip> 9042
    

    Make sure all ports mentioned here are allowed through firewall in the partitoned node

    9042/tcp  
    9142/tcp  
    7000/tcp  
    7001/tcp  
    7199/tcp  
    10000/tcp  
    9180/tcp  
    9100/tcp  
    9160/tcp  
    19042/tcp  
    19142/tcp  
    
    • Resolve any network partition or firewall issues.
  2. Restart Gossip Service:
    If any nodes are unable to synchronize gossip state, restart the gossip service in that node:

    nodetool disablegossip
    nodetool enablegossip
    
  3. Validate Node Configuration:
    Ensure all nodes are using the same cluster_name and seed_node configuration in scylla.yaml.

  4. Monitor Cluster Quorum:
    Use Grafana or monitoring tools to observe quorum consistency.

  5. Node Rejoin:
    For nodes isolated due to split-brain, rejoin them using:

    nodetool repair
    

DevOps Remedy

  1. Resolve Network Partition:

    • Diagnose and fix the root cause of the split-brain, such as misconfigured network settings or faulty switches.
  2. Reform the Cluster:

    • Identify which subset of nodes holds the correct data.
    • Use nodetool repair as needed to stabilize the cluster.
    • Use nodetool removenode <host-id> when the node is in DN state of the cluster to get the stability and rejoin that node to the cluster again
  3. Cluster-wide Configuration Check:

    • Confirm that all nodes are configured with the same cluster settings, including:
      • cluster_name.
      • seed_node.
  4. If the error still persists, try rebuild the cluster again with no data deletions.

Here’s a professional and detailed documentation for the ScyllaHighReactorUtilization alert:


Scylla High Reactor Utilization

Alert Name: ScyllaHighReactorUtilization

Overview:

The Scylla High Reactor Utilization alert indicates excessive reactor thread utilization in one or more ScyllaDB nodes. Reactor threads are critical for processing requests and performing I/O operations, and high utilization can lead to degraded performance, increased latencies, and a potential bottleneck in the node’s operations. This typically occurs due to unbalanced workloads, high traffic routed to a single node, or resource constraints such as CPU or memory.

C3 Data Collection

To troubleshoot the alert, perform the following steps:

  1. Node Status: Run the following command to check the load on the node:

    nodetool status
    

    Ensure that the load is manageable compared to other nodes in the cluster. High load on this node might explain elevated reactor utilization.

  2. Reactor Utilization Metric: Check the scylla_reactor_utilization metric in Prometheus or Grafana. This metric shows per-CPU reactor utilization as a value between 0 and 1. Analyze recent changes to identify patterns or spikes.

  3. Scylla Dashboards: Review the Scylla Dashboards folder for any major changes in key panels, such as CPU usage, latency, or throughput. Dashboard Link

  4. CPU Load: Verify the CPU load on the affected node:

    uptime
    
     top
    

    High CPU utilization can exacerbate reactor thread bottlenecks.

  5. Pending Reactor Tasks: Check the scylla_reactor_tasks_pending metric in Prometheus. Increased pending tasks can indicate a backlog of work for the reactor threads, potentially caused by high traffic or slow query processing.

  6. Scylla Connections: Investigate if all incoming connections are routed to this specific node. High connection concentration can cause elevated reactor utilization. Check connection distribution and rebalance traffic if necessary.

Dependent Metrics

Monitor the following metrics to identify the root cause of the high reactor utilization:

  1. Pending Tasks: scylla_reactor_tasks_pending A rise in this metric indicates a backlog of work for the reactor threads.

  2. Reactor Utilization: scylla_reactor_utilization Shows reactor thread usage per CPU. Higher values indicate increased utilization.

  3. Reactor Stalls: rate(scylla_reactor_stalls_count{instance="$instance"}[5m]) Tracks stalls in reactor operations. A rising trend in the graph suggests delays in processing.

  4. Threaded I/O Fallbacks: rate(scylla_reactor_io_threaded_fallbacks{instance="$instance"}[5m]) Tracks instances where the reactor falls back to threaded I/O. Increased values suggest the reactor cannot handle I/O efficiently.

  5. Dashboard Panel: Refer to the panel in the Grafana dashboard to visualize these metrics and correlate them with the observed issue.
    Dashboard Panel Link

C3 Remedy

  1. Monitor CPU Load:
    Observe if the CPU load starts stabilizing or decreasing naturally over time. If the node remains operational and queries are being processed, the system might self-recover.

  2. Check Request Processing:
    Ensure that incoming requests are being processed without significant delays or errors.

  3. Escalate if Needed:
    If the reactor utilization continues to rise, latency values increase, and the node becomes difficult to operate, treat the issue as critical and inform the DevOps team immediately.

DevOps Remedy

  1. Resource Scaling: If the system is underprovisioned, add resources such as additional CPU cores or scale the cluster horizontally by adding more nodes.

  2. Load Balancing: Review and adjust traffic distribution across the cluster. Ensure no single node is receiving disproportionate traffic, which could lead to elevated reactor utilization.

  3. Restart Scylla Service: If the issue persists after balancing and scaling, consider restarting the Scylla service to clear any transient states:

    sudo systemctl restart scylla-server
    

Here’s a professional and clearly structured documentation for the ScyllaHighWrites alert based on the provided key points:


Scylla High Writes

Overview

The ScyllaHighWrites alert is triggered when a node experiences an unusually high volume of write requests. This condition can result in increased latencies, pressure on commit logs, unbalanced load distribution, and potential node instability. Resolving this alert ensures consistent cluster performance and prevents write-related failures.

C3 Data Collection

To identify and troubleshoot the cause of the high write load, perform the following steps:

  1. Verify Node Load with nodetool status:

    • Run nodetool status to check the load distribution across nodes.
    • A higher load on the affected node indicates a disproportionate number of write requests being directed to it.
  2. Monitor Write Latency: Use the metric scylla_storage_proxy_coordinator_write_latency in Grafana or Prometheus. High latency suggests that the node is struggling to handle the volume of write requests.

  3. Analyze Write Attempts:

    • Here remote node means the other nodes present in the cluster, local node is the node which receives the request. Use the following PromQL queries to monitor write attempts at the local and remote nodes: Local Write Attempts:
      sum(rate(scylla_storage_proxy_coordinator_total_write_attempts_local_node{instance="psorbit-node01"}[5m])) by (instance)
      
      Remote Write Attempts:
      sum(rate(scylla_storage_proxy_coordinator_total_write_attempts_remote_node{instance="psorbit-node01"}[5m])) by (instance)
      
  4. Check Database Write Volume:

    • Monitor the metric:
      sum(rate(scylla_database_total_writes{cluster="pssb-ds-sdc"}[5m])) by (instance)
      

    An increase in this value confirms a spike in write requests across the database.

  5. Review Grafana Dashboard: Access the relevant panels to examine write-related metrics and trends for anomalies.
    Grafana Dashboard Link

Dependent Metrics

The following metrics provide deeper insights into the issue and help correlate the root cause:

  1. Write Latency: scylla_storage_proxy_coordinator_write_latency: A high value indicates write performance degradation.

  2. Local and Remote Write Attempts: scylla_storage_proxy_coordinator_total_write_attempts_local_node: Tracks local write operations.
    scylla_storage_proxy_coordinator_total_write_attempts_remote_node: Tracks remote write operations.

  3. Commit Log Flush Rate: rate(scylla_commitlog_flush{instance="pssb1avm001"}[1m]): Increased flush rates suggest high write activity and pressure on commit logs.

C3 Remedy

C3 Team have less interaction dealing with this alert and communicate with devops if it becomes critical.

  1. Wait for Stabilization: If the write load is due to a temporary surge (e.g., batch processing), monitor the system to ensure it stabilizes over time.

  2. Observe Load Distribution: Check other nodes to confirm that write requests are evenly distributed across the cluster.

  3. Escalate if Critical: If the affected node becomes unresponsive or unmanageable, escalate the issue to the DevOps team immediately. Treat this as a critical priority.

DevOps Remedy

  1. Increase Resources: Allocate additional resources (CPU, memory, or disk capacity) to the affected nodes to handle the increased write workload.

  2. Investigate Root Cause: Analyze the source of the high write load, such as unbalanced traffic, unoptimized workloads, or application anomalies. Address these issues to prevent future occurrences.

  3. Manage Token Distribution: Review and adjust token assignments to ensure an even distribution of write loads across the cluster.

  4. Update Configuration: If necessary, tune ScyllaDB configurations to handle a specific number of writes more efficiently (e.g., commitlog thresholds or write batch limits).


HighScyllaReads

Overview

The HighScyllaReads alert indicates a node or cluster is experiencing an unusually high volume of read requests. This condition can lead to increased latency, elevated resource utilization, and potential performance degradation if not managed promptly. Properly diagnosing and addressing this alert ensures consistent query performance and balanced cluster operation.

C3 Data Collection

Follow these steps to investigate and gather data related to the high read activity:

  1. Check Node Load with nodetool status:
    Run nodetool status to inspect the load distribution across nodes.
    A disproportionately high load on a specific node indicates uneven read traffic.

  2. Monitor Read Latency:
    Use the metric scylla_storage_proxy_coordinator_read_latency in Grafana or Prometheus.
    High latency suggests the node is struggling to process read requests efficiently.

  3. Analyze Read Operations:

    • Use the following PromQL queries to monitor read attempts at local and remote nodes:
      • Local Read Attempts:
        sum(rate(scylla_storage_proxy_coordinator_total_read_attempts_local_node{instance="psorbit-node01"}[5m])) by (instance)
        
      • Remote Read Attempts:
        sum(rate(scylla_storage_proxy_coordinator_total_read_attempts_remote_node{instance="psorbit-node01"}[5m])) by (instance)
        
  4. Evaluate Read Volume:

    • Use the metric:
      sum(rate(scylla_database_total_reads{cluster="pssb-ds-sdc"}[5m])) by (instance)
      
    • An increase in this value confirms a spike in read requests across the cluster.
  5. Review Grafana Dashboard:

    • Check the relevant panels to investigate read-related metrics and patterns for irregularities.
      Grafana Dashboard Link

Dependent Metrics

Key metrics to monitor for correlating and diagnosing the issue include:

  1. Read Latency:

    • scylla_storage_proxy_coordinator_read_latency: Elevated values signal a delay in processing read requests.
  2. Local and Remote Read Attempts:

    • scylla_storage_proxy_coordinator_total_read_attempts_local_node: Measures local read operations.
    • scylla_storage_proxy_coordinator_total_read_attempts_remote_node: Tracks remote read operations.
  3. Database Read Volume:

    • rate(scylla_database_total_reads{instance="pssb1avm001"}[1m]): Indicates the total read volume processed by the node.
  4. Cache Hits:

    • scylla_cache_hits and scylla_cache_misses: Evaluate cache efficiency. A low cache hit rate may increase read latency.

C3 Remedy

  1. Monitor and Wait for Stabilization:

    • If the high read activity is due to temporary operations (e.g., batch queries or analytics), monitor the system for stabilization over time.
  2. Review Load Distribution:

    • Ensure that read traffic is evenly distributed across the cluster.
  3. Escalate if Critical:

    • If the affected node becomes unresponsive or performance degrades severely, escalate the issue to the DevOps team immediately.

DevOps Remedy

  1. Increase Node Resources:

    • Add more resources (CPU, memory, or disk IOPS) to the affected node to better handle the read workload.
  2. Review Cache Efficiency:

    • Optimize ScyllaDB cache settings to improve read performance. Ensure adequate memory is allocated for caching frequently accessed data.
  3. Balance Read Traffic:

    • Verify and adjust token assignments to evenly distribute the read load across nodes.
  4. Scale the Cluster:

    • If the system cannot handle the sustained read load, consider horizontal scaling by adding more nodes to the cluster.
  5. Restart Scylla Service if Necessary:

    • If the node remains unstable despite other remedies, restart the Scylla service:
      sudo systemctl restart scylla-server
      

Here’s a professional and clear documentation for the ScyllaHighReads alert, structured for easy understanding by new team members:


Scylla High Memory Usage

The ScyllaHighMemory alert is triggered when ScyllaDB’s memory usage exceeds normal thresholds, potentially leading to degraded performance or system instability. High memory usage can stem from memory-intensive operations like handling large datasets, compactions, or excessive read/write workloads. Addressing this alert promptly ensures smooth cluster operations and prevents critical failures.

C3 Data Collection

To investigate the ScyllaHighMemory alert, perform the following steps:

  1. Check Read/Write Requests on the Node:

    • Analyze if high memory usage correlates with heavy read/write workloads. Use nodetool status to verify the node’s load and traffic distribution.
  2. Monitor Memory-Intensive Operations:

    • Identify if ongoing compaction processes are contributing to memory contention. Use the following command to check recent compaction logs:
      journalctl -u scylla-server --since "1 hour ago" --no-pager | grep 'compaction'
      
  3. Review Key Metrics:

  4. Monitor Swap Usage:

    • If memory exceeds available RAM, the operating system may swap memory to disk. Check the swap usage in the Linux OS Grafana panel:

Dependent Metrics

Review the following metrics to understand the underlying cause of high memory usage

Scylla Metrics Documentation Link

  1. Latency Metrics:

    • Write latency Metrics:
      • scylla_storage_proxy_coordinator_write_latency_summary
      • wlatencyp99
      • wlatencya
    • Read latency Metrics:
      • scylla_storage_proxy_coordinator_read_latency_summary
      • rlatencya
      • rlatency995
  2. Scylla Memory Metrics to Monitor:

    • scylla_memory_allocated_memory
    • scylla_memory_free_memory
    • scylla_lsa_memory_allocated
    • scylla_lsa_memory_freed
  3. Reactor Stalls:

    • Monitor scylla_reactor_stalls_count for increasing trends, indicating memory pressure affecting Scylla’s event loop.
  4. Kernel Interventions:

    • If memory usage grows beyond system limits, the kernel might terminate ScyllaDB or other processes to free up memory.

C3 Remedy

For addressing high memory usage as part of C3-level actions:

  1. Increase Available RAM:

    • Add more RAM to the affected node if memory pressure is a recurring issue. In some cases, RAM can be increased dynamically without downtime.
  2. Monitor and Stabilize:

    • Continuously monitor the system for stabilization. Ensure that the load is distributed evenly across all nodes.
    • If node is not recover from this issue, contact the DevOpsers immediately to resolve the issue.

DevOps Remedy

For persistent memory issues requiring deeper intervention, DevOps can take the following actions:

  1. Update Scylla Configuration:

    • Optimize the ScyllaDB configuration to handle high memory usage efficiently: Below are the known expamples to update in scylla config file,
      commitlog_total_space_in_mb: 8192
      compaction_strategy: IncrementalCompactionStrategy
      
  2. Analyze and Distribute Load:

    • Ensure uniform traffic distribution across all nodes in the cluster. Adjust token allocation and workload balancing strategies as needed.
    num_tokens: 256
    
  3. Horizontal Scaling:

    • Add additional nodes to the cluster to distribute load and reduce memory pressure on individual nodes.
  4. Restart Scylla Services:

    • If memory usage remains persistently high despite optimization, consider restarting the Scylla service as a last resort:
      sudo systemctl restart scylla-server
      

Scylla CQL Connections Out of Range

Alert Name: ScyllaHighCQLConnections

Alert Name: ScyllaLowCQLConnections

Description:
It include two alerts that are to be fired when cql connection going out of range. These triggers when the number of CQL connections in a Scylla cluster exceeds the expected threshold, indicating a potential overload or misconfiguration of the system. A high number of CQL connections can lead to resource exhaustion (e.g., CPU, memory), increased latency, and degraded performance due to network or disk bottlenecks.

C3 Data Collection

  1. Rate of CQL Requests:

    • Check the rate of CQL requests to monitor connection activity:
      rate(scylla_transport_cql_requests_count{cluster="pssb-ds-sdc"}[1m])
      
  2. System-Level Metrics for Connection Analysis:

    • Active Connections: Use system tools to check open connections to ScyllaDB’s CQL port (default is 9042):

      netstat -an | grep :9042 | wc -l
      
    • Core Resource Metrics:

      • Network Activity: Tools like iftop or tcpdump can be used to monitor traffic volume on port 9042 to ensure there is no congestion or unusually high traffic.

      • Client Misbehavior: Identify clients making excessive connections by checking connection attempts from each client using s netstat:

        netstat -an | grep :9042 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr
        

        Example Output

              4 172.21.0.63
                 2 172.21.0.65
                 2 172.21.0.64
                 2 172.21.0.62
                 2 172.21.0.61
                 1 172.21.0.1
        
  3. Journal Logs:

    • To gather further insights into CQL-related activities in the past hour:
      journalctl -u scylla-server --since "1 hour ago" | grep CQL
      

Dependent Metrics

Monitor the following metrics to diagnose potential issues with high CQL connections:

  • scylla_transport_cql_requests_count: The total count of CQL requests, which helps in identifying increased connection activity.

  • scylla_transport_cql_connections: The current number of active CQL connections in the cluster.

  • scylla_reactor_stalls_count: The number of reactor stalls, which may indicate resource contention or high connection load.

  • scylla_reactor_utilization: The percentage of reactor thread usage, high values suggest the system may be overwhelmed.

  • Load Imbalance: Some nodes show signs of high resource usage (e.g., CPU, memory), while others appear idle or underutilized. High CQL connection volumes may lead to uneven load distribution across Scylla nodes. If clients are not evenly distributing their connections, some nodes could experience high CPU/memory usage, while others remain underutilized. This imbalance can lead to resource contention and degrade cluster performance.

C3 Remedy

As a C3 team member, your role for this alert is primarily observational, with minimal direct action. Follow these guidelines to address the situation.

  1. Evaluate the Cluster’s Connection Status:

    • Check if the alert pertains to a single node or multiple nodes.
    • If all nodes are experiencing issues with incoming connections, escalate the matter immediately to the DevOps team without attempting any direct actions.
  2. Assess Node-Specific Connection Issues:

    • If only one node is not receiving connections while others are functioning normally:
      • Verify the node’s status to ensure it is not overloaded or unreachable.
      • Restart the ScyllaDB service on the affected node to attempt a resolution:
        sudo systemctl restart scylla-server
        
      • Observe the node post-restart to confirm if connections are restored.
  3. Avoid Direct Configuration Changes:

    • Do not modify connection-related settings or perform network troubleshooting directly. Such tasks should be handled by the DevOps team to avoid unintended consequences.

DevOps Remedy

  1. If All Nodes in the Cluster Are Not Receiving CQL Connections
  • Restart One Node in the Cluster:
    • Restart a single ScyllaDB node to test if the issue resolves and connections resume:

      sudo systemctl restart scylla-server
      
    • Monitor the node after the restart. If connections are restored, proceed to restart the other nodes in the cluster, one at a time, ensuring the cluster remains consistent and operational.

    • Cluster Validation:

      • After restarting all nodes, verify that the cluster is fully operational using:
        nodetool status
        
        Ensure all nodes are marked as Up (U) and Normal (N), and check for any inconsistencies using:
        nodetool repair
        
        Example Output of nodetoo repair
        [2024-11-27 18:29:16,603] Starting repair command #1, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
         [2024-11-27 18:29:17,697] Repair session 1 
         [2024-11-27 18:29:17,697] Repair session 1 finished
         [2024-11-27 18:29:17,717] Starting repair command #2, repairing 1 ranges for keyspace system_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
         [2024-11-27 18:29:17,819] Repair session 2 
         [2024-11-27 18:29:17,820] Repair session 2 finished
         [2024-11-27 18:29:17,848] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
         [2024-11-27 18:29:18,950] Repair session 3 
         [2024-11-27 18:29:18,950] Repair session 3 finished
        
  1. If All Nodes in the Cluster Are Receiving High CQL Connections
    • Check Tomcat and Haproxy logs to identify excessive traffic or uneven load distribution, and adjust configurations if needed. Rebalance the load if necessary by updating the Haproxy configuration to distribute connections evenly across the cluster nodes.

    • Examine Connection Behavior:

      • Identify persistent or idle connections:
        netstat -an | grep 9042
        
      • If many idle connections persist, consider adjusting timeouts in both Haproxy and tomcat.

ScyllaHighInserts

Alert Name: Scylla High Inserts

ScyllaHighInserts alert means high insert statements are performing on the scylla database on the respective node, helps ensuring smooth operations and cluster stability. High insert operations can significantly impact cluster performance, disk I/O, memory, and network utilization.

Overview

The ScyllaHighInserts alert indicates that a high volume of insert operations is being performed on the ScyllaDB cluster. This can have several effects:

  1. Increased Disk I/O: Heavy write operations put pressure on the storage system.
  2. High Latencies: Read and write queries may experience slower response times.
  3. High Memory Usage: Large portions of system memory are consumed for caching, scylla compaction process and memtables.
  4. Cluster Instability: High insert loads can cause synchronization issues between nodes, potentially leading to cluster breakdowns.

C3 Data Collection

To collect relevant data when this alert fires, follow these steps:

  1. Verify Cluster Health:

    • Check the cluster’s health using nodetool status:
      nodetool status
      
    • Ensure all nodes are in the UN (Up and Normal) state and that the Load column shows balanced data distribution.
  2. Monitor Disk and Memory Usage: Look at the OS Grafana Dashboard for further examination about the alert and its affects

    • Verify if disk I/O is under stress:
      • Check I/O operations:
        iostat -x 1
        
    • Review memory usage to determine if Scylla is consuming excessive resources:
      free -m
      
  3. Check Latency Metrics:

    • Monitor read and write latencies using the following Grafana metrics:
      • scylla_storage_proxy_coordinator_write_latency
      • scylla_storage_proxy_coordinator_read_latency
  4. Analyze Network Traffic:

    • High insert operations can increase network traffic due to replication:
      iftop
      
    • Confirm if inter-node synchronization is causing high network I/O.
  5. Verify Insert Statements:

    • Confirm that insert operations are executing correctly across the cluster. High insert workloads can create uneven distribution:
      nodetool status
      
    • Review application logs for problematic insert queries.

Dependent Metrics

Monitor the following metrics to assess the system’s state during high insert activity:

  1. Memory Usage Metrics:
    • scylla_memory_allocated_memory
    • scylla_memory_free_memory
  2. Latency Metrics:
    • scylla_storage_proxy_coordinator_write_latency
    • scylla_storage_proxy_coordinator_read_latency
  3. Insert Metrics:
    • scylla_database_total_writes
  4. Disk I/O and Commit Logs:
    • scylla_commitlog_flush
    • scylla_commitlog_pending_flushes
    • Syclla compaction will increase and will see raise in the chart
    rate(scylla_compaction_manager_pending_compactions{cluster="pssb-ds-sdc"}[5m])
    
  5. Network I/O Metrics:
    • scylla_streaming_total_incoming_bytes and scylla_streaming_total_outgoing_bytes will increase

C3 Remedy

C3 team’s role in resolving this alert is minimal, focusing on monitoring and data collection. Here’s what C3 should do:

  1. Monitor Resource Utilization:

    • Check disk I/O, RAM usage, and network bandwidth across the cluster nodes.
  2. Check Load Distribution:

    • Verify if all nodes are receiving load evenly. If some nodes are underloaded, investigate possible issues.
  3. Restart Scylla on Affected Nodes (if necessary):

    • If one or more nodes show unresponsive or abnormal behavior, restart the Scylla service on those nodes:
      sudo systemctl restart scylla-server
      
  4. Report Critical Issues to DevOps:

    • If the load is excessively high or cluster-wide inconsistencies are detected, escalate the issue to the DevOps team immediately.

DevOps Remedy

The DevOps team is responsible for resolving high insert activity issues and ensuring cluster stability. Follow these steps:

  1. Scale Resources Temporarily:

    • Allocate more CPU, RAM, or disk space to affected nodes.
    • Update Scylla configurations to handle higher insert volumes:
      commitlog_total_space_in_mb: 8192
      cache_size_in_mb: 30% of RAM
      
  2. Optimize Node Memory:

    • Increase memory usage limits for the Scylla process:
      SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --memory 1230M"
      

Proceed to the this step if the increase memory for scylla-server should not disturb resources for any other services in the node.

  1. Check and Tune Disk and Network I/O:

    • Terminate unnecessary processes on affected nodes to free resources.
    • Upgrade network bandwidth if sustained high loads are expected.
  2. Investigate Application Queries:

    • Review application logs to identify and optimize inefficient insert statements. Work with the application team to reduce unnecessary queries.
  3. Monitor and Tune Inter-Node Traffic:

    • Confirm that replication traffic is evenly distributed and adjust token allocation if required:
      nodetool move <new_token>
      
  4. Plan for Long-Term Resource Scaling:

    • If high insert activity is expected to persist, consider adding more nodes to the cluster or upgrading hardware.

Remove Node From Scylla Cluster

To ensure clarity and completeness, the section on running a full cluster repair has been added to the documentation. Below is the updated version of the document with the new information integrated.

1. Removing a Running Node

Steps:

  1. Check Node Status

    • Run the nodetool status command to verify the status of all nodes in the cluster.

    • Example output:

      Datacenter: DC1
         Status=Up/Down
         State=Normal/Leaving/Joining/Moving
         --  Address        Load       Tokens  Owns (effective)                         Host ID         Rack
         UN  192.168.1.201  112.82 KB  256     32.7%             8d5ed9f4-7764-4dbd-bad8-43fddce94b7c   B1
         UN  192.168.1.202  91.11 KB   256     32.9%             125ed9f4-7777-1dbn-mac8-43fddce9123e   B1
         UN  192.168.1.203  124.42 KB  256     32.6%             675ed9f4-6564-6dbd-can8-43fddce952gy   B1
      
    • Ensure the node you want to remove is in the Up Normal (UN) state.

  2. Decommission the Node

    • Use the nodetool decommission command to remove the node you are connected to. This ensures that the data on the node being removed is streamed to the remaining nodes in the cluster.
    • Example:
      nodetool decommission
      
  3. Monitor Progress

    • Use the nodetool netstats command to monitor the progress of token reallocation and data streaming.
  4. Verify Removal

    • Run nodetool status again to confirm that the node has been removed from the cluster.
  5. Manually Remove Data

    • After the node is removed, manually delete its data and commit logs to ensure they are no longer counted against the load.
    • Commands:
      sudo rm -rf /var/lib/scylla/data
      sudo find /var/lib/scylla/commitlog -type f -delete
      sudo find /var/lib/scylla/hints -type f -delete
      sudo find /var/lib/scylla/view_hints -type f -delete
      

2. Removing an Unavailable Node

Steps:

  1. Attempt to Restore the Node

    • If the node is in the Down Normal (DN) state, try to restore it first.
    • Once restored, use the nodetool decommission command to remove it (refer to the “Removing a Running Node” section).
  2. Fallback Procedure: Remove Permanently Down Node

    • If the node cannot be restored and is permanently down, use the nodetool removenode command with the Host ID of the node.
    • Example:
      nodetool removenode 675ed9f4-6564-6dbd-can8-43fddce952gy
      
  3. Precautions

    • Ensure all other nodes in the cluster are in the Up Normal (UN) state.

    • Run a Full Cluster Repair before executing nodetool removenode to ensure all replicas have the most up-to-date data.

      • To run a full cluster repair, execute the following command on each node in the cluster:
        nodetool repair -full
        
      • The -full option ensures a complete repair of all data ranges owned by the node.
    • If the operation fails due to node failures, re-run the repair and then retry nodetool removenode.

  4. Warning

    • Never use nodetool removenode on a running node that is reachable by other nodes in the cluster.

3. Safely Removing a Joining Node

Scenario: A node gets stuck in the Joining (UJ) state and never transitions to Up Normal (UN).

Steps:
  1. Drain the Node

    • Run the nodetool drain command to stop the node from listening to client and peer connections.
    • Example:
      nodetool drain
      
  2. Stop the Node

    • Stop the ScyllaDB service on the node.
    • Command:
      sudo systemctl stop scylla-server
      
  3. Clean the Data

    • Delete the node’s data, commit logs, hints, and view hints.
    • Commands:
      sudo rm -rf /var/lib/scylla/data
      sudo find /var/lib/scylla/commitlog -type f -delete
      sudo find /var/lib/scylla/hints -type f -delete
      sudo find /var/lib/scylla/view_hints -type f -delete
      
  4. Restart the Node

    • Start the ScyllaDB service if you plan to add the node back to the cluster later.
    • Command:
      sudo systemctl start scylla-server
      

Important Notes and Warnings

  1. Disk Space Utilization

    • Before starting the removal process, review the disk space utilization on the remaining nodes. Ensure there is enough space to accommodate the data streamed from the node being removed. Add more storage if necessary.
  2. Data Consistency

    • Always run a full cluster repair before using nodetool removenode to ensure data consistency across replicas.
      • Use the nodetool repair -full command on each node in the cluster to perform a full repair.
  3. Repair Based Node Operations (RBNO)

    • When RBNO for removenode is enabled, re-running repairs after node failures is not required.
  4. Avoid Misuse of Commands

    • Do not use nodetool removenode on a running node that is reachable by other nodes in the cluster. This can lead to data loss or inconsistency.

Here’s a structured, cleanly formatted documentation you can use for “Adding a New Node to an Existing ScyllaDB Cluster” (Out Scale):


Adding a New Node to an Existing ScyllaDB Cluster (Out Scale)

Adding a node to a ScyllaDB cluster is known as bootstrapping, during which the new node receives data streamed from existing nodes. This process can be time-consuming depending on the data size and network bandwidth. In multi-availability zone deployments, ensure AZ balance is maintained.

Prerequisites

  1. Verify All Nodes Are Healthy You cannot add a new node if any existing node is down.
  • Use the following command to check:
    nodetool status
    
  1. Collect Cluster Information Log into any existing node and gather:
Config Command
Cluster Name grep cluster_name /etc/scylla/scylla.yaml
Seeds grep seeds: /etc/scylla/scylla.yaml
Endpoint Snitch grep endpoint_snitch /etc/scylla/scylla.yaml
ScyllaDB Version scylla --version
Authenticator grep authenticator /etc/scylla/scylla.yaml

Procedure to Add New Node

  1. Install ScyllaDB Install the exact same version (including patch release) of ScyllaDB as used in the current cluster.

Examples:

# For Scylla Enterprise
sudo yum install scylla-enterprise-2018.1.9

# For Scylla Open Source
sudo yum install scylla-3.0.3

⚠️ Do not use a different version or patch release. This may break compatibility.


  1. Configure the New Node

Do not start the node before completing configuration.

Edit /etc/scylla/scylla.yaml:

Key Description
cluster_name Must match existing cluster name
listen_address IP of the new node
rpc_address IP for CQL client connections
endpoint_snitch Match with existing cluster setting
seeds Comma-separated IPs of existing seed nodes

  1. Start the New Node

Start the Scylla service:

sudo systemctl start scylla-server

The node will join the cluster and begin bootstrapping. Check its status:

nodetool status

You should see the new node in UJ (Up/Joining) state:

UJ  192.168.1.203  ...  Rack: B1

Wait until it changes to:

UN  192.168.1.203  ...  Rack: B1

Post-Join Cleanup

Once the new node is UN (Up Normal):

  • Run cleanup on all other nodes, not the new one:
    nodetool cleanup
    

⚠️ Cleanup removes old data that is now owned by the new node. It’s important to prevent data resurrection.

Cleanup Tips

  • Multiple nodes: Run cleanup in old cluster nodes after all new nodes are added.
  • Performance: Run cleanup during low-traffic hours.
  • Impact control: Run cleanup on one node at a time.

✅ Summary Checklist

Step Done
All existing nodes are up
Cluster details collected
ScyllaDB version matched
scylla.yaml configured properly
I/O scheduler files copied
Node started
Node reached UN state
Cleanup run on other nodes
Monitoring/Manager updated