OS Observability

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

Reference Documentation

  • For reference dashboard panels,metrics, operating ranges, and severity levels, refer to the provided Google Sheets link for guidance.

  • For Grafana Dashboard Panels and their respective explanations refer Link

Alert Handling Procedure

  1. Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.

  2. Severity-Based Actions:

    • Low-Priority Alerts:
      • If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
    • Escalation to DevOps:
      • If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
  3. Severity-Specific Notifications:

    • Warning Alerts:
      • For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
    • Critical Alerts:
      • For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.

Alerts, Thresholds and Priorities

Table

Row Panel Panel Description Query Query Description Query Operating Range Metrics Metric Description Metric Operating Range SEVERITY: CRITICAL SEVERITY: WARNING SEVERITY: OK
Quick CPU / Mem / Disk ResourcePressure The Percentage of Process time involved for waiting for resources in the last X time interval irate(node_pressure_cpu_waiting_seconds_total) how fast the waiting time is increasing over a very recent period due to CPU busy, memory, and IO congestion 0-15% node_pressure_cpu_waiting_seconds_total +ve values 10% 5% < 5%
irate(node_pressure_memory_waiting_seconds_total) 0-5% node_pressure_memory_waiting_seconds_total > 5% 0-5% 0-1%
irate(node_pressure_io_waiting_seconds_total) 0-17% node_pressure_io_waiting_seconds_total > 10% 3% - 10% 0 - 3%
CPU Busy Percentage of CPU spent time being not idle 100 * (1 - avg(rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval]))) Percentage of avg of each CPU spent time being not idle 0-80% node_cpu_seconds_total Counts the total time in seconds each CPU core has spent in idle mode +ve Values > 90% 80% - 90% 0 - 80%
System Load System load over all CPU cores together scalar(node_load1) * 100 / count(count(node_cpu_seconds_total) by (cpu)) Upcoming system load average per each CPU 0-more than 100% node_load1 Load average shows how system performance is evolving through different time ranges Dependent on CPU number (0-3.0) >90% 80% - 90% 0 - 80%
RAM Used Used Memory in percentage ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes) * 100 The percentage of used memory 0-100% node_memory_MemTotal_bytes Total bytes in memory +ve Values >90% 75% - 90% 0% - 75%
node_memory_MemFree_bytes Available bytes in memory +ve Values node_memory_MemFree_bytes
SWAP Used Used SWAP in percentage ((node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes) * 100 The percentage of Swap used 0% - 100% node_memory_SwapTotal_bytes Swap total bytes +ve Values >25% 10% - 25% 0% - 10%
node_memory_SwapFree_bytes Swap total bytes available +ve Values node_memory_SwapFree_bytes
Root FS Used panel Root File system used percentage 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!=“rootfs”} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!=“rootfs”}) The percentage of root file system used, which is mounted on / 0-100% node_filesystem_avail_bytes Available bytes in file system +ve Values Dependent and >90% 80% - 90% 0% - 80%
node_filesystem_size_bytes Total bytes in file system +ve Values node_filesystem_size_bytes
CPU Cores panel —— INFO ———-
Root FS Total
Uptime
SWAP Total
RAM
System Misc Context Switches / Interrupts The rate of context switches and hardware interrupts, indicating CPU multitasking efficiency and hardware activity irate(node_context_switches_total[$__rate_interval]) This query calculates the per-second rate of context switches on a node 4K - 9K switches per second node_context_switches_total Measures the total number of context switches performed by the CPU +ve Values Dependent and >3k for single CPU 2k for single CPU 0-2k
Node Status Node Down The node down sum(up{instance=~pssb.*}) < 5 This query fires when node is in running state 1 up The value of the metric will be 1 when it is in running state 0 or 1 0 NA 1
Storage Disk Disk Space Used Basic Disk Space Used of all File systems Mounted 100 - ((node_filesystem_avail_bytes{device!~‘rootfs’} * 100) / node_filesystem_size_bytes{device!~‘rootfs’}) This calculates the percentage of Used storage for each mount 1-100% node_filesystem_avail_bytes Available space in each file system mounted Dependent Dependent and >90% Dependent and 80%-90% < 80%
node_filesystem_size_bytes Total size of the filesystem in bytes, providing the total storage capacity available on the disk or partition.
DISK iops completed The number of IO requests (after merges) completed per second for the device irate(node_disk_reads_completed_total[$__rate_interval]) Calculates the rate of disk reads completed per second on a specific node over a defined interval. Dependent node_disk_reads_completed_total Count of disk read operations completed on a disk Dependent Dependent and >200 reads/sec Dependent and 150-200 reads/sec Dependent and less than 150 reads/sec
Disk R/W Data The number of bytes written to or read from storage device per second irate(node_disk_written_bytes_total[$__rate_interval]) Measures the per-second rate of bytes written to disk, helping to detect spikes in write activity. Dependent node_disk_written_bytes_total Total number of bytes written to disk over time 0 to +ve >200 iops/second 150-200 iops/sec 0-150 io/sec
irate(node_disk_read_bytes_total[$__rate_interval]) Measures the per-second rate of bytes read from disk, useful for identifying sudden increases in read demand. Dependent node_disk_read_bytes_total Total number of bytes read from disk over time 0 to +ve
Storage File System File Descriptors Displays the count of open file descriptors on a system, helping monitor resource usage node_filefd_maximum Shows the maximum number of file descriptors available to the system, indicating the limit on simultaneous open files. Dependent node_filefd_maximum - 0 to +ve >4500 4000 - 4500 <4000
node_filefd_allocated Represents the current count of file descriptors in use, helping to track resource consumption and identify when limits are near. Dependent node_filefd_allocated - 0 to +ve
File System in Read Only / Error Indicates if a filesystem has switched to a read-only state or encountered errors, alerting to potential disk failures or permission issues impacting data writes. node_filesystem_readonly{device!~‘rootfs’} Indicates whether a filesystem is mounted in read-only mode, typically due to disk errors or file system issues. 0

CPU Busy Increased

When 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="$node"}[$__rate_interval]))) shows increased CPU busy time, it indicates that the system is spending more time processing tasks rather than being idle. High CPU busy time can lead to system slowdowns and affect performance.

C3 Data Collect

  1. CPU Utilization: Check the overall CPU usage to determine if it is near full utilization.
    • Use top, htop, or uptime to check the current CPU load.
  2. Load on Specific Cores: Identify if the load is distributed evenly across CPU cores or concentrated on specific ones.
    • Use mpstat -P ALL to see if any particular CPU core is being overwhelmed.
  3. Running Processes: Check which processes are consuming the most CPU time.
    • Use ps aux --sort=-%cpu or top to view the CPU usage of processes.
  4. System Load: Check system load averages (node_load1, node_load5, node_load15), which can help correlate the CPU usage with overall system load.

Dependent Metrics

When CPU busy time is high, the following metrics can provide additional context:

  • CPU Pressure: irate(node_pressure_cpu_waiting_seconds_total) helps determine if there is CPU contention, which could explain the increased CPU busy time.
  • Process Metrics: node_procs_running, node_procs_blocked can show how many processes are running or waiting for CPU resources.
  • Disk I/O: Metrics like node_disk_io_time_seconds_total and node_disk_reads_completed_total can indicate if high CPU usage is due to I/O-bound tasks.
  • Memory Pressure: node_pressure_memory_waiting_seconds_total could indicate if high CPU usage is a result of memory contention.

C3 Remedy

  1. Identify High CPU Processes:
    • Use top, htop, or ps aux to find out which processes are consuming the most CPU.
  2. Optimize or Terminate Processes:
    • If certain processes are consuming too much CPU, consider optimizing their configurations, or terminating them if they are non-essential.
  3. Investigate System Load:
    • Check if the system load is high (i.e., node_load1, node_load5, node_load15) to determine if high CPU usage is affecting the overall system performance.
  4. Check for Resource Saturation:
    • Investigate if there’s a resource bottleneck elsewhere in the system (e.g., disk I/O or memory).

If C3 and DevOps actions do not resolve the issue, further analysis of system performance or application behavior may be needed to uncover deeper root causes.

DevOps Remedy

If the C3 team has already followed the data collection and remediation steps but CPU pressure persists, the DevOps team can take additional actions:

  1. Add More Resources:

    • If the system is constantly under high CPU pressure and it’s impacting performance, consider upgrading the server with more CPU cores or a higher clock speed.
  2. Optimize Application Code:

    • If specific applications or services are causing high CPU usage, optimize the code to make it more efficient or allocate more resources to handle the load.
  3. Rebalance Workloads:

    • Distribute workloads more evenly across the available CPUs to prevent overburdening any single core. This can be done by tuning application settings or using load balancing techniques.

System Load Increased

  • Confusing the difference between system load and CPU busy ?

    System load or CPU Load is measured based on processes waiting for CPU to be allocated. It is measured for the avg of last 1 min, 5 min and 15min. CPU Busy is said to be percentage of cpu which is not idle i.e processes executing by the CPU. Both can be calculated by the single metric node_cpu_seconds_total.

When the query scalar(node_load1{instance="$node",job="$job"}) * 100 / count(count(node_cpu_seconds_total{instance="$node",job="$job"}) by (cpu)) is used, it calculates the system load as a percentage. It reflects how much CPU is needed to handle the load, considering the number of CPUs available. An increased value suggests that the system is under heavy load, potentially resulting in CPU contention or resource saturation.

  • Load is simply a count of the number of processes using or waiting for the CPU at a single point in time

C3 Data Collect

  • Update system apps and drivers: Outdated drivers and apps can also cause high CPU load because they can’t effectively perform the I/O operations. The best way to avoid this issue is to ensure the entire system is up to date.

Idle time: The idle time is inversely related to the CPU load. This means that when idle time increases, the CPU load decreases and vice versa.
User time and system time: The user time and system time directly indicate the CPU load. Basically, the sum of user time, system time, and idle time is equal to 100% of the CPU time or load. Higher user and system time values indicate a higher load in the CPU.
Wait or I/O wait time: The I/O wait time refers to instances where the CPU is idle and waiting for an I/O to complete. This increases the CPU load, as more processes wait for the CPU while it’s waiting for the I/O to complete.
Steal time: The percentage of time a virtual CPU involuntarily waits for a CPU process while the hypervisor is servicing another virtual CPU.

  1. Check System Load:

    • Use uptime or top to review the system load averages over 1 minute, 5 minutes, and 15 minutes. The load values should ideally be less than the number of CPU cores available.
    • Check the load from Prometheus: node_load1, node_load5, node_load15.
  2. CPU Utilization:

    • Check the CPU usage using top, htop, or mpstat -P ALL to ensure that CPU resources are not fully utilized.
  3. System Performance:

    • Investigate any performance degradation by checking logs or using tools like dmesg or journalctl for any hardware or software failures.

Dependent Metrics

When system load is high, consider reviewing the following metrics for additional context:

  • CPU Pressure: irate(node_pressure_cpu_waiting_seconds_total) to check if processes are waiting for CPU time.
  • Memory Pressure: irate(node_pressure_memory_waiting_seconds_total) to see if memory constraints are contributing to the load.
  • Disk I/O: node_disk_io_time_seconds_total and node_disk_reads_completed_total to check if disk I/O is a bottleneck affecting CPU performance.
  • Procs Running: node_procs_running to check how many processes are running and if they are contributing to system load.

C3 Remedy

  1. Monitor Running Processes:

    • Use top, htop, or ps aux to identify processes that are consuming significant CPU time.
  2. Terminate or Optimize Processes:

    • If certain processes are contributing to the high load, consider terminating or optimizing them.
  3. System Load Analysis:

    • Check if the load exceeds the number of CPUs, indicating resource saturation. This can be managed by reducing the number of processes running or upgrading CPU resources.

High RAM Usage Alert

C3 Data Collect

When RAM usage is high, the C3 team needs to collect the following data by logging into the server to gather more context:

  1. Check Memory Usage:

    • Run the following command to view the total, used, free, and available memory:
      free -m
      
    • This will provide an overview of how much memory is being used and how much is available.
  2. Check Active Processes:

    • Identify memory-consuming processes with the following command:
      top
      
    • Look for processes consuming a significant amount of memory. Alternatively, you can use htop for a more interactive interface.
  3. Check Swap Usage:

    • Check if the system is using swap space, which may indicate memory pressure:
      swapon -s
      
    • If swap is being used heavily, it can be a sign that the system is running out of RAM and using swap as overflow space.
  4. Check System Logs:

    • View system logs for memory-related issues. The following command can help identify any memory errors or warnings in the logs:
      dmesg | grep -i memory
      
    • You can also check logs via journalctl if you’re using a systemd-based distribution:
      journalctl -xe | grep -i memory
      

C3 Remedy

Once the C3 team has gathered the data, they can take the following steps to remedy high RAM usage:

  1. Identify Memory-Hungry Processes:

    • Use the top or htop command to check which processes are consuming the most memory. If any processes are using more memory than expected, consider stopping or restarting them:
      kill <PID>
      
      Replace <PID> with the process ID of the memory-hungry process. Be cautious and ensure that killing the process will not disrupt critical operations.
  2. Free Up Cached Memory:

    • If the system is using too much memory for caching, it may help to clear the cache. This can be done by running:
      sync; echo 3 > /proc/sys/vm/drop_caches
      
    • This will free up memory used by the cache without affecting the running processes. However, it may cause temporary performance degradation as cached data will need to be reloaded.
  3. Check Swap Usage:

    • If swap usage is high, this can indicate that the system is running out of physical memory. Try to reduce the system’s reliance on swap by terminating memory-heavy processes or freeing up memory as described above.
  4. Restart Services or Applications:

    • Sometimes, restarting specific services or applications that are consuming excessive memory can resolve the issue. For example:
      systemctl restart <service-name>
      
    • Replace <service-name> with the name of the service, like apache2, mysql, or nginx.
  5. Increase Swap Space (if necessary):

    • If the system continues to run out of memory, consider adding more swap space by creating a swap file:

         `Inform the DevOps team`
      

Dependent Metrics

In the event of high RAM usage, the following actions should also be monitored:

  1. Check System Load:

    • High memory usage can lead to increased system load. Check the system load averages using:
      uptime
      
    • If the load averages are high, this indicates that the system is struggling to handle the processes and may be an indication of a bottleneck.
  2. Check CPU Usage:

    • High RAM usage often correlates with high CPU usage. You can check CPU usage by running:
      top
      
    • If both CPU and memory usage are high, it may indicate that the system is overwhelmed with processes and needs optimization.
  3. Check Disk I/O:

    • High memory usage may lead to increased disk I/O as the system swaps memory pages to the disk. You can check disk I/O usage using:
      iostat
      

DevOps Remedy

If C3 actions do not resolve the high RAM usage, the DevOps team should take the following steps:

  1. Increase Physical Memory:

    • If the system consistently faces high memory usage, consider adding more RAM to the server.
  2. Review and Adjust System Configuration:

    • Adjust system memory settings, such as vm.swappiness, which controls the kernel’s tendency to swap, to reduce swap usage when physical memory is running low.
  3. Scale Resources:

    • If the high memory usage is due to a growing workload, consider scaling the infrastructure vertically (adding more RAM) or horizontally (adding more servers to distribute the load).
  4. Check for Memory Leaks:

    • If a process or application is using more memory over time, investigate possible memory leaks and fix them in the application’s code.
  5. Optimize Memory Usage at the Application Level:

    • Review the application’s memory usage patterns and optimize the application code to better handle memory.

High Swap Usage Alert

Swap is a portion of your instance’s disk that is reserved for the operating system to use when the available RAM has been utilized. As it uses the disk, Swap is slower to access and is generally used as a last resort.

Swap can be used even if your instance has plenty of RAM.

High Swap Usage Warning

High Swap is concerning if your instance is using all of the available RAM (i.e. consistently using more than 75%).

Affects:

  • Slow query responses
  • Degraded performance due to swapping regularly between RAM and disk.
  • Higher Disk I/O due to swapping regularly.

C3 Data Collect

When swap usage is high, it indicates that the system is running out of physical memory and is resorting to swap space, which can negatively impact system performance. The C3 team should perform the following actions via SSH login to the server to gather more information:

  1. Check Current Swap Usage:

    • Run the following command to see the current swap usage:
      swapon -s
      
    • This will show the swap devices and their usage. Check the Used column to see if swap usage is high.
  2. Check Overall Memory Usage:

    • View total, used, and free memory with:
      free -m
      
    • If free memory is low and swap is being used heavily, it may indicate a memory pressure situation.
  3. Check Swap Usage Percentage:

    • Check the swap usage percentage to confirm if it’s over a threshold that could be causing system performance issues. Run:
      vmstat 1 5
      
    • The si (swap in) and so (swap out) columns will show how much data is being swapped in and out, which indicates swap activity.
  4. Check System Logs for Swap-Related Issues:

    • Look for any warnings or errors related to memory or swap in the system logs:
      dmesg | grep -i swap
      
    • You can also use journalctl to search for swap-related messages:
      journalctl -xe | grep -i swap
      
  5. Identify High Memory Processes:

    • Use top or htop to identify processes that are consuming excessive memory. Processes with high memory usage can contribute to swap usage.

C3 Remedy

Once the C3 team has collected the data, they can take the following steps to resolve high swap usage:

  1. Identify Memory-Hungry Processes:

    • Run top or htop to identify processes consuming excessive memory. If certain processes are found, consider terminating or restarting them:
      kill <PID>
      
    • Replace <PID> with the process ID of the offending process. Be cautious about terminating critical processes.
  2. Free Up Cached Memory:

    • If the system is using swap because cached memory is taking up too much space, try clearing the cache to free up memory:
      sync; echo 3 > /proc/sys/vm/drop_caches
      
    • This will free up memory used for caching. However, this may cause a temporary performance hit as cached data is cleared.
  3. Restart Memory-Hungry Applications:

    • Restart applications or services that are using a significant amount of memory. For example:
      systemctl restart <service-name>
      
    • Replace <service-name> with the name of the service, such as mysql, nginx, or apache2.
  4. Increase Physical Memory (if possible):

    • If swap usage is high and memory usage is consistently high, adding more physical RAM may help alleviate the problem.
  5. Adjust Swappiness:

    • The vm.swappiness parameter controls the kernel’s preference for swapping. By default, it is set to 60, which is a balanced approach. You can decrease this value to make the kernel less likely to use swap:
      sysctl vm.swappiness=30
      
    • To make the change permanent, add the following line to /etc/sysctl.conf:
      vm.swappiness = 30
      

Dependent Metrics

When swap usage is high, the following metrics should also be monitored to understand the impact:

  1. System Load:

    • High swap usage often leads to increased system load. Check the load averages using:
      uptime
      
    • If the load is high, it indicates the system is struggling, and swap usage could be one of the contributing factors.
  2. Memory Usage:

    • Monitor the free -m output for memory usage. If memory is low and swap usage is high, it indicates that the system has insufficient RAM for its workload.
  3. CPU Usage:

    • Check CPU usage (top or htop) to determine if the system is spending too much time swapping (i.e., CPU waiting for I/O operations). High CPU usage coupled with high swap usage may suggest disk I/O bottlenecks.
  4. Disk I/O:

    • Swap usage is often accompanied by high disk I/O as the system writes and reads memory pages from swap space. Check disk I/O performance with:
      iostat
      
  5. Page Faults:

    • Page faults can be an indication of the system attempting to load data from swap. You can check for page faults using:
      vmstat 1 5
      
    • Look for high values in the pgpgin (page ins) and pgpgout (page outs) columns.

DevOps Remedy

If C3 actions do not resolve the high swap usage issue, the DevOps team should consider the following:

  1. Increase Physical Memory:

    • Add more RAM to the server to reduce the reliance on swap space.
  2. Optimize System Configuration:

    • Adjust system parameters related to memory management to optimize performance, such as tweaking the vm.swappiness value or adjusting resource limits for applications.
  3. Scale Infrastructure:

    • If the server’s workload is too high, consider scaling the infrastructure either vertically (by adding more memory) or horizontally (by adding more servers to distribute the load).
  4. Investigate Memory Leaks:

    • If certain applications or processes are using more memory over time, investigate and fix possible memory leaks in the application’s code.
  5. Monitor Disk I/O Performance:

    • Ensure that disk I/O performance is adequate for handling swap operations. If disk I/O is slow, it can further exacerbate the performance impact of swap usage.

Node Down Alert

A Node Down Alert indicates that a server or node is no longer accessible within the network, which can affect application availability, data accessibility, and service continuity.

Affects:

  • Disruption in application services relying on the node.
  • Potential data unavailability if the node was part of a storage or database cluster.
  • Load imbalance as traffic or processing shifts to remaining nodes.

C3 Data Collection

When a Node Down Alert is triggered, the C3 team should follow these steps to collect information on the status of the node:

  1. Verify Node Reachability:

    • Attempt to ping the node’s IP address:
      ping <node_ip>
      
    • If the node doesn’t respond, this indicates a potential network or power issue.
  2. Check SSH Connectivity:

    • Attempt an SSH login to the node:
      ssh <user>@<node_ip>
      
    • If SSH is inaccessible, check for network or firewall issues.
    • Try login to the server through console access like (virsh console ) or serial console in public cloud providers.
  3. Check Application Monitoring System:

    • Review monitoring tools like Prometheus, Grafana, or any internal monitoring system for further insights on the node’s status and to see the last known metrics.
  4. Analyze Network and Power Status:

    • Check with the network team for any known outages or recent changes that could impact connectivity.
    • Ensure the node’s physical or virtual power status is intact and operational.
  5. Identify Last Log Entries:

    • Check the last logs or alerts for signs of abnormal activity, errors, or processes that may indicate why the node went down:
      journalctl -xe --since "5 minutes ago"
      
    • This provides context for system events leading up to the downtime.

C3 Remedy Steps

  1. Restart Node Services (If SSH is accessible):

    • If the node is partially responsive, try restarting critical services that may restore access:
      sudo systemctl restart <service-name>
      
    • Replace <service-name> with the main service running on the node (e.g., apache2, nginx).
  2. Reboot the Node:

    • If the node is unresponsive to SSH, a restart via management consoles (e.g., AWS, Azure, VMware) may be required. This can restore functionality if the node has become unresponsive due to a software issue.
  3. Escalate to Network and Hardware Teams:

    • If the node remains unreachable, escalate to the DevOps team to check for possible physical or network failures.

Dependent Metrics and Checks

When a node is down, review these metrics to assess impact:

  1. System Load on Neighboring Nodes:

    • Check the load averages on nodes handling additional traffic due to the downed node:
      uptime
      
  2. Database Cluster Status:

    • If the node is part of a cluster (e.g., Redpanda, Scylla, MySQL), review cluster health to ensure quorum and availability.
  3. Network and Firewall Logs:

    • Examine network logs to rule out network disruptions, configuration changes, or firewall issues affecting node reachability.
  4. Storage Cluster Health:

    • If the node is part of a distributed storage setup (e.g., GlusterFS), verify that replication and failover mechanisms are functioning, preventing data unavailability.

DevOps Remediation Steps

If C3 actions don’t resolve the Node Down Alert, the DevOps team should consider these options:

  1. Reallocate Workloads:

    • Temporarily shift workloads to other nodes, if possible, to maintain service continuity.
  2. Perform Hardware Diagnostics:

    • If accessible, perform hardware diagnostics (e.g., memory, disk checks) to identify potential hardware issues.
  3. Check Virtualization Platform:

    • If the node is a VM, investigate host-level issues that may affect virtual machine availability.
  4. Plan Node Replacement:

    • If the node cannot be restored, start preparations for its replacement, including provisioning a new node and updating configurations.
  5. Update Alerting and Documentation:

    • Review and update alert thresholds and documentation for faster troubleshooting in future incidents. Ensure all relevant teams are aware of the resolution process.

Root File System Filling Alert

A Root File System Filling Alert is triggered when the root (/) file system reaches a high usage threshold, potentially leading to critical performance degradation, service disruptions, or inability to store necessary data.

Affects:

  • Limited ability to write logs or data to disk.
  • Possible application crashes or performance issues due to insufficient space.
  • Risk of operating system instability if system files cannot be updated or managed.

C3 Data Collection

When a Root File System Filling Alert is triggered, the C3 team should perform the following steps to gather information on disk usage:

  1. Check Disk Usage:

    • Use the df command to check the usage of the root file system:
      df -h /
      
    • Look at the % Used column to determine how close it is to the threshold (e.g., 80%, 90%).
  2. Identify Largest Files and Directories:

    • Locate files or directories consuming the most space in the root file system:
      du -ahx / | sort -rh | head -20
      
    • This command displays the top 20 largest files and directories, helping identify potential space hogs.
  3. Examine Log Files:

    • Log files in /var/log can often grow unexpectedly. Use this command to identify large logs:
      du -sh /var/log/*
      
  4. Check Inode Usage:

    • If the file system is full due to excessive small files, check inode usage:
      df -i /
      
    • If inodes are fully utilized, the system won’t be able to create new files, even if there is available disk space.
  5. Review System Logs for Errors:

    • Check for recent errors related to disk space issues:
      dmesg | grep -i "disk full"
      
    • You can also use journalctl to search for disk-related warnings or errors:
      journalctl -xe | grep -i "disk full"
      

C3 Remedy Steps

Once the C3 team has collected the necessary data, follow these steps to clear disk space:

  1. Clear Log Files:

    • Rotate or clear large log files, especially in /var/log. If logs are critical, compress them to save space:
      truncate -s 0 /var/log/<large-log-file>
      
    • Better deleting the rotated log files like auth.log.1 instead of deleting auth.log
    • Replace <large-log-file> with the filename. Be careful not to remove logs without proper backup.
  2. Delete Unnecessary Files:

    • Remove temporary or unnecessary files in /tmp or user-specific directories:
      rm -rf /tmp/*
      
  3. Clear Package Cache:

    • Package caches can consume space. Clear unused caches for package managers, such as:
      sudo apt-get clean
      
  4. Archive and Move Old Data:

    • If there are old data files that aren’t actively needed, consider archiving and moving them to a different storage location:
      tar -czf /path/to/backup.tar.gz /path/to/old-files
      
    • Replace the paths as needed.

Dependent Metrics and Checks

When the root file system is filling up, monitor the following metrics to understand the potential impact:

  1. Disk Usage on Other Filesystems:

    • Ensure other critical filesystems (e.g., /home, /var, /tmp) are not approaching capacity:
      df -h
      
  2. Application Logs for Errors:

    • Review logs in applications that rely on disk access (e.g., databases, web servers) for potential disk space issues.
  3. System Performance Metrics:

    • Monitor CPU and memory usage, as high disk usage can also impact system performance:
      top
      
  4. I/O Performance:

    • If the root file system is heavily used, it could lead to I/O bottlenecks. Monitor I/O performance:
      iostat
      

DevOps Remediation Steps

If C3 actions do not sufficiently reduce disk usage, the DevOps team should consider the following:

  1. Add More Disk Space:

    • Expand the root volume if using a virtualized environment or cloud provider, or attach additional storage if feasible.
  2. Implement Log Rotation:

    • Set up automatic log rotation for critical services to avoid log files filling up the disk:
      sudo nano /etc/logrotate.conf
      
    • Configure log rotation settings based on log growth and retention requirements.
  3. Optimize Application Configurations:

    • Configure applications to store temporary or cache files on a non-root partition if they generate large data files frequently.

Node Reboot Alert

A Node Reboot Alert is triggered when a monitored server has unexpectedly restarted. This can be due to various issues, including system crashes, power failures, or kernel panics. A node reboot can lead to temporary service interruptions, application downtime, and potential data inconsistencies.

Affects:

  • Temporary downtime of services hosted on the node.
  • Potential data loss or inconsistency if services were not gracefully stopped.
  • Increased load on other nodes if the rebooted node is part of a cluster.

C3 Data Collection

When a Node Reboot Alert is triggered, the C3 team should perform the following checks and gather relevant information:

  1. Confirm the Reboot:

    • Verify the reboot time and details with the following command:
      last -x | grep reboot | head -n 1
      
    • This shows the most recent reboot event with the timestamp.
  2. Check System Uptime:

    • Determine how long the system has been up since the reboot:
      uptime
      
    • This provides information on when the node came back online.
  3. Review System Logs for Reboot Cause:

    • Check for any errors or messages that may indicate why the node rebooted:
      journalctl -b -1
      
    • This command shows logs from the previous boot, helping identify any issues before the reboot.
  4. Examine Kernel Logs:

    • Look for kernel messages or errors that could indicate a crash or panic:
      dmesg | grep -i "panic\|error\|fatal"
      
    • This helps identify hardware or kernel-related issues.
  5. Check Power Management and UPS Logs (if applicable):

    • For nodes with uninterruptible power supplies (UPS), verify if a power issue caused the reboot:
      cat /var/log/ups.log | grep -i "power failure"
      
    • This command might vary based on the UPS configuration.

C3 Remedy Steps

Once the C3 team has collected data, they can follow these steps to mitigate any issues caused by the unexpected reboot:

  1. Restart Critical Services:

    • Confirm that essential services are running. If any services were stopped by the reboot, start them manually:
      systemctl start <service-name>
      
    • Replace <service-name> with the specific service (e.g., apache2, mysql).
  2. Verify System Health:

    • Perform basic system health checks to ensure stability:
      top -b -n 1 | head -n 10
      
    • This provides a snapshot of CPU and memory usage to confirm no high resource issues.
  3. Check Application Logs:

    • Review logs for applications hosted on the node to identify any issues from the restart:
      tail -n 50 /var/log/<application-log>
      
    • Replace <application-log> with the relevant log file for the service.

Dependent Metrics and Checks

After a node reboot, monitor these metrics to ensure continued stability:

  1. Service Status:

    • Confirm that all critical services have started successfully and are functioning normally.
  2. System Load:

    • Monitor system load to ensure it stabilizes after the reboot. Check load averages:
      uptime
      
  3. Memory Usage:

    • Confirm no unusual memory consumption post-reboot:
      free -m
      
  4. Disk Health:

    • Run a disk health check to rule out issues with disk storage that could cause reboots:
      smartctl -H /dev/sda
      
    • Replace /dev/sda with the relevant disk device name.
  5. Network Connectivity:

    • Ensure network interfaces are up and configured correctly:
      ip a
      

DevOps Remediation Steps

If the cause of the reboot is not apparent or is due to a recurring issue, the DevOps team should consider these additional actions:

  1. Check for Kernel Updates:

    • If the reboot was caused by a kernel issue, consider updating the kernel to the latest stable version:
      sudo apt update && sudo apt install linux-generic
      
    • Ensure that updates are tested in a non-production environment first.
  2. Monitor for Hardware Failures:

    • Review hardware diagnostics and replace faulty hardware if needed. For example:
      lshw -C memory
      
    • Look for hardware errors or warnings in the output.
  3. Document Findings:

    • Record any root cause analysis and actions taken for future reference, and share findings with the relevant teams.

Disk Space is Less Than 10% Available Alert

The Disk Space is Less Than 10% Available alert is triggered when the available disk space on a server falls below 10%. This can lead to various issues such as slow performance, application crashes, and failed system operations. Immediate attention is required to prevent critical system failures.

Affects:

  • Performance Degradation: Lack of available disk space can severely impact system performance and response times.
  • Application Crashes: Applications or services that rely on disk space may fail to operate properly or crash.
  • Data Corruption: In extreme cases, a full disk can cause data corruption if the system cannot write essential data.
  • System Failures: Essential system processes may fail or the operating system may become unstable.

C3 Data Collection

When the Disk Space is Less Than 10% Available alert is triggered, the C3 team should perform the following checks and gather relevant information:

  1. Check Available Disk Space:

    • Verify the disk usage on the server:
      df -h
      
    • This will show the disk usage and available space in human-readable format (-h flag).
  2. Identify Specific Disk/Partition:

    • Identify which specific disk or partition is running out of space. For example:
      df -h /dev/sda1
      
  3. List Large Files and Directories:

    • Find the largest files or directories occupying disk space:
      du -ahx / | sort -rh | head -n 20
      
    • This will list the top 20 largest files and directories in the root (/) directory. Modify the path if you need to check specific directories.
  4. Check System Logs:

    • Check the system logs to ensure no critical warnings or errors are related to disk space:
      dmesg | grep -i "disk"
      
    • Review any disk-related messages that may indicate failing disks or other issues.
  5. Examine Log Files:

    • Ensure that log files are not growing uncontrollably. Look for large log files in /var/log or other directories where logs are stored:
      ls -lh /var/log
      
  6. Check for Unused or Temporary Files:

    • Check for unused or temporary files that could be safely deleted. For example:
      find /tmp -type f -exec ls -lh {} \; | sort -rh | head -n 10
      

C3 Remedy Steps

Once the C3 team has collected data, they can follow these steps to resolve the issue:

  1. Clear Temporary and Cache Files:

    • Remove unnecessary or outdated files, such as temporary files, old logs, or cache files:
      rm -rf /tmp/*
      
  2. Rotate and Clean Up Logs:

    • Rotate logs and delete old log files to free up space:
      logrotate -f /etc/logrotate.conf
      
    • Consider setting up log rotation to prevent logs from consuming excessive disk space.
  3. Delete Unnecessary Files or Backups:

    • Remove any old backups, outdated files, or unused data to free up space. For example:
      rm -rf /path/to/old-backups/*
      
  4. Move Data to Another Disk or Partition:

    • If possible, move data to another disk or partition with more available space:
      mv /path/to/large-data /path/to/another-disk
      
  5. Extend Disk Space (if applicable):

    • If the disk is full and there is no option to free up space, consider adding additional storage or expanding the existing disk/partition:
      • For virtual machines: Increase the disk size through the virtualization platform.
      • For physical servers: Add additional drives and extend partitions using tools like gparted.
  6. Check Disk for Errors:

    • If there are any signs of a failing disk, check for disk errors:
      sudo smartctl -a /dev/sda
      
    • Replace any faulty disks if necessary.

Dependent Metrics and Checks

When the disk space is low, monitor the following metrics to assess the impact and take preventive measures:

  1. System Load:

    • Low disk space can lead to high system load. Check the load averages:
      uptime
      
  2. Disk I/O Performance:

    • Monitor disk I/O to detect any abnormal slowdowns or bottlenecks caused by low disk space:
      iostat
      
  3. Memory Usage:

    • Check if the system is under memory pressure and swapping due to insufficient disk space:
      free -m
      
  4. Log File Sizes:

    • Keep track of log file sizes to avoid uncontrolled growth that could fill the disk. For example:
      du -sh /var/log/*
      
  5. Backup and Archive Jobs:

    • Ensure that backup or archive jobs are not consuming excessive disk space. Check the last run times and size of backups:
      du -sh /path/to/backups
      

DevOps Remediation Steps

If C3 team actions do not resolve the disk space issue, the DevOps team should consider the following additional actions:

  1. Configure Disk Quotas:

    • Implement disk quotas for users and services to prevent any one user or service from consuming all available disk space.
  2. Upgrade Hardware (if needed):

    • For physical systems with frequent disk space issues, consider upgrading to a larger disk or a more reliable storage solution.
  3. Implement Data Archiving Solutions:

    • For systems with large amounts of data, implement an archiving solution to move infrequently accessed data off to external storage, reducing the load on the primary disk.

High Disk Reads or Writes Alert

The High Disk Reads alert is triggered when the disk read activity on a server exceeds a threshold that could indicate potential issues, such as slow disk performance or heavy I/O operations. High disk reads can lead to system slowdowns, especially when coupled with high disk writes or excessive I/O wait times.

Affects:

  • System Performance Degradation: High disk reads can lead to slower response times, especially when the system is waiting for data from the disk.
  • I/O Bottlenecks: Excessive disk reads can create bottlenecks in the I/O subsystem, affecting all services and applications relying on the disk.
  • Application Slowness: Applications that perform a lot of disk reads may experience delays in processing data or retrieving information.
  • Increased CPU Usage: Heavy disk read activity can increase CPU usage as the system spends more time reading data from disk and processing it.

C3 Data Collection

When the High Disk Reads alert is triggered, the C3 team should gather the following information to understand the underlying cause and to assist with remediation:

  1. Check Disk I/O Activity:

    • Use iostat to check for any abnormalities in disk I/O performance. It will show disk read and write statistics:
      iostat -dx 1 5
      
    • Look at the r/s (reads per second) and w/s (writes per second) columns to identify high read activity.
  2. Examine Disk Usage:

    • Verify the disk usage and overall health of the disks:
      df -h
      
    • Check if the disk space is nearing capacity, as this can cause slow disk read performance.
  3. Monitor Disk Reads Over Time:

    • Use vmstat to monitor disk read activity over time:
      vmstat 1 5
      
    • The bi (blocks in) column will indicate the number of blocks read from disk.
  4. Check for High-Read Processes:

    • Identify processes that are consuming excessive disk I/O resources by running iotop or top:
      sudo iotop -o
      
    • This will show processes that are performing high I/O operations, specifically disk reads.
  5. Review System Logs:

    • Check the system logs for any disk-related warnings or errors:
      dmesg | grep -i "disk"
      
    • Review any disk errors or slow read warnings that might explain the high disk read activity.
  6. Check for Disk Errors:

    • Check the health of the disk using smartctl to identify any possible errors or issues:
      sudo smartctl -a /dev/sda
      

C3 Remedy Steps

Once the C3 team has gathered the necessary information, the following steps should be taken to address the high disk read activity:

  1. Identify and Resolve High-Read Processes:

    • Use iotop or top to identify processes that are performing excessive disk reads. If the process is non-essential or consuming an excessive amount of I/O resources, consider stopping or restarting it:
      kill <PID>
      
    • Be careful when terminating processes, as some processes may be critical to the system.
  2. Optimize Database Queries:

    • If high disk reads are caused by database queries (e.g., MySQL, PostgreSQL), convey dev teams to optimize the queries to reduce the number of disk reads. Consider adding indexes, using caching, or reviewing the query execution plans.
  3. Clear Cache or Temporary Files:

    • If cached or temporary files are taking up a lot of space and causing disk reads, clear them to free up space:
      rm -rf /tmp/*
      
  4. Increase Disk I/O Throughput:

    • If possible, consider increasing the throughput of the disk by upgrading to a faster disk (e.g., SSD) or optimizing the disk setup (e.g., RAID configurations).

Dependent Metrics and Checks

When high disk reads are observed, the following metrics and system performance indicators should be monitored to assess the impact and prevent further issues:

  1. Disk Write Activity:

    • High disk reads often accompany high disk writes. Use iostat to monitor disk writes (w/s column) and check if both reads and writes are high.
  2. CPU Usage:

    • High disk read activity may lead to increased CPU usage as the system waits for data to be read. Check the CPU usage with top or htop.
  3. Disk Queue Length:

    • A high disk queue length (avgqu-sz in iostat) can indicate that the disk is struggling to handle read requests, leading to slower I/O operations.
  4. System Load:

    • High disk reads can lead to increased system load. Check the system load averages to see if they are higher than expected:
      uptime
      
  5. Disk Latency:

    • High disk read activity may be accompanied by high disk latency. Monitor disk latency using tools like iostat or check for errors in system logs related to disk timeouts or delays.
  6. Page Faults:

    • Check for page faults to determine if the system is waiting for data from swap. High page faults can indicate that the system is swapping heavily due to memory pressure, which could exacerbate the high disk reads:
      vmstat 1 5
      

DevOps Remediation Steps

If C3 team actions do not resolve the high disk reads issue, the DevOps team should consider the following steps:

  1. Scale Disk Resources:

    • If high disk reads are a recurring issue, consider scaling the disk resources by adding additional disks or upgrading to faster storage solutions like SSDs.
  2. Optimize System Configurations:

    • Ensure that system parameters related to disk I/O and memory management are optimized. For example, reducing the system’s reliance on swap by tuning the vm.swappiness parameter or increasing available memory.
  3. Database Optimization:

    • If the high disk reads are caused by database activity, consider scaling the database, optimizing queries, or adding caching layers to reduce the number of reads.
  4. Use Distributed File Systems:

    • If disk reads are a bottleneck in a distributed system, consider using distributed file systems (e.g., GlusterFS, Ceph) to improve scalability and reduce the load on individual disks.
  5. Implement I/O Queuing:

    • If disk reads are being caused by high I/O contention, implement I/O queuing mechanisms to smooth out read operations and ensure that they are processed more efficiently.
  6. Review Application-Level Disk Usage:

    • Review the application logic and ensure that it is not performing redundant or inefficient disk reads. Consider implementing a caching layer or optimizing file access patterns.

Time Synchronization Drift Detected Alert

The Time Synchronization Drift alert is triggered when the system clock is found to be out of sync with the configured time source, such as a Network Time Protocol (NTP) server. Time synchronization is critical for ensuring accurate timestamps, smooth system operations, and consistency in logs and data across distributed systems.

Affects:

  • Inconsistent Logs: Time drift can cause logs to appear out of order or lead to confusion in diagnosing issues.
  • Distributed Systems Issues: In distributed systems, time drift can lead to inconsistency in timestamps, which may affect coordination between services.
  • Authentication Failures: Services that rely on time-based authentication mechanisms (e.g., Kerberos, JWT tokens) can experience failures due to time differences.
  • Scheduled Task Failures: Scheduled tasks (cron jobs, backups, etc.) may fail or be delayed if the system clock is out of sync.

C3 Data Collection

When the Time Synchronization Drift alert is triggered, the C3 team should gather the following information to determine the cause of the time drift:

  1. Check System Time:

    • Run the following command to view the current system time:
      date
      
    • Compare the output with the correct time or the time of your time server to see if there is a significant drift.
  2. Verify NTP Service Status:

    • Check if the NTP service is running to synchronize the system time:
      systemctl status ntp
      
    • If the NTP service is not active, start it with:
      sudo systemctl start ntp
      
  3. Check NTP Synchronization:

    • Check if the system is synchronized with an NTP server using the following command:
      ntpq -p
      
    • This will show the NTP servers that the system is connected to, and whether synchronization is successful.
  4. Check for Time Drift Logs:

    • Review system logs for any messages indicating time synchronization issues:
      journalctl -xe | grep ntp
      
    • This will provide logs related to NTP synchronization errors, time drift, or related warnings.
  5. Check Hardware Clock (RTC):

    • Sometimes the time drift is due to a malfunctioning hardware clock. Check the current hardware clock time:
      sudo hwclock --show
      
  6. Check for Timezone Mismatches:

    • Ensure that the system timezone is set correctly:
      timedatectl
      
    • Mismatched timezones can lead to apparent time drift even if the system time is correct.

C3 Remedy Steps

Once the C3 team has gathered the necessary information, the following actions should be taken to resolve the time synchronization drift:

  1. Restart NTP Service:

    • If the NTP service is not running or synchronization has failed, restart the NTP service to resynchronize the system time:
      sudo systemctl restart ntp
      
  2. Manually Sync Time:

    • If NTP is not functioning correctly, manually sync the system time with an NTP server:
      sudo ntpdate <NTP-server>
      
    • Replace <NTP-server> with the address of a reliable NTP server.
  3. Check for Time Configuration Conflicts:

    • Ensure that no conflicting time synchronization services (e.g., chrony, ntpd, systemd-timesyncd) are running simultaneously. Disable any unused services:
      sudo systemctl stop chronyd
      sudo systemctl disable chronyd
      
  4. Ensure Correct Timezone:

    • If the system’s timezone is incorrect, adjust it using the following command:
      sudo timedatectl set-timezone <Timezone>
      
    • Replace <Timezone> with the correct timezone, such as UTC or Asia/Kolkata.
  5. Check for Hardware Clock Issues:

    • If the hardware clock is incorrect, update it using the following command:
      sudo hwclock --systohc
      
    • This will set the hardware clock to match the system time.
  6. Synchronize NTP Servers:

    • If the NTP servers are not providing reliable synchronization, consider switching to different NTP servers. For example, use the public NTP servers:
      sudo nano /etc/ntp.conf
      
    • Replace the current server entries with public NTP servers such as:
      server 0.pool.ntp.org
      server 1.pool.ntp.org
      

Dependent Metrics and Checks

When time synchronization drift is detected, the following metrics and system settings should also be monitored to identify underlying causes and ensure proper synchronization:

  1. System Load and Performance:

    • Time drift can sometimes be caused by high system load or performance issues that affect NTP synchronization. Monitor system performance with top, htop, or uptime.
  2. NTP Synchronization Status:

    • Monitor the synchronization status continuously using ntpq -p or chronyc sources to ensure that the system stays synchronized over time.
  3. System Logs:

    • Always check for recurring time synchronization errors in the system logs (journalctl -xe | grep ntp) to identify potential issues with the NTP service or time configuration.
  4. Hardware Clock (RTC) Drift:

    • Check for hardware clock drift by comparing the hardware clock time with the system time periodically using hwclock.
  5. Network Connectivity:

    • Ensure that the server has consistent network connectivity to communicate with the NTP servers. Network issues may prevent time synchronization from occurring.

DevOps Remediation Steps

If C3 team actions do not resolve the time synchronization drift issue, the DevOps team should consider the following steps:

  1. Check Network and Firewall Settings:

    • Ensure that network configurations or firewalls are not blocking NTP communication (UDP port 123) between the server and the NTP server.
  2. Scale Time Synchronization Infrastructure:

    • If the server’s time synchronization is heavily reliant on a single NTP server, consider adding multiple time sources for redundancy.
  3. Use Hardware Time Synchronization:

    • If possible, use hardware-based time synchronization mechanisms (e.g., GPS time servers) for more accurate time synchronization, especially for critical systems.
  4. Investigate NTP Service or Software Bugs:

    • Investigate and apply any patches for bugs in the NTP service or related software that may be causing synchronization issues. Keep the NTP software up-to-date.
  5. Monitor Server Time Continuously:

    • Set up monitoring to track the system’s time and NTP synchronization status continuously, ensuring that any time drift is detected and addressed proactively.

High Context Switching Alert

The High Context Switching alert is triggered when the number of context switches in the system exceeds a certain threshold. A context switch occurs when the CPU switches from one process or thread to another, which can happen frequently due to multitasking.

Excessive context switching can indicate issues like high CPU utilization, process contention, or inefficient use of system resources. It may lead to performance degradation due to overhead caused by the switching process.

Affects:

  • Increased CPU Load: High context switching can cause the CPU to spend too much time switching between processes, leading to increased CPU utilization and poor performance.
  • Slow System Response: Systems may become less responsive as resources are constantly allocated and deallocated between processes.
  • Inefficient Resource Usage: High context switching may indicate that system resources are being inefficiently allocated, causing unnecessary overhead.
  • Decreased Throughput: Processes may experience delays in execution as the system spends time switching contexts rather than executing tasks.

C3 Data Collection

When the High Context Switching alert is triggered, the C3 team should collect the following information to diagnose the root cause of the issue:

  1. Check System Load:

    • Start by checking the system load using the uptime or top command. High load averages combined with high context switching can indicate resource contention.
      uptime
      
  2. Check Context Switch Count:

    • Use the vmstat command to check for high context switches. Look for high values in the cs (context switches) column:
      vmstat 1 5
      
    • This will display context switch data along with system performance metrics.
  3. Monitor Process-Specific Context Switching:

    • Use pidstat to monitor context switches per process and identify which processes are causing excessive context switching:
      pidstat -w 1
      
    • The -w option displays context switch statistics.
  4. Check CPU Usage and Load:

    • Monitor CPU utilization to check if high context switching correlates with high CPU usage. Use the top command to identify processes that are consuming high CPU:
      top
      
  5. Check for I/O Wait:

    • High context switching can be caused by I/O-bound processes. Use the iostat command to monitor disk and network I/O:
      iostat -xz 1 5
      
    • Look for high values in the await column, which indicates I/O wait time.
  6. Check for System Resource Contention:

    • Look for resource contention, where multiple processes are competing for CPU or I/O. The dstat command can provide insight into system resource usage:
      dstat -c -d -n
      

C3 Remedy Steps

Once the C3 team has collected the necessary information, the following actions should be taken to resolve high context switching:

  1. Identify Resource-Hungry Processes:

    • Use top, htop, or pidstat to identify processes consuming excessive resources, which may cause high context switching. Consider adjusting or terminating these processes if necessary.
      kill <PID>
      
    • Replace <PID> with the process ID of the offending process.
  2. Optimize Application Code:

    • Review the application code for any inefficiencies that may lead to excessive context switching, such as tight loops, inefficient thread management, or poor process synchronization.
  3. Increase CPU Resources:

    • If the system is under heavy load, consider scaling the infrastructure by adding more CPU resources or distributing the workload across more servers.
  4. Review System Resource Limits:

    • Check if there are any limits on the number of processes, threads, or open files. Use ulimit -a to view the system limits and adjust if necessary.
  5. Optimize I/O Operations:

    • If the high context switching is due to I/O-bound processes, optimize the I/O operations. This could involve tuning the disk subsystem, improving network configurations, or optimizing database queries.
  6. Reduce Process and Thread Creation:

    • High context switching often occurs when many short-lived processes or threads are created. Consider pooling threads or optimizing process creation to reduce overhead.
  7. Review System Scheduler Configuration:

    • The kernel scheduler might be handling context switching inefficiently in some cases. Review the scheduler configuration (e.g., CFS scheduler) to ensure it is appropriately tuned for the workload.

Dependent Metrics and Checks

When high context switching is detected, the following metrics and system settings should also be monitored:

  1. CPU Usage:

    • High context switching is often correlated with high CPU usage. Continuously monitor CPU utilization using top, htop, or vmstat.
  2. Disk I/O:

    • High context switching may be caused by excessive disk I/O. Monitor disk performance using iostat and look for signs of I/O bottlenecks or high wait times.
  3. Process Behavior:

    • Monitor the number of processes and threads being created and destroyed in the system. Excessive creation of threads or processes can lead to high context switching.
  4. System Load:

    • High system load can increase the likelihood of high context switching. Use uptime or top to monitor load averages and CPU utilization.
  5. Memory Usage:

    • Check memory usage and swap usage, as memory contention can also lead to high context switching. Use free -m and vmstat to monitor memory status.

DevOps Remediation Steps

If C3 team actions do not resolve the high context switching issue, the DevOps team should consider the following steps:

  1. Scale the Infrastructure:

    • If the server is under heavy load, consider scaling up by adding more CPU resources or scaling out by distributing the workload across multiple servers.
  2. Optimize System Configuration:

    • Review and optimize system configuration settings, such as vm.swappiness, cpu.shares, or process limits to ensure efficient resource allocation.
  3. Investigate Application-Level Issues:

    • If the issue is application-specific, investigate the application’s thread or process management. Consider code optimization to reduce contention and improve thread management.
  4. Tune the Kernel Scheduler:

    • The kernel scheduler controls how processes are assigned CPU time. Depending on the workload, it may be beneficial to tweak scheduling settings for optimal context switching behavior.
  5. Monitor Long-Term Trends:

    • Set up continuous monitoring to track context switching, CPU usage, disk I/O, and load over time to detect any recurring patterns and prevent future issues.
  6. Optimize Database Queries:

    • If the context switching is related to database queries, optimize the queries or database configuration to reduce resource contention and improve performance.

High Open File Descriptors Alert

The High Open File Descriptors alert is triggered when the number of open file descriptors on the system exceeds a specified threshold. File descriptors are used by the operating system to manage files, network sockets, and other input/output resources. An unusually high number of open file descriptors can indicate resource leaks or excessive file/socket usage by applications.

Excessive open file descriptors can lead to resource exhaustion, causing the system or application to become unresponsive or fail to open new files or network connections.

Affects:

  • Application Failures: If an application exceeds the maximum number of allowed file descriptors, it may fail to open files, sockets, or other I/O resources.
  • System Resource Exhaustion: Excessive file descriptors can consume system resources, leading to slower performance, crashes, or unexpected behavior.
  • Degraded System Performance: High file descriptor usage can lead to increased memory consumption and may result in higher system load.

C3 Data Collection

When the High Open File Descriptors alert is triggered, the C3 team should collect the following information to diagnose the issue:

  1. Check Current Open File Descriptors:

    • Use the lsof command to list all open file descriptors. This will show the number of file descriptors being used by processes:
      lsof | wc -l
      
    • Alternatively, check the open file descriptors for specific processes:
      lsof -p <PID>
      
    • Replace <PID> with the process ID to check open file descriptors for a specific process.
  2. Check System-Wide Open File Descriptor Limit:

    • Verify the system-wide limit on open file descriptors by running:
      ulimit -n
      
    • This command will show the current limit for open file descriptors for the user. If the value is low, it may need to be increased.
  3. Check Per-Process Open File Descriptor Count:

    • To identify which processes are consuming a large number of file descriptors, use the following command:
      lsof | awk '{print $1}' | sort | uniq -c | sort -n
      
    • This will give you a list of processes and the number of file descriptors they are using, ordered by the number of file descriptors.
  4. Check System Logs for Errors:

    • Look for system log entries related to file descriptor issues. Use journalctl or dmesg to search for any warnings or errors:
      dmesg | grep -i file
      
    • Or:
      journalctl -xe | grep -i file
      
  5. Check Resource Usage by Application:

    • If specific applications are causing the high file descriptor count, monitor their resource usage with top or htop to see if they are consuming excessive CPU or memory resources.

C3 Remedy Steps

Once the C3 team has collected the necessary data, they can take the following steps to resolve high open file descriptor usage:

  1. Identify Processes with Excessive File Descriptors:

    • Use lsof or the command mentioned earlier to identify processes consuming an unusually high number of file descriptors. If any processes are found, investigate their behavior to determine if they are leaking file descriptors or using resources inefficiently.
  2. Increase Open File Descriptor Limit:

    • If the system is hitting the file descriptor limit, consider increasing the limit. You can do this by running:
      ulimit -n <new_limit>
      
    • Replace <new_limit> with the desired value for the maximum number of open file descriptors.
  3. Check for File Descriptor Leaks:

    • Investigate the application code or system processes for file descriptor leaks. Ensure that file descriptors are being properly closed when no longer needed.
  4. Restart Problematic Applications:

    • If a specific application is consuming excessive file descriptors, restarting the application can help free up unused file descriptors:
      systemctl restart <service-name>
      
    • Replace <service-name> with the service name, such as mysql, nginx, or apache2.
  5. Review Application Configuration:

    • Review the configuration of the application consuming excessive file descriptors. Consider optimizing the application to use fewer file descriptors or increase connection pooling to manage resources more efficiently.

Dependent Metrics and Checks

When high open file descriptors are detected, the following metrics and system checks should also be monitored:

  1. System Load:

    • High open file descriptors can indicate resource contention, leading to increased system load. Use uptime or top to monitor the system load.
  2. Memory Usage:

    • If a process is using a large number of file descriptors, it may also be consuming excessive memory. Monitor memory usage using free -m or vmstat.
  3. CPU Usage:

    • Check for high CPU usage in the processes that are using excessive file descriptors. Use top or htop to monitor CPU utilization.
  4. Disk I/O:

    • File descriptors are associated with I/O operations. Monitor disk I/O using iostat or dstat to identify any performance bottlenecks related to disk access.
  5. Application Resource Usage:

    • Monitor the resource usage of applications that are opening large numbers of file descriptors. This can be done using top, htop, or custom application monitoring tools.

DevOps Remediation Steps

If C3 team actions do not resolve the high open file descriptor issue, the DevOps team should consider the following steps:

  1. Scale Infrastructure:

    • If the application is under heavy load and requires more file descriptors, consider scaling the infrastructure by adding more resources or distributing the load across additional servers.
  2. Optimize Application Code:

    • Work with the development team to optimize the application to ensure that file descriptors are being closed properly and that the number of open file descriptors is minimized.
  3. Increase File Descriptor Limits:

    • If the system requires more open file descriptors, consider permanently increasing the system-wide limits by modifying /etc/security/limits.conf to raise the soft and hard limits for file descriptors:
      * soft nofile 65536
      * hard nofile 65536
      
  4. Check for System Configuration Issues:

    • Ensure that system-level settings related to file descriptors (such as limits and quotas) are configured appropriately for the expected workload.
  5. Monitor Long-Term Trends:

    • Set up long-term monitoring for open file descriptors to track trends and proactively address issues before they cause performance degradation or system failures.

Filesystem Read-Only State Alert

The Filesystem Read-Only State alert is triggered when the system’s filesystem has entered a read-only state. This usually occurs when there is a serious issue with the filesystem, such as corruption, hardware failure, or a kernel panic. A filesystem in read-only mode cannot accept write operations, meaning that any applications or processes that need to write to the disk will fail.

This alert indicates that the system is unable to perform critical operations such as writing logs, saving data, or updating files, which can significantly impact system performance and reliability.

Affects:

  • Application Failures: Applications that require write access to the disk will fail to function properly, potentially causing downtime or data loss.
  • System Instability: A read-only filesystem can prevent system updates, logging, and configuration changes, leading to a compromised system state.
  • Data Corruption: If the filesystem was abruptly mounted as read-only due to errors, there is a risk of data corruption when trying to restore the filesystem.

C3 Data Collection

When the Filesystem Read-Only State alert is triggered, the C3 team should collect the following information to diagnose the issue:

  1. Check the Filesystem State:

    • Use the mount command to verify if the filesystem is mounted in read-only mode:
      mount | grep "ro,"
      
    • This will show all the filesystems mounted in read-only mode. If the root filesystem or any important filesystems are listed, it confirms the issue.
  2. Check System Logs for Filesystem Errors:

    • Look at the system logs to find any error messages related to the filesystem or disk issues. Run:
      dmesg | grep -i error
      
    • Or check the journal logs for relevant error messages:
      journalctl -xe | grep -i "read-only"
      
  3. Check Disk Health and SMART Status:

    • Use smartctl to check the health of the disk, as physical disk failures are a common cause of read-only filesystems:
      smartctl -a /dev/sda
      
    • Replace /dev/sda with the appropriate disk device.
  4. Check Filesystem for Errors:

    • Use fsck (filesystem check) to scan and repair filesystem errors. For example:
      sudo fsck /dev/sda1
      
    • Replace /dev/sda1 with the appropriate partition.
  5. Check for Disk Space Issues:

    • Verify if the disk is full, which could cause the system to mount the filesystem as read-only:
      df -h
      
  6. Check for Pending Kernel or System Errors:

    • If the filesystem was mounted as read-only due to kernel panics or crashes, the kernel logs may provide insights:
      dmesg | tail -n 50
      

C3 Remedy Steps

Once the C3 team has collected the necessary data, they can take the following steps to resolve the filesystem read-only state:

  • Treat it as Critical Issue, inform the DevOps team

Dependent Metrics and Checks

When the Filesystem Read-Only State alert is triggered, the following metrics and system checks should also be monitored:

  1. Disk I/O:

    • A read-only filesystem can impact disk I/O operations. Monitor disk I/O performance using tools like iostat, iotop, or dstat to identify any disk bottlenecks.
  2. System Load:

    • High system load can result from failed applications or processes unable to write to the disk. Check the system load with uptime or top.
  3. Disk Space:

    • Verify disk space usage with df -h to ensure the filesystem is not full, which could have caused the filesystem to become read-only.
  4. Memory Usage:

    • Monitor memory usage to ensure that the system has sufficient memory and that excessive swapping is not causing issues related to the filesystem.
  5. System Logs:

    • Continue to monitor system logs (/var/log/syslog, /var/log/messages, or journalctl) for additional filesystem or disk errors.

DevOps Remediation Steps

If C3 team actions do not resolve the filesystem read-only issue, the DevOps team should consider the following steps:

  1. Replace the Failing Disk:

    • If disk health issues are detected, replace the failing disk and restore the data from backups. Ensure that the new disk is properly configured and integrated into the system.
  2. Scale Infrastructure:

    • If the disk is full due to increased usage, consider scaling the infrastructure to add more disk capacity or migrate data to another server or storage system.
  3. Reconfigure Disk and Filesystem:

    • If the read-only state is due to filesystem misconfigurations, the DevOps team should reconfigure the filesystem or consider using a more robust filesystem (e.g., XFS or ext4) if necessary.
  4. Review System and Kernel Configurations:

    • Ensure that the kernel is properly configured to handle disk errors and filesystem corruption. Review system configurations related to disk mounting and recovery.