When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
For reference dashboard panels,metrics, operating ranges, and severity levels, refer to the provided Google Sheets link for guidance.
For Grafana Dashboard Panels and their respective explanations refer Link
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
Severity-Specific Notifications:
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
Row | Panel | Panel Description | Query | Query Description | Query Operating Range | Metrics | Metric Description | Metric Operating Range | SEVERITY: CRITICAL | SEVERITY: WARNING | SEVERITY: OK |
---|---|---|---|---|---|---|---|---|---|---|---|
Quick CPU / Mem / Disk | ResourcePressure | The Percentage of Process time involved for waiting for resources in the last X time interval | irate(node_pressure_cpu_waiting_seconds_total) | how fast the waiting time is increasing over a very recent period due to CPU busy, memory, and IO congestion | 0-15% | node_pressure_cpu_waiting_seconds_total | +ve values | 10% | 5% | < 5% | |
irate(node_pressure_memory_waiting_seconds_total) | 0-5% | node_pressure_memory_waiting_seconds_total | > 5% | 0-5% | 0-1% | ||||||
irate(node_pressure_io_waiting_seconds_total) | 0-17% | node_pressure_io_waiting_seconds_total | > 10% | 3% - 10% | 0 - 3% | ||||||
CPU Busy | Percentage of CPU spent time being not idle | 100 * (1 - avg(rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval]))) | Percentage of avg of each CPU spent time being not idle | 0-80% | node_cpu_seconds_total | Counts the total time in seconds each CPU core has spent in idle mode | +ve Values | > 90% | 80% - 90% | 0 - 80% | |
System Load | System load over all CPU cores together | scalar(node_load1) * 100 / count(count(node_cpu_seconds_total) by (cpu)) | Upcoming system load average per each CPU | 0-more than 100% | node_load1 | Load average shows how system performance is evolving through different time ranges | Dependent on CPU number (0-3.0) | >90% | 80% - 90% | 0 - 80% | |
RAM Used | Used Memory in percentage | ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes) * 100 | The percentage of used memory | 0-100% | node_memory_MemTotal_bytes | Total bytes in memory | +ve Values | >90% | 75% - 90% | 0% - 75% | |
node_memory_MemFree_bytes | Available bytes in memory | +ve Values | node_memory_MemFree_bytes | ||||||||
SWAP Used | Used SWAP in percentage | ((node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes) * 100 | The percentage of Swap used | 0% - 100% | node_memory_SwapTotal_bytes | Swap total bytes | +ve Values | >25% | 10% - 25% | 0% - 10% | |
node_memory_SwapFree_bytes | Swap total bytes available | +ve Values | node_memory_SwapFree_bytes | ||||||||
Root FS Used panel | Root File system used percentage | 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!=“rootfs”} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!=“rootfs”}) | The percentage of root file system used, which is mounted on / | 0-100% | node_filesystem_avail_bytes | Available bytes in file system | +ve Values | Dependent and >90% | 80% - 90% | 0% - 80% | |
node_filesystem_size_bytes | Total bytes in file system | +ve Values | node_filesystem_size_bytes | ||||||||
CPU Cores panel | —— INFO ———- | ||||||||||
Root FS Total | |||||||||||
Uptime | |||||||||||
SWAP Total | |||||||||||
RAM | |||||||||||
System Misc | Context Switches / Interrupts | The rate of context switches and hardware interrupts, indicating CPU multitasking efficiency and hardware activity | irate(node_context_switches_total[$__rate_interval]) | This query calculates the per-second rate of context switches on a node | 4K - 9K switches per second | node_context_switches_total | Measures the total number of context switches performed by the CPU | +ve Values | Dependent and >3k for single CPU | 2k for single CPU | 0-2k |
Node Status | Node Down | The node down | sum(up{instance=~pssb.*}) < 5 | This query fires when node is in running state | 1 | up | The value of the metric will be 1 when it is in running state | 0 or 1 | 0 | NA | 1 |
Storage Disk | Disk Space Used Basic | Disk Space Used of all File systems Mounted | 100 - ((node_filesystem_avail_bytes{device!~‘rootfs’} * 100) / node_filesystem_size_bytes{device!~‘rootfs’}) | This calculates the percentage of Used storage for each mount | 1-100% | node_filesystem_avail_bytes | Available space in each file system mounted | Dependent | Dependent and >90% | Dependent and 80%-90% | < 80% |
node_filesystem_size_bytes | Total size of the filesystem in bytes, providing the total storage capacity available on the disk or partition. | ||||||||||
DISK iops completed | The number of IO requests (after merges) completed per second for the device | irate(node_disk_reads_completed_total[$__rate_interval]) | Calculates the rate of disk reads completed per second on a specific node over a defined interval. | Dependent | node_disk_reads_completed_total | Count of disk read operations completed on a disk | Dependent | Dependent and >200 reads/sec | Dependent and 150-200 reads/sec | Dependent and less than 150 reads/sec | |
Disk R/W Data | The number of bytes written to or read from storage device per second | irate(node_disk_written_bytes_total[$__rate_interval]) | Measures the per-second rate of bytes written to disk, helping to detect spikes in write activity. | Dependent | node_disk_written_bytes_total | Total number of bytes written to disk over time | 0 to +ve | >200 iops/second | 150-200 iops/sec | 0-150 io/sec | |
irate(node_disk_read_bytes_total[$__rate_interval]) | Measures the per-second rate of bytes read from disk, useful for identifying sudden increases in read demand. | Dependent | node_disk_read_bytes_total | Total number of bytes read from disk over time | 0 to +ve | ||||||
Storage File System | File Descriptors | Displays the count of open file descriptors on a system, helping monitor resource usage | node_filefd_maximum | Shows the maximum number of file descriptors available to the system, indicating the limit on simultaneous open files. | Dependent | node_filefd_maximum | - | 0 to +ve | >4500 | 4000 - 4500 | <4000 |
node_filefd_allocated | Represents the current count of file descriptors in use, helping to track resource consumption and identify when limits are near. | Dependent | node_filefd_allocated | - | 0 to +ve | ||||||
File System in Read Only / Error | Indicates if a filesystem has switched to a read-only state or encountered errors, alerting to potential disk failures or permission issues impacting data writes. | node_filesystem_readonly{device!~‘rootfs’} | Indicates whether a filesystem is mounted in read-only mode, typically due to disk errors or file system issues. | 0 |
When 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="$node"}[$__rate_interval])))
shows increased CPU busy time, it indicates that the system is spending more time processing tasks rather than being idle. High CPU busy time can lead to system slowdowns and affect performance.
top
, htop
, or uptime
to check the current CPU load.mpstat -P ALL
to see if any particular CPU core is being overwhelmed.ps aux --sort=-%cpu
or top
to view the CPU usage of processes.node_load1
, node_load5
, node_load15
), which can help correlate the CPU usage with overall system load.When CPU busy time is high, the following metrics can provide additional context:
irate(node_pressure_cpu_waiting_seconds_total)
helps determine if there is CPU contention, which could explain the increased CPU busy time.node_procs_running
, node_procs_blocked
can show how many processes are running or waiting for CPU resources.node_disk_io_time_seconds_total
and node_disk_reads_completed_total
can indicate if high CPU usage is due to I/O-bound tasks.node_pressure_memory_waiting_seconds_total
could indicate if high CPU usage is a result of memory contention.top
, htop
, or ps aux
to find out which processes are consuming the most CPU.node_load1
, node_load5
, node_load15
) to determine if high CPU usage is affecting the overall system performance.If C3 and DevOps actions do not resolve the issue, further analysis of system performance or application behavior may be needed to uncover deeper root causes.
If the C3 team has already followed the data collection and remediation steps but CPU pressure persists, the DevOps team can take additional actions:
Add More Resources:
Optimize Application Code:
Rebalance Workloads:
Confusing the difference between system load and CPU busy ?
System load or CPU Load is measured based on processes waiting for CPU to be allocated. It is measured for the avg of last 1 min, 5 min and 15min.
CPU Busy is said to be percentage of cpu which is not idle i.e processes executing by the CPU. Both can be calculated by the single metric node_cpu_seconds_total
.
When the query scalar(node_load1{instance="$node",job="$job"}) * 100 / count(count(node_cpu_seconds_total{instance="$node",job="$job"}) by (cpu))
is used, it calculates the system load as a percentage. It reflects how much CPU is needed to handle the load, considering the number of CPUs available. An increased value suggests that the system is under heavy load, potentially resulting in CPU contention or resource saturation.
Idle time: The idle time is inversely related to the CPU load. This means that when idle time increases, the CPU load decreases and vice versa.
User time and system time: The user time and system time directly indicate the CPU load. Basically, the sum of user time, system time, and idle time is equal to 100% of the CPU time or load. Higher user and system time values indicate a higher load in the CPU.
Wait or I/O wait time: The I/O wait time refers to instances where the CPU is idle and waiting for an I/O to complete. This increases the CPU load, as more processes wait for the CPU while it’s waiting for the I/O to complete.
Steal time: The percentage of time a virtual CPU involuntarily waits for a CPU process while the hypervisor is servicing another virtual CPU.
Check System Load:
uptime
or top
to review the system load averages over 1 minute, 5 minutes, and 15 minutes. The load values should ideally be less than the number of CPU cores available.node_load1
, node_load5
, node_load15
.CPU Utilization:
top
, htop
, or mpstat -P ALL
to ensure that CPU resources are not fully utilized.System Performance:
dmesg
or journalctl
for any hardware or software failures.When system load is high, consider reviewing the following metrics for additional context:
irate(node_pressure_cpu_waiting_seconds_total)
to check if processes are waiting for CPU time.irate(node_pressure_memory_waiting_seconds_total)
to see if memory constraints are contributing to the load.node_disk_io_time_seconds_total
and node_disk_reads_completed_total
to check if disk I/O is a bottleneck affecting CPU performance.node_procs_running
to check how many processes are running and if they are contributing to system load.Monitor Running Processes:
top
, htop
, or ps aux
to identify processes that are consuming significant CPU time.Terminate or Optimize Processes:
System Load Analysis:
When RAM usage is high, the C3 team needs to collect the following data by logging into the server to gather more context:
Check Memory Usage:
free -m
Check Active Processes:
top
htop
for a more interactive interface.Check Swap Usage:
swapon -s
Check System Logs:
dmesg | grep -i memory
journalctl
if you’re using a systemd-based distribution:
journalctl -xe | grep -i memory
Once the C3 team has gathered the data, they can take the following steps to remedy high RAM usage:
Identify Memory-Hungry Processes:
top
or htop
command to check which processes are consuming the most memory. If any processes are using more memory than expected, consider stopping or restarting them:
kill <PID>
<PID>
with the process ID of the memory-hungry process. Be cautious and ensure that killing the process will not disrupt critical operations.Free Up Cached Memory:
sync; echo 3 > /proc/sys/vm/drop_caches
Check Swap Usage:
Restart Services or Applications:
systemctl restart <service-name>
<service-name>
with the name of the service, like apache2
, mysql
, or nginx
.Increase Swap Space (if necessary):
If the system continues to run out of memory, consider adding more swap space by creating a swap file:
`Inform the DevOps team`
In the event of high RAM usage, the following actions should also be monitored:
Check System Load:
uptime
Check CPU Usage:
top
Check Disk I/O:
iostat
If C3 actions do not resolve the high RAM usage, the DevOps team should take the following steps:
Increase Physical Memory:
Review and Adjust System Configuration:
vm.swappiness
, which controls the kernel’s tendency to swap, to reduce swap usage when physical memory is running low.Scale Resources:
Check for Memory Leaks:
Optimize Memory Usage at the Application Level:
Swap is a portion of your instance’s disk that is reserved for the operating system to use when the available RAM has been utilized. As it uses the disk, Swap is slower to access and is generally used as a last resort.
Swap can be used even if your instance has plenty of RAM.
High Swap is concerning if your instance is using all of the available RAM (i.e. consistently using more than 75%).
Affects:
When swap usage is high, it indicates that the system is running out of physical memory and is resorting to swap space, which can negatively impact system performance. The C3 team should perform the following actions via SSH login to the server to gather more information:
Check Current Swap Usage:
swapon -s
Used
column to see if swap usage is high.Check Overall Memory Usage:
free -m
Check Swap Usage Percentage:
vmstat 1 5
si
(swap in) and so
(swap out) columns will show how much data is being swapped in and out, which indicates swap activity.Check System Logs for Swap-Related Issues:
dmesg | grep -i swap
journalctl
to search for swap-related messages:
journalctl -xe | grep -i swap
Identify High Memory Processes:
top
or htop
to identify processes that are consuming excessive memory. Processes with high memory usage can contribute to swap usage.Once the C3 team has collected the data, they can take the following steps to resolve high swap usage:
Identify Memory-Hungry Processes:
top
or htop
to identify processes consuming excessive memory. If certain processes are found, consider terminating or restarting them:
kill <PID>
<PID>
with the process ID of the offending process. Be cautious about terminating critical processes.Free Up Cached Memory:
sync; echo 3 > /proc/sys/vm/drop_caches
Restart Memory-Hungry Applications:
systemctl restart <service-name>
<service-name>
with the name of the service, such as mysql
, nginx
, or apache2
.Increase Physical Memory (if possible):
Adjust Swappiness:
vm.swappiness
parameter controls the kernel’s preference for swapping. By default, it is set to 60, which is a balanced approach. You can decrease this value to make the kernel less likely to use swap:
sysctl vm.swappiness=30
/etc/sysctl.conf
:
vm.swappiness = 30
When swap usage is high, the following metrics should also be monitored to understand the impact:
System Load:
uptime
Memory Usage:
free -m
output for memory usage. If memory is low and swap usage is high, it indicates that the system has insufficient RAM for its workload.CPU Usage:
top
or htop
) to determine if the system is spending too much time swapping (i.e., CPU waiting for I/O operations). High CPU usage coupled with high swap usage may suggest disk I/O bottlenecks.Disk I/O:
iostat
Page Faults:
vmstat 1 5
pgpgin
(page ins) and pgpgout
(page outs) columns.If C3 actions do not resolve the high swap usage issue, the DevOps team should consider the following:
Increase Physical Memory:
Optimize System Configuration:
vm.swappiness
value or adjusting resource limits for applications.Scale Infrastructure:
Investigate Memory Leaks:
Monitor Disk I/O Performance:
A Node Down Alert indicates that a server or node is no longer accessible within the network, which can affect application availability, data accessibility, and service continuity.
When a Node Down Alert is triggered, the C3 team should follow these steps to collect information on the status of the node:
Verify Node Reachability:
ping <node_ip>
Check SSH Connectivity:
ssh <user>@<node_ip>
Check Application Monitoring System:
Analyze Network and Power Status:
Identify Last Log Entries:
journalctl -xe --since "5 minutes ago"
Restart Node Services (If SSH is accessible):
sudo systemctl restart <service-name>
<service-name>
with the main service running on the node (e.g., apache2
, nginx
).Reboot the Node:
Escalate to Network and Hardware Teams:
When a node is down, review these metrics to assess impact:
System Load on Neighboring Nodes:
uptime
Database Cluster Status:
Network and Firewall Logs:
Storage Cluster Health:
If C3 actions don’t resolve the Node Down Alert, the DevOps team should consider these options:
Reallocate Workloads:
Perform Hardware Diagnostics:
Check Virtualization Platform:
Plan Node Replacement:
Update Alerting and Documentation:
A Root File System Filling Alert is triggered when the root (/
) file system reaches a high usage threshold, potentially leading to critical performance degradation, service disruptions, or inability to store necessary data.
When a Root File System Filling Alert is triggered, the C3 team should perform the following steps to gather information on disk usage:
Check Disk Usage:
df
command to check the usage of the root file system:
df -h /
% Used
column to determine how close it is to the threshold (e.g., 80%, 90%).Identify Largest Files and Directories:
du -ahx / | sort -rh | head -20
Examine Log Files:
/var/log
can often grow unexpectedly. Use this command to identify large logs:
du -sh /var/log/*
Check Inode Usage:
df -i /
Review System Logs for Errors:
dmesg | grep -i "disk full"
journalctl
to search for disk-related warnings or errors:
journalctl -xe | grep -i "disk full"
Once the C3 team has collected the necessary data, follow these steps to clear disk space:
Clear Log Files:
/var/log
. If logs are critical, compress them to save space:
truncate -s 0 /var/log/<large-log-file>
auth.log.1
instead of deleting auth.log
<large-log-file>
with the filename. Be careful not to remove logs without proper backup.Delete Unnecessary Files:
/tmp
or user-specific directories:
rm -rf /tmp/*
Clear Package Cache:
sudo apt-get clean
Archive and Move Old Data:
tar -czf /path/to/backup.tar.gz /path/to/old-files
When the root file system is filling up, monitor the following metrics to understand the potential impact:
Disk Usage on Other Filesystems:
/home
, /var
, /tmp
) are not approaching capacity:
df -h
Application Logs for Errors:
System Performance Metrics:
top
I/O Performance:
iostat
If C3 actions do not sufficiently reduce disk usage, the DevOps team should consider the following:
Add More Disk Space:
Implement Log Rotation:
sudo nano /etc/logrotate.conf
Optimize Application Configurations:
A Node Reboot Alert is triggered when a monitored server has unexpectedly restarted. This can be due to various issues, including system crashes, power failures, or kernel panics. A node reboot can lead to temporary service interruptions, application downtime, and potential data inconsistencies.
When a Node Reboot Alert is triggered, the C3 team should perform the following checks and gather relevant information:
Confirm the Reboot:
last -x | grep reboot | head -n 1
Check System Uptime:
uptime
Review System Logs for Reboot Cause:
journalctl -b -1
Examine Kernel Logs:
dmesg | grep -i "panic\|error\|fatal"
Check Power Management and UPS Logs (if applicable):
cat /var/log/ups.log | grep -i "power failure"
Once the C3 team has collected data, they can follow these steps to mitigate any issues caused by the unexpected reboot:
Restart Critical Services:
systemctl start <service-name>
<service-name>
with the specific service (e.g., apache2
, mysql
).Verify System Health:
top -b -n 1 | head -n 10
Check Application Logs:
tail -n 50 /var/log/<application-log>
<application-log>
with the relevant log file for the service.After a node reboot, monitor these metrics to ensure continued stability:
Service Status:
System Load:
uptime
Memory Usage:
free -m
Disk Health:
smartctl -H /dev/sda
/dev/sda
with the relevant disk device name.Network Connectivity:
ip a
If the cause of the reboot is not apparent or is due to a recurring issue, the DevOps team should consider these additional actions:
Check for Kernel Updates:
sudo apt update && sudo apt install linux-generic
Monitor for Hardware Failures:
lshw -C memory
Document Findings:
The Disk Space is Less Than 10% Available alert is triggered when the available disk space on a server falls below 10%. This can lead to various issues such as slow performance, application crashes, and failed system operations. Immediate attention is required to prevent critical system failures.
When the Disk Space is Less Than 10% Available alert is triggered, the C3 team should perform the following checks and gather relevant information:
Check Available Disk Space:
df -h
-h
flag).Identify Specific Disk/Partition:
df -h /dev/sda1
List Large Files and Directories:
du -ahx / | sort -rh | head -n 20
/
) directory. Modify the path if you need to check specific directories.Check System Logs:
dmesg | grep -i "disk"
Examine Log Files:
/var/log
or other directories where logs are stored:
ls -lh /var/log
Check for Unused or Temporary Files:
find /tmp -type f -exec ls -lh {} \; | sort -rh | head -n 10
Once the C3 team has collected data, they can follow these steps to resolve the issue:
Clear Temporary and Cache Files:
rm -rf /tmp/*
Rotate and Clean Up Logs:
logrotate -f /etc/logrotate.conf
Delete Unnecessary Files or Backups:
rm -rf /path/to/old-backups/*
Move Data to Another Disk or Partition:
mv /path/to/large-data /path/to/another-disk
Extend Disk Space (if applicable):
gparted
.Check Disk for Errors:
sudo smartctl -a /dev/sda
When the disk space is low, monitor the following metrics to assess the impact and take preventive measures:
System Load:
uptime
Disk I/O Performance:
iostat
Memory Usage:
free -m
Log File Sizes:
du -sh /var/log/*
Backup and Archive Jobs:
du -sh /path/to/backups
If C3 team actions do not resolve the disk space issue, the DevOps team should consider the following additional actions:
Configure Disk Quotas:
Upgrade Hardware (if needed):
Implement Data Archiving Solutions:
The High Disk Reads alert is triggered when the disk read activity on a server exceeds a threshold that could indicate potential issues, such as slow disk performance or heavy I/O operations. High disk reads can lead to system slowdowns, especially when coupled with high disk writes or excessive I/O wait times.
When the High Disk Reads alert is triggered, the C3 team should gather the following information to understand the underlying cause and to assist with remediation:
Check Disk I/O Activity:
iostat
to check for any abnormalities in disk I/O performance. It will show disk read and write statistics:
iostat -dx 1 5
r/s
(reads per second) and w/s
(writes per second) columns to identify high read activity.Examine Disk Usage:
df -h
Monitor Disk Reads Over Time:
vmstat
to monitor disk read activity over time:
vmstat 1 5
bi
(blocks in) column will indicate the number of blocks read from disk.Check for High-Read Processes:
iotop
or top
:
sudo iotop -o
Review System Logs:
dmesg | grep -i "disk"
Check for Disk Errors:
smartctl
to identify any possible errors or issues:
sudo smartctl -a /dev/sda
Once the C3 team has gathered the necessary information, the following steps should be taken to address the high disk read activity:
Identify and Resolve High-Read Processes:
iotop
or top
to identify processes that are performing excessive disk reads. If the process is non-essential or consuming an excessive amount of I/O resources, consider stopping or restarting it:
kill <PID>
Optimize Database Queries:
Clear Cache or Temporary Files:
rm -rf /tmp/*
Increase Disk I/O Throughput:
When high disk reads are observed, the following metrics and system performance indicators should be monitored to assess the impact and prevent further issues:
Disk Write Activity:
iostat
to monitor disk writes (w/s
column) and check if both reads and writes are high.CPU Usage:
top
or htop
.Disk Queue Length:
avgqu-sz
in iostat
) can indicate that the disk is struggling to handle read requests, leading to slower I/O operations.System Load:
uptime
Disk Latency:
iostat
or check for errors in system logs related to disk timeouts or delays.Page Faults:
vmstat 1 5
If C3 team actions do not resolve the high disk reads issue, the DevOps team should consider the following steps:
Scale Disk Resources:
Optimize System Configurations:
vm.swappiness
parameter or increasing available memory.Database Optimization:
Use Distributed File Systems:
Implement I/O Queuing:
Review Application-Level Disk Usage:
The Time Synchronization Drift alert is triggered when the system clock is found to be out of sync with the configured time source, such as a Network Time Protocol (NTP) server. Time synchronization is critical for ensuring accurate timestamps, smooth system operations, and consistency in logs and data across distributed systems.
When the Time Synchronization Drift alert is triggered, the C3 team should gather the following information to determine the cause of the time drift:
Check System Time:
date
Verify NTP Service Status:
systemctl status ntp
sudo systemctl start ntp
Check NTP Synchronization:
ntpq -p
Check for Time Drift Logs:
journalctl -xe | grep ntp
Check Hardware Clock (RTC):
sudo hwclock --show
Check for Timezone Mismatches:
timedatectl
Once the C3 team has gathered the necessary information, the following actions should be taken to resolve the time synchronization drift:
Restart NTP Service:
sudo systemctl restart ntp
Manually Sync Time:
sudo ntpdate <NTP-server>
<NTP-server>
with the address of a reliable NTP server.Check for Time Configuration Conflicts:
chrony
, ntpd
, systemd-timesyncd
) are running simultaneously. Disable any unused services:
sudo systemctl stop chronyd
sudo systemctl disable chronyd
Ensure Correct Timezone:
sudo timedatectl set-timezone <Timezone>
<Timezone>
with the correct timezone, such as UTC
or Asia/Kolkata
.Check for Hardware Clock Issues:
sudo hwclock --systohc
Synchronize NTP Servers:
sudo nano /etc/ntp.conf
server 0.pool.ntp.org
server 1.pool.ntp.org
When time synchronization drift is detected, the following metrics and system settings should also be monitored to identify underlying causes and ensure proper synchronization:
System Load and Performance:
top
, htop
, or uptime
.NTP Synchronization Status:
ntpq -p
or chronyc sources
to ensure that the system stays synchronized over time.System Logs:
journalctl -xe | grep ntp
) to identify potential issues with the NTP service or time configuration.Hardware Clock (RTC) Drift:
hwclock
.Network Connectivity:
If C3 team actions do not resolve the time synchronization drift issue, the DevOps team should consider the following steps:
Check Network and Firewall Settings:
Scale Time Synchronization Infrastructure:
Use Hardware Time Synchronization:
Investigate NTP Service or Software Bugs:
Monitor Server Time Continuously:
The High Context Switching alert is triggered when the number of context switches in the system exceeds a certain threshold. A context switch occurs when the CPU switches from one process or thread to another, which can happen frequently due to multitasking.
Excessive context switching can indicate issues like high CPU utilization, process contention, or inefficient use of system resources. It may lead to performance degradation due to overhead caused by the switching process.
When the High Context Switching alert is triggered, the C3 team should collect the following information to diagnose the root cause of the issue:
Check System Load:
uptime
or top
command. High load averages combined with high context switching can indicate resource contention.
uptime
Check Context Switch Count:
vmstat
command to check for high context switches. Look for high values in the cs
(context switches) column:
vmstat 1 5
Monitor Process-Specific Context Switching:
pidstat
to monitor context switches per process and identify which processes are causing excessive context switching:
pidstat -w 1
-w
option displays context switch statistics.Check CPU Usage and Load:
top
command to identify processes that are consuming high CPU:
top
Check for I/O Wait:
iostat
command to monitor disk and network I/O:
iostat -xz 1 5
await
column, which indicates I/O wait time.Check for System Resource Contention:
dstat
command can provide insight into system resource usage:
dstat -c -d -n
Once the C3 team has collected the necessary information, the following actions should be taken to resolve high context switching:
Identify Resource-Hungry Processes:
top
, htop
, or pidstat
to identify processes consuming excessive resources, which may cause high context switching. Consider adjusting or terminating these processes if necessary.
kill <PID>
<PID>
with the process ID of the offending process.Optimize Application Code:
Increase CPU Resources:
Review System Resource Limits:
ulimit -a
to view the system limits and adjust if necessary.Optimize I/O Operations:
Reduce Process and Thread Creation:
Review System Scheduler Configuration:
CFS scheduler
) to ensure it is appropriately tuned for the workload.When high context switching is detected, the following metrics and system settings should also be monitored:
CPU Usage:
top
, htop
, or vmstat
.Disk I/O:
iostat
and look for signs of I/O bottlenecks or high wait times.Process Behavior:
System Load:
uptime
or top
to monitor load averages and CPU utilization.Memory Usage:
free -m
and vmstat
to monitor memory status.If C3 team actions do not resolve the high context switching issue, the DevOps team should consider the following steps:
Scale the Infrastructure:
Optimize System Configuration:
vm.swappiness
, cpu.shares
, or process limits to ensure efficient resource allocation.Investigate Application-Level Issues:
Tune the Kernel Scheduler:
Monitor Long-Term Trends:
Optimize Database Queries:
The High Open File Descriptors alert is triggered when the number of open file descriptors on the system exceeds a specified threshold. File descriptors are used by the operating system to manage files, network sockets, and other input/output resources. An unusually high number of open file descriptors can indicate resource leaks or excessive file/socket usage by applications.
Excessive open file descriptors can lead to resource exhaustion, causing the system or application to become unresponsive or fail to open new files or network connections.
When the High Open File Descriptors alert is triggered, the C3 team should collect the following information to diagnose the issue:
Check Current Open File Descriptors:
lsof
command to list all open file descriptors. This will show the number of file descriptors being used by processes:
lsof | wc -l
lsof -p <PID>
<PID>
with the process ID to check open file descriptors for a specific process.Check System-Wide Open File Descriptor Limit:
ulimit -n
Check Per-Process Open File Descriptor Count:
lsof | awk '{print $1}' | sort | uniq -c | sort -n
Check System Logs for Errors:
journalctl
or dmesg
to search for any warnings or errors:
dmesg | grep -i file
journalctl -xe | grep -i file
Check Resource Usage by Application:
top
or htop
to see if they are consuming excessive CPU or memory resources.Once the C3 team has collected the necessary data, they can take the following steps to resolve high open file descriptor usage:
Identify Processes with Excessive File Descriptors:
lsof
or the command mentioned earlier to identify processes consuming an unusually high number of file descriptors. If any processes are found, investigate their behavior to determine if they are leaking file descriptors or using resources inefficiently.Increase Open File Descriptor Limit:
ulimit -n <new_limit>
<new_limit>
with the desired value for the maximum number of open file descriptors.Check for File Descriptor Leaks:
Restart Problematic Applications:
systemctl restart <service-name>
<service-name>
with the service name, such as mysql
, nginx
, or apache2
.Review Application Configuration:
When high open file descriptors are detected, the following metrics and system checks should also be monitored:
System Load:
uptime
or top
to monitor the system load.Memory Usage:
free -m
or vmstat
.CPU Usage:
top
or htop
to monitor CPU utilization.Disk I/O:
iostat
or dstat
to identify any performance bottlenecks related to disk access.Application Resource Usage:
top
, htop
, or custom application monitoring tools.If C3 team actions do not resolve the high open file descriptor issue, the DevOps team should consider the following steps:
Scale Infrastructure:
Optimize Application Code:
Increase File Descriptor Limits:
/etc/security/limits.conf
to raise the soft and hard limits for file descriptors:
* soft nofile 65536
* hard nofile 65536
Check for System Configuration Issues:
Monitor Long-Term Trends:
The Filesystem Read-Only State alert is triggered when the system’s filesystem has entered a read-only state. This usually occurs when there is a serious issue with the filesystem, such as corruption, hardware failure, or a kernel panic. A filesystem in read-only mode cannot accept write operations, meaning that any applications or processes that need to write to the disk will fail.
This alert indicates that the system is unable to perform critical operations such as writing logs, saving data, or updating files, which can significantly impact system performance and reliability.
When the Filesystem Read-Only State alert is triggered, the C3 team should collect the following information to diagnose the issue:
Check the Filesystem State:
mount
command to verify if the filesystem is mounted in read-only mode:
mount | grep "ro,"
Check System Logs for Filesystem Errors:
dmesg | grep -i error
journalctl -xe | grep -i "read-only"
Check Disk Health and SMART Status:
smartctl
to check the health of the disk, as physical disk failures are a common cause of read-only filesystems:
smartctl -a /dev/sda
/dev/sda
with the appropriate disk device.Check Filesystem for Errors:
fsck
(filesystem check) to scan and repair filesystem errors. For example:
sudo fsck /dev/sda1
/dev/sda1
with the appropriate partition.Check for Disk Space Issues:
df -h
Check for Pending Kernel or System Errors:
dmesg | tail -n 50
Once the C3 team has collected the necessary data, they can take the following steps to resolve the filesystem read-only state:
When the Filesystem Read-Only State alert is triggered, the following metrics and system checks should also be monitored:
Disk I/O:
iostat
, iotop
, or dstat
to identify any disk bottlenecks.System Load:
uptime
or top
.Disk Space:
df -h
to ensure the filesystem is not full, which could have caused the filesystem to become read-only.Memory Usage:
System Logs:
/var/log/syslog
, /var/log/messages
, or journalctl
) for additional filesystem or disk errors.If C3 team actions do not resolve the filesystem read-only issue, the DevOps team should consider the following steps:
Replace the Failing Disk:
Scale Infrastructure:
Reconfigure Disk and Filesystem:
Review System and Kernel Configurations: