Node Exporter

Node Exporter

In a Node Exporter dashboard in Grafana, you typically have multiple panels, each displaying different metrics or information about the system being monitored. Here’s an explanation of some common panels you might find in a Node Exporter dashboard:

CPU Usage Panel:

This panel displays the CPU usage metrics of the target machine(s) being monitored. It can show the overall CPU usage percentage or break it down by individual CPU cores or modes (user, system, idle, etc.).

Memory Usage Panel:

This panel provides information about the memory usage on the target machine(s). It may include metrics like total memory, used memory, free memory, and memory usage percentage.

Disk Usage Panel:

This panel shows the disk usage metrics, including total disk space, used space, and available space. It can provide an overview of the disk usage on the target machine(s) or specific disks or partitions.

Network Traffic Panel:

This panel displays the network traffic metrics, such as incoming and outgoing network traffic (bytes/sec) or network packets (packets/sec). It can give you insights into the network activity of the monitored system.

Load Average Panel:

The load average panel shows the system load average metrics, indicating the average number of processes in the system’s run queue over different time intervals (e.g., 1 minute, 5 minutes, 15 minutes). It helps you understand the system’s workload and resource utilization.

Process Metrics Panel:

This panel provides information about specific processes running on the target machine(s). It can show metrics like CPU usage, memory usage, and other process-specific metrics. You can configure this panel to monitor specific processes of interest.

System Uptime Panel:

This panel displays the system uptime metric, indicating how long the target machine has been running since the last reboot. It gives you an overview of the system’s stability.

Custom Panels:

Depending on your monitoring requirements, you can create custom panels to display any specific metrics exposed by the Node Exporter or other system-level metrics you find relevant.

These are just some examples of panels you may find in a Node Exporter dashboard. The exact panels and their configurations may vary based on your specific monitoring needs and the metrics you choose to track. Grafana provides a wide range of visualization options and customization features, allowing you to create panels that best suit your monitoring requirements

Metrics

CPU and Memory Metrics

  • rate(node_cpu_seconds_total{mode=“system”}[1m]): The average amount of CPU time spent in system mode, per second, over the last minute (in seconds).

  • node_memory_MemAvailable_bytes: The amount of free memory available in bytes.

  • node_memory_MemTotal_bytes: The total memory available in bytes.

  • node_memory_MemFree_bytes: The amount of free RAM in bytes available on the node.

  • node_vmstat_pgmajfault: The number of major page faults that have occurred.

  • node_pressure_cpu_waiting_seconds_total: The rate of time tasks wait for CPU resources.

  • node_pressure_memory_waiting_seconds_total: The rate of time tasks wait for memory resources.

  • node_pressure_io_waiting_seconds_total: The rate of time tasks wait for I/O operations.

  • node_memory_SwapTotal_bytes: The total amount of swap memory in bytes available on the node.

  • node_memory_SwapFree_bytes: The amount of free swap memory in bytes available on the node.

  • node_memory_Cached_bytes: The amount of cached memory in bytes.

  • node_memory_Buffers_bytes: The amount of buffers memory in bytes.

  • node_memory_SReclaimable_bytes: The amount of reclaimable memory in bytes.

Network Metrics

  • rate(node_network_receive_bytes_total[1m]): The average network traffic received, per second, over the last minute (in bytes).

  • node_network_receive_bytes_total: The total number of bytes received by a node over all network interfaces since the node was booted.

  • node_network_transmit_bytes_total: The total number of bytes transmitted by a node over all network interfaces since the node was booted.

  • node_network_receive_errs_total: The total number of errors that have occurred while receiving data on a network interface.

  • node_network_transmit_errs_total: The total number of errors that have occurred while transmitting data on a network interface.

File System Metrics

  • node_filesystem_avail_bytes: The filesystem space available to non-root users (in bytes).

  • node_filesystem_size_bytes: The total size of a filesystem on a node (in bytes).

  • node_filesystem_readonly: Whether a filesystem on a node is mounted read-only. A value of 1 indicates read-only, while 0 indicates read-write.

  • node_filesystem_files_free: The number of free inodes on a filesystem on a node.

  • node_filesystem_files: The total number of files and directories on a filesystem on a node.

Disk Metrics

  • node_disk_read_bytes_total: The total number of bytes read from a disk device on a node since the node was booted.

  • node_disk_written_bytes_total: The total number of bytes written to a disk device on a node since the node was booted.

  • node_disk_read_time_seconds_total: The total amount of time spent reading data from a disk device on a node since the node was booted.

  • node_disk_reads_completed_total: The total number of read operations completed on a disk device on a node since the node was booted.

  • node_disk_write_time_seconds_total: The total amount of time spent writing data to a disk device on a node since the node was booted.

  • node_md_disks: The number of active/failed/spare disks of a device.

Kernel Metrics

  • node_nf_conntrack_entries: The number of currently allocated flow entries for connection tracking.

  • node_edac_uncorrectable_errors_total: The total number of uncorrectable errors detected by the Error Detection and Correction (EDAC) subsystem on a node.

  • node_vmstat_oom_kill: The cumulative number of times the kernel has killed processes due to out-of-memory (OOM) conditions on a node.

  • node_load1: The load average over the last minute for the node.

  • node_time_seconds: The current time on the node in seconds.

  • node_boot_time_seconds: The time when the node was last booted in seconds.

Temperature Metrics

  • node_hwmon_temp_celsius: Current temperature in Celsius.

  • node_hwmon_temp_crit_alarm_celsius: Critical temperature alarm status in Celsius.

  • node_hwmon_temp_crit_celsius: Critical temperature threshold in Celsius.

  • node_hwmon_temp_crit_hyst_celsius: Critical temperature hysteresis in Celsius.

  • node_hwmon_temp_max_celsius: Maximum temperature observed in Celsius.

  • node_hwmon_chip_names: Names of the hardware chips monitored.

  • node_cooling_device_cur_state: Current state of the cooling device.

  • node_cooling_device_max_state: Maximum state of the cooling device.

Clock Synchronization Metrics

  • node_timex_offset_seconds: The difference between the node’s system clock and the NTP reference clock.

Netstat Metrics

  • irate(node_netstat_IpExt_InOctets[1m]): The rate of incoming IP octets per second.

  • irate(node_netstat_IpExt_OutOctets[1m]): The rate of outgoing IP octets per second.

  • irate(node_netstat_Ip_Forwarding[1m]): The rate of IP forwarding.

  • irate(node_netstat_Icmp_InMsgs[1m]): The rate of incoming ICMP messages.

  • irate(node_netstat_Icmp_OutMsgs[1m]): The rate of outgoing ICMP messages.

  • irate(node_netstat_Icmp_InErrors[1m]): The rate of incoming ICMP errors.

  • irate(node_netstat_Udp_InDatagrams[1m]): The rate of incoming UDP datagrams.

  • irate(node_netstat_Udp_OutDatagrams[1m]): The rate of outgoing UDP datagrams.

  • irate(node_netstat_Udp_InErrors[1m]): The rate of incoming UDP errors.

  • irate(node_netstat_Udp_NoPorts[1m]): The rate of UDP no ports errors.

  • irate(node_netstat_UdpLite_InErrors[1m]): The rate of incoming UDPLite errors.

  • irate(node_netstat_Udp_RcvbufErrors[1m]): The rate of UDP receive buffer errors.

  • irate(node_netstat_Udp_SndbufErrors[1m]): The rate of UDP send buffer errors.

  • irate(node_netstat_Tcp_InSegs[1m]): The rate of incoming TCP segments.

  • irate(node_netstat_Tcp_OutSegs[1m]): The rate of outgoing TCP segments.

  • irate(node_netstat_TcpExt_ListenOverflows[1m]): The rate of TCP listen overflows.

  • irate(node_netstat_TcpExt_ListenDrops[1m]): The rate of TCP listen drops.

  • irate(node_netstat_TcpExt_TCPSynRetrans[1m]): The rate of TCP SYN retransmissions.

  • irate(node_netstat_Tcp_RetransSegs[1m]): The rate of TCP retransmitted segments.

  • irate(node_netstat_Tcp_InErrs[1m]): The rate of TCP input errors.

  • irate(node_netstat_Tcp_OutRsts[1m]): The rate of TCP reset segments sent.

  • irate(node_netstat_TcpExt_TCPRcvQDrop[1m]): The rate of TCP receive queue drops.

  • irate(node_netstat_TcpExt_TCPOFOQueue[1m]): The rate of TCP out-of-order queue drops.

  • node_netstat_Tcp_CurrEstab: Current established TCP connections.

  • node_netstat_Tcp_MaxConn: Maximum TCP connections allowed.

  • irate(node_netstat_TcpExt_SyncookiesFailed[1m]): The rate of failed SYN cookies.

  • irate(node_netstat_TcpExt_SyncookiesRecv[1m]): The rate of received SYN cookies.

  • irate(node_netstat_TcpExt_SyncookiesSent[1m]): The rate of sent SYN cookies.

  • irate(node_netstat_Tcp_ActiveOpens[1m]): The rate of active TCP connection opens.

  • irate(node_netstat_Tcp_PassiveOpens[1m]): The rate of passive TCP connection opens.


Grafana Dashboard Panels

Quick CPU / Mem / Disk Row

Pressure panel

This panel shows the percentage of time the system’s resources (CPU, memory, and I/O) are under pressure or experiencing wait times.

  • Metric Used: node_pressure_cpu_waiting_seconds_total, node_pressure_memory_waiting_seconds_total, node_pressure_io_waiting_seconds_total.

CPU Busy panel

It displays the percentage of time the CPU is actively working, indicating how much of the CPU’s capacity is being utilized.

  • Metric Used: node_cpu_seconds_total

Sys Load panel

This panel displays the system load as a percentage of total CPU capacity. It helps to understand how much work the system is handling compared to the number of CPU cores available.

  • Metrics Used: node_load1, node_cpu_seconds_total

RAM Used panel

This panel shows how much of the total RAM is currently being used by the system. It calculates the used RAM as a percentage of the total RAM.

  • Metrics Used: node_memory_MemTotal_bytes, node_memory_MemFree_bytes

SWAP Used panel

This panel shows how much of the total swap memory is currently being used by the system.

  • Metrics Used: node_memory_SwapTotal_bytes, node_memory_SwapFree_bytes

Root FS Used panel

It displays the percentage of the root filesystem that is currently being used. This helps in monitoring disk usage on the root filesystem of the node.

  • Metrics Used: node_filesystem_avail_bytes, node_filesystem_size_bytes

CPU Cores panel

It displays the total number of CPU cores available on the node.

  • Metrics Used: node_cpu_seconds_total

Root FS Total Panel

It displays the total size of the root filesystem on the node. This helps in understanding the total disk capacity available in the root filesystem.

  • Metrics Used: node_filesystem_size_bytes

Uptime Panel

It displays the uptime of the node, indicating how long the node has been running since its last boot.

  • Metrics Used: node_time_seconds,node_boot_time_seconds

Basic CPU / Mem / Net / Disk Row

CPU Basic Panel

It shows the average system CPU usage per core on the node.

  • Metrics Used: node_cpu_seconds_total

Memory Basic Panel

It provides insights into memory usage on the node, including total memory, used memory, cached memory, buffers, and free memory.

  • Metrics Used: node_memory_MemTotal_bytes, node_memory_MemFree_bytes, node_memory_Cached_bytes, node_memory_Buffers_bytes, node_memory_SReclaimable_bytes

Network Traffic Basic Panel

It shows the incoming and outgoing network traffic on the node in bits per second (bps).

  • Metrics Used: node_network_receive_bytes_total, node_network_transmit_bytes_total

Disk Space Used Basic Panel

It shows the percentage of disk space used on non-root filesystems.

  • Metrics Used: node_filesystem_avail_bytes, node_filesystem_size_bytes

Hardware Misc Row

Hardware Temperature Monitor Panel

It monitors various temperature metrics of hardware components on the node.

  • Metrics Used: node_hwmon_temp_celsius, node_hwmon_temp_crit_alarm_celsius, node_hwmon_temp_crit_celsius, node_hwmon_temp_crit_hyst_celsius, node_hwmon_temp_max_celsius, node_hwmon_chip_names

Throttle Cooling Device Panel

It monitors the current and maximum states of cooling devices on the node.

  • Metrics Used: node_cooling_device_cur_state, node_cooling_device_max_state.

Power Supply Panel

It monitors the online status of power supplies on the node.

Metrics Used:

  • node_power_supply_online: Online status of the power supply.

Storage File System Row

Filesystem Space Available Panel

It shows the available space in bytes on the filesystem excluding rootfs.

  • Metrics Used: node_filesystem_avail_bytes, node_filesystem_free_bytes, node_filesystem_size_bytes

This panel provides information about the available and free space on the filesystem, helping monitor storage capacity.

File Nodes Free Panel

It displays the number of free file nodes (inodes) on the filesystem excluding rootfs.

  • Metrics Used: node_filesystem_files_free.

This panel monitors the availability of file nodes, which are important for storing metadata about files.

File Descriptor Panel

It shows the maximum and currently allocated file descriptors for the node.

  • Metrics Used: node_filefd_maximum, node_filefd_allocated

A file descriptor is a unique identifier that the operating system assigns to an opened file or other input/output resource

File descriptors are used by applications to access files, pipes, sockets, and other I/O resources. This panel tracks their usage and limits.

File Nodes Size Panel

It displays the total number of file nodes (inodes) on the filesystem excluding rootfs.

  • Metrics Used: node_filesystem_files

File nodes (inodes) represent individual files or directories on the filesystem. This panel monitors the total count of these nodes.

Filesystem in ReadOnly / Error Panel

It indicates if the filesystem excluding tmpfs is in a read-only state or has encountered errors.

  • Metrics Used: node_filesystem_readonly, node_filesystem_device_error

This panel alerts when the filesystem is in a read-only state or has encountered device errors, helping to identify potential storage issues.


Netstat Row

Netstat IP In / Out Octets Panel

It displays the rate of incoming and outgoing octets (bytes) for IP traffic.

  • Metrics Used: node_netstat_IpExt_InOctets, node_netstat_IpExt_OutOctets

This panel provides insights into the data traffic in bytes for incoming and outgoing IP packets, crucial for monitoring network throughput.

Netstat IP Forwarding Panel

This panel shows the rate of IP packets being forwarded by the node.

  • Metrics Used: node_netstat_Ip_Forwarding

This panel monitors the rate of IP packets that the node is forwarding, helping to understand network routing activities.

ICMP In / Out Panel

It indicates the rate of incoming and outgoing ICMP (Internet Control Message Protocol) messages.

  • Metrics Used: node_netstat_Icmp_InMsgs, node_netstat_Icmp_OutMsgs

This panel tracks the rate of ICMP messages exchanged by the node, which are essential for network troubleshooting and diagnostics.

ICMP Errors Panel

It displays the rate of ICMP message errors received by the node.

  • Metrics Used: node_netstat_Icmp_InErrors

UDP In / Out Panel

It shows the rate of incoming and outgoing UDP (User Datagram Protocol) datagrams.

  • Metrics Used: node_netstat_Udp_InDatagrams, node_netstat_Udp_OutDatagrams

This panel monitors the rate of UDP datagrams (packets) being received and sent by the node, crucial for real-time data transmission.

UDP Errors Panel

It displays various error rates related to UDP traffic.

  • Metrics Used: node_netstat_Udp_InErrors, node_netstat_Udp_NoPorts, node_netstat_UdpLite_InErrors, node_netstat_Udp_RcvbufErrors, node_netstat_Udp_SndbufErrors

This panel tracks different types of errors related to UDP traffic, aiding in the diagnosis and troubleshooting of UDP-related issues in the network.

TCP In / Out Panel

It shows the rate of incoming and outgoing TCP segments.

  • Metrics Used: node_netstat_Tcp_InSegs, node_netstat_Tcp_OutSegs

This panel monitors the rate at which TCP segments are received and sent by the node, providing insights into TCP traffic volume.

TCP Errors Panel

It displays various error rates related to TCP traffic.

  • Metrics Used: node_netstat_TcpExt_ListenOverflows, node_netstat_TcpExt_ListenDrops, node_netstat_TcpExt_TCPSynRetrans, node_netstat_Tcp_RetransSegs, node_netstat_Tcp_InErrs, node_netstat_Tcp_OutRsts, node_netstat_TcpExt_TCPRcvQDrop, node_netstat_TcpExt_TCPOFOQueue

This panel tracks different types of errors related to TCP traffic, aiding in diagnosing and troubleshooting TCP-related issues in the network.

TCP Connections Panel

It displays current and maximum TCP connections.

  • Metrics Used: node_netstat_Tcp_CurrEstab, node_netstat_Tcp_MaxConn

This panel shows the number of current established TCP connections and the maximum number of connections the node can handle, helping to monitor connection capacity and load.

TCP SynCookie Panel

It displays the rate of SYN cookies sent, received, and failed.

  • Metrics Used: node_netstat_TcpExt_SyncookiesFailed, node_netstat_TcpExt_SyncookiesRecv, node_netstat_TcpExt_SyncookiesSent

This panel monitors SYN cookies used to protect against SYN flood attacks, showing how many SYN cookies were sent, received, and failed.

TCP Direct Transition Panel

It shows the rate of active and passive TCP connection opens.

  • Metrics Used: node_netstat_Tcp_ActiveOpens, node_netstat_Tcp_PassiveOpens

Rows

CPU / Memory / Net / Disk

  • Displays metrics related to CPU usage, memory usage, network traffic, and disk usage.

Memory Meminfo

  • Shows detailed memory information, including active/inactive memory, committed memory, and various memory states.

Memory Vmstat

  • Provides virtual memory statistics such as page ins/outs, swap in/outs, and page faults.

System Timesync

  • Displays information related to system time synchronization, including time drift and synchronization status.

System Processes

  • Shows metrics related to system processes, including their status, memory usage, and process limits.

System Misc

  • Provides miscellaneous system metrics, including context switches, interrupts, CPU frequency scaling, and entropy.

Systemd

  • Displays metrics related to systemd services, including the state of systemd units and sockets.

Storage Disk

  • Shows disk I/O metrics, including IOPs, read/write data, average wait time, and queue size.

Storage Filesystem

  • Provides information about filesystem space, file nodes, file descriptors, and file node sizes.

Network Traffic

  • Displays network traffic metrics, including traffic by packets, errors, and drops.

Network Sockstat

  • Shows socket statistics, including TCP, UDP, frag/RAW, and memory size for sockets.

Node Exporter

  • Displays metrics related to the Node Exporter itself, including scrape times and statuses.

Alert rules

Node down

  • Expression: sum(up{job="node_exporter"}) < 7
  • Summary: Node is down
  • Description: Failed to scrape for more than 2 minutes. Node seems down.

HostOutOfMemory

  • Expression: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
  • Summary: Host out of memory
  • Description: Node memory is filling up (< 10% left)

HostMemoryUnderMemoryPressure

  • Expression: rate(node_vmstat_pgmajfault[1m]) > 1000
  • Summary: Host memory under memory pressure
  • Description: The node is under heavy memory pressure. High rate of major page faults

HostUnusualNetworkThroughputIn

  • Expression: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
  • Summary: Host unusual network throughput in
  • Description: Host network interfaces are probably receiving too much data (> 100 MB/s)

HostUnusualNetworkThroughputOut

  • Expression: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
  • Summary: Host unusual network throughput out
  • Description: Host network interfaces are probably sending too much data (> 100 MB/s)

HostUnusualDiskReadRate

  • Expression: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
  • Summary: Host unusual disk read rate
  • Description: Disk is probably reading too much data (> 50 MB/s)

HostUnusualDiskWriteRate

  • Expression: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
  • Summary: Host unusual disk write rate
  • Description: Disk is probably writing too much data (> 50 MB/s)

HostOutOfDiskSpace

  • Expression: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
  • Summary: Host out of disk space
  • Description: Disk is almost full (< 10% left)

HostDiskWillFillIn24Hours

  • Expression: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
  • Summary: Host disk will fill in 24 hours
  • Description: Filesystem is predicted to run out of space within the next 24 hours at current write rate

HostOutOfInodes - Expression: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0 - Summary: Host out of inodes - Description: Disk is almost running out of available inodes (< 10% left)

HostInodesWillFillIn24Hours - Expression: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{mountpoint="/rootfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0 - Summary: Host inodes will fill in 24 hours - Description: Filesystem is predicted to run out of inodes within the next 24 hours at current write rate

HostUnusualDiskReadLatency - Expression: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 - Summary: Host unusual disk read latency - Description: Disk latency is growing (read operations > 100ms)

HostUnusualDiskWriteLatency - Expression: rate(node_disk_write_time_seconds_totali{device!~"mmcblk.+"}[1m]) / rate(node_disk_writes_completed_total{device!~"mmcblk.+"}[1m]) > 0.1 and rate(node_disk_writes_completed_total{device!~"mmcblk.+"}[1m]) > 0 - Summary: Host unusual disk write latency - Description: Disk latency is growing (write operations > 100ms)

HostHighCpuLoad - Expression: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 - Summary: Host high CPU load - Description: CPU load is > 80%

HostCpuStealNoisyNeighbor - Expression: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10 - Summary: Host CPU steal noisy neighbor - Description: CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.

HostContextSwitching - Expression: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000 - Summary: Host context switching - Description: Context switching is growing on node (> 1000 / s)

HostSwapIsFillingUp - Expression: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 - Summary: Host swap is filling up - Description: Swap is filling up (>80%)

HostSystemdServiceCrashed - Expression: node_systemd_unit_state{state="failed"} == 1 - Summary: Host SystemD service crashed - Description: SystemD service crashed

HostPhysicalComponentTooHot - Expression: node_hwmon_temp_celsius > 75 - Summary: Host physical component too hot - Description: Physical hardware component too hot

HostNodeOvertemperatureAlarm - Expression: node_hwmon_temp_crit_alarm_celsius == 1 - Summary: Host node overtemperature alarm - Description: Physical node temperature alarm triggered

HostRaidArrayGotInactive - Expression: node_md_state{state="inactive"} > 0 - Summary: Host RAID array got inactive - Description: RAID array is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.

HostRaidDiskFailure - Expression: node_md_disks{state="failed"} > 0 - Summary: Host RAID disk failure - Description: At least one device in RAID array on failed. Array needs attention and possibly a disk swap.

HostKernelVersionDeviations

  • Expression: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
  • Summary: Host kernel version deviations
  • Description: Alerts when different kernel versions are running on the host.

HostOomKillDetected

  • Expression: increase(node_vmstat_oom_kill[1m]) > 0
  • Summary: Host OOM kill detected
  • Description: Alerts when an OOM kill is detected on the host.

HostEdacCorrectableErrorsDetected

  • Expression: increase(node_edac_correctable_errors_total[1m]) > 0
  • Summary: Host EDAC Correctable Errors detected
  • Description: Alerts when correctable memory errors are detected by EDAC.

HostEdacUncorrectableErrorsDetected

  • Expression: node_edac_uncorrectable_errors_total > 0
  • Summary: Host EDAC Uncorrectable Errors detected
  • Description: Alerts when uncorrectable memory errors are detected by EDAC.

HostNetworkReceiveErrors

  • Expression: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
  • Summary: Host Network Receive Errors
  • Description: Alerts when network receive errors exceed 1%.

HostNetworkTransmitErrors

  • Expression: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
  • Summary: Host Network Transmit Errors
  • Description: Alerts when network transmit errors exceed 1%.

HostNetworkInterfaceSaturated

  • Expression: (rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8
  • Summary: Host Network Interface Saturated
  • Description: Alerts when network interface usage exceeds 80%.

HostConntrackLimit

  • Expression: node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
  • Summary: Host conntrack limit
  • Description: Alerts when conntrack entries exceed 80% of the limit.

HostClockSkew

  • Expression: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
  • Summary: Host clock skew
  • Description: Alerts when clock skew is detected on the host.

HostClockNotSynchronising - Expression: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16 - Summary: Host clock not synchronising - Description: Alerts when the host clock is not synchronising.