Node Exporter
In a Node Exporter dashboard in Grafana, you typically have multiple panels, each displaying different metrics or information about the system being monitored. Here’s an explanation of some common panels you might find in a Node Exporter dashboard:
CPU Usage Panel:
This panel displays the CPU usage metrics of the target machine(s) being monitored. It can show the overall CPU usage percentage or break it down by individual CPU cores or modes (user, system, idle, etc.).
Memory Usage Panel:
This panel provides information about the memory usage on the target machine(s). It may include metrics like total memory, used memory, free memory, and memory usage percentage.
Disk Usage Panel:
This panel shows the disk usage metrics, including total disk space, used space, and available space. It can provide an overview of the disk usage on the target machine(s) or specific disks or partitions.
Network Traffic Panel:
This panel displays the network traffic metrics, such as incoming and outgoing network traffic (bytes/sec) or network packets (packets/sec). It can give you insights into the network activity of the monitored system.
Load Average Panel:
The load average panel shows the system load average metrics, indicating the average number of processes in the system’s run queue over different time intervals (e.g., 1 minute, 5 minutes, 15 minutes). It helps you understand the system’s workload and resource utilization.
Process Metrics Panel:
This panel provides information about specific processes running on the target machine(s). It can show metrics like CPU usage, memory usage, and other process-specific metrics. You can configure this panel to monitor specific processes of interest.
System Uptime Panel:
This panel displays the system uptime metric, indicating how long the target machine has been running since the last reboot. It gives you an overview of the system’s stability.
Custom Panels:
Depending on your monitoring requirements, you can create custom panels to display any specific metrics exposed by the Node Exporter or other system-level metrics you find relevant.
These are just some examples of panels you may find in a Node Exporter dashboard. The exact panels and their configurations may vary based on your specific monitoring needs and the metrics you choose to track. Grafana provides a wide range of visualization options and customization features, allowing you to create panels that best suit your monitoring requirements
rate(node_cpu_seconds_total{mode=“system”}[1m]): The average amount of CPU time spent in system mode, per second, over the last minute (in seconds).
node_memory_MemAvailable_bytes: The amount of free memory available in bytes.
node_memory_MemTotal_bytes: The total memory available in bytes.
node_memory_MemFree_bytes: The amount of free RAM in bytes available on the node.
node_vmstat_pgmajfault: The number of major page faults that have occurred.
node_pressure_cpu_waiting_seconds_total: The rate of time tasks wait for CPU resources.
node_pressure_memory_waiting_seconds_total: The rate of time tasks wait for memory resources.
node_pressure_io_waiting_seconds_total: The rate of time tasks wait for I/O operations.
node_memory_SwapTotal_bytes: The total amount of swap memory in bytes available on the node.
node_memory_SwapFree_bytes: The amount of free swap memory in bytes available on the node.
node_memory_Cached_bytes: The amount of cached memory in bytes.
node_memory_Buffers_bytes: The amount of buffers memory in bytes.
node_memory_SReclaimable_bytes: The amount of reclaimable memory in bytes.
rate(node_network_receive_bytes_total[1m]): The average network traffic received, per second, over the last minute (in bytes).
node_network_receive_bytes_total: The total number of bytes received by a node over all network interfaces since the node was booted.
node_network_transmit_bytes_total: The total number of bytes transmitted by a node over all network interfaces since the node was booted.
node_network_receive_errs_total: The total number of errors that have occurred while receiving data on a network interface.
node_network_transmit_errs_total: The total number of errors that have occurred while transmitting data on a network interface.
node_filesystem_avail_bytes: The filesystem space available to non-root users (in bytes).
node_filesystem_size_bytes: The total size of a filesystem on a node (in bytes).
node_filesystem_readonly: Whether a filesystem on a node is mounted read-only. A value of 1 indicates read-only, while 0 indicates read-write.
node_filesystem_files_free: The number of free inodes on a filesystem on a node.
node_filesystem_files: The total number of files and directories on a filesystem on a node.
node_disk_read_bytes_total: The total number of bytes read from a disk device on a node since the node was booted.
node_disk_written_bytes_total: The total number of bytes written to a disk device on a node since the node was booted.
node_disk_read_time_seconds_total: The total amount of time spent reading data from a disk device on a node since the node was booted.
node_disk_reads_completed_total: The total number of read operations completed on a disk device on a node since the node was booted.
node_disk_write_time_seconds_total: The total amount of time spent writing data to a disk device on a node since the node was booted.
node_md_disks: The number of active/failed/spare disks of a device.
node_nf_conntrack_entries: The number of currently allocated flow entries for connection tracking.
node_edac_uncorrectable_errors_total: The total number of uncorrectable errors detected by the Error Detection and Correction (EDAC) subsystem on a node.
node_vmstat_oom_kill: The cumulative number of times the kernel has killed processes due to out-of-memory (OOM) conditions on a node.
node_load1: The load average over the last minute for the node.
node_time_seconds: The current time on the node in seconds.
node_boot_time_seconds: The time when the node was last booted in seconds.
node_hwmon_temp_celsius: Current temperature in Celsius.
node_hwmon_temp_crit_alarm_celsius: Critical temperature alarm status in Celsius.
node_hwmon_temp_crit_celsius: Critical temperature threshold in Celsius.
node_hwmon_temp_crit_hyst_celsius: Critical temperature hysteresis in Celsius.
node_hwmon_temp_max_celsius: Maximum temperature observed in Celsius.
node_hwmon_chip_names: Names of the hardware chips monitored.
node_cooling_device_cur_state: Current state of the cooling device.
node_cooling_device_max_state: Maximum state of the cooling device.
irate(node_netstat_IpExt_InOctets[1m]): The rate of incoming IP octets per second.
irate(node_netstat_IpExt_OutOctets[1m]): The rate of outgoing IP octets per second.
irate(node_netstat_Ip_Forwarding[1m]): The rate of IP forwarding.
irate(node_netstat_Icmp_InMsgs[1m]): The rate of incoming ICMP messages.
irate(node_netstat_Icmp_OutMsgs[1m]): The rate of outgoing ICMP messages.
irate(node_netstat_Icmp_InErrors[1m]): The rate of incoming ICMP errors.
irate(node_netstat_Udp_InDatagrams[1m]): The rate of incoming UDP datagrams.
irate(node_netstat_Udp_OutDatagrams[1m]): The rate of outgoing UDP datagrams.
irate(node_netstat_Udp_InErrors[1m]): The rate of incoming UDP errors.
irate(node_netstat_Udp_NoPorts[1m]): The rate of UDP no ports errors.
irate(node_netstat_UdpLite_InErrors[1m]): The rate of incoming UDPLite errors.
irate(node_netstat_Udp_RcvbufErrors[1m]): The rate of UDP receive buffer errors.
irate(node_netstat_Udp_SndbufErrors[1m]): The rate of UDP send buffer errors.
irate(node_netstat_Tcp_InSegs[1m]): The rate of incoming TCP segments.
irate(node_netstat_Tcp_OutSegs[1m]): The rate of outgoing TCP segments.
irate(node_netstat_TcpExt_ListenOverflows[1m]): The rate of TCP listen overflows.
irate(node_netstat_TcpExt_ListenDrops[1m]): The rate of TCP listen drops.
irate(node_netstat_TcpExt_TCPSynRetrans[1m]): The rate of TCP SYN retransmissions.
irate(node_netstat_Tcp_RetransSegs[1m]): The rate of TCP retransmitted segments.
irate(node_netstat_Tcp_InErrs[1m]): The rate of TCP input errors.
irate(node_netstat_Tcp_OutRsts[1m]): The rate of TCP reset segments sent.
irate(node_netstat_TcpExt_TCPRcvQDrop[1m]): The rate of TCP receive queue drops.
irate(node_netstat_TcpExt_TCPOFOQueue[1m]): The rate of TCP out-of-order queue drops.
node_netstat_Tcp_CurrEstab: Current established TCP connections.
node_netstat_Tcp_MaxConn: Maximum TCP connections allowed.
irate(node_netstat_TcpExt_SyncookiesFailed[1m]): The rate of failed SYN cookies.
irate(node_netstat_TcpExt_SyncookiesRecv[1m]): The rate of received SYN cookies.
irate(node_netstat_TcpExt_SyncookiesSent[1m]): The rate of sent SYN cookies.
irate(node_netstat_Tcp_ActiveOpens[1m]): The rate of active TCP connection opens.
irate(node_netstat_Tcp_PassiveOpens[1m]): The rate of passive TCP connection opens.
Pressure panel
This panel shows the percentage of time the system’s resources (CPU, memory, and I/O) are under pressure or experiencing wait times.
CPU Busy panel
It displays the percentage of time the CPU is actively working, indicating how much of the CPU’s capacity is being utilized.
Sys Load panel
This panel displays the system load as a percentage of total CPU capacity. It helps to understand how much work the system is handling compared to the number of CPU cores available.
RAM Used panel
This panel shows how much of the total RAM is currently being used by the system. It calculates the used RAM as a percentage of the total RAM.
SWAP Used panel
This panel shows how much of the total swap memory is currently being used by the system.
Root FS Used panel
It displays the percentage of the root filesystem that is currently being used. This helps in monitoring disk usage on the root filesystem of the node.
CPU Cores panel
It displays the total number of CPU cores available on the node.
Root FS Total Panel
It displays the total size of the root filesystem on the node. This helps in understanding the total disk capacity available in the root filesystem.
Uptime Panel
It displays the uptime of the node, indicating how long the node has been running since its last boot.
CPU Basic Panel
It shows the average system CPU usage per core on the node.
Memory Basic Panel
It provides insights into memory usage on the node, including total memory, used memory, cached memory, buffers, and free memory.
Network Traffic Basic Panel
It shows the incoming and outgoing network traffic on the node in bits per second (bps).
Disk Space Used Basic Panel
It shows the percentage of disk space used on non-root filesystems.
Hardware Temperature Monitor Panel
It monitors various temperature metrics of hardware components on the node.
Throttle Cooling Device Panel
It monitors the current and maximum states of cooling devices on the node.
Power Supply Panel
It monitors the online status of power supplies on the node.
Metrics Used:
Filesystem Space Available Panel
It shows the available space in bytes on the filesystem excluding rootfs.
This panel provides information about the available and free space on the filesystem, helping monitor storage capacity.
File Nodes Free Panel
It displays the number of free file nodes (inodes) on the filesystem excluding rootfs.
This panel monitors the availability of file nodes, which are important for storing metadata about files.
File Descriptor Panel
It shows the maximum and currently allocated file descriptors for the node.
A file descriptor is a unique identifier that the operating system assigns to an opened file or other input/output resource
File descriptors are used by applications to access files, pipes, sockets, and other I/O resources. This panel tracks their usage and limits.
File Nodes Size Panel
It displays the total number of file nodes (inodes) on the filesystem excluding rootfs.
File nodes (inodes) represent individual files or directories on the filesystem. This panel monitors the total count of these nodes.
Filesystem in ReadOnly / Error Panel
It indicates if the filesystem excluding tmpfs is in a read-only state or has encountered errors.
This panel alerts when the filesystem is in a read-only state or has encountered device errors, helping to identify potential storage issues.
Netstat IP In / Out Octets Panel
It displays the rate of incoming and outgoing octets (bytes) for IP traffic.
This panel provides insights into the data traffic in bytes for incoming and outgoing IP packets, crucial for monitoring network throughput.
Netstat IP Forwarding Panel
This panel shows the rate of IP packets being forwarded by the node.
This panel monitors the rate of IP packets that the node is forwarding, helping to understand network routing activities.
ICMP In / Out Panel
It indicates the rate of incoming and outgoing ICMP (Internet Control Message Protocol) messages.
This panel tracks the rate of ICMP messages exchanged by the node, which are essential for network troubleshooting and diagnostics.
ICMP Errors Panel
It displays the rate of ICMP message errors received by the node.
UDP In / Out Panel
It shows the rate of incoming and outgoing UDP (User Datagram Protocol) datagrams.
This panel monitors the rate of UDP datagrams (packets) being received and sent by the node, crucial for real-time data transmission.
UDP Errors Panel
It displays various error rates related to UDP traffic.
This panel tracks different types of errors related to UDP traffic, aiding in the diagnosis and troubleshooting of UDP-related issues in the network.
TCP In / Out Panel
It shows the rate of incoming and outgoing TCP segments.
This panel monitors the rate at which TCP segments are received and sent by the node, providing insights into TCP traffic volume.
TCP Errors Panel
It displays various error rates related to TCP traffic.
This panel tracks different types of errors related to TCP traffic, aiding in diagnosing and troubleshooting TCP-related issues in the network.
TCP Connections Panel
It displays current and maximum TCP connections.
This panel shows the number of current established TCP connections and the maximum number of connections the node can handle, helping to monitor connection capacity and load.
TCP SynCookie Panel
It displays the rate of SYN cookies sent, received, and failed.
This panel monitors SYN cookies used to protect against SYN flood attacks, showing how many SYN cookies were sent, received, and failed.
TCP Direct Transition Panel
It shows the rate of active and passive TCP connection opens.
CPU / Memory / Net / Disk
Memory Meminfo
Memory Vmstat
System Timesync
System Processes
System Misc
Systemd
Storage Disk
Storage Filesystem
Network Traffic
Network Sockstat
Node Exporter
Node down
sum(up{job="node_exporter"}) < 7
HostOutOfMemory
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
HostMemoryUnderMemoryPressure
rate(node_vmstat_pgmajfault[1m]) > 1000
HostUnusualNetworkThroughputIn
sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
HostUnusualNetworkThroughputOut
sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
HostUnusualDiskReadRate
sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
HostUnusualDiskWriteRate
sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
HostOutOfDiskSpace
(node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
HostDiskWillFillIn24Hours
(node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
HostOutOfInodes
- Expression: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
- Summary: Host out of inodes
- Description: Disk is almost running out of available inodes (< 10% left)
HostInodesWillFillIn24Hours
- Expression: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{mountpoint="/rootfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
- Summary: Host inodes will fill in 24 hours
- Description: Filesystem is predicted to run out of inodes within the next 24 hours at current write rate
HostUnusualDiskReadLatency
- Expression: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
- Summary: Host unusual disk read latency
- Description: Disk latency is growing (read operations > 100ms)
HostUnusualDiskWriteLatency
- Expression: rate(node_disk_write_time_seconds_totali{device!~"mmcblk.+"}[1m]) / rate(node_disk_writes_completed_total{device!~"mmcblk.+"}[1m]) > 0.1 and rate(node_disk_writes_completed_total{device!~"mmcblk.+"}[1m]) > 0
- Summary: Host unusual disk write latency
- Description: Disk latency is growing (write operations > 100ms)
HostHighCpuLoad
- Expression: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
- Summary: Host high CPU load
- Description: CPU load is > 80%
HostCpuStealNoisyNeighbor
- Expression: avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10
- Summary: Host CPU steal noisy neighbor
- Description: CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.
HostContextSwitching
- Expression: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
- Summary: Host context switching
- Description: Context switching is growing on node (> 1000 / s)
HostSwapIsFillingUp
- Expression: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
- Summary: Host swap is filling up
- Description: Swap is filling up (>80%)
HostSystemdServiceCrashed
- Expression: node_systemd_unit_state{state="failed"} == 1
- Summary: Host SystemD service crashed
- Description: SystemD service crashed
HostPhysicalComponentTooHot
- Expression: node_hwmon_temp_celsius > 75
- Summary: Host physical component too hot
- Description: Physical hardware component too hot
HostNodeOvertemperatureAlarm
- Expression: node_hwmon_temp_crit_alarm_celsius == 1
- Summary: Host node overtemperature alarm
- Description: Physical node temperature alarm triggered
HostRaidArrayGotInactive
- Expression: node_md_state{state="inactive"} > 0
- Summary: Host RAID array got inactive
- Description: RAID array is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.
HostRaidDiskFailure
- Expression: node_md_disks{state="failed"} > 0
- Summary: Host RAID disk failure
- Description: At least one device in RAID array on failed. Array needs attention and possibly a disk swap.
HostKernelVersionDeviations
count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
HostOomKillDetected
increase(node_vmstat_oom_kill[1m]) > 0
HostEdacCorrectableErrorsDetected
increase(node_edac_correctable_errors_total[1m]) > 0
HostEdacUncorrectableErrorsDetected
node_edac_uncorrectable_errors_total > 0
HostNetworkReceiveErrors
rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
HostNetworkTransmitErrors
rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
HostNetworkInterfaceSaturated
(rate(node_network_receive_bytes_total{device!~"^tap.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*"} > 0.8
HostConntrackLimit
node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8
HostClockSkew
(node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
HostClockNotSynchronising
- Expression: min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16
- Summary: Host clock not synchronising
- Description: Alerts when the host clock is not synchronising.