Vectorized_application_uptime: Red Panda uptime in milliseconds
Vectorized_io_queue_delay: Total delay time in the queue. It indicates latency caused by disk operations in seconds.
Vectorized_io_queue_length: No of requests in the queue.
Vectorized_kafka_rpc_active_connections: currently active connections of kafka rpc. It shows the currently active connections.
vectorized_reactor_utilization: Shows the true utilisation of the CPU by Red Panda processes
vectorized_storage_log_read_bytes: Total number of bytes read
vectorized_storage_log_written_bytes: Total number of bytes written
redpanda_cluster_brokers: Number of configured, fully commissioned brokers in a cluster.
redpanda_cluster_controller_log_limit_requests_dropped: Number of requests dropped by a cluster controller log due to exceeding.
redpanda_cluster_partitions: Number of partitions managed by a cluster. Includes partitions of the controller topic, but not replicas.
redpanda_cluster_topics: Number of topics in a cluster.
The storage, memory, up time, latency metrics are related to the infrastructure metrics
redpanda_cpu_busy_seconds_total: Total CPU busy time in seconds.
redpanda_memory_allocated_memory: Total allocated memory in bytes.
redpanda_memory_available_memory: Total memory potentially available (free plus reclaimable memory) to a CPU shard (core), in bytes.
redpanda_memory_free_memory: Available memory in bytes.
redpanda_rpc_request_latency_seconds: RPC latency in seconds.
redpanda_storage_disk_free_bytes: Available disk storage in bytes.
redpanda_storage_disk_free_space_alert: Alert for low disk storage: 0-OK, 1-low space, 2-degraded
redpanda_storage_disk_total_bytes: Total size in bytes of attached storage.
redpanda_uptime_seconds_total: Total CPU runtime (uptime) in seconds.
redpanda_rpc_active_connections: Counts the active RPC (Remote Procedure Call) connections in Redpanda, showing the current number of ongoing communications within the system.
redpanda_application_build: Denotes the current version or build identifier of the deployed Redpanda application.
redpanda_kafka_request_latency_seconds: Latency of produce/consume requests per broker. This duration measures from when a request is initiated on the partition to when the response is fulfilled.
redpanda_kafka_partitions: Counts the total number of Kafka partitions managed by Redpanda.
redpanda_kafka_request_bytes_total: Tracks the total number of bytes processed by Kafka requests within Redpanda.
redpanda_raft_leadership_changes: Tracks the number of leadership changes in the Raft protocol within Redpanda.
redpanda_kafka_max_offset: Tracks the highest offset (position) within Kafka topics managed by Redpanda.
redpanda_kafka_under_replicated_replicas: Measures Kafka topic replicas in Redpanda that are currently under-replicated, indicating potential data availability issues.
redpanda_kafka_request_latency_seconds_bucket: Measures the latency distribution of Kafka requests in Redpanda, providing insights into request processing times across different time intervals.
redpanda_kafka_consumer_group_consumers: Number of consumers in a consumer group.
redpanda_application_uptime_seconds_total: Redpanda application uptime in seconds.
Node Up panel
The “Node Up” panel determines the number of Redpanda nodes that are currently up and running. It uses the redpanda_application_uptime_seconds_total metric to count the active nodes within the specified data cluster.
Node uptime panel
The “Node Uptime” panel shows how long each Redpanda node has been running without any interruptions.
Topics panel
The “Topics” panel shows the total number of topics in the Redpanda cluster, grouped by a specific aggregation criteria.
Topic: A named channel or category in messaging systems like Kafka or Redpanda where data records (messages) are published by producers and consumed by subscribers (consumers).
Partitions panel
The “Partitions” panel shows the total number of partitions for all topics in the Redpanda system.
Throughput panel
This panel displays the rate of data being processed by Redpanda, measured in bytes per second.
Leaderships Transfer/5min panel
This panel shows the rate of leadership changes in the Redpanda system over a 5-minute interval.
Partition Balance panel
This panel measures how evenly data is distributed across partitions for each topic in Redpanda. It calculates the percentage imbalance based on the standard deviation and average of maximum offsets. A lower percentage indicates a more balanced distribution of data across partitions.
Under replicated partitions by topic panel
This panel calculates the total number of under-replicated partitions across all topics. An under-replicated partition is one that does not have the necessary number of replicas, which can impact data reliability and availability. The query sums these under-replicated replicas, and if none are found, it returns a zero value.
Leaderless partition panel
This panel calculates the total number of leaderless partitions by identifying partitions marked as unavailable and by comparing the expected number of partitions to those that are under-replicated. A leaderless partition is one that lacks an active leader, impacting the ability to manage read and write operations effectively.
Storage Used panel
The “Storage Used” panel shows the percentage of disk storage currently in use in the Redpanda system.
This helps monitor how much disk space is being utilized, which is crucial for capacity planning and avoiding storage issues.
Allocated memory panel
This panel shows the percentage of allocated memory used in the Redpanda system.
This panel calculates the memory usage by dividing the total allocated memory by the sum of allocated and free memory. It helps monitor the utilization of memory resources in Redpanda, crucial for optimizing performance and ensuring system stability.
Storage Health panel
The “Storage Health” panel indicates the current health status of storage based on alert levels in the Redpanda system.
CPU Utilization panel
This shows the average utilization of CPU resources in the Redpanda system over a specified interval.
Kafka RPC: Currently active connections panel
This panel shows how many connections are currently active for Kafka services in your Redpanda system. This helps monitor how busy the system is handling requests.
Kafka, a service that manages data streaming and messaging.
Cluster info panel
Displays essential information about Redpanda application build version and uptime.
Produce Latency (p99) panel
This panel calculates the 99th percentile latency (p99) for produce requests in Redpanda Kafka. It looks at how much time the slowest 1% of produce requests are taking to complete. This metric helps understand the worst-case scenario for producing data, which is important for ensuring consistent and reliable performance.
IO Queue Write Operations panel
It shows the rate of write operations queued in the input/output (IO) system of the Redpanda nodes.
Bytes Received via Kafka RPC panel
Displays the rate of data bytes received through Kafka Remote Procedure Call (RPC) requests in Redpanda system.
Fetch Latency Panel
This panel tracks how long it takes for the system to respond to requests that fetch data (consume operations) from the Redpanda Kafka service.
IO Queue Read Operations panel
Displays how many read operations are waiting in the input/output (IO) system of the Redpanda nodes. It helps to monitor the workload and efficiency of the IO system.
Bytes Sent via Kafka RPC panel
Displays how much data is being sent via Kafka RPC requests. It helps to monitor the outgoing data flow through Kafka RPC, providing insights into data distribution rates and workload.
Allocated Memory panel The panel measures how effectively memory is allocated on each Redpanda node.
It calculates the ratio of allocated memory (currently in use) to the total available memory (both allocated and free) on each node.
Available Memory panel
This panel displays the maximum amount of free memory available on each Redpanda node.
Available memory low watermark panel
This panel indicates the minimum threshold of available memory that should be maintained on each Redpanda node to avoid performance degradation.
Produce Latency (p95) panel
This panel monitors the latency (delay) experienced when producing (publishing) data into our Redpanda Kafka system.
It indicates how quickly or slowly produce requests are being processed in Redpanda Kafka cluster.
Produce Latency (p99) panel
This panel displays the 99th percentile latency for produce requests (data publishing) in our Redpanda Kafka system
Fetch Latency (p95) panel
This panel calculates the 95th percentile latency (p95) for fetch requests (data retrieval) in Redpanda Kafka
Monitoring this metric helps assess the performance of data retrieval operations, ensuring timely fetching of messages from Kafka system.
Fetch Latency (p99) panel
This panel in Grafana measures and displays the 99th percentile latency for fetch requests (data retrieval) in a Redpanda Kafka system.
Kafka RPC: Latency (p95) panel
This panel provides insights into the performance of Remote Procedure Call (RPC) requests within your Redpanda Kafka system, specifically focusing on the latency experienced by the slowest 5% of these RPC requests.
Kafka RPC: Latency (p99) panel
This panel helps you understand the performance of your Kafka system by showing how long it takes for almost all RPC requests to be completed.
Rate - Total number of bytes written panel
This panel shows the rate at which data is being written to the Kafka system. It provides insights into the total number of bytes written per second.
Rate - Total number of bytes read panel
This panel shows the rate at which data is being read from the Kafka system. It provides insights into the total number of bytes read per second.
Leadership Transfers panel
This panel shows the rate at which leadership changes occur within the Redpanda nodes. This metric helps you understand how often the leader role is transferred among nodes.
Leadership Transfer: In a distributed system like Redpanda, different nodes take on the role of leader to manage and coordinate tasks. A leadership transfer occurs when the leader role is passed from one node to another.
internal_rpc: Latency (p95) panel
This panel helps you monitor the performance of internal communication between Redpanda nodes by showing the latency for 95% of internal RPC requests.
The latency for the slowest 5% of internal RPC requests, meaning 95% of requests are completed within this time.
internal_rpc: Latency (p99) panel
This panel helps you monitor the performance of internal communication between Redpanda nodes by showing the latency for 99% of internal RPC requests.
The latency for the slowest 1% of internal RPC requests, meaning 99% of requests are completed within this time.
CPU utilization (Reactor) panel
This panel shows the average CPU utilization of the Redpanda nodes using the Reactor framework. This metric provides insights into how much CPU resources are being utilized by Redpanda processes.
Reactor refers to the underlying framework or mechanism used for managing concurrency and handling events efficiently.
Brokers are down
((sum(max_over_time(redpanda_cluster_brokers[30d])) - sum(up{job='redpanda'})) > 0)
Brokers are down (alternative)
((max_over_time(count(redpanda_application_uptime_seconds_total)[1w:])) - (count(redpanda_application_uptime_seconds_total) or on () vector(0))) > 0
Storage is degraded
(redpanda_storage_disk_free_space_alert) > 1
Storage - there is less than 1 GiB of free space
(redpanda_storage_disk_free_bytes) < 1073741824
Leaderless Partitions
(redpanda_cluster_unavailable_partitions) > 0
Low Memory
(redpanda_memory_available_memory) < 1073741824
Storage Alert - Low Space
(redpanda_storage_disk_free_space_alert) > 0
Under-replicated partitions
(redpanda_kafka_under_replicated_replicas) > 0
Storage space is predicted to be less than 1 GiB in 30 minutes
(predict_linear(redpanda_storage_disk_free_bytes[1h], 1800)) < 1073741824
Memory is predicted to be less than 1 GiB in one hour
- Expression: (predict_linear(redpanda_memory_available_memory[30m], 1800)) < 1073741824
- Summary: Memory has been consistently predicted to be less than 1 GiB (in one hour), for over 5 minutes.
- Description: Predicts memory will be below 1 GiB in one hour.
More than 1% of Schema Registry requests results in an error
- Expression: (100 * (sum by (instance) (rate(redpanda_schema_registry_request_errors_total[5m])) / sum by (instance) (rate(redpanda_schema_registry_request_latency_seconds_count[5m])))) > 1
- Summary: More than 1% of Schema Registry requests results in an error, for over 5 minutes.
- Description: Indicates high error rate for Schema Registry requests.
More than 1% of RPC requests results in an error
- Expression: (100 * (sum by (instance) (rate(redpanda_rpc_request_errors_total[5m])) / sum by (instance) (rate(redpanda_rpc_request_latency_seconds_count[5m])))) > 1
- Summary: More than 1% of RPC requests results in an error, for over 5 minutes.
- Description: Alerts when RPC request error rate is high.
More than 1% of REST requests results in an error
- Expression: (100 * (sum by (instance) (rate(redpanda_rest_proxy_request_errors_total[5m])) / sum by (instance) (rate(redpanda_rest_proxy_request_latency_seconds_count[5m])))) > 1
- Summary: More than 1% of REST requests results in an error, for over 5 minutes.
- Description: Indicates high error rate for REST requests.
More than 1% of Raft RPC requests results in an error
- Expression: (100 * (sum by (instance) (rate(redpanda_node_status_rpcs_timed_out[5m])) / sum by (instance) (rate(redpanda_node_status_rpcs_sent[5m])))) > 1
- Summary: More than 1% of Raft RPC requests results in an error, for over 5 minutes.
- Description: Alerts when Raft RPC error rate is high.
Raft leadership is continually changing
- Expression: (rate(redpanda_raft_leadership_changes[1m])) > 0
- Summary: Raft leadership is continually changing, rather than settling into a stable distribution, for over 5 minutes.
- Description: Indicates frequent changes in Raft leadership.
Kafka request latency is too high
- Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_kafka_request_latency_seconds_bucket[5m])))) > 0.1
- Summary: Kafka request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes.
- Description: Alerts when Kafka request latency is high.
RPC request latency is too high
- Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_rpc_request_latency_seconds_bucket[5m])))) > 0.1
- Summary: RPC request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes.
- Description: Indicates high RPC request latency.
REST request latency is too high
- Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_rest_proxy_request_latency_seconds_bucket[5m])))) > 0.1
- Summary: REST request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes.
- Description: Alerts when REST request latency is high.
Schema Registry request latency is too high
- Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_schema_registry_request_latency_seconds_bucket[5m])))) > 0.1
- Summary: Schema Registry request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes.
- Description: Indicates high Schema Registry request latency.
Storage - there is less than 10 GiB of free space
- Expression: (redpanda_storage_disk_free_bytes) < 1073741824
- Summary: There is less than 10 GiB free space available for more than 5 minutes.
- Description: Alerts when storage space is below 10 GiB.
Schema Registry errors are increasing
- Expression: (increase(redpanda_schema_registry_request_errors_total[1m])) > 0
- Summary: Schema Registry errors are increasing for more than 5 minutes.
- Description: Indicates increasing errors in Schema Registry.
RPC errors are increasing
- Expression: (increase(redpanda_rpc_request_errors_total[1m])) > 0
- Summary: RPC errors are increasing for more than 5 minutes.
- Description: Alerts when RPC errors are increasing.
REST Proxy errors are increasing
- Expression: (increase(redpanda_rest_proxy_request_errors_total[1m])) > 0
- Summary: REST Proxy errors are increasing for more than 5 minutes.
- Description: Indicates increasing errors in REST Proxy.
Raft RPC errors are increasing
- Expression: (increase(redpanda_node_status_rpcs_timed_out[1m]) OR on() vector(0)) > 0
- Summary: Raft RPC errors are increasing for more than 5 minutes.
- Description: Alerts when Raft RPC errors are increasing.
Cluster administration monitoring: https://docs.redpanda.com/21.11/cluster-administration/monitoring/
Public metrics reference: https://docs.redpanda.com/current/reference/public-metrics-reference/
Redpanda observability: https://github.com/redpanda-data/observability