Redpanda Exporter

Metrics

Internal metrics

  • Vectorized_application_uptime: Red Panda uptime in milliseconds

  • Vectorized_io_queue_delay: Total delay time in the queue. It indicates latency caused by disk operations in seconds.

  • Vectorized_io_queue_length: No of requests in the queue.

  • Vectorized_kafka_rpc_active_connections: currently active connections of kafka rpc. It shows the currently active connections.

  • vectorized_reactor_utilization: Shows the true utilisation of the CPU by Red Panda processes

  • vectorized_storage_log_read_bytes: Total number of bytes read

  • vectorized_storage_log_written_bytes: Total number of bytes written

Cluster metrics

  • redpanda_cluster_brokers: Number of configured, fully commissioned brokers in a cluster.

  • redpanda_cluster_controller_log_limit_requests_dropped: Number of requests dropped by a cluster controller log due to exceeding.

  • redpanda_cluster_partitions: Number of partitions managed by a cluster. Includes partitions of the controller topic, but not replicas.

  • redpanda_cluster_topics: Number of topics in a cluster.

Infrastructure metrics

The storage, memory, up time, latency metrics are related to the infrastructure metrics

  • redpanda_cpu_busy_seconds_total: Total CPU busy time in seconds.

  • redpanda_memory_allocated_memory: Total allocated memory in bytes.

  • redpanda_memory_available_memory: Total memory potentially available (free plus reclaimable memory) to a CPU shard (core), in bytes.

  • redpanda_memory_free_memory: Available memory in bytes.

  • redpanda_rpc_request_latency_seconds: RPC latency in seconds.

  • redpanda_storage_disk_free_bytes: Available disk storage in bytes.

  • redpanda_storage_disk_free_space_alert: Alert for low disk storage: 0-OK, 1-low space, 2-degraded

  • redpanda_storage_disk_total_bytes: Total size in bytes of attached storage.

  • redpanda_uptime_seconds_total: Total CPU runtime (uptime) in seconds.

  • redpanda_rpc_active_connections: Counts the active RPC (Remote Procedure Call) connections in Redpanda, showing the current number of ongoing communications within the system.

  • redpanda_application_build: Denotes the current version or build identifier of the deployed Redpanda application.

Broker metrics

  • redpanda_kafka_request_latency_seconds: Latency of produce/consume requests per broker. This duration measures from when a request is initiated on the partition to when the response is fulfilled.

  • redpanda_kafka_partitions: Counts the total number of Kafka partitions managed by Redpanda.

  • redpanda_kafka_request_bytes_total: Tracks the total number of bytes processed by Kafka requests within Redpanda.

  • redpanda_raft_leadership_changes: Tracks the number of leadership changes in the Raft protocol within Redpanda.

  • redpanda_kafka_max_offset: Tracks the highest offset (position) within Kafka topics managed by Redpanda.

  • redpanda_kafka_under_replicated_replicas: Measures Kafka topic replicas in Redpanda that are currently under-replicated, indicating potential data availability issues.

  • redpanda_kafka_request_latency_seconds_bucket: Measures the latency distribution of Kafka requests in Redpanda, providing insights into request processing times across different time intervals.

Consumer group metrics

  • redpanda_kafka_consumer_group_consumers: Number of consumers in a consumer group.

  • redpanda_application_uptime_seconds_total: Redpanda application uptime in seconds.


Grafana Dashboard Panels

Redpanda Summary Row

Node Up panel

The “Node Up” panel determines the number of Redpanda nodes that are currently up and running. It uses the redpanda_application_uptime_seconds_total metric to count the active nodes within the specified data cluster.

  • Metric Used: redpanda_application_uptime_seconds_total

Node uptime panel

The “Node Uptime” panel shows how long each Redpanda node has been running without any interruptions.

  • Metric Used: redpanda_application_uptime_seconds_total

Topics panel

The “Topics” panel shows the total number of topics in the Redpanda cluster, grouped by a specific aggregation criteria.

  • Metric Used: redpanda_cluster_topics

Topic: A named channel or category in messaging systems like Kafka or Redpanda where data records (messages) are published by producers and consumed by subscribers (consumers).

Partitions panel

The “Partitions” panel shows the total number of partitions for all topics in the Redpanda system.

  • Metric Used: redpanda_kafka_partitions

Throughput panel

This panel displays the rate of data being processed by Redpanda, measured in bytes per second.

  • Metric Used: redpanda_kafka_request_bytes_total

Leaderships Transfer/5min panel

This panel shows the rate of leadership changes in the Redpanda system over a 5-minute interval.

  • Metric Used: redpanda_raft_leadership_changes

Partition Balance panel

This panel measures how evenly data is distributed across partitions for each topic in Redpanda. It calculates the percentage imbalance based on the standard deviation and average of maximum offsets. A lower percentage indicates a more balanced distribution of data across partitions.

  • Metric Used: redpanda_kafka_max_offset

Under replicated partitions by topic panel

This panel calculates the total number of under-replicated partitions across all topics. An under-replicated partition is one that does not have the necessary number of replicas, which can impact data reliability and availability. The query sums these under-replicated replicas, and if none are found, it returns a zero value.

  • Metric Used: redpanda_kafka_under_replicated_replicas

Leaderless partition panel

This panel calculates the total number of leaderless partitions by identifying partitions marked as unavailable and by comparing the expected number of partitions to those that are under-replicated. A leaderless partition is one that lacks an active leader, impacting the ability to manage read and write operations effectively.

  • Metric Used: redpanda_cluster_unavailable_partitions, redpanda_kafka_partitions, redpanda_kafka_under_replicated_replicas

Storage Used panel

The “Storage Used” panel shows the percentage of disk storage currently in use in the Redpanda system.

This helps monitor how much disk space is being utilized, which is crucial for capacity planning and avoiding storage issues.

  • Metric Used: redpanda_storage_disk_free_bytes, redpanda_storage_disk_total_bytes

Allocated memory panel

This panel shows the percentage of allocated memory used in the Redpanda system.

This panel calculates the memory usage by dividing the total allocated memory by the sum of allocated and free memory. It helps monitor the utilization of memory resources in Redpanda, crucial for optimizing performance and ensuring system stability.

  • Metric Used: redpanda_memory_allocated_memory, redpanda_memory_free_memory

Storage Health panel

The “Storage Health” panel indicates the current health status of storage based on alert levels in the Redpanda system.

  • 0 (OK): Indicates that storage is functioning within normal parameters.
  • 1 (Low): Indicates low disk space, which may require attention to prevent issues.
  • 2 (Degraded): Indicates degraded storage conditions, potentially affecting performance or reliability.
  • Metric Used: redpanda_storage_disk_free_space_alert

CPU Utilization panel

This shows the average utilization of CPU resources in the Redpanda system over a specified interval.

  • Metric Used: redpanda_cpu_busy_seconds_total

Kafka RPC: Currently active connections panel

This panel shows how many connections are currently active for Kafka services in your Redpanda system. This helps monitor how busy the system is handling requests.

Kafka, a service that manages data streaming and messaging.

  • Metric Used: redpanda_rpc_active_connections

Cluster info panel

Displays essential information about Redpanda application build version and uptime.

  • Metrics Used:redpanda_application_build, redpanda_application_uptime_seconds_total

Produce Latency (p99) panel

This panel calculates the 99th percentile latency (p99) for produce requests in Redpanda Kafka. It looks at how much time the slowest 1% of produce requests are taking to complete. This metric helps understand the worst-case scenario for producing data, which is important for ensuring consistent and reliable performance.

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

IO Queue Write Operations panel

It shows the rate of write operations queued in the input/output (IO) system of the Redpanda nodes.

  • Metric Used: redpanda_io_queue_total_write_ops

Bytes Received via Kafka RPC panel

Displays the rate of data bytes received through Kafka Remote Procedure Call (RPC) requests in Redpanda system.

  • Metric Used: redpanda_kafka_request_bytes_total

Fetch Latency Panel

This panel tracks how long it takes for the system to respond to requests that fetch data (consume operations) from the Redpanda Kafka service.

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

IO Queue Read Operations panel

Displays how many read operations are waiting in the input/output (IO) system of the Redpanda nodes. It helps to monitor the workload and efficiency of the IO system.

  • Metric Used: redpanda_io_queue_total_read_ops

Bytes Sent via Kafka RPC panel

Displays how much data is being sent via Kafka RPC requests. It helps to monitor the outgoing data flow through Kafka RPC, providing insights into data distribution rates and workload.

  • Metric Used: redpanda_kafka_request_bytes_total

Redpanda Memory Row

Allocated Memory panel The panel measures how effectively memory is allocated on each Redpanda node.

It calculates the ratio of allocated memory (currently in use) to the total available memory (both allocated and free) on each node.

  • Metrics Used: redpanda_memory_allocated_memory, redpanda_memory_free_memory

Available Memory panel

This panel displays the maximum amount of free memory available on each Redpanda node.

  • Metric Used: redpanda_memory_available_memory

Available memory low watermark panel

This panel indicates the minimum threshold of available memory that should be maintained on each Redpanda node to avoid performance degradation.

  • Metric Used: redpanda_memory_available_memory_low_water_mark

Latency Row

Produce Latency (p95) panel

This panel monitors the latency (delay) experienced when producing (publishing) data into our Redpanda Kafka system.

It indicates how quickly or slowly produce requests are being processed in Redpanda Kafka cluster.

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

Produce Latency (p99) panel

This panel displays the 99th percentile latency for produce requests (data publishing) in our Redpanda Kafka system

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

Fetch Latency (p95) panel

This panel calculates the 95th percentile latency (p95) for fetch requests (data retrieval) in Redpanda Kafka

Monitoring this metric helps assess the performance of data retrieval operations, ensuring timely fetching of messages from Kafka system.

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

Fetch Latency (p99) panel

This panel in Grafana measures and displays the 99th percentile latency for fetch requests (data retrieval) in a Redpanda Kafka system.

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

Kafka RPC: Latency (p95) panel

This panel provides insights into the performance of Remote Procedure Call (RPC) requests within your Redpanda Kafka system, specifically focusing on the latency experienced by the slowest 5% of these RPC requests.

  • Shows the latency experienced by the slowest 5% of RPC requests.
  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

Kafka RPC: Latency (p99) panel

This panel helps you understand the performance of your Kafka system by showing how long it takes for almost all RPC requests to be completed.

  • Metric Used: redpanda_kafka_request_latency_seconds_bucket

Throughput row

Rate - Total number of bytes written panel

This panel shows the rate at which data is being written to the Kafka system. It provides insights into the total number of bytes written per second.

  • Metrics Used: redpanda_kafka_request_bytes_total

Rate - Total number of bytes read panel

This panel shows the rate at which data is being read from the Kafka system. It provides insights into the total number of bytes read per second.

  • Metrics Used: redpanda_kafka_request_bytes_total

Leadership Transfers panel

This panel shows the rate at which leadership changes occur within the Redpanda nodes. This metric helps you understand how often the leader role is transferred among nodes.

  • Metrics Used: redpanda_raft_leadership_changes

Leadership Transfer: In a distributed system like Redpanda, different nodes take on the role of leader to manage and coordinate tasks. A leadership transfer occurs when the leader role is passed from one node to another.

Internal RPC Latency Row

internal_rpc: Latency (p95) panel

This panel helps you monitor the performance of internal communication between Redpanda nodes by showing the latency for 95% of internal RPC requests.

The latency for the slowest 5% of internal RPC requests, meaning 95% of requests are completed within this time.

  • Metrics Used: redpanda_rpc_request_latency_seconds_bucket

internal_rpc: Latency (p99) panel

This panel helps you monitor the performance of internal communication between Redpanda nodes by showing the latency for 99% of internal RPC requests.

The latency for the slowest 1% of internal RPC requests, meaning 99% of requests are completed within this time.

  • Metrics Used: redpanda_rpc_request_latency_seconds_bucket

CPU utilization (Reactor) panel

This panel shows the average CPU utilization of the Redpanda nodes using the Reactor framework. This metric provides insights into how much CPU resources are being utilized by Redpanda processes.

  • Metrics Used: redpanda_cpu_busy_seconds_total

Reactor refers to the underlying framework or mechanism used for managing concurrency and handling events efficiently.


Alert rules

Brokers are down

  • Expression: ((sum(max_over_time(redpanda_cluster_brokers[30d])) - sum(up{job='redpanda'})) > 0)
  • Summary: The number of active brokers has been too low for more than 1 minute.
  • Description: Alerts when the number of active brokers drops below expected.

Brokers are down (alternative)

  • Expression: ((max_over_time(count(redpanda_application_uptime_seconds_total)[1w:])) - (count(redpanda_application_uptime_seconds_total) or on () vector(0))) > 0
  • Summary: The number of active brokers has been too low for more than 1 minute.
  • Description: Another alert for when the active brokers count is low.

Storage is degraded

  • Expression: (redpanda_storage_disk_free_space_alert) > 1
  • Summary: Redpanda is alerting that storage is degraded for more than 1 minute, resulting in writes being rejected.
  • Description: Indicates degraded storage leading to write rejections.

Storage - there is less than 1 GiB of free space

  • Expression: (redpanda_storage_disk_free_bytes) < 1073741824
  • Summary: There is less than 1 GiB free space available for more than 1 minute.
  • Description: Alerts when free storage space is below 1 GiB.

Leaderless Partitions

  • Expression: (redpanda_cluster_unavailable_partitions) > 0
  • Summary: There are leaderless partitions for more than 1 minute, so some data may be unavailable.
  • Description: Triggers when there are leaderless partitions.

Low Memory

  • Expression: (redpanda_memory_available_memory) < 1073741824
  • Summary: There is less than 1 GiB memory available for more than 1 minute.
  • Description: Alerts when available memory is below 1 GiB.

Storage Alert - Low Space

  • Expression: (redpanda_storage_disk_free_space_alert) > 0
  • Summary: Redpanda is alerting that space is too low for over 5 minutes.
  • Description: Indicates critically low storage space.

Under-replicated partitions

  • Expression: (redpanda_kafka_under_replicated_replicas) > 0
  • Summary: There have been under-replicated partitions for over 5 minutes.
  • Description: Alerts when partitions are under-replicated.

Storage space is predicted to be less than 1 GiB in 30 minutes

  • Expression: (predict_linear(redpanda_storage_disk_free_bytes[1h], 1800)) < 1073741824
  • Summary: Storage space has been consistently predicted to be less than 1 GiB (in one hour), for over 5 minutes.
  • Description: Predicts storage space will be below 1 GiB in 30 minutes.

Memory is predicted to be less than 1 GiB in one hour - Expression: (predict_linear(redpanda_memory_available_memory[30m], 1800)) < 1073741824 - Summary: Memory has been consistently predicted to be less than 1 GiB (in one hour), for over 5 minutes. - Description: Predicts memory will be below 1 GiB in one hour.

More than 1% of Schema Registry requests results in an error - Expression: (100 * (sum by (instance) (rate(redpanda_schema_registry_request_errors_total[5m])) / sum by (instance) (rate(redpanda_schema_registry_request_latency_seconds_count[5m])))) > 1 - Summary: More than 1% of Schema Registry requests results in an error, for over 5 minutes. - Description: Indicates high error rate for Schema Registry requests.

More than 1% of RPC requests results in an error - Expression: (100 * (sum by (instance) (rate(redpanda_rpc_request_errors_total[5m])) / sum by (instance) (rate(redpanda_rpc_request_latency_seconds_count[5m])))) > 1 - Summary: More than 1% of RPC requests results in an error, for over 5 minutes. - Description: Alerts when RPC request error rate is high.

More than 1% of REST requests results in an error - Expression: (100 * (sum by (instance) (rate(redpanda_rest_proxy_request_errors_total[5m])) / sum by (instance) (rate(redpanda_rest_proxy_request_latency_seconds_count[5m])))) > 1 - Summary: More than 1% of REST requests results in an error, for over 5 minutes. - Description: Indicates high error rate for REST requests.

More than 1% of Raft RPC requests results in an error - Expression: (100 * (sum by (instance) (rate(redpanda_node_status_rpcs_timed_out[5m])) / sum by (instance) (rate(redpanda_node_status_rpcs_sent[5m])))) > 1 - Summary: More than 1% of Raft RPC requests results in an error, for over 5 minutes. - Description: Alerts when Raft RPC error rate is high.

Raft leadership is continually changing - Expression: (rate(redpanda_raft_leadership_changes[1m])) > 0 - Summary: Raft leadership is continually changing, rather than settling into a stable distribution, for over 5 minutes. - Description: Indicates frequent changes in Raft leadership.

Kafka request latency is too high - Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_kafka_request_latency_seconds_bucket[5m])))) > 0.1 - Summary: Kafka request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes. - Description: Alerts when Kafka request latency is high.

RPC request latency is too high - Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_rpc_request_latency_seconds_bucket[5m])))) > 0.1 - Summary: RPC request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes. - Description: Indicates high RPC request latency.

REST request latency is too high - Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_rest_proxy_request_latency_seconds_bucket[5m])))) > 0.1 - Summary: REST request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes. - Description: Alerts when REST request latency is high.

Schema Registry request latency is too high - Expression: (histogram_quantile(0.95, sum by(le) (rate(redpanda_schema_registry_request_latency_seconds_bucket[5m])))) > 0.1 - Summary: Schema Registry request latency (95th percentile) is more than 100 milliseconds per request, for over 5 minutes. - Description: Indicates high Schema Registry request latency.

Storage - there is less than 10 GiB of free space - Expression: (redpanda_storage_disk_free_bytes) < 1073741824 - Summary: There is less than 10 GiB free space available for more than 5 minutes. - Description: Alerts when storage space is below 10 GiB.

Schema Registry errors are increasing - Expression: (increase(redpanda_schema_registry_request_errors_total[1m])) > 0 - Summary: Schema Registry errors are increasing for more than 5 minutes. - Description: Indicates increasing errors in Schema Registry.

RPC errors are increasing - Expression: (increase(redpanda_rpc_request_errors_total[1m])) > 0 - Summary: RPC errors are increasing for more than 5 minutes. - Description: Alerts when RPC errors are increasing.

REST Proxy errors are increasing - Expression: (increase(redpanda_rest_proxy_request_errors_total[1m])) > 0 - Summary: REST Proxy errors are increasing for more than 5 minutes. - Description: Indicates increasing errors in REST Proxy.

Raft RPC errors are increasing - Expression: (increase(redpanda_node_status_rpcs_timed_out[1m]) OR on() vector(0)) > 0 - Summary: Raft RPC errors are increasing for more than 5 minutes. - Description: Alerts when Raft RPC errors are increasing.


References

Cluster administration monitoring: https://docs.redpanda.com/21.11/cluster-administration/monitoring/

Public metrics reference: https://docs.redpanda.com/current/reference/public-metrics-reference/

Redpanda observability: https://github.com/redpanda-data/observability