Home > DevOps > Services Offered by C3 > Monitoring > Metrics and Panel Details > Promotheus Exporter

Promotheus Exporter

Promotheus

It is a open source system monitoring and alerting tool which pull the metric from client server over http and place the data in to its local data base which works on the principle of time series data base.

Prometheus only can pull (or) scrapes data can’t collect the data by its own.so we uses some tools to collect the data.

Metrics

General Performance Metrics

process_cpu_seconds_total: This shows the CPU usage over time.
process_resident_memory_bytes: This displays the amount of resident memory (RAM) used by the Prometheus process.
process_open_fds: This shows the number of open file descriptors for the Prometheus process.
process_start_time_seconds: This is used to calculate the uptime of the Prometheus process.
time() - process_start_time_seconds: This represents the uptime of the Prometheus process.

Configuration and Reload Metrics

prometheus_config_last_reload_successful: This shows the success status of the last configuration reload.
prometheus_config_last_reload_success_timestamp_seconds: This is used to calculate the time since the last successful configuration reload.
time() - prometheus_config_last_reload_success_timestamp_seconds: This represents the time elapsed since the last successful configuration reload.

Storage and Database Metrics

prometheus_tsdb_storage_blocks_bytes: This shows the amount of storage used by the Prometheus TSDB in bytes.
prometheus_tsdb_checkpoint_deletions_total: This represents the total number of checkpoints deleted.
prometheus_tsdb_checkpoint_creations_total: This represents the total number of checkpoints created.
prometheus_tsdb_time_retentions_total: This shows the number of times the retention period was applied.
prometheus_tsdb_vertical_compactions_total: This represents the total number of vertical compactions.
prometheus_tsdb_checkpoint_deletions_failed_total: This shows the number of failed checkpoint deletions.
prometheus_tsdb_checkpoint_creations_failed_total: This shows the number of failed checkpoint creations.
prometheus_tsdb_clean_start: This displays the state of the TSDB clean start process.
prometheus_tsdb_data_replay_duration_seconds: This represents the duration of data replay during startup.
prometheus_tsdb_symbol_table_size_bytes: This shows the size of the symbol table in bytes.
prometheus_tsdb_mmap_chunk_corruptions_total: This represents the total number of mmap chunk corruptions.
prometheus_tsdb_reloads_failures_total: This shows the total number of reload failures.
prometheus_tsdb_snapshot_replay_error_total: This shows the total number of snapshot replay errors.
prometheus_tsdb_too_old_samples_total: This represents the total number of samples that were too old.
prometheus_tsdb_size_retentions_total: This shows the total number of times the size retention was applied.
prometheus_tsdb_retention_limit_bytes: This represents the size retention limit in bytes.

Alerting Metrics

prometheus_notifications_sent_total: This displays the total number of alerts sent.
prometheus_notifications_dropped_total: Total number of dropped notifications in Prometheus.
prometheus_notifications_sent_total: Total number of notifications sent by Prometheus.
prometheus_notifications_alertmanagers_discovered: Number of Alertmanager instances discovered by Prometheus for sending notifications.
prometheus_notifications_queue_length: Length of the notification queue in Prometheus, indicating pending notifications.
prometheus_notifications_errors_total: Total number of errors encountered during notification sending in Prometheus.

Rule Evaluation Metrics

prometheus_rule_evaluation_duration_seconds: This shows the duration of rule evaluations.
prometheus_rule_group_duration_seconds: This represents the duration of rule group evaluations.

Query and Log Metrics

prometheus_engine_queries: This shows the number of active queries.
prometheus_engine_query_log_enabled: This indicates whether query logging is enabled.
prometheus_engine_query_log_failures_total: This shows the total number of query log failures.
prometheus_engine_query_duration_seconds: This metric measures the duration of queries processed by the Prometheus engine in seconds. It indicates how long Prometheus takes to process queries against its time series database.

Target and Scraping Metrics

prometheus_target_scrape_pool_targets: This shows the number of targets being scraped.

Time Series Data (TSDB) Metrics

prometheus_tsdb_lowest_timestamp_seconds: This represents the oldest timestamp in the TSDB.
timestamp(prometheus_tsdb_lowest_timestamp_seconds) - prometheus_tsdb_lowest_timestamp_seconds: This shows the time difference for the oldest value in the TSDB.

Template Metrics

prometheus_template_text_expansion_failures_total: Total number of failures encountered during text template expansion in Prometheus.
prometheus_template_text_expansions_total: Total number of successful text template expansions performed in Prometheus.

Target Sync Metrics

prometheus_target_sync_length_seconds: This metric measures the length of time it takes for Prometheus to synchronize targets.
prometheus_target_interval_length_seconds: This metric indicates the length of time intervals for target synchronization in Prometheus.
prometheus_target_scrape_pool_targets: This metric shows the number of scrape pool targets configured in Prometheus.

HTTP Metrics

prometheus_http_response_size_bytes_bucket: This metric represents the distribution of HTTP response sizes in buckets for Prometheus. It helps in understanding the size distribution of responses served by Prometheus.
prometheus_http_request_duration_seconds_bucket: This metric shows the distribution of HTTP request durations in buckets for Prometheus. It helps in analyzing the time taken by different HTTP requests served by Prometheus.

Miscellaneous Metrics

prometheus_ready: This indicates the readiness status of Prometheus.
prometheus_build_info: This displays the build information of the Prometheus instance.

Grafana Dashboard Panels

TSDB (Time Series Database) Row

TSDB : A specialized database optimized for storing and querying time-stamped data, such as metrics and events.

Local Storage Panel

This panel shows the total size of storage blocks used by Prometheus on the local instance. It represents the amount of disk space consumed by the time series data stored by Prometheus, helping to monitor and manage the storage utilization of the Prometheus instance.

Metric Used: prometheus_tsdb_storage_blocks_bytes

Checkpoints Deleted Panel

This panel shows the number of checkpoints deleted over a specified range of time. It represents the frequency of checkpoint deletions, which are used to manage the TSDB storage and maintain efficient access to time series data.

Metric Used: prometheus_tsdb_checkpoint_deletions_total

Checkpoint: A snapshot of the state of the TSDB at a particular point in time. Checkpoints are used to make recovery faster and more efficient, as they provide a reference point from which Prometheus can resume processing without needing to replay the entire WAL.

Checkpoints Created Panel

This panel shows the number of checkpoints created over a specified range of time. It represents the frequency of creating new checkpoints in the TSDB.

Metric Used: prometheus_tsdb_checkpoint_creations_total

Time Retention Hits Panel

This panel shows the number of time retention events over a specified range of time. It represents the instances when old data is removed based on the retention policy, helping to manage the TSDB size and performance.

Metric Used: prometheus_tsdb_time_retentions_total

Time Retention: The policy and process of automatically deleting old time-series data from the TSDB after a specified period.

Vertical Compactions Panel

This panel shows the number of vertical compactions performed over a specified range of time.

Metric Used: prometheus_tsdb_vertical_compactions_total

Vertical Compactions: The process of merging smaller TSDB blocks into larger ones to reduce the number of blocks and improve query performance.

Checkpoints Deletion Failures Panel

This panel shows the number of failed checkpoint deletions over a specified range of time. It represents the instances where attempts to delete checkpoints have failed.

Metric Used: prometheus_tsdb_checkpoint_deletions_failed_total

Checkpoints Creation Failures Panel

This panel shows the number of failed checkpoint creations over a specified range of time. It represents the instances where attempts to create new checkpoints have failed.

Metric Used: prometheus_tsdb_checkpoint_creations_failed_total

TSDB Clean Start Panel

This panel shows the status of TSDB clean starts. The possible values are:

1: CLEAN
-1: DISABLED
0: REPLACED

It represents the status of the TSDB at startup, indicating whether it was a clean start, disabled, or replaced.

Metric Used: prometheus_tsdb_clean_start

Replay Time Panel

This panel shows the duration of data replay in seconds. It represents the time taken to replay the data logs during Prometheus startup

Metric Used: prometheus_tsdb_data_replay_duration_seconds

Replay: The process of replaying or reprocessing the write-ahead log (WAL) data during the startup of Prometheus. This ensures that any data written to the WAL before a shutdown is properly incorporated into the TSDB.

Symbol Table RAM Panel

This panel shows the size of the symbol table in bytes used by Prometheus on the specified instance. The symbol table is part of the TSDB’s internal data structures and is used for efficient indexing and lookup of time series data.

Metric used: prometheus_tsdb_symbol_table_size_bytes

MMap Corruptions Panel

This panel shows the number of memory map (MMap) chunk corruptions over a specified range of time. MMap is used by Prometheus for efficient memory management, and corruptions can indicate potential issues with memory handling or hardware problems.

Metric used: prometheus_tsdb_mmap_chunk_corruptions_total

Reload Failures Panel

This panel shows the number of times Prometheus failed to reload data over a specified range of time. Reload failures can occur due to various reasons such as data corruption or disk access issues.

Metric used: prometheus_tsdb_reloads_failures_total

Snapshot Replay Errors Panel

This panel shows the number of errors encountered during snapshot replay over a specified range of time. Snapshot replay is a process where Prometheus restores previously saved data, and errors here can affect data integrity or recovery.

Metric used: prometheus_tsdb_snapshot_replay_error_total

Samples Out of Time Panel

This panel shows the number of samples that were considered out of time (too old) over a specified range of time. This metric helps monitor issues related to outdated or expired data samples in the time series database.

Metric used: prometheus_tsdb_too_old_samples_total

Time Retention Hits Panel

This panel shows the number of time retention events over a specified range of time. Time retention involves removing old data based on configured retention policies to manage the size and performance of the time series database.

Metric used: prometheus_tsdb_time_retentions_total

Size Retention Hits Panel

This panel shows the number of size retention events over a specified range of time. Size retention involves removing data to stay within the configured storage limits, ensuring efficient use of storage resources.

Metric used: prometheus_tsdb_size_retentions_total

Size Retention Limit Panel

This panel shows the retention limit in bytes configured for the Prometheus instance. It represents the maximum amount of data that can be stored before size-based retention policies start removing older data.

Metric used: prometheus_tsdb_retention_limit_bytes

For the this Panel, the value mappings are:

0: DISABLED: Indicates that the retention limit or clean-up operations are disabled

INFO Row

Prometheus Ready Panel

This panel indicates the readiness status of the Prometheus instance.

Metric Used: prometheus_ready

It uses value mappings to show whether Prometheus is ready or has failed:

0: FAILED
1: OK

Configuration Reload Panel

This panel shows the success status of the last configuration reload for the Prometheus instance.

Metric Used: prometheus_config_last_reload_successful

It uses value mappings to represent the status:

0: FAILED
1: OK

Targets Panel

This panel displays the total number of targets being scraped by the Prometheus instance. It sums up the targets from all scrape pools.

Metric Used: prometheus_target_scrape_pool_targets

CPU Usage % Panel

This panel shows the CPU usage percentage of the Prometheus instance over a specified interval. It calculates the rate of CPU seconds used.

Metric USsed: process_cpu_seconds_total

Active Alert Managers Panel

This panel displays the number of active Alertmanagers discovered by the Prometheus instance, indicating the status of alert manager integration.

Metric Used: prometheus_notifications_alertmanagers_discovered

Rule Execution Panel

This panel shows the duration of rule evaluations in seconds for the Prometheus instance, focusing on the 99th percentile to indicate near-maximum execution times.

Metric Used: prometheus_rule_evaluation_duration_seconds

Oldest Value Panel

This panel displays the timestamp of the oldest data value stored in the Prometheus TSDB, representing the data retention and age.

Metric Used: prometheus_tsdb_lowest_timestamp_seconds

File Description Panel

This panel shows the number of open file descriptors for the Prometheus process, which is useful for monitoring resource usage and potential file handle limits.

Metric Used: process_open_fds

Process Uptime Panel

This panel displays the uptime of the Prometheus process by calculating the difference between the current time and the process start time.

Metric Used: process_start_time_seconds

Configuration Reload Time Panel

This panel shows the time elapsed since the last successful configuration reload of the Prometheus instance.

Metric Used: prometheus_config_last_reload_success_timestamp_seconds

Local Storage Panel

This panel displays the amount of storage used by the Prometheus time-series database (TSDB) in bytes.

Metric Used: prometheus_tsdb_storage_blocks_bytes

RAM Usage Panel

This panel shows the amount of resident memory (RAM) used by the Prometheus process.

Metric Used: process_resident_memory_bytes

Sent Alerts [10m] Panel

This panel displays the total number of alerts sent by Prometheus in the last 10 minutes.

Metric Used: prometheus_notifications_sent_total

Rule Group Evaluation Panel

This panel shows the duration of rule group evaluations in seconds for the Prometheus instance, focusing on the 99th percentile to indicate near-maximum evaluation times.

Metric Used: prometheus_rule_group_duration_seconds

Versions Panel

This panel displays the build information of the Prometheus instance, including version and build metadata.

Metric Used: prometheus_build_info

Query Log Panel

This panel indicates whether query logging is enabled for the Prometheus instance, using value mappings to represent the status:

0: OFF
1: ON
Metric Used: prometheus_engine_query_log_enabled

Active Queries Panel

This panel displays the number of active queries currently being executed by the Prometheus instance.

Metric Used: prometheus_engine_queries

Log Failures Panel

This panel shows the rate of query log failures for the Prometheus instance over a specified interval.

Metric Used: prometheus_engine_query_log_failures_total

CPU / Memory Row

Prometheus CPU Usage % Panel

The “Prometheus CPU Usage %” panel focuses on monitoring the CPU load of the Prometheus server on the node ps-in-mon1a-node01. It displays the percentage of CPU usage, providing insights into how much processing power Prometheus is utilizing.

Metrics Used: process_cpu_seconds_total

Prometheus RAM Usage Panel

The “Prometheus RAM Usage” panel monitors the RAM usage of the Prometheus server on the node ps-in-mon1a-node01. It displays the amount of memory consumed, helping to understand the memory load of Prometheus.

Metrics Used: process_resident_memory_bytes

Alerts Row

Sent Alerts [10m] Panel

The “Sent Alerts [10m]” panel displays the number of alerts sent by Prometheus over the past 10 minutes. It helps track the frequency and volume of alerts, ensuring that alerting rules are functioning as expected and providing insight into alerting patterns.

Metrics Used: prometheus_notifications_sent_total

Active Alertmanagers Panel

The “Active Alertmanagers” panel shows the number of active Alertmanager instances. It ensures that the Alertmanager, which handles alerts from Prometheus, is operational. Having at least one active Alertmanager is crucial for managing notifications and routing alerts effectively.

Metrics Used: prometheus_notifications_alertmanagers_discovered

Notification Queue Panel

The “Notification Queue” panel indicates the current number of notifications queued for delivery. A high queue length may suggest issues with notification delivery or an excessive generation of alerts.

Metrics Used: prometheus_notifications_queue_length

Alert Errors Panel

The “Alert Errors” panel tracks the number of errors encountered while processing alerts. A high number of alert errors could indicate configuration issues or problems with alerting rules.

Metrics Used: prometheus_notifications_errors_total

Template Expansions Panel

The “Template Expansions” panel shows the number of times alert templates have been expanded. Frequent template expansions indicate active alert processing involving template rendering.

Metrics Used: prometheus_template_text_expansions_total

Template Failures Panel

The “Template Failures” panel counts the number of failures encountered during template expansion. Zero failures indicate that templates are correctly formatted and functioning.

Metrics Used: prometheus_template_text_expansion_failures_total

Dropped Alerts Panel

The “Dropped Alerts” panel measures the number of alerts that have been dropped. This indicates issues in the alerting pipeline, such as overloads or misconfigurations, causing alerts to be dropped instead of processed.

Metrics Used: prometheus_notifications_dropped_total

Scrape Row

Targets per Job Panel

The “Targets per Job” panel in the Prometheus dashboard provides an overview of the number of scrape targets for each configured job. Each job represents a set of targets from which Prometheus scrapes metrics. The metrics displayed include the number of scrape targets for each job, such as Pushgateway, blackbox-https, haproxy_exporter, jenkins_exporter, libvirt_exporter, mysql_exporter, node_exporter, prometheus_exporter, pushgateway_exporter, redpanda_exporter, scylla, and tomcat_exporter.

Metrics Used: prometheus_target_scrape_pool_targets

Actual Intervals between Scrapes (p99) Panel

The “Actual Intervals between Scrapes (p99)” panel in the Prometheus dashboard shows the 99th percentile of the time intervals between consecutive scrapes.

p99 : The 99th percentile, or p99, is a statistical measure indicating that 99% of the scrape intervals are below a certain time. For instance, if the p99 value is 15 seconds, it means that 99% of the time, the interval between scrapes is 15 seconds or less.

Metrics Used: prometheus_target_interval_length_seconds

Sync Interval Panel

The “Sync Interval” panel lists the configured sync intervals for various exporters and jobs, including Pushgateway, blackbox-https, haproxy_exporter, jenkins_exporter, libvirt_exporter, mysql_exporter, node_exporter, prometheus_exporter, pushgateway_exporter, redpanda_exporter, scylla, and tomcat_exporter. This information helps in understanding how frequently Prometheus syncs with each target to collect metrics.

Metrics Used: prometheus_target_sync_length_seconds

Queries Row

Query Duration (p90) Panel

The “Query Duration (p90)” panel in Prometheus provides crucial insights into query performance through metrics like inner_eval, prepare_time, and queue_time. Inner_eval measures the time spent on evaluating inner query expressions or computations.Prepare_time indicates the duration required for query preparation tasks such as parsing and initial setup. Queue_time reflects the waiting period queries experience in the queue before execution, influenced by system load and prioritization.

p90 : The p90 value represents the point below which 90% of query durations fall, providing a measure of typical query performance.

Metrics Used: prometheus_engine_query_duration_seconds

Query Duration (p99) Panel

The “Query Duration (p99)” panel in Prometheus provides detailed metrics on query performance, focusing on the 99th percentile of query duration metrics: inner_eval, prepare_time, and queue_time.

Metrics Used: prometheus_engine_query_duration_seconds

HTTP Latency (/metrics) Panel

The “HTTP Latency (/metrics)” panel displays a histogram of HTTP request latency metrics. It shows different latency bucket ranges (le values) and the corresponding count of requests falling within each range.

Metrics Used: prometheus_http_request_duration_seconds_bucket

Response Size (/metrics) Panel

The “Response Size (/metrics)” panel provides a histogram view of HTTP response sizes. It shows various size buckets (le values) and the count of responses falling into each bucket.

Metrics Used: prometheus_http_response_size_bytes_bucket