Promotheus
It is a open source system monitoring and alerting tool which pull the metric from client server over http and place the data in to its local data base which works on the principle of time series data base.
Prometheus only can pull (or) scrapes data can’t collect the data by its own.so we uses some tools to collect the data.
process_cpu_seconds_total: This shows the CPU usage over time.
process_resident_memory_bytes: This displays the amount of resident memory (RAM) used by the Prometheus process.
process_open_fds: This shows the number of open file descriptors for the Prometheus process.
process_start_time_seconds: This is used to calculate the uptime of the Prometheus process.
time() - process_start_time_seconds: This represents the uptime of the Prometheus process.
prometheus_config_last_reload_successful: This shows the success status of the last configuration reload.
prometheus_config_last_reload_success_timestamp_seconds: This is used to calculate the time since the last successful configuration reload.
time() - prometheus_config_last_reload_success_timestamp_seconds: This represents the time elapsed since the last successful configuration reload.
prometheus_tsdb_storage_blocks_bytes: This shows the amount of storage used by the Prometheus TSDB in bytes.
prometheus_tsdb_checkpoint_deletions_total: This represents the total number of checkpoints deleted.
prometheus_tsdb_checkpoint_creations_total: This represents the total number of checkpoints created.
prometheus_tsdb_time_retentions_total: This shows the number of times the retention period was applied.
prometheus_tsdb_vertical_compactions_total: This represents the total number of vertical compactions.
prometheus_tsdb_checkpoint_deletions_failed_total: This shows the number of failed checkpoint deletions.
prometheus_tsdb_checkpoint_creations_failed_total: This shows the number of failed checkpoint creations.
prometheus_tsdb_clean_start: This displays the state of the TSDB clean start process.
prometheus_tsdb_data_replay_duration_seconds: This represents the duration of data replay during startup.
prometheus_tsdb_symbol_table_size_bytes: This shows the size of the symbol table in bytes.
prometheus_tsdb_mmap_chunk_corruptions_total: This represents the total number of mmap chunk corruptions.
prometheus_tsdb_reloads_failures_total: This shows the total number of reload failures.
prometheus_tsdb_snapshot_replay_error_total: This shows the total number of snapshot replay errors.
prometheus_tsdb_too_old_samples_total: This represents the total number of samples that were too old.
prometheus_tsdb_size_retentions_total: This shows the total number of times the size retention was applied.
prometheus_tsdb_retention_limit_bytes: This represents the size retention limit in bytes.
prometheus_notifications_sent_total: This displays the total number of alerts sent.
prometheus_notifications_dropped_total: Total number of dropped notifications in Prometheus.
prometheus_notifications_sent_total: Total number of notifications sent by Prometheus.
prometheus_notifications_alertmanagers_discovered: Number of Alertmanager instances discovered by Prometheus for sending notifications.
prometheus_notifications_queue_length: Length of the notification queue in Prometheus, indicating pending notifications.
prometheus_notifications_errors_total: Total number of errors encountered during notification sending in Prometheus.
prometheus_rule_evaluation_duration_seconds: This shows the duration of rule evaluations.
prometheus_rule_group_duration_seconds: This represents the duration of rule group evaluations.
prometheus_engine_queries: This shows the number of active queries.
prometheus_engine_query_log_enabled: This indicates whether query logging is enabled.
prometheus_engine_query_log_failures_total: This shows the total number of query log failures.
prometheus_engine_query_duration_seconds: This metric measures the duration of queries processed by the Prometheus engine in seconds. It indicates how long Prometheus takes to process queries against its time series database.
prometheus_tsdb_lowest_timestamp_seconds: This represents the oldest timestamp in the TSDB.
timestamp(prometheus_tsdb_lowest_timestamp_seconds) - prometheus_tsdb_lowest_timestamp_seconds: This shows the time difference for the oldest value in the TSDB.
prometheus_template_text_expansion_failures_total: Total number of failures encountered during text template expansion in Prometheus.
prometheus_template_text_expansions_total: Total number of successful text template expansions performed in Prometheus.
prometheus_target_sync_length_seconds: This metric measures the length of time it takes for Prometheus to synchronize targets.
prometheus_target_interval_length_seconds: This metric indicates the length of time intervals for target synchronization in Prometheus.
prometheus_target_scrape_pool_targets: This metric shows the number of scrape pool targets configured in Prometheus.
prometheus_http_response_size_bytes_bucket: This metric represents the distribution of HTTP response sizes in buckets for Prometheus. It helps in understanding the size distribution of responses served by Prometheus.
prometheus_http_request_duration_seconds_bucket: This metric shows the distribution of HTTP request durations in buckets for Prometheus. It helps in analyzing the time taken by different HTTP requests served by Prometheus.
prometheus_ready: This indicates the readiness status of Prometheus.
prometheus_build_info: This displays the build information of the Prometheus instance.
TSDB : A specialized database optimized for storing and querying time-stamped data, such as metrics and events.
Local Storage Panel
This panel shows the total size of storage blocks used by Prometheus on the local instance. It represents the amount of disk space consumed by the time series data stored by Prometheus, helping to monitor and manage the storage utilization of the Prometheus instance.
Checkpoints Deleted Panel
This panel shows the number of checkpoints deleted over a specified range of time. It represents the frequency of checkpoint deletions, which are used to manage the TSDB storage and maintain efficient access to time series data.
Checkpoint: A snapshot of the state of the TSDB at a particular point in time. Checkpoints are used to make recovery faster and more efficient, as they provide a reference point from which Prometheus can resume processing without needing to replay the entire WAL.
Checkpoints Created Panel
This panel shows the number of checkpoints created over a specified range of time. It represents the frequency of creating new checkpoints in the TSDB.
Time Retention Hits Panel
This panel shows the number of time retention events over a specified range of time. It represents the instances when old data is removed based on the retention policy, helping to manage the TSDB size and performance.
Time Retention: The policy and process of automatically deleting old time-series data from the TSDB after a specified period.
Vertical Compactions Panel
This panel shows the number of vertical compactions performed over a specified range of time.
Vertical Compactions: The process of merging smaller TSDB blocks into larger ones to reduce the number of blocks and improve query performance.
Checkpoints Deletion Failures Panel
This panel shows the number of failed checkpoint deletions over a specified range of time. It represents the instances where attempts to delete checkpoints have failed.
Checkpoints Creation Failures Panel
This panel shows the number of failed checkpoint creations over a specified range of time. It represents the instances where attempts to create new checkpoints have failed.
TSDB Clean Start Panel
This panel shows the status of TSDB clean starts. The possible values are:
It represents the status of the TSDB at startup, indicating whether it was a clean start, disabled, or replaced.
Replay Time Panel
This panel shows the duration of data replay in seconds. It represents the time taken to replay the data logs during Prometheus startup
Replay: The process of replaying or reprocessing the write-ahead log (WAL) data during the startup of Prometheus. This ensures that any data written to the WAL before a shutdown is properly incorporated into the TSDB.
Symbol Table RAM Panel
This panel shows the size of the symbol table in bytes used by Prometheus on the specified instance. The symbol table is part of the TSDB’s internal data structures and is used for efficient indexing and lookup of time series data.
MMap Corruptions Panel
This panel shows the number of memory map (MMap) chunk corruptions over a specified range of time. MMap is used by Prometheus for efficient memory management, and corruptions can indicate potential issues with memory handling or hardware problems.
Reload Failures Panel
This panel shows the number of times Prometheus failed to reload data over a specified range of time. Reload failures can occur due to various reasons such as data corruption or disk access issues.
Snapshot Replay Errors Panel
This panel shows the number of errors encountered during snapshot replay over a specified range of time. Snapshot replay is a process where Prometheus restores previously saved data, and errors here can affect data integrity or recovery.
Samples Out of Time Panel
This panel shows the number of samples that were considered out of time (too old) over a specified range of time. This metric helps monitor issues related to outdated or expired data samples in the time series database.
Time Retention Hits Panel
This panel shows the number of time retention events over a specified range of time. Time retention involves removing old data based on configured retention policies to manage the size and performance of the time series database.
Size Retention Hits Panel
This panel shows the number of size retention events over a specified range of time. Size retention involves removing data to stay within the configured storage limits, ensuring efficient use of storage resources.
Size Retention Limit Panel
This panel shows the retention limit in bytes configured for the Prometheus instance. It represents the maximum amount of data that can be stored before size-based retention policies start removing older data.
For the this Panel, the value mappings are:
Prometheus Ready Panel
This panel indicates the readiness status of the Prometheus instance.
It uses value mappings to show whether Prometheus is ready or has failed:
Configuration Reload Panel
This panel shows the success status of the last configuration reload for the Prometheus instance.
It uses value mappings to represent the status:
Targets Panel
This panel displays the total number of targets being scraped by the Prometheus instance. It sums up the targets from all scrape pools.
CPU Usage % Panel
This panel shows the CPU usage percentage of the Prometheus instance over a specified interval. It calculates the rate of CPU seconds used.
Active Alert Managers Panel
This panel displays the number of active Alertmanagers discovered by the Prometheus instance, indicating the status of alert manager integration.
Rule Execution Panel
This panel shows the duration of rule evaluations in seconds for the Prometheus instance, focusing on the 99th percentile to indicate near-maximum execution times.
Oldest Value Panel
This panel displays the timestamp of the oldest data value stored in the Prometheus TSDB, representing the data retention and age.
File Description Panel
This panel shows the number of open file descriptors for the Prometheus process, which is useful for monitoring resource usage and potential file handle limits.
Process Uptime Panel
This panel displays the uptime of the Prometheus process by calculating the difference between the current time and the process start time.
Configuration Reload Time Panel
This panel shows the time elapsed since the last successful configuration reload of the Prometheus instance.
Local Storage Panel
This panel displays the amount of storage used by the Prometheus time-series database (TSDB) in bytes.
RAM Usage Panel
This panel shows the amount of resident memory (RAM) used by the Prometheus process.
Sent Alerts [10m] Panel
This panel displays the total number of alerts sent by Prometheus in the last 10 minutes.
Rule Group Evaluation Panel
This panel shows the duration of rule group evaluations in seconds for the Prometheus instance, focusing on the 99th percentile to indicate near-maximum evaluation times.
Versions Panel
This panel displays the build information of the Prometheus instance, including version and build metadata.
Query Log Panel
This panel indicates whether query logging is enabled for the Prometheus instance, using value mappings to represent the status:
Active Queries Panel
This panel displays the number of active queries currently being executed by the Prometheus instance.
Log Failures Panel
This panel shows the rate of query log failures for the Prometheus instance over a specified interval.
Prometheus CPU Usage % Panel
The “Prometheus CPU Usage %” panel focuses on monitoring the CPU load of the Prometheus server on the node ps-in-mon1a-node01. It displays the percentage of CPU usage, providing insights into how much processing power Prometheus is utilizing.
Prometheus RAM Usage Panel
The “Prometheus RAM Usage” panel monitors the RAM usage of the Prometheus server on the node ps-in-mon1a-node01. It displays the amount of memory consumed, helping to understand the memory load of Prometheus.
Sent Alerts [10m] Panel
The “Sent Alerts [10m]” panel displays the number of alerts sent by Prometheus over the past 10 minutes. It helps track the frequency and volume of alerts, ensuring that alerting rules are functioning as expected and providing insight into alerting patterns.
Active Alertmanagers Panel
The “Active Alertmanagers” panel shows the number of active Alertmanager instances. It ensures that the Alertmanager, which handles alerts from Prometheus, is operational. Having at least one active Alertmanager is crucial for managing notifications and routing alerts effectively.
Notification Queue Panel
The “Notification Queue” panel indicates the current number of notifications queued for delivery. A high queue length may suggest issues with notification delivery or an excessive generation of alerts.
Alert Errors Panel
The “Alert Errors” panel tracks the number of errors encountered while processing alerts. A high number of alert errors could indicate configuration issues or problems with alerting rules.
Template Expansions Panel
The “Template Expansions” panel shows the number of times alert templates have been expanded. Frequent template expansions indicate active alert processing involving template rendering.
Template Failures Panel
The “Template Failures” panel counts the number of failures encountered during template expansion. Zero failures indicate that templates are correctly formatted and functioning.
Dropped Alerts Panel
The “Dropped Alerts” panel measures the number of alerts that have been dropped. This indicates issues in the alerting pipeline, such as overloads or misconfigurations, causing alerts to be dropped instead of processed.
Targets per Job Panel
The “Targets per Job” panel in the Prometheus dashboard provides an overview of the number of scrape targets for each configured job. Each job represents a set of targets from which Prometheus scrapes metrics. The metrics displayed include the number of scrape targets for each job, such as Pushgateway, blackbox-https, haproxy_exporter, jenkins_exporter, libvirt_exporter, mysql_exporter, node_exporter, prometheus_exporter, pushgateway_exporter, redpanda_exporter, scylla, and tomcat_exporter.
Actual Intervals between Scrapes (p99) Panel
The “Actual Intervals between Scrapes (p99)” panel in the Prometheus dashboard shows the 99th percentile of the time intervals between consecutive scrapes.
p99 : The 99th percentile, or p99, is a statistical measure indicating that 99% of the scrape intervals are below a certain time. For instance, if the p99 value is 15 seconds, it means that 99% of the time, the interval between scrapes is 15 seconds or less.
Sync Interval Panel
The “Sync Interval” panel lists the configured sync intervals for various exporters and jobs, including Pushgateway, blackbox-https, haproxy_exporter, jenkins_exporter, libvirt_exporter, mysql_exporter, node_exporter, prometheus_exporter, pushgateway_exporter, redpanda_exporter, scylla, and tomcat_exporter. This information helps in understanding how frequently Prometheus syncs with each target to collect metrics.
Query Duration (p90) Panel
The “Query Duration (p90)” panel in Prometheus provides crucial insights into query performance through metrics like inner_eval, prepare_time, and queue_time. Inner_eval measures the time spent on evaluating inner query expressions or computations.Prepare_time indicates the duration required for query preparation tasks such as parsing and initial setup. Queue_time reflects the waiting period queries experience in the queue before execution, influenced by system load and prioritization.
p90 : The p90 value represents the point below which 90% of query durations fall, providing a measure of typical query performance.
Query Duration (p99) Panel
The “Query Duration (p99)” panel in Prometheus provides detailed metrics on query performance, focusing on the 99th percentile of query duration metrics: inner_eval, prepare_time, and queue_time.
HTTP Latency (/metrics) Panel
The “HTTP Latency (/metrics)” panel displays a histogram of HTTP request latency metrics. It shows different latency bucket ranges (le values) and the corresponding count of requests falling within each range.
Response Size (/metrics) Panel
The “Response Size (/metrics)” panel provides a histogram view of HTTP response sizes. It shows various size buckets (le values) and the count of responses falling into each bucket.