Please read the following documents before addressing the issues to become familiar with the postgresql.
Common High availabilty structures
Current architecture of our cluster
When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
Row | Panel | Query | Query Description | Query Operating Range | Metrics | Metric Description | Metric Operating Range | CRITICAL | WARNING | OK | PRIORITY |
---|---|---|---|---|---|---|---|---|---|---|---|
1.3 | Replication lag | pg_replication_lag_seconds |
Lag behind master in seconds | +ve Values | pg_replication_lag_seconds |
Replication lag behind master in seconds | +ve Values | > 9 seconds | < 9 seconds | ||
1.3 | Start time | pg_postmaster_start_time_seconds |
Time since PostgreSQL instance started | +ve Values | pg_postmaster_start_time_seconds |
Time since PostgreSQL instance started | +ve Values | < 120 seconds | > 120 seconds | ||
1.3 | DB instance type | pg_replication_is_replica |
Return 1 if the node is a replica | 0-1 | pg_replication_is_replica |
Return 1 if the node is a slave/replica | 0-1 | Change 0->1 or 1->0 | No change in 10s | ||
1.4 | Failed WAL archives | pg_stat_archiver_failed_count |
Failed WAL archive files on master | +ve Values | pg_stat_archiver_failed_count |
Number of failed WAL archive files | +ve Values | +ve Values | 0 | ||
2.6 | Conflicts/Deadlocks | sum(rate(pg_stat_database_deadlocks{...}[$interval])) |
Monitors PostgreSQL deadlock rate over a 5m interval, | +ve Values | pg_stat_database_deadlocks |
Provides the number of deadlocks active | +ve Values | +ve Values | 0 | ||
offering insights into locking issues and conflicts | |||||||||||
2.4 | Transaction duration | avg by (instance,state) ... |
Calculates avg maximum transaction duration by instance | > 0 | pg_stat_activity_max_tx_duration |
Maximum duration of active transactions (in secs) | > 0 | > 9 seconds | < 9 seconds | ||
and state over the interval. Helps identify long TXNs. | |||||||||||
1.3 | Active replication slots | count(pg_replication_slot_slot_is_active) |
Determines active replication slots | 0,1,2,3 | pg_replication_slot_slot_is_active |
Indicates if the replication slot is active | 0 or 1 | Count < 2 | 2 |
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
Severity-Specific Notifications:
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
The pg_up
metric in PostgreSQL monitoring indicates whether a PostgreSQL instance is available and responding to queries. When pg_up
is 1, it means the PostgreSQL server is running and successfully reachable by the exporter or monitoring tool. If pg_up
is 0, the server is down, unresponsive, or cannot be reached due to network issues, configuration problems, or service crashes. C3 team must submit all given data to devops team in case of C3 remedy fails.
systemctl status postgres_exporter
to check the status of postgres_exporter service.systemctl status postgresql
to see what state the service is in.ps aux | grep potgres
and check if postgres process is running by verifying the command existence of something like /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf.(Version might change)netstat -tlnp | grep postgres
to see if the port number 5432
and 9187
is mentioned in the output.df -h
to check for disk space available for each parition and specifically check for /data
and /
mounts.journalctl -xeu postgresql.service
to get the postgresql.service logs.tail -100f /var/log/postgresql/postgresql-16-main.log
to get the actual database logs and submit to devops in the case of prescription given.When pg_up
shows 0
or no data
, checking the following metrics will help in determining the issue:
up{instance="<target_instance>"}
. If the instance is in shutdown state, typically you will receive no data
alert.node_filesystem_avail_bytes{instance="<target_instance>"}
with instance filter to check for storage available for every mount on the server. You can check this metric using this panel. Specifically look out for /
and /data
mount points.node_memory_MemFree_bytes{instance="<target_instance>"}
to check if RAM usage is very near to 100% at which the system kills the process.Identify the type of instance(Master/Slave):
pg_replication_is_replica{instance="<target_name>"}
for a while back(try applying the time frame at which the server is up and running at that time). If the metric returns 1
the node is replica
type, if it is 0
or no data
then it is of type master
and intimate the devops team right away.
Proceed to the following steps only when the instance type is replica
, in any other case handover the issue to devops
team.
Check the node status(shutdown/running), if node is in running state:
systemctl start postgresql postgres_exporter
to start the postgresql and postgres_exporter`.
If the service starts:
ps aux | grep postgres
’s last column with following sample output:postgres: /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
postgres: psorbit-pg-cluster: checkpointer
postgres: psorbit-pg-cluster: background writer
postgres: psorbit-pg-cluster: startup recovering 000000010000000000000073
postgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
postgres: psorbit-pg-cluster: walreceiver streaming 0/73ED26A8
postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
Although systemctl status postgresql
shows that the service is activated, the actual database operations may not work correctly, you must verify from the process names given above to confirm the database is operational.
df -h
to check the storage for each partition:
Use%
is near to 100%
(In range of 95% to 100% or less than 1GB), check for storage consuming directories on the node by using the following command, du -sh /*
./var/log
directory’s consumption by using du -sh /var/log
and inform the devops team about the insights.
Output:Filesystem Size Used Avail Use% Mounted on
tmpfs 794M 1016K 793M 1% /run
/dev/vda5 25G 14G 11G 57% /
tmpfs 3.9G 1.1M 3.9G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
none 3.9G 0 3.9G 0% /run/shm
tmpfs 3.9G 297M 3.6G 8% /tmp
/dev/vda3 974M 241M 666M 27% /boot
/dev/vdb 295M 175M 121M 60% /export/vdb
/dev/vda4 59G 6.8G 50G 13% /data
/dev/vda6 34G 2.9G 29G 9% /opt
tmpfs 794M 4.0K 794M 1% /run/user/1004
psorbit-node01:/ps_orbit_dfs 295M 178M 118M 61% /data/ps/orbit
journalctl -xeu postgresql
and tail -100f /var/log/postgresql/postgresql-16-main.log
.free -h
to check for available RAM for postgresql to be able to work, if the available RAM is less than 300M or when you see a OOM killer
entry in any of the logs: total used free shared buff/cache available
Mem: 7.8Gi 3.7Gi 3.0Gi 321Mi 1.0Gi 3.4Gi
Swap: 8.0Gi 2.0Gi 6.0Gi
sync; echo 3 > /proc/sys/vm/drop_caches
If the node is in shutdown state:
standby.signal
file’s existence in postgresql’s data directory: /data/db/psql/16/main
by using:
ll /data/db/psql/16/main/standby.signal
Sample Output:
-rw------- 1 postgres postgres 0 Oct 1 09:51 /data/db/psql/16/main/standby.signal
touch /data/db/psql/16/main/standby.signal
ll /data/db/psql/16/main/standby.signal
/data/db/psql/16/main/standby.signal
to be accessible by postgres
user by:
chown -R postgres:postgres /data/db/psql/16/main/standby.signal
ll /data/db/psql/16/main/standby.signal
Sample Output:
-rw------- 1 postgres postgres 0 Oct 1 09:51 /data/db/psql/16/main/standby.signal
systemctl start postgresql
, if it doesn’t start yet, proceed from 2.2
of this section.If the C3 team has followed the data collection steps and tried available remedies and yet the postgresql service didn’t start, devops team should follow these instructions:
If the node is Replica:
If the node is Master:
pg_replication_slot_slot_current_wal_lsn
and choose the one that is at latest lsn
./etc/hosts
for the hostname pg-master
to the new master’s IP address.systemctl stop postgresql
.standby.signal
, postgresql.auto.conf
files from data directory using:rm /data/db/psql/16/main/standby.signal /data/db/psql/16/main/postgresql.auto.conf
systemctl start postgresql
systemctl status postgresql
Sample output:
● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
Active: active (exited) since Thu 2024-11-07 16:08:40 IST; 2 weeks 0 days ago
Main PID: 3946191 (code=exited, status=0/SUCCESS)
CPU: 3ms
ps aux | grep postgres
:postgres: psorbit-pg-cluster: checkpointer
postgres: psorbit-pg-cluster: background writer
postgrespostgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
postgres: psorbit-pg-cluster: walwriter
postgrespostgres: psorbit-pg-cluster: autovacuum launcher
postgrespostgres: psorbit-pg-cluster: logical replication launcher
postgres: psorbit-pg-cluster: walsender replicator 172.21.0.92(38070) streaming 0/73F706C8
root 409257 0.0grep --color=auto postgrespostgres: psorbit-pg-cluster: walsender replicator 172.21.0.90(40186) streaming 0/73F706C8
You might not see replicator processes as the server is started just now, wait for sometime and check for replicator sender process to be active,.
pg-master
to the new master’s IP address.standby.signal
file in the data directory and change its permissions by using the following commands:touch /data/db/psql/16/main/standby.signal
chown -R postgres:postgres /data/db/psql/16/main/standby.signal
walreceiver
process is active by running:ps aux | grep postgres
output should resemble:
postgres: /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
postgres: psorbit-pg-cluster: checkpointer
postgres: psorbit-pg-cluster: background writer
postgres: psorbit-pg-cluster: startup recovering 000000010000000000000073
postgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
postgres: psorbit-pg-cluster: walreceiver streaming 0/73ED26A8
postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
pg_basebackup -hpg-master -U replicator -p 5432 -D /data/db/psql/16/main -P -Xs -R
chown -R /data/db/psql/16/main
ll /data/db/psql/16/main
Sample output:
drwx------ 19 postgres postgres 4096 Nov 7 16:08 ./
drwxr-xr-x 3 postgres postgres 4096 Sep 26 09:41 ../
-rw------- 1 postgres postgres 3 Oct 1 09:51 PG_VERSION
-rw------- 1 postgres postgres 227 Oct 1 09:51 backup_label.old
-rw------- 1 postgres postgres 260079 Oct 1 09:51 backup_manifest
drwx------ 7 postgres postgres 4096 Oct 1 09:51 base/
drwx------ 2 postgres postgres 4096 Nov 7 16:08 global/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_commit_ts/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_dynshmem/
drwx------ 4 postgres postgres 4096 Nov 22 12:38 pg_logical/
drwx------ 4 postgres postgres 4096 Oct 1 09:51 pg_multixact/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_notify/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_replslot/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_serial/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_snapshots/
drwx------ 2 postgres postgres 4096 Nov 7 16:08 pg_stat/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_stat_tmp/
drwx------ 2 postgres postgres 4096 Nov 12 01:03 pg_subtrans/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_tblspc/
drwx------ 2 postgres postgres 4096 Oct 1 09:51 pg_twophase/
drwx------ 3 postgres postgres 4096 Nov 21 15:02 pg_wal/
drwx------ 2 postgres postgres 4096 Nov 12 00:58 pg_xact/
-rw------- 1 postgres postgres 318 Oct 1 09:51 postgresql.auto.conf
-rw------- 1 postgres postgres 124 Nov 7 16:08 postmaster.opts
-rw------- 1 postgres postgres 107 Nov 7 16:08 postmaster.pid
-rw------- 1 postgres postgres 0 Oct 1 09:51 standby.signal
The pg_replication_slot_slot_is_active
metric shows how many replica nodes are currently active,if this returns <2 integer it means the node is either disconnected from the master, has stopped replication, or is unavailable due to network, storage, or process issues.
The C3 team must collect and report the following data to the DevOps team if the issue cannot be resolved through C3’s remedies.
Instance Name, IP Address, and Alert Duration:
Collect the instance name, IP address, and the total duration the alert has been in the firing state.
Instantiate new master:
pg_replication_slot_slot_current_wal_lsn
and choose the one that is at latest lsn
./etc/hosts
for the hostname pg-master
to the new master’s IP address.systemctl stop postgresql
.standby.signal
, postgresql.auto.conf
files from data directory using:rm /data/db/psql/16/main/standby.signal /data/db/psql/16/main/postgresql.auto.conf
systemctl start postgresql
ps aux | grep postgres
:postgres: psorbit-pg-cluster: checkpointer
postgres: psorbit-pg-cluster: background writer
postgrespostgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
postgres: psorbit-pg-cluster: walwriter
postgrespostgres: psorbit-pg-cluster: autovacuum launcher
postgrespostgres: psorbit-pg-cluster: logical replication launcher
postgres: psorbit-pg-cluster: walsender replicator 172.21.0.92(38070) streaming 0/73F706C8
root 409257 0.0grep --color=auto postgrespostgres: psorbit-pg-cluster: walsender replicator 172.21.0.90(40186) streaming 0/73F706C8
Start the old master as a slave node:
pg-master
to the new master’s IP address.standby.signal
file in the data directory and change its permissions by using the following commands:touch /data/db/psql/16/main/standby.signal
chown -R postgres:postgres /data/db/psql/16/main/standby.signal
walreceiver
process is active by running:ps aux | grep postgres
output should resemble:
postgres: /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
postgres: psorbit-pg-cluster: checkpointer
postgres: psorbit-pg-cluster: background writer
postgres: psorbit-pg-cluster: startup recovering 000000010000000000000073
postgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
postgres: psorbit-pg-cluster: walreceiver streaming 0/73ED26A8
postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
pg_basebackup -hpg-master -U replicator -p 5432 -D /data/db/psql/16/main -P -Xs -R
chown -R /data/db/psql/16/main
You might not see replicator processes as the server is started just now, wait for sometime and check for replicator sender process to be active,.
Check PostgreSQL Service on Master and Replica Nodes:
Verify that the postgresql
service is running on both master and replica nodes. Use:
systemctl status postgresql
Network Connectivity:
Ensure all nodes (master and replicas) can reach each other on the network. Test using:
ping pg-master
ping <replica_ip>
Sample output:
PING pg-master (172.21.0.91) 56(84) bytes of data.
64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=1 ttl=64 time=18.1 ms
64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=2 ttl=64 time=0.228 ms
64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=3 ttl=64 time=0.312 ms
64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=4 ttl=64 time=0.318 ms
--- pg-master ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3039ms
rtt min/avg/max/mdev = 0.228/4.727/18.052/7.692 ms
Port Connectivity (5432):
Check that the master and replicas are listening on the expected ports:
netstat -tlnp | grep postgres
telnet <replica_ip> 5432
telnet pg-master 5432
Sample output:
telnet pg-master 5432
Trying 172.21.0.91...
Connected to pg-master.
Escape character is '^]'.
^]
Connection closed by foreign host.
PostgreSQL Processes:
Verify the presence of walsender
on the master and walreceiver
on replicas. Use:
ps aux | grep postgres
Your output must include the followinf process in case of replica:
postgres: psorbit-pg-cluster: walreceiver streaming 0/7514C2C8
For master node:
walsender replicator 172.21.0.90(40186) streaming 0/7514CB78
Storage Space:
Check the disk space availability on all nodes, focusing on the PostgreSQL data directories. Use:
df -h
See if no free storage is available for postgresql to function on either /
or /data
Replication Logs:
Gather logs related to replication from both master and replicas:
journalctl -xeu postgresql.service
tail -100f /var/log/postgresql/postgresql-16-main.log
Submit both logs in case of c3 remedy fails to devops team.
Replication Slots:
On the master, verify the status of replication slots. Use(On replica nodes):
SELECT * FROM pg_replication_slots;
The output should be similar to:
slot_name | restart_lsn | wal_status
-----------+-------------+------------
slot_3 | 0/75151770 | reserved
slot_2 | |
slot_1 | 0/75151770 | reserved
Each slot indicates a replica node’s oocupancy, slot_1 is for node-1 and slot_3 is for node-3 and so on. atleast one slot should be active to be replicating on atleast one node.
Check the Node Status and Type (Master/Replica):
Use the pg_replication_is_replica{instance="<target_instance>"}
metric to identify the type of the node.
Verify Replication Processes:
walsender
processes are active for each replica.walreceiver
processes are active.ps aux | grep postgres
Sample Output:postgres 3854730 0.2 0.1 1238296 11908 ? Ssl Nov07 58:12 /opt/postgres_exporter/postgres_exporter-0.15.0.linux-amd64/postgres_exporter --web.listen-address=:9187 --web.telemetry-path=/metrics --collector.postmaster --collector.long_running_transactions --collector.database_wraparound --collector.process_idle --collector.stat_activity_autovacuum --collector.stat_statements --collector.stat_wal_receiver --collector.xlog_location
root 3909146 0.0 0.0 4028 2136 pts/2 S+ 13:45 0:00 grep --color=auto postgres
postgres 3946117 0.0 0.1 217276 16072 ? Ss Nov07 2:20 /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
postgres 3946125 0.0 0.5 217404 42080 ? Ss Nov07 0:36 postgres: psorbit-pg-cluster: checkpointer
postgres 3946126 0.0 0.0 217276 5120 ? Ss Nov07 0:16 postgres: psorbit-pg-cluster: background writer
postgres 3946127 0.0 0.5 219088 43632 ? Ss Nov07 1:23 postgres: psorbit-pg-cluster: startup recovering 000000010000000000000076
postgres 3946133 0.0 0.0 217380 4848 ? Ss Nov07 0:05 postgres: psorbit-pg-cluster: archiver last was 000000010000000000000075
postgres 3946134 0.0 0.1 218364 9672 ? Ss Nov07 5:02 postgres: psorbit-pg-cluster: walreceiver streaming 0/7648AC80
postgres 3946351 0.0 0.2 220736 17016 ? Ss Nov07 15:30 postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
Restart PostgreSQL Services:
Restart the PostgreSQL service on the affected replicas:
systemctl restart postgresql
Re-establish Replication:
If replication has stopped, re-establish it:
pg_basebackup
from the master to recreate the replica:
pg_basebackup -h pg-master -U replicator -D /data/db/psql/16/main -P -Xs -R
chown -R postgres:postgres /data/db/psql/16/main
Check Replication Lag:
Monitor the historic replication lag using metrics such as pg_replication_lag
. If it is unusually high, troubleshoot the network and performance.
Address Storage Issues:
df -h
./var/log
.Log Analysis:
Check for replication-specific errors in logs on both master and replicas.
pg_stat_replication
on the master.up{instance="<target_instance>"}
.node_filesystem_avail_bytes
for sufficient storage on the nodes.node_memory_MemFree_bytes
to ensure adequate free memory on the nodes.If C3 remedies fail, escalate to the DevOps team with collected data and logs. The DevOps team can perform:
Restore Replication:
pg_basebackup
.Rebuild Replication Slots:
If replication slots are corrupt or missing, recreate them on the master.
Address Persistent Storage or Network Issues:
Failover Procedures:
Promote a replica to master if the master is irrecoverable, and reinitialize the failed node as a replica.
The pg_replication_lag_seconds
metric in PostgreSQL monitoring measures the time difference (in seconds) between the master and a replica. High replication lag can result in inconsistent data for applications relying on replicas. The C3 team must collect the following data and attempt remedies before escalating the issue to the DevOps team.
systemctl status postgresql
on both master and replicas.sudo su postgres
psql
SELECT application_name, client_addr, state, sync_priority, sync_state, replay_lag FROM pg_stat_replication;
Sample Output:
application_name | client_addr | state | sync_priority | sync_state | replay_lag
--------------------+-------------+-----------+---------------+------------+-----------------
psorbit-pg-cluster | 172.21.0.90 | streaming | 0 | async | 00:00:00.002042
psorbit-pg-cluster | 172.21.0.92 | streaming | 0 | async | 00:00:00.002073
(2 rows)
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
Sample output:
slot_name | active | restart_lsn
-----------+--------+-------------
slot_3 | t | 0/764908E8
slot_2 | f |
slot_1 | t | 0/764908E8
(3 rows)
t indicates that the slot is activejournalctl -xeu postgresql.service
and tail -100f /var/log/postgresql/postgresql-16-main.log
.ping <replica_IP>
and telnet <replica_IP> 5432
to check connectivity and port status.df -h
for disk space and iostat
for disk I/O stats.pg_wal
directory for excessive files:
ls -lh /data/db/psql/archive
top
or htop
for real-time resource monitoring.When pg_replication_lag_seconds
shows high values, investigate the following:
pg_stat_replication
.up
metric or pg_is_in_recovery()
.Verify replica status:
SELECT client_addr, state, sync_state FROM pg_stat_replication;
Restart lagging replica:
systemctl restart postgresql
Verify replication configuration:
postgresql.conf
and pg_hba.conf
files are correctly configured for replication(Check for syntax errors only, handover to devops for any configuration changes).Network-related issues:
Resolve WAL file issues:
wal_keep_size = '1GB'
Disk performance:
If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:
Increase replication slots and resources:
Rebuild the replica:
pg_basebackup
from the master to recreate the replica on the lagging node:pg_basebackup -h pg-master -U replicator -D /data/db/psql/16/main -P -Xs -R
chown -R postgres:postgres /data/db/psql/16/main
Replica failover:
Review replication strategy:
The pg_up
metric in PostgreSQL monitoring identifies the uptime of a PostgreSQL node. This alert triggers when the uptime is 2 minutes or less, indicating a recent restart. This could signify planned maintenance or an unexpected failure. The C3 team must collect the following data and perform initial troubleshooting before escalating the issue to the DevOps team.
systemctl status postgresql
SELECT date_trunc('seconds', now() - pg_postmaster_start_time()) AS uptime FROM pg_stat_activity;
Sample output:
uptime
------------------
20 days 13:22:27
ps aux | grep 'walsender\|walreceiver'
Sample output:
postgres 2639 0.0 0.1 219532 11036 ? Ss Nov02 4:12 postgres: psorbit-pg-cluster: walsender replicator 172.21.0.92(38070) streaming 0/76490BC0
postgres 3978687 0.0 0.1 219532 12136 ? Ss Nov07 3:28 postgres: psorbit-pg-cluster: walsender replicator 172.21.0.90(40186) streaming 0/76490BC0
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
Sample output:
slot_name | active | restart_lsn
-----------+--------+-------------
slot_3 | t | 0/76491200
slot_2 | f |
slot_1 | t | 0/76491200
(3 rows)
pg_hba.conf
(For credentials contact devops).
psql -h 172.21.0.91 -U haproxy -d haproxy_checks -c 'SELECT 1;'
Sample output:
?column?
----------
1
(1 row)
SELECT pg_current_wal_lsn(), pg_is_in_recovery();
Sample output(Incase of master node):
pg_current_wal_lsn | pg_is_in_recovery
--------------------+-------------------
0/764AA308 | f
(1 row)
Sample output(Incase of replica node):
pg_current_wal_lsn | pg_is_in_recovery
--------------------+-------------------
0/764AA308 | t
(1 row)
SELECT application_name, state, sync_state, replay_lag FROM pg_stat_replication;
Sample Output(On master):
application_name | state | sync_state | replay_lag
--------------------+-----------+------------+------------
psorbit-pg-cluster | streaming | async |
psorbit-pg-cluster | streaming | async |
(2 rows)
journalctl -xeu postgresql.service
tail -100 /var/log/postgresql/postgresql-16-main.log
When the pg_up
metric indicates recent restarts, monitor:
Confirm intentional restarts:
Restart PostgreSQL service:
systemctl restart postgresql
Validate replication:
Verify logs for errors:
Check disk and resource status:
If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:
Rebuild replication slots:
Reconfigure PostgreSQL settings:
Review system-level issues:
Cluster validation:
The pg_replication_is_replica
metric detects changes in the node role within a PostgreSQL cluster. This alert triggers when a master node transitions into a replica or a replica behaves inconsistently. If a node’s role changes unexpectedly, it may indicate an unintended failover or a misconfiguration. The C3 team must verify the current cluster state and perform immediate actions to stabilize the system.
Instance details:
master
or replica
).Cluster status:
SELECT pg_is_in_recovery() AS is_replica, inet_server_addr() AS node_ip;
Sample output:
is_replica | node_ip
------------+---------
f |
(1 row)
Replication status:
SELECT client_addr, state, sync_state FROM pg_stat_replication;
client_addr | state | sync_state
-------------+-----------+------------
172.21.0.90 | streaming | async
172.21.0.92 | streaming | async
(2 rows)
Replication slot status:
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
Sample output:
slot_name | active | restart_lsn
-----------+--------+-------------
slot_3 | t | 0/764AAD48
slot_2 | f |
slot_1 | t | 0/764AAD48
(3 rows)
Logs:
journalctl -xeu postgresql.service
tail -100 /var/log/postgresql/postgresql-16-main.log
Hostname-IP consistency:
pg-master
hostname resolves to the correct new master IP address:
nslookup pg-master
ping pg-master
Server: 127.0.0.53
Address: 127.0.0.53#53
Name: pg-master
Address: 172.21.0.91
Node downtime verification:
Database connections:
psql -h 172.21.0.91 -U haproxy -d haproxy_checks -c 'SELECT 1;'
When pg_replication_is_replica
detects role changes, also monitor:
pg_replication_lag_seconds
): Confirm replicas are catching up.pg_stat_replication
): Ensure WAL files are replayed correctly.Confirm intentional role change:
Update hostname:
pg-master
hostname to point to the correct master IP address.Restart replication:
systemctl restart postgresql
Announce downtime:
Verify system connectivity:
If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:
Analyze cluster stability:
If the roles was unintentional, try check for related issues and instantiate a new master with changes described in previous procedures.
Reconfigure replicas:
pg_basebackup -h pg-master -U replicator -D /data/db/psql/16/main -P -Xs -R
The pg_stat_archiver_failed_count
metric tracks the number of failed attempts to archive WAL (Write-Ahead Logging) files. A failure in WAL archiving typically arises due to insufficient storage, directory permission issues, or incorrect configurations. These failures can lead to WAL file accumulation, affecting the performance and durability of the database. The C3 team must collect data and address issues promptly to avoid disruptions.
Instance details:
Failed count and logs:
SELECT failed_count, last_failed_wal FROM pg_stat_archiver;
failed_count | last_failed_wal
--------------+-----------------
0 |
(1 row)
tail -100 /var/log/postgresql/postgresql-16-main.log | grep "archive"
Directory status:
ls -ld /data/db/psql/archive
Sample output:
drwx------ 2 postgres postgres 4096 Nov 21 14:57 /data/db/psql/archive
postgres
user.Disk space availability:
df -h /data/db/psql/archive
Sample output:
Filesystem Size Used Avail Use% Mounted on
/dev/vda4 59G 6.8G 50G 13% /data
Network and storage connectivity (if incase of archive directory is on a remote location, this is not the case as of now):
ping <storage_server_ip>
mount | grep /data/db/psql/archive
Archiving process status:
ps aux | grep "postgres: archiver"
WAL directory size:
archive
directory for unarchived files that may accumulate due to failures:
ls -lh /data//db/psql/archive
du -sh /data//db/psql/archive
Configuration validation:
archive_command
in postgresql.conf
is configured correctly:
grep "archive_command" /etc/postgresql/16/main/postgresql.conf
cp <test_wal_file> /data/db/psql/archive
When pg_stat_archiver_failed_count
increases, check:
pg_wal
:
Fix directory permissions:
chown postgres:postgres /data/db/psql/archive
chmod 700 /data/db/psql/archive
Free up storage:
find /data/db/psql/archive -type f -mtime +30 -delete
Test and restart archiving:
archive_command
to ensure it works:
cp <sample_file> /data/db/psql/archive
systemctl restart postgresql
Monitor WAL directory:
/data/pg_wal
to prevent storage overflow.If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:
Archive directory reconfiguration:
postgresql.conf
:
archive_command = 'cp %p /data/db/psql/archive/%f'
Automate cleanup:
crontab -e
0 3 * * * find /data/db/psql/archive -type f -mtime +30 -delete
Increase disk space:
/data/db/psql/archive
.WAL retention:
wal_keep_size
to manage WAL retention efficiently:
wal_keep_size = '1GB'
The pg_stat_database_conflicts
metric captures conflict-related events such as deadlocks in PostgreSQL databases. Deadlocks occur when two or more transactions block each other, preventing any from proceeding. These conflicts can degrade database performance and impact application availability. The C3 team must investigate and address the root cause of deadlocks promptly before escalating to DevOps.
Instance details:
Deadlock metrics:
SELECT datname, deadlocks FROM pg_stat_database WHERE deadlocks > 0;
Sample output:
datname | deadlocks
---------+-----------
(0 rows)
Deadlock logs:
grep "deadlock detected" /var/log/postgresql/postgresql-16-main.log
Query inspection:
pg_stat_activity
:
SELECT pid, datname, usename, query, state, wait_event
FROM pg_stat_activity
WHERE wait_event LIKE 'deadlock';
Sample output:
pid | datname | usename | query | state | wait_event
-----+---------+---------+-------+-------+------------
(0 rows)
You will details of the event when there is a deadlock
Blocked queries:
SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Sample output:
blocked_pid | blocking_pid | blocked_query | blocking_query
-------------+--------------+---------------+----------------
(0 rows)
Database performance:
top -b -n1 | grep postgres
iostat -xm 1 5
Sample output:
3854730 postgres 20 0 1238296 12544 5188 S 0.0 0.2 58:31.86 postgres_export
3946117 postgres 20 0 217276 16136 15692 S 0.0 0.2 2:21.53 postgres
3946125 postgres 20 0 217404 42080 41584 S 0.0 0.5 0:36.81 postgres
3946126 postgres 20 0 217276 5120 4768 S 0.0 0.1 0:16.76 postgres
3946127 postgres 20 0 219088 43632 42984 S 0.0 0.5 1:23.34 postgres
3946133 postgres 20 0 217380 4848 4448 S 0.0 0.1 0:05.72 postgres
3946134 postgres 20 0 218364 9672 8952 S 0.0 0.1 5:03.13 postgres
3946351 postgres 20 0 220736 17016 15172 S 0.0 0.2 15:35.74 postgres
Linux 5.15.0-124-generic (psorbit-node01) 11/22/24 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
10.17 0.05 8.22 0.41 0.02 81.13
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
loop0 0.00 0.00 0.00 0.00 0.70 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sr0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vda 4.98 0.25 0.95 16.07 2.18 50.98 36.22 0.33 13.33 26.90 4.05 9.34 0.03 0.08 0.00 0.00 0.35 2921.77 8.06 0.99 0.17 3.11
vdb 0.48 0.01 0.00 0.00 29.50 17.96 0.54 0.01 0.11 17.44 22.41 19.11 0.00 0.00 0.00 0.00 2.98 6700.00 0.07 2.88 0.03 1.59
Locks on objects:
SELECT locktype, relation::regclass, mode, granted, transactionid, virtualtransaction
FROM pg_locks
WHERE NOT granted;
Sample output:
locktype | relation | mode | granted | transactionid | virtualtransaction
----------+----------+------+---------+---------------+--------------------
(0 rows)
When deadlocks are detected, monitor:
Report conflicting queries:
SELECT pg_terminate_backend(<blocked_pid>);
Analyze transaction scope:
If the C3 team cannot resolve the issue, escalate to DevOps with the following actions:
Lock priority tuning:
max_locks_per_transaction = 128
Reorganize table structures:
Review application logic:
Terminate conflicting queries with discussion:
SELECT pg_terminate_backend(<blocked_pid>);
The pg_stat_activity_max_tx_duration
metric identifies long-running transactions in PostgreSQL that have been executing for 9 seconds or more. Long transactions can lead to performance issues, such as locks, blocking other transactions, or consuming excessive resources. The C3 team should collect transaction details and report them, and the DevOps team may consider terminating transactions after consulting with developers to ensure system stability.
Instance details:
Identify long transactions:
SELECT pid, datname, usename, query, state, backend_start, xact_start, query_start
FROM pg_stat_activity
WHERE now() - query_start > interval '9 seconds'
AND state = 'active';
Sample output:
pid | datname | usename | query | state | backend_start | xact_start | query_start
———+———+————+——————————————————-+——–+———————————-+————+———————————- 2639 | | replicator | START_REPLICATION SLOT “slot_3” 0/4C000000 TIMELINE 1 | active | 2024-11-02 00:36:37.326071+05:30 | | 2024-11-02 00:36:37.396455+05:30 3978687 | | replicator | START_REPLICATION SLOT “slot_1” 0/4E000000 TIMELINE 1 | active | 2024-11-07 16:08:38.691104+05:30 | | 2024-11-07 16:08:38.744189+05:30 (2 rows)
Transaction details:
SELECT pid, usename, query, query_start, state, xact_start
FROM pg_stat_activity
WHERE pid IN (<list of long transaction PIDs>);
Locks held by transactions:
SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Sample output:
blocked_pid | blocking_pid | blocked_query | blocking_query
-------------+--------------+---------------+----------------
(0 rows)
Transaction impact on system resources:
top -b -n1 | grep postgres
iostat -xm 1 5
to identify any disk I/O performance issues.Report transaction details:
Investigate transaction impact:
Contact developers:
If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:
Terminate long transactions (after developer discussion):
SELECT pg_terminate_backend(<pid>);
Review query optimization:
Evaluate transaction management: