PostgresSQL Observability

Prerequisites

Please read the following documents before addressing the issues to become familiar with the postgresql.

PostgresSQL Architecture

Server setup and operator

Common High availabilty structures

Current architecture of our cluster

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

  • Grafana dashboards for postgresSQL link
  • Alerts link

Monitoring Panel Severity Matrix

Row Panel Query Query Description Query Operating Range Metrics Metric Description Metric Operating Range CRITICAL WARNING OK PRIORITY
1.3 Replication lag pg_replication_lag_seconds Lag behind master in seconds +ve Values pg_replication_lag_seconds Replication lag behind master in seconds +ve Values > 9 seconds < 9 seconds
1.3 Start time pg_postmaster_start_time_seconds Time since PostgreSQL instance started +ve Values pg_postmaster_start_time_seconds Time since PostgreSQL instance started +ve Values < 120 seconds > 120 seconds
1.3 DB instance type pg_replication_is_replica Return 1 if the node is a replica 0-1 pg_replication_is_replica Return 1 if the node is a slave/replica 0-1 Change 0->1 or 1->0 No change in 10s
1.4 Failed WAL archives pg_stat_archiver_failed_count Failed WAL archive files on master +ve Values pg_stat_archiver_failed_count Number of failed WAL archive files +ve Values +ve Values 0
2.6 Conflicts/Deadlocks sum(rate(pg_stat_database_deadlocks{...}[$interval])) Monitors PostgreSQL deadlock rate over a 5m interval, +ve Values pg_stat_database_deadlocks Provides the number of deadlocks active +ve Values +ve Values 0
offering insights into locking issues and conflicts
2.4 Transaction duration avg by (instance,state) ... Calculates avg maximum transaction duration by instance > 0 pg_stat_activity_max_tx_duration Maximum duration of active transactions (in secs) > 0 > 9 seconds < 9 seconds
and state over the interval. Helps identify long TXNs.
1.3 Active replication slots count(pg_replication_slot_slot_is_active) Determines active replication slots 0,1,2,3 pg_replication_slot_slot_is_active Indicates if the replication slot is active 0 or 1 Count < 2 2

Alert Handling Procedure

  1. Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.

  2. Severity-Based Actions:

    • Low-Priority Alerts:
      • If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
    • Escalation to DevOps:
      • If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
  3. Severity-Specific Notifications:

    • Warning Alerts:
      • For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
    • Critical Alerts:
      • For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.

PostgresSQL instance state

Alertname: PostgresInstanceState

The pg_up metric in PostgreSQL monitoring indicates whether a PostgreSQL instance is available and responding to queries. When pg_up is 1, it means the PostgreSQL server is running and successfully reachable by the exporter or monitoring tool. If pg_up is 0, the server is down, unresponsive, or cannot be reached due to network issues, configuration problems, or service crashes. C3 team must submit all given data to devops team in case of C3 remedy fails.

C3 Data Collection

  1. Instance name,IP address, age of alert in firing state: Collect the instance name,IP address of the instance, total time for the alert is being in firing state.
  2. For no data alert: Check the postgresql_exporter status first in case of no data alert.
    • Use systemctl status postgres_exporter to check the status of postgres_exporter service.
  3. Service Status: Verify the PostgreSQL service status on the instance where the metric reported 0 or no data. Check if the service is in an active, activating, or failed state.
    • Use systemctl status postgresql to see what state the service is in.
  4. Process status: Verify if postgresql process is running and includes several child processes as given:
    • Use ps aux | grep potgres and check if postgres process is running by verifying the command existence of something like /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf.(Version might change)
  5. Port status: Check if the port is correctly listening on given port and process name.
    • Use netstat -tlnp | grep postgres to see if the port number 5432 and 9187 is mentioned in the output.
  6. Disk space: Check if storage space is available for postgressql to be able to function.
    • Use df -h to check for disk space available for each parition and specifically check for /data and / mounts.
  7. Log messages: Submit the log messages from the given below process.
    • Use journalctl -xeu postgresql.service to get the postgresql.service logs.
    • Use tail -100f /var/log/postgresql/postgresql-16-main.log to get the actual database logs and submit to devops in the case of prescription given.

Dependent Metrics

When pg_up shows 0 or no data, checking the following metrics will help in determining the issue:

  • Node status: Check if the instance is in running state by using up{instance="<target_instance>"}. If the instance is in shutdown state, typically you will receive no data alert.
  • Disk space: Check the disk space by using node_filesystem_avail_bytes{instance="<target_instance>"} with instance filter to check for storage available for every mount on the server. You can check this metric using this panel. Specifically look out for / and /data mount points.
  • RAM usage: Check the physical memory usage metric for the node as it might effect the operating status of services that are not very prioritized. Use node_memory_MemFree_bytes{instance="<target_instance>"} to check if RAM usage is very near to 100% at which the system kills the process.

C3 Remedy

  1. Identify the type of instance(Master/Slave):

    1. Check for the type of node by inspecting the pg_replication_is_replica{instance="<target_name>"} for a while back(try applying the time frame at which the server is up and running at that time). If the metric returns 1 the node is replica type, if it is 0 or no data then it is of type master and intimate the devops team right away.

      Proceed to the following steps only when the instance type is replica, in any other case handover the issue to devops team.

  2. Check the node status(shutdown/running), if node is in running state:

    1. Start the service: Use systemctl start postgresql postgres_exporter to start the postgresql and postgres_exporter`. If the service starts:
      1. Check if the replication has started correctly by verifying the processes a typical postgresql instance uses to function correctly.
        Compare the output of ps aux | grep postgres’s last column with following sample output:
      postgres: /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
      postgres: psorbit-pg-cluster: checkpointer 
      postgres: psorbit-pg-cluster: background writer 
      postgres: psorbit-pg-cluster: startup recovering 000000010000000000000073
      postgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
      postgres: psorbit-pg-cluster: walreceiver streaming 0/73ED26A8
      postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
      

      Although systemctl status postgresql shows that the service is activated, the actual database operations may not work correctly, you must verify from the process names given above to confirm the database is operational.

      2. If the service doesn’t start:
      Check the storage space available:
      1. Use df -h to check the storage for each partition:
        • If any partition’s Use% is near to 100%(In range of 95% to 100% or less than 1GB), check for storage consuming directories on the node by using the following command, du -sh /*.
        • Check the /var/log directory’s consumption by using du -sh /var/log and inform the devops team about the insights. Output:
        Filesystem                    Size  Used Avail Use% Mounted on
        tmpfs                         794M 1016K  793M   1% /run
        /dev/vda5                      25G   14G   11G  57% /
        tmpfs                         3.9G  1.1M  3.9G   1% /dev/shm
        tmpfs                         5.0M     0  5.0M   0% /run/lock
        none                          3.9G     0  3.9G   0% /run/shm
        tmpfs                         3.9G  297M  3.6G   8% /tmp
        /dev/vda3                     974M  241M  666M  27% /boot
        /dev/vdb                      295M  175M  121M  60% /export/vdb
        /dev/vda4                      59G  6.8G   50G  13% /data
        /dev/vda6                      34G  2.9G   29G   9% /opt
        tmpfs                         794M  4.0K  794M   1% /run/user/1004
        psorbit-node01:/ps_orbit_dfs  295M  178M  118M  61% /data/ps/orbit
        
        • If storage is available, then proceed to next step.
      2. Check the logs from journalctl -xeu postgresql and tail -100f /var/log/postgresql/postgresql-16-main.log.
      3. Use free -h to check for available RAM for postgresql to be able to work, if the available RAM is less than 300M or when you see a OOM killer entry in any of the logs:
        Sample Output:
                     total        used        free      shared  buff/cache   available
      Mem:           7.8Gi       3.7Gi       3.0Gi       321Mi       1.0Gi       3.4Gi
      Swap:          8.0Gi       2.0Gi       6.0Gi
      
      • To free out RAM use:
      sync; echo 3 > /proc/sys/vm/drop_caches
      
      • Now, try starting the service again, if it doesn’t work, Proceed from
  3. If the node is in shutdown state:

    • Start the node:
      • Make sure of standby.signal file’s existence in postgresql’s data directory: /data/db/psql/16/main by using:
        ll /data/db/psql/16/main/standby.signal   
        
        Sample Output:
        -rw------- 1 postgres postgres 0 Oct  1 09:51 /data/db/psql/16/main/standby.signal
        
      • Create the file if it doesn’t exist in the data directory mention above by using:
        touch /data/db/psql/16/main/standby.signal
        ll /data/db/psql/16/main/standby.signal
        
      • Change the permissions of the file /data/db/psql/16/main/standby.signal to be accessible by postgres user by:
        chown -R postgres:postgres /data/db/psql/16/main/standby.signal
        ll /data/db/psql/16/main/standby.signal
        
        Sample Output:
        -rw------- 1 postgres postgres 0 Oct  1 09:51 /data/db/psql/16/main/standby.signal
        
      • Now, start the service using systemctl start postgresql, if it doesn’t start yet, proceed from 2.2 of this section.

DevOps Remedy

If the C3 team has followed the data collection steps and tried available remedies and yet the postgresql service didn’t start, devops team should follow these instructions:

  1. If the node is Replica:

    1. If it is a storage issue:
      • Rely on the C3 collected data(if collected) and try clearing storage if it is the issue with it, prefer clearing old log files by discussing with the team and remove other files as discussed.
    2. If this is any other issue:
      • Rely on logs submitted by C3 team along with historic logs and monitoring data related to write operations, conflicts, deadlocks, CPU & memory stats and try resolving it.
      • If the database is in inconsistent state, try rollback procedures and use latest backups to recover if db is in irrecoverable state.
  2. If the node is Master:

    • Announce downtime for the cluster in this case and proceed.
    1. If the node is in shutdown state:
      1. Instantiate new master:
        1. Check which node is suitable to be started as master by refering to metrics such as pg_replication_slot_slot_current_wal_lsn and choose the one that is at latest lsn.
        2. After choosing the new master, change the IP address in /etc/hosts for the hostname pg-master to the new master’s IP address.
        3. Loginto the new master and shutdown the postgresql service using systemctl stop postgresql.
        4. Remove the standby.signal, postgresql.auto.conf files from data directory using:
        rm /data/db/psql/16/main/standby.signal /data/db/psql/16/main/postgresql.auto.conf
        
        1. Start the postgresql service on this new master using:
        systemctl start postgresql
        systemctl status postgresql
        
        Sample output:
        ● postgresql.service - PostgreSQL RDBMS
        Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
        Active: active (exited) since Thu 2024-11-07 16:08:40 IST; 2 weeks 0 days ago
        Main PID: 3946191 (code=exited, status=0/SUCCESS)
           CPU: 3ms
        
        1. Check for processes on the started node using ps aux | grep postgres:
        postgres: psorbit-pg-cluster: checkpointer 
        postgres: psorbit-pg-cluster: background writer 
        postgrespostgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
        postgres: psorbit-pg-cluster: walwriter 
        postgrespostgres: psorbit-pg-cluster: autovacuum launcher 
        postgrespostgres: psorbit-pg-cluster: logical replication launcher 
        postgres: psorbit-pg-cluster: walsender replicator 172.21.0.92(38070) streaming 0/73F706C8
        root      409257  0.0grep --color=auto postgrespostgres: psorbit-pg-cluster: walsender replicator 172.21.0.90(40186) streaming 0/73F706C8
        

        You might not see replicator processes as the server is started just now, wait for sometime and check for replicator sender process to be active,.

      2. Start the old master as a slave node:
        1. Start the old master as a replica as soon as possible by following the next steps.
        2. Start the node and make sure that postgresql service is in stop state.
        3. Change the IP address for the hostname pg-master to the new master’s IP address.
        4. Create a standby.signal file in the data directory and change its permissions by using the following commands:
        touch /data/db/psql/16/main/standby.signal
        chown -R postgres:postgres /data/db/psql/16/main/standby.signal
        
        1. Start the postgresql service on this node and verify if walreceiver process is active by running:
        ps aux | grep postgres
        
        output should resemble:
        postgres: /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
        postgres: psorbit-pg-cluster: checkpointer 
        postgres: psorbit-pg-cluster: background writer 
        postgres: psorbit-pg-cluster: startup recovering 000000010000000000000073
        postgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
        postgres: psorbit-pg-cluster: walreceiver streaming 0/73ED26A8
        postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
        
        1. If you encounter any lsn number issues, try removing the data directory and use the following command to bring up the new data directory using the new master, this step can also be used to deploy a new replica:
        pg_basebackup -hpg-master -U replicator -p 5432 -D /data/db/psql/16/main -P -Xs -R
        
        1. Change the ownership of the data directory to postgres by:
        chown -R /data/db/psql/16/main
        ll /data/db/psql/16/main
        
        Sample output:
        drwx------ 19 postgres postgres   4096 Nov  7 16:08 ./
        drwxr-xr-x  3 postgres postgres   4096 Sep 26 09:41 ../
        -rw-------  1 postgres postgres      3 Oct  1 09:51 PG_VERSION
        -rw-------  1 postgres postgres    227 Oct  1 09:51 backup_label.old
        -rw-------  1 postgres postgres 260079 Oct  1 09:51 backup_manifest
        drwx------  7 postgres postgres   4096 Oct  1 09:51 base/
        drwx------  2 postgres postgres   4096 Nov  7 16:08 global/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_commit_ts/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_dynshmem/
        drwx------  4 postgres postgres   4096 Nov 22 12:38 pg_logical/
        drwx------  4 postgres postgres   4096 Oct  1 09:51 pg_multixact/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_notify/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_replslot/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_serial/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_snapshots/
        drwx------  2 postgres postgres   4096 Nov  7 16:08 pg_stat/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_stat_tmp/
        drwx------  2 postgres postgres   4096 Nov 12 01:03 pg_subtrans/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_tblspc/
        drwx------  2 postgres postgres   4096 Oct  1 09:51 pg_twophase/
        drwx------  3 postgres postgres   4096 Nov 21 15:02 pg_wal/
        drwx------  2 postgres postgres   4096 Nov 12 00:58 pg_xact/
        -rw-------  1 postgres postgres    318 Oct  1 09:51 postgresql.auto.conf
        -rw-------  1 postgres postgres    124 Nov  7 16:08 postmaster.opts
        -rw-------  1 postgres postgres    107 Nov  7 16:08 postmaster.pid
        -rw-------  1 postgres postgres      0 Oct  1 09:51 standby.signal
        

PostgresSQL No Active Replicas

Alertname: PostgresNoReplicas

The pg_replication_slot_slot_is_active metric shows how many replica nodes are currently active,if this returns <2 integer it means the node is either disconnected from the master, has stopped replication, or is unavailable due to network, storage, or process issues.
The C3 team must collect and report the following data to the DevOps team if the issue cannot be resolved through C3’s remedies.

C3 Data Collection

  1. Instance Name, IP Address, and Alert Duration:
    Collect the instance name, IP address, and the total duration the alert has been in the firing state.

    1. If the node is in shutdown state and is replica:
      1. Try starting the node and starting the postgresql service and verify network connectivity.
    2. If the node is in shutdown state and is master:
      1. Instantiate new master:

        1. Check which node is suitable to be started as master by refering to metrics such as pg_replication_slot_slot_current_wal_lsn and choose the one that is at latest lsn.
        2. After choosing the new master, change the IP address in /etc/hosts for the hostname pg-master to the new master’s IP address.
        3. Loginto the new master and shutdown the postgresql service using systemctl stop postgresql.
        4. Remove the standby.signal, postgresql.auto.conf files from data directory using:
        rm /data/db/psql/16/main/standby.signal /data/db/psql/16/main/postgresql.auto.conf
        
        1. Start the postgresql service on this new master using:
        systemctl start postgresql
        
        1. Check for processes on the started node using ps aux | grep postgres:
        postgres: psorbit-pg-cluster: checkpointer 
        postgres: psorbit-pg-cluster: background writer 
        postgrespostgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
        postgres: psorbit-pg-cluster: walwriter 
        postgrespostgres: psorbit-pg-cluster: autovacuum launcher 
        postgrespostgres: psorbit-pg-cluster: logical replication launcher 
        postgres: psorbit-pg-cluster: walsender replicator 172.21.0.92(38070) streaming 0/73F706C8
        root      409257  0.0grep --color=auto postgrespostgres: psorbit-pg-cluster: walsender replicator 172.21.0.90(40186) streaming 0/73F706C8
        
      2. Start the old master as a slave node:

        1. Start the old master as a replica as soon as possible by following the next steps.
        2. Start the node and make sure that postgresql service is in stop state.
        3. Change the IP address for the hostname pg-master to the new master’s IP address.
        4. Create a standby.signal file in the data directory and change its permissions by using the following commands:
        touch /data/db/psql/16/main/standby.signal
        chown -R postgres:postgres /data/db/psql/16/main/standby.signal
        
        1. Start the postgresql service on this node and verify if walreceiver process is active by running:
        ps aux | grep postgres
        

        output should resemble:

        postgres: /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
        postgres: psorbit-pg-cluster: checkpointer 
        postgres: psorbit-pg-cluster: background writer 
        postgres: psorbit-pg-cluster: startup recovering 000000010000000000000073
        postgres: psorbit-pg-cluster: archiver last was 000000010000000000000072
        postgres: psorbit-pg-cluster: walreceiver streaming 0/73ED26A8
        postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
        
        1. If you encounter any lsn number issues, try removing the data directory and use the following command to bring up the new data directory using the new master, this step can also be used to deploy a new replica:
        pg_basebackup -hpg-master -U replicator -p 5432 -D /data/db/psql/16/main -P -Xs -R
        
        1. Change the ownership of the data directory to postgres by:
        chown -R /data/db/psql/16/main
        

    You might not see replicator processes as the server is started just now, wait for sometime and check for replicator sender process to be active,.

  2. Check PostgreSQL Service on Master and Replica Nodes:
    Verify that the postgresql service is running on both master and replica nodes. Use:

    systemctl status postgresql
    
  3. Network Connectivity:
    Ensure all nodes (master and replicas) can reach each other on the network. Test using:

    ping pg-master
    ping <replica_ip>
    

    Sample output:

    PING pg-master (172.21.0.91) 56(84) bytes of data.
    64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=1 ttl=64 time=18.1 ms
    64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=2 ttl=64 time=0.228 ms
    64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=3 ttl=64 time=0.312 ms
    64 bytes from psorbit-node02 (172.21.0.91): icmp_seq=4 ttl=64 time=0.318 ms
    --- pg-master ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3039ms
    rtt min/avg/max/mdev = 0.228/4.727/18.052/7.692 ms
    
  4. Port Connectivity (5432):
    Check that the master and replicas are listening on the expected ports:

    netstat -tlnp | grep postgres
    telnet <replica_ip> 5432
    telnet pg-master 5432
    

    Sample output:

    telnet pg-master 5432
    Trying 172.21.0.91...
    Connected to pg-master.
    Escape character is '^]'.
    ^]
    Connection closed by foreign host.
    
  5. PostgreSQL Processes:
    Verify the presence of walsender on the master and walreceiver on replicas. Use:

    ps aux | grep postgres
    

    Your output must include the followinf process in case of replica:

    postgres: psorbit-pg-cluster: walreceiver streaming 0/7514C2C8
    

    For master node:

    walsender replicator 172.21.0.90(40186) streaming 0/7514CB78
    
  6. Storage Space:
    Check the disk space availability on all nodes, focusing on the PostgreSQL data directories. Use:

    df -h
    

    See if no free storage is available for postgresql to function on either / or /data

  7. Replication Logs:
    Gather logs related to replication from both master and replicas:

    journalctl -xeu postgresql.service
    tail -100f /var/log/postgresql/postgresql-16-main.log
    

    Submit both logs in case of c3 remedy fails to devops team.

  8. Replication Slots:
    On the master, verify the status of replication slots. Use(On replica nodes):

    SELECT * FROM pg_replication_slots;
    

    The output should be similar to:

       slot_name | restart_lsn | wal_status 
       -----------+-------------+------------
       slot_3    | 0/75151770  | reserved
       slot_2    |             | 
       slot_1    | 0/75151770  | reserved
    

    Each slot indicates a replica node’s oocupancy, slot_1 is for node-1 and slot_3 is for node-3 and so on. atleast one slot should be active to be replicating on atleast one node.

C3 Remedy

  1. Check the Node Status and Type (Master/Replica):
    Use the pg_replication_is_replica{instance="<target_instance>"} metric to identify the type of the node.

    1. If master is down: Announce downtime and Handover to devops team.
  2. Verify Replication Processes:

    • On the master, ensure walsender processes are active for each replica.
    • On replicas, ensure walreceiver processes are active.
      Use: ps aux | grep postgres Sample Output:
    postgres 3854730  0.2  0.1 1238296 11908 ?       Ssl  Nov07  58:12 /opt/postgres_exporter/postgres_exporter-0.15.0.linux-amd64/postgres_exporter --web.listen-address=:9187 --web.telemetry-path=/metrics --collector.postmaster --collector.long_running_transactions --collector.database_wraparound --collector.process_idle --collector.stat_activity_autovacuum --collector.stat_statements --collector.stat_wal_receiver --collector.xlog_location
    root     3909146  0.0  0.0   4028  2136 pts/2    S+   13:45   0:00 grep --color=auto postgres
    postgres 3946117  0.0  0.1 217276 16072 ?        Ss   Nov07   2:20 /usr/lib/postgresql/16/bin/postgres -D /data/db/psql/16/main -c config_file=/etc/postgresql/16/main/postgresql.conf
    postgres 3946125  0.0  0.5 217404 42080 ?        Ss   Nov07   0:36 postgres: psorbit-pg-cluster: checkpointer 
    postgres 3946126  0.0  0.0 217276  5120 ?        Ss   Nov07   0:16 postgres: psorbit-pg-cluster: background writer 
    postgres 3946127  0.0  0.5 219088 43632 ?        Ss   Nov07   1:23 postgres: psorbit-pg-cluster: startup recovering 000000010000000000000076
    postgres 3946133  0.0  0.0 217380  4848 ?        Ss   Nov07   0:05 postgres: psorbit-pg-cluster: archiver last was 000000010000000000000075
    postgres 3946134  0.0  0.1 218364  9672 ?        Ss   Nov07   5:02 postgres: psorbit-pg-cluster: walreceiver streaming 0/7648AC80
    postgres 3946351  0.0  0.2 220736 17016 ?        Ss   Nov07  15:30 postgres: psorbit-pg-cluster: postgres_exporter postgres 172.21.0.90(55958) idle
    
  3. Restart PostgreSQL Services:
    Restart the PostgreSQL service on the affected replicas:

    systemctl restart postgresql
    
  4. Re-establish Replication:
    If replication has stopped, re-establish it:

    • Remove the existing data directory on the replica.
    • Use pg_basebackup from the master to recreate the replica:
      pg_basebackup -h pg-master -U replicator -D /data/db/psql/16/main -P -Xs -R
      
    • Adjust ownership of the directory:
      chown -R postgres:postgres /data/db/psql/16/main
      
  5. Check Replication Lag:
    Monitor the historic replication lag using metrics such as pg_replication_lag. If it is unusually high, troubleshoot the network and performance.

  6. Address Storage Issues:

    • Verify available space on relevant partitions using df -h.
    • If disk usage is high, clean unnecessary files or logs under /var/log.
  7. Log Analysis:
    Check for replication-specific errors in logs on both master and replicas.

Dependent Metrics

  • Replication Lag: Monitor pg_stat_replication on the master.
  • Node Health: Verify the node’s health using up{instance="<target_instance>"}.
  • Disk Space: Check node_filesystem_avail_bytes for sufficient storage on the nodes.
  • Memory: Check node_memory_MemFree_bytes to ensure adequate free memory on the nodes.

DevOps Remedy

If C3 remedies fail, escalate to the DevOps team with collected data and logs. The DevOps team can perform:

  1. Restore Replication:

    • Investigate replication lag, slot issues, or WAL archive problems.
    • Reinitialize the replica if necessary using pg_basebackup.
  2. Rebuild Replication Slots:
    If replication slots are corrupt or missing, recreate them on the master.

  3. Address Persistent Storage or Network Issues:

    • Clear space by archiving or deleting logs.
    • Troubleshoot network connectivity if port or host reachability issues are present.
  4. Failover Procedures:
    Promote a replica to master if the master is irrecoverable, and reinitialize the failed node as a replica.


PostgreSQL Replication Lag

Alertname: PostgresReplicationLag

The pg_replication_lag_seconds metric in PostgreSQL monitoring measures the time difference (in seconds) between the master and a replica. High replication lag can result in inconsistent data for applications relying on replicas. The C3 team must collect the following data and attempt remedies before escalating the issue to the DevOps team.

C3 Data Collection

  1. Instance name, IP address, and age of alert in the firing state: Record details of the instance with replication lag, including the duration of the alert.
  2. Service status: Confirm that PostgreSQL is running on both master and replicas.
    • Use systemctl status postgresql on both master and replicas.
  3. Replication lag metrics: Retrieve the current replication lag values.
    • Use the following SQL: Login to psql shell and: To login to psql:
    sudo su postgres
    psql
    
    SELECT application_name, client_addr, state, sync_priority, sync_state, replay_lag FROM pg_stat_replication;
    
    Sample Output:
      application_name  | client_addr |   state   | sync_priority | sync_state |   replay_lag    
    --------------------+-------------+-----------+---------------+------------+-----------------
    psorbit-pg-cluster | 172.21.0.90 | streaming |             0 | async      | 00:00:00.002042
    psorbit-pg-cluster | 172.21.0.92 | streaming |             0 | async      | 00:00:00.002073
    (2 rows)
    
  4. Replication slot status: Check the status of replication slots on the master.
    • Use:
      SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
      
      Sample output:
        slot_name | active | restart_lsn 
       -----------+--------+-------------
       slot_3    | t      | 0/764908E8
       slot_2    | f      | 
       slot_1    | t      | 0/764908E8
       (3 rows)
      
      t indicates that the slot is active
  5. Replication logs: Collect logs from both the master and replica nodes for replication-related issues.
    • Use journalctl -xeu postgresql.service and tail -100f /var/log/postgresql/postgresql-16-main.log.
  6. Network connectivity: Verify connectivity between the master and the lagging replica.
    • Use ping <replica_IP> and telnet <replica_IP> 5432 to check connectivity and port status.
  7. Disk I/O and space: Examine the disk space and I/O performance on both master and replica nodes.
    • Use df -h for disk space and iostat for disk I/O stats.
  8. WAL transfer status: Check for WAL file transfer delays between the master and replicas.
    • On the master, verify the pg_wal directory for excessive files:
      ls -lh /data/db/psql/archive
      
    • On the replica, check if WAL files are being replayed.
  9. Load and CPU utilization: Check system resource usage on the master and replica servers.
    • Use top or htop for real-time resource monitoring.

Dependent Metrics

When pg_replication_lag_seconds shows high values, investigate the following:

  • WAL replay status: Verify if WAL files are being replayed correctly using pg_stat_replication.
  • Replica up status: Ensure the replica is online using the up metric or pg_is_in_recovery().
  • Replication throughput: Examine the network throughput for replication between master and replica nodes.

C3 Remedy

  1. Verify replica status:

    • Check if the replica is connected to the master and replaying WAL files:
      SELECT client_addr, state, sync_state FROM pg_stat_replication;
      
  2. Restart lagging replica:

    • Restart the PostgreSQL service on the lagging replica:
      systemctl restart postgresql
      
  3. Verify replication configuration:

    • Ensure the postgresql.conf and pg_hba.conf files are correctly configured for replication(Check for syntax errors only, handover to devops for any configuration changes).
  4. Network-related issues:

    • Check for bandwidth issues using any speedtest programs or by inspecting from monitoring.
  5. Resolve WAL file issues:

    • Ensure the master is retaining sufficient WAL files for the replica to catch up:
      wal_keep_size = '1GB'
      
  6. Disk performance:

    • Address any disk I/O issues by clearing temporary files or upgrading the storage hardware.

DevOps Remedy

If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:

  1. Increase replication slots and resources:

    • Add more slots or memory if replication is slowed due to resource constraints.
  2. Rebuild the replica:

    • Use pg_basebackup from the master to recreate the replica on the lagging node:
    pg_basebackup -h pg-master -U replicator -D /data/db/psql/16/main -P -Xs -R
    
    • Adjust ownership of the directory:
      chown -R postgres:postgres /data/db/psql/16/main
      
  3. Replica failover:

    • If the master is overloaded, consider a failover to a healthy replica.
  4. Review replication strategy:

    • Analyze and modify the replication setup, including synchronous/asynchronous replication modes.

PostgreSQL Uptime

Alertname: PostgresqlUptime

The pg_up metric in PostgreSQL monitoring identifies the uptime of a PostgreSQL node. This alert triggers when the uptime is 2 minutes or less, indicating a recent restart. This could signify planned maintenance or an unexpected failure. The C3 team must collect the following data and perform initial troubleshooting before escalating the issue to the DevOps team.

C3 Data Collection

  1. Instance name, IP address, and age of alert in the firing state: Record the instance details, IP address, and the time the alert has been active.
  2. Service status: Confirm PostgreSQL service status on the affected node.
    • Use:
      systemctl status postgresql
      
  3. Uptime details: Verify the actual uptime of the node.
    • Use:
      SELECT date_trunc('seconds', now() - pg_postmaster_start_time()) AS uptime FROM pg_stat_activity;
      
      Sample output:
            uptime      
       ------------------
       20 days 13:22:27 
      
  4. Critical process check: Confirm key processes like WAL sender and WAL receiver are operational:
    • Use:
      ps aux | grep 'walsender\|walreceiver'
      
      Sample output:
      postgres    2639  0.0  0.1 219532 11036 ?        Ss   Nov02   4:12 postgres: psorbit-pg-cluster: walsender replicator 172.21.0.92(38070) streaming 0/76490BC0
      postgres 3978687  0.0  0.1 219532 12136 ?        Ss   Nov07   3:28 postgres: psorbit-pg-cluster: walsender replicator 172.21.0.90(40186) streaming 0/76490BC0
      
  5. Replication slot status: Check the status of replication slots to ensure data consistency:
    • Use:
      SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
      
      Sample output:
        slot_name | active | restart_lsn 
       -----------+--------+-------------
       slot_3    | t      | 0/76491200
       slot_2    | f      | 
       slot_1    | t      | 0/76491200
       (3 rows)
      
  6. Database connectivity: Validate database connections:
    This check works on the given database, for newer or unmentioned databases in the pg_hba.conf(For credentials contact devops).
    psql -h 172.21.0.91 -U haproxy -d haproxy_checks -c 'SELECT 1;'
    
    Sample output:
      ?column? 
     ----------
           1
     (1 row)
    
  7. Current LSN status: Verify the current WAL and replication status across nodes.
    • Use:
      SELECT pg_current_wal_lsn(), pg_is_in_recovery();
      
      Sample output(Incase of master node):
        pg_current_wal_lsn | pg_is_in_recovery 
       --------------------+-------------------
       0/764AA308         | f
       (1 row)
      
      Sample output(Incase of replica node):
        pg_current_wal_lsn | pg_is_in_recovery 
       --------------------+-------------------
       0/764AA308         | t
       (1 row)
      
  8. Replication lag status: Confirm no abnormal lag exists between master and replicas.
    • Use:
      SELECT application_name, state, sync_state, replay_lag FROM pg_stat_replication;
      
      Sample Output(On master):
         application_name  |   state   | sync_state | replay_lag 
       --------------------+-----------+------------+------------
       psorbit-pg-cluster | streaming | async      | 
       psorbit-pg-cluster | streaming | async      | 
       (2 rows)
      
  9. Logs: Inspect logs for errors or indications of the restart:
    • Use:
      journalctl -xeu postgresql.service  
      tail -100 /var/log/postgresql/postgresql-16-main.log
      

Dependent Metrics

When the pg_up metric indicates recent restarts, monitor:

  • Replication processes: WAL sender and WAL receiver should be running.
  • Replication slots: Ensure they are active and functioning correctly.
  • Network connectivity: Validate seamless communication between nodes.
  • Resource utilization: Check CPU, memory, and disk usage to rule out resource-related restarts.

C3 Remedy

  1. Confirm intentional restarts:

    • Verify with the maintenance schedule or related team if the restart was planned.
  2. Restart PostgreSQL service:

    • If services are unresponsive, restart the affected PostgreSQL instance:
      systemctl restart postgresql
      
  3. Validate replication:

    • Ensure replication processes (WAL sender/receiver) are active and in sync.
  4. Verify logs for errors:

    • Investigate logs to identify causes such as disk issues, configuration errors, or system-level failures.
  5. Check disk and resource status:

    • Clear up space if disk usage is high and ensure sufficient memory is available.

DevOps Remedy

If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:

  1. Rebuild replication slots:

    • Recreate slots if necessary to maintain replication consistency.
  2. Reconfigure PostgreSQL settings:

    • Optimize configurations, such as increasing WAL retention or tuning connection limits.
  3. Review system-level issues:

    • Check for hardware failures or operating system misconfigurations.
  4. Cluster validation:

    • Ensure the cluster setup aligns with intended failover and replication requirements.

PostgreSQL Node Role Changed

Alertname: PostgresNodeRoleChanged

The pg_replication_is_replica metric detects changes in the node role within a PostgreSQL cluster. This alert triggers when a master node transitions into a replica or a replica behaves inconsistently. If a node’s role changes unexpectedly, it may indicate an unintended failover or a misconfiguration. The C3 team must verify the current cluster state and perform immediate actions to stabilize the system.

C3 Data Collection

  1. Instance details:

    • Record the instance name, IP address, and role (master or replica).
    • Identify the node where the alert is firing.
  2. Cluster status:

    • Retrieve the current cluster role of all nodes:
      SELECT pg_is_in_recovery() AS is_replica, inet_server_addr() AS node_ip;
      
      Sample output:
       is_replica | node_ip 
       ------------+---------
       f          | 
       (1 row)
      
  3. Replication status:

    • For replicas:
      SELECT client_addr, state, sync_state FROM pg_stat_replication;
      
    • Check for any inconsistencies in synchronization and replication state.
      Sample output:
     client_addr |   state   | sync_state 
    -------------+-----------+------------
    172.21.0.90 | streaming | async
    172.21.0.92 | streaming | async
    (2 rows)
    
  4. Replication slot status:

    • Validate the slots on the new master to ensure active replication:
      SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
      

    Sample output:

     slot_name | active | restart_lsn 
    -----------+--------+-------------
    slot_3    | t      | 0/764AAD48
    slot_2    | f      | 
    slot_1    | t      | 0/764AAD48
    (3 rows)
    
  5. Logs:

    • Gather logs from the affected node:
      journalctl -xeu postgresql.service  
      tail -100 /var/log/postgresql/postgresql-16-main.log
      
  6. Hostname-IP consistency:

    • Confirm that the pg-master hostname resolves to the correct new master IP address:
      nslookup pg-master  
      ping pg-master
      
    • Update DNS records if there is a mismatch.
      Sample output:
    Server:         127.0.0.53
    Address:        127.0.0.53#53
    
    Name:   pg-master
    Address: 172.21.0.91
    
  7. Node downtime verification:

    • Check if the previous master experienced downtime or failed during the transition.
  8. Database connections:

    • Verify database connectivity from applications to the new master.
      psql -h 172.21.0.91 -U haproxy -d haproxy_checks -c 'SELECT 1;'
      

Dependent Metrics

When pg_replication_is_replica detects role changes, also monitor:

  • Replication lag (pg_replication_lag_seconds): Confirm replicas are catching up.
  • WAL replay status (pg_stat_replication): Ensure WAL files are replayed correctly.
  • System resource usage: Check CPU, memory, and disk usage on all nodes.

C3 Remedy

  1. Confirm intentional role change:

    • Verify with the DevOps team if the role change was planned.
  2. Update hostname:

    • If a new master is detected, update the pg-master hostname to point to the correct master IP address.
  3. Restart replication:

    • On the new replicas, restart replication processes:
      systemctl restart postgresql
      
  4. Announce downtime:

    • If the transition was unintended, inform the team and application stakeholders about downtime.
  5. Verify system connectivity:

    • Ensure all application and replica connections are re-established with the new master.

DevOps Remedy

If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:

  1. Analyze cluster stability:

    • Ensure all nodes in the cluster are in their expected roles and replication is functioning correctly.
  2. If the roles was unintentional, try check for related issues and instantiate a new master with changes described in previous procedures.

  3. Reconfigure replicas:

    • Rebuild replicas if necessary using:
      pg_basebackup -h pg-master -U replicator -D /data/db/psql/16/main -P -Xs -R
      

PostgreSQL WAL Archives Failed

Alertname: PostgresWALArchivesFailed

The pg_stat_archiver_failed_count metric tracks the number of failed attempts to archive WAL (Write-Ahead Logging) files. A failure in WAL archiving typically arises due to insufficient storage, directory permission issues, or incorrect configurations. These failures can lead to WAL file accumulation, affecting the performance and durability of the database. The C3 team must collect data and address issues promptly to avoid disruptions.

C3 Data Collection

  1. Instance details:

    • Record the instance name, IP address, and age of alert in the firing state.
  2. Failed count and logs:

    • Check the total number of WAL archive failures:
      SELECT failed_count, last_failed_wal FROM pg_stat_archiver;
      
       failed_count | last_failed_wal 
       --------------+-----------------
                   0 | 
       (1 row)
      
    • Collect PostgreSQL logs for specific errors:
      tail -100 /var/log/postgresql/postgresql-16-main.log | grep "archive"
      
  3. Directory status:

    • Confirm the archive directory exists and has the correct permissions:
      ls -ld /data/db/psql/archive  
      
      Sample output:
      drwx------ 2 postgres postgres 4096 Nov 21 14:57 /data/db/psql/archive
      
      • Ensure the directory is owned by the postgres user.
  4. Disk space availability:

    • Verify available disk space for the archive directory:
      df -h /data/db/psql/archive
      
      Sample output:
       Filesystem      Size  Used Avail Use% Mounted on
       /dev/vda4        59G  6.8G   50G  13% /data
      
  5. Network and storage connectivity (if incase of archive directory is on a remote location, this is not the case as of now):

    • Confirm connectivity and mount status:
      ping <storage_server_ip>  
      mount | grep /data/db/psql/archive
      
  6. Archiving process status:

    • Check the PostgreSQL process responsible for archiving WAL files:
      ps aux | grep "postgres: archiver"
      
  7. WAL directory size:

    • Inspect the archive directory for unarchived files that may accumulate due to failures:
      ls -lh /data//db/psql/archive  
      du -sh /data//db/psql/archive  
      
  8. Configuration validation:

    • Confirm that archive_command in postgresql.conf is configured correctly:
      grep "archive_command" /etc/postgresql/16/main/postgresql.conf
      
    • Test the command manually:
      cp <test_wal_file> /data/db/psql/archive
      

Dependent Metrics

When pg_stat_archiver_failed_count increases, check:

  • WAL accumulation in pg_wal:
    • Monitor file count and size to ensure the directory doesn’t reach its capacity.
  • Disk I/O usage:
    • Verify disk performance to ensure smooth archiving.
  • System load:
    • Monitor CPU and memory utilization for potential bottlenecks.

C3 Remedy

  1. Fix directory permissions:

    • Ensure the archive directory is writable by PostgreSQL:
      chown postgres:postgres /data/db/psql/archive  
      chmod 700 /data/db/psql/archive  
      
  2. Free up storage:

    • Remove old or unnecessary files in the archive directory:
      find /data/db/psql/archive -type f -mtime +30 -delete  
      
  3. Test and restart archiving:

    • Manually test the archive_command to ensure it works:
      cp <sample_file> /data/db/psql/archive  
      
    • Restart PostgreSQL if needed:
      systemctl restart postgresql  
      
  4. Monitor WAL directory:

    • Check and remove unnecessary WAL files in /data/pg_wal to prevent storage overflow.

DevOps Remedy

If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:

  1. Archive directory reconfiguration:

    • If local storage is insufficient, consider moving the archive directory to a larger or networked storage. Update postgresql.conf:
      archive_command = 'cp %p /data/db/psql/archive/%f'
      
  2. Automate cleanup:

    • Schedule regular cleanups to avoid storage issues, discuss with the team to make sure of this process start:
      crontab -e  
      0 3 * * * find /data/db/psql/archive -type f -mtime +30 -delete
      
  3. Increase disk space:

    • Allocate additional storage or upgrade disk space for /data/db/psql/archive.
    • Try allocating a new disk for the data directory/ discuss options with people.
  4. WAL retention:

    • Tune parameters like wal_keep_size to manage WAL retention efficiently:
      wal_keep_size = '1GB'
      

PostgreSQL Deadlocks Detected

Alertname: PostgresDeadlocksDetected

The pg_stat_database_conflicts metric captures conflict-related events such as deadlocks in PostgreSQL databases. Deadlocks occur when two or more transactions block each other, preventing any from proceeding. These conflicts can degrade database performance and impact application availability. The C3 team must investigate and address the root cause of deadlocks promptly before escalating to DevOps.

C3 Data Collection

  1. Instance details:

    • Record the database instance name, IP address, and age of alert in the firing state.
  2. Deadlock metrics:

    • Query the number of deadlocks per database:
      SELECT datname, deadlocks FROM pg_stat_database WHERE deadlocks > 0;
      

    Sample output:

     datname | deadlocks 
    ---------+-----------
    (0 rows)
    
  3. Deadlock logs:

    • Extract logs of recent deadlock events:
      grep "deadlock detected" /var/log/postgresql/postgresql-16-main.log
      
    • Note the time of the event, affected transactions, and involved tables.
  4. Query inspection:

    • Identify the queries causing the deadlock by checking pg_stat_activity:
      SELECT pid, datname, usename, query, state, wait_event 
      FROM pg_stat_activity 
      WHERE wait_event LIKE 'deadlock';
      

    Sample output:

     pid | datname | usename | query | state | wait_event 
    -----+---------+---------+-------+-------+------------
    (0 rows)
    

    You will details of the event when there is a deadlock

  5. Blocked queries:

    • Investigate queries waiting for locks:
      SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, 
             blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query
      FROM pg_locks blocked_locks
      JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
      JOIN pg_locks blocking_locks
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
      JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
      WHERE NOT blocked_locks.granted;
      
      Sample output:
        blocked_pid | blocking_pid | blocked_query | blocking_query 
       -------------+--------------+---------------+----------------
       (0 rows)
      
  6. Database performance:

    • Monitor CPU, memory, and disk I/O utilization for potential bottlenecks during deadlock events:
      top -b -n1 | grep postgres  
      iostat -xm 1 5  
      
      Sample output:
       3854730 postgres  20   0 1238296  12544   5188 S   0.0   0.2  58:31.86 postgres_export
       3946117 postgres  20   0  217276  16136  15692 S   0.0   0.2   2:21.53 postgres
       3946125 postgres  20   0  217404  42080  41584 S   0.0   0.5   0:36.81 postgres
       3946126 postgres  20   0  217276   5120   4768 S   0.0   0.1   0:16.76 postgres
       3946127 postgres  20   0  219088  43632  42984 S   0.0   0.5   1:23.34 postgres
       3946133 postgres  20   0  217380   4848   4448 S   0.0   0.1   0:05.72 postgres
       3946134 postgres  20   0  218364   9672   8952 S   0.0   0.1   5:03.13 postgres
       3946351 postgres  20   0  220736  17016  15172 S   0.0   0.2  15:35.74 postgres
       Linux 5.15.0-124-generic (psorbit-node01)       11/22/24        _x86_64_        (4 CPU)
      
       avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                10.17    0.05    8.22    0.41    0.02   81.13
      
       Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
       loop0            0.00      0.00     0.00   0.00    0.70     1.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
       sr0              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
       vda              4.98      0.25     0.95  16.07    2.18    50.98   36.22      0.33    13.33  26.90    4.05     9.34    0.03      0.08     0.00   0.00    0.35  2921.77    8.06    0.99    0.17   3.11
       vdb              0.48      0.01     0.00   0.00   29.50    17.96    0.54      0.01     0.11  17.44   22.41    19.11    0.00      0.00     0.00   0.00    2.98  6700.00    0.07    2.88    0.03   1.59
      
  7. Locks on objects:

    • List locks held by sessions during the alert:
      SELECT locktype, relation::regclass, mode, granted, transactionid, virtualtransaction 
      FROM pg_locks 
      WHERE NOT granted;
      
      Sample output:
        locktype | relation | mode | granted | transactionid | virtualtransaction 
       ----------+----------+------+---------+---------------+--------------------
       (0 rows)
      

Dependent Metrics

When deadlocks are detected, monitor:

  • Query execution time: Identify slow or long-running queries.
  • Active connections: Ensure that connection pooling is effectively managing session limits.
  • Deadlock-prone workloads: Look for patterns in conflict-prone operations, such as updates on shared tables.

C3 Remedy

  1. Report conflicting queries:

    • Identify and report deadlock-causing queries:
      SELECT pg_terminate_backend(<blocked_pid>);
      
  2. Analyze transaction scope:

    • Report long-running transactions.

DevOps Remedy

If the C3 team cannot resolve the issue, escalate to DevOps with the following actions:

  1. Lock priority tuning:

    • Adjust PostgreSQL configuration for lock management. Example:
      max_locks_per_transaction = 128  
      
  2. Reorganize table structures:

    • Consider partitioning high-traffic tables to reduce lock contention.
  3. Review application logic:

    • Work with developers to ensure the application does not introduce conflicting transaction patterns.
  4. Terminate conflicting queries with discussion:

    • Identify and terminate deadlock-causing queries:
      SELECT pg_terminate_backend(<blocked_pid>);
      

PostgreSQL Long Transactions

Alertname: PostgresLongTransactions

The pg_stat_activity_max_tx_duration metric identifies long-running transactions in PostgreSQL that have been executing for 9 seconds or more. Long transactions can lead to performance issues, such as locks, blocking other transactions, or consuming excessive resources. The C3 team should collect transaction details and report them, and the DevOps team may consider terminating transactions after consulting with developers to ensure system stability.

C3 Data Collection

  1. Instance details:

    • Record the instance name, IP address, and the age of the alert in the firing state.
  2. Identify long transactions:

    • Query for active long-running transactions that have been running for 9 seconds or more:
      SELECT pid, datname, usename, query, state, backend_start, xact_start, query_start 
      FROM pg_stat_activity 
      WHERE now() - query_start > interval '9 seconds' 
        AND state = 'active';
      
      Sample output:
         pid   | datname |  usename   |                         query                         | state  |          backend_start           | xact_start |           query_start            
      

    ———+———+————+——————————————————-+——–+———————————-+————+———————————- 2639 | | replicator | START_REPLICATION SLOT “slot_3” 0/4C000000 TIMELINE 1 | active | 2024-11-02 00:36:37.326071+05:30 | | 2024-11-02 00:36:37.396455+05:30 3978687 | | replicator | START_REPLICATION SLOT “slot_1” 0/4E000000 TIMELINE 1 | active | 2024-11-07 16:08:38.691104+05:30 | | 2024-11-07 16:08:38.744189+05:30 (2 rows)

  3. Transaction details:

    • Retrieve the details of the long-running transactions, including transaction start time and the related queries:
      SELECT pid, usename, query, query_start, state, xact_start 
      FROM pg_stat_activity 
      WHERE pid IN (<list of long transaction PIDs>);
      
  4. Locks held by transactions:

    • Check if any long-running transactions are holding locks that could be blocking other queries:
      SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, 
             blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query
      FROM pg_locks blocked_locks
      JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
      JOIN pg_locks blocking_locks
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
      JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
      WHERE NOT blocked_locks.granted;
      
      Sample output:
        blocked_pid | blocking_pid | blocked_query | blocking_query 
       -------------+--------------+---------------+----------------
       (0 rows)
      
  5. Transaction impact on system resources:

    • Monitor system resource usage to determine if long transactions are consuming excessive CPU or memory:
      top -b -n1 | grep postgres
      
      • Use iostat -xm 1 5 to identify any disk I/O performance issues.

Dependent Metrics

  • Locks held by transactions: Monitor whether long transactions are causing blocking or deadlocks.
  • Database load and performance: Keep track of the overall performance impact, including CPU, memory, and disk usage.

C3 Remedy

  1. Report transaction details:

    • Report the following transaction information to the DevOps team:
      • Transaction start time
      • Query being executed
      • Duration of transaction
      • Locks being held
  2. Investigate transaction impact:

    • Examine the impact of long-running transactions on system performance and blocking other operations.
  3. Contact developers:

    • Inform developers about the long transactions to determine if the query is critical or if it can be optimized.

DevOps Remedy

If the C3 team cannot resolve the issue, escalate to the DevOps team with the following actions:

  1. Terminate long transactions (after developer discussion):

    • If it is determined that the long transactions are not critical or can be safely terminated, the DevOps team may decide to kill the offending transactions:
      SELECT pg_terminate_backend(<pid>);
      
  2. Review query optimization:

    • Developers should work on optimizing the queries causing the long transactions. Common optimizations include indexing, query refactoring, or partitioning large tables.
  3. Evaluate transaction management:

    • Review the transaction management processes and ensure that long transactions are broken into smaller units or handled in batches to avoid locking issues.