MySQL Monitoring

MySQL is a popular open-source relational database management system (RDBMS) known for its scalability, reliability, and use in a wide range of web-based and enterprise applications.

Percona XtraDB Cluster is a high-availability solution based on MySQL that provides synchronous replication, multi-master support, and data consistency using Galera Cluster technology. It ensures no single point of failure and is ideal for environments requiring fault tolerance and horizontal scalability.

Difference Between MySQL and Percona XtraDB Cluster

Feature MySQL Percona XtraDB Cluster
Replication Asynchronous (default). Synchronous with Galera technology.
High Availability (HA) Requires additional tools for HA. Built-in HA with no single point of failure.
Write Scalability Single primary server for writes. Multi-master, allowing writes on all nodes.
Data Consistency Eventual consistency. Strong consistency across all nodes.
  • Grafana Dashboards for Mysql Link
  • Grafana alerts for Mysql Link
  • Percona XtraDB Cluster Link
  • Improve Query Performance Link
  • Slow Queries Link
Dashboard & Row Alert Name Panel Panel Description Query Query Description Query Operating Range Metrics Metric Description Metric Operating Range SEVERITY: CRITICAL SEVERITY: WARNING SEVERITY: OK
3.1.1 Cluster Size Represents the total number of nodes in the cluster. mysql_global_status_wsrep_cluster_size{instance=~“pssb.*”} Represents the number of nodes currently active in the cluster 0-5 mysql_global_status_wsrep_cluster_size Represents the number of nodes currently active in the cluster 0-5 0 < 5 5
2.1.2 Cluster Status (In sync / out of sync) Status of the MySQL node mysql_global_status_wsrep_local_state{instance=~“pssb.*”} State of the writeset replication 0-4 mysql_global_status_wsrep_local_state State of the MySQL node like show global status 0-4 != 4 - 4
2.4.1 Max Thread Connections MySQL Connections Monitors MySQL connection statistics over a specified interval. mysql_global_status_max_used_connections{instance=~“pssb.*”} Shows the highest number of connections used from MySQL server started Positive values mysql_global_status_max_used_connections Shows the highest number of connections used from MySQL server started Positive values > 100 > 50 < 15
2.4.1 HighThreadsRunning MySQL Connections MySQL thread count which are executing, i.e., using I/O mysql_global_status_threads_running{instance=~“pssb.*”} MySQL threads executing which are not sleeping Positive values mysql_global_status_threads_running MySQL connections which are not idle, i.e., in running state Positive values > 30 20 < threads < 30 > 20
Slow Queries Slow Queries Monitors the rate of slow queries rate(mysql_global_status_slow_queries{instance=~“pssb.*”}[5m]) This query gives the rate of slow queries per second Positive values mysql_global_status_slow_queries Tracks the number of slow queries logged by MySQL > 5 > 0 0
2.1.4 MySQL Uptime MySQL Uptime Shows how long it’s been running mysql_global_status_uptime{instance=~“pssb.*”} Shows how long it’s been running in seconds Positive values mysql_global_status_uptime Shows how long it’s been running in seconds Positive values < 60 - > 60
2.1.3 Buffer Pool Usage Buffer pool usage Memory used by the buffer pool (mysql_global_variables_innodb_buffer_pool_size{instance=~“pssb.”} * 100) / on (instance) node_memory_MemTotal_bytes{instance=~“pssb.”} Shows the buffer pool used in percentage Positive percentage mysql_global_variables_innodb_buffer_pool_size Shows buffer pool size in bytes Positive values 50 < size < 60 > 60 < 50
2.2.1 High Reads Database Reads MySQL reads performing at an instantaneous time mysql_global_status_innodb_data_reads{instance=~“pssb.*”} MySQL database data reads Positive values mysql_global_status_innodb_data_reads MySQL database data reads Positive values - > 15000 < 15000
2.2.2 High Writes Database Writes MySQL writes performing at an instantaneous time rate(mysql_global_status_innodb_data_writes{instance=~“pssb.*”}[5m]) Calculates the rate of writes per second over the last 5 minutes Positive values mysql_global_status_innodb_data_writes Total number of write operations by MySQL Positive values > 50 50 > writes > 15 < 15
InnoDB Log Waits InnoDB Log Waits Refers to an issue where MySQL InnoDB log writes are stalling increase(mysql_global_status_innodb_log_waits{instance=~“pssb.*”}[5m]) If any waits occur, triggers the alert mysql_global_status_innodb_log_waits The number of times a transaction had to wait Positive values > 0 0
2.2.7 Replication Traffic Received Average Incoming Transaction Size Calculates the average size of replicated transactions (in bytes) rate(mysql_global_status_wsrep_received_bytes{instance=~“pssb.*”}[5m]) Calculates the rate of replication traffic in bytes received Positive values mysql_global_status_wsrep_received_bytes Tracks the total number of bytes received via replication. > 5 KB (5120) > 1 MB (1048576)
2.2.7 Replication Traffic Transmitted Average Replicated Transaction Size Calculates the average size of the replicated transactions. rate(mysql_global_status_wsrep_replicated_bytes{instance=~“pssb.*”}[5m]) Rate of data being replicated in bytes per second for instances mysql_global_status_wsrep_replicated_bytes Total number of bytes received via replication. > 5 KB (5120) > 1 MB (1048576)
2.2.13 Replication Latency Average Replication Latency Latency refers to the time delay in synchronizing data changes across all nodes in a Galera Cluster mysql_galera_evs_repl_latency_avg_seconds{instance=~“pssb.*”} Latency in Galera replication process of MySQL 5-node cluster Positive decimal values (seconds) mysql_galera_evs_repl_latency_avg_seconds Average Galera replication latency in seconds Positive decimal values (seconds) > 1 1 > latency > 0.5 < 0.5

Alerts and C3 Procedures

When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.

  1. Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.

  2. Severity-Based Actions:

    • Low-Priority Alerts:
      • If the priority level is low, and the C3 team can address it, they should follow the “C3 Remedy” steps after reviewing “Dependent Metrics and Checks.”
    • Escalation to DevOps:
      • If the C3 team cannot resolve the issue, they should escalate it to the DevOps team.
  3. Severity-Specific Notifications:

    • Warning Alerts:
      • For alerts with a “Warning” severity level, the C3 team can notify DevOps in the current or next work shift.
    • Critical Alerts:
      • For “Critical” severity alerts, the C3 team must notify the DevOps team immediately, regardless of work shift status.

Preliminary Steps

Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.

This process ensures effective response and resolution for all alerts based on severity and priority.

MySQL Cluster Size

A change in the MySQL cluster size count means that nodes (or members) have either joined or left the MySQL cluster. These changes can significantly impact the cluster’s stability, performance, and resource distribution.

The alert is triggered when the MySQL cluster size is not equal to the expected size (e.g., 5 nodes in a 5-node cluster).

C3 Data Collection

To analyze changes in the MySQL cluster size:

  1. Monitor the Current Cluster Size

    • Access the MySQL Cluster Size Grafana dashboard panel.
    • Identify which node is not part of the cluster.
  2. Verify Node Visibility Across the Cluster

    • Ensure all 5 nodes in the cluster can detect the presence of all other nodes.
    • Check for potential network issues (firewall, connectivity) between nodes.

Dependent Metrics

  1. mysql_global_status_wsrep_cluster_size
    • Represents the number of nodes currently active in the cluster.

C3 Remedy

The C3 team should focus on basic troubleshooting and operational checks for MySQL cluster node count issues. Follow these steps:

  1. Check the Current Status of the MySQL Cluster Nodes:

    SHOW STATUS LIKE 'wsrep_cluster_size';
    

    Expected Output:

    +--------------------+-------+
    | Variable_name      | Value |
    +--------------------+-------+
    | wsrep_cluster_size | 5     |
    +--------------------+-------+
    
    • Ensure the cluster size matches the expected value (e.g., 5).
  2. Analyze MySQL Error Logs for the Disconnected Node:

    vi + /var/log/mysql/error.log
    
  3. Restart the MySQL Service on the Disconnected Node:

    systemctl restart mysql
    

DevOps Remedy

  1. Verify Galera Cluster Status and Logs on the Disconnected Node(s):

    SHOW GLOBAL STATUS LIKE 'wsrep%';
    

    Key Metrics to Verify:

    • wsrep_cluster_status:
      • Primary: Indicates the node is part of the primary component.
      • Non-Primary: Indicates the node is disconnected or part of a split-brain scenario.
    • wsrep_connected: Expected value: ON.
  2. Validate Cluster Configuration in MySQL Configuration File:

    cat /etc/mysql/mysql.conf.d/mysqld.cnf | grep wsrep_cluster_address
    

    Expected Output:

    wsrep_cluster_address=gcomm://pssb1avm001,pssb1avm002,pssb1avm003,pssb1avm004,pssb1avm005
    
  3. Resolve Errors Identified in MySQL Logs:

    vi + /var/log/mysql/error.log
    
  4. Restart MySQL Services on Affected Nodes After Fixing Errors:

    systemctl restart mysql
    
  5. Validate Cluster Size Restoration:
    After all steps, check if the cluster size is restored to the expected value.


Mysql Cluster Out Of Sync

This alert is triggered when a node in the cluster transitions to an out-of-sync state. The alert indicates that one or more nodes may not be fully synchronized with the rest of the cluster, which could lead to degraded performance or data inconsistencies.

C3 Data Collection

To analyze cluster status changes:

  1. Monitor the Node Synchronization Status

    • Check the Node Role Grafana panel.
    • Identify which node(s) have moved to an out-of-sync state.
  2. Verify Node Sync Details

    • Check if the affected node’s wsrep_local_state_comment is set to anything other than Synced.
  3. Collect Logs for Investigation

    • Examine the MySQL error logs for the affected node:
      vi + /var/log/mysql/error.log
      
    • Look for any network connectivity errors or Galera-specific warnings.

Dependent Metrics

  1. wsrep_local_state_comment

    • Represents the current state of the node. Possible values:
      • Synced: Node is in sync.
      • Joining: Node is in the process of joining the cluster.
      • Donor/Desynced: Node is temporarily desynced (e.g., during IST/SST).
      • Joining (Recovering): Node is recovering.
      • Disconnected: Node is not communicating with the cluster.
  2. wsrep_cluster_status

    • Indicates the cluster’s status:
      • Primary: Node is part of the primary cluster.
      • Non-Primary: Node is part of a split-brain scenario.
  3. wsrep_flow_control_paused

    • Measures replication flow control; higher values may indicate bottlenecks.

C3 Remedy

  1. Verify Node State and Cluster Status:

    • Run the following command to check the node’s status:

      SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
      

      Expected Value: Synced.

    • Confirm the cluster status:

      SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
      

      Expected Value: Primary.

  2. Check Error Logs for More Details:

    vi + /var/log/mysql/error.log
    
    • Identify issues such as network disconnections, high load, or SST/IST failures.
  3. Restart MySQL Service on Affected Node(s):

    systemctl restart mysql
    

DevOps Remedy

  1. Validate Network Connectivity:

    • Ensure all cluster nodes can communicate with each other. Check network firewalls or connectivity issues:
      ping <node-IP>
      
  2. Check SST/IST Progress:

    • On the affected node, verify the progress of state transfers:
      SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
      
    • If the node is stuck in Joining (Recovering), investigate the donor logs for SST/IST failures.
  3. Validate Resource Allocation:

    • Ensure the affected node has sufficient resources (CPU, RAM, Disk I/O).
  4. Rejoin the Node to the Cluster:

    • If the node is still out of sync, force it to rejoin:
      systemctl restart mysql
      
  5. Validate Sync State After Remedies:

    • Check wsrep_local_state_comment again to confirm the node is back in Synced state.

Max Thread Connections

This alert is triggered when the number of threads connected to the MySQL server gets close to or exceeds the maximum limit. If this limit is reached, new connections will fail, causing potential issues for applications relying on the database.

C3 Data Collection

  1. Monitor Current Thread Connections:

    • Check the Max Thread Connections Grafana panel to view the number of active connections.
  2. Verify Current Connection Usage:

    • Run this query in MySQL to check the current number of connections:
      SHOW STATUS LIKE 'Threads_connected';
      
      Output
         +-------------------+-------+
         | Variable_name     | Value |
         +-------------------+-------+
         | Threads_connected | 11    |
         +-------------------+-------+
      
  3. Identify the Maximum Allowed Connections:

    • Run this query to check the maximum allowed connections:
      SHOW VARIABLES LIKE 'max_connections';
      
      Output
         +-----------------+-------+
         | Variable_name   | Value |
         +-----------------+-------+
         | max_connections | 9999  |
         +-----------------+-------+
      
  4. Analyze Connection Logs:

    • Check MySQL logs for any warnings or errors about connection limits:
      vi + /var/log/mysql/mysql.log
      

Dependent Metrics

  1. mysql_global_status_threads_connected

    • The current number of connections to the database.
  2. mysql_global_variables_max_connections

    • The maximum number of allowed connections.
  3. mysql_global_status_threads_running

    • Number of active threads currently running queries.

C3 Remedy

  1. Check the Current Number of Threads Connected:

    • Run the query:
      SHOW STATUS LIKE 'Threads_connected';
      
      Compare the result with the max_connections value.
  2. Identify Problematic Connections:

    • Use this query to see all active connections:
      SHOW FULL PROCESSLIST;
      
    • If needed, kill long-running or stuck queries
      KILL <thread_id>;
      
  3. Check if the application is opening too many connections without closing them properly.

DevOps Remedy

  1. Increase max_connections:

    • If the current connection limit is too low, increase it in the MySQL configuration file:
      vi /etc/mysql/mysql.conf.d/mysqld.cnf
      
      Add or update:
      max_connections = <value>
      
      Restart MySQL to apply the changes:
      systemctl restart mysql
      
  2. Monitor Connection Errors:

    • Check if the Connection_errors_max_connections metric is incrementing to verify if the issue persists.

      SHOW STATUS LIKE 'Connection_errors_max_connections';
      

      Output

      +-----------------------------------+-------+
      | Variable_name                     | Value |
      +-----------------------------------+-------+
      | Connection_errors_max_connections | 0     |
      +-----------------------------------+-------+
      
  3. Identify applications that may be opening too many connections and optimize their connection handling.


High Threads Running

This alert is triggered when the number of running threads exceeds a defined threshold:

  • Critical: Threads running > 30
  • Warning: Threads running between 20 and 30

Threads in MySQL represent tasks or queries that the database server is currently processing. When the number of active (non-sleeping) threads increases significantly, it can mean that the database server is under heavy load or facing issues.

Dependent Metrics

  1. Threads_running
    • Represents the active threads executing queries at a given time.
  2. Threads_connected
    • Shows all connections (active + sleeping).
  3. Max_used_connections
    • Tracks the maximum number of connections used since the server started.
  4. Connection_errors_max_connections
    • Counts the number of times the server hit the max_connections limit.

C3 Data Collection

  1. Monitor the current threads
    Run the following command to check the thread status:

    SHOW GLOBAL STATUS LIKE 'Threads%';
    

    Example output:

    +-------------------+-------+
    | Variable_name     | Value |
    +-------------------+-------+
    | Threads_cached    | 6     |
    | Threads_connected | 50    |
    | Threads_created   | 200   |
    | Threads_running   | 35    |
    +-------------------+-------+
    
  2. Check ongoing queries
    Use the SHOW PROCESSLIST command to see active queries and their states:

    SHOW PROCESSLIST;
    

    Focus on long-running or resource-intensive queries.

  3. Review the slow query log
    Check for inefficient queries that might be causing high load:

    vi /var/log/mysql/slow-query.log
    
  4. Identify max connection errors
    Query the metric for errors related to exceeding the max connection limit:

    SHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections';
    

C3 Remedy

  1. Basic Checks:

    • Confirm whether the number of running threads is above the thresholds (20 or 30).
    • Investigate resource utilization (CPU, memory).
  2. Slow Query Identification:

    • Use the slow query log to identify and optimize long-running queries.
    • Example: Add proper indexes to improve query performance.
  3. Restart the MySQL Service (if needed)

    • If the server becomes unresponsive, restart it to stabilize.
    systemctl restart mysql
    

DevOps Remedy

  1. Analyze Running Threads:

    • Check the current number of running threads to understand the workload:
      SHOW STATUS LIKE 'Threads_running';
      
  2. Investigate Long-Running Queries:

    • Identify queries that might be consuming threads for an extended period:
      SHOW FULL PROCESSLIST;
      
    • If needed, kill long-running or stuck queries
      KILL <thread_id>;
      
    • Review and optimize any slow or inefficient queries.
  3. Increase Thread Pool Size (If Necessary):

    • Update the MySQL configuration to handle more threads if the workload justifies it:
      vi /etc/mysql/mysql.conf.d/mysqld.cnf
      
      Add or modify the following parameter:
      thread_cache_size = 128
      
  4. Optimize Applications:

    • Ensure applications are closing database connections promptly to prevent excessive thread usage.
  5. Restart MySQL (if configuration changes are made):

    systemctl restart mysql
    

Mysql High Replication Latency

C3 Data Collection

1. Metric Query: Use the following query to fetch the average replication latency in seconds for all nodes in the cluster:

mysql_galera_evs_repl_latency_avg_seconds{instance=~"pssb.*"}

This metric helps determine the current replication latency across the cluster.

2. Node State Verification

  • Check the panel that displays all nodes’ states (e.g., Joining, Syncing, Synced) to identify any abnormal node statuses.
    View Panel

3. Galera Metrics : Run the following SQL command on the affected node to get detailed cluster and node statuses:

SHOW STATUS LIKE 'wsrep%';

Sample Output

| wsrep_local_state_comment        | Synced                                                                                                                                         |
| wsrep_incoming_addresses         | pssb1avm001:3306,pssb1avm002:3306,pssb1abm003:3306,pssb1avm004:3306,pssb1avm005:3306                                                           |
| wsrep_cluster_weight             | 5                                                                                                                                              |
| wsrep_evs_delayed                |                                                                                                                                                |
| wsrep_evs_evict_list             |                                                                                                                                                |
| wsrep_evs_repl_latency           | 0.00198006/0.00321617/0.00554355/0.00164676/3                                                                                                  |
| wsrep_evs_state                  | OPERATIONAL                                                                                                                                    |
| wsrep_cluster_size               | 5                                                                                                                                              |
| wsrep_cluster_status             | Primary                                                                                                                                        |
| wsrep_connected                  |    ON                                                                                                                                             |
| wsrep_provider_name              | Galera                                                                                                                                         |
| wsrep_ready                      | ON                                                                                                                                             |
| wsrep_local_send_queue           | 0                      
| wsrep_local_recv_queue           | 0                      
| wsrep_flow_control_paused        | 0.000783576            

Key Metrics to Note:

  • wsrep_local_state_comment: Indicates the state of the node (e.g., Synced, Donor, Joining).
  • wsrep_cluster_size: Shows the total number of nodes in the cluster.
  • wsrep_flow_control_paused: Reveals replication delays due to flow control pauses.

Replication Queues:

  • wsrep_local_recv_queue: Number of events waiting to be processed locally.
  • wsrep_local_send_queue: Number of events waiting to be sent to other nodes.

4. Resource Usage Monitor system resource metrics on all nodes to check for potential bottlenecks:

  • CPU Usage: High usage can impact replication threads.
    • Use tools like top or Prometheus metric: node_cpu_seconds_total.
  • Memory Usage: Insufficient memory may affect performance.
    • Use free -m or Prometheus metric: node_memory_Active_bytes.
  • Disk I/O: High latency in I/O can cause transaction delays.
    • Use iostat, iotop, or Prometheus metrics: node_disk_read/write_bytes.
  • Network Latency: Packet drops or high RTT between nodes can increase replication latency.
    • Tools: ping, traceroute.

5. Logs

  • MySQL Error Logs:
    • Collect logs from /var/log/mysql/error.log for errors related to replication, network issues, or cluster communication.
  • System Logs:
    • Check for OS-level errors:
      journalctl -u mysql  
      
      dmesg | grep -i error  
      

6. Slow Query Analysis

7. External Factors

  • Verify network connectivity between nodes to check for packet loss or high latency.

    Ping Test Analysis for Packet Loss

    Command:

    ping -c 10 172.21.0.62
    

    Output:

    PING 172.21.0.62 (172.21.0.62) 56(84) bytes of data.
    64 bytes from 172.21.0.62: icmp_seq=1 ttl=64 time=0.421 ms
    64 bytes from 172.21.0.62: icmp_seq=2 ttl=64 time=0.296 ms
    64 bytes from 172.21.0.62: icmp_seq=3 ttl=64 time=0.417 ms
    64 bytes from 172.21.0.62: icmp_seq=4 ttl=64 time=0.363 ms
    64 bytes from 172.21.0.62: icmp_seq=5 ttl=64 time=0.418 ms
    64 bytes from 172.21.0.62: icmp_seq=6 ttl=64 time=0.391 ms
    64 bytes from 172.21.0.62: icmp_seq=7 ttl=64 time=0.208 ms
    64 bytes from 172.21.0.62: icmp_seq=8 ttl=64 time=0.364 ms
    64 bytes from 172.21.0.62: icmp_seq=9 ttl=64 time=0.441 ms
    64 bytes from 172.21.0.62: icmp_seq=10 ttl=64 time=0.465 ms
    
    --- 172.21.0.62 ping statistics ---
    10 packets transmitted, 10 received, 0% packet loss, time 9221ms
    rtt min/avg/max/mdev = 0.208/0.378/0.465/0.072 ms
        ```
    

Understand of Output:

  • Packet Loss: The output indicates that 0% packet loss occurred. This means all 10 packets were successfully transmitted and received without any loss.

  • Round-Trip Time (RTT):

  • Min RTT: 0.208 ms — This is the minimum round-trip time for the packets.

  • Avg RTT: 0.378 ms — The average round-trip time for all packets.

  • Max RTT: 0.465 ms — The maximum round-trip time recorded.

Key Points:

  • No packet loss is a good sign that the network between the nodes is functioning correctly.
  • The RTT times are relatively low, indicating good network performance.
  • Consistent latency with minimal deviation suggests stable network conditions between the nodes.

In this case, there are no network issues detected based on this ping test output.

Dependent Metrics

To address high replication latency in a Galera or Percona XtraDB Cluster, it’s important to monitor several dependent metrics that can indicate the root cause. Here’s a breakdown of key metrics to track and how you can get more details from MySQL:

  1. mysql_global_status_wsrep_local_state: This metric represents the state of the node in the cluster. If it’s showing a state other than “Synced” (i.e., state 4), it indicates that the node is out of sync, which can cause replication delays.

  2. mysql_global_status_wsrep_local_commits: This metric tracks the number of local commits that have been replicated. A low or fluctuating number might suggest that there are issues in committing transactions to the cluster, which can lead to replication latency.

  3. mysql_global_status_wsrep_received_bytes: This metric shows the amount of data received by the node from other nodes via replication. If there’s high replication traffic, the node may struggle to process this data, increasing replication latency.

  4. mysql_global_status_wsrep_replicated_bytes: This tracks the amount of data sent by the node to other nodes. If a node is transmitting large amounts of data that are not being processed fast enough by the receiving nodes, it can lead to delays in replication.

  5. mysql_global_status_innodb_data_reads: High InnoDB data reads can cause I/O bottlenecks, which may affect replication performance, leading to latency if the system is unable to process the read operations quickly enough.

  6. mysql_global_status_innodb_data_writes: Similar to data reads, high write operations to the InnoDB storage engine can lead to bottlenecks, especially if replication has to wait for the writes to complete.

  7. mysql_global_status_innodb_buffer_pool_waits: If this metric is high, it indicates that there is contention for space in the InnoDB buffer pool, which can slow down the processing of reads and writes, ultimately impacting replication speed.

  8. mysql_global_status_innodb_buffer_pool_size: A small buffer pool can lead to slower read and write operations, which can directly affect replication performance. Increasing the buffer pool size might alleviate this issue.

  9. mysql_global_status_threads_running: A high number of active threads indicates that the system is under heavy load. This can cause replication delays, especially if many threads are handling different types of operations concurrently.

  10. mysql_global_status_slow_queries: Slow queries can block threads, causing delays in processing replication data. If slow queries are frequent, they should be optimized to avoid impact on replication latency.

  11. mysql_global_status_innodb_log_waits: This metric counts the number of times transactions had to wait for InnoDB log space. High log waits can indicate that transactions are being delayed, impacting replication performance.

You can obtain more detailed information about many of these metrics and their current values by running the following command:

SHOW STATUS LIKE "%wsrep%";

We can also refer the Grafana Dashboards of Mysql, links are specified above

C3 Remedy

  1. Node Resource Issues:

    • Resolve resource bottlenecks (CPU, memory, disk I/O) by following the OS Observability page guidelines.
  2. Node in Joining State:

    • If a node is in the Joining state, wait until it fully joins and syncs with the cluster before taking further action.
  3. MySQL Runtime Issues:

    • If the issue is related to MySQL runtime, restart the MySQL service using:
      systemctl restart mysql
      
    Before restarting mysql, please check all nodes are present in the cluster and this node is getting the problem. If any other problem found, inform the DevOps team.
    
  4. Escalation to DevOps:

    • If the issue persists despite troubleshooting, escalate to DevOps with all relevant logs, metrics, and resource usage details.

Devops Remedy

  1. Optimize Write Operations: Identify heavy write queries (INSERT/UPDATE) and optimize them. MOdify database queries such that break large transactions into smaller ones where possible.

  2. Hardware Optimization:

    • Upgrade Disk Speed: Use faster storage solutions such as SSDs or NVMe drives to improve disk I/O performance.
    • Enhance CPU Resources: Add more CPU cores or improve CPU speed to handle increased processing requirements.
    • Assess Memory: Increase memory allocation to avoid disk-based operations and improve performance.
  3. Enable Binary Logging Compression:

    Enable binary log compression on the source database server to reduce the size of binary log files generated during replication. Compressed binary logs consume less network bandwidth during replication, thereby reducing replication lag.

    Steps to Enable:

    1.Edit the MySQL configuration file (e.g., /etc/my.cnf) and add this under [mysqld]:

    binlog_transaction_compression=ON
    
    1. Restart MySQL: Apply changes with:
    sudo systemctl restart mysql
    
    1. Verify: Run this query to ensure compression is enabled:
    SHOW VARIABLES LIKE 'binlog_transaction_compression';
    
  4. Tune Replica Configuration Parameters

    Optimize innodb_buffer_pool_size in mysql config to improve data retrieval efficiency if any write waits are occurring.


MySQL High Replication Traffic Received / Transmitted

  • Alert URL
  • Mysql High Replication Transmitted Bytes URL
  • Mysql High Replication Traffic Received URL
  • Dashboard URL
  • InBound Traffic Panel URL
  • OutBound Traffic Panel URL

It triggers when excessive replication traffic is observed between nodes in a master-master MySQL cluster. High traffic can indicate large transactions, frequent updates, or inefficient data operations, potentially leading to replication conflicts, increased network latency, and resource contention.

Understanding Network Traffic in Master-Master MySQL Replication

In a master-master replication setup, two MySQL servers work together, each acting as both a master and a slave. This means that each server sends and receives data from the other server to keep both databases in sync. Let’s break down how this works, and how it affects network traffic between the servers.

What is Master-Master Replication?

In master-master replication, both servers replicate data to each other. When one server makes a change to its database (like adding a new record), it sends that change to the other server. The other server then applies the change to its own database, and the process repeats.

Two Key Threads in Replication

Two threads handle this data exchange:

  1. The I/O Thread: This thread pulls the changes (called binary log events) from the other server.
  2. The SQL Thread: After the I/O thread brings in the changes, the SQL thread applies those changes to the local database.

If both servers have log-bin enabled, after applying the changes, the SQL thread also writes those changes to its own binary log and sends them back to the other server.

How Traffic is Generated

When one server sends data to the other, it affects two network traffic metrics:

  • Bytes_sent: The amount of data sent by the server to another server.
  • Bytes_received: The amount of data the server has received from another server.

Let’s look at how this works in a master-master setup:

  • Server A (Master 1) sends its changes to Server B. This increases Bytes_sent on Server A.
  • Server A receives changes from Server B. This increases Bytes_received on Server A.
  • Similarly, Server B sends changes to Server A, increasing Bytes_sent on Server B, and receives changes from Server A, increasing Bytes_received on Server B.
Example: Monitoring with SHOW FULL PROCESSLIST

You can monitor what’s happening on the servers using the SHOW FULL PROCESSLIST command. This command shows all active processes on the server. In a master-master setup, you might see something like this:

Sample Output:

On Server A:

Id User Host db Command Time State Info Time_ms Rows_sent Rows_examined
1 system user NULL Sleep 18 wsrep: committed write set NULL 18191 0 0
2 system user NULL Sleep 2134791 wsrep: aborter idle NULL 2134791569 0 0
136348 qrtzrw_psdb_sb pssb1abm003:56308 qrtz_psdb_sb Sleep 21 NULL 21628 1 1
140859 psrw_sb pssb1abm003:46952 psdb_sb Sleep 4 NULL 4258 0 0
144696 root localhost NULL Query 0 init show full processlist 0 0 0

Here’s what’s happening:

  • The Sleep command means the I/O thread is waiting for new data from Server B.
  • The Sending data state means the SQL thread is applying changes to the database and sending them to Server B.

On Server B, you might see something similar:

Sample Output:

Id User Host db Command Time State Info Time_ms Rows_sent Rows_examined
1 system user NULL Sleep 2745965 wsrep: aborter idle NULL 2745965341 0 0
2 system user NULL Sleep 14 wsrep: committed write set NULL 14104 0 0
179129 qrtzrw_psdb_sb pssb-ms-node02:44554 qrtz_psdb_sb Sleep 7 NULL 7900 1 1
186172 root localhost psdb_sb Query 0 init show processlist 0 0 0
Why Does This Matter?

In master-master replication, Bytes_sent and Bytes_received help you monitor the amount of data flowing between the two servers. If these values increase, it might indicate heavy replication traffic or large queries.

  • High traffic might mean lots of changes are being made to the database or there are large queries being executed.
  • Monitoring these values helps you ensure that replication is happening smoothly and that both servers are properly synced.

In a master-master replication setup, both servers are constantly sending and receiving data to keep their databases in sync. This results in changes to the Bytes_sent and Bytes_received values, which reflect the traffic between the two servers. By monitoring these metrics, you can track replication health and identify any potential issues with data flow between the servers.

Data Collection

  1. Network Traffic Analysis:

    • Use tools like iftop, vnstat, or monitoring systems (e.g., Grafana) to measure data transfer rates between nodes.
  2. Binary Log Inspection:

    • Identify large or frequent transactions causing high traffic by inspecting binary logs. For example, to view events in a specific binary log file:
      SHOW BINLOG EVENTS IN 'mysql-bin.000001';
      
      Replace mysql-bin.000001 with the actual binary log file name.
  3. Bytes Sent and Received: s This mysql query retrieves global status variables related to the number of bytes sent and received by the server. These variables provide insights into network traffic generated by MySQL.

    SHOW GLOBAL STATUS LIKE 'Bytes_%';
    

    Example Output

    +----------------+-------------+
    | Variable_name  | Value       |
    +----------------+-------------+
    | Bytes_received | 1267179425  |
    | Bytes_sent     | 30373626016 |
    +----------------+-------------+
    2 rows in set (0.00 sec)
    
  4. Conflict Detection:

    • Review the error log for replication conflicts caused by simultaneous writes to both nodes. In a master-master MySQL cluster, both nodes can accept write operations, which may lead to replication conflicts. These conflicts can arise if the same data is modified on both nodes simultaneously, or if there are issues such as primary key conflicts, foreign key violations, or write-order problems.
    • It may cause inconsistency in the mysql data and mysql server will show failure in the status.
      systemctl status mysql
      
    • Look for the mysql conflicts in the error.log file. The Conflicts are common until there are no consistency problems occur.
      cat /var/log/mysql/error.log | grep onflic
      
      Example Output
      2024-10-31T01:05:01.148460Z 14 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
      THD: 14, mode: high priority, state: exec, conflict: executing, seqno: 35492
      THD: 3581, mode: local, state: exec, conflict: committing, seqno: 35493
      2024-10-31T01:05:01.148572Z 14 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
      THD: 14, mode: high priority, state: exec, conflict: executing, seqno: 35492
      THD: 3581, mode: local, state: exec, conflict: must_replay, seqno: 35493
      
      These errors are because of high replication traffic and may cause inconsistency when high conflicts are occurred
  5. Resource Monitoring:

    • Monitor CPU, memory, and disk usage on both nodes to identify system resource bottlenecks.

Dependent Metrics

  1. Binary Log Size
    Monitor the size and frequency of binary logs generated during replication. Excessive growth can impact performance.
    Check with:

    SHOW BINARY LOGS;
    

    Look for:

    • Log_name, File_size: Check for unexpected increases in log file size.
  2. Network Bandwidth
    Measure data transfer between nodes. High transfer rates can indicate heavy replication traffic or network issues.
    Check with (Linux):

    iftop -i eth0
    

    Look for:

    • High data transfer rates between nodes, which may suggest replication traffic overload.
  3. Disk I/O
    Monitor disk read/write operations for relay and binary logs. High disk I/O can impact replication speed.
    Check with (Linux):

    iostat -x 1
    

    Look for:

    • High await or svctm values, which indicate disk latency and potential replication delays.
  4. Check CPU, RAM and System Load in the OS Monitoring Section and identify the potential bottlenecks.

C3 Remedy

  1. Restart MySQL Server
    If there is inconsistency or the MySQL server has gone into an unknown state, a restart can often help restore normal replication behavior.
    Action:

    • Perform a clean restart of the MySQL service:
      sudo systemctl restart mysql
      
  2. Try adding RAM, CPU to the server which is experiencing high replication traffic and monitor the performance of server and traffic

  3. Wait for Resolution
    After making the above adjustments, monitor the system to see if the issue resolves. If the server stabilizes and replication traffic returns to normal, no further action may be needed.
    Action:

    • Allow some time for the replication traffic to settle. Monitor the replication status and system health.
  4. Escalate to DevOps Team
    If the issue does not resolve after restarting the server and adjusting resources, escalate the issue to the DevOps team for further investigation.
    Action:

    • Document the issue, including any changes made (e.g., resource adjustments) and replication status.
    • Inform the DevOps team of the ongoing problem, but ensure no additional latency is introduced to the entire cluster while waiting for resolution.

    Note: Delaying the escalation to DevOps or failing to inform them of the issue in a timely manner can lead to increased latency and performance degradation across the entire cluster. Ensure that any alerts or issues are communicated promptly to avoid further impact.

DevOps Remedy

  1. Update Configurations :
    If the MySQL server is unable to handle increased replication traffic, increase RAM and CPU resources to handle the load.
    Action:
    • Dynamically adjust the MySQL configuration to allow the server to utilize more system resources. This may involve adjusting parameters like innodb_buffer_pool_size, max_connections, and other resource-intensive settings.
    • Scale infrastructure: Increase instance sizes or allocate additional resources if running in cloud environments.
      Example:
      For increasing innodb_buffer_pool_size:
      SET GLOBAL innodb_buffer_pool_size = <new_value>;
      
  2. Get More Cores or Faster Cores: If CPU usage is high and the system is experiencing CPU saturation, increasing CPU cores may help. However, if the Max CPU Core Utilization is near 100% and the CPU usage is low, focus on upgrading to faster CPU cores. In cloud environments, monitor for CPU Stealing to ensure the MySQL server is allocated sufficient resources.

Mysql InnoDB Log Waits

Understanding InnoDB log waits

In InnoDB, a log buffer is a memory structure that stores the data modification before being written to the log files on disk. An optimal size of the log buffer can improve performance by reducing the frequency of expensive I/O operations.

To check whether you need to adjust the log buffer size, you can use the innodb_log_waits status variable. The InnoDB Log Waits occur when a transaction is unable to write to the log buffer because it’s full. The innodb_log_waits variable stores the number of times such waits have occurred.

If innodb_log_waits variable is greater than zero, so you likely need to adjust the size of the log buffer.

Checking InnoDB Log Waits First, connect to the MySQL server:

mysql -u root -p

Second, check the innodb_log_waits value using the following query:

SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';

Output:

+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_log_waits | 0     |
+------------------+-------+
1 row in set (0.00 sec)

C3 Data Collection

To troubleshoot and resolve the InnoDB waiting state issue, the C3 team should collect the following data:

  1. MySQL Service Status
    Check if the MySQL service is running and identify any service-level issues:

    systemctl status mysql 
    

    OR

    systemctl status mysql@bootstrap
    
  2. Disk Usage
    Ensure there is sufficient disk space available for InnoDB log writes:

    df -h
    
  3. InnoDB Engine Status
    Review the detailed status of the InnoDB engine to detect locks or stalls:

    mysql -e "SHOW ENGINE INNODB STATUS\G"
    

    Go through the Transactions sections in the output

    TRANSACTIONS
    ------------
    Trx id counter 2844915
    Purge done for trx's n:o < 2844915 undo n:o < 0 state: running but idle
    History list length 2
    LIST OF TRANSACTIONS FOR EACH SESSION:
    ---TRANSACTION 421815622005464, not started
    0 lock struct(s), heap size 1128, 0 row lock(s)
    ---TRANSACTION 421815622006328, not started
    0 lock struct(s), heap size 1128, 0 row lock(s)
    ---TRANSACTION 421815622004600, not started
    
  4. InnoDB Log File Size
    Verify the size of the InnoDB transaction log files to identify potential issues:

    ls -lh /data/db/pxc/mysql
    

    Sample Ouput

    -rw-r----- 1 mysql mysql 1.1M Nov  8 13:28 '#ib_16384_3.dblwr'
    -rw-r----- 1 mysql mysql  64K Dec  9 08:42 '#ib_16384_4.dblwr'
    -rw-r----- 1 mysql mysql 1.1M Nov  8 13:28 '#ib_16384_5.dblwr'
    -rw-r----- 1 mysql mysql  64K Dec  9 08:43 '#ib_16384_6.dblwr'
    -rw-r----- 1 mysql mysql 1.1M Nov  8 13:28 '#ib_16384_7.dblwr'
    -rw-r----- 1 mysql mysql  64K Dec  9 08:43 '#ib_16384_8.dblwr'
    -rw-r----- 1 mysql mysql 1.1M Nov  8 13:28 '#ib_16384_9.dblwr'
    drwxr-x--- 2 mysql mysql 4.0K Dec  9 08:12 '#innodb_redo'
    
  5. Slow Query Analysis
    Search the MySQL logs for slow queries that might be causing contention:

    grep -i "slow" /var/log/mysql/error.log
    
  6. Error Log Analysis
    Identify any critical errors in the MySQL log file that could be causing the waiting state:

    grep -i "error" /var/log/mysql/error.log
    
  7. Connection Metrics
    Review active MySQL connections for potential bottlenecks or saturation:

    netstat -anp | grep mysql
    

    Example Ouput

    tcp        0      0 0.0.0.0:4567            0.0.0.0:*               LISTEN      4194/mysqld         
    tcp        0      0 172.21.0.63:46096       172.21.0.65:4567        ESTABLISHED 4194/mysqld         
    tcp        0      0 172.21.0.63:46002       172.21.0.64:4567        ESTABLISHED 4194/mysqld         
    tcp        0      0 172.21.0.63:4567        172.21.0.61:45638       ESTABLISHED 4194/mysqld         
    tcp        0      0 172.21.0.63:4567        172.21.0.62:37702       ESTABLISHED 4194/mysqld         
    tcp6       0      0 :::9104                 :::*                    LISTEN      738/mysqld_exporter 
    tcp6       0      0 :::33060                :::*                    LISTEN      4194/mysqld         
    tcp6       0      0 :::3306                 :::*                    LISTEN      4194/mysqld 
    
  8. System Load
    Analyze server load and resource usage to detect high contention or lack of available resources:

    top
    

Collecting this data will provide a comprehensive view of the system and help diagnose the root cause of the InnoDB waiting state.

Dependent Metrics

  1. Write Activity

    • Verify that no writes are occurring on the MySQL server; a lack of writes could point to a bottleneck or misconfiguration.
  2. Operating System Alerts

    • Review all active OS-level alerts to ensure they are resolved and in a known state, as issues like high I/O wait, memory pressure, or CPU saturation can contribute to InnoDB log waits.
  3. OS Metrics Panels

    • Examine the primary OS metrics panels for abnormalities. If any panels are firing alerts, refer to the procedures outlined in the OS observability documentation to address them.
  4. Prometheus Metrics

    • Inspect the specific Prometheus metric to track InnoDB log waits:
      mysql_global_status_innodb_log_waits{instance=~"your_instance"}
      
  5. Additional Factors

    • Disk Latency: Check for high latency on disks used by InnoDB log files, which could delay writes.
    • Buffer Pool Saturation: An overloaded buffer pool can cause InnoDB operations to stall.
    • Network Bottlenecks: For replicated setups, ensure network throughput is sufficient to handle traffic.

Monitoring these dependent metrics is essential to diagnose and resolve InnoDB log waits efficiently.

C3 Remedy

  1. Initial Assessment by C3

    • C3’s role in this alert is minimal, primarily focused on identifying potential OS-level issues that might be causing InnoDB log waits. The first step is to assess the health of the underlying infrastructure.
  2. Review OS Metrics and Dashboards

    • Begin by reviewing the relevant OS metrics and dashboards, looking for any abnormalities or spikes in key resource areas such as:

      • Disk I/O: Check for high latency or throughput bottlenecks in disk systems storing InnoDB logs.
      • CPU Utilization: Look for signs of CPU saturation or high I/O wait times, which can indirectly affect MySQL’s performance.
      • Memory Usage: Ensure the system is not under memory pressure, as this can result in slower disk writes or caching inefficiencies.
    • If any abnormal metrics are identified in the OS monitoring tools, refer to the OS Observability Documentation to troubleshoot and resolve the issue. This could involve addressing I/O issues, optimizing system resources, or adjusting hardware allocations.

  3. Escalation to DevOps Team

    • If C3 is unable to resolve the issue using the OS-level documentation or if the issue persists beyond OS resource adjustments, the problem may lie deeper within the MySQL configuration or system infrastructure that requires more advanced troubleshooting.

    • The DevOps team may need to review the MySQL configuration, disk setup, or perform more advanced debugging to identify the root cause of the InnoDB log waits.

By following this process, C3 ensures that the issue is assessed at the OS level first, with clear guidance to escalate the problem to DevOps when necessary. This helps ensure that MySQL InnoDB log waits are addressed promptly and efficiently, minimizing the impact on performance.

Devops Remedy

  1. Check InnoDB Log Waits

    • First, connect to the MySQL server:
      mysql -u root -p
      
    • Then, check the Innodb_log_waits value using the following query:
      SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';
      
    • Interpretation:
      • If the value of Innodb_log_waits is 0, this indicates that the InnoDB log buffer size is sufficient.
      • If the value is greater than zero, it suggests that the buffer size may be too small, which is leading to InnoDB log waits.
  2. Configure InnoDB Log Buffer Size

    • Update the MySQL Configuration:

      • Open the MySQL configuration file (/etc/mysql/mysqld.conf.d/mysqld.cnf) and set the innodb_log_buffer_size to an appropriate value based on available memory and workload requirements.
      • Example:
        innodb_log_buffer_size = 1024M
        
      • The above configuration sets the log buffer size to 32MB. Adjust this based on the system’s resources and transaction load.
    • Restart MySQL:

      • After updating the configuration, restart the MySQL server for the changes to take effect:
        systemctl restart mysql
        
  3. Verify the Configuration Change

    • After restarting MySQL, connect to the server again:
      mysql -u root -p
      
    • Check the value of the innodb_log_buffer_size to ensure it has been updated:
      SHOW GLOBAL VARIABLES LIKE 'innodb_log_buffer_size';
      
    • The output should reflect the new buffer size. For example:
      +------------------------+----------+
      | Variable_name          | Value    |
      +------------------------+----------+
      | innodb_log_buffer_size | 33554432 |
      +------------------------+----------+
      
      • The buffer size is in bytes. You can convert it to megabytes:
        SELECT ROUND(@@innodb_log_buffer_size / 1024 / 1024, 0);
        
      • This should output the buffer size in MB (e.g., 32MB).
  4. Adjust for High Log Waits

    • If Innodb_log_waits continues to show non-zero values after increasing the log buffer size, consider:
      • Increasing the buffer size further based on workload and system memory.
      • Reviewing disk I/O performance: Slow disk writes can exacerbate log wait times. Ensure that the disk subsystem is performing optimally, and consider using faster storage (e.g., SSDs) if necessary.
  5. Optimize MySQL Configuration for Performance

    • Consider other related settings to reduce log wait times:
      • innodb_flush_log_at_trx_commit: Set this to 2 for less frequent flushing to disk (default is 1).
      • innodb_write_io_threads: Increase the number of write I/O threads to better utilize CPU resources for disk writes.
  6. Monitor and Validate Changes

    • Continuously monitor the Innodb_log_waits status and other related metrics (e.g., disk I/O and CPU utilization) to ensure that the issue is resolved and no new bottlenecks have emerged.
    • Use Prometheus or another monitoring solution to set up proactive alerts for InnoDB log waits.
  7. Collaborate with C3

    • Once the issue is resolved, inform the C3 team of the configuration changes. C3 should continue monitoring for any recurrence and ensure that the system operates smoothly.

By following these steps, the DevOps team can effectively resolve MySQL InnoDB log waits and ensure better performance of the MySQL server.


MySQL High/Low Reads**

  • Mysql High Reads Alert URL
  • Mysql Low Reads Alert URL
  • Dashboard Link
  • Panel URL

The “High MySQL Reads” alert is triggered when MySQL experiences a significant increase in the number of read operations. This could be an indication of database stress, inefficient queries, or issues with resource allocation. High read activity can result in high CPU usage, slow queries, or even server crashes in severe cases.

Data Collection:

1. Check MySQL Global Status: Review global status metrics related to read activity:

SHOW GLOBAL STATUS LIKE 'Com_select';  -- Number of SELECT queries executed
SHOW GLOBAL STATUS LIKE 'Handler_read_rnd';  -- Table scans performed by MySQL
SHOW GLOBAL STATUS LIKE 'Handler_read_next';  -- Index scans performed by MySQL

These queries provide insight into the total number of read operations and the types of reads being performed.

2. Review Grafana Dashboards: We have dedicated Grafana dashboards to monitor reads. Navigate to the “MySQL Reads” section to understand the metrics in real time. Use the following queries to view read activity:

rate(mysql_global_status_innodb_data_reads{instance=~"pssb.*"}[5m])

Apply this query to the Prometheus dashboard to track any changes in read performance over a 5-minute interval.

**3. Check System Resource Utilization: High read activity could lead to resource strain. Use the following commands to check system health:

  • CPU Usage:
    top
    
  • Memory Usage:
    free -m
    
  • Disk I/O:
    iostat -x 1 10
    

4. Check MySQL Connections: Ensure there are no unusually high numbers of active connections, which could indicate excessive reads:

netstat -anp | grep mysql

5. Check for Locking Issues: High read activity can sometimes be correlated with locking. Check for table or row locks:

SHOW ENGINE INNODB STATUS\G

Dependent Metrics:

  1. Track Query Performance & Frequency
  • Panels:
    • SELECT Queries: Monitors the number of SELECT queries being executed.
    • Rows Examined: Tracks how many rows are being examined by queries, indicating the efficiency of your queries.
    • Slow Queries: Displays the count of queries that take longer than the specified threshold, helping to identify performance bottlenecks.

These panels help identify spikes in read activity, inefficient queries, or performance degradation in read operations.

  1. Monitor Buffer Pool Efficiency

    • Panels:
      • Buffer Pool Reads: Shows how many reads are occurring in the buffer pool, indicating the frequency of accessing disk data.
      • Buffer Pool Hit Rate: Tracks the percentage of requests served from the buffer pool, where a lower hit rate suggests increased disk I/O.

    Monitoring these panels helps identify if the buffer pool is large enough to handle the workload, reducing the need for expensive disk reads.

  2. Check InnoDB and Locking Metrics

    • Panels:
      • Lock Waits: Monitors the number of lock waits, which can signal contention and delays.
      • InnoDB Read-Write Operations: Displays the read and write operations within InnoDB, helping to identify bottlenecks.
      • InnoDB Data Reads/Writes: Tracks data reads and writes at the InnoDB storage level, which can highlight inefficiencies in data access.

    These panels help identify lock contention and inefficiencies in InnoDB storage, which can directly impact read performance.

This format keeps the explanations concise and inline, directly relating each Grafana panel to its purpose in monitoring High MySQL Reads.

  1. OS Metrics: Review system performance metrics (CPU, memory, disk) to identify potential resource bottlenecks that may be contributing to the increased reads.

C3 Remedy:

  • Check the Grafana Dashboards for abnormal read patterns. Look for spikes or unexpected changes in read activity.
  • Examine the Global Status metrics for any unusual increases in SELECT queries or table/index scans. This can help identify if inefficient queries or table scans are causing the issue.
  • If the issue persists and is impacting system performance, escalate to the DevOps team for further investigation, as C3 has limited capabilities for addressing resource-intensive read spikes.

DevOps Remedy:

  1. Diagnose High Read Activity:

    • Check the MySQL Global Status for high read counts. If Com_select is high, it indicates many SELECT queries and need to wait for the resolve the issue itsef. If not, proceed to the next step.
  2. Optimize Queries:

    • Consider optimizing slow or inefficient queries by adjusting indexes or rewriting the queries for better performance.
  3. Increase System Resources: If high read activity is causing performance degradation, consider increasing resources (CPU, RAM) on the server to handle the load. You can increase memory or CPU by adjusting server configurations or provisioning more powerful hardware.

  4. Restart MySQL Service: If the issue is related to server state (e.g., memory issues, resource starvation), restarting MySQL can help clear any temporary issues:

    systemctl restart mysql
    
  5. Monitor System Health:

    • Use tools like iostat, top, and free to ensure that the system has enough resources to handle high read activity.

High Mysql Writes

  • Mysql High Writes Alert URL
  • Dashboard Link
  • Panel URL

The “High MySQL Writes” alert is triggered when MySQL experiences a significant increase in write operations, which could indicate issues such as inefficient queries, inadequate indexing, or resource constraints. High write activity can lead to increased disk I/O, slower performance, and potential database contention. Similarly, “Low MySQL Writes” may indicate underutilization of the database, suggesting potential problems with write-intensive applications.

Data Collection:

1. Check MySQL Global Status: Review global status metrics related to write activity:

SHOW GLOBAL STATUS LIKE 'Com_insert';  -- Number of INSERT queries executed
SHOW GLOBAL STATUS LIKE 'Com_update';  -- Number of UPDATE queries executed
SHOW GLOBAL STATUS LIKE 'Com_delete';  -- Number of DELETE queries executed

These queries provide insight into the total number of write operations (INSERT, UPDATE, DELETE) and the types of writes being performed.

2. Review Grafana Dashboards: We have dedicated Grafana dashboards to monitor write activity. Navigate to the “MySQL Writes” section to track write operations in real time. Use the following queries to view write activity:

rate(mysql_global_status_innodb_data_writes{instance=~"pssb.*"}[5m])

Apply this query to the Prometheus dashboard to track any changes in write performance over a 5-minute interval.

3. Check System Resource Utilization: High or low write activity can lead to resource strain or underutilization. Use the following commands to check system health:

  • CPU Usage:
    top
    
  • Memory Usage:
    free -m
    
  • Disk I/O:
    iostat -x 1 10
    

4. Check MySQL Connections: Ensure there are no unusually high numbers of active connections, which could indicate excessive writes:

netstat -anp | grep mysql

5. Check for Locking Issues: High write activity can sometimes be correlated with locking. Check for table or row locks:

SHOW ENGINE INNODB STATUS\G;

Example Ouput

   TRANSACTIONS
--------------
Trx id counter 0 89026
---
TRANSACTION 0 89026, ACTIVE 25 sec
mysql tables in use 1, locked 1
LOCK WAIT 1 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 4235, OS thread handle 140595728161024, query id 118572 localhost root updating
UPDATE my_table SET column_name = 'value' WHERE id = 1000

In this example, you can see that Trx id 89026 is waiting for a lock and is involved in a transaction that is holding a lock. This indicates a lock wait situation where one transaction is blocked by another. If high write activity is causing locking issues, you will likely see a lot of “LOCK WAIT” and “waiting for lock” entries. These entries can guide you in diagnosing transaction blocking, deadlocks, and lock contention that may be affecting performance.

Dependent Metrics:

  1. Track Write Performance & Frequency

    • Panels:
      • INSERT Queries: Monitors the number of INSERT queries being executed.
      • UPDATE Queries: Tracks the number of UPDATE queries being executed.
      • DELETE Queries: Monitors the number of DELETE queries being executed.

    These panels help identify spikes in write activity, inefficient queries, or performance degradation in write operations.

  2. Monitor Buffer Pool Efficiency

    • Panels:
      • Buffer Pool Writes: Shows how many writes are occurring in the buffer pool, indicating the frequency of accessing disk for data modification.
      • Buffer Pool Hit Rate: Tracks the percentage of requests served from the buffer pool, where a lower hit rate suggests increased disk I/O due to write activity.

    Monitoring these panels helps identify if the buffer pool is large enough to handle the workload, reducing the need for expensive disk writes.

  3. Check InnoDB and Locking Metrics

    • Panels:
      • Lock Waits: Monitors the number of lock waits, which can signal contention and delays during write operations.
      • InnoDB Read-Write Operations: Displays the read and write operations within InnoDB, helping to identify bottlenecks.
      • InnoDB Data Reads/Writes: Tracks data reads and writes at the InnoDB storage level, highlighting inefficiencies in data access and modification.
  4. OS Metrics: Review system performance metrics (CPU, memory, disk) to identify potential resource bottlenecks that may be contributing to the increased or decreased writes.

C3 Remedy

The C3 team, as the first level of support, is responsible for identifying, mitigating, and documenting the issue without making any configuration changes to MySQL. The following steps outline how the C3 team can manage high write activity:

1. Redistribute Write Load

  • Action: Inform application teams about the high write activity and request them to throttle or reschedule non-critical write operations.
  • Outcome: This can reduce immediate pressure on the database while avoiding the need for direct changes.

2. Monitor and Report Resource Utilization

  • Use monitoring tools like Grafana or Prometheus to collect metrics on:
    • Write throughput: Observe metrics like rate(mysql_global_status_innodb_data_writes[5m]).
    • Disk I/O: Check if write latency is increasing or disk utilization is at critical levels.
    • CPU and memory usage: Ensure adequate system resources are available.
  • Share these observations in real-time with the DevOps team.

3. Verify and Clear Disk Space

  • Action: Check disk utilization to ensure there is enough space for write operations:
    df -h
    
  • If disk space is low, escalate to the storage team to provision additional capacity. Inform DevOps for immediate intervention if space cannot be cleared.

4. Notify Application Teams

  • Investigate the source of excessive writes:
    • Check for logs indicating write-heavy operations or anomalies.
    • Notify the responsible teams about abnormal patterns and recommend application-level throttling.

5. Restart Impacted Applications or Services

  • Action: If an application is generating excessive writes due to a bug or misconfiguration, request application owners to temporarily disable or restart the service to reduce load.

6. Collaborate with DevOps for Long-Term Solutions

  • Provide a comprehensive incident report including:
    • The timeframe and magnitude of high write activity.
    • Resource utilization metrics collected during the event.
    • Actions taken, such as throttling applications or clearing disk space.
    • Any patterns observed in the system metrics or application logs.

By following these steps, the C3 team can effectively manage high write scenarios, minimize their impact, and ensure proper escalation for permanent resolutions.

DevOps Remedy

If MySQL write operations are abnormally high, it may indicate performance bottlenecks or configuration issues. Here’s a structured approach for DevOps to analyze and address the problem.

Step 1: Analyze the Write Load
  1. Examine Write Metrics
    Use Prometheus or Grafana dashboards to analyze write rates over time.

    • Prometheus Query Example:
      rate(mysql_global_status_innodb_data_writes[5m])
      
    • Look for unusual spikes or patterns. Determine if the writes are due to legitimate application usage or anomalies such as batch jobs, misbehaving clients, or logging loops.
  2. Check Application Logs

    • Identify if specific services or application components are generating excessive writes.
    • Correlate logs with high-write periods to narrow down potential culprits.
Step 2: Assess Disk Performance
  1. Monitor Disk I/O Performance
    Use tools like iostat or vmstat to monitor disk write speeds and identify bottlenecks:

    iostat -x 1 5
    

    Output example:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
    sda      0.00   5.00 2.00  8.00  50.00 200.00  30.00    0.05  1.00  0.50  40.00
    

    High w/s (write per second) and %util nearing 100% indicate a saturated disk.

  2. Investigate Disk Latency
    Check the time taken for write operations. If latency is high:

    • Consider upgrading to faster storage (e.g., SSDs).
    • Optimize InnoDB flush settings, such as innodb_flush_log_at_trx_commit.
Step 3: Optimize MySQL Settings
  1. Buffer Pool Size
    Increase the innodb_buffer_pool_size to reduce direct disk writes:

    SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
    SET GLOBAL innodb_buffer_pool_size = 4G;
    
  2. Log Buffer Size
    Ensure the innodb_log_buffer_size is adequate to handle bursts of writes:

    SHOW VARIABLES LIKE 'innodb_log_buffer_size';
    SET GLOBAL innodb_log_buffer_size = 128M;
    
Step 4: Index Optimization
  1. Check Table Indexes

    • Analyze write-heavy tables for unnecessary or poorly optimized indexes.
    • Remove redundant indexes to reduce the overhead of maintaining them during write operations.
  2. Query Example:
    Identify index usage:

    SELECT TABLE_NAME, INDEX_NAME, NON_UNIQUE 
    FROM INFORMATION_SCHEMA.STATISTICS
    WHERE TABLE_SCHEMA = 'your_database';
    
Step 5: Manage Lock Contention
  1. Monitor Locking Issues

    • Use SHOW ENGINE INNODB STATUS\G to analyze transaction locks.
    • Output example:
      ------------
      TRANSACTIONS
      ------------
      Trx id counter 162053
      Purge done for trx's n:o < 162050
      History list length 3
      ---
      
      High transaction contention indicates the need to optimize queries or table design.
  2. Implement Optimistic Concurrency Control
    If lock contention is a bottleneck, consider adjusting application logic to minimize simultaneous writes.

Step 6: Investigate Write Amplification
  1. Analyze Binlog Writes
    If binary logging is enabled, it may contribute significantly to write overhead.
    • Check binary log size:
      SHOW VARIABLES LIKE 'log_bin';
      
    • Consider disabling binary logging temporarily in non-production environments if it’s not required.
Step 7: Review Application Behavior
  • Audit application write patterns. Ensure unnecessary updates, excessive logging, or inefficient bulk operations are not generating additional write load.

High InnoDB Buffer Pool Usage

The alert “High InnoDB Buffer Pool Usage” is triggered when the MySQL InnoDB buffer pool is heavily utilized, nearing or exceeding its configured size. The buffer pool is a key component of MySQL’s performance, acting as a memory cache for frequently accessed data and indexes. High usage can indicate efficient use of the buffer pool or point to memory pressure and the need for optimization.

High InnoDB buffer pool usage often occurs when there is an increase in read or write requests, as MySQL leverages the buffer pool to cache frequently accessed data and indexes for better performance. This alert requires attention to system resources and workload patterns to prevent degraded performance or system instability.

Impact

  1. Increased latency for read and write operations as MySQL resorts to disk I/O.
  2. Potential memory contention with the operating system or other applications.
  3. Deterioration in query performance due to insufficient memory for caching.

C3 Data Collection

The C3 team, as the first-level responders, collects information to confirm the issue and determine if further escalation is needed.

  1. Check Buffer Pool Usage
    Use the following query to monitor buffer pool usage:

    SHOW ENGINE INNODB STATUS\G;
    

    Look for the following sections in the output:

    • “Buffer pool size”
    • “Database pages”
    • “Free buffers”
      Example:
    BUFFER POOL AND MEMORY
    Total memory allocated 8589934592; in additional pool allocated 0
    Database pages 409600
    Free buffers 0
    
  2. Monitor Memory Utilization
    The query below shows the usage of the buffer pool in percentage. Use the Prometheus metric for real-time tracking:

    (mysql_global_status_innodb_buffer_pool_bytes_data{instance="$host"} / mysql_global_variables_innodb_buffer_pool_size{instance="$host"}) * 100
    
  3. **check ** Run the following command to assess overall system memory utilization:

    free -h
    

    Ensure there is no memory contention between MySQL and other processes.We have configured memory limits on the node for buffer pool size.

  4. Disk I/O Activity
    Examine disk usage if the buffer pool is insufficient:

    iostat -x 1 5
    

    High await and %util indicate disk pressure due to inadequate buffer pool size.

  5. Query Statistics
    Analyze queries contributing to buffer pool pressure:

    SHOW GLOBAL STATUS LIKE 'Innodb_pages_read';
    SHOW GLOBAL STATUS LIKE 'Innodb_pages_written';
    

    High read/write operations can point to unoptimized queries or heavy workload.

Dependent Metrics

  1. Key Metrics to Monitor:

    • Buffer pool hit rate:
      1 - (rate(mysql_global_status_innodb_buffer_pool_reads[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))
      
    • High read requests with low buffer hit rate may indicate insufficient buffer pool size.
  2. CPU and Disk Correlation:

    • Increased CPU usage due to high memory contention or context switching.
    • Elevated disk I/O caused by excessive page flushes from the buffer pool.
  3. Table/Index Access Patterns: This refers to how MySQL handles table and index access during queries. If tables are accessed frequently without proper indexes, MySQL may perform full table scans instead of efficiently retrieving data via indexes.

    • Tables with frequent access and no indexes.
    • Prometheus query for table scan detection:
      rate(mysql_global_status_handler[5m])
      
  4. Memory Pressure:

    • System-wide memory pressure can be tracked via OS dashboards or vmstat.
    • Frequent swap usage (si/so columns) may exacerbate MySQL performance issues.

C3 Remedy

The C3 team does not modify MySQL configuration but ensures initial triaging of the issue:

  1. Validate Memory and Disk I/O:

    • Cross-check system memory (free -h) and disk I/O metrics (iostat -x).
    • Report abnormal usage patterns to the DevOps team.
  2. Monitor Read/Write Patterns:

    • Use dashboards to identify spikes in read/write operations and note the associated timeframes.
  3. Escalate Appropriately:

    • If the buffer pool is insufficient or memory contention is evident, escalate with evidence, including query statistics, buffer pool metrics, and system memory usage.

DevOps Remedy

DevOps ensures optimal configuration and resolves underlying issues:

  1. Increase Buffer Pool Size:

    • Check the current configuration:
      SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
      
    • Open the config file of mysql and update the variable in [mysqld] section
      innodb_buffer_pool_size = 2G
      
    • Restart MySQL to apply changes:
      systemctl restart mysql
      
  2. Optimize Query Patterns:

    • Analyze slow queries contributing to buffer pool pressure:
      SHOW GLOBAL STATUS LIKE 'Com_select';
      SHOW GLOBAL STATUS LIKE 'Com_update';
      
    • Index high-impact tables to reduce memory and disk overhead.
  3. Improve Memory Utilization:

    • Optimize OS memory management:
      • Use vm.swappiness=1 to prioritize application memory over swap in the file /etc/sysctl.conf
    • Monitor and adjust the innodb_buffer_pool_instances for better concurrency.
  4. Evaluate Workload Distribution:

    • Distribute traffic across multiple MySQL instances using load balancing if possible like using the ProxySQL.

The “High InnoDB Buffer Pool Usage” alert indicates potential memory pressure and degraded database performance. By carefully monitoring, analyzing, and optimizing buffer pool settings and workload patterns, the issue can be resolved efficiently, ensuring stable and high-performing MySQL operations.

Here is the documentation for the MySQL Uptime alert:


Mysql Uptime

The MySQL Uptime alert triggers when the uptime of a MySQL server falls below an expected threshold. This indicates that the MySQL service has restarted recently, either due to planned maintenance or unexpected failure. It is essential to investigate the root cause promptly to prevent data loss, operational disruption, or performance degradation.

C3 Data Collection

The C3 team is responsible for collecting the following information to assess the situation and escalate if necessary:

  1. Check MySQL Uptime:

    • Query the MySQL server to retrieve the uptime value:
      SHOW GLOBAL STATUS LIKE 'Uptime';
      
      Example Output:
      +---------------+-------+
      | Variable_name | Value |
      +---------------+-------+
      | Uptime        | 120   |  # Uptime in seconds
      +---------------+-------+
      
  2. Verify System Logs:

    • Review system logs for indications of a MySQL restart:
      journalctl -u mysql.service --since "1 hour ago"
      
    • Look for entries like MySQL started or errors causing restarts.
  3. Check Server Resources:

    • Inspect the system’s CPU, memory, and disk usage to identify potential resource constraints:
      top
      df -h
      free -m
      
  4. Review MySQL Error Log:

    • Examine the MySQL error log for issues preceding the restart:
      tail -n 50 /var/log/mysql/error.log
      
  5. Database Connection Metrics:

    • Identify any spikes in connections or failed connection attempts:
      SHOW GLOBAL STATUS LIKE 'Aborted_connects';
      SHOW GLOBAL STATUS LIKE 'Threads_connected';
      

Dependent Metrics

The following Prometheus metrics may show changes correlating with low MySQL uptime:

  1. Uptime Metrics:

    • Monitor MySQL uptime via Prometheus:
      mysql_global_status_uptime{instance=~"your-instance"}
      
  2. Connection Failures:

    • Detect failed connection attempts, which may indicate restart-related issues:
      rate(mysql_global_status_aborted_connects[5m])
      
  3. CPU and Memory Usage:

    • Monitor CPU/memory utilization for resource bottlenecks leading to restarts:
      node_memory_Active_bytes{instance=~"your-instance"}
      node_cpu_seconds_total{instance=~"your-instance"}
      

C3 Remedy

The C3 team should take the following steps to resolve or escalate the issue:

  1. Restart Confirmation:

    • Confirm whether the restart was planned by checking for maintenance schedules or alerts.
    • If unplanned, gather relevant logs and metrics to understand the cause.
  2. Resource Management:

    • Address temporary resource issues (e.g., high CPU/memory) by terminating unnecessary processes or allocating additional resources.
    • Verify that the server is not overburdened.
  3. Error Review:

    • Share the error log details with the DevOps team if critical errors are identified.
  4. Escalation:

    • If the root cause is unclear or cannot be mitigated by the C3 team, escalate to the DevOps team with collected evidence.

DevOps Remedy

  1. Analyze Logs and Metrics:

    • Review system and MySQL logs for causes of the restart (e.g., OOM killer events, configuration errors, or software crashes).
    • Use Prometheus dashboards to correlate uptime drops with other metrics.
  2. Optimize MySQL Configuration:

    • Ensure the MySQL configuration aligns with workload demands. Common adjustments include:
      • max_connections
      • innodb_buffer_pool_size
      • thread_cache_size
  3. Inspect Hardware and OS Issues:

    • Verify disk health, CPU performance, and memory allocation:
      dmesg | grep -i error
      smartctl -a /dev/sdX
      
  4. Reconfigure Monitoring and Alerts:

    • Adjust uptime alert thresholds to reduce false positives and align with operational policies.
  5. Implement Failover Mechanisms:

    • Ensure the cluster or failover setup can handle restarts seamlessly to minimize downtime.

Mysql Slow Queries

  • Identifying Slow Queries Link
  • Tune Slow Query Log Link

Grafana Dashboard Panel Link

The MySQL Slow Queries alert triggers when the number of queries taking longer than the configured threshold (long_query_time) exceeds acceptable limits. Slow queries can impact overall database performance, lead to increased resource usage, and degrade application response times. Causes of slow queries include missing indexes, poorly optimized queries, and insufficient server resources.

C3 Data Collection

The C3 team collects relevant data to investigate and identify the potential causes of slow queries:

  1. Verify Slow Query Log Status Check if the slow query log is enabled and identify the configured long_query_time threshold:

    SHOW VARIABLES LIKE 'slow_query_log';
    SHOW VARIABLES LIKE 'long_query_time';
    

    Example Output:

    +-----------------+-------+
    | Variable_name   | Value |
    +-----------------+-------+
    | slow_query_log  | ON    |
    | long_query_time | 10.00  |
    +-----------------+-------+
    

    Long query time 10.0 sec means any query taking longer than 10 seconds to execute is considered a long-running query. It usually indicates performance issues such as inefficient query design, lack of indexing, or high system load.

  2. Examine the Slow Query Log Retrieve entries from the slow query log to identify problematic queries:

    tail -n 50 /var/log/mysql/slow.log
    
  3. Check Server Load and Resource Usage Inspect system-level metrics for CPU, memory, and disk I/O:

    top
    vmstat
    iostat -x 1 5
    
  4. Identify slow query patterns like, are they frequent? are they related to specific tables or queries?

  5. Prometheus Metric: Slow Queries Rate Monitor the rate of slow queries in Prometheus:

    rate(mysql_global_status_slow_queries[5m])
    

Dependent Metrics

  1. Query Execution Time: Slow queries increase average execution time, leading to delays in processing and query completion.
    PromQL:

    rate(mysql_global_status_slow_queries[5m])
    
  2. Number of Slow Queries: A high count of slow queries indicates inefficient query patterns or resource bottlenecks.
    PromQL:

    mysql_global_status_slow_queries
    
  3. CPU Usage: Slow queries often involve complex operations, such as sorting or joins, which increase CPU consumption and strain processing power.
    PromQL:

    rate(node_cpu_seconds_total{mode="user"}[5m]) * 100
    
  4. Memory Utilization: Inefficient queries consuming large datasets can lead to higher memory usage, potentially causing swapping or out-of-memory issues.
    PromQL:

    node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
    
  5. Disk I/O Metrics: Slow queries performing full table scans or large read/write operations increase disk I/O, leading to higher latency and throughput.
    PromQL:

    rate(node_disk_reads_completed_total[5m])
    rate(node_disk_writes_completed_total[5m])
    
  6. Active Connections: Slow queries hold connections longer, increasing the number of threads waiting or running, which can exhaust connection limits.
    PromQL:

    mysql_global_status_threads_connected
    mysql_global_status_threads_running
    
  7. Query Cache Efficiency: If query results are not cached or cache misses occur, slow queries further degrade performance by re-executing resource-heavy operations.
    PromQL:

    (mysql_global_status_qcache_hits / (mysql_global_status_qcache_hits + mysql_global_status_qcache_inserts)) * 100
    
  8. InnoDB Buffer Pool Usage: Slow queries with large data scans or updates can fill the buffer pool, causing evictions and increased disk reads/writes.
    Refer the panel

  9. Table or Row Lock Contention: Slow queries often hold locks for longer durations, increasing lock waits and contention, which impacts other queries.
    PromQL:

    rate(mysql_global_status_innodb_row_lock_waits[5m])
    rate(mysql_global_status_innodb_row_lock_time[5m])
    

    Refer the Table Lock dashboard panel Link

  10. Query Per Second (QPS): A spike in slow queries reduces the overall QPS due to longer execution times, leading to slower response rates for incoming requests.
    PromQL:

    rate(mysql_global_status_queries[5m])
    

By monitoring these metrics, you can assess how slow queries impact MySQL’s overall performance and identify areas for optimization.

C3 Remedy

  1. Verify Slow Query Log Activation Use the following query to confirm whether the slow query log is enabled:
    SHOW VARIABLES LIKE 'slow_query_log';
    

If the log is disabled, escalate to the DevOps team to enable it. C3 does not have the authority to change configurations.

  1. Monitor Query Performance
  • Check the metric mysql_global_status_slow_queries in Prometheus or Grafana dashboards. Identify spikes or abnormal trends in the slow query rate.
  1. Analyze System Resource Usage
  • Inspect resource usage on the MySQL server:
    • CPU: Check for high CPU utilization indicating resource-heavy queries.
    • Memory: Verify if the memory usage is stable and not causing swapping.
    • Disk I/O: Review disk activity for signs of bottlenecks.
  • Use OS monitoring dashboards for a detailed view of resource metrics.
  1. Evaluate Table Scans
  • Use Prometheus query to detect table scans:
    rate(mysql_global_status_handler_read_rnd_next[5m])
    
  • High table scan rates often indicate missing indexes or inefficient queries. Document findings for escalation.
  1. Check for Resource Contention
  • Identify if high concurrency or locking issues are contributing to slow queries:
    SHOW ENGINE INNODB STATUS\G;
    
  • If contention is detected, include it in the escalation report.
  1. Validate Query Patterns
  • Review application-level patterns in the logs or dashboards to identify problematic queries or redundant query execution.
  1. Collaborate with Application Teams
  • Inform application teams of specific slow queries causing issues, as they may need to optimize the queries at the application level.
show full process list;
  1. Escalate When Necessary
  • If resource constraints, missing indexes, or configuration limitations are suspected, escalate to the DevOps team with the following:
    • Details of the slow queries (from the slow query log).
    • Metrics showing system resource usage.
    • Evidence of table scans or locking issues.
    • Work closely with application teams to address inefficiencies in query design or patterns.

Devops Remedy

The DevOps team has the authority to make necessary configuration changes, analyze database performance in depth, and implement optimizations to resolve slow query issues effectively. Below is a detailed remedy process for addressing the “MySQL Slow Queries” alert:

  1. Enable and Configure the Slow Query Log

    • If the slow query log is not already enabled, modify the MySQL configuration file (my.cnf or my.ini) to activate it:
      slow_query_log = 1
      slow_query_log_file = /var/log/mysql/slow.log
      long_query_time = 1
      
    • Restart MySQL for the changes to take effect:
      systemctl restart mysql
      
  2. Analyze the Slow Query Log

    • Use the mysqldumpslow tool to summarize the slow queries:
      mysqldumpslow -s c -t 10 /var/log/mysql/slow.log
      
    • Alternatively, review the raw log for details:
      cat /var/log/mysql/slow.log
      
    • Identify queries taking the most time or being executed frequently.
  3. Optimize Index Usage

    • Use the EXPLAIN statement to understand how MySQL processes the identified slow queries:
      EXPLAIN SELECT * FROM table WHERE condition;
      
    • Add or refine indexes based on the output. For example:
      CREATE INDEX idx_column_name ON table_name (column_name);
      
    • Avoid over-indexing, which can slow down write operations.
  4. Tune MySQL Configuration Parameters

    • Adjust MySQL parameters to improve performance based on the workload:
      • Increase Query Cache (if applicable):
        query_cache_size = 64M
        query_cache_type = 1
        
      • Optimize InnoDB Settings:
        innodb_buffer_pool_size = 70% of available memory
        innodb_log_buffer_size = 16M
        innodb_flush_log_at_trx_commit = 2
        
      • Thread Configuration:
        thread_cache_size = 16
        max_connections = 200
        
    • Restart MySQL to apply changes:
      systemctl restart mysql
      
  5. Optimize Query Execution

    • Review and rewrite slow queries for better efficiency:
      • Avoid SELECT *; explicitly list columns.
      • Use LIMIT to restrict results when possible.
      • Minimize subqueries in favor of JOINs.
  6. Analyze Locking and Concurrency

    • Check for locking issues:
      SHOW ENGINE INNODB STATUS\G;
      
    • Increase the innodb_lock_wait_timeout parameter if locks are frequently causing delays:
      innodb_lock_wait_timeout = 50
      
  7. Monitor Resource Usage

    • Use system monitoring tools (e.g., htop, vmstat, or OS dashboards) to analyze CPU, memory, and disk I/O.
    • Ensure sufficient memory to avoid swapping, and allocate appropriate resources for the MySQL server.
  8. Collaborate with Application Teams

    • Share insights on poorly performing queries with developers.
    • Work together to implement query optimization and caching strategies in the application layer.
  9. Implement Long-Term Solutions

    • Scale the database infrastructure by:
      • Sharding large tables.
      • Replicating the database (if not already done) to distribute the workload.
    • Migrate to more performant storage (e.g., SSDs) to improve disk I/O.