MySQL is a popular open-source relational database management system (RDBMS) known for its scalability, reliability, and use in a wide range of web-based and enterprise applications.
Percona XtraDB Cluster is a high-availability solution based on MySQL that provides synchronous replication, multi-master support, and data consistency using Galera Cluster technology. It ensures no single point of failure and is ideal for environments requiring fault tolerance and horizontal scalability.
Feature | MySQL | Percona XtraDB Cluster |
---|---|---|
Replication | Asynchronous (default). | Synchronous with Galera technology. |
High Availability (HA) | Requires additional tools for HA. | Built-in HA with no single point of failure. |
Write Scalability | Single primary server for writes. | Multi-master, allowing writes on all nodes. |
Data Consistency | Eventual consistency. | Strong consistency across all nodes. |
Dashboard & Row | Alert Name | Panel | Panel Description | Query | Query Description | Query Operating Range | Metrics | Metric Description | Metric Operating Range | SEVERITY: CRITICAL | SEVERITY: WARNING | SEVERITY: OK |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3.1.1 | Cluster Size | Represents the total number of nodes in the cluster. | mysql_global_status_wsrep_cluster_size{instance=~“pssb.*”} | Represents the number of nodes currently active in the cluster | 0-5 | mysql_global_status_wsrep_cluster_size | Represents the number of nodes currently active in the cluster | 0-5 | 0 | < 5 | 5 | |
2.1.2 | Cluster Status (In sync / out of sync) | Status of the MySQL node | mysql_global_status_wsrep_local_state{instance=~“pssb.*”} | State of the writeset replication | 0-4 | mysql_global_status_wsrep_local_state | State of the MySQL node like show global status |
0-4 | != 4 | - | 4 | |
2.4.1 | Max Thread Connections | MySQL Connections | Monitors MySQL connection statistics over a specified interval. | mysql_global_status_max_used_connections{instance=~“pssb.*”} | Shows the highest number of connections used from MySQL server started | Positive values | mysql_global_status_max_used_connections | Shows the highest number of connections used from MySQL server started | Positive values | > 100 | > 50 | < 15 |
2.4.1 | HighThreadsRunning | MySQL Connections | MySQL thread count which are executing, i.e., using I/O | mysql_global_status_threads_running{instance=~“pssb.*”} | MySQL threads executing which are not sleeping | Positive values | mysql_global_status_threads_running | MySQL connections which are not idle, i.e., in running state | Positive values | > 30 | 20 < threads < 30 | > 20 |
Slow Queries | Slow Queries | Monitors the rate of slow queries | rate(mysql_global_status_slow_queries{instance=~“pssb.*”}[5m]) | This query gives the rate of slow queries per second | Positive values | mysql_global_status_slow_queries | Tracks the number of slow queries logged by MySQL | > 5 | > 0 | 0 | ||
2.1.4 | MySQL Uptime | MySQL Uptime | Shows how long it’s been running | mysql_global_status_uptime{instance=~“pssb.*”} | Shows how long it’s been running in seconds | Positive values | mysql_global_status_uptime | Shows how long it’s been running in seconds | Positive values | < 60 | - | > 60 |
2.1.3 | Buffer Pool Usage | Buffer pool usage | Memory used by the buffer pool | (mysql_global_variables_innodb_buffer_pool_size{instance=~“pssb.”} * 100) / on (instance) node_memory_MemTotal_bytes{instance=~“pssb.”} | Shows the buffer pool used in percentage | Positive percentage | mysql_global_variables_innodb_buffer_pool_size | Shows buffer pool size in bytes | Positive values | 50 < size < 60 | > 60 | < 50 |
2.2.1 | High Reads | Database Reads | MySQL reads performing at an instantaneous time | mysql_global_status_innodb_data_reads{instance=~“pssb.*”} | MySQL database data reads | Positive values | mysql_global_status_innodb_data_reads | MySQL database data reads | Positive values | - | > 15000 | < 15000 |
2.2.2 | High Writes | Database Writes | MySQL writes performing at an instantaneous time | rate(mysql_global_status_innodb_data_writes{instance=~“pssb.*”}[5m]) | Calculates the rate of writes per second over the last 5 minutes | Positive values | mysql_global_status_innodb_data_writes | Total number of write operations by MySQL | Positive values | > 50 | 50 > writes > 15 | < 15 |
InnoDB Log Waits | InnoDB Log Waits | Refers to an issue where MySQL InnoDB log writes are stalling | increase(mysql_global_status_innodb_log_waits{instance=~“pssb.*”}[5m]) | If any waits occur, triggers the alert | mysql_global_status_innodb_log_waits | The number of times a transaction had to wait | Positive values | > 0 | 0 | |||
2.2.7 | Replication Traffic Received | Average Incoming Transaction Size | Calculates the average size of replicated transactions (in bytes) | rate(mysql_global_status_wsrep_received_bytes{instance=~“pssb.*”}[5m]) | Calculates the rate of replication traffic in bytes received | Positive values | mysql_global_status_wsrep_received_bytes | Tracks the total number of bytes received via replication. | > 5 KB (5120) | > 1 MB (1048576) | ||
2.2.7 | Replication Traffic Transmitted | Average Replicated Transaction Size | Calculates the average size of the replicated transactions. | rate(mysql_global_status_wsrep_replicated_bytes{instance=~“pssb.*”}[5m]) | Rate of data being replicated in bytes per second for instances | mysql_global_status_wsrep_replicated_bytes | Total number of bytes received via replication. | > 5 KB (5120) | > 1 MB (1048576) | |||
2.2.13 | Replication Latency | Average Replication Latency | Latency refers to the time delay in synchronizing data changes across all nodes in a Galera Cluster | mysql_galera_evs_repl_latency_avg_seconds{instance=~“pssb.*”} | Latency in Galera replication process of MySQL 5-node cluster | Positive decimal values (seconds) | mysql_galera_evs_repl_latency_avg_seconds | Average Galera replication latency in seconds | Positive decimal values (seconds) | > 1 | 1 > latency > 0.5 | < 0.5 |
When alerts are triggered, the C3 team receives notifications via email. The C3 team is expected to follow the outlined procedures below.
Data Collection: When an alert is fired, the C3 team should first gather relevant data to understand the source of the issue.
Severity-Based Actions:
Severity-Specific Notifications:
Before taking action on the C3 Remedy, the C3 team should thoroughly review the “Dependent Metrics and Checks” section to ensure all supporting data is understood.
This process ensures effective response and resolution for all alerts based on severity and priority.
A change in the MySQL cluster size count means that nodes (or members) have either joined or left the MySQL cluster. These changes can significantly impact the cluster’s stability, performance, and resource distribution.
The alert is triggered when the MySQL cluster size is not equal to the expected size (e.g., 5 nodes in a 5-node cluster).
To analyze changes in the MySQL cluster size:
Monitor the Current Cluster Size
MySQL Cluster Size
Grafana dashboard panel.Verify Node Visibility Across the Cluster
mysql_global_status_wsrep_cluster_size
The C3 team should focus on basic troubleshooting and operational checks for MySQL cluster node count issues. Follow these steps:
Check the Current Status of the MySQL Cluster Nodes:
SHOW STATUS LIKE 'wsrep_cluster_size';
Expected Output:
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 5 |
+--------------------+-------+
Analyze MySQL Error Logs for the Disconnected Node:
vi + /var/log/mysql/error.log
Restart the MySQL Service on the Disconnected Node:
systemctl restart mysql
Verify Galera Cluster Status and Logs on the Disconnected Node(s):
SHOW GLOBAL STATUS LIKE 'wsrep%';
Key Metrics to Verify:
wsrep_cluster_status
:
Primary
: Indicates the node is part of the primary component.Non-Primary
: Indicates the node is disconnected or part of a split-brain scenario.wsrep_connected
: Expected value: ON
.Validate Cluster Configuration in MySQL Configuration File:
cat /etc/mysql/mysql.conf.d/mysqld.cnf | grep wsrep_cluster_address
Expected Output:
wsrep_cluster_address=gcomm://pssb1avm001,pssb1avm002,pssb1avm003,pssb1avm004,pssb1avm005
Resolve Errors Identified in MySQL Logs:
vi + /var/log/mysql/error.log
Restart MySQL Services on Affected Nodes After Fixing Errors:
systemctl restart mysql
Validate Cluster Size Restoration:
After all steps, check if the cluster size is restored to the expected value.
This alert is triggered when a node in the cluster transitions to an out-of-sync state. The alert indicates that one or more nodes may not be fully synchronized with the rest of the cluster, which could lead to degraded performance or data inconsistencies.
To analyze cluster status changes:
Monitor the Node Synchronization Status
Node Role
Grafana panel.Verify Node Sync Details
wsrep_local_state_comment
is set to anything other than Synced.Collect Logs for Investigation
vi + /var/log/mysql/error.log
wsrep_local_state_comment
wsrep_cluster_status
wsrep_flow_control_paused
Verify Node State and Cluster Status:
Run the following command to check the node’s status:
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
Expected Value: Synced
.
Confirm the cluster status:
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
Expected Value: Primary
.
Check Error Logs for More Details:
vi + /var/log/mysql/error.log
Restart MySQL Service on Affected Node(s):
systemctl restart mysql
Validate Network Connectivity:
ping <node-IP>
Check SST/IST Progress:
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
Validate Resource Allocation:
Rejoin the Node to the Cluster:
systemctl restart mysql
Validate Sync State After Remedies:
wsrep_local_state_comment
again to confirm the node is back in Synced state.This alert is triggered when the number of threads connected to the MySQL server gets close to or exceeds the maximum limit. If this limit is reached, new connections will fail, causing potential issues for applications relying on the database.
Monitor Current Thread Connections:
Max Thread Connections
Grafana panel to view the number of active connections.Verify Current Connection Usage:
SHOW STATUS LIKE 'Threads_connected';
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_connected | 11 |
+-------------------+-------+
Identify the Maximum Allowed Connections:
SHOW VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 9999 |
+-----------------+-------+
Analyze Connection Logs:
vi + /var/log/mysql/mysql.log
mysql_global_status_threads_connected
mysql_global_variables_max_connections
mysql_global_status_threads_running
Check the Current Number of Threads Connected:
SHOW STATUS LIKE 'Threads_connected';
max_connections
value.Identify Problematic Connections:
SHOW FULL PROCESSLIST;
KILL <thread_id>;
Check if the application is opening too many connections without closing them properly.
Increase max_connections
:
vi /etc/mysql/mysql.conf.d/mysqld.cnf
max_connections = <value>
Restart MySQL to apply the changes:
systemctl restart mysql
Monitor Connection Errors:
Check if the Connection_errors_max_connections
metric is incrementing to verify if the issue persists.
SHOW STATUS LIKE 'Connection_errors_max_connections';
Output
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| Connection_errors_max_connections | 0 |
+-----------------------------------+-------+
Identify applications that may be opening too many connections and optimize their connection handling.
This alert is triggered when the number of running threads exceeds a defined threshold:
Threads in MySQL represent tasks or queries that the database server is currently processing. When the number of active (non-sleeping) threads increases significantly, it can mean that the database server is under heavy load or facing issues.
max_connections
limit.Monitor the current threads
Run the following command to check the thread status:
SHOW GLOBAL STATUS LIKE 'Threads%';
Example output:
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_cached | 6 |
| Threads_connected | 50 |
| Threads_created | 200 |
| Threads_running | 35 |
+-------------------+-------+
Check ongoing queries
Use the SHOW PROCESSLIST
command to see active queries and their states:
SHOW PROCESSLIST;
Focus on long-running or resource-intensive queries.
Review the slow query log
Check for inefficient queries that might be causing high load:
vi /var/log/mysql/slow-query.log
Identify max connection errors
Query the metric for errors related to exceeding the max connection limit:
SHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections';
Basic Checks:
Slow Query Identification:
Restart the MySQL Service (if needed)
systemctl restart mysql
Analyze Running Threads:
SHOW STATUS LIKE 'Threads_running';
Investigate Long-Running Queries:
SHOW FULL PROCESSLIST;
KILL <thread_id>;
Increase Thread Pool Size (If Necessary):
vi /etc/mysql/mysql.conf.d/mysqld.cnf
thread_cache_size = 128
Optimize Applications:
Restart MySQL (if configuration changes are made):
systemctl restart mysql
1. Metric Query: Use the following query to fetch the average replication latency in seconds for all nodes in the cluster:
mysql_galera_evs_repl_latency_avg_seconds{instance=~"pssb.*"}
This metric helps determine the current replication latency across the cluster.
2. Node State Verification
3. Galera Metrics : Run the following SQL command on the affected node to get detailed cluster and node statuses:
SHOW STATUS LIKE 'wsrep%';
Sample Output
| wsrep_local_state_comment | Synced |
| wsrep_incoming_addresses | pssb1avm001:3306,pssb1avm002:3306,pssb1abm003:3306,pssb1avm004:3306,pssb1avm005:3306 |
| wsrep_cluster_weight | 5 |
| wsrep_evs_delayed | |
| wsrep_evs_evict_list | |
| wsrep_evs_repl_latency | 0.00198006/0.00321617/0.00554355/0.00164676/3 |
| wsrep_evs_state | OPERATIONAL |
| wsrep_cluster_size | 5 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_provider_name | Galera |
| wsrep_ready | ON |
| wsrep_local_send_queue | 0
| wsrep_local_recv_queue | 0
| wsrep_flow_control_paused | 0.000783576
Key Metrics to Note:
wsrep_local_state_comment
: Indicates the state of the node (e.g., Synced, Donor, Joining).wsrep_cluster_size
: Shows the total number of nodes in the cluster.wsrep_flow_control_paused
: Reveals replication delays due to flow control pauses.Replication Queues:
wsrep_local_recv_queue
: Number of events waiting to be processed locally.wsrep_local_send_queue
: Number of events waiting to be sent to other nodes.4. Resource Usage Monitor system resource metrics on all nodes to check for potential bottlenecks:
top
or Prometheus metric: node_cpu_seconds_total
.free -m
or Prometheus metric: node_memory_Active_bytes
.iostat
, iotop
, or Prometheus metrics: node_disk_read/write_bytes
.ping
, traceroute
.5. Logs
/var/log/mysql/error.log
for errors related to replication, network issues, or cluster communication.journalctl -u mysql
dmesg | grep -i error
Verify network connectivity between nodes to check for packet loss or high latency.
Ping Test Analysis for Packet Loss
Command:
ping -c 10 172.21.0.62
Output:
PING 172.21.0.62 (172.21.0.62) 56(84) bytes of data.
64 bytes from 172.21.0.62: icmp_seq=1 ttl=64 time=0.421 ms
64 bytes from 172.21.0.62: icmp_seq=2 ttl=64 time=0.296 ms
64 bytes from 172.21.0.62: icmp_seq=3 ttl=64 time=0.417 ms
64 bytes from 172.21.0.62: icmp_seq=4 ttl=64 time=0.363 ms
64 bytes from 172.21.0.62: icmp_seq=5 ttl=64 time=0.418 ms
64 bytes from 172.21.0.62: icmp_seq=6 ttl=64 time=0.391 ms
64 bytes from 172.21.0.62: icmp_seq=7 ttl=64 time=0.208 ms
64 bytes from 172.21.0.62: icmp_seq=8 ttl=64 time=0.364 ms
64 bytes from 172.21.0.62: icmp_seq=9 ttl=64 time=0.441 ms
64 bytes from 172.21.0.62: icmp_seq=10 ttl=64 time=0.465 ms
--- 172.21.0.62 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9221ms
rtt min/avg/max/mdev = 0.208/0.378/0.465/0.072 ms
```
Understand of Output:
Packet Loss: The output indicates that 0% packet loss occurred. This means all 10 packets were successfully transmitted and received without any loss.
Round-Trip Time (RTT):
Min RTT: 0.208 ms — This is the minimum round-trip time for the packets.
Avg RTT: 0.378 ms — The average round-trip time for all packets.
Max RTT: 0.465 ms — The maximum round-trip time recorded.
Key Points:
In this case, there are no network issues detected based on this ping test output.
To address high replication latency in a Galera or Percona XtraDB Cluster, it’s important to monitor several dependent metrics that can indicate the root cause. Here’s a breakdown of key metrics to track and how you can get more details from MySQL:
mysql_global_status_wsrep_local_state
: This metric represents the state of the node in the cluster. If it’s showing a state other than “Synced” (i.e., state 4), it indicates that the node is out of sync, which can cause replication delays.
mysql_global_status_wsrep_local_commits
: This metric tracks the number of local commits that have been replicated. A low or fluctuating number might suggest that there are issues in committing transactions to the cluster, which can lead to replication latency.
mysql_global_status_wsrep_received_bytes
: This metric shows the amount of data received by the node from other nodes via replication. If there’s high replication traffic, the node may struggle to process this data, increasing replication latency.
mysql_global_status_wsrep_replicated_bytes
: This tracks the amount of data sent by the node to other nodes. If a node is transmitting large amounts of data that are not being processed fast enough by the receiving nodes, it can lead to delays in replication.
mysql_global_status_innodb_data_reads
: High InnoDB data reads can cause I/O bottlenecks, which may affect replication performance, leading to latency if the system is unable to process the read operations quickly enough.
mysql_global_status_innodb_data_writes
: Similar to data reads, high write operations to the InnoDB storage engine can lead to bottlenecks, especially if replication has to wait for the writes to complete.
mysql_global_status_innodb_buffer_pool_waits
: If this metric is high, it indicates that there is contention for space in the InnoDB buffer pool, which can slow down the processing of reads and writes, ultimately impacting replication speed.
mysql_global_status_innodb_buffer_pool_size
: A small buffer pool can lead to slower read and write operations, which can directly affect replication performance. Increasing the buffer pool size might alleviate this issue.
mysql_global_status_threads_running
: A high number of active threads indicates that the system is under heavy load. This can cause replication delays, especially if many threads are handling different types of operations concurrently.
mysql_global_status_slow_queries
: Slow queries can block threads, causing delays in processing replication data. If slow queries are frequent, they should be optimized to avoid impact on replication latency.
mysql_global_status_innodb_log_waits
: This metric counts the number of times transactions had to wait for InnoDB log space. High log waits can indicate that transactions are being delayed, impacting replication performance.
You can obtain more detailed information about many of these metrics and their current values by running the following command:
SHOW STATUS LIKE "%wsrep%";
We can also refer the Grafana Dashboards of Mysql, links are specified above
Node Resource Issues:
Node in Joining State:
MySQL Runtime Issues:
systemctl restart mysql
Before restarting mysql, please check all nodes are present in the cluster and this node is getting the problem. If any other problem found, inform the DevOps team.
Escalation to DevOps:
Optimize Write Operations: Identify heavy write queries (INSERT/UPDATE) and optimize them. MOdify database queries such that break large transactions into smaller ones where possible.
Hardware Optimization:
Enable Binary Logging Compression:
Enable binary log compression on the source database server to reduce the size of binary log files generated during replication. Compressed binary logs consume less network bandwidth during replication, thereby reducing replication lag.
Steps to Enable:
1.Edit the MySQL configuration file (e.g., /etc/my.cnf) and add this under [mysqld]:
binlog_transaction_compression=ON
sudo systemctl restart mysql
SHOW VARIABLES LIKE 'binlog_transaction_compression';
Tune Replica Configuration Parameters
Optimize innodb_buffer_pool_size
in mysql config to improve data retrieval efficiency if any write waits are occurring.
It triggers when excessive replication traffic is observed between nodes in a master-master MySQL cluster. High traffic can indicate large transactions, frequent updates, or inefficient data operations, potentially leading to replication conflicts, increased network latency, and resource contention.
In a master-master replication setup, two MySQL servers work together, each acting as both a master and a slave. This means that each server sends and receives data from the other server to keep both databases in sync. Let’s break down how this works, and how it affects network traffic between the servers.
In master-master replication, both servers replicate data to each other. When one server makes a change to its database (like adding a new record), it sends that change to the other server. The other server then applies the change to its own database, and the process repeats.
Two threads handle this data exchange:
If both servers have log-bin
enabled, after applying the changes, the SQL thread also writes those changes to its own binary log and sends them back to the other server.
When one server sends data to the other, it affects two network traffic metrics:
Let’s look at how this works in a master-master setup:
SHOW FULL PROCESSLIST
You can monitor what’s happening on the servers using the SHOW FULL PROCESSLIST
command. This command shows all active processes on the server. In a master-master setup, you might see something like this:
Sample Output:
On Server A:
Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined |
---|---|---|---|---|---|---|---|---|---|---|
1 | system user | NULL | Sleep | 18 | wsrep: committed write set | NULL | 18191 | 0 | 0 | |
2 | system user | NULL | Sleep | 2134791 | wsrep: aborter idle | NULL | 2134791569 | 0 | 0 | |
136348 | qrtzrw_psdb_sb | pssb1abm003:56308 | qrtz_psdb_sb | Sleep | 21 | NULL | 21628 | 1 | 1 | |
140859 | psrw_sb | pssb1abm003:46952 | psdb_sb | Sleep | 4 | NULL | 4258 | 0 | 0 | |
144696 | root | localhost | NULL | Query | 0 | init | show full processlist | 0 | 0 | 0 |
Here’s what’s happening:
On Server B, you might see something similar:
Sample Output:
Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined |
---|---|---|---|---|---|---|---|---|---|---|
1 | system user | NULL | Sleep | 2745965 | wsrep: aborter idle | NULL | 2745965341 | 0 | 0 | |
2 | system user | NULL | Sleep | 14 | wsrep: committed write set | NULL | 14104 | 0 | 0 | |
179129 | qrtzrw_psdb_sb | pssb-ms-node02:44554 | qrtz_psdb_sb | Sleep | 7 | NULL | 7900 | 1 | 1 | |
186172 | root | localhost | psdb_sb | Query | 0 | init | show processlist | 0 | 0 | 0 |
In master-master replication, Bytes_sent and Bytes_received help you monitor the amount of data flowing between the two servers. If these values increase, it might indicate heavy replication traffic or large queries.
In a master-master replication setup, both servers are constantly sending and receiving data to keep their databases in sync. This results in changes to the Bytes_sent and Bytes_received values, which reflect the traffic between the two servers. By monitoring these metrics, you can track replication health and identify any potential issues with data flow between the servers.
Network Traffic Analysis:
iftop
, vnstat
, or monitoring systems (e.g., Grafana) to measure data transfer rates between nodes.Binary Log Inspection:
SHOW BINLOG EVENTS IN 'mysql-bin.000001';
mysql-bin.000001
with the actual binary log file name.Bytes Sent and Received: s This mysql query retrieves global status variables related to the number of bytes sent and received by the server. These variables provide insights into network traffic generated by MySQL.
SHOW GLOBAL STATUS LIKE 'Bytes_%';
Example Output
+----------------+-------------+
| Variable_name | Value |
+----------------+-------------+
| Bytes_received | 1267179425 |
| Bytes_sent | 30373626016 |
+----------------+-------------+
2 rows in set (0.00 sec)
Conflict Detection:
systemctl status mysql
error.log
file. The Conflicts are common until there are no consistency problems occur.
cat /var/log/mysql/error.log | grep onflic
Example Output
2024-10-31T01:05:01.148460Z 14 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
THD: 14, mode: high priority, state: exec, conflict: executing, seqno: 35492
THD: 3581, mode: local, state: exec, conflict: committing, seqno: 35493
2024-10-31T01:05:01.148572Z 14 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
THD: 14, mode: high priority, state: exec, conflict: executing, seqno: 35492
THD: 3581, mode: local, state: exec, conflict: must_replay, seqno: 35493
These errors are because of high replication traffic and may cause inconsistency when high conflicts are occurredResource Monitoring:
Binary Log Size
Monitor the size and frequency of binary logs generated during replication. Excessive growth can impact performance.
Check with:
SHOW BINARY LOGS;
Look for:
Log_name
, File_size
: Check for unexpected increases in log file size.Network Bandwidth
Measure data transfer between nodes. High transfer rates can indicate heavy replication traffic or network issues.
Check with (Linux):
iftop -i eth0
Look for:
Disk I/O
Monitor disk read/write operations for relay and binary logs. High disk I/O can impact replication speed.
Check with (Linux):
iostat -x 1
Look for:
await
or svctm
values, which indicate disk latency and potential replication delays.Check CPU, RAM and System Load in the OS Monitoring Section and identify the potential bottlenecks.
Restart MySQL Server
If there is inconsistency or the MySQL server has gone into an unknown state, a restart can often help restore normal replication behavior.
Action:
sudo systemctl restart mysql
Try adding RAM, CPU to the server which is experiencing high replication traffic and monitor the performance of server and traffic
Wait for Resolution
After making the above adjustments, monitor the system to see if the issue resolves. If the server stabilizes and replication traffic returns to normal, no further action may be needed.
Action:
Escalate to DevOps Team
If the issue does not resolve after restarting the server and adjusting resources, escalate the issue to the DevOps team for further investigation.
Action:
Note: Delaying the escalation to DevOps or failing to inform them of the issue in a timely manner can lead to increased latency and performance degradation across the entire cluster. Ensure that any alerts or issues are communicated promptly to avoid further impact.
innodb_buffer_pool_size
, max_connections
, and other resource-intensive settings.innodb_buffer_pool_size
:
SET GLOBAL innodb_buffer_pool_size = <new_value>;
Understanding InnoDB log waits
In InnoDB, a log buffer is a memory structure that stores the data modification before being written to the log files on disk. An optimal size of the log buffer can improve performance by reducing the frequency of expensive I/O operations.
To check whether you need to adjust the log buffer size, you can use the innodb_log_waits
status variable. The InnoDB Log Waits occur when a transaction is unable to write to the log buffer because it’s full. The innodb_log_waits variable stores the number of times such waits have occurred.
If innodb_log_waits variable is greater than zero, so you likely need to adjust the size of the log buffer.
Checking InnoDB Log Waits First, connect to the MySQL server:
mysql -u root -p
Second, check the innodb_log_waits value using the following query:
SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';
Output:
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| Innodb_log_waits | 0 |
+------------------+-------+
1 row in set (0.00 sec)
To troubleshoot and resolve the InnoDB waiting state issue, the C3 team should collect the following data:
MySQL Service Status
Check if the MySQL service is running and identify any service-level issues:
systemctl status mysql
OR
systemctl status mysql@bootstrap
Disk Usage
Ensure there is sufficient disk space available for InnoDB log writes:
df -h
InnoDB Engine Status
Review the detailed status of the InnoDB engine to detect locks or stalls:
mysql -e "SHOW ENGINE INNODB STATUS\G"
Go through the Transactions
sections in the output
TRANSACTIONS
------------
Trx id counter 2844915
Purge done for trx's n:o < 2844915 undo n:o < 0 state: running but idle
History list length 2
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421815622005464, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 421815622006328, not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION 421815622004600, not started
InnoDB Log File Size
Verify the size of the InnoDB transaction log files to identify potential issues:
ls -lh /data/db/pxc/mysql
Sample Ouput
-rw-r----- 1 mysql mysql 1.1M Nov 8 13:28 '#ib_16384_3.dblwr'
-rw-r----- 1 mysql mysql 64K Dec 9 08:42 '#ib_16384_4.dblwr'
-rw-r----- 1 mysql mysql 1.1M Nov 8 13:28 '#ib_16384_5.dblwr'
-rw-r----- 1 mysql mysql 64K Dec 9 08:43 '#ib_16384_6.dblwr'
-rw-r----- 1 mysql mysql 1.1M Nov 8 13:28 '#ib_16384_7.dblwr'
-rw-r----- 1 mysql mysql 64K Dec 9 08:43 '#ib_16384_8.dblwr'
-rw-r----- 1 mysql mysql 1.1M Nov 8 13:28 '#ib_16384_9.dblwr'
drwxr-x--- 2 mysql mysql 4.0K Dec 9 08:12 '#innodb_redo'
Slow Query Analysis
Search the MySQL logs for slow queries that might be causing contention:
grep -i "slow" /var/log/mysql/error.log
Error Log Analysis
Identify any critical errors in the MySQL log file that could be causing the waiting state:
grep -i "error" /var/log/mysql/error.log
Connection Metrics
Review active MySQL connections for potential bottlenecks or saturation:
netstat -anp | grep mysql
Example Ouput
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 4194/mysqld
tcp 0 0 172.21.0.63:46096 172.21.0.65:4567 ESTABLISHED 4194/mysqld
tcp 0 0 172.21.0.63:46002 172.21.0.64:4567 ESTABLISHED 4194/mysqld
tcp 0 0 172.21.0.63:4567 172.21.0.61:45638 ESTABLISHED 4194/mysqld
tcp 0 0 172.21.0.63:4567 172.21.0.62:37702 ESTABLISHED 4194/mysqld
tcp6 0 0 :::9104 :::* LISTEN 738/mysqld_exporter
tcp6 0 0 :::33060 :::* LISTEN 4194/mysqld
tcp6 0 0 :::3306 :::* LISTEN 4194/mysqld
System Load
Analyze server load and resource usage to detect high contention or lack of available resources:
top
Collecting this data will provide a comprehensive view of the system and help diagnose the root cause of the InnoDB waiting state.
Write Activity
Operating System Alerts
OS Metrics Panels
Prometheus Metrics
mysql_global_status_innodb_log_waits{instance=~"your_instance"}
Additional Factors
Monitoring these dependent metrics is essential to diagnose and resolve InnoDB log waits efficiently.
Initial Assessment by C3
Review OS Metrics and Dashboards
Begin by reviewing the relevant OS metrics and dashboards, looking for any abnormalities or spikes in key resource areas such as:
If any abnormal metrics are identified in the OS monitoring tools, refer to the OS Observability Documentation to troubleshoot and resolve the issue. This could involve addressing I/O issues, optimizing system resources, or adjusting hardware allocations.
Escalation to DevOps Team
If C3 is unable to resolve the issue using the OS-level documentation or if the issue persists beyond OS resource adjustments, the problem may lie deeper within the MySQL configuration or system infrastructure that requires more advanced troubleshooting.
The DevOps team may need to review the MySQL configuration, disk setup, or perform more advanced debugging to identify the root cause of the InnoDB log waits.
By following this process, C3 ensures that the issue is assessed at the OS level first, with clear guidance to escalate the problem to DevOps when necessary. This helps ensure that MySQL InnoDB log waits are addressed promptly and efficiently, minimizing the impact on performance.
Check InnoDB Log Waits
mysql -u root -p
Innodb_log_waits
value using the following query:
SHOW GLOBAL STATUS LIKE 'Innodb_log_waits';
Innodb_log_waits
is 0
, this indicates that the InnoDB log buffer size is sufficient.Configure InnoDB Log Buffer Size
Update the MySQL Configuration:
/etc/mysql/mysqld.conf.d/mysqld.cnf
) and set the innodb_log_buffer_size
to an appropriate value based on available memory and workload requirements.innodb_log_buffer_size = 1024M
Restart MySQL:
systemctl restart mysql
Verify the Configuration Change
mysql -u root -p
innodb_log_buffer_size
to ensure it has been updated:
SHOW GLOBAL VARIABLES LIKE 'innodb_log_buffer_size';
+------------------------+----------+
| Variable_name | Value |
+------------------------+----------+
| innodb_log_buffer_size | 33554432 |
+------------------------+----------+
SELECT ROUND(@@innodb_log_buffer_size / 1024 / 1024, 0);
32MB
).Adjust for High Log Waits
Innodb_log_waits
continues to show non-zero values after increasing the log buffer size, consider:
Optimize MySQL Configuration for Performance
innodb_flush_log_at_trx_commit
: Set this to 2
for less frequent flushing to disk (default is 1
).innodb_write_io_threads
: Increase the number of write I/O threads to better utilize CPU resources for disk writes.Monitor and Validate Changes
Innodb_log_waits
status and other related metrics (e.g., disk I/O and CPU utilization) to ensure that the issue is resolved and no new bottlenecks have emerged.Collaborate with C3
By following these steps, the DevOps team can effectively resolve MySQL InnoDB log waits and ensure better performance of the MySQL server.
The “High MySQL Reads” alert is triggered when MySQL experiences a significant increase in the number of read operations. This could be an indication of database stress, inefficient queries, or issues with resource allocation. High read activity can result in high CPU usage, slow queries, or even server crashes in severe cases.
1. Check MySQL Global Status: Review global status metrics related to read activity:
SHOW GLOBAL STATUS LIKE 'Com_select'; -- Number of SELECT queries executed
SHOW GLOBAL STATUS LIKE 'Handler_read_rnd'; -- Table scans performed by MySQL
SHOW GLOBAL STATUS LIKE 'Handler_read_next'; -- Index scans performed by MySQL
These queries provide insight into the total number of read operations and the types of reads being performed.
2. Review Grafana Dashboards: We have dedicated Grafana dashboards to monitor reads. Navigate to the “MySQL Reads” section to understand the metrics in real time. Use the following queries to view read activity:
rate(mysql_global_status_innodb_data_reads{instance=~"pssb.*"}[5m])
Apply this query to the Prometheus dashboard to track any changes in read performance over a 5-minute interval.
**3. Check System Resource Utilization: High read activity could lead to resource strain. Use the following commands to check system health:
top
free -m
iostat -x 1 10
4. Check MySQL Connections: Ensure there are no unusually high numbers of active connections, which could indicate excessive reads:
netstat -anp | grep mysql
5. Check for Locking Issues: High read activity can sometimes be correlated with locking. Check for table or row locks:
SHOW ENGINE INNODB STATUS\G
SELECT Queries
: Monitors the number of SELECT
queries being executed.Rows Examined
: Tracks how many rows are being examined by queries, indicating the efficiency of your queries.Slow Queries
: Displays the count of queries that take longer than the specified threshold, helping to identify performance bottlenecks.These panels help identify spikes in read activity, inefficient queries, or performance degradation in read operations.
Monitor Buffer Pool Efficiency
Buffer Pool Reads
: Shows how many reads are occurring in the buffer pool, indicating the frequency of accessing disk data.Buffer Pool Hit Rate
: Tracks the percentage of requests served from the buffer pool, where a lower hit rate suggests increased disk I/O.Monitoring these panels helps identify if the buffer pool is large enough to handle the workload, reducing the need for expensive disk reads.
Check InnoDB and Locking Metrics
Lock Waits
: Monitors the number of lock waits, which can signal contention and delays.InnoDB Read-Write Operations
: Displays the read and write operations within InnoDB, helping to identify bottlenecks.InnoDB Data Reads/Writes
: Tracks data reads and writes at the InnoDB storage level, which can highlight inefficiencies in data access.These panels help identify lock contention and inefficiencies in InnoDB storage, which can directly impact read performance.
This format keeps the explanations concise and inline, directly relating each Grafana panel to its purpose in monitoring High MySQL Reads.
Diagnose High Read Activity:
Com_select
is high, it indicates many SELECT queries and need to wait for the resolve the issue itsef. If not, proceed to the next step.Optimize Queries:
Increase System Resources: If high read activity is causing performance degradation, consider increasing resources (CPU, RAM) on the server to handle the load. You can increase memory or CPU by adjusting server configurations or provisioning more powerful hardware.
Restart MySQL Service: If the issue is related to server state (e.g., memory issues, resource starvation), restarting MySQL can help clear any temporary issues:
systemctl restart mysql
Monitor System Health:
iostat
, top
, and free
to ensure that the system has enough resources to handle high read activity.The “High MySQL Writes” alert is triggered when MySQL experiences a significant increase in write operations, which could indicate issues such as inefficient queries, inadequate indexing, or resource constraints. High write activity can lead to increased disk I/O, slower performance, and potential database contention. Similarly, “Low MySQL Writes” may indicate underutilization of the database, suggesting potential problems with write-intensive applications.
1. Check MySQL Global Status: Review global status metrics related to write activity:
SHOW GLOBAL STATUS LIKE 'Com_insert'; -- Number of INSERT queries executed
SHOW GLOBAL STATUS LIKE 'Com_update'; -- Number of UPDATE queries executed
SHOW GLOBAL STATUS LIKE 'Com_delete'; -- Number of DELETE queries executed
These queries provide insight into the total number of write operations (INSERT, UPDATE, DELETE) and the types of writes being performed.
2. Review Grafana Dashboards: We have dedicated Grafana dashboards to monitor write activity. Navigate to the “MySQL Writes” section to track write operations in real time. Use the following queries to view write activity:
rate(mysql_global_status_innodb_data_writes{instance=~"pssb.*"}[5m])
Apply this query to the Prometheus dashboard to track any changes in write performance over a 5-minute interval.
3. Check System Resource Utilization: High or low write activity can lead to resource strain or underutilization. Use the following commands to check system health:
top
free -m
iostat -x 1 10
4. Check MySQL Connections: Ensure there are no unusually high numbers of active connections, which could indicate excessive writes:
netstat -anp | grep mysql
5. Check for Locking Issues: High write activity can sometimes be correlated with locking. Check for table or row locks:
SHOW ENGINE INNODB STATUS\G;
Example Ouput
TRANSACTIONS
--------------
Trx id counter 0 89026
---
TRANSACTION 0 89026, ACTIVE 25 sec
mysql tables in use 1, locked 1
LOCK WAIT 1 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 4235, OS thread handle 140595728161024, query id 118572 localhost root updating
UPDATE my_table SET column_name = 'value' WHERE id = 1000
In this example, you can see that Trx id 89026 is waiting for a lock and is involved in a transaction that is holding a lock. This indicates a lock wait situation where one transaction is blocked by another. If high write activity is causing locking issues, you will likely see a lot of “LOCK WAIT” and “waiting for lock” entries. These entries can guide you in diagnosing transaction blocking, deadlocks, and lock contention that may be affecting performance.
Track Write Performance & Frequency
INSERT Queries
: Monitors the number of INSERT
queries being executed.UPDATE Queries
: Tracks the number of UPDATE
queries being executed.DELETE Queries
: Monitors the number of DELETE
queries being executed.These panels help identify spikes in write activity, inefficient queries, or performance degradation in write operations.
Monitor Buffer Pool Efficiency
Buffer Pool Writes
: Shows how many writes are occurring in the buffer pool, indicating the frequency of accessing disk for data modification.Buffer Pool Hit Rate
: Tracks the percentage of requests served from the buffer pool, where a lower hit rate suggests increased disk I/O due to write activity.Monitoring these panels helps identify if the buffer pool is large enough to handle the workload, reducing the need for expensive disk writes.
Check InnoDB and Locking Metrics
Lock Waits
: Monitors the number of lock waits, which can signal contention and delays during write operations.InnoDB Read-Write Operations
: Displays the read and write operations within InnoDB, helping to identify bottlenecks.InnoDB Data Reads/Writes
: Tracks data reads and writes at the InnoDB storage level, highlighting inefficiencies in data access and modification.OS Metrics: Review system performance metrics (CPU, memory, disk) to identify potential resource bottlenecks that may be contributing to the increased or decreased writes.
The C3 team, as the first level of support, is responsible for identifying, mitigating, and documenting the issue without making any configuration changes to MySQL. The following steps outline how the C3 team can manage high write activity:
1. Redistribute Write Load
2. Monitor and Report Resource Utilization
rate(mysql_global_status_innodb_data_writes[5m])
.3. Verify and Clear Disk Space
df -h
4. Notify Application Teams
5. Restart Impacted Applications or Services
6. Collaborate with DevOps for Long-Term Solutions
By following these steps, the C3 team can effectively manage high write scenarios, minimize their impact, and ensure proper escalation for permanent resolutions.
If MySQL write operations are abnormally high, it may indicate performance bottlenecks or configuration issues. Here’s a structured approach for DevOps to analyze and address the problem.
Examine Write Metrics
Use Prometheus or Grafana dashboards to analyze write rates over time.
rate(mysql_global_status_innodb_data_writes[5m])
Check Application Logs
Monitor Disk I/O Performance
Use tools like iostat
or vmstat
to monitor disk write speeds and identify bottlenecks:
iostat -x 1 5
Output example:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 5.00 2.00 8.00 50.00 200.00 30.00 0.05 1.00 0.50 40.00
High w/s
(write per second) and %util
nearing 100% indicate a saturated disk.
Investigate Disk Latency
Check the time taken for write operations. If latency is high:
innodb_flush_log_at_trx_commit
.Buffer Pool Size
Increase the innodb_buffer_pool_size
to reduce direct disk writes:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
SET GLOBAL innodb_buffer_pool_size = 4G;
Log Buffer Size
Ensure the innodb_log_buffer_size
is adequate to handle bursts of writes:
SHOW VARIABLES LIKE 'innodb_log_buffer_size';
SET GLOBAL innodb_log_buffer_size = 128M;
Check Table Indexes
Query Example:
Identify index usage:
SELECT TABLE_NAME, INDEX_NAME, NON_UNIQUE
FROM INFORMATION_SCHEMA.STATISTICS
WHERE TABLE_SCHEMA = 'your_database';
Monitor Locking Issues
SHOW ENGINE INNODB STATUS\G
to analyze transaction locks.------------
TRANSACTIONS
------------
Trx id counter 162053
Purge done for trx's n:o < 162050
History list length 3
---
High transaction contention indicates the need to optimize queries or table design.Implement Optimistic Concurrency Control
If lock contention is a bottleneck, consider adjusting application logic to minimize simultaneous writes.
SHOW VARIABLES LIKE 'log_bin';
The alert “High InnoDB Buffer Pool Usage” is triggered when the MySQL InnoDB buffer pool is heavily utilized, nearing or exceeding its configured size. The buffer pool is a key component of MySQL’s performance, acting as a memory cache for frequently accessed data and indexes. High usage can indicate efficient use of the buffer pool or point to memory pressure and the need for optimization.
High InnoDB buffer pool usage often occurs when there is an increase in read or write requests, as MySQL leverages the buffer pool to cache frequently accessed data and indexes for better performance. This alert requires attention to system resources and workload patterns to prevent degraded performance or system instability.
The C3 team, as the first-level responders, collects information to confirm the issue and determine if further escalation is needed.
Check Buffer Pool Usage
Use the following query to monitor buffer pool usage:
SHOW ENGINE INNODB STATUS\G;
Look for the following sections in the output:
BUFFER POOL AND MEMORY
Total memory allocated 8589934592; in additional pool allocated 0
Database pages 409600
Free buffers 0
Monitor Memory Utilization
The query below shows the usage of the buffer pool in percentage. Use the Prometheus metric for real-time tracking:
(mysql_global_status_innodb_buffer_pool_bytes_data{instance="$host"} / mysql_global_variables_innodb_buffer_pool_size{instance="$host"}) * 100
**check ** Run the following command to assess overall system memory utilization:
free -h
Ensure there is no memory contention between MySQL and other processes.We have configured memory limits on the node for buffer pool size.
Disk I/O Activity
Examine disk usage if the buffer pool is insufficient:
iostat -x 1 5
High await
and %util
indicate disk pressure due to inadequate buffer pool size.
Query Statistics
Analyze queries contributing to buffer pool pressure:
SHOW GLOBAL STATUS LIKE 'Innodb_pages_read';
SHOW GLOBAL STATUS LIKE 'Innodb_pages_written';
High read/write operations can point to unoptimized queries or heavy workload.
Key Metrics to Monitor:
1 - (rate(mysql_global_status_innodb_buffer_pool_reads[5m]) / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m]))
CPU and Disk Correlation:
Table/Index Access Patterns: This refers to how MySQL handles table and index access during queries. If tables are accessed frequently without proper indexes, MySQL may perform full table scans instead of efficiently retrieving data via indexes.
rate(mysql_global_status_handler[5m])
Memory Pressure:
vmstat
.si
/so
columns) may exacerbate MySQL performance issues.The C3 team does not modify MySQL configuration but ensures initial triaging of the issue:
Validate Memory and Disk I/O:
free -h
) and disk I/O metrics (iostat -x
).Monitor Read/Write Patterns:
Escalate Appropriately:
DevOps ensures optimal configuration and resolves underlying issues:
Increase Buffer Pool Size:
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
[mysqld]
section
innodb_buffer_pool_size = 2G
systemctl restart mysql
Optimize Query Patterns:
SHOW GLOBAL STATUS LIKE 'Com_select';
SHOW GLOBAL STATUS LIKE 'Com_update';
Improve Memory Utilization:
vm.swappiness=1
to prioritize application memory over swap in the file /etc/sysctl.conf
Evaluate Workload Distribution:
The “High InnoDB Buffer Pool Usage” alert indicates potential memory pressure and degraded database performance. By carefully monitoring, analyzing, and optimizing buffer pool settings and workload patterns, the issue can be resolved efficiently, ensuring stable and high-performing MySQL operations.
Here is the documentation for the MySQL Uptime alert:
The MySQL Uptime alert triggers when the uptime of a MySQL server falls below an expected threshold. This indicates that the MySQL service has restarted recently, either due to planned maintenance or unexpected failure. It is essential to investigate the root cause promptly to prevent data loss, operational disruption, or performance degradation.
The C3 team is responsible for collecting the following information to assess the situation and escalate if necessary:
Check MySQL Uptime:
SHOW GLOBAL STATUS LIKE 'Uptime';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Uptime | 120 | # Uptime in seconds
+---------------+-------+
Verify System Logs:
journalctl -u mysql.service --since "1 hour ago"
MySQL started
or errors causing restarts.Check Server Resources:
top
df -h
free -m
Review MySQL Error Log:
tail -n 50 /var/log/mysql/error.log
Database Connection Metrics:
SHOW GLOBAL STATUS LIKE 'Aborted_connects';
SHOW GLOBAL STATUS LIKE 'Threads_connected';
The following Prometheus metrics may show changes correlating with low MySQL uptime:
Uptime Metrics:
mysql_global_status_uptime{instance=~"your-instance"}
Connection Failures:
rate(mysql_global_status_aborted_connects[5m])
CPU and Memory Usage:
node_memory_Active_bytes{instance=~"your-instance"}
node_cpu_seconds_total{instance=~"your-instance"}
The C3 team should take the following steps to resolve or escalate the issue:
Restart Confirmation:
Resource Management:
Error Review:
Escalation:
Analyze Logs and Metrics:
Optimize MySQL Configuration:
max_connections
innodb_buffer_pool_size
thread_cache_size
Inspect Hardware and OS Issues:
dmesg | grep -i error
smartctl -a /dev/sdX
Reconfigure Monitoring and Alerts:
Implement Failover Mechanisms:
Grafana Dashboard Panel Link
The MySQL Slow Queries alert triggers when the number of queries taking longer than the configured threshold (long_query_time) exceeds acceptable limits. Slow queries can impact overall database performance, lead to increased resource usage, and degrade application response times. Causes of slow queries include missing indexes, poorly optimized queries, and insufficient server resources.
The C3 team collects relevant data to investigate and identify the potential causes of slow queries:
Verify Slow Query Log Status Check if the slow query log is enabled and identify the configured long_query_time threshold:
SHOW VARIABLES LIKE 'slow_query_log';
SHOW VARIABLES LIKE 'long_query_time';
Example Output:
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| slow_query_log | ON |
| long_query_time | 10.00 |
+-----------------+-------+
Long query time 10.0 sec means any query taking longer than 10 seconds to execute is considered a long-running query. It usually indicates performance issues such as inefficient query design, lack of indexing, or high system load.
Examine the Slow Query Log Retrieve entries from the slow query log to identify problematic queries:
tail -n 50 /var/log/mysql/slow.log
Check Server Load and Resource Usage Inspect system-level metrics for CPU, memory, and disk I/O:
top
vmstat
iostat -x 1 5
Identify slow query patterns like, are they frequent? are they related to specific tables or queries?
Prometheus Metric: Slow Queries Rate Monitor the rate of slow queries in Prometheus:
rate(mysql_global_status_slow_queries[5m])
Query Execution Time: Slow queries increase average execution time, leading to delays in processing and query completion.
PromQL:
rate(mysql_global_status_slow_queries[5m])
Number of Slow Queries: A high count of slow queries indicates inefficient query patterns or resource bottlenecks.
PromQL:
mysql_global_status_slow_queries
CPU Usage: Slow queries often involve complex operations, such as sorting or joins, which increase CPU consumption and strain processing power.
PromQL:
rate(node_cpu_seconds_total{mode="user"}[5m]) * 100
Memory Utilization: Inefficient queries consuming large datasets can lead to higher memory usage, potentially causing swapping or out-of-memory issues.
PromQL:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
Disk I/O Metrics: Slow queries performing full table scans or large read/write operations increase disk I/O, leading to higher latency and throughput.
PromQL:
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])
Active Connections: Slow queries hold connections longer, increasing the number of threads waiting or running, which can exhaust connection limits.
PromQL:
mysql_global_status_threads_connected
mysql_global_status_threads_running
Query Cache Efficiency: If query results are not cached or cache misses occur, slow queries further degrade performance by re-executing resource-heavy operations.
PromQL:
(mysql_global_status_qcache_hits / (mysql_global_status_qcache_hits + mysql_global_status_qcache_inserts)) * 100
InnoDB Buffer Pool Usage: Slow queries with large data scans or updates can fill the buffer pool, causing evictions and increased disk reads/writes.
Refer the panel
Table or Row Lock Contention: Slow queries often hold locks for longer durations, increasing lock waits and contention, which impacts other queries.
PromQL:
rate(mysql_global_status_innodb_row_lock_waits[5m])
rate(mysql_global_status_innodb_row_lock_time[5m])
Refer the Table Lock dashboard panel Link
Query Per Second (QPS): A spike in slow queries reduces the overall QPS due to longer execution times, leading to slower response rates for incoming requests.
PromQL:
rate(mysql_global_status_queries[5m])
By monitoring these metrics, you can assess how slow queries impact MySQL’s overall performance and identify areas for optimization.
SHOW VARIABLES LIKE 'slow_query_log';
If the log is disabled, escalate to the DevOps team to enable it. C3 does not have the authority to change configurations.
mysql_global_status_slow_queries
in Prometheus or Grafana dashboards. Identify spikes or abnormal trends in the slow query rate.rate(mysql_global_status_handler_read_rnd_next[5m])
SHOW ENGINE INNODB STATUS\G;
show full process list;
The DevOps team has the authority to make necessary configuration changes, analyze database performance in depth, and implement optimizations to resolve slow query issues effectively. Below is a detailed remedy process for addressing the “MySQL Slow Queries” alert:
Enable and Configure the Slow Query Log
my.cnf
or my.ini
) to activate it:
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 1
systemctl restart mysql
Analyze the Slow Query Log
mysqldumpslow
tool to summarize the slow queries:
mysqldumpslow -s c -t 10 /var/log/mysql/slow.log
cat /var/log/mysql/slow.log
Optimize Index Usage
EXPLAIN SELECT * FROM table WHERE condition;
CREATE INDEX idx_column_name ON table_name (column_name);
Tune MySQL Configuration Parameters
query_cache_size = 64M
query_cache_type = 1
innodb_buffer_pool_size = 70% of available memory
innodb_log_buffer_size = 16M
innodb_flush_log_at_trx_commit = 2
thread_cache_size = 16
max_connections = 200
systemctl restart mysql
Optimize Query Execution
SELECT *
; explicitly list columns.Analyze Locking and Concurrency
SHOW ENGINE INNODB STATUS\G;
innodb_lock_wait_timeout
parameter if locks are frequently causing delays:
innodb_lock_wait_timeout = 50
Monitor Resource Usage
htop
, vmstat
, or OS dashboards) to analyze CPU, memory, and disk I/O.Collaborate with Application Teams
Implement Long-Term Solutions