MySQL monitoring maturity model: from survival to expert
Production MySQL incidents rarely fail because nobody was watching. They fail because the team was watching the wrong signals. A replica reporting Seconds_Behind_Source = 0 while its I/O thread is stopped hides a stopped replica. A healthy buffer pool hit ratio while checkpoint age nears a stall hides an impending write freeze. Both create false safety. This model maps exactly what to instrument at each stage of operational maturity.
The four levels are additive. Level 1 confirms the server is alive. Level 2 confirms the workload is healthy. Level 3 provides leading indicators of saturation and internal backlog. Level 4 gives forensic precision to separate a forgotten transaction from a storage stall, and to catch replication divergence before it becomes split-brain. Skip a level and you inherit its blind spots.
This guide assumes InnoDB and covers MySQL 5.7 through 8.x. Modern terminology (REPLICA / SOURCE) is used with legacy equivalents noted for 5.7. Use the tables to audit your current monitoring stack or decide which alert to tune before the next peak.
flowchart TD
L1[Level 1: survival]
L2[Level 2: operational]
L3[Level 3: mature]
L4[Level 4: expert]
L1 --> L2
L2 --> L3
L3 --> L4Level 1: survival
Survival monitoring answers one question: is MySQL up and reachable? If you have only these signals, you will know when the server is down, but you will have zero warning before a saturation failure. This is the baseline for every production instance.
| Focus | Key metric or source | What to watch |
|---|---|---|
| Liveness | SELECT 1 or TCP connect to port 3306 | The probe must return inside the timeout. Success confirms the full auth and query path is accepting work. |
| Uptime | SHOW GLOBAL STATUS LIKE 'Uptime' | A reset near zero signals a crash, OOM kill, or unplanned restart. |
| Connection limit | Threads_connected / max_connections | Fetch max_connections from SHOW GLOBAL VARIABLES LIKE 'max_connections'. A ratio approaching 1.0 means new connections are refused instantly. |
| Replication state | SHOW REPLICA STATUS\G | Replica_IO_Running and Replica_SQL_Running must both be Yes. In 5.7, use SHOW SLAVE STATUS\G. Also check Last_IO_Errno and Last_SQL_Errno for non-zero codes. |
| Disk space | OS-level free space on the data volume | MySQL halts writes and can crash when the partition fills. Monitor both the data directory and the temporary directory (tmpdir). |
| Error log | Tail of the MySQL error log | Look for [ERROR] lines, assertion failures, and crash recovery messages. Rotate logs aggressively enough that you do not lose context during a crash loop. |
Level 2: operational
Operational monitoring adds workload quality and contention awareness. These signals distinguish an idle server from an oversubscribed one, and they catch query regressions before they cascade into connection exhaustion.
| Focus | Key metric or source | What to watch |
|---|---|---|
| Query throughput | Questions rate (per second) | A sustained drop while Threads_connected is stable indicates blocking or saturation. |
| Active concurrency | SHOW GLOBAL STATUS LIKE 'Threads_running' | As a rule of thumb, keep this below CPU core count during normal load. Spikes above core count indicate queuing or contention. If you use thread_pool, this still measures active queries, not pooled threads. |
| Slow query rate | Slow_queries delta with long_query_time <= 1 s | The default of 10 s misses most operationally relevant regressions. Track the rate, not the absolute count. Tuning long_query_time requires restarting the metric baseline; do not compare rates across different thresholds. |
| Buffer pool efficiency | 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests) | OLTP should stay above 99%. Below 95%, latency degrades non-linearly. |
| Lock contention | Innodb_row_lock_waits delta and lock_deadlocks from information_schema.INNODB_METRICS | Any sustained increase in deadlock rate or average lock wait time warrants investigation. Correlate with SHOW ENGINE INNODB STATUS LATEST DETECTED DEADLOCK to identify the conflicting table and index. |
| Connection health | Aborted_connects, Connection_errors_max_connections | Rejections or auth failures signal pool leaks, retry storms, or network issues. |
| Temp table spill | Created_tmp_disk_tables / Created_tmp_tables | A ratio above 25% indicates queries are spilling to disk due to memory limits or BLOB/TEXT columns. |
Level 3: mature
Mature monitoring shifts focus from symptoms to leading indicators. These signals predict cliff-edge failures such as redo log stalls, purge lag explosions, and metadata lock cascades. They require performance_schema and InnoDB metrics that are not visible in basic SHOW STATUS output. Verify that Performance_schema_*_lost counters are zero; any nonzero value means instrumentation is being dropped.
| Focus | Key metric or source | What to watch |
|---|---|---|
| Checkpoint age | Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn vs capacity | In 8.0.30+, compare against innodb_redo_log_capacity. In earlier versions, capacity is innodb_log_files_in_group * innodb_log_file_size. Above 75% triggers aggressive flushing and user-visible latency jitter. Above 90%, writes stall. Pre-8.0.30, parse SHOW ENGINE INNODB STATUS for the LOG section. |
| History list length | trx_rseg_history_len from information_schema.INNODB_METRICS | Tracks MVCC purge debt. Above 1,000,000 indicates a long-running transaction is blocking purge. |
| Metadata locks | performance_schema.metadata_locks | Pending locks predict a cascade before connection exhaustion. Common culprits are DDL statements blocked by an uncommitted transaction or a long-running backup lock. |
| Open transaction age | information_schema.INNODB_TRX ordered by trx_started | Any transaction older than 5 minutes in OLTP is suspicious. Idle transactions with trx_query IS NULL are often the culprit. Join with performance_schema.threads to map trx_mysql_thread_id to the client host and user. |
| Thread cache efficiency | Threads_created / Connections | Should be near zero. A ratio above 10% usually means thread_cache_size is too small for the connection churn pattern. |
| Binary log growth | SHOW BINARY LOGS total size vs disk capacity and binlog_expire_logs_seconds (5.7: expire_logs_days) | Growth without effective expiry risks disk exhaustion, especially when replicas lag. |
| Table cache pressure | Opened_tables rate and Open_tables / table_open_cache | Saturation causes file descriptor churn and latency spikes. |
| Query digest latency | performance_schema.events_statements_summary_by_digest | Track per-pattern latency regressions. A full digest table loses visibility into new queries. Truncate the table after schema or index changes so old plans do not mask new regressions: TRUNCATE TABLE performance_schema.events_statements_summary_by_digest. |
| Adaptive hash index | adaptive_hash_searches vs adaptive_hash_searches_btree from information_schema.INNODB_METRICS | Low hit ratio combined with btr_search semaphore waits suggests disabling AHI. This requires SET GLOBAL innodb_adaptive_hash_index = OFF; the change is dynamic but removes a hot structure, so test during a low-traffic window. |
Level 4: expert
Expert monitoring targets correctness, topology integrity, and internal engine efficiency. These signals are essential for large-scale deployments, strict RPO requirements, and environments where query plans and lock ordering change under load.
| Focus | Key metric or source | What to watch |
|---|---|---|
| Per-index access stats | performance_schema.table_io_waits_summary_by_index_usage | Identifies unused indexes and table scan patterns that evade aggregate counters. A COUNT_FETCH of zero with high COUNT_INSERT suggests a write-only index that adds maintenance overhead. |
| Lock-wait chains | performance_schema.data_lock_waits joined with data_locks (MySQL 8.0) | Query where BLOCKING_ENGINE_TRANSACTION_ID is not null to reconstruct who blocks whom in real time. In 5 |
[OUTPUT TRUNCATED: Response exceeded output token limit.]







