MySQL monitoring maturity model: from survival to expert

Production MySQL incidents rarely fail because nobody was watching. They fail because the team was watching the wrong signals. A replica reporting Seconds_Behind_Source = 0 while its I/O thread is stopped hides a stopped replica. A healthy buffer pool hit ratio while checkpoint age nears a stall hides an impending write freeze. Both create false safety. This model maps exactly what to instrument at each stage of operational maturity.

The four levels are additive. Level 1 confirms the server is alive. Level 2 confirms the workload is healthy. Level 3 provides leading indicators of saturation and internal backlog. Level 4 gives forensic precision to separate a forgotten transaction from a storage stall, and to catch replication divergence before it becomes split-brain. Skip a level and you inherit its blind spots.

This guide assumes InnoDB and covers MySQL 5.7 through 8.x. Modern terminology (REPLICA / SOURCE) is used with legacy equivalents noted for 5.7. Use the tables to audit your current monitoring stack or decide which alert to tune before the next peak.

flowchart TD
    L1[Level 1: survival]
    L2[Level 2: operational]
    L3[Level 3: mature]
    L4[Level 4: expert]

    L1 --> L2
    L2 --> L3
    L3 --> L4

Level 1: survival

Survival monitoring answers one question: is MySQL up and reachable? If you have only these signals, you will know when the server is down, but you will have zero warning before a saturation failure. This is the baseline for every production instance.

FocusKey metric or sourceWhat to watch
LivenessSELECT 1 or TCP connect to port 3306The probe must return inside the timeout. Success confirms the full auth and query path is accepting work.
UptimeSHOW GLOBAL STATUS LIKE 'Uptime'A reset near zero signals a crash, OOM kill, or unplanned restart.
Connection limitThreads_connected / max_connectionsFetch max_connections from SHOW GLOBAL VARIABLES LIKE 'max_connections'. A ratio approaching 1.0 means new connections are refused instantly.
Replication stateSHOW REPLICA STATUS\GReplica_IO_Running and Replica_SQL_Running must both be Yes. In 5.7, use SHOW SLAVE STATUS\G. Also check Last_IO_Errno and Last_SQL_Errno for non-zero codes.
Disk spaceOS-level free space on the data volumeMySQL halts writes and can crash when the partition fills. Monitor both the data directory and the temporary directory (tmpdir).
Error logTail of the MySQL error logLook for [ERROR] lines, assertion failures, and crash recovery messages. Rotate logs aggressively enough that you do not lose context during a crash loop.

Level 2: operational

Operational monitoring adds workload quality and contention awareness. These signals distinguish an idle server from an oversubscribed one, and they catch query regressions before they cascade into connection exhaustion.

FocusKey metric or sourceWhat to watch
Query throughputQuestions rate (per second)A sustained drop while Threads_connected is stable indicates blocking or saturation.
Active concurrencySHOW GLOBAL STATUS LIKE 'Threads_running'As a rule of thumb, keep this below CPU core count during normal load. Spikes above core count indicate queuing or contention. If you use thread_pool, this still measures active queries, not pooled threads.
Slow query rateSlow_queries delta with long_query_time <= 1 sThe default of 10 s misses most operationally relevant regressions. Track the rate, not the absolute count. Tuning long_query_time requires restarting the metric baseline; do not compare rates across different thresholds.
Buffer pool efficiency1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)OLTP should stay above 99%. Below 95%, latency degrades non-linearly.
Lock contentionInnodb_row_lock_waits delta and lock_deadlocks from information_schema.INNODB_METRICSAny sustained increase in deadlock rate or average lock wait time warrants investigation. Correlate with SHOW ENGINE INNODB STATUS LATEST DETECTED DEADLOCK to identify the conflicting table and index.
Connection healthAborted_connects, Connection_errors_max_connectionsRejections or auth failures signal pool leaks, retry storms, or network issues.
Temp table spillCreated_tmp_disk_tables / Created_tmp_tablesA ratio above 25% indicates queries are spilling to disk due to memory limits or BLOB/TEXT columns.

Level 3: mature

Mature monitoring shifts focus from symptoms to leading indicators. These signals predict cliff-edge failures such as redo log stalls, purge lag explosions, and metadata lock cascades. They require performance_schema and InnoDB metrics that are not visible in basic SHOW STATUS output. Verify that Performance_schema_*_lost counters are zero; any nonzero value means instrumentation is being dropped.

FocusKey metric or sourceWhat to watch
Checkpoint ageInnodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn vs capacityIn 8.0.30+, compare against innodb_redo_log_capacity. In earlier versions, capacity is innodb_log_files_in_group * innodb_log_file_size. Above 75% triggers aggressive flushing and user-visible latency jitter. Above 90%, writes stall. Pre-8.0.30, parse SHOW ENGINE INNODB STATUS for the LOG section.
History list lengthtrx_rseg_history_len from information_schema.INNODB_METRICSTracks MVCC purge debt. Above 1,000,000 indicates a long-running transaction is blocking purge.
Metadata locksperformance_schema.metadata_locksPending locks predict a cascade before connection exhaustion. Common culprits are DDL statements blocked by an uncommitted transaction or a long-running backup lock.
Open transaction ageinformation_schema.INNODB_TRX ordered by trx_startedAny transaction older than 5 minutes in OLTP is suspicious. Idle transactions with trx_query IS NULL are often the culprit. Join with performance_schema.threads to map trx_mysql_thread_id to the client host and user.
Thread cache efficiencyThreads_created / ConnectionsShould be near zero. A ratio above 10% usually means thread_cache_size is too small for the connection churn pattern.
Binary log growthSHOW BINARY LOGS total size vs disk capacity and binlog_expire_logs_seconds (5.7: expire_logs_days)Growth without effective expiry risks disk exhaustion, especially when replicas lag.
Table cache pressureOpened_tables rate and Open_tables / table_open_cacheSaturation causes file descriptor churn and latency spikes.
Query digest latencyperformance_schema.events_statements_summary_by_digestTrack per-pattern latency regressions. A full digest table loses visibility into new queries. Truncate the table after schema or index changes so old plans do not mask new regressions: TRUNCATE TABLE performance_schema.events_statements_summary_by_digest.
Adaptive hash indexadaptive_hash_searches vs adaptive_hash_searches_btree from information_schema.INNODB_METRICSLow hit ratio combined with btr_search semaphore waits suggests disabling AHI. This requires SET GLOBAL innodb_adaptive_hash_index = OFF; the change is dynamic but removes a hot structure, so test during a low-traffic window.

Level 4: expert

Expert monitoring targets correctness, topology integrity, and internal engine efficiency. These signals are essential for large-scale deployments, strict RPO requirements, and environments where query plans and lock ordering change under load.

FocusKey metric or sourceWhat to watch
Per-index access statsperformance_schema.table_io_waits_summary_by_index_usageIdentifies unused indexes and table scan patterns that evade aggregate counters. A COUNT_FETCH of zero with high COUNT_INSERT suggests a write-only index that adds maintenance overhead.
Lock-wait chainsperformance_schema.data_lock_waits joined with data_locks (MySQL 8.0)Query where BLOCKING_ENGINE_TRANSACTION_ID is not null to reconstruct who blocks whom in real time. In 5

[OUTPUT TRUNCATED: Response exceeded output token limit.]