MySQL monitoring checklist: the signals every production instance needs
This checklist organizes MySQL production signals into four levels. Start with survival, add operational, then mature and expert signals as your environment demands. Each level builds on the previous one.
flowchart TD
A[Level 1 Survival] --> B[Level 2 Operational]
B --> C[Level 3 Mature]
C --> D[Level 4 Expert]
A --> E[Page when dead]
B --> F[Ticket when degraded]
C --> G[Plan before crisis]
D --> H[Correlate and prevent]Level 1: Survival
These signals answer whether MySQL is alive and serving traffic.
SELECT 1liveness – TCP connect to port 3306 does not prove MySQL is ready; InnoDB crash recovery can block queries for minutes. RunSELECT 1through the full auth-query path from the application’s network perspective to confirm liveness.- Server uptime –
SHOW GLOBAL STATUS LIKE 'Uptime'tracks seconds since restart. Unexpected resets indicate crash, OOM kill, or assertion failure. Correlate with the error log immediately. - Connection utilization – Divide
Threads_connectedbymax_connections. Keep peak utilization below 80% to leave headroom for spikes and admin access. - Replication thread state – On replicas,
SHOW REPLICA STATUSfieldsReplica_IO_RunningandReplica_SQL_Runningmust both beYes. If either stops, lag grows unboundedly. - Disk space free – MySQL stops writing and may crash when data, redo log, binlog, or tmpdir volumes fill. Monitor every filesystem MySQL touches, not just the data directory. InnoDB cannot allocate new extents on a full volume and writes hang.
- Error log scan – Watch for
[ERROR],[FATAL], stack traces, and assertion failures. MySQL writes crash signatures (mysqld got signal), corruption detection (InnoDB: Database page corruption), and replication breaks to the error log, not toSHOW GLOBAL STATUS.
Level 2: Operational
Add these to detect degradation before it becomes an outage.
- Query throughput (
Questions) – The rate of client-initiated statements. A sudden drop whileThreads_connectedremains stable means queries are stuck, not that traffic disappeared. - Active execution queue (
Threads_running) – The number of threads currently executing queries. This is the definitive measure of database load. If it rises into the hundreds on a modest instance, threads are context-switching instead of working. Correlate with CPUiowaitto distinguish CPU saturation from disk stall. - Slow query rate –
Slow_queriesincrements for statements exceedinglong_query_time. The default of 10 seconds misses most OLTP problems. Setlong_query_timeto 1 second or lower. If the counter rises but the slow log shows no new entries, check the log destination and rotation. - Buffer pool hit ratio – Calculate as
1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests). For OLTP, sustained drops below 99% mean the working set no longer fits in RAM. Below 95% is a crisis. Ignore dips during cache warmup after a restart; wait for steady state. - InnoDB row lock contention – Watch
Innodb_row_lock_waitsrate,Innodb_row_lock_current_waits, andInnodb_row_lock_time_avg. Sustainedcurrent_waitswith risingThreads_runningsignals active contention. - InnoDB deadlocks – Query
INFORMATION_SCHEMA.INNODB_METRICSforlock_deadlocks. Occasional deadlocks are normal in OLTP. Sustained rates above 1 per minute indicate application logic or indexing issues. - Connection refusals –
Connection_errors_max_connectionscounts rejected connections. Any nonzero rate in production means clients are being turned away. - Aborted connections – Rising
Aborted_connectsindicates authentication failures or network issues. RisingAborted_clientsmeans connections are terminated abnormally, often because a client timeout is shorter than query runtime. - Temporary table disk spill – The ratio
Created_tmp_disk_tables / Created_tmp_tablesshows how often queries spill intermediate results to disk. Above 25% warrants investigation. In MySQL 8.0, theTempTableengine replacesMEMORYand handles larger in-memory temp tables before spilling, but disk spills still indicate missing indexes, largeGROUP BYcolumns, or lowtemptable_max_ram. - Read/write mix – Track
Com_select,Com_insert,Com_update,Com_delete. A shift in ratio without a corresponding deployment often explains performance changes, especially when read volume spikes while write volume stays flat.
Level 3: Mature
These are leading indicators for the composite failure patterns that cause most major MySQL incidents.
- Checkpoint age ratio – Compute
(Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn) / innodb_redo_log_capacityon MySQL 8.0.30+, or parseSHOW ENGINE INNODB STATUSon older versions. Above 75% enters aggressive flushing; above 90% risks synchronous write stalls. At 100%, new writes block until dirty pages flush. - History list length – Query
INFORMATION_SCHEMA.INNODB_METRICSfortrx_rseg_history_len. Values above 1,000,000 mean a long-running transaction is blocking purge. All MVCC reads degrade as history accumulates, slowing even simpleSELECTstatements that touch recently modified rows. - Buffer pool memory pressure –
Innodb_buffer_pool_wait_freecounts times queries blocked waiting for a clean page. Any nonzero sustained rate means the pool is undersized or page cleaners cannot keep up. - Dirty page ratio – Compare
Innodb_buffer_pool_pages_dirtytoInnodb_buffer_pool_pages_total. Sustained values above 75% indicate write pressure that may precede a checkpoint stall. Tuneinnodb_max_dirty_pages_pctonly as a safety rail; the real fix is write rate or IOPS capacity. - Log buffer waits –
Innodb_log_waitsshould be zero. Any sustained nonzero rate meansinnodb_log_buffer_sizeis too small for the write rate. - Open transaction age – Query
information_schema.INNODB_TRXand alert on transactions older than 5 minutes in OLTP. Idle transactions withtrx_query IS NULLhold read views and block purge even without active statements. Kill the connection or fix the application idle-in-transaction pattern. A single forgottenBEGINin a connection pool can hold a read view for hours. - Metadata lock buildup – Monitor
performance_schema.metadata_locks. More than 3 sessions pending on the same table for 30 seconds means a DDL or long-running transaction is about to cascade into a full connection pool stall. The blocking session usually holdsOBJECT_TYPE = 'TABLE'withLOCK_TYPE = 'SHARED_WRITE'or an explicit
[OUTPUT TRUNCATED: Response exceeded output token limit.]







