MySQL monitoring checklist: the signals every production instance needs

This checklist organizes MySQL production signals into four levels. Start with survival, add operational, then mature and expert signals as your environment demands. Each level builds on the previous one.

flowchart TD
    A[Level 1 Survival] --> B[Level 2 Operational]
    B --> C[Level 3 Mature]
    C --> D[Level 4 Expert]
    A --> E[Page when dead]
    B --> F[Ticket when degraded]
    C --> G[Plan before crisis]
    D --> H[Correlate and prevent]

Level 1: Survival

These signals answer whether MySQL is alive and serving traffic.

  • SELECT 1 liveness – TCP connect to port 3306 does not prove MySQL is ready; InnoDB crash recovery can block queries for minutes. Run SELECT 1 through the full auth-query path from the application’s network perspective to confirm liveness.
  • Server uptimeSHOW GLOBAL STATUS LIKE 'Uptime' tracks seconds since restart. Unexpected resets indicate crash, OOM kill, or assertion failure. Correlate with the error log immediately.
  • Connection utilization – Divide Threads_connected by max_connections. Keep peak utilization below 80% to leave headroom for spikes and admin access.
  • Replication thread state – On replicas, SHOW REPLICA STATUS fields Replica_IO_Running and Replica_SQL_Running must both be Yes. If either stops, lag grows unboundedly.
  • Disk space free – MySQL stops writing and may crash when data, redo log, binlog, or tmpdir volumes fill. Monitor every filesystem MySQL touches, not just the data directory. InnoDB cannot allocate new extents on a full volume and writes hang.
  • Error log scan – Watch for [ERROR], [FATAL], stack traces, and assertion failures. MySQL writes crash signatures (mysqld got signal), corruption detection (InnoDB: Database page corruption), and replication breaks to the error log, not to SHOW GLOBAL STATUS.

Level 2: Operational

Add these to detect degradation before it becomes an outage.

  • Query throughput (Questions) – The rate of client-initiated statements. A sudden drop while Threads_connected remains stable means queries are stuck, not that traffic disappeared.
  • Active execution queue (Threads_running) – The number of threads currently executing queries. This is the definitive measure of database load. If it rises into the hundreds on a modest instance, threads are context-switching instead of working. Correlate with CPU iowait to distinguish CPU saturation from disk stall.
  • Slow query rateSlow_queries increments for statements exceeding long_query_time. The default of 10 seconds misses most OLTP problems. Set long_query_time to 1 second or lower. If the counter rises but the slow log shows no new entries, check the log destination and rotation.
  • Buffer pool hit ratio – Calculate as 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests). For OLTP, sustained drops below 99% mean the working set no longer fits in RAM. Below 95% is a crisis. Ignore dips during cache warmup after a restart; wait for steady state.
  • InnoDB row lock contention – Watch Innodb_row_lock_waits rate, Innodb_row_lock_current_waits, and Innodb_row_lock_time_avg. Sustained current_waits with rising Threads_running signals active contention.
  • InnoDB deadlocks – Query INFORMATION_SCHEMA.INNODB_METRICS for lock_deadlocks. Occasional deadlocks are normal in OLTP. Sustained rates above 1 per minute indicate application logic or indexing issues.
  • Connection refusalsConnection_errors_max_connections counts rejected connections. Any nonzero rate in production means clients are being turned away.
  • Aborted connections – Rising Aborted_connects indicates authentication failures or network issues. Rising Aborted_clients means connections are terminated abnormally, often because a client timeout is shorter than query runtime.
  • Temporary table disk spill – The ratio Created_tmp_disk_tables / Created_tmp_tables shows how often queries spill intermediate results to disk. Above 25% warrants investigation. In MySQL 8.0, the TempTable engine replaces MEMORY and handles larger in-memory temp tables before spilling, but disk spills still indicate missing indexes, large GROUP BY columns, or low temptable_max_ram.
  • Read/write mix – Track Com_select, Com_insert, Com_update, Com_delete. A shift in ratio without a corresponding deployment often explains performance changes, especially when read volume spikes while write volume stays flat.

Level 3: Mature

These are leading indicators for the composite failure patterns that cause most major MySQL incidents.

  • Checkpoint age ratio – Compute (Innodb_redo_log_current_lsn - Innodb_redo_log_checkpoint_lsn) / innodb_redo_log_capacity on MySQL 8.0.30+, or parse SHOW ENGINE INNODB STATUS on older versions. Above 75% enters aggressive flushing; above 90% risks synchronous write stalls. At 100%, new writes block until dirty pages flush.
  • History list length – Query INFORMATION_SCHEMA.INNODB_METRICS for trx_rseg_history_len. Values above 1,000,000 mean a long-running transaction is blocking purge. All MVCC reads degrade as history accumulates, slowing even simple SELECT statements that touch recently modified rows.
  • Buffer pool memory pressureInnodb_buffer_pool_wait_free counts times queries blocked waiting for a clean page. Any nonzero sustained rate means the pool is undersized or page cleaners cannot keep up.
  • Dirty page ratio – Compare Innodb_buffer_pool_pages_dirty to Innodb_buffer_pool_pages_total. Sustained values above 75% indicate write pressure that may precede a checkpoint stall. Tune innodb_max_dirty_pages_pct only as a safety rail; the real fix is write rate or IOPS capacity.
  • Log buffer waitsInnodb_log_waits should be zero. Any sustained nonzero rate means innodb_log_buffer_size is too small for the write rate.
  • Open transaction age – Query information_schema.INNODB_TRX and alert on transactions older than 5 minutes in OLTP. Idle transactions with trx_query IS NULL hold read views and block purge even without active statements. Kill the connection or fix the application idle-in-transaction pattern. A single forgotten BEGIN in a connection pool can hold a read view for hours.
  • Metadata lock buildup – Monitor performance_schema.metadata_locks. More than 3 sessions pending on the same table for 30 seconds means a DDL or long-running transaction is about to cascade into a full connection pool stall. The blocking session usually holds OBJECT_TYPE = 'TABLE' with LOCK_TYPE = 'SHARED_WRITE' or an explicit

[OUTPUT TRUNCATED: Response exceeded output token limit.]