MySQL connection exhaustion: detection, diagnosis, and prevention

ERROR 1040 (HY000): Too many connections. Health checks fail. Users cannot sign in. If you configured an admin account correctly, a break-glass session may still get through on the reserved slot, and SELECT 1 returns. MySQL is running, but the connection pool is a wall.

This is a hard cliff. One moment queries flow; the next, every new TCP handshake to port 3306 is rejected. The server does not queue connections. Whether the root cause is a connection leak, a retry storm, or genuine overload from slow queries, the symptom is the same: Threads_connected has reached max_connections.

The key to fast diagnosis is the difference between connected threads and running threads. Threads_connected counts every open session, including thousands of idle Sleep connections. Threads_running counts only those executing a query right now. High connected with low running means you are leaking connections. High running means the database is choking on work. Picking the wrong branch wastes minutes during an outage.

What this means

MySQL uses a one-thread-per-connection model in the classic architecture. Each connection holds session memory and a file descriptor. The server enforces a hard limit at max_connections. The default is 151.

When the limit is reached, the server rejects new attempts and increments Connection_errors_max_connections. MySQL reserves one connection beyond max_connections for accounts with the CONNECTION_ADMIN privilege (or the deprecated SUPER privilege in older versions). This is your emergency access, but only if you configured an admin account and did not consume that slot through application misconfiguration.

Exhaustion is usually a symptom, not the disease. Three broad categories:

  • Idle accumulation: connections hold sessions without closing them, often because wait_timeout is too high or application pools leak.
  • Active overload: queries run slowly, so connections stay busy longer and the pool backs up.
  • Retry amplification: an application hits a timeout, opens more connections to retry, and accelerates saturation.
flowchart TD
    A[Connection refused errors] --> B{Threads_running low?}
    B -->|Yes| C[Idle leak]
    B -->|No| D[Active overload]
    C --> E[Check PROCESSLIST Sleep]
    D --> F[Check query time]
    D --> G[Check lock waits]
    E --> H[Kill old Sleep threads]
    F --> I[Kill blockers or tune]
    G --> I

Common causes

CauseWhat it looks likeFirst thing to check
Connection pool leakThreads_connected near limit, Threads_running very low, many Sleep entries with high TimeSHOW PROCESSLIST filtered for Sleep
Slow query pile-upBoth metrics high, queries active for many secondsSHOW PROCESSLIST for large Time values
Retry storm after partial failureWaves of Connection_errors_max_connections spikes correlating with application errorsApplication logs for ERROR 1040 and reconnect loops
wait_timeout too highGradual daily growth of idle Sleep connections that never dropSHOW GLOBAL VARIABLES LIKE 'wait_timeout'
Metadata lock cascadeOne table affected, processlist shows Waiting for table metadata lockperformance_schema.metadata_locks for pending locks

Quick checks

These are read-only unless noted.

# Saturation ratio
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL VARIABLES LIKE 'max_connections';"
# Active versus idle split
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';"
# Rejections happening now
mysql -e "SHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections';"
-- Oldest idle Sleep connections
SELECT id, user, host, db, command, time, state
FROM information_schema.PROCESSLIST
WHERE command = 'Sleep'
ORDER BY time DESC
LIMIT 20;
-- Pending metadata locks (MySQL 5.7+)
SELECT OBJECT_NAME, OBJECT_TYPE, LOCK_TYPE, LOCK_STATUS, COUNT(*) AS waiting
FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING'
GROUP BY OBJECT_NAME, OBJECT_TYPE, LOCK_TYPE, LOCK_STATUS;
-- Long-running active queries
SELECT id, user, host, db, command, time, state, LEFT(info, 100) AS query_preview
FROM information_schema.PROCESSLIST
WHERE command != 'Sleep' AND time > 10
ORDER BY time DESC;
# Timeout configuration
mysql -e "SHOW GLOBAL VARIABLES WHERE Variable_name IN ('wait_timeout', 'interactive_timeout');"
# Peak usage since startup
mysql -e "SHOW GLOBAL STATUS LIKE 'Max_used_connections';"
-- Oldest open transactions
SELECT trx_mysql_thread_id, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_seconds, trx_query
FROM information_schema.INNODB_TRX
ORDER BY trx_started ASC
LIMIT 5;

How to diagnose it

  1. Confirm saturation. Verify Threads_connected / max_connections > 0.95 and Connection_errors_max_connections is increasing. If the ratio is high but rejections are zero, you are near the edge but not yet falling.
  2. Determine the branch. Look at Threads_running. If it is low relative to Threads_connected, you have an idle accumulation problem. If it is high and climbing, you have an active overload problem.
  3. For idle accumulation, inspect information_schema.PROCESSLIST for Sleep commands with large Time values. Correlate User and Host with application services. If all idle connections come from one application host, that pool is leaking.
  4. Check wait_timeout and interactive_timeout. The default wait_timeout is 28800 seconds (8 hours). If your application or ORM does not explicitly close connections, reducing this reclaims idle sessions. Interactive clients use interactive_timeout; change both for consistent behavior.
  5. For active overload, inspect active queries. Look for high Time in SHOW PROCESSLIST. Check performance_schema.data_lock_waits (MySQL 8.0+) or information_schema.INNODB_LOCK_WAITS (5.7) for row lock contention backing up the queue.
  6. Check for metadata lock waits. Run the performance_schema.metadata_locks query from Quick checks. If one table has many pending metadata locks and a DDL statement is waiting, you are in a metadata lock cascade. This fills the pool even though the queries themselves are fast.
  7. Check application logs for retry loops. If the application receives ERROR 1040 and immediately retries without backoff, each attempt consumes a brief connection slot and worsens the storm.
  8. Verify admin access. Ensure your monitoring and break-glass accounts hold CONNECTION_ADMIN so you can still connect when the pool is full.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Threads_connected / max_connectionsSaturation of the connection pool> 0.80 sustained for 10 minutes; page if > 0.95 with rejections
Threads_runningDistinguishes idle accumulation from genuine overloadConsistently above CPU core count while throughput drops
Connection_errors_max_connectionsDirect evidence that clients are being refusedNonzero rate while Threads_connected is near the limit
Max_used_connections / max_connectionsHistorical peak trend for capacity planningTrending upward over weeks, crossing 0.70
Aborted_connects ratePrecursor or compounding factor from auth failures and handshake dropsSustained spike above baseline during saturation events
Threads_created / Connections ratioThread cache thrashing from connection churnRatio > 0.01 indicates cache is too small or pooling is missing

Fixes

Immediate relief without a restart

If you can connect via the admin slot, kill the worst offenders. Killing connections is disruptive and aborts in-flight transactions, but during an outage it may be necessary.

-- Kill the oldest idle Sleep connections first
KILL <id>;

For a metadata lock cascade, decide whether the DDL or the long transaction is more expendable. Killing the DDL provides instant relief; killing the long transaction may be required if the DDL is critical.

-- Kill a blocking DDL or long-running query
KILL <id>;

If idle accumulation is the culprit and you cannot restart the application, lower wait_timeout dynamically.

-- Reclaim idle connections (seconds)
SET GLOBAL wait_timeout = 300;
SET GLOBAL interactive_timeout = 300;

This only affects new idle periods. Existing idle connections are not retroactively subject to the new timeout and must be killed manually.

Short-term configuration changes

Raising max_connections is a band-aid, not a cure. Each connection consumes memory for thread stack, sort buffers, and join buffers. Raising the limit without increasing RAM trades connection rejections for OOM kills or excessive swap. Only raise it after confirming the server has memory headroom.

If you must raise it:

SET GLOBAL max_connections = <new_value>;

Long-term fixes

  • Fix the application connection pool logic. Ensure connections are returned to the pool and closed on exceptions.
  • Add exponential backoff with jitter to application retry logic when encountering ERROR 1040.
  • Deploy a connection pooler such as ProxySQL or MaxScale between the application and MySQL. Middleware multiplexes many application connections into fewer server connections.
  • Reduce wait_timeout and interactive_timeout permanently to values appropriate for your workload. For many web applications, 300 to 600 seconds is sufficient.
  • Schedule DDL operations during low-traffic windows and check for long-running transactions with information_schema.INNODB_TRX before starting.

Prevention

  • Size application connection pools so that total capacity across all instances leaves at least 20-30 percent of max_connections free for spikes, admin access, replication, backups, and monitoring.
  • Monitor Max_used_connections daily. Track the weekly peak trend. If it grows steadily, investigate before the cliff.
  • Set alerts at 80 percent of max_connections, not at 100 percent.
  • Ensure at least one break-glass account has CONNECTION_ADMIN and is not used by applications.
  • Tune thread_cache_size based on Threads_created rate. If you see constant thread creation, increase the cache to reduce connection latency overhead.
  • Review ORM and framework configurations. Many frameworks auto-begin transactions and rely on garbage collection or request-finalizers to close connections. Explicit close or try-finally blocks are safer.

How Netdata helps

  • Correlates Threads_connected, Threads_running, and Connection_errors_max_connections in one view to distinguish idle leaks from active overload.
  • Tracks Max_used_connections over time to expose gradual growth toward the limit.
  • Surfaces Aborted_connects and Threads_created alongside connection utilization to highlight pool thrashing or handshake storms.
  • Supports composite alerts that page only when Threads_connected / max_connections > 0.95 and Connection_errors_max_connections is increasing, filtering transient spikes.
  • Collects per-second processlist state breakdowns when configured, showing Sleep, Running, and lock wait proportions without manual polling.