MySQL connection exhaustion: detection, diagnosis, and prevention
ERROR 1040 (HY000): Too many connections. Health checks fail. Users cannot sign in. If you configured an admin account correctly, a break-glass session may still get through on the reserved slot, and SELECT 1 returns. MySQL is running, but the connection pool is a wall.
This is a hard cliff. One moment queries flow; the next, every new TCP handshake to port 3306 is rejected. The server does not queue connections. Whether the root cause is a connection leak, a retry storm, or genuine overload from slow queries, the symptom is the same: Threads_connected has reached max_connections.
The key to fast diagnosis is the difference between connected threads and running threads. Threads_connected counts every open session, including thousands of idle Sleep connections. Threads_running counts only those executing a query right now. High connected with low running means you are leaking connections. High running means the database is choking on work. Picking the wrong branch wastes minutes during an outage.
What this means
MySQL uses a one-thread-per-connection model in the classic architecture. Each connection holds session memory and a file descriptor. The server enforces a hard limit at max_connections. The default is 151.
When the limit is reached, the server rejects new attempts and increments Connection_errors_max_connections. MySQL reserves one connection beyond max_connections for accounts with the CONNECTION_ADMIN privilege (or the deprecated SUPER privilege in older versions). This is your emergency access, but only if you configured an admin account and did not consume that slot through application misconfiguration.
Exhaustion is usually a symptom, not the disease. Three broad categories:
- Idle accumulation: connections hold sessions without closing them, often because
wait_timeoutis too high or application pools leak. - Active overload: queries run slowly, so connections stay busy longer and the pool backs up.
- Retry amplification: an application hits a timeout, opens more connections to retry, and accelerates saturation.
flowchart TD
A[Connection refused errors] --> B{Threads_running low?}
B -->|Yes| C[Idle leak]
B -->|No| D[Active overload]
C --> E[Check PROCESSLIST Sleep]
D --> F[Check query time]
D --> G[Check lock waits]
E --> H[Kill old Sleep threads]
F --> I[Kill blockers or tune]
G --> ICommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Connection pool leak | Threads_connected near limit, Threads_running very low, many Sleep entries with high Time | SHOW PROCESSLIST filtered for Sleep |
| Slow query pile-up | Both metrics high, queries active for many seconds | SHOW PROCESSLIST for large Time values |
| Retry storm after partial failure | Waves of Connection_errors_max_connections spikes correlating with application errors | Application logs for ERROR 1040 and reconnect loops |
wait_timeout too high | Gradual daily growth of idle Sleep connections that never drop | SHOW GLOBAL VARIABLES LIKE 'wait_timeout' |
| Metadata lock cascade | One table affected, processlist shows Waiting for table metadata lock | performance_schema.metadata_locks for pending locks |
Quick checks
These are read-only unless noted.
# Saturation ratio
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL VARIABLES LIKE 'max_connections';"
# Active versus idle split
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';"
# Rejections happening now
mysql -e "SHOW GLOBAL STATUS LIKE 'Connection_errors_max_connections';"
-- Oldest idle Sleep connections
SELECT id, user, host, db, command, time, state
FROM information_schema.PROCESSLIST
WHERE command = 'Sleep'
ORDER BY time DESC
LIMIT 20;
-- Pending metadata locks (MySQL 5.7+)
SELECT OBJECT_NAME, OBJECT_TYPE, LOCK_TYPE, LOCK_STATUS, COUNT(*) AS waiting
FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING'
GROUP BY OBJECT_NAME, OBJECT_TYPE, LOCK_TYPE, LOCK_STATUS;
-- Long-running active queries
SELECT id, user, host, db, command, time, state, LEFT(info, 100) AS query_preview
FROM information_schema.PROCESSLIST
WHERE command != 'Sleep' AND time > 10
ORDER BY time DESC;
# Timeout configuration
mysql -e "SHOW GLOBAL VARIABLES WHERE Variable_name IN ('wait_timeout', 'interactive_timeout');"
# Peak usage since startup
mysql -e "SHOW GLOBAL STATUS LIKE 'Max_used_connections';"
-- Oldest open transactions
SELECT trx_mysql_thread_id, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_seconds, trx_query
FROM information_schema.INNODB_TRX
ORDER BY trx_started ASC
LIMIT 5;
How to diagnose it
- Confirm saturation. Verify
Threads_connected / max_connections > 0.95andConnection_errors_max_connectionsis increasing. If the ratio is high but rejections are zero, you are near the edge but not yet falling. - Determine the branch. Look at
Threads_running. If it is low relative toThreads_connected, you have an idle accumulation problem. If it is high and climbing, you have an active overload problem. - For idle accumulation, inspect
information_schema.PROCESSLISTforSleepcommands with largeTimevalues. CorrelateUserandHostwith application services. If all idle connections come from one application host, that pool is leaking. - Check
wait_timeoutandinteractive_timeout. The defaultwait_timeoutis 28800 seconds (8 hours). If your application or ORM does not explicitly close connections, reducing this reclaims idle sessions. Interactive clients useinteractive_timeout; change both for consistent behavior. - For active overload, inspect active queries. Look for high
TimeinSHOW PROCESSLIST. Checkperformance_schema.data_lock_waits(MySQL 8.0+) orinformation_schema.INNODB_LOCK_WAITS(5.7) for row lock contention backing up the queue. - Check for metadata lock waits. Run the
performance_schema.metadata_locksquery from Quick checks. If one table has many pending metadata locks and a DDL statement is waiting, you are in a metadata lock cascade. This fills the pool even though the queries themselves are fast. - Check application logs for retry loops. If the application receives ERROR 1040 and immediately retries without backoff, each attempt consumes a brief connection slot and worsens the storm.
- Verify admin access. Ensure your monitoring and break-glass accounts hold
CONNECTION_ADMINso you can still connect when the pool is full.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Threads_connected / max_connections | Saturation of the connection pool | > 0.80 sustained for 10 minutes; page if > 0.95 with rejections |
Threads_running | Distinguishes idle accumulation from genuine overload | Consistently above CPU core count while throughput drops |
Connection_errors_max_connections | Direct evidence that clients are being refused | Nonzero rate while Threads_connected is near the limit |
Max_used_connections / max_connections | Historical peak trend for capacity planning | Trending upward over weeks, crossing 0.70 |
Aborted_connects rate | Precursor or compounding factor from auth failures and handshake drops | Sustained spike above baseline during saturation events |
Threads_created / Connections ratio | Thread cache thrashing from connection churn | Ratio > 0.01 indicates cache is too small or pooling is missing |
Fixes
Immediate relief without a restart
If you can connect via the admin slot, kill the worst offenders. Killing connections is disruptive and aborts in-flight transactions, but during an outage it may be necessary.
-- Kill the oldest idle Sleep connections first
KILL <id>;
For a metadata lock cascade, decide whether the DDL or the long transaction is more expendable. Killing the DDL provides instant relief; killing the long transaction may be required if the DDL is critical.
-- Kill a blocking DDL or long-running query
KILL <id>;
If idle accumulation is the culprit and you cannot restart the application, lower wait_timeout dynamically.
-- Reclaim idle connections (seconds)
SET GLOBAL wait_timeout = 300;
SET GLOBAL interactive_timeout = 300;
This only affects new idle periods. Existing idle connections are not retroactively subject to the new timeout and must be killed manually.
Short-term configuration changes
Raising max_connections is a band-aid, not a cure. Each connection consumes memory for thread stack, sort buffers, and join buffers. Raising the limit without increasing RAM trades connection rejections for OOM kills or excessive swap. Only raise it after confirming the server has memory headroom.
If you must raise it:
SET GLOBAL max_connections = <new_value>;
Long-term fixes
- Fix the application connection pool logic. Ensure connections are returned to the pool and closed on exceptions.
- Add exponential backoff with jitter to application retry logic when encountering ERROR 1040.
- Deploy a connection pooler such as ProxySQL or MaxScale between the application and MySQL. Middleware multiplexes many application connections into fewer server connections.
- Reduce
wait_timeoutandinteractive_timeoutpermanently to values appropriate for your workload. For many web applications, 300 to 600 seconds is sufficient. - Schedule DDL operations during low-traffic windows and check for long-running transactions with
information_schema.INNODB_TRXbefore starting.
Prevention
- Size application connection pools so that total capacity across all instances leaves at least 20-30 percent of
max_connectionsfree for spikes, admin access, replication, backups, and monitoring. - Monitor
Max_used_connectionsdaily. Track the weekly peak trend. If it grows steadily, investigate before the cliff. - Set alerts at 80 percent of
max_connections, not at 100 percent. - Ensure at least one break-glass account has
CONNECTION_ADMINand is not used by applications. - Tune
thread_cache_sizebased onThreads_createdrate. If you see constant thread creation, increase the cache to reduce connection latency overhead. - Review ORM and framework configurations. Many frameworks auto-begin transactions and rely on garbage collection or request-finalizers to close connections. Explicit close or try-finally blocks are safer.
How Netdata helps
- Correlates
Threads_connected,Threads_running, andConnection_errors_max_connectionsin one view to distinguish idle leaks from active overload. - Tracks
Max_used_connectionsover time to expose gradual growth toward the limit. - Surfaces
Aborted_connectsandThreads_createdalongside connection utilization to highlight pool thrashing or handshake storms. - Supports composite alerts that page only when
Threads_connected / max_connections > 0.95andConnection_errors_max_connectionsis increasing, filtering transient spikes. - Collects per-second processlist state breakdowns when configured, showing
Sleep,Running, and lock wait proportions without manual polling.







