Redis connection exhaustion: leaks, pools, and the retry storm
Application logs show connection timeouts and Redis returns ERR max number of clients reached. Downstream services fail because they cannot reach the cache. INFO clients shows connected_clients at the hard limit even though traffic has not increased. This is connection exhaustion. The most dangerous response is a retry storm that turns a small leak into a site-wide cascade.
Redis enforces a hard upper bound on connections via maxclients. When the sum of connected_clients, connected_slaves, and cluster_connections reaches that limit, Redis rejects every new TCP connection. Applications that retry immediately without backoff create a feedback loop: existing connections age out slowly while new attempts pile up, keeping the server pinned at the limit even after the original leak stops growing.
What this means
maxclients is a server-side hard stop. The default is 10000, but Redis also respects the OS file-descriptor limit (ulimit -n). If the OS limit is lower than maxclients plus the 32 file descriptors Redis reserves for internal use, Redis silently lowers the effective maxclients at startup. You can hit the ceiling unexpectedly even when CONFIG GET maxclients returns a high number.
The capacity calculation must include every connection type. On a primary with replicas and cluster bus sockets, the true headroom is maxclients - connected_slaves - cluster_connections. Monitoring only connected_clients understates usage and delays intervention.
Once the limit is reached, Redis increments rejected_connections and returns an error. Naive client behavior turns this into an amplification attack against your own infrastructure.
flowchart TD A[Connection leak or pool oversize] --> B[connected_clients approaches maxclients] B --> C[Redis rejects new connections] C --> D[Application error and immediate retry] D --> E[More connection attempts] E --> C C --> F[Monitoring and admin connections blocked] F --> G[Downstream cascade]
Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Connection leak | connected_clients climbs steadily while application throughput is flat; old connections show high idle seconds in CLIENT LIST | CLIENT LIST filtered by idle time |
| Pool misconfiguration | connected_clients near maxclients immediately after a deploy or scale-out; application instances each hold a fixed pool size | Application pool size multiplied by instance count |
| Retry storm | rejected_connections spikes rapidly; connected_clients oscillates near the limit as old connections time out and new attempts flood in | Application logs for immediate reconnects without backoff |
| maxclients too low for workload | rejected_connections on an otherwise healthy instance with legitimate traffic growth | CONFIG GET maxclients and OS ulimit -n |
| Sentinel or cluster internal connections | Unexpectedly high connection count on a primary after enabling Sentinel or joining a cluster | INFO replication and INFO cluster for internal slot usage |
Quick checks
Use these read-only commands to assess the situation without making changes.
# Full connection accounting against the hard limit
redis-cli INFO clients | grep -E "connected_clients|cluster_connections"
redis-cli INFO replication | grep connected_slaves
redis-cli CONFIG GET maxclients
Calculate the ratio manually: (connected_clients + connected_slaves + cluster_connections) / maxclients. If this is above 0.9, you are in the danger zone.
# Rejected connections since last restart (cumulative counter)
redis-cli INFO stats | grep rejected_connections
Any rate of increase here is an active incident.
# Idle connection audit to spot leaks
redis-cli CLIENT LIST | awk -F'[ =]' '{for(i=1;i<=NF;i++) {if($i=="id") id=$(i+1); if($i=="addr") addr=$(i+1); if($i=="idle") idle=$(i+1)} if(idle+0 > 300) print id, addr, idle"s"}'
Connections idle longer than your typical request lifetime are likely leaked.
# OS-level file descriptor ceiling (run on the Redis host)
ulimit -n
Compare this to maxclients. If ulimit -n is only slightly above maxclients, the OS is the real bottleneck.
# Connection churn rate
redis-cli INFO stats | grep total_connections_received
Sample this twice, 10 seconds apart. A rapidly climbing counter during an outage suggests a retry storm.
# Blocked clients consuming slots without generating throughput
redis-cli INFO clients | grep blocked_clients
A stuck BLPOP, BRPOP, or WAIT holds a slot even while idle.
How to diagnose it
Confirm exhaustion. Sum
connected_clients,connected_slaves, andcluster_connections. Compare to the effectivemaxclientslimit. If the total equals or exceeds the limit, you have confirmed exhaustion.Classify the connections. Determine how many slots are consumed by replicas, cluster bus, blocked clients, and normal clients. If replicas or cluster connections dominate, the fix is server topology or headroom planning, not application code.
Find idle connections. Use
CLIENT LISTand sort by theidlefield. Leaked connections often show idle times in the thousands of seconds while legitimate traffic keeps other connections active.Map connections to sources. The
addrfield inCLIENT LISTshows remote IPs. If all connections originate from a small set of application hosts, pool sizing is the issue. If they come from hundreds of transient serverless invocations, instance-per-connection is the issue.Check application logs for retry behavior. Look for connection errors followed immediately by another connection attempt. A healthy client should back off exponentially and trip a circuit breaker after consecutive failures.
Validate OS limits. Run
ulimit -non the Redis host. If it is lower thanmaxclientsplus internal overhead, Redis has already auto-adjusted downward and your real limit is lower than you think.Review for blocked clients. A stuck blocking command or long-running
WAITcan consume a slot indefinitely without generating idle-time alerts.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Connection capacity ratio | maxclients is a cliff-edge limit; you need runway before hitting it | (connected_clients + connected_slaves + cluster_connections) / maxclients > 0.8 |
rejected_connections rate | Every increment is a failed client operation | Any sustained rate > 0 |
total_connections_received rate | Churn indicates leaks or retry storms | Sudden spike uncorrelated with deployment |
blocked_clients | Blocking commands hold slots without generating throughput | Sustained growth above baseline |
Idle time in CLIENT LIST | The indicator for connection leaks | Idle time > 5 minutes on production traffic paths |
| Application connection error rate | Catches retry storms before they saturate Redis | Error rate climbing while Redis CPU is low |
Fixes
Immediate relief: kill stale connections and raise the ceiling
To restore service immediately, eliminate leaked connections. Use CLIENT LIST to identify stale clients by idle time and addr.
# WARNING: Disconnects clients. Use only after identifying stale connections via CLIENT LIST.
redis-cli CLIENT KILL <ip:port>
If the OS file-descriptor limit allows, raise maxclients temporarily:
# Runtime increase; reverts on restart unless redis.conf is updated.
redis-cli CONFIG SET maxclients 15000
Check ulimit -n first. If the OS limit is 1024, Redis cannot safely accept 15000 connections.
Fix the leak at the source
Connection leaks usually come from missing close() calls in error paths. Success-path metrics often look healthy while error-path connections accumulate silently. Audit application code for connections opened in try blocks that are not closed in finally or equivalent. Add connection budgets per endpoint so a leak in one path cannot exhaust the global pool.
Right-size connection pools
N application instances each with a pool of M connections present N times M connections to Redis. A four-instance deployment with pool size 50 consumes 200 server-side connections before replicas or cluster sockets are counted. Reduce per-instance pool sizes, share pools across workers, or place a proxy between applications and Redis to consolidate connections.
Stop retry storms
When Redis refuses a connection, the client must not retry immediately. Implement exponential backoff with jitter and a circuit breaker that trips open after N consecutive failures. Without backoff, each rejected connection attempt generates more load than it resolves.
Reserve headroom for internal connections
On primaries with replicas or cluster nodes, reserve at least 100 to 200 slots for connected_slaves and cluster_connections. Sentinel monitoring connections also count toward the limit. If you size maxclients exactly for application traffic, internal topology traffic pushes you over the edge during failovers or reconfigurations.
Enable server-side idle timeout
If you cannot deploy a code fix immediately, set a server-side idle timeout as a safety net.
# WARNING: Disconnects idle clients. Not applied to Pub/Sub clients or blocked clients.
redis-cli CONFIG SET timeout 300
This closes connections idle longer than 300 seconds. It is a safety net, not a substitute for proper pool hygiene.
Prevention
- Connection budgets. Enforce a maximum number of connections per application endpoint so a leak in one path cannot exhaust the global pool.
- Pool sizing discipline. Size pools based on concurrency needs, not instance count. Prefer smaller pools with pipelining over large pools with low utilization.
- Monitor the ratio, not just the absolute count. Alert on
(connected_clients + connected_slaves + cluster_connections) / maxclients> 0.8 so you have time to act before the cliff. - Circuit breakers and backoff. Mandate exponential backoff with jitter and circuit breakers in all Redis clients. Flat or immediate retry logic is an outage amplification mechanism.
- Serverless awareness. Each function instance typically creates its own pool. Size
maxclientstoexpected_concurrent_instances * connections_per_instance + reserved_headroom, and consider connection proxies for high-concurrency serverless workloads. - Audit OS limits during provisioning. Set
ulimit -nwell abovemaxclientsplus internal overhead, and verify this at every deployment.
How Netdata helps
- Correlate
rejected_connectionswith application error spikes to confirm the cascade pattern. - Alert on connection capacity ratio > 0.8 by combining
connected_clients,connected_slaves, andcluster_connectionsagainstmaxclients. - Track
total_connections_receivedrate to detect abnormal churn before the hard limit is reached. - Surface
blocked_clientsalongside connection metrics to distinguish active traffic from idle slot consumption. - Cross-reference connection exhaustion with memory metrics to spot output-buffer bloat from slow consumers.
Related guides
- How Redis actually works in production: a mental model for operators
- Redis aof_last_write_status:err: AOF write failures and recovery
- Redis appendfsync always latency: durability vs throughput trade-offs
- Redis BUSY Redis is busy running a script: blocking Lua and how to recover
- Redis Can’t save in background: fork: Cannot allocate memory - diagnosis and fix
- Redis event loop blocked: when one slow command freezes everything
- Redis eviction policy tuning: allkeys-lru vs volatile-ttl vs noeviction
- Redis fork/COW memory storm: why persistence doubles RSS and OOM-kills the box
- Redis KEYS command blocking production: why to replace it with SCAN
- Redis latency spikes: diagnosis with the LATENCY subsystem
- Redis latest_fork_usec too high: THP, NUMA, and fork latency
- Redis maxmemory not set: why every production instance needs a memory limit







