Redis connection exhaustion: leaks, pools, and the retry storm

Application logs show connection timeouts and Redis returns ERR max number of clients reached. Downstream services fail because they cannot reach the cache. INFO clients shows connected_clients at the hard limit even though traffic has not increased. This is connection exhaustion. The most dangerous response is a retry storm that turns a small leak into a site-wide cascade.

Redis enforces a hard upper bound on connections via maxclients. When the sum of connected_clients, connected_slaves, and cluster_connections reaches that limit, Redis rejects every new TCP connection. Applications that retry immediately without backoff create a feedback loop: existing connections age out slowly while new attempts pile up, keeping the server pinned at the limit even after the original leak stops growing.

What this means

maxclients is a server-side hard stop. The default is 10000, but Redis also respects the OS file-descriptor limit (ulimit -n). If the OS limit is lower than maxclients plus the 32 file descriptors Redis reserves for internal use, Redis silently lowers the effective maxclients at startup. You can hit the ceiling unexpectedly even when CONFIG GET maxclients returns a high number.

The capacity calculation must include every connection type. On a primary with replicas and cluster bus sockets, the true headroom is maxclients - connected_slaves - cluster_connections. Monitoring only connected_clients understates usage and delays intervention.

Once the limit is reached, Redis increments rejected_connections and returns an error. Naive client behavior turns this into an amplification attack against your own infrastructure.

flowchart TD
  A[Connection leak or pool oversize] --> B[connected_clients approaches maxclients]
  B --> C[Redis rejects new connections]
  C --> D[Application error and immediate retry]
  D --> E[More connection attempts]
  E --> C
  C --> F[Monitoring and admin connections blocked]
  F --> G[Downstream cascade]

Common causes

CauseWhat it looks likeFirst thing to check
Connection leakconnected_clients climbs steadily while application throughput is flat; old connections show high idle seconds in CLIENT LISTCLIENT LIST filtered by idle time
Pool misconfigurationconnected_clients near maxclients immediately after a deploy or scale-out; application instances each hold a fixed pool sizeApplication pool size multiplied by instance count
Retry stormrejected_connections spikes rapidly; connected_clients oscillates near the limit as old connections time out and new attempts flood inApplication logs for immediate reconnects without backoff
maxclients too low for workloadrejected_connections on an otherwise healthy instance with legitimate traffic growthCONFIG GET maxclients and OS ulimit -n
Sentinel or cluster internal connectionsUnexpectedly high connection count on a primary after enabling Sentinel or joining a clusterINFO replication and INFO cluster for internal slot usage

Quick checks

Use these read-only commands to assess the situation without making changes.

# Full connection accounting against the hard limit
redis-cli INFO clients | grep -E "connected_clients|cluster_connections"
redis-cli INFO replication | grep connected_slaves
redis-cli CONFIG GET maxclients

Calculate the ratio manually: (connected_clients + connected_slaves + cluster_connections) / maxclients. If this is above 0.9, you are in the danger zone.

# Rejected connections since last restart (cumulative counter)
redis-cli INFO stats | grep rejected_connections

Any rate of increase here is an active incident.

# Idle connection audit to spot leaks
redis-cli CLIENT LIST | awk -F'[ =]' '{for(i=1;i<=NF;i++) {if($i=="id") id=$(i+1); if($i=="addr") addr=$(i+1); if($i=="idle") idle=$(i+1)} if(idle+0 > 300) print id, addr, idle"s"}'

Connections idle longer than your typical request lifetime are likely leaked.

# OS-level file descriptor ceiling (run on the Redis host)
ulimit -n

Compare this to maxclients. If ulimit -n is only slightly above maxclients, the OS is the real bottleneck.

# Connection churn rate
redis-cli INFO stats | grep total_connections_received

Sample this twice, 10 seconds apart. A rapidly climbing counter during an outage suggests a retry storm.

# Blocked clients consuming slots without generating throughput
redis-cli INFO clients | grep blocked_clients

A stuck BLPOP, BRPOP, or WAIT holds a slot even while idle.

How to diagnose it

  1. Confirm exhaustion. Sum connected_clients, connected_slaves, and cluster_connections. Compare to the effective maxclients limit. If the total equals or exceeds the limit, you have confirmed exhaustion.

  2. Classify the connections. Determine how many slots are consumed by replicas, cluster bus, blocked clients, and normal clients. If replicas or cluster connections dominate, the fix is server topology or headroom planning, not application code.

  3. Find idle connections. Use CLIENT LIST and sort by the idle field. Leaked connections often show idle times in the thousands of seconds while legitimate traffic keeps other connections active.

  4. Map connections to sources. The addr field in CLIENT LIST shows remote IPs. If all connections originate from a small set of application hosts, pool sizing is the issue. If they come from hundreds of transient serverless invocations, instance-per-connection is the issue.

  5. Check application logs for retry behavior. Look for connection errors followed immediately by another connection attempt. A healthy client should back off exponentially and trip a circuit breaker after consecutive failures.

  6. Validate OS limits. Run ulimit -n on the Redis host. If it is lower than maxclients plus internal overhead, Redis has already auto-adjusted downward and your real limit is lower than you think.

  7. Review for blocked clients. A stuck blocking command or long-running WAIT can consume a slot indefinitely without generating idle-time alerts.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Connection capacity ratiomaxclients is a cliff-edge limit; you need runway before hitting it(connected_clients + connected_slaves + cluster_connections) / maxclients > 0.8
rejected_connections rateEvery increment is a failed client operationAny sustained rate > 0
total_connections_received rateChurn indicates leaks or retry stormsSudden spike uncorrelated with deployment
blocked_clientsBlocking commands hold slots without generating throughputSustained growth above baseline
Idle time in CLIENT LISTThe indicator for connection leaksIdle time > 5 minutes on production traffic paths
Application connection error rateCatches retry storms before they saturate RedisError rate climbing while Redis CPU is low

Fixes

Immediate relief: kill stale connections and raise the ceiling

To restore service immediately, eliminate leaked connections. Use CLIENT LIST to identify stale clients by idle time and addr.

# WARNING: Disconnects clients. Use only after identifying stale connections via CLIENT LIST.
redis-cli CLIENT KILL <ip:port>

If the OS file-descriptor limit allows, raise maxclients temporarily:

# Runtime increase; reverts on restart unless redis.conf is updated.
redis-cli CONFIG SET maxclients 15000

Check ulimit -n first. If the OS limit is 1024, Redis cannot safely accept 15000 connections.

Fix the leak at the source

Connection leaks usually come from missing close() calls in error paths. Success-path metrics often look healthy while error-path connections accumulate silently. Audit application code for connections opened in try blocks that are not closed in finally or equivalent. Add connection budgets per endpoint so a leak in one path cannot exhaust the global pool.

Right-size connection pools

N application instances each with a pool of M connections present N times M connections to Redis. A four-instance deployment with pool size 50 consumes 200 server-side connections before replicas or cluster sockets are counted. Reduce per-instance pool sizes, share pools across workers, or place a proxy between applications and Redis to consolidate connections.

Stop retry storms

When Redis refuses a connection, the client must not retry immediately. Implement exponential backoff with jitter and a circuit breaker that trips open after N consecutive failures. Without backoff, each rejected connection attempt generates more load than it resolves.

Reserve headroom for internal connections

On primaries with replicas or cluster nodes, reserve at least 100 to 200 slots for connected_slaves and cluster_connections. Sentinel monitoring connections also count toward the limit. If you size maxclients exactly for application traffic, internal topology traffic pushes you over the edge during failovers or reconfigurations.

Enable server-side idle timeout

If you cannot deploy a code fix immediately, set a server-side idle timeout as a safety net.

# WARNING: Disconnects idle clients. Not applied to Pub/Sub clients or blocked clients.
redis-cli CONFIG SET timeout 300

This closes connections idle longer than 300 seconds. It is a safety net, not a substitute for proper pool hygiene.

Prevention

  • Connection budgets. Enforce a maximum number of connections per application endpoint so a leak in one path cannot exhaust the global pool.
  • Pool sizing discipline. Size pools based on concurrency needs, not instance count. Prefer smaller pools with pipelining over large pools with low utilization.
  • Monitor the ratio, not just the absolute count. Alert on (connected_clients + connected_slaves + cluster_connections) / maxclients > 0.8 so you have time to act before the cliff.
  • Circuit breakers and backoff. Mandate exponential backoff with jitter and circuit breakers in all Redis clients. Flat or immediate retry logic is an outage amplification mechanism.
  • Serverless awareness. Each function instance typically creates its own pool. Size maxclients to expected_concurrent_instances * connections_per_instance + reserved_headroom, and consider connection proxies for high-concurrency serverless workloads.
  • Audit OS limits during provisioning. Set ulimit -n well above maxclients plus internal overhead, and verify this at every deployment.

How Netdata helps

  • Correlate rejected_connections with application error spikes to confirm the cascade pattern.
  • Alert on connection capacity ratio > 0.8 by combining connected_clients, connected_slaves, and cluster_connections against maxclients.
  • Track total_connections_received rate to detect abnormal churn before the hard limit is reached.
  • Surface blocked_clients alongside connection metrics to distinguish active traffic from idle slot consumption.
  • Cross-reference connection exhaustion with memory metrics to spot output-buffer bloat from slow consumers.