PostgreSQL connection exhaustion: detection, diagnosis, and prevention

Application logs show FATAL: sorry, too many clients already. Health checks are failing. A rolling deploy just finished, and now the database is rejecting connections. Because PostgreSQL uses one process per connection, every slot consumes memory and scheduler overhead. Once max_connections is reached, the server refuses new backends entirely. Raising max_connections without fixing the root cause increases memory pressure and context-switch thrashing. This guide covers how to distinguish a true capacity shortage from a leak or pool misconfiguration, how to recover safely, and how to prevent recurrence.

What this means

max_connections is a hard limit. Changing it requires a server restart. The default is 100, including all client backends, replication connections, and reserved superuser slots. When the limit is reached, PostgreSQL returns a fatal error to every new connection attempt. PostgreSQL has no built-in connection pooling: every application connection becomes a dedicated backend process. Without an external pooler such as PgBouncer, connection churn from deploys, health checks, or retry storms can exhaust slots in seconds.

Connection exhaustion is not only about total count. The state of those connections matters. An idle backend holds a slot and memory but no locks. An idle in transaction backend also holds a snapshot, blocks VACUUM, and can accelerate table bloat and transaction-ID wraparound pressure. An active backend running a slow query or waiting on a lock consumes a slot while a queue builds behind it.

flowchart TD
  A["Connection errors: too many clients"] --> B{"pg_stat_activity count >= max_connections?"}
  B -->|Yes| C["Server limit hit"]
  B -->|No| D["Pooler limit hit"]
  C --> E{"Dominant state?"}
  E -->|"idle in transaction"| F["App leak or missing timeout"]
  E -->|"idle"| G["Pool size too large or missing pooler"]
  E -->|"active"| H["Slow queries or lock waits"]
  D --> I["Check PgBouncer SHOW POOLS"]
  F --> J["Terminate idle; set idle_in_transaction_session_timeout"]
  G --> K["Deploy PgBouncer; reduce app pool size"]
  H --> L["Tune query or kill blocker"]

Common causes

CauseWhat it looks likeFirst thing to check
Application connection leakidle in transaction count grows monotonically; errors during steady-state trafficpg_stat_activity filtered by state = 'idle in transaction' and application_name
Missing or misconfigured poolermax_connections hit during deploys or spikes; hundreds of idle backends visible directly in PostgreSQLWhether PgBouncer or an equivalent pooler is deployed, and its SHOW POOLS output
Idle-in-transaction sessionsOld state_change timestamps; VACUUM blocked; accelerating table bloatpg_stat_activity ordered by state_change for idle in transaction rows
Thundering herd on restart or deployConnection count spikes immediately after a rolling deploy or primary restart; spike correlates with deploy timestampsCorrelation between deploy events and pg_stat_activity count
Long-running queries or lock waitsactive connections with old query_start; a queue of ungranted locks in pg_lockspg_locks where granted = false, joined to pg_stat_activity

Quick checks

Run these read-only checks before making any changes.

# Check current connection count grouped by state
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
# Check the hard server limit and reserved slots
psql -c "SHOW max_connections; SHOW superuser_reserved_connections;"
# Identify idle-in-transaction sessions and their age
psql -c "SELECT pid, usename, application_name, state, now() - state_change AS idle_time FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY idle_time DESC;"
# Check for lock waits that inflate required concurrency
psql -c "SELECT l.pid, a.usename, l.mode, l.granted, a.query FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE NOT l.granted LIMIT 10;"
# Inspect PgBouncer pool state if a pooler is in use
psql -p 6432 pgbouncer -c "SHOW POOLS;"
# Attribute connections by application source
psql -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count(*) DESC;"

How to diagnose it

  1. Confirm the boundary. Compare count(*) FROM pg_stat_activity against max_connections. If the server limit is not yet reached, the exhaustion may be happening in an intermediate pooler or application-side pool.
  2. Classify states. High idle suggests the application is opening connections it does not need, or the pooler is not multiplexing. High idle in transaction points to an application logic leak. High active with old query_start indicates slow queries or lock contention.
  3. Attribute by source. Use application_name and client_addr to find the specific service, host, or batch job responsible for the spike.
  4. Check for lock pressure. Query pg_locks for ungranted locks. A single blocked UPDATE can hold a row lock while a dozen connections queue behind it, artificially inflating the required connection count.
  5. Correlate with events. Match the timestamp of the spike against deploys, restarts, cron jobs, or ETL batches. A rolling deploy that opens new connections before draining old ones creates a transient doubling of slots.
  6. Inspect the pooler. If PgBouncer exists, run SHOW POOLS. Check the cl_waiting column. If clients are waiting, the pooler is saturated. If client and server active counts are nearly equal, the pooler is not multiplexing effectively, or the pool size is too large.

Metrics and signals to monitor

SignalWhy it mattersWarning sign
Connection utilization (count(*) / max_connections)Reaching the hard limit causes immediate fatal application errorsSustained utilization above 80%
Idle-in-transaction countHolds snapshots, blocks VACUUM, and wastes slots without doing workAny session in this state for more than 5 minutes
Pooler client vs. server active ratioReveals whether exhaustion is at the pooler or the databasecl_waiting greater than zero, or client and server counts near parity in transaction mode
Lock wait queue depthQueued locks increase the number of concurrent connections needed to maintain throughputUngranted locks persisting longer than 10 seconds
Connections by application_nameIdentifies which service is leaking or over-connectingA single application name consuming more than 50% of slots

Fixes

Terminate unsafe idle sessions for immediate relief

If idle in transaction sessions are consuming slots, terminate them to free capacity and unblock VACUUM.

-- Terminate idle-in-transaction sessions older than 5 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < NOW() - INTERVAL '5 minutes';

Warning: This aborts open transactions and requires privileges to terminate backends. Any in-flight application request will fail and roll back. Do not terminate actively running queries unless you have confirmed they are safe to abort.

Deploy or reconfigure connection pooling

If connections are hitting PostgreSQL directly, deploy PgBouncer in transaction mode. Size the PostgreSQL-side pool to roughly 2-3 times the number of CPU cores, and set server_reset_query = DISCARD ALL to clear session state between transactions.

Tradeoff: Transaction mode breaks session-level features such as advisory locks, LISTEN/NOTIFY, temporary tables, and prepared statements that span multiple transactions. If your application requires these, use session mode and accept lower efficiency.

Fix application connection leaks

Set idle_in_transaction_session_timeout to 2 minutes. On PostgreSQL 14 and later, also set idle_session_timeout to 30 minutes. Audit application code to ensure transactions are committed or rolled back promptly, including in exception paths.

Tradeoff: Aggressive timeouts may abort legitimate long-running transactions. Tune the threshold based on your longest expected interactive transaction, and set stricter limits on batch and health-check roles.

Eliminate thundering herd on deploy

Implement connection draining so old instances release backends before new instances connect. Stagger rolling restarts. Reduce health-check frequency if each check opens a new connection.

Tradeoff: This requires coordination between application deploy pipelines and the infrastructure layer. It is easier to enforce when a pooler sits in front of the database.

Increase max_connections only as a last resort

If a pooler is already in place and the legitimate active workload still exceeds the limit, calculate a new max_connections based on pool_size multiplied by the number of poolers, plus replication connections, superuser reserved slots, and a small spare margin. Then restart PostgreSQL.

Tradeoff: Higher max_connections increases per-process memory overhead and OS scheduler pressure. Do not raise the limit above 200-300 without aggressive pooling and memory validation.

Prevention

  • PgBouncer in transaction mode. Multiplex hundreds of client connections over a small pool of PostgreSQL backends.
  • idle_in_transaction_session_timeout. Set to 2 minutes to prevent abandoned transactions from holding slots and blocking VACUUM.
  • Connection count alerts. Page when connection utilization exceeds 70% so you can investigate before hard rejection begins.
  • Deploy-time connection draining. Ensure rolling deployments close old connections before opening new ones to avoid double-counting slots.
  • Application-side pool limits. Cap application pool sizes so the aggregate across all instances stays well below the server or pooler limit.

How Netdata helps

  • Correlates PostgreSQL connection count with application error rates to distinguish database saturation from pooler misconfiguration.
  • Tracks pg_stat_activity state breakdowns (active, idle, idle in transaction).
  • Alerts on connection utilization percentage before max_connections is reached.
  • Surfaces memory pressure and context-switch rates, which typically increase when max_connections is raised without pooling.