PostgreSQL connection exhaustion: detection, diagnosis, and prevention
Application logs show FATAL: sorry, too many clients already. Health checks are failing. A rolling deploy just finished, and now the database is rejecting connections. Because PostgreSQL uses one process per connection, every slot consumes memory and scheduler overhead. Once max_connections is reached, the server refuses new backends entirely. Raising max_connections without fixing the root cause increases memory pressure and context-switch thrashing. This guide covers how to distinguish a true capacity shortage from a leak or pool misconfiguration, how to recover safely, and how to prevent recurrence.
What this means
max_connections is a hard limit. Changing it requires a server restart. The default is 100, including all client backends, replication connections, and reserved superuser slots. When the limit is reached, PostgreSQL returns a fatal error to every new connection attempt. PostgreSQL has no built-in connection pooling: every application connection becomes a dedicated backend process. Without an external pooler such as PgBouncer, connection churn from deploys, health checks, or retry storms can exhaust slots in seconds.
Connection exhaustion is not only about total count. The state of those connections matters. An idle backend holds a slot and memory but no locks. An idle in transaction backend also holds a snapshot, blocks VACUUM, and can accelerate table bloat and transaction-ID wraparound pressure. An active backend running a slow query or waiting on a lock consumes a slot while a queue builds behind it.
flowchart TD
A["Connection errors: too many clients"] --> B{"pg_stat_activity count >= max_connections?"}
B -->|Yes| C["Server limit hit"]
B -->|No| D["Pooler limit hit"]
C --> E{"Dominant state?"}
E -->|"idle in transaction"| F["App leak or missing timeout"]
E -->|"idle"| G["Pool size too large or missing pooler"]
E -->|"active"| H["Slow queries or lock waits"]
D --> I["Check PgBouncer SHOW POOLS"]
F --> J["Terminate idle; set idle_in_transaction_session_timeout"]
G --> K["Deploy PgBouncer; reduce app pool size"]
H --> L["Tune query or kill blocker"]Common causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Application connection leak | idle in transaction count grows monotonically; errors during steady-state traffic | pg_stat_activity filtered by state = 'idle in transaction' and application_name |
| Missing or misconfigured pooler | max_connections hit during deploys or spikes; hundreds of idle backends visible directly in PostgreSQL | Whether PgBouncer or an equivalent pooler is deployed, and its SHOW POOLS output |
| Idle-in-transaction sessions | Old state_change timestamps; VACUUM blocked; accelerating table bloat | pg_stat_activity ordered by state_change for idle in transaction rows |
| Thundering herd on restart or deploy | Connection count spikes immediately after a rolling deploy or primary restart; spike correlates with deploy timestamps | Correlation between deploy events and pg_stat_activity count |
| Long-running queries or lock waits | active connections with old query_start; a queue of ungranted locks in pg_locks | pg_locks where granted = false, joined to pg_stat_activity |
Quick checks
Run these read-only checks before making any changes.
# Check current connection count grouped by state
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
# Check the hard server limit and reserved slots
psql -c "SHOW max_connections; SHOW superuser_reserved_connections;"
# Identify idle-in-transaction sessions and their age
psql -c "SELECT pid, usename, application_name, state, now() - state_change AS idle_time FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY idle_time DESC;"
# Check for lock waits that inflate required concurrency
psql -c "SELECT l.pid, a.usename, l.mode, l.granted, a.query FROM pg_locks l JOIN pg_stat_activity a ON l.pid = a.pid WHERE NOT l.granted LIMIT 10;"
# Inspect PgBouncer pool state if a pooler is in use
psql -p 6432 pgbouncer -c "SHOW POOLS;"
# Attribute connections by application source
psql -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count(*) DESC;"
How to diagnose it
- Confirm the boundary. Compare
count(*) FROM pg_stat_activityagainstmax_connections. If the server limit is not yet reached, the exhaustion may be happening in an intermediate pooler or application-side pool. - Classify states. High
idlesuggests the application is opening connections it does not need, or the pooler is not multiplexing. Highidle in transactionpoints to an application logic leak. Highactivewith oldquery_startindicates slow queries or lock contention. - Attribute by source. Use
application_nameandclient_addrto find the specific service, host, or batch job responsible for the spike. - Check for lock pressure. Query
pg_locksfor ungranted locks. A single blockedUPDATEcan hold a row lock while a dozen connections queue behind it, artificially inflating the required connection count. - Correlate with events. Match the timestamp of the spike against deploys, restarts, cron jobs, or ETL batches. A rolling deploy that opens new connections before draining old ones creates a transient doubling of slots.
- Inspect the pooler. If PgBouncer exists, run
SHOW POOLS. Check thecl_waitingcolumn. If clients are waiting, the pooler is saturated. If client and server active counts are nearly equal, the pooler is not multiplexing effectively, or the pool size is too large.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
Connection utilization (count(*) / max_connections) | Reaching the hard limit causes immediate fatal application errors | Sustained utilization above 80% |
| Idle-in-transaction count | Holds snapshots, blocks VACUUM, and wastes slots without doing work | Any session in this state for more than 5 minutes |
| Pooler client vs. server active ratio | Reveals whether exhaustion is at the pooler or the database | cl_waiting greater than zero, or client and server counts near parity in transaction mode |
| Lock wait queue depth | Queued locks increase the number of concurrent connections needed to maintain throughput | Ungranted locks persisting longer than 10 seconds |
Connections by application_name | Identifies which service is leaking or over-connecting | A single application name consuming more than 50% of slots |
Fixes
Terminate unsafe idle sessions for immediate relief
If idle in transaction sessions are consuming slots, terminate them to free capacity and unblock VACUUM.
-- Terminate idle-in-transaction sessions older than 5 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < NOW() - INTERVAL '5 minutes';
Warning: This aborts open transactions and requires privileges to terminate backends. Any in-flight application request will fail and roll back. Do not terminate actively running queries unless you have confirmed they are safe to abort.
Deploy or reconfigure connection pooling
If connections are hitting PostgreSQL directly, deploy PgBouncer in transaction mode. Size the PostgreSQL-side pool to roughly 2-3 times the number of CPU cores, and set server_reset_query = DISCARD ALL to clear session state between transactions.
Tradeoff: Transaction mode breaks session-level features such as advisory locks, LISTEN/NOTIFY, temporary tables, and prepared statements that span multiple transactions. If your application requires these, use session mode and accept lower efficiency.
Fix application connection leaks
Set idle_in_transaction_session_timeout to 2 minutes. On PostgreSQL 14 and later, also set idle_session_timeout to 30 minutes. Audit application code to ensure transactions are committed or rolled back promptly, including in exception paths.
Tradeoff: Aggressive timeouts may abort legitimate long-running transactions. Tune the threshold based on your longest expected interactive transaction, and set stricter limits on batch and health-check roles.
Eliminate thundering herd on deploy
Implement connection draining so old instances release backends before new instances connect. Stagger rolling restarts. Reduce health-check frequency if each check opens a new connection.
Tradeoff: This requires coordination between application deploy pipelines and the infrastructure layer. It is easier to enforce when a pooler sits in front of the database.
Increase max_connections only as a last resort
If a pooler is already in place and the legitimate active workload still exceeds the limit, calculate a new max_connections based on pool_size multiplied by the number of poolers, plus replication connections, superuser reserved slots, and a small spare margin. Then restart PostgreSQL.
Tradeoff: Higher max_connections increases per-process memory overhead and OS scheduler pressure. Do not raise the limit above 200-300 without aggressive pooling and memory validation.
Prevention
- PgBouncer in transaction mode. Multiplex hundreds of client connections over a small pool of PostgreSQL backends.
- idle_in_transaction_session_timeout. Set to 2 minutes to prevent abandoned transactions from holding slots and blocking
VACUUM. - Connection count alerts. Page when connection utilization exceeds 70% so you can investigate before hard rejection begins.
- Deploy-time connection draining. Ensure rolling deployments close old connections before opening new ones to avoid double-counting slots.
- Application-side pool limits. Cap application pool sizes so the aggregate across all instances stays well below the server or pooler limit.
How Netdata helps
- Correlates PostgreSQL connection count with application error rates to distinguish database saturation from pooler misconfiguration.
- Tracks
pg_stat_activitystate breakdowns (active, idle, idle in transaction). - Alerts on connection utilization percentage before
max_connectionsis reached. - Surfaces memory pressure and context-switch rates, which typically increase when
max_connectionsis raised without pooling.






