PostgreSQL split-brain after failover: detection and reconciliation
A failover should leave one primary and a clean topology. Split-brain means two instances accept writes, application connections are split across both nodes, and transaction histories diverge. It typically starts with a network partition, a missed demotion signal, or a health-check false positive that promotes a replica while the old primary keeps running. Once both primaries accept transactions, timelines diverge and you must reconcile.
What this means
In streaming replication, promotion creates a new timeline. The old primary must stop accepting writes and rejoin as a replica or stay offline. Split-brain occurs when the old primary continues to run read-write after promotion. Each primary generates its own WAL stream in the same LSN space but on different timelines. Replicas cannot follow both. Applications that reconnect to the old primary write data that does not exist on the new primary. The longer both primaries stay active, the larger the divergence and the more likely you must rebuild instead of rewind.
flowchart TD
A[Failover triggered] --> B{Two primaries accepting writes?}
B -->|Yes| C[Split-brain confirmed]
B -->|No| D[Standard failover]
C --> E[Stop old primary immediately]
E --> F{pg_rewind viable?}
F -->|Yes| G[Snapshot data directory]
G --> H[Run pg_rewind to new primary]
H --> I[Start as replica]
F -->|No| J[Rebuild from pg_basebackup]
J --> ICommon causes
| Cause | What it looks like | First thing to check |
|---|---|---|
| Network partition with failed fencing | Old primary isolated from DCS and replicas, but still local to some app servers | Network connectivity and Patroni or repmgr logs for demotion failures |
| Misconfigured watchdog or systemd restart | Old primary is demoted, then systemd restarts PostgreSQL automatically | systemd restart policies and pg_is_in_recovery() on the old node |
| Two-node repmgr cluster without witness | Partition leaves each node believing it is the sole survivor | repmgr node count and witness server status |
| Application bypassing pooler reconnects to old primary | Failover completes, but app DNS or direct IP reconnects to the old address | PgBouncer or application connection strings pointing at old primary IP |
| Manual promotion without stopping old primary | Operator promotes a replica before confirming old primary is down | pg_controldata on both nodes showing “in production” |
Quick checks
Run these on every node that might be primary. They are read-only.
pg_controldata $PGDATA | grep -E "Database cluster state|Latest checkpoint's TimeLineID"
SELECT client_addr, state, sent_lsn, replay_lsn
FROM pg_stat_replication;
SELECT pg_is_in_recovery();
patronictl list <cluster-name>
repmgr cluster show
If more than one node returns Database cluster state: in production and has active pg_stat_replication senders or zero recovery status, you have a split-brain.
How to diagnose it
- Confirm scope. Check
pg_stat_replicationon every node. A healthy primary has replicas connected. If two nodes show connected replicas, or if some replicas are missing while applications report writes to multiple endpoints, the cluster is split. - Map timeline IDs. Use
pg_controldataon every node. The legitimate new primary usually holds the highest timeline ID because promotion increments it. The old primary remains on the pre-failover timeline. If timeline IDs are equal but both claim primary status, the promotion did not complete cleanly or the old primary was not shut down before the new one was promoted. - Find active writes on the old primary. Query
pg_stat_databaseand look for increasingxact_commitorblks_writtenafter the known failover time. Check PostgreSQL logs for checkpoint activity. A node that should be read-only should not generate new checkpoints. - Identify application connections. Query
pg_stat_activityon the old primary for active writes. Checkapplication_nameand source IPs to see which services still route to the wrong node. - Assess divergence depth. Compare
pg_current_wal_lsn()on both nodes. Large gaps mean more data to reconcile. If the old primary has been writable for hours or days, expect to rebuild rather than rewind.
Metrics and signals to monitor
| Signal | Why it matters | Warning sign |
|---|---|---|
| Timeline ID per node | Promotion increments timeline; divergence indicates split-brain | Two nodes in the same cluster with different active timeline IDs |
pg_stat_replication sender count | Only the primary should stream WAL to multiple replicas | Multiple nodes showing active WAL senders |
| Checkpoints on old primary | Read-only replicas should not checkpoint | checkpoints_req or checkpoints_timed increasing on a demoted node |
| Connection count to old primary | Apps may reconnect to the demoted node after failover | Sustained active connections to a node that should be in recovery |
pg_stat_database.xact_commit on standby | A standby should not commit new transactions | Transaction commit counters increasing on a replica |
Fixes
Stop the old primary immediately
Warning: Stopping the old primary aborts all active connections on that node.
Do not attempt to reconcile while both nodes accept writes. Every new transaction deepens divergence. Stop PostgreSQL on the old primary:
pg_ctl stop -D $PGDATA -m fast
If systemd is configured to restart PostgreSQL automatically, mask the PostgreSQL unit temporarily to prevent restart loops. Block client access at the network layer or in PgBouncer.
Choose the authoritative primary
Choose the node that won the DCS election. Verify with patronictl list or repmgr cluster show. The correct primary normally has the higher timeline ID and the most recent LSN. Do not choose based on uptime or load; choose based on quorum state and timeline.
Reconcile with pg_rewind
pg_rewind synchronizes a diverged data directory without a full base backup by copying only changed blocks since the divergence point.
Prerequisites:
wal_log_hints = oninpostgresql.conf, or data checksums enabled atinitdb.full_page_writes = on.- WAL from the divergence point must be available in the archive or in
pg_wal(pg_xlogin versions before 10).
Process:
- Snapshot the data directory. The PostgreSQL documentation warns that a failed
pg_rewindcan leave the data directory unrecoverable. - Run
pg_rewindon the old primary, using the new primary as the source:pg_rewind --target-pgdata=$PGDATA --source-server="host=new_primary port=5432 user=replicator" - Configure standby.
pg_rewindupdates the data directory at the block level, which can overwrite configuration files. Reviewpostgresql.conf,pg_hba.conf, and any included files. After the rewind, configure the node as a standby. In PostgreSQL 12 and later, createstandby.signaland setprimary_conninfoinpostgresql.auto.conf. In earlier versions, create arecovery.conffile withstandby_mode = onandprimary_conninfo. - Ensure a clean shutdown before rewinding. If the old primary shut down uncleanly and the divergence point lands on a partial WAL record,
pg_rewindcan fail with an invalid record length. If a clean shutdown is impossible, treat the node as unrecoverable and reclone.
When to rebuild instead
Do not spend hours trying to rewind a node that diverged for an extended period. Rebuild from pg_basebackup if any of the following apply:
- The old primary was writable for many hours and WAL archive retention does not cover the divergence point.
pg_rewindfailed and left the data directory inconsistent.- The node requires a major version upgrade or configuration overhaul anyway.
Rejoin and verify
Start the reconciled node. On the primary, confirm the replica connects and streams WAL:
SELECT client_addr, state, sent_lsn, replay_lsn
FROM pg_stat_replication;
On the replica, confirm WAL reception:
SELECT status, receive_start_lsn, latest_end_lsn
FROM pg_stat_wal_receiver;
Run pg_checksums verification or compare row counts on critical tables if you suspect inconsistency.
Prevention
Fencing. Use Patroni with watchdog enabled if your environment supports it. On cloud VMs, the softdog software watchdog is the default, but hardware watchdog is preferred. Without watchdog, Patroni cannot fence an unresponsive primary when the DCS is also partitioned. For repmgr, avoid two-node clusters without a witness server. A partition in a two-node cluster leaves both nodes without quorum.
Connection routing. Route all application traffic through PgBouncer or a proxy that follows the HA tool’s leader key. Direct PostgreSQL connections bypass failover logic. If an application reverts to a hardcoded IP after failover, split-brain writes will continue regardless of Patroni’s state.
Failover testing. Test automated failover monthly in a non-production environment. Verify that the old primary stops, stays stopped, and can rejoin as a replica.
Synchronous replication. Consider synchronous_commit = remote_apply with synchronous_standby_names for workloads that cannot tolerate divergent writes. This trades availability for consistency: if the synchronous replica fails, writes pause until it recovers or the configuration is relaxed.
How Netdata helps
Netdata correlates WAL generation rate and checkpoint frequency across nodes to spot a demoted primary that is still writing. It tracks active connections per PostgreSQL instance; a connection spike on a node that should be in recovery signals routing failure. It monitors replication lag per replica. If some replicas show zero lag to one primary while others lag to a different node, the cluster may be split. Alert on pg_stat_database transaction rates on standby nodes. A standby should not show sustained commit throughput.
Related guides
- How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
- PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
- PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
- PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
- PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
- PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
- PostgreSQL checkpoint storms: detection, causes, and tuning: /guides/postgres/postgres-checkpoint-storms/
- PostgreSQL: checkpoints are occurring too frequently – what to tune: /guides/postgres/postgres-checkpoints-occurring-too-frequently/
- PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
- PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
- PostgreSQL: database is not accepting commands to avoid wraparound data loss: /guides/postgres/postgres-database-not-accepting-commands/
- PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/






