$ guides / postgres / postgres-split-brain-after-failover ▌

Operations Guides

PostgreSQL split-brain after failover: detection and reconciliation

A failover should leave one primary and a clean topology. Split-brain means two instances accept writes, application connections are split across both nodes, and transaction histories diverge. It typically starts with a network partition, a missed demotion signal, or a health-check false positive that promotes a replica while the old primary keeps running. Once both primaries accept transactions, timelines diverge and you must reconcile.

What this means

In streaming replication, promotion creates a new timeline. The old primary must stop accepting writes and rejoin as a replica or stay offline. Split-brain occurs when the old primary continues to run read-write after promotion. Each primary generates its own WAL stream in the same LSN space but on different timelines. Replicas cannot follow both. Applications that reconnect to the old primary write data that does not exist on the new primary. The longer both primaries stay active, the larger the divergence and the more likely you must rebuild instead of rewind.

flowchart TD
    A[Failover triggered] --> B{Two primaries accepting writes?}
    B -->|Yes| C[Split-brain confirmed]
    B -->|No| D[Standard failover]
    C --> E[Stop old primary immediately]
    E --> F{pg_rewind viable?}
    F -->|Yes| G[Snapshot data directory]
    G --> H[Run pg_rewind to new primary]
    H --> I[Start as replica]
    F -->|No| J[Rebuild from pg_basebackup]
    J --> I

Common causes

Cause	What it looks like	First thing to check
Network partition with failed fencing	Old primary isolated from DCS and replicas, but still local to some app servers	Network connectivity and Patroni or repmgr logs for demotion failures
Misconfigured watchdog or systemd restart	Old primary is demoted, then systemd restarts PostgreSQL automatically	systemd restart policies and `pg_is_in_recovery()` on the old node
Two-node repmgr cluster without witness	Partition leaves each node believing it is the sole survivor	repmgr node count and witness server status
Application bypassing pooler reconnects to old primary	Failover completes, but app DNS or direct IP reconnects to the old address	PgBouncer or application connection strings pointing at old primary IP
Manual promotion without stopping old primary	Operator promotes a replica before confirming old primary is down	`pg_controldata` on both nodes showing “in production”

Quick checks

Run these on every node that might be primary. They are read-only.

pg_controldata $PGDATA | grep -E "Database cluster state|Latest checkpoint's TimeLineID"

SELECT client_addr, state, sent_lsn, replay_lsn
FROM pg_stat_replication;

SELECT pg_is_in_recovery();

patronictl list <cluster-name>

repmgr cluster show

If more than one node returns Database cluster state: in production and has active pg_stat_replication senders or zero recovery status, you have a split-brain.

How to diagnose it

Confirm scope. Check pg_stat_replication on every node. A healthy primary has replicas connected. If two nodes show connected replicas, or if some replicas are missing while applications report writes to multiple endpoints, the cluster is split.
Map timeline IDs. Use pg_controldata on every node. The legitimate new primary usually holds the highest timeline ID because promotion increments it. The old primary remains on the pre-failover timeline. If timeline IDs are equal but both claim primary status, the promotion did not complete cleanly or the old primary was not shut down before the new one was promoted.
Find active writes on the old primary. Query pg_stat_database and look for increasing xact_commit or blks_written after the known failover time. Check PostgreSQL logs for checkpoint activity. A node that should be read-only should not generate new checkpoints.
Identify application connections. Query pg_stat_activity on the old primary for active writes. Check application_name and source IPs to see which services still route to the wrong node.
Assess divergence depth. Compare pg_current_wal_lsn() on both nodes. Large gaps mean more data to reconcile. If the old primary has been writable for hours or days, expect to rebuild rather than rewind.

Metrics and signals to monitor

Signal	Why it matters	Warning sign
Timeline ID per node	Promotion increments timeline; divergence indicates split-brain	Two nodes in the same cluster with different active timeline IDs
`pg_stat_replication` sender count	Only the primary should stream WAL to multiple replicas	Multiple nodes showing active WAL senders
Checkpoints on old primary	Read-only replicas should not checkpoint	`checkpoints_req` or `checkpoints_timed` increasing on a demoted node
Connection count to old primary	Apps may reconnect to the demoted node after failover	Sustained active connections to a node that should be in recovery
`pg_stat_database.xact_commit` on standby	A standby should not commit new transactions	Transaction commit counters increasing on a replica

Fixes

Stop the old primary immediately

Warning: Stopping the old primary aborts all active connections on that node.

Do not attempt to reconcile while both nodes accept writes. Every new transaction deepens divergence. Stop PostgreSQL on the old primary:

pg_ctl stop -D $PGDATA -m fast

If systemd is configured to restart PostgreSQL automatically, mask the PostgreSQL unit temporarily to prevent restart loops. Block client access at the network layer or in PgBouncer.

Choose the authoritative primary

Choose the node that won the DCS election. Verify with patronictl list or repmgr cluster show. The correct primary normally has the higher timeline ID and the most recent LSN. Do not choose based on uptime or load; choose based on quorum state and timeline.

Reconcile with pg_rewind

pg_rewind synchronizes a diverged data directory without a full base backup by copying only changed blocks since the divergence point.

Prerequisites:

wal_log_hints = on in postgresql.conf, or data checksums enabled at initdb.
full_page_writes = on.
WAL from the divergence point must be available in the archive or in pg_wal (pg_xlog in versions before 10).

Process:

Snapshot the data directory. The PostgreSQL documentation warns that a failed pg_rewind can leave the data directory unrecoverable.

Run pg_rewind on the old primary, using the new primary as the source:

pg_rewind --target-pgdata=$PGDATA --source-server="host=new_primary port=5432 user=replicator"

Configure standby. pg_rewind updates the data directory at the block level, which can overwrite configuration files. Review postgresql.conf, pg_hba.conf, and any included files. After the rewind, configure the node as a standby. In PostgreSQL 12 and later, create standby.signal and set primary_conninfo in postgresql.auto.conf. In earlier versions, create a recovery.conf file with standby_mode = on and primary_conninfo.
Ensure a clean shutdown before rewinding. If the old primary shut down uncleanly and the divergence point lands on a partial WAL record, pg_rewind can fail with an invalid record length. If a clean shutdown is impossible, treat the node as unrecoverable and reclone.

When to rebuild instead

Do not spend hours trying to rewind a node that diverged for an extended period. Rebuild from pg_basebackup if any of the following apply:

The old primary was writable for many hours and WAL archive retention does not cover the divergence point.
pg_rewind failed and left the data directory inconsistent.
The node requires a major version upgrade or configuration overhaul anyway.

Rejoin and verify

Start the reconciled node. On the primary, confirm the replica connects and streams WAL:

SELECT client_addr, state, sent_lsn, replay_lsn
FROM pg_stat_replication;

On the replica, confirm WAL reception:

SELECT status, receive_start_lsn, latest_end_lsn
FROM pg_stat_wal_receiver;

Run pg_checksums verification or compare row counts on critical tables if you suspect inconsistency.

Prevention

Fencing. Use Patroni with watchdog enabled if your environment supports it. On cloud VMs, the softdog software watchdog is the default, but hardware watchdog is preferred. Without watchdog, Patroni cannot fence an unresponsive primary when the DCS is also partitioned. For repmgr, avoid two-node clusters without a witness server. A partition in a two-node cluster leaves both nodes without quorum.

Connection routing. Route all application traffic through PgBouncer or a proxy that follows the HA tool’s leader key. Direct PostgreSQL connections bypass failover logic. If an application reverts to a hardcoded IP after failover, split-brain writes will continue regardless of Patroni’s state.

Failover testing. Test automated failover monthly in a non-production environment. Verify that the old primary stops, stays stopped, and can rejoin as a replica.

Synchronous replication. Consider synchronous_commit = remote_apply with synchronous_standby_names for workloads that cannot tolerate divergent writes. This trades availability for consistency: if the synchronous replica fails, writes pause until it recovers or the configuration is relaxed.

How Netdata helps

Netdata correlates WAL generation rate and checkpoint frequency across nodes to spot a demoted primary that is still writing. It tracks active connections per PostgreSQL instance; a connection spike on a node that should be in recovery signals routing failure. It monitors replication lag per replica. If some replicas show zero lag to one primary while others lag to a different node, the cluster may be split. Alert on pg_stat_database transaction rates on standby nodes. A standby should not show sustained commit throughput.

How PostgreSQL actually works in production: a mental model for operators: /guides/postgres/how-postgres-works-in-production/
PostgreSQL ALTER TABLE blocked: zero-downtime DDL patterns: /guides/postgres/postgres-alter-table-blocked/
PostgreSQL autovacuum blocked by long-running transaction: detection and fix: /guides/postgres/postgres-autovacuum-blocked-by-long-transaction/
PostgreSQL autovacuum not running: detection, causes, and fixes: /guides/postgres/postgres-autovacuum-not-running/
PostgreSQL autovacuum tuning: per-table thresholds for high-churn workloads: /guides/postgres/postgres-autovacuum-tuning/
PostgreSQL blocking queries: finding the root blocker in a lock cascade: /guides/postgres/postgres-blocking-queries/
PostgreSQL checkpoint storms: detection, causes, and tuning: /guides/postgres/postgres-checkpoint-storms/
PostgreSQL: checkpoints are occurring too frequently – what to tune: /guides/postgres/postgres-checkpoints-occurring-too-frequently/
PostgreSQL connection exhaustion: detection, diagnosis, and prevention: /guides/postgres/postgres-connection-exhaustion/
PostgreSQL connection refused: pg_hba, listen_addresses, and TCP diagnosis: /guides/postgres/postgres-connection-refused/
PostgreSQL: database is not accepting commands to avoid wraparound data loss: /guides/postgres/postgres-database-not-accepting-commands/
PostgreSQL dead tuples piling up: why autovacuum can’t keep up: /guides/postgres/postgres-dead-tuples-piling-up/

The Netdata solution

PostgreSQL monitoring with Netdata

Netdata monitors PostgreSQL with per-second metrics, pre-built dashboards, and ML-powered anomaly detection. Correlate connection saturation, lock waits, autovacuum progress, replication lag, and checkpoint I/O against the rest of your stack so you catch the incidents in these runbooks before they page anyone.

See PostgreSQL monitoring → Start monitoring free

PostgreSQL split-brain after failover: detection and reconciliation

PostgreSQL split-brain after failover: detection and reconciliation

What this means

Common causes

Quick checks

How to diagnose it

Metrics and signals to monitor

Fixes

Stop the old primary immediately

Choose the authoritative primary

Reconcile with pg_rewind

When to rebuild instead

Rejoin and verify

Prevention

How Netdata helps

Related guides

PostgreSQL monitoring with Netdata