PostgreSQL failover with Patroni: detection, promotion, and rollback

Patroni automates PostgreSQL failover with machine-enforced lease expiration. An agent on each node maintains a leader lock in a distributed consensus store such as etcd or Consul. When the primary stops renewing that lock, a surviving replica acquires it and promotes itself. Detection to promotion typically completes in seconds rather than the minutes a manual procedure requires.

Automation does not remove operational risk. A network blip between Patroni and its consensus store can look exactly like a primary death. A promoted replica with unbounded lag can wipe out your RPO. An old primary that restarts outside Patroni’s control can create a split brain. Understanding detection, promotion, and rollback mechanics is essential in production.

What it is and why it matters

Patroni is an open-source template for PostgreSQL high availability. It stores cluster membership and leader state in a distributed consensus store (DCS), runs health checks against the local PostgreSQL instance, and drives the promotion and demotion lifecycle. Without an orchestrator, failover is a manual sequence: identify the most current replica, stop the old primary, run pg_ctl promote, rewire application connections, and re-establish streaming replication. Patroni compresses this into an automated state machine, but it adds a hard dependency on DCS availability and network stability between every node and the consensus layer.

The main reason to run Patroni is RTO reduction. Manual failover in a stressed incident rarely meets a sub-minute recovery target. Patroni can detect a dead primary, elect a replacement, and update service discovery before an operator opens a terminal. The tradeoff is complexity: you are running a distributed system to manage your database, and its failure modes become your database’s failure modes.

How it works

flowchart TD
  P[Primary holds DCS lock] -->|renews lease| D[DCS]
  R[Replicas watch lock] -->|watch| D
  P -->|stops renewing| D
  D -->|TTL expires| E[Leaderless state]
  E -->|evaluate lag| C[Candidate selection]
  C -->|winner promotes| NP[New primary]
  P -->|must be| F[Old primary fenced]

Leader lock and health checks

The primary Patroni process continuously renews a leader lock key in the DCS. This lock has a time-to-live (TTL). The Patroni agent on the primary polls PostgreSQL health and, if the instance is responsive, extends the lease. Worst-case detection latency is the TTL plus the health-check interval. If the PostgreSQL instance crashes, is OOM-killed, or the node loses connectivity to the DCS, renewals stop and the lock expires.

Replicas watch the same leader key. While the lock is held and the primary is responsive, replicas remain in recovery and stream WAL. They do not attempt to modify the key.

Detection and election

When the leader lock disappears or is released, the cluster enters a leaderless state. Surviving Patroni agents evaluate local PostgreSQL instances as promotion candidates. The evaluation considers replication lag and configured priority. A replica that is too far behind the last known primary position may be excluded from the election to prevent promoting stale data.

In synchronous replication setups, PostgreSQL itself determines which standbys have acknowledged WAL.

Promotion

The winning replica acquires the leader lock in the DCS and issues a promote command on its local PostgreSQL instance. The database transitions from recovery to read-write. Patroni updates the cluster state in the DCS so that proxies, poolers, or service discovery layers can route connections to the new endpoint.

After promotion, the new primary begins accepting writes. Remaining replicas must reconfigure their primary_conninfo to stream from the new leader. Patroni handles this reconfiguration automatically.

Fencing the old primary

The most dangerous moment in any failover is the overlap between the old primary and the new one. If the old primary is still alive and accepting writes while a replica promotes, you now have two divergent primaries. Patroni cannot demote a primary it cannot reach. If the node is partitioned from the DCS but still reachable by applications, writes can continue on a node that no longer holds the leader lock. Auto-restart policies in systemd can also resurrect a stopped primary outside Patroni’s control.

Preventing this requires fencing: disable systemd auto-restart for PostgreSQL, route applications through a proxy that follows DCS state, and be prepared to drop traffic via firewall rules or STONITH when partitions occur.

Planned switchover vs manual failover

A switchover is a graceful, operator-initiated transition. The current primary releases the leader lock, shuts down cleanly, and a chosen replica is promoted. Use switchover for planned maintenance, operating system upgrades, or availability-zone evacuations. It minimizes downtime because the new primary starts without crash recovery.

A failover is an unplanned event driven by actual primary failure. Patroni’s automatic failover handles this without human intervention. An operator can also trigger a manual failover via patronictl failover when automatic detection has not fired but the primary is clearly unhealthy, or when the operator wants to force promotion of a specific candidate.

Rollback

A failover is effectively a one-way event. Once a replica promotes and begins accepting writes, the old primary cannot simply be re-promoted without risk of data divergence. The safe rollback path is to rebuild the old primary as a replica from the new one, using a physical backup or streaming replication reinitialization. If you need to return leadership to the original node, perform a planned switchover after the rebuild is complete and replication lag is zero.

Pause mode

Patroni supports a pause mode that disables automatic failover while leaving the cluster running. This is useful during risky maintenance, DCS upgrades, or network changes. While paused, Patroni will not trigger an election if the primary fails, but an operator can still run a manual switchover or failover. Do not leave a cluster paused indefinitely; a paused cluster with a dead primary will not recover automatically.

Where it shows up in production

DCS connectivity blips. The most common cause of spurious failovers is not a dead PostgreSQL instance, but packet loss or latency spikes between Patroni and etcd or Consul. The primary is healthy, yet it cannot renew its lock, so Patroni demotes it. Before declaring a database incident, verify DCS health and network paths.

Network partitions. A partition between data centers can isolate the primary from the DCS while leaving it connected to application servers. Without proper fencing, the old primary continues to accept writes and diverges from the promoted replica.

Replication lag disqualification. If every replica is lagging beyond maximum_lag_on_failover, Patroni refuses to promote any of them. The cluster is left without a writable primary until a replica catches up or an operator forces a manual failover and accepts the data loss window.

Synchronous replication stalls. If the only synchronous standby fails and synchronous_mode_strict is enabled, the primary halts commits until a replacement is attached. Without strict mode, PostgreSQL can fall back to asynchronous commits, but you may lose the RPO guarantees you assumed were in place.

Tradeoffs and when to use it

Automatic failover. Use when your RTO is measured in seconds and on-call staffing cannot guarantee immediate manual intervention. Accept the risk that DCS instability can trigger promotions on a healthy primary.

Manual failover only. Use when network partitions are frequent, when the cost of a spurious promotion exceeds the cost of downtime, or when regulatory constraints require human approval for any leadership change. Accept an RTO measured in minutes to tens of minutes.

Switchover. Always prefer switchover for planned changes. It is safer than failover because it avoids crash recovery and preserves replication consistency.

Pause mode. Use during maintenance windows, but set a reminder to unpause. A paused cluster with a dead primary will not recover automatically.

Signals to watch in production

SignalWhy it mattersWarning sign
DCS response latencyPatroni depends on timely lock renewalRenewals timing out or approaching TTL
Replication lagDetermines candidate quality and RPOLag growing toward maximum_lag_on_failover
PostgreSQL process state on old primaryDetects split brain or auto-restartPostmaster running after a demotion
Leader lock presence in DCSIndicates who the cluster believes is primaryMissing lock or multiple claimants
WAL receive rate on replicasConfirms the primary is generating and sending WALRate drops to zero while the primary is still marked up in the DCS

How Netdata helps

Use Netdata to distinguish a real primary failure from a DCS false positive.

  • Correlate pg_stat_replication.replay_lag with connection refusal on the primary. If the primary is truly down, lag grows and connections drop simultaneously.
  • Monitor WAL generation rate and disk I/O on replicas to confirm a candidate is catching up and suitable for promotion.
  • Track write throughput and transaction rates across nodes. A split brain appears as active writes on the old primary after the DCS has registered a new leader.
  • Alert on pg_stat_activity state. A sudden absence of active client backends on the primary, coupled with stalled replication, indicates a real failure more strongly than a single missed DCS heartbeat.
  • If you collect etcd or Consul metrics, overlay Raft latency or leader change counts with PostgreSQL role transitions to expose consensus-store-induced failovers.