Failover failing with 3+ units, diverged timeline

Bug #1616433 reported by Stuart Bishop
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
PostgreSQL Charm
Triaged
High
Unassigned

Bug Description

Failover can fail when there are three or more units.

To failover, the leader should pause xlog replay on all units, see which has received the most WAL, and declare it the new master. The new master should promote itself to master, which switches timelines. Standby units should ensure replication settings are in place for the new master before restarting.

Something is failing, and while the new master is successfully promoted other remaining standbys may fail to replicate from the new master due to their replay point being more advanced than the timeline switch. Either the check to see which unit is most advanced is bogus, or standbys receiving or replaying more of the old timeline after the check. Note that the old master may still be active.

Stuart Bishop (stub)
Changed in postgresql-charm:
status: New → Triaged
importance: Undecided → High
Revision history for this message
MichaelEino (michaeleino) wrote :
Download full text (14.0 KiB)

have the same.. below is the related unit logs, it is not firing the switchover, despite I shutdown the master machine.

root@server:~# juju debug-log --include postgres-ha/6
unit-postgres-ha-6: 20:50:49 WARNING unit.postgres-ha/6.juju-log Falling back to comma separated extra_pg_auth
unit-postgres-ha-6: 20:50:50 DEBUG unit.postgres-ha/6.juju-log Setting hot_standby to True
unit-postgres-ha-6: 20:50:50 DEBUG unit.postgres-ha/6.juju-log Setting wal_level to logical
unit-postgres-ha-6: 20:50:50 DEBUG unit.postgres-ha/6.juju-log Setting wal_keep_segments to 500
unit-postgres-ha-6: 20:50:50 INFO unit.postgres-ha/6.juju-log PostgreSQL has been configured
unit-postgres-ha-6: 20:50:51 DEBUG unit.postgres-ha/6.juju-log postgresql.conf settings unchanged
unit-postgres-ha-6: 20:50:51 INFO unit.postgres-ha/6.juju-log Invoking reactive handler: reactive/postgresql/service.py:851:set_active
unit-postgres-ha-6: 20:50:51 INFO unit.postgres-ha/6.juju-log active: Live secondary (9.5.10)
unit-postgres-ha-6: 20:50:52 DEBUG unit.postgres-ha/6.juju-log Coordinator: Leader handling coordinator requests
unit-postgres-ha-6: 20:50:52 DEBUG unit.postgres-ha/6.juju-log Coordinator: Publishing state
unit-postgres-ha-6: 20:50:33 INFO unit.postgres-ha/6.juju-log Reactive main running for hook update-status
unit-postgres-ha-6: 20:50:33 DEBUG unit.postgres-ha/6.juju-log Coordinator: Using charms.coordinator.SimpleCoordinator coordinator
unit-postgres-ha-6: 20:50:33 INFO unit.postgres-ha/6.juju-log Initializing Snap Layer
unit-postgres-ha-6: 20:50:34 DEBUG unit.postgres-ha/6.update-status none
unit-postgres-ha-6: 20:50:34 INFO unit.postgres-ha/6.juju-log Initializing Apt Layer
unit-postgres-ha-6: 20:50:34 DEBUG unit.postgres-ha/6.juju-log Coordinator: Loading state
unit-postgres-ha-6: 20:50:35 DEBUG unit.postgres-ha/6.juju-log Coordinator: Leader handling coordinator requests
unit-postgres-ha-6: 20:50:35 INFO unit.postgres-ha/6.juju-log Coordinator: Initializing coordinator layer
unit-postgres-ha-6: 20:50:36 INFO unit.postgres-ha/6.juju-log Initializing Leadership Layer (is leader)
unit-postgres-ha-6: 20:50:36 INFO unit.postgres-ha/6.juju-log preflight handler: reactive/workloadstatus.py:57:initialize_workloadstatus_state
unit-postgres-ha-6: 20:50:36 INFO unit.postgres-ha/6.juju-log preflight handler: reactive/postgresql/preflight.py:25:block_on_bad_juju
unit-postgres-ha-6: 20:50:37 INFO unit.postgres-ha/6.juju-log preflight handler: reactive/postgresql/preflight.py:33:block_on_invalid_config
unit-postgres-ha-6: 20:50:37 INFO unit.postgres-ha/6.juju-log Invoking reactive handler: reactive/postgresql/service.py:45:main
unit-postgres-ha-6: 20:50:37 DEBUG unit.postgres-ha/6.juju-log Reactive state: leadership.is_leader
unit-postgres-ha-6: 20:50:37 DEBUG unit.postgres-ha/6.juju-log Reactive state: leadership.set.coordinator
unit-postgres-ha-6: 20:50:37 DEBUG unit.postgres-ha/6.juju-log Reactive state: leadership.set.master
unit-postgres-ha-6: 20:50:37 DEBUG unit.postgres-ha/6.juju-log Reactive state: leadership.set.replication_password
unit-postgres-ha-6: 20:50:38 DEBUG unit.postgres-ha/6.juju-log Reactive state: postgresql.client.passwords_set
unit-postgres-ha-6: 20:50:38 DEB...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.