switch over not triggered fail to failover

Bug #1744141 reported by MichaelEino
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
PostgreSQL Charm
Triaged
High
Unassigned

Bug Description

when i shutdown the master node it keeps silent with the secondary as it is.

unit-pgsql-ha-8: 20:31:02 INFO unit.pgsql-ha/8.juju-log active: Live master (9.5.10)
unit-pgsql-ha-8: 20:31:02 DEBUG unit.pgsql-ha/8.juju-log Coordinator: Publishing state

#### nothing happened during this 5 mins however I stopped the master machine

unit-pgsql-ha-7: 20:35:58 INFO unit.pgsql-ha/7.juju-log Reactive main running for hook update-status
unit-pgsql-ha-7: 20:35:59 DEBUG unit.pgsql-ha/7.juju-log Coordinator: Using charms.coordinator.SimpleCoordinator coordinator
unit-pgsql-ha-7: 20:35:59 INFO unit.pgsql-ha/7.juju-log Initializing Snap Layer
unit-pgsql-ha-7: 20:35:59 DEBUG unit.pgsql-ha/7.update-status none
unit-pgsql-ha-7: 20:35:59 INFO unit.pgsql-ha/7.juju-log Initializing Apt Layer
unit-pgsql-ha-7: 20:36:00 DEBUG unit.pgsql-ha/7.juju-log Coordinator: Loading state
unit-pgsql-ha-7: 20:36:01 DEBUG unit.pgsql-ha/7.juju-log Coordinator: Leader handling coordinator requests
unit-pgsql-ha-7: 20:36:01 INFO unit.pgsql-ha/7.juju-log Coordinator: Initializing coordinator layer
unit-pgsql-ha-7: 20:36:01 INFO unit.pgsql-ha/7.juju-log Initializing Leadership Layer (is leader)
unit-pgsql-ha-7: 20:36:02 INFO unit.pgsql-ha/7.juju-log preflight handler: reactive/workloadstatus.py:57:initialize_workloadstatus_state
unit-pgsql-ha-7: 20:36:02 INFO unit.pgsql-ha/7.juju-log preflight handler: reactive/postgresql/preflight.py:25:block_on_bad_juju
unit-pgsql-ha-7: 20:36:02 INFO unit.pgsql-ha/7.juju-log preflight handler: reactive/postgresql/preflight.py:33:block_on_invalid_config
unit-pgsql-ha-7: 20:36:02 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/client.py:38:publish_client_relations
unit-pgsql-ha-7: 20:36:02 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/replication.py:38:replication_states
unit-pgsql-ha-7: 20:36:05 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:45:main
unit-pgsql-ha-7: 20:36:05 DEBUG unit.pgsql-ha/7.update-status sudo: unable to resolve host juju-maas1
unit-pgsql-ha-7: 20:36:05 DEBUG unit.pgsql-ha/7.juju-log Reactive state: leadership.is_leader
unit-pgsql-ha-7: 20:36:05 DEBUG unit.pgsql-ha/7.juju-log Reactive state: leadership.set.coordinator
unit-pgsql-ha-7: 20:36:05 DEBUG unit.pgsql-ha/7.juju-log Reactive state: leadership.set.master
unit-pgsql-ha-7: 20:36:05 DEBUG unit.pgsql-ha/7.juju-log Reactive state: leadership.set.replication_password
unit-pgsql-ha-7: 20:36:06 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.cluster.created
unit-pgsql-ha-7: 20:36:06 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.cluster.is_running
unit-pgsql-ha-7: 20:36:06 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.cluster.kernel_settings.set
unit-pgsql-ha-7: 20:36:06 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.cluster.locale.set
unit-pgsql-ha-7: 20:36:06 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.replication.cloned
unit-pgsql-ha-7: 20:36:06 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.replication.had_peers
unit-pgsql-ha-7: 20:36:07 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.replication.has_master
unit-pgsql-ha-7: 20:36:07 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.replication.has_peers
unit-pgsql-ha-7: 20:36:07 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.replication.master.authorized
unit-pgsql-ha-7: 20:36:07 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.replication.master.peered
unit-pgsql-ha-7: 20:36:07 DEBUG unit.pgsql-ha/7.juju-log Reactive state: postgresql.wal_e.configured
unit-pgsql-ha-7: 20:36:07 DEBUG unit.pgsql-ha/7.juju-log Reactive state: workloadstatus.active
unit-pgsql-ha-7: 20:36:08 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:872:set_version
unit-pgsql-ha-7: 20:36:08 INFO unit.pgsql-ha/7.juju-log Setting application version to 9.5.10
unit-pgsql-ha-7: 20:36:08 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/replication.py:350:publish_replication_details
unit-pgsql-ha-7: 20:36:10 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/apt.py:47:ensure_package_status
unit-pgsql-ha-7: 20:36:10 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:925:update_postgresql_crontab
unit-pgsql-ha-7: 20:36:10 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:879:install_administrative_scripts
unit-pgsql-ha-7: 20:36:11 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:205:configure_cluster
unit-pgsql-ha-7: 20:36:11 WARNING unit.pgsql-ha/7.juju-log Falling back to comma separated extra_pg_auth
unit-pgsql-ha-7: 20:36:11 DEBUG unit.pgsql-ha/7.juju-log Setting hot_standby to True
unit-pgsql-ha-7: 20:36:11 DEBUG unit.pgsql-ha/7.juju-log Setting wal_level to logical
unit-pgsql-ha-7: 20:36:12 DEBUG unit.pgsql-ha/7.juju-log Setting wal_keep_segments to 500
unit-pgsql-ha-7: 20:36:12 INFO unit.pgsql-ha/7.juju-log PostgreSQL has been configured
unit-pgsql-ha-7: 20:36:12 DEBUG unit.pgsql-ha/7.juju-log postgresql.conf settings unchanged
unit-pgsql-ha-7: 20:36:12 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/client.py:47:set_client_passwords
unit-pgsql-ha-7: 20:36:12 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/replication.py:475:follow_master
unit-pgsql-ha-7: 20:36:13 INFO unit.pgsql-ha/7.juju-log Continuing to follow pgsql-ha/8
unit-pgsql-ha-7: 20:36:13 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/client.py:106:mirror_master
unit-pgsql-ha-7: 20:36:13 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:767:update_pgpass
unit-pgsql-ha-7: 20:36:14 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:205:configure_cluster
unit-pgsql-ha-7: 20:36:15 WARNING unit.pgsql-ha/7.juju-log Falling back to comma separated extra_pg_auth
unit-pgsql-ha-7: 20:36:15 DEBUG unit.pgsql-ha/7.juju-log Setting hot_standby to True
unit-pgsql-ha-7: 20:36:15 DEBUG unit.pgsql-ha/7.juju-log Setting wal_level to logical
unit-pgsql-ha-7: 20:36:15 DEBUG unit.pgsql-ha/7.juju-log Setting wal_keep_segments to 500
unit-pgsql-ha-7: 20:36:16 INFO unit.pgsql-ha/7.juju-log PostgreSQL has been configured
unit-pgsql-ha-7: 20:36:16 DEBUG unit.pgsql-ha/7.juju-log postgresql.conf settings unchanged
unit-pgsql-ha-7: 20:36:16 INFO unit.pgsql-ha/7.juju-log Invoking reactive handler: reactive/postgresql/service.py:851:set_active
unit-pgsql-ha-7: 20:36:16 DEBUG unit.pgsql-ha/7.update-status sudo: unable to resolve host juju-maas1
unit-pgsql-ha-7: 20:36:17 INFO unit.pgsql-ha/7.juju-log active: Live secondary (9.5.10)
unit-pgsql-ha-7: 20:36:17 DEBUG unit.pgsql-ha/7.juju-log Coordinator: Leader handling coordinator requests
unit-pgsql-ha-7: 20:36:17 DEBUG unit.pgsql-ha/7.juju-log Coordinator: Publishing state

Revision history for this message
Stuart Bishop (stub) wrote :

Fail over is only triggered when the master unit is removed (juju remove-unit pgsql-ha/7). This was a deliberate design decision, as I did not want automatic failover but for it to be explicit and under operator control. Also, at the time, there was no regularly triggered hook that would allow the charm to detect a failed master and react.

This will be reworked, as people do want their PostgreSQL deployments to be HA and I think we have enough pieces now to do it. This will likely involve tearing out a lot of logic from the charm, and replacing it with 2nd Quadrant's repmgr. This approach means failover is not tied to Juju hooks, or even a contactable Juju controller. We will probably have to drop support for older versions of PostgreSQL at the same time, but with the upcoming Ubuntu 18.04 release we can do it in a tasteful way (keeping the trusty/xenial charm as it is and doing the charm rework on the bionic branch).

Changed in postgresql-charm:
status: New → Triaged
importance: Undecided → High
Revision history for this message
MichaelEino (michaeleino) wrote :

Yes you are right.. the cluster was 2 nodes, when I removed the master unit it goes to failover the secondary and raised it to master.

After a while I added 2 lxd nodes, and after they are started as secondary successfully ... I do removed the master, but unfortunately it couldn't failover.

and they are still waiting the master "pgsql-ha/7" to failover from... however it is gone!

Unit Workload Agent Machine Public address Ports Message
pgsql-ha/10 waiting idle 4/lxd/1 x.x.x.41 5432/tcp Failover from pgsql-ha/7
pgsql-ha/11* error idle 4/lxd/2 x.x.x.40 5432/tcp hook failed: "replication-relation-changed"

I'm attaching the related logs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.