ha tear down causes last controller to be unusable
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Ian Booth |
Bug Description
Found while running (cd tests ; ./main.sh -v controller), reproducible outside of the test as well:
juju 2.9.29, lxd 5.1, jammy/focal host
juju bootstrap localhost
juju enable-ha
wait for highavailability
juju remove-machine -m controller 2
juju remove-machine -m controller 1
juju commands start returning:
ERROR not master and slaveOk=false
Investigation shows that the sole remaining controller is marked as a SECONDARY rather than PRIMARY. Looking at logs for details. It seems to happen as 2nd machine is removed. juju show-controller lists machine 0 as the primary until the commands starts to fail.
There is a timing aspect to this issue.
Recovery steps:
1. Login to the remaining controller's db. https:/
2. Follow steps here: https:/
juju:SECONDARY> cfg = rs.conf()
juju:SECONDARY> cfg.members = [cfg.members[0]]
juju:SECONDARY> rs.reconfig(cfg, {force : true})
Not reproducible on aws so far. Nor every lxd config.
Changed in juju: | |
milestone: | none → 2.9.33 |
assignee: | nobody → Ian Booth (wallyworld) |
importance: | Undecided → High |
status: | New → In Progress |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
machine-2: 19:52:25 ERROR juju.worker. dependency "machiner" manifold worker returned unexpected error: machine-2 failed to set machine to dead: machine 2 is still a voting controller member dependency "machiner" manifold worker returned unexpected error: machine-1 failed to set machine to dead: machine 1 is still a voting controller member peergrouper failed to remove dying controller as a controller after removing its vote: controller 1 cannot be removed as it is the last controller peergrouper failed to remove dying controller as a controller after removing its vote: controller 2 cannot be removed as it is the last controller dependency "is-primary- controller- flag" manifold worker returned unexpected error: connection is shut down jujud.runner fatal "1-container- watcher" : worker "1-container- watcher" exited: connection is shut down jujud.runner fatal "0-container- watcher" : worker "0-container- watcher" exited: connection is shut down jujud.runner fatal "2-container- watcher" : connection is shut down dependency "is-primary- controller- flag" manifold worker returned unexpected error: permission denied (unauthorized access) dependency "is-primary- controller- flag" manifold worker returned unexpected error: permission denied (unauthorized access)
machine-1: 19:52:25 ERROR juju.worker.
machine-0: 19:52:26 ERROR juju.worker.
machine-0: 19:52:26 ERROR juju.worker.
machine-1: 19:52:26 ERROR juju.worker.
machine-1: 19:52:26 ERROR juju.cmd.
machine-0: 19:52:26 ERROR juju.cmd.
machine-2: 19:52:26 ERROR juju.cmd.
machine-1: 19:52:26 ERROR juju.worker.
machine-2: 19:52:27 ERROR juju.worker.
machine-0: 19:52:37 ERROR juju.worker. peergrouper cannot set replicaset: cannot remove member 2 from replicaset: Reconfig finished but failed to propagate to a majority :: caused by :: Current config with {version: 4, term: 2} has not yet propagated to a majority of nodes :: caused by :: operation was interrupted
machine-0: 19:52:37 ERROR juju.worker. dependency "peer-grouper" manifold worker returned unexpected error: cannot get controller ids: reading controller info: cannot get controllers document: not master and slaveOk=false
"juju-machine-id" : "2" is the mongo replica set member I had to remove from the config in the recovery steps above.