HA: juju behaves incorrectly when mongo on master state server dies
Bug #1339866 reported by
Michael Foord
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Won't Fix
|
Medium
|
Unassigned |
Bug Description
In an HA system, if mongo dies but the jujud process itself remains alive then mongo fails over correctly (the system appears to still function correctly) but the state servers don't recognise that the master state server is non-functional. The peergroupers do not rewrite agent.conf (etc).
To reproduce:
juju bootstrap
juju ensure-availability
Then kill mongo on machine 0 (master state server).
description: | updated |
Changed in juju-core: | |
milestone: | none → 1.21-alpha1 |
Changed in juju-core: | |
importance: | Undecided → High |
status: | New → Triaged |
Changed in juju-core: | |
status: | Triaged → Invalid |
Changed in juju-core: | |
milestone: | 1.21-alpha1 → none |
tags: | added: cts |
tags: |
added: cts-cloud-escalation removed: cts |
Changed in juju-core: | |
status: | Confirmed → Triaged |
importance: | High → Medium |
To post a comment you must log in.
After further experimentation, and verification of the *actual* specified behaviour, I can confirm that juju does behave correctly when mongo on the primary HA state server (or on a secondary) dies.
The symptom we saw that caused us to believe it didn't behave correctly was that the machine agent.conf was not rewritten, and the now-dead machine is still listed as an api server. However, this is actually the expected behaviour. When mongo goes down jujud remains up - but if it is the master it does shut down all the relevant jobs and workers (verified from the machine log) and the mongo primary fails over to a new machine which becomes the juju master. The old machine is left in the mongo replica set, and still listed as a valid apiserver, as it *may* come back. Running "juju ensure- availability" again will remove its entry (and also shut down the instance it runs on I believe).
Clients and machine agents have a list of all api servers, and if contacting one fails (e.g. our down machine) then they will automatically try the other entries in the list. So this behaviour is "as specified" and not a problem.