HA: juju behaves incorrectly when mongo on master state server dies

Bug #1339866 reported by Michael Foord
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
Medium
Unassigned

Bug Description

In an HA system, if mongo dies but the jujud process itself remains alive then mongo fails over correctly (the system appears to still function correctly) but the state servers don't recognise that the master state server is non-functional. The peergroupers do not rewrite agent.conf (etc).

To reproduce:

juju bootstrap
juju ensure-availability

Then kill mongo on machine 0 (master state server).

Michael Foord (mfoord)
description: updated
Changed in juju-core:
milestone: none → 1.21-alpha1
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
Michael Foord (mfoord)
Changed in juju-core:
status: Triaged → Invalid
Revision history for this message
Michael Foord (mfoord) wrote :

After further experimentation, and verification of the *actual* specified behaviour, I can confirm that juju does behave correctly when mongo on the primary HA state server (or on a secondary) dies.

The symptom we saw that caused us to believe it didn't behave correctly was that the machine agent.conf was not rewritten, and the now-dead machine is still listed as an api server. However, this is actually the expected behaviour. When mongo goes down jujud remains up - but if it is the master it does shut down all the relevant jobs and workers (verified from the machine log) and the mongo primary fails over to a new machine which becomes the juju master. The old machine is left in the mongo replica set, and still listed as a valid apiserver, as it *may* come back. Running "juju ensure-availability" again will remove its entry (and also shut down the instance it runs on I believe).

Clients and machine agents have a list of all api servers, and if contacting one fails (e.g. our down machine) then they will automatically try the other entries in the list. So this behaviour is "as specified" and not a problem.

Ian Booth (wallyworld)
Changed in juju-core:
milestone: 1.21-alpha1 → none
Revision history for this message
julian wang (zeratul-j) wrote :
Download full text (4.7 KiB)

We are trying to deploy juju HA on customer site.
With same scenario,
$ juju bootstrap
$ juju ensure-availability
Then kill mongo on machine 0 (master state server).
Juju stop working. (juju status not respond.) We think this is a bug for HA.

===== juju log ============
ubuntu@maas-trusty:~$ juju status --debug
2014-10-19 17:42:14 INFO juju.cmd supercommand.go:37 running juju [1.20.8-trusty-amd64 gc]
2014-10-19 17:42:14 DEBUG juju.conn api.go:187 trying cached API connection settings
2014-10-19 17:42:14 INFO juju.conn api.go:270 connecting to API addresses: [bootstrap-trusty-01.beijing.cts.canonical.com:17070 bootstrap-trusty-01.beijing.cts.canonical.com:17070 10.231.64.39:17070 bootstrap-trusty-02.beijing.cts.canonical.com:17070 bootstrap-trusty-02.beijing.cts.canonical.com:17070 10.231.64.87:17070 bootstrap-trusty-03.beijing.cts.canonical.com:17070 bootstrap-trusty-03.beijing.cts.canonical.com:17070 10.231.64.88:17070]
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap-trusty-01.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap-trusty-01.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://10.231.64.39:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap-trusty-02.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap-trusty-02.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://10.231.64.87:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap-trusty-03.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 DEBUG juju.state.api apiclient.go:248 error dialing "wss://bootstrap-trusty-03.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api", will retry: websocket.Dial wss://bootstrap-trusty-03.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api: dial tcp 10.231.64.88:17070: connection refused
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://bootstrap-trusty-03.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:42:14 DEBUG juju.state.api apiclient.go:248 error dialing "wss://bootstrap-trusty-03.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api", will retry: websocket.Dial wss://bootstrap-trusty-03.beijing.cts.canonical.com:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api: dial tcp 10.231.64.88:17070: connection refused
2014-10-19 17:42:14 INFO juju.state.api apiclient.go:242 dialing "wss://10.231.64.88:17070/environment/cf1c570b-0611-4585-8915-fc3fb53024d1/api"
2014-10-19 17:4...

Read more...

Changed in juju-core:
status: Invalid → Confirmed
julian wang (zeratul-j)
tags: added: cts
Curtis Hovey (sinzui)
tags: added: cts-cloud-escalation
removed: cts
Changed in juju-core:
status: Confirmed → Triaged
importance: High → Medium
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

please reopen this bug if it is still an issue on 2.0

Changed in juju-core:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.