trusty juju 1.25.5 HA availability issues

Bug #1575448 reported by Brad Marshall
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Anastasia
juju-core
Fix Released
High
Anastasia
1.25
Fix Released
High
Anastasia

Bug Description

I'm trying to deploy Mitaka Openstack via Juju 1.25.5 running on trusty with MAAS. I'm trying to deploy Xenial to bare metal.

As part of this deployment we're using HA juju state servers. I'm running into a problem when the extra state servers don't join the cluster, which then blocks them from doing anything further. They end up like:

  "1":
    agent-state: started
    agent-state-info: (started)
    agent-version: 1.25.5
    dns-name: transient-vm2.maas
    instance-id: /MAAS/api/1.0/nodes/node-c33f327a-ba6d-11e5-a4b7-a0d3c1ef015d/
    series: xenial
    hardware: arch=amd64 cpu-cores=2 mem=30720M tags=kvm-transient,virtual,openstack-ha,kvm-transient-infra
      availability-zone=default
    state-server-member-status: adding-vote

On the node 0 side, jujud is listening on port 17070 as expected, but neither node 1 or 2 are.

The last line of the machine-1 log file is:

  2016-04-27 00:32:22 DEBUG juju.worker runner.go:191 stop "state"

I've reproduced this on two different environments, one with MAAS 1.9.0 and one with MAAS 1.9.1, both with juju 1.25.5 and trusty.

Steps to reproduce:
* Set default series to xenial in juju environment
* Use MAAS 1.9.x (tested with maas 1.9.0 and 1.9.1) and juju 1.25.5 on Trusty.
* juju bootstrap
* juju deploy -n 2 ubuntu
* juju ensure-availability --to 1,2

Test setup:
$ dpkg-query -W maas
maas 1.9.1+bzr4543-0ubuntu2~trusty1
   AND
$ dpkg-query -W maas
maas 1.9.0+bzr4533-0ubuntu1~trusty1

$ dpkg-query -W juju-core
juju-core 1.25.5-0ubuntu1~14.04.2~juju1

Please let me know if you need any further information.

Revision history for this message
Brad Marshall (brad-marshall) wrote :
Revision history for this message
Brad Marshall (brad-marshall) wrote :

Node 0 /var/log/juju

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Node 1 /var/log/juju

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Node 2 /var/log/juju

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Can you attach /var/log/syslog for each of the three state servers? There may be something going on with mongo that would show up in those logs.

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Brad Marshall (brad-marshall) wrote :

Tarball of all 3 state nodes syslog attached. Please let me know if you need any more information.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am seeing a few "broken pipes" and some "bad credentials" messages.

I'll dig further.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

FWIW I suspect something going on with the non-HA state server we have running, I'll attach the machine-0 log here. There's an awful lot of "broken pipes", "Closed explicitly", EOF and other messages.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

From the conversation with Dimiter, it seems like the root cause is the same as https://bugs.launchpad.net/juju-core/+bug/1576021

It's reported fixed in 1.25.6.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.25.6
Revision history for this message
Cheryl Jennings (cherylj) wrote :

This bug can't be caused by bug 1576021, as the cause was injected after we shipped 1.25.5

Changed in juju-core:
assignee: nobody → Anastasia (anastasia-macmood)
Revision history for this message
Brad Marshall (brad-marshall) wrote :

I've done some more testing, and restarting the 2 hung machine agents seems to let things eventually settle down, and all 3 nodes end up being a state server.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Obsolete version of the HA facade was used.

Proposal against 1.25: https://github.com/juju/juju/pull/5441

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Incomplete
importance: High → Undecided
assignee: Anastasia (anastasia-macmood) → nobody
milestone: 1.25.6 → none
Revision history for this message
Brad Marshall (brad-marshall) wrote :

Unfortunately I've just tried this with the juju binaries provided, and it doesn't seem to have worked.

To deploy, I copied the binaries to a directory, set it to the front of my path and then I did:

  $ juju bootstrap --constraints "tags=<tag>" --show-log --debug --upload-tools
  $ juju deploy local:xenial/ubuntu -n2 --constraints "tags=<tag>"
  $ juju ensure-availability --to 1,2

After > 10 minutes, the extra nodes end up looking like:

  "1":
    agent-state: started
    agent-state-info: (started)
    agent-version: 1.25.6.1
    dns-name: ditto.maas
    instance-id: /MAAS/api/1.0/nodes/node-c832aaa2-bb3d-11e5-8d95-a0d3c1ef015d/
    series: xenial
    hardware: arch=amd64 cpu-cores=48 mem=262136M tags=baremetal-stable availability-zone=default
    state-server-member-status: adding-vote

There is no action in the machine 1 and 2 logs at all after the initial juju ensure-availability.

$ juju --version
1.25.6-trusty-amd64

$ dpkg-query -W maas
maas 1.9.2+bzr4568-0ubuntu1~trusty1

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Machine 0 /var/log/syslog and machine-0.log after deployment, waited > 10 minutes.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Machine 1 /var/log/syslog and machine-1.log

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Machine 2 /var/log/syslog and machine-2.log

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Mongodb replication status.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Machine-0 agent.conf (redacted - let me know if you need any of the other values).

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Machine-1 agent.conf - redacted again. The only difference between this and machine-2 agent.conf is tag, nonce, statepassword, apipassword and AGENT_SERVICE_NAME.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

I can confirm that restarting the machine agents on machine 1 and 2 got mongodb running on those 2 nodes, and got the state servers working properly. The agents.conf looked much better, the only bit missing was the oldpassword setting.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

I redeployed the environment using a Xenial KVM instance, and I got the same thing - no HA, and the machine 1 and 2 were stuck in adding-vote. If I restart the machine agents on them, they start up mongodb on the 2 nodes and setup the state servers, getting them into has-vote.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

So to clarify, what Brad and I have observed:

1. 1.25.5 on trusty (host) -trusty (unit) works
2. 1.25.6 on trusty (host) -trusty (unit) works

3. 1.25.5 on trusty (host) -xenial (unit) requires restart
4. 12.5.6 on trusty (host) -xenial (unit) requires restart
5. 1.25.5 on xenial (host) -xenial (unit) requires restart

To be confirmed
6. 1.25.6 on xenial (host) -xenial (unit)

Revision history for this message
Anastasia (anastasia-macmood) wrote :

The original issues has been fixed and has been thoroughly tested manually.

The fallout happened after the fix was committed with a follow on commit https://github.com/juju/juju/commit/c9588231f6ba90fe9b753c4dae66c9d5c57bbc1f

This commit rolled forward systemd and dbus dependencies. I have confirmed it by reverting to previous versions which un-stuck HA.

Changed in juju-core:
status: Incomplete → Fix Committed
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Created new bug to track fix for dependent libraries - https://bugs.launchpad.net/juju-core/+bug/1595155

Changed in juju-core:
assignee: nobody → Anastasia (anastasia-macmood)
importance: Undecided → High
milestone: none → 2.0-beta10
status: Fix Committed → Fix Released
milestone: 2.0-beta10 → none
affects: juju-core → juju
Changed in juju-core:
assignee: nobody → Anastasia (anastasia-macmood)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.