Canonical Juju

HA tests fail after the leader is deleted

Bug #1640535 reported by Curtis Hovey on 2016-11-09

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Horacio Durán	Canonical Juju 2.1-beta3

Bug Description

As seen in
http://reports.vapour.ws/releases/issue/5762cc04749a5666b4caf038

Juju CI started seeing systemic HA failures about Friday Nov 3. The failures were in AWS, Rackspace, Azure, and Prodstack. The failures were across trusty and xenial controllers. After the leader is deleted, juju status fails. We expect juju to fall back to another controller.

The first signs of the problem are in
http://reports.vapour.ws/releases/4560
http://reports.vapour.ws/releases/4561

After a change to the test to ensure lingering procs, we see a consistent timeout waiting for a new controller to answer the clients call to status.
http://reports.vapour.ws/releases/4563

See original description

Tags:

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-11-09:

While a prodstack test was failing, I was able to confirm on the machine that juju status failed. I see the three controllers that "nova list" show in the api-endpoints in JUJU_DATA dir setup for the test.

$ cat /var/lib/jenkins/cloud-city/jes-homes/functional-ha-recovery-prodstack/controllers.yaml
controllers:
  functional-ha-recovery-prodstack:
    unresolved-api-endpoints: ['10.25.28.67:17070', '10.25.28.7:17070', '10.25.28.70:17070']
    uuid: 60807685-8236-4984-8d91-7be04f1e4732
    api-endpoints: ['10.25.28.67:17070', '10.25.28.7:17070', '10.25.28.70:17070']
    ca-cert: |
      -----BEGIN CERTIFICATE-----
      CERT
      -----END CERTIFICATE-----
    cloud: prodstack45
    region: bootstack-ps45
    agent-version: 2.1-beta1
    controller-machine-count: 1
    active-controller-machine-count: 0
    model-count: 2
    machine-count: 1
current-controller: functional-ha-recovery-prodstack

Changed in juju:
milestone:	none → 2.1.0-beta1
assignee:	nobody → James Tunnicliffe (dooferlad)

Richard Harding (rharding) on 2016-11-14

Changed in juju:
assignee:	James Tunnicliffe (dooferlad) → Richard Harding (rharding)

Revision history for this message

Richard Harding (rharding) wrote on 2016-11-14:

+1 to it not working in GCE. In testing the client does realize the master is gone and rotates pinging the other two nodes. However, on those nodes they don't end up picking up the mongodb primary.

I restarted juju-db on one of the other two controllers and got the following failure to get mongodb happy again:

https://pastebin.canonical.com/170629/

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-11-14:

The first thing to establish is whether this is mongodb failing to elect a new master or juju not handling the failover correctly. My guess is that it's a Juju issue, but MongoDB 3.2 does use a new consensus protocol so there could be a problem there too.

In order to check that the remaining nodes elect a new leader once the previous leader is killed, please check the output of "rs.status()" in the mongo shell on each of the remaining nodes.

Up to date details for getting a mongo shell are here: https://github.com/juju/juju/wiki/Login-into-MongoDB

Curtis Hovey (sinzui) on 2016-11-16

description:

updated

Curtis Hovey (sinzui) on 2016-11-17

Changed in juju:
milestone:	2.1-beta1 → 2.1-beta2

Alexis Bruemmer (alexis-bruemmer) on 2016-11-22

Changed in juju:
assignee:	Richard Harding (rharding) → Horacio Durán (hduran-8)

Horacio Durán (hduran-8) on 2016-11-23

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Horacio Durán (hduran-8) wrote on 2016-11-30:

The culprit is juju and almost certainly https://github.com/juju/juju/commit/b02339f16e10ae472bcbf646846f566a79dee5e7

Curtis Hovey (sinzui) on 2016-12-01

Changed in juju:
milestone:	2.1-beta2 → none

Curtis Hovey (sinzui) on 2016-12-02

Changed in juju:
milestone:	none → 2.1-rc1

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-12-02:

Horacio has a fix at https://github.com/juju/juju/pull/6652

Revision history for this message

Horacio Durán (hduran-8) wrote on 2016-12-02:

I proposed https://github.com/juju/juju/pull/6652

Anastasia (anastasia-macmood) on 2016-12-04

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2016-12-13

Changed in juju:
milestone:	2.1-rc1 → 2.1-beta3

Curtis Hovey (sinzui) on 2016-12-15

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1638944

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.