HA tests fail after the leader is deleted

Bug #1640535 reported by Curtis Hovey
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Horacio Durán

Bug Description

As seen in
    http://reports.vapour.ws/releases/issue/5762cc04749a5666b4caf038

Juju CI started seeing systemic HA failures about Friday Nov 3. The failures were in AWS, Rackspace, Azure, and Prodstack. The failures were across trusty and xenial controllers. After the leader is deleted, juju status fails. We expect juju to fall back to another controller.

The first signs of the problem are in
    http://reports.vapour.ws/releases/4560
    http://reports.vapour.ws/releases/4561

After a change to the test to ensure lingering procs, we see a consistent timeout waiting for a new controller to answer the clients call to status.
    http://reports.vapour.ws/releases/4563

Tags: ci regression ha
Revision history for this message
Curtis Hovey (sinzui) wrote :

While a prodstack test was failing, I was able to confirm on the machine that juju status failed. I see the three controllers that "nova list" show in the api-endpoints in JUJU_DATA dir setup for the test.

$ cat /var/lib/jenkins/cloud-city/jes-homes/functional-ha-recovery-prodstack/controllers.yaml
controllers:
  functional-ha-recovery-prodstack:
    unresolved-api-endpoints: ['10.25.28.67:17070', '10.25.28.7:17070', '10.25.28.70:17070']
    uuid: 60807685-8236-4984-8d91-7be04f1e4732
    api-endpoints: ['10.25.28.67:17070', '10.25.28.7:17070', '10.25.28.70:17070']
    ca-cert: |
      -----BEGIN CERTIFICATE-----
      CERT
      -----END CERTIFICATE-----
    cloud: prodstack45
    region: bootstack-ps45
    agent-version: 2.1-beta1
    controller-machine-count: 1
    active-controller-machine-count: 0
    model-count: 2
    machine-count: 1
current-controller: functional-ha-recovery-prodstack

Changed in juju:
milestone: none → 2.1.0-beta1
assignee: nobody → James Tunnicliffe (dooferlad)
Changed in juju:
assignee: James Tunnicliffe (dooferlad) → Richard Harding (rharding)
Revision history for this message
Richard Harding (rharding) wrote :

+1 to it not working in GCE. In testing the client does realize the master is gone and rotates pinging the other two nodes. However, on those nodes they don't end up picking up the mongodb primary.

I restarted juju-db on one of the other two controllers and got the following failure to get mongodb happy again:

https://pastebin.canonical.com/170629/

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The first thing to establish is whether this is mongodb failing to elect a new master or juju not handling the failover correctly. My guess is that it's a Juju issue, but MongoDB 3.2 does use a new consensus protocol so there could be a problem there too.

In order to check that the remaining nodes elect a new leader once the previous leader is killed, please check the output of "rs.status()" in the mongo shell on each of the remaining nodes.

Up to date details for getting a mongo shell are here: https://github.com/juju/juju/wiki/Login-into-MongoDB

Curtis Hovey (sinzui)
description: updated
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.1-beta1 → 2.1-beta2
Changed in juju:
assignee: Richard Harding (rharding) → Horacio Durán (hduran-8)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Horacio Durán (hduran-8) wrote :
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.1-beta2 → none
Curtis Hovey (sinzui)
Changed in juju:
milestone: none → 2.1-rc1
Revision history for this message
Curtis Hovey (sinzui) wrote :
Revision history for this message
Horacio Durán (hduran-8) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
milestone: 2.1-rc1 → 2.1-beta3
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.