Destroyed leader, new leader not elected.

Bug #1511659 reported by Stuart Bishop on 2015-10-30
64
This bug affects 11 people
Affects Status Importance Assigned to Milestone
juju
High
Dave Cheney
juju-core
High
Dave Cheney
1.25
High
Dave Cheney

Bug Description

I've been trying to track down a test failure in lp:~stub/charms/trusty/postgresql/rewrite, where the failover tests consistently fail on the Ecosystem Team's Jenkins (including the local provider) but always pass locally using the local provider.

I believe I have narrowed it down to Juju not promoting a surviving unit to leader after the leader is destroyed.

In the attached logs, the failing test starts around 2015-10-29 04:08:10.

First, a new unit is added to the PostgreSQL service. This works just fine.

At 2015-10-29 04:10:12, the leader (postgresql/0) is destroyed. This kicks of many, many hooks starting with leader-settings-changed on postgresql/0 where it still thinks it is the leader and replication-relation-departed hooks on postgresql/1 and postgresql/2.

By 2015-10-29 04:10:36 things are winding down, where you can see the final replication-relation-departed hook running on postgresql/0, which still thinks it is the leader and able to change leadership settings happily.

At 2015-10-29 04:10:42, the stop hook is run on postgresql/0, successfully.

No further hooks run for the next 6 minutes. After this, the test suite gives up and things are torn down for the next set of tests.

Stuart Bishop (stub) wrote :
Stuart Bishop (stub) wrote :

The attached log came from the lxc run at http://reports.vapour.ws/charm-test-details/charm-bundle-test-parent-3201

I believe the same failure is happening with all the providers (same test failure - service gives up waiting for a master to appear), but have not trawled through their logs to confirm.

Stuart Bishop (stub) wrote :

At 2015-10-29 03:59:44 you can see the leader-elected hook running when the service is initially setup, which confirms that it is wired up correctly and can be executed.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Stuart Bishop (stub) wrote :

I also have a staging Cassandra environment in this state (OpenStack controller, dse nodes manually provisioned, older version of Juju). I had two nodes in the service, and dropped one due to hardware failure. Now I only have dse/1 left, but it is not the leader:

$ juju run --unit=dse/1 is-leader
False
$ juju --version
1.24.4-trusty-amd64

David Ames (thedac) wrote :

I have also seen this with rabbitmq-server.

Even after adding a new node, no election takes place. I have left this over a 12 hour period (in case leadership elections take place over a longer period of time) with no election taking place.

I have observed this with wily juju-core 1.24.6-0ubuntu3 and trusty juju-core 1.25.0-0ubuntu1~14.04.1~juju1.

Note this issue is intermittent. Occasionally a new leader is elected, but often one is not.

Process to recreate:
Deploy 3 nodes of a charm with peer relationship
Determine the leader node
Destroy the leader node
Check for a leader node

Example with rabbitmq-server:
Juju status: http://pastebin.ubuntu.com/13093778/

ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "4"
  Stdout: |
    True
  UnitId: rabbitmq-server/0
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |
    False
  UnitId: rabbitmq-server/2

# rabbitmq-server/0 is the leader. Destroy it.
ubuntu@thedac-bastion:~/rabbitmq-server$ juju destroy-unit rabbitmq-server/0
ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |
    False
  UnitId: rabbitmq-server/2
# No leader exists

ubuntu@thedac-bastion:~/rabbitmq-server$ juju add-unit rabbitmq-server
ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |

    False
  UnitId: rabbitmq-server/2
# No leader exists

ubuntu@thedac-bastion:~/rabbitmq-server$ juju add-unit rabbitmq-server
ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |
    False
  UnitId: rabbitmq-server/2
- MachineId: "7"
  Stdout: |
    False
  UnitId: rabbitmq-server/3
# No leader exists

Changed in juju-core:
milestone: none → 1.26.0
Adam Collard (adam-collard) wrote :

I think the logging request in https://bugs.launchpad.net/juju-core/+bug/1488166/comments/4 is still relevant

tags: added: bug-squad leadership
Changed in juju-core:
milestone: 1.26.0 → 2.0-alpha2
Edward Hope-Morley (hopem) wrote :

I too have seen this with the latest Juju i.e. 1.25.0-0ubuntu1~14.04.1~juju1. I deployed 3 units of the swift-proxy charm with the openstack provider, powered off the leader unit and waited for a re-election to occur. Even after 15 minutes a new leader had not been elected (so there was no leader).

tags: added: sts
Edward Hope-Morley (hopem) wrote :

As a follow on to my previous comment, here is what seems to be a failry reliable reproducer:

bzr branch https://code.launchpad.net/~openstack-charm-testers/+junk/swift-rings swift-test
cd swift-test
juju-deployer -c swift-next -d trusty-liberty

[wait for deployment to complete]

# work out who the leader is
juju run --service swift-proxy "is-leader"
...

# then poweroff the leader unit (assuming /0 here)
juju ssh swift-proxy/0 sudo poweroff

Now wait at least 60 seconds then re-run the is-leader check. When i do this, even after 15 minutes, I see that neither of the remaining units is now the leader.

Edward Hope-Morley (hopem) wrote :

oh and enable debug logs prior to doing the test by doing:

juju set-env logging-config="<root>=DEBUG;"

Ursula Junque (ursinha) wrote :

I can reproduce the issue consistently. Once I poweroff the leader, the other unit just won't show up as leader, even long after 60s.
Juju 1.25.0 in wily.

William Reade (fwereade) on 2016-01-12
Changed in juju-core:
assignee: nobody → William Reade (fwereade)
Nate Finch (natefinch) wrote :

I've done some testing with rabbitmq-server, and I keep getting errors from the hooks when I kill a unit. One time I got 'hook failed: "leader-elected"' and one time I got 'hook fail: "cluster-relation-changed"' ... so I'm not super confident about using rabbitmq as a test bed, if it can't even handle killing a unit.

I'll try it with another charm.

Changed in juju-core:
assignee: William Reade (fwereade) → Dave Cheney (dave-cheney)
Changed in juju-core:
status: Triaged → In Progress
Dave Cheney (dave-cheney) wrote :
Dave Cheney (dave-cheney) wrote :

Fix committed to master, will retest today and propose backport to 1.25 if successful

Changed in juju-core:
status: In Progress → Fix Committed
Dave Cheney (dave-cheney) wrote :

Due to various bugs it took all day to replicate the juju-deployer environment above, but I was able to reproduce the scenario and confirm that with this fix applied, a new leader is elected.

I will work on backporting this fix to 1.25.

Curtis Hovey (sinzui) on 2016-02-11
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-alpha2 → none
milestone: none → 2.0-alpha2
Changed in juju-core:
assignee: nobody → Dave Cheney (dave-cheney)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers