Destroyed leader, new leader not elected.

Bug #1511659 reported by Stuart Bishop
64
This bug affects 11 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Dave Cheney
juju-core
Fix Released
High
Dave Cheney
1.25
Fix Released
High
Dave Cheney

Bug Description

I've been trying to track down a test failure in lp:~stub/charms/trusty/postgresql/rewrite, where the failover tests consistently fail on the Ecosystem Team's Jenkins (including the local provider) but always pass locally using the local provider.

I believe I have narrowed it down to Juju not promoting a surviving unit to leader after the leader is destroyed.

In the attached logs, the failing test starts around 2015-10-29 04:08:10.

First, a new unit is added to the PostgreSQL service. This works just fine.

At 2015-10-29 04:10:12, the leader (postgresql/0) is destroyed. This kicks of many, many hooks starting with leader-settings-changed on postgresql/0 where it still thinks it is the leader and replication-relation-departed hooks on postgresql/1 and postgresql/2.

By 2015-10-29 04:10:36 things are winding down, where you can see the final replication-relation-departed hook running on postgresql/0, which still thinks it is the leader and able to change leadership settings happily.

At 2015-10-29 04:10:42, the stop hook is run on postgresql/0, successfully.

No further hooks run for the next 6 minutes. After this, the test suite gives up and things are torn down for the next set of tests.

Revision history for this message
Stuart Bishop (stub) wrote :
Revision history for this message
Stuart Bishop (stub) wrote :

The attached log came from the lxc run at http://reports.vapour.ws/charm-test-details/charm-bundle-test-parent-3201

I believe the same failure is happening with all the providers (same test failure - service gives up waiting for a master to appear), but have not trawled through their logs to confirm.

Revision history for this message
Stuart Bishop (stub) wrote :

At 2015-10-29 03:59:44 you can see the leader-elected hook running when the service is initially setup, which confirms that it is wired up correctly and can be executed.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Stuart Bishop (stub) wrote :

I also have a staging Cassandra environment in this state (OpenStack controller, dse nodes manually provisioned, older version of Juju). I had two nodes in the service, and dropped one due to hardware failure. Now I only have dse/1 left, but it is not the leader:

$ juju run --unit=dse/1 is-leader
False
$ juju --version
1.24.4-trusty-amd64

Revision history for this message
David Ames (thedac) wrote :

I have also seen this with rabbitmq-server.

Even after adding a new node, no election takes place. I have left this over a 12 hour period (in case leadership elections take place over a longer period of time) with no election taking place.

I have observed this with wily juju-core 1.24.6-0ubuntu3 and trusty juju-core 1.25.0-0ubuntu1~14.04.1~juju1.

Note this issue is intermittent. Occasionally a new leader is elected, but often one is not.

Process to recreate:
Deploy 3 nodes of a charm with peer relationship
Determine the leader node
Destroy the leader node
Check for a leader node

Example with rabbitmq-server:
Juju status: http://pastebin.ubuntu.com/13093778/

ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "4"
  Stdout: |
    True
  UnitId: rabbitmq-server/0
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |
    False
  UnitId: rabbitmq-server/2

# rabbitmq-server/0 is the leader. Destroy it.
ubuntu@thedac-bastion:~/rabbitmq-server$ juju destroy-unit rabbitmq-server/0
ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |
    False
  UnitId: rabbitmq-server/2
# No leader exists

ubuntu@thedac-bastion:~/rabbitmq-server$ juju add-unit rabbitmq-server
ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |

    False
  UnitId: rabbitmq-server/2
# No leader exists

ubuntu@thedac-bastion:~/rabbitmq-server$ juju add-unit rabbitmq-server
ubuntu@thedac-bastion:~/rabbitmq-server$ juju run --service rabbitmq-server is-leader
- MachineId: "5"
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: "6"
  Stdout: |
    False
  UnitId: rabbitmq-server/2
- MachineId: "7"
  Stdout: |
    False
  UnitId: rabbitmq-server/3
# No leader exists

Changed in juju-core:
milestone: none → 1.26.0
Revision history for this message
Adam Collard (adam-collard) wrote :

I think the logging request in https://bugs.launchpad.net/juju-core/+bug/1488166/comments/4 is still relevant

tags: added: bug-squad leadership
Changed in juju-core:
milestone: 1.26.0 → 2.0-alpha2
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I too have seen this with the latest Juju i.e. 1.25.0-0ubuntu1~14.04.1~juju1. I deployed 3 units of the swift-proxy charm with the openstack provider, powered off the leader unit and waited for a re-election to occur. Even after 15 minutes a new leader had not been elected (so there was no leader).

tags: added: sts
Revision history for this message
Edward Hope-Morley (hopem) wrote :

As a follow on to my previous comment, here is what seems to be a failry reliable reproducer:

bzr branch https://code.launchpad.net/~openstack-charm-testers/+junk/swift-rings swift-test
cd swift-test
juju-deployer -c swift-next -d trusty-liberty

[wait for deployment to complete]

# work out who the leader is
juju run --service swift-proxy "is-leader"
...

# then poweroff the leader unit (assuming /0 here)
juju ssh swift-proxy/0 sudo poweroff

Now wait at least 60 seconds then re-run the is-leader check. When i do this, even after 15 minutes, I see that neither of the remaining units is now the leader.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

oh and enable debug logs prior to doing the test by doing:

juju set-env logging-config="<root>=DEBUG;"

Revision history for this message
Ursula Junque (ursinha) wrote :

I can reproduce the issue consistently. Once I poweroff the leader, the other unit just won't show up as leader, even long after 60s.
Juju 1.25.0 in wily.

William Reade (fwereade)
Changed in juju-core:
assignee: nobody → William Reade (fwereade)
Revision history for this message
Nate Finch (natefinch) wrote :

I've done some testing with rabbitmq-server, and I keep getting errors from the hooks when I kill a unit. One time I got 'hook failed: "leader-elected"' and one time I got 'hook fail: "cluster-relation-changed"' ... so I'm not super confident about using rabbitmq as a test bed, if it can't even handle killing a unit.

I'll try it with another charm.

Changed in juju-core:
assignee: William Reade (fwereade) → Dave Cheney (dave-cheney)
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Dave Cheney (dave-cheney) wrote :
Revision history for this message
Dave Cheney (dave-cheney) wrote :

Fix committed to master, will retest today and propose backport to 1.25 if successful

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Dave Cheney (dave-cheney) wrote :

Due to various bugs it took all day to replicate the juju-deployer environment above, but I was able to reproduce the scenario and confirm that with this fix applied, a new leader is elected.

I will work on backporting this fix to 1.25.

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-alpha2 → none
milestone: none → 2.0-alpha2
Changed in juju-core:
assignee: nobody → Dave Cheney (dave-cheney)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.