Multiple units told they are leaders

Bug #1723184 reported by Jacek Nykis
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
Joseph Phillips
2.4
Fix Released
Medium
Joseph Phillips
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

In my environment I had 5 out of 8 units end up with hook errors.

On closer inspection I noticed that all failed units wrongly thought they were leaders and were trying to "leader-set" which was failing.

Once I run "juju resolved" on all affected units they worked fine.

I could not find anything interesting in the logs.

This is the 2nd bug I noticed recently that affects leadership election so they may be related. The other is bug 1721159.

juju 1.25.13
Ubuntu 14.04.5 LTS

Tags: canonical-is
Revision history for this message
Junien F (axino) wrote :

I have seen kind of the same thing just today, on a fresh 2.2.4 (<24h) model :

2017-10-12 19:28:26 INFO juju-log leader-elected fired. This unit is the new leader: foo/3
2017-10-12 19:28:26 DEBUG leader-elected ERROR cannot write leadership settings: cannot write settings: not the leader
[...]
2017-10-12 19:28:26 DEBUG leader-elected subprocess.CalledProcessError: Command '['leader-set', 'leader_id=foo/3']' returned non-zero exit status 1
2017-10-12 19:28:26 ERROR juju.worker.uniter.operation runhook.go:107 hook "leader-elected" failed: exit status 1

tags: added: canonical-is
description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

can you describe the charm where you're seeing this? and if there are steps to reproduce?

Revision history for this message
Junien F (axino) wrote :

It's our good friend ubuntu-repository-cache, same as in the linked bug. It's one of the rare (only ?) charm that actively uses leadership / has leader-* hooks, which could explain why we're seeing leadership problems only with it.

No step to reproduce as far as I'm aware, unfortunately.

I guess I can resolve the hook error and enable TRACE debug on that env, and wait for a repro ?

Revision history for this message
John A Meinel (jameinel) wrote :

We are unlikely to fix this in 1.25 unless there is a clear reproduction case.

Changed in juju-core:
status: New → Won't Fix
Revision history for this message
John A Meinel (jameinel) wrote :

If we can reproduce this, then it is very likely worth fixing. Especially the 2.2.4 case seems to be we are actively electing a leader and it is getting demoted within the same second.
I wonder if the problem is actually that it has not successfully been promoted entirely? (could we have thought we requested the txn but it hasn't been fully committed before the next step?
)

Changed in juju:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Jacek Nykis (jacekn) wrote :

We've just hit this in a production environment running juju 2.2.6

We ended up with a non-leader unit stuck with failed "leader-elected" hook. The hook was trying to "leader-set" which returned error. We recovered by running "exit 0" inside debug-hooks.

So maybe the solution is to cancel "leader-elected" hooks on non-leaders?

Revision history for this message
Haw Loeung (hloeung) wrote :

Seeing this as well with a 2.2.6 environment.

| ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5 'is-leader'
| False

| ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5 'leader-get'
| leader_id: ubuntu-repository-cache/4

At some point, this unit must have been elected as the leader and now it keeps trying to run the leader-elected hook.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1723184] Re: Multiple units told they are leaders

This sounds like we got into a bad state where a hook failed, and our logic
around automatically retrying hooks that are in error state, is not a good
fit around leader-elected.

The one option from the charm is to just return if is_leader returns false
instead of failing. That would avoid having the hook go into error state
and triggering juju to retry the hook.

John
=:->

On Tue, Dec 5, 2017 at 3:27 AM, Haw Loeung <email address hidden>
wrote:

> Seeing this as well with a 2.2.6 environment.
>
> | ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5
> 'is-leader'
> | False
>
> | ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5
> 'leader-get'
> | leader_id: ubuntu-repository-cache/4
>
> At some point, this unit must have been elected as the leader and now it
> keeps trying to run the leader-elected hook.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1723184
>
> Title:
> Multiple units told they are leaders
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1723184/+subscriptions
>

Revision history for this message
Junien F (axino) wrote :

This is still happening with 2.4.2 by the way :

2018-08-29 09:13:59 INFO juju-log leader-elected fired. This unit is the new leader: ubuntu-repository-cache/2
2018-08-29 09:13:59 DEBUG leader-elected ERROR cannot write leadership settings: cannot write settings: not the leader

Revision history for this message
Tim Penhey (thumper) wrote :

Possibly a timing issue where the leadership changed before the unit was able to run the leader-elected hook.

Juju should ensure that the unit is still the leader before running the leader-elected hook to handle this case.

Changed in juju:
milestone: none → 2.5-beta1
assignee: nobody → Joseph Phillips (manadart)
Changed in juju:
status: Triaged → In Progress
no longer affects: juju-core/2.0
Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Joseph Phillips (manadart) wrote :

The patch to address this included some logic that should have been removed.

It is addressed by https://github.com/juju/juju/pull/10301, and is the the edge versions for 2.5+.

no longer affects: juju/2.8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.