Multiple units told they are leaders

Bug #1723184 reported by Jacek Nykis on 2017-10-12
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned
juju-core
Undecided
Unassigned

Bug Description

In my environment I had 5 out of 8 units end up with hook errors.

On closer inspection I noticed that all failed units wrongly thought they were leaders and were trying to "leader-set" which was failing.

Once I run "juju resolved" on all affected units they worked fine.

I could not find anything interesting in the logs.

This is the 2nd bug I noticed recently that affects leadership election so they may be related. The other is bug 1721159.

juju 1.25.13
Ubuntu 14.04.5 LTS

Related branches

Junien Fridrick (axino) wrote :

I have seen kind of the same thing just today, on a fresh 2.2.4 (<24h) model :

2017-10-12 19:28:26 INFO juju-log leader-elected fired. This unit is the new leader: foo/3
2017-10-12 19:28:26 DEBUG leader-elected ERROR cannot write leadership settings: cannot write settings: not the leader
[...]
2017-10-12 19:28:26 DEBUG leader-elected subprocess.CalledProcessError: Command '['leader-set', 'leader_id=foo/3']' returned non-zero exit status 1
2017-10-12 19:28:26 ERROR juju.worker.uniter.operation runhook.go:107 hook "leader-elected" failed: exit status 1

tags: added: canonical-is
description: updated
John A Meinel (jameinel) wrote :

can you describe the charm where you're seeing this? and if there are steps to reproduce?

Junien Fridrick (axino) wrote :

It's our good friend ubuntu-repository-cache, same as in the linked bug. It's one of the rare (only ?) charm that actively uses leadership / has leader-* hooks, which could explain why we're seeing leadership problems only with it.

No step to reproduce as far as I'm aware, unfortunately.

I guess I can resolve the hook error and enable TRACE debug on that env, and wait for a repro ?

John A Meinel (jameinel) wrote :

We are unlikely to fix this in 1.25 unless there is a clear reproduction case.

Changed in juju-core:
status: New → Won't Fix
John A Meinel (jameinel) wrote :

If we can reproduce this, then it is very likely worth fixing. Especially the 2.2.4 case seems to be we are actively electing a leader and it is getting demoted within the same second.
I wonder if the problem is actually that it has not successfully been promoted entirely? (could we have thought we requested the txn but it hasn't been fully committed before the next step?
)

Changed in juju:
importance: Undecided → Medium
status: New → Triaged
Jacek Nykis (jacekn) wrote :

We've just hit this in a production environment running juju 2.2.6

We ended up with a non-leader unit stuck with failed "leader-elected" hook. The hook was trying to "leader-set" which returned error. We recovered by running "exit 0" inside debug-hooks.

So maybe the solution is to cancel "leader-elected" hooks on non-leaders?

Haw Loeung (hloeung) wrote :

Seeing this as well with a 2.2.6 environment.

| ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5 'is-leader'
| False

| ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5 'leader-get'
| leader_id: ubuntu-repository-cache/4

At some point, this unit must have been elected as the leader and now it keeps trying to run the leader-elected hook.

This sounds like we got into a bad state where a hook failed, and our logic
around automatically retrying hooks that are in error state, is not a good
fit around leader-elected.

The one option from the charm is to just return if is_leader returns false
instead of failing. That would avoid having the hook go into error state
and triggering juju to retry the hook.

John
=:->

On Tue, Dec 5, 2017 at 3:27 AM, Haw Loeung <email address hidden>
wrote:

> Seeing this as well with a 2.2.6 environment.
>
> | ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5
> 'is-leader'
> | False
>
> | ubuntu@ip-172-30-18-69:~$ sudo juju-run ubuntu-repository-cache/5
> 'leader-get'
> | leader_id: ubuntu-repository-cache/4
>
> At some point, this unit must have been elected as the leader and now it
> keeps trying to run the leader-elected hook.
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1723184
>
> Title:
> Multiple units told they are leaders
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1723184/+subscriptions
>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers