pxc cluster build failed due to leadership change in early unit lifecycle

Bug #1728111 reported by Jason Hobbs
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned
Charm Helpers
New
Undecided
Unassigned
OpenStack Percona Cluster Charm
Triaged
Low
Unassigned

Bug Description

The mysql/0 unit in my deployment failed a cluster-relation-changed hook and entered error state.

Here's the error:
http://paste.ubuntu.com/25831519/

Indeed, there is no mysql entry in /etc/passwd.

I've attached full logs from the run.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
James Page (james-page) wrote :

Looking at the mysql log data:

./12/lxd/6/var/log/juju/unit-mysql-2.log

2017-10-27 16:26:09 INFO juju.worker.uniter resolver.go:104 found queued "install" hook

2017-10-27 16:42:24 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:42:24 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:42:25 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:04 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:44:04 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

./0/lxd/6/var/log/juju/unit-mysql-0.log

2017-10-27 16:25:56 INFO juju.worker.uniter resolver.go:104 found queued "install" hook

2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:35:30 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:36:50 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:36:50 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"
2017-10-27 16:43:57 INFO juju.worker.uniter resolver.go:104 found queued "leader-elected" hook
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-elected hook
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run leader-elected hook"
2017-10-27 16:43:57 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run leader-elected hook"
2017-10-27 16:43:57 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running leader-elected hook
2017-10-27 16:43:59 INFO juju-log Unknown hook leader-elected - skipping.
2017-10-27 16:44:58 INFO juju.worker.uniter.operation runhook.go:113 ran "leader-elected" hook
2017-10-27 16:44:58 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run leader-elected hook"

pxc only installed once the lead unit has actually set the cluster root and sst passwords into leader storage; it would appear that at the time of install, non of the units was the leader, so the data was never seeded into leader storage.

Revision history for this message
James Page (james-page) wrote :

Something wonky went on during early unit lifecycle:

2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: <nil>
2017-10-27 16:23:08 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:14 DEBUG juju.worker.dependency engine.go:504 "leadership-tracker" manifold worker stopped: "migration-inactive-flag" not running: dependency not available
2017-10-27 16:23:14 DEBUG juju.worker.dependency engine.go:486 "leadership-tracker" manifold worker started
2017-10-27 16:23:14 DEBUG juju.worker.leadership tracker.go:126 mysql/2 making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.leadership tracker.go:185 mysql/2 promoted to leadership of mysql
2017-10-27 16:26:09 DEBUG juju.worker.uniter.remotestate watcher.go:354 got leader settings change: ok=true
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:230 notifying mysql/2 ticket of impending loss of mysql leadership
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:269 mysql/2 is not mysql leader
2017-10-27 16:36:10 DEBUG juju.worker.leadership tracker.go:215 mysql/2 waiting for mysql leadership release
2017-10-27 16:36:10 DEBUG juju.worker.uniter.remotestate watcher.go:394 got leadership change: minion
2017-10-27 16:36:10 DEBUG install ERROR cannot write leadership settings: cannot write settings: not the leader
2017-10-27 16:36:10 DEBUG install leader_set({key: _password})
2017-10-27 16:36:10 DEBUG install File "/var/lib/juju/agents/unit-mysql-2/charm/hooks/charmhelpers/core/hookenv.py", line 946, in leader_set
2017-10-27 16:36:10 DEBUG install subprocess.CalledProcessError: Command '['leader-set', 'root-password=6fwpXrzGGkb5gYqjmPk2qjxTSgLmbcR722Nwf4s2']' returned non-zero exit status 1

Revision history for this message
James Page (james-page) wrote :

and at the point where mysql/2 tried to write to leader storage:

2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied

summary: - cluster-relation-changed KeyError: 'getpwnam(): name not found: mysql'
+ pxc cluster build failed due to leadership change in early unit
+ lifecycle
Revision history for this message
James Page (james-page) wrote :

tl;dr leadership changed during the seeding of the passwords (so between a call to is-leader and leader-set) which the charm does not currently deal with so the cluster never bootstrapped.

I'm guessing this is not that easy to reproduce but at least the cause is visible from the log data provided; the logs from the controller might tell is more about why leadership changed.

Changed in charm-percona-cluster:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
James Page (james-page) wrote :

Adding a bug task for juju; this is a pretty small codeblock to have leadership switch between two lines:

    _password = leader_get(key)
    if not _password and is_leader():
        _password = config(key) or pwgen()
        leader_set({key: _password})
    return _password

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I think the only way to really control for this error, is wrap every call to leader_set(...) in a try: ... except: as the leadership can change during hook execution. i.e. even if is_leader() -> True, it's still possible for a later leader_set(...) set to fail. It's better to catch that failure, and undo any 'leader' things the hook was doing, and then exit the hook, and the new leader unit to perform the leadership actions instead.

e.g. Unless Juju can provide a guarantee that leadership won't change during a hook execution, then charms are going to have to back out of a leader_set(...) failure gracefully.

Revision history for this message
Tim Penhey (thumper) wrote :

Juju need to confirm whether or not we have leadership bouncing between units.

Under "normal" circumstances, where normal means that we have continued network connectivity, once a unit is a leader, it should stay as leader until the API connection is dropped.

There have been reports before of leadership bouncing between units, and this is something we need to investigate. It is possible that clock skew could have been an issue before, but this is where the recent work has gone in to mitigate that problem.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.3.0
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1728111] Re: pxc cluster build failed due to leadership change in early unit lifecycle

It would be good to know from the logs how long *we* think it was for those
to lines to execute. On a heavily loaded system I think we've seen things a
spike as high as 45s for a query to execute which chews up most of the
lease time. Also if there was something like a controller restart, etc.

IIRC is_leader doesn't do an immediate refresh but just checks the current
status. It might make it more reliable if we just force a refresh at that
point.

John
=:->

On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:

> Juju need to confirm whether or not we have leadership bouncing between
> units.
>
> Under "normal" circumstances, where normal means that we have continued
> network connectivity, once a unit is a leader, it should stay as leader
> until the API connection is dropped.
>
> There have been reports before of leadership bouncing between units, and
> this is something we need to investigate. It is possible that clock skew
> could have been an issue before, but this is where the recent work has
> gone in to mitigate that problem.
>
> ** Changed in: juju
> Status: New => Triaged
>
> ** Changed in: juju
> Importance: Undecided => High
>
> ** Changed in: juju
> Milestone: None => 2.3.0
>
> ** Changed in: juju
> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1728111
>
> Title:
> pxc cluster build failed due to leadership change in early unit
> lifecycle
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

(This is speculation while on a walk, not while reading through the code)

Thinking it through... If is_leader isn't refreshing but we're only doing
our async "every 30s extend the lease by 1min". If something happened to
that async loop, you could see a case where is leader returns true but it
is failing to actually extend the lease.

Even more true if we are only looking at the agents local state when
answering is leader. If there is clock skewing happening what happens if we
get the leadership token and our clock jumps backward by 1 min. It seems
possible that locally we think we're the leader but don't try to refreach
the token because our time isn't up yet.

Auditing the code to make sure we're using durations and time.Since rather
than absolute times/deadlines would allow the monotonic timer of go 1.9 to
help out.

We also need to make sure we're confident we're not doing something wrong
when time is perfectly stable.

John
=:->

On Oct 31, 2017 07:10, "John Meinel" <email address hidden> wrote:

> It would be good to know from the logs how long *we* think it was for
> those to lines to execute. On a heavily loaded system I think we've seen
> things a spike as high as 45s for a query to execute which chews up most of
> the lease time. Also if there was something like a controller restart, etc.
>
> IIRC is_leader doesn't do an immediate refresh but just checks the current
> status. It might make it more reliable if we just force a refresh at that
> point.
>
> John
> =:->
>
> On Oct 31, 2017 00:35, "Tim Penhey" <email address hidden> wrote:
>
>> Juju need to confirm whether or not we have leadership bouncing between
>> units.
>>
>> Under "normal" circumstances, where normal means that we have continued
>> network connectivity, once a unit is a leader, it should stay as leader
>> until the API connection is dropped.
>>
>> There have been reports before of leadership bouncing between units, and
>> this is something we need to investigate. It is possible that clock skew
>> could have been an issue before, but this is where the recent work has
>> gone in to mitigate that problem.
>>
>> ** Changed in: juju
>> Status: New => Triaged
>>
>> ** Changed in: juju
>> Importance: Undecided => High
>>
>> ** Changed in: juju
>> Milestone: None => 2.3.0
>>
>> ** Changed in: juju
>> Assignee: (unassigned) => Andrew Wilkins (axwalk)
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1728111
>>
>> Title:
>> pxc cluster build failed due to leadership change in early unit
>> lifecycle
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>>
>

Revision history for this message
Andrew Wilkins (axwalk) wrote :

"is-leader" does refresh. You can see the details here: https://github.com/juju/juju/blob/develop/worker/uniter/runner/context/leader.go#L54.

If the clock was jumping on the controller, then this could be explained. I've looked over the worker/lease and worker/leadership code, and it should now be sound when compiled with Go 1.9+ (which we now do), from Juju 2.3-beta2+ (new lease manager code).

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (4.3 KiB)

So digging through the code we call
func (ctx *leadershipContext) ensureLeader() error {
...
success := ctx.tracker.ClaimLeader().Wait()

which submits a claim ticket and waits for it to respond, claim tickets are
handled here:
if err := t.resolveClaim(ticketCh); err != nil {
resolve claim calls
if leader, err := t.isLeader(); err != nil {
which then:
func (t *Tracker) isLeader() (bool, error) {
if !t.isMinion {
// Last time we looked, we were leader.
select {
case <-t.tomb.Dying():
return false, errors.Trace(tomb.ErrDying)
case <-t.renewLease:
logger.Tracef("%s renewing lease for %s leadership", t.unitName,
t.applicationName)
t.renewLease = nil
if err := t.refresh(); err != nil {
return false, errors.Trace(err)
}
default:
logger.Tracef("%s still has %s leadership", t.unitName, t.applicationName)
}
}
return !t.isMinion, nil
}

*that* looks to me like we only renew the lease if we are currently pending
a renewal (so on a 1min lease we only renew on IsLeader if we're past the
30s mark).
Otherwise the:
default: still leader
code triggers and we just return true.

So if the timing was:
 0s: renew leadership for 60s
 25s: call IsLeader (no actual refresh)
 There doesn't appear to be any database activity after isLeader returns
true

All that refreshing would do is increase the window, which we could
probably do in a different way (just increase the lease time).

The other curious bit is the timing from the log:
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-get"
2017-10-27 16:28:13 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "is-leader"
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"

That is a full 2m35s from the time we see "is-leader" being called before
"leader-set" is then called.

Given the comment here:

    _password = leader_get(key)
    if not _password and is_leader():
        _password = config(key) or pwgen()
        leader_set({key: _password})
    return _password

Is pwgen() actually quite slow on a heavily loaded machine? Is it grabbing
lots of entropy/reading from /dev/random rather than /dev/urandom and
getting blocked?

So 2m45s is quite a long time. But also note that other things are
surprisingly slow:
2017-10-27 16:30:48 DEBUG worker.uniter.jujuc server.go:178 running hook
tool "leader-set"
2017-10-27 16:36:10 INFO juju.worker.leadership tracker.go:208 mysql
leadership for mysql/2 denied

Is it really taking us ~5minutes to deal with the leader-set call? or are
these 2 separate calls we're dealing with?

I'm assuming mysql/2 is the one running in the "something wonky went on
early".

We see that mysql/2 was set to be the leader at 16:24:

2017-10-27 16:23:14 DEBUG juju.worker.leadership tracker.go:126 mysql/2
making initial claim for mysql leadership
2017-10-27 16:24:52 INFO juju.worker.leadership tracker.go:185 mysql/2
promoted to leadership of mysql

At 16:36:10 mysql/2 is told its no longer the leader, but at 16:35:30 is
where mysql/0 is told that is now the leader:

2017-10-27 16:35:30 INFO juju.worker.uniter resolver.go:104 found queued
"leader-elected" hook

I'm heading back to the raw logs now, but nearly 3min from a is-lea...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

Side note, we do potentially have a serious issue about responding to relation data and coordination of leadership. Our statement that we guarantee you will have no more than 1 leader at any given time doesn't work well with arbitrary hooks in response to relation data changes.
Here is an example timeline:

 0s mysql/0 => becomes the leader (goes unresponsive for a bit)
 20s rabbit/0 => joins the relation with mysql and sets data in the relation bucket that only the leader can handle
 35s mysql/1 sees rabbits data but is not the leader
 35s mysql/2 sees rabbits data but is not the leader
 60s mysql/0 demoted, mysql/1 is now the leader
 65s mysql/1 sees the relation data from rabbit but is no longer the leader

There is no guarantee that there will be a leader that sees relation change data.
The one backstop would be 'leader-elected', which could go through and re-evaluate if there is anything that the previous leader missed. (look at your existing relations, and see if there was something you didn't handle earlier because you weren't the leader, that the last leader also failed to handle).

All of the above is possible even with nothing wrong with our leader election process. All it takes is for the machine where the leader is currently running to be busy with other hooks (colocated workloads), that it takes too long for what was the leader to actually respond to a relation.

I'd like us to figure out what they need as charmers to actually handle this case. Should there be an idea of "if I become the leader this is what I would want to do", that gets set aside as context that gets presented again as context during leader-elected?

Revision history for this message
John A Meinel (jameinel) wrote :

The logs show that leader-elected isn't implemented, which probably means that you can suffer from comment #13:
2017-10-27 16:35:31 INFO juju-log Unknown hook leader-elected - skipping.

I was discussing with Andrew, and one thing that we are thinking about this cycle is trying to introduce Application <=> Application relation data, rather than just having Unit <=> Application data.
In that context, it would be interesting to consider having a "relation-joined/changed" hook that is actually *guaranteed* to fire on the current leader, and if leadership changes and the hook has not exited successfully in the past, then the hook is triggered on the new leader.
The initial scope around Application data bags would not change the hook logic, so it wouldn't actually address this bug, but in the stuff we are calling "charms v2" and trying to change what hooks are fired, we could potentially address it there.

Potentially we could introduce a new hook more easily than deprecating all the existing hooks that we fire. Which would allow you to have something like "application-relation-changed", or some other spelling. Having some sort of knowledge around "what is the latest version of relation data that a leader has processed for all of its relations" and then always triggering a 'changed' hook whenever either the leader changes or the relation changes, and then recording that a leader has processed up to 'revno=X'.

Revision history for this message
John A Meinel (jameinel) wrote :

Looking at the charm: https://jujucharms.com/percona-cluster/
It does have a symlink of "leader-elected => percona_hooks.py"
but the Python code itself is hitting this line:
    try:
        hooks.execute(sys.argv)
    except UnregisteredHookError as e:
        log('Unknown hook {} - skipping.'.format(e))

So its more a case that you're not actually responding when leader-elected really is fired.

Revision history for this message
James Page (james-page) wrote :

I think the recommendation in #15 to implement the leader-elected hook, and deal with anything missing at that point in time makes alot of sense.

Revision history for this message
Tim Penhey (thumper) wrote :

I'm going to mark the Juju task invalid for now then based on John's comments above.

Changed in juju:
milestone: 2.3.0 → 2.3-rc1
status: Triaged → Invalid
milestone: 2.3-rc1 → none
assignee: Andrew Wilkins (axwalk) → nobody
Revision history for this message
James Page (james-page) wrote :

Setting Juju bug back to New; we can improve the charm but leader switching mid hook execution makes writing charms harder, so we should see if things can be improved.

Changed in juju:
status: Invalid → New
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Agree with James.

Changing leader whilst a hook is executing on the leader is not something we should expect charms and charmers to trap.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Can Juju also document the assurances made for leadership election, when/why it is determined to be changed, etc? This would be helpful documentation for charm authors to reference.

Revision history for this message
John A Meinel (jameinel) wrote :

It is taking you 2.5min to go from "is_leader" until we get to
"leader_set". If it is taking that long, your system is under enough load
that we apparently are unable to guarantee keep-alives. (We need a refresh
of leadership which is done by the unit agent every 30s that extends the
leadership for another 1 minute.)
I don't know what exactly is causing it to take 2.5min, but if we can't get
a network request 1/minute then we would allow leadership to lapse.

On Wed, Nov 15, 2017 at 8:24 PM, Ryan Beisner <email address hidden>
wrote:

> Can Juju also document the assurances made for leadership election,
> when/why it is determined to be changed, etc? This would be helpful
> documentation for charm authors to reference.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1728111
>
> Title:
> pxc cluster build failed due to leadership change in early unit
> lifecycle
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-helpers/+bug/1728111/+subscriptions
>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (7.2 KiB)

Sorry, it's a long message but I've got meaningful stuff there (I think).

https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/2
https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/3

The behavior I encountered in a duplicate bug got me thinking about how to fix this problem at both Juju and charm levels (both will need modifications).

TL;DR:

Juju: revive "leader-deposed" hook work - actually run that hook instead of a no-op (see https://git.io/vF1Jn)

Charmers: Modify charms with service-level leadership (not only Juju-level) to use leader-deposed.

Juju: Document when is_leader no longer returns TRUE and think about leader transactions (where a leader executes code and cannot be deposed until finishes execution or its process dies) or document operation interruption semantics (if any).

========

Topic 1.

Description:

For clarity, I will name 2 levels of leadership:

* level 1 (L1): Juju-level per-application unit leadership (a leader unit is an actor here);
* level 2 (L2): application-specific or service-specific leadership (a percona cluster process in this case, no explicit mapping from L2: L1 or L1: L2 leadership)

What happened (pad.lv/1732257)?

L1 leader got elected and started bootstrapping a cluster so L2 leader got created => L1 leader == L2 leader

L1 minions have not done <peer>-relation-joined yet => L1 leader cannot tell them that it is the L2 leader and there are no L2 minion processes yet => waits for more <peer>-relation-{joined, changed} events

L1-minion-0 got installed and joined a peer relation with the L1 leader but there are only 2/3 peers (min-cluster-size config option gating) => L2-minion-0 has NOT been set up yet (2/3, not clustered, not an L1 leader - no config rendering, no process running).

L1-leader got deposed, however, did not perform any action to depose L2 leader => **L1-minion-2**

L1-minion-1 became L1-leader and **started** bootstrapping a new cluster => L1 leader != L2 leader => 2 L2 leaders present!

L1-minion-0 started its service and spawned an L2 minion which got cluster state from L1-minion-2 (the old L1 and now contending L2 leader) ***before it got it from L1-leader*** => 2 leaders and 1 minion present - hit a RACE CONDITION on L2

L1-leader (new) set a new bootstrap_uuid leader bucket setting which is inconsistent with L2 UUIDs at L1-minion-0 and L1-minion-2 => hook errors at both L1-minion-0 and L1-minion-2

So in the final state there are no errors on L1-leader (new) as it has bootstrap_uuid that was set by it via leader-set (leader_settings_bootstrap_uuid == L1-leader_local_service_bootstrap_uuid)

2 minions are in a separate L2 cluster and have service-level UUIDs that are inconsistent with the leader setting.

AFAIKS Juju already has a somewhat transactional nature for leadership changes - there is a "Resign" operation and "leader-deposed hook" which apparently is not run (no-op):

https://github.com/juju/juju/blame/juju-2.2.6/worker/uniter/operation/leader.go#L74-L79

2017-11-14 17:21:32 INFO juju.worker.uniter.operation runhook.go:113 ran "shared-db-relation-changed" hook
2017-11-14 17:21:32 DEBUG juju.worker.uniter.operation executor.go:100 ...

Read more...

tags: added: cpe-onsite
Ryan Beisner (1chb1n)
tags: added: uosci
John A Meinel (jameinel)
Changed in juju:
status: New → Triaged
Revision history for this message
John A Meinel (jameinel) wrote :

I'm not sure that there is a logic bug in Juju, but we should understand what is going on in the system that is causing us to not refresh leadership correctly. I think the discussion around leader-elected is still relevant.

I'm not sure how much leader-disposed would actually help in this particular case. If you're in the middle of a hook, and you've ever called is_leader should we kill the execution of that script if we want to depose you?
It might work for some of the other cases where you need to tear things down that only got partially set up. Units still get a leader-settings-changed when the get demoted, so they could do the same work in that hook.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.9 KiB)

Could we provide a guarantee that no unit of a given application will ever consider itself a leader until a previous leader has been deposed in apiserver's view? Likewise, an apiserver should not give any leader tokens until it receives a confirmation that the previous leader has been deposed and ran that hook.

The latter condition is a strong requirement as if there is a network partition and a unit agent is no longer available, apiserver will never elect a new leader. If we introduce a timeout for that this may result in a split-brain unless a unit agent is required to stop executing further operations if there is a connection loss with the apiserver.

We cannot just stop a hook execution because inherently a charm may spawn threads and processes on it's own will which may daemonize and do other arbitrary things on a system during hook execution. Any process tracking mechanisms are operating system-specific (e.g. cgroups) and they can be escaped so we shouldn't even look that way.

The complicated part is that a unit <-> apiserver connection may be lost but a service-level network may be fine (i.e. the loss of L1-relevant connectivity doesn't mean services on L2 have the same picture) - this is the case where we have ToR and BoR switches providing service and management networks respectively on different physical media (switch fabrics). This is a common scenario for us (that's why we have network spaces). In other words: there may be an L1-related partition but not L2-related partition.

I think that in this case a partitioned unit should run leader-deposed which may run L2-related checks to see if this is only the unit <-> apiserver connectivity problem. This is an interesting scenario as the unit agent is isolated in this case and cannot get anything from the apiserver (can't do facade RPC). However, I think this is a useful scenario to model.

As an operator, would you do something like that with your system? Probably yes, you would go out-of-band or in-person and check if this problem impacts only Juju-related connectivity and decide upon service-level impact - this is what you should have in the charm in leader-deposed hook.

===

Now, to having one per-app leader unit running at a time, I believe this is, at least partially, present in Juju.

https://github.com/juju/juju/blob/juju-2.3-rc1/worker/leadership/tracker.go#L206-L227
// setMinion arranges for lease acquisition when there's an opportunity.
func (t *Tracker) setMinion() error {
...
  t.claimLease = make(chan struct{})
  go func() {
   defer close(t.claimLease)
   logger.Debugf("%s waiting for %s leadership release", t.unitName, t.applicationName)
   err := t.claimer.BlockUntilLeadershipReleased(t.applicationName)
   if err != nil {
    logger.Debugf("error while %s waiting for %s leadership release: %v", t.unitName, t.applicationName, err)
   }

The only part I have not found yet is explicit blocks on leader-deposed on the apiserver side.

What I think we need:

1. leadership-tracker tries to renew the lease;
2. fails as the token has expired;
3. runs leader-deposed hook;
3. meanwhile, apiserver doesn't allow anybody else to claim leadership unit it got EXPLICIT notificatio...

Read more...

Revision history for this message
Ante Karamatić (ivoks) wrote :

This behavior is critical for us.

Revision history for this message
Tim Penhey (thumper) wrote :

A key problem we have here is that Juju really can't give any guarantees. I spent some time last week talking with Ryan about what Juju can and can't say at any particular point in time.

The short answer is no, Juju cannot guarantee that a new leader won't be elected until a leader deposed hook is executed because the old leader might not be communicative. Consider the situation where there is a hardware failure, and the machine just dies. There is no way for it to run the hook, and if we are waiting, no other unit would ever be elected leader. This isn't reasonable.

Considering that we can't make this guarantee, we whouldn't rely on it.

No, AFAIK we don't have any explicit waits on other units running leader-deposed.

Revision history for this message
Tim Penhey (thumper) wrote :

I think a key thing to note here is the term "guarantee". I think I may have been taking too hard a line with guarantee.

The key thing to think about here is that the leader "shouldn't" change under normal circumstances. So the situations that are causing a leadership change should be the exceptional circumstances.

To be clear, as long as the agents are able to communicate, the leadership shouldn't change.

All the sharp edge cases are at the exceptional edge though. Why would communication drop?
 * net splits - I'm still not clear on what causes a net split
 * hardware failures
 * severly overloaded servers - we should work out how to be more aware of this, perhaps the number of running api calls.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Tim,

net split example: you have Juju controllers and MAAS region controllers sitting on layer 2 networks different from rack controllers and application servers in a data center. E.g. there are 9 racks to manage in different locations within the same DC but you would like to keep the same Juju & MAAS regiond control plane located separately so that you can add more racks. In this case there may be a situation where you lose access to one management network for rack "k" from a Juju controller which is a primary in a replicaset. It's a net split but your applications are unaffected - only machine & unit agents.

I think that what we encounter is mostly deployment-time problems because after a model has converged there is little use for Juju leadership hooks. It may be needed if you need to scale your infrastructure (deployment time again) but by then service-level clustering will have already been done.

Another use-case is rolling upgrades: a single unit should initiate them even if the "rolling" part is managed at the service level. But there are two different types of rolling upgrades:

1. for stateless applications - ordering of operations (by a leader) should be done on the Juju side as this is operator-driven if done manually in many cases. Otherwise we will need a "software-upgrader" application which will have to handle that and maintain the deployment state;
2. stateful applications - service-level quorum awareness is required so a leader unit only initiates an upgrade which is done in software itself.

In the cases I've seen we go through the following logic:

1. a leader unit defines who will bootstrap a service-level cluster;
2. service-level elections are performed (ordered connections to a master, PAXOS, RAFT, Totem RRP etc.);
3. leadership is managed at the service level. Leader settings contain an indication of a completed bootstrap procedure and leadership hooks are no-ops.

A practical example:

1. percona cluster (master bootstraps, slaves join without bootstrapping);
2. new slaves join the quorum;
3. any service-level failure conditions require disaster recovery and manual intervention.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.