Comment 22 for bug 1728111

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Sorry, it's a long message but I've got meaningful stuff there (I think).

https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/2
https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/comments/3

The behavior I encountered in a duplicate bug got me thinking about how to fix this problem at both Juju and charm levels (both will need modifications).

TL;DR:

Juju: revive "leader-deposed" hook work - actually run that hook instead of a no-op (see https://git.io/vF1Jn)

Charmers: Modify charms with service-level leadership (not only Juju-level) to use leader-deposed.

Juju: Document when is_leader no longer returns TRUE and think about leader transactions (where a leader executes code and cannot be deposed until finishes execution or its process dies) or document operation interruption semantics (if any).

========

Topic 1.

Description:

For clarity, I will name 2 levels of leadership:

* level 1 (L1): Juju-level per-application unit leadership (a leader unit is an actor here);
* level 2 (L2): application-specific or service-specific leadership (a percona cluster process in this case, no explicit mapping from L2: L1 or L1: L2 leadership)

What happened (pad.lv/1732257)?

L1 leader got elected and started bootstrapping a cluster so L2 leader got created => L1 leader == L2 leader

L1 minions have not done <peer>-relation-joined yet => L1 leader cannot tell them that it is the L2 leader and there are no L2 minion processes yet => waits for more <peer>-relation-{joined, changed} events

L1-minion-0 got installed and joined a peer relation with the L1 leader but there are only 2/3 peers (min-cluster-size config option gating) => L2-minion-0 has NOT been set up yet (2/3, not clustered, not an L1 leader - no config rendering, no process running).

L1-leader got deposed, however, did not perform any action to depose L2 leader => **L1-minion-2**

L1-minion-1 became L1-leader and **started** bootstrapping a new cluster => L1 leader != L2 leader => 2 L2 leaders present!

L1-minion-0 started its service and spawned an L2 minion which got cluster state from L1-minion-2 (the old L1 and now contending L2 leader) ***before it got it from L1-leader*** => 2 leaders and 1 minion present - hit a RACE CONDITION on L2

L1-leader (new) set a new bootstrap_uuid leader bucket setting which is inconsistent with L2 UUIDs at L1-minion-0 and L1-minion-2 => hook errors at both L1-minion-0 and L1-minion-2

So in the final state there are no errors on L1-leader (new) as it has bootstrap_uuid that was set by it via leader-set (leader_settings_bootstrap_uuid == L1-leader_local_service_bootstrap_uuid)

2 minions are in a separate L2 cluster and have service-level UUIDs that are inconsistent with the leader setting.

AFAIKS Juju already has a somewhat transactional nature for leadership changes - there is a "Resign" operation and "leader-deposed hook" which apparently is not run (no-op):

https://github.com/juju/juju/blame/juju-2.2.6/worker/uniter/operation/leader.go#L74-L79

2017-11-14 17:21:32 INFO juju.worker.uniter.operation runhook.go:113 ran "shared-db-relation-changed" hook
2017-11-14 17:21:32 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run relation-changed (88; neutron-api/1) hook"
2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:89 lock released
2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:69 running operation resign leadership <---- resign OP
2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:69 running operation resign leadership
2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "resign leadership"
2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "resign leadership"
2017-11-14 17:21:33 WARNING juju.worker.uniter.operation leader.go:115 we should run a leader-deposed hook here, but we can't yet <---- leader-deposed
2017-11-14 17:21:33 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "resign leadership"

So, reviving this work would allow us to debootstrap (as in "kill the service and clean persistent state") the old L2 leader **IF** the L2 bootstrap procedure has NOT been completed yet. At the charm level, a charm author has to be careful not to debootstrap on every leader-deposed of course.

=====

Topic 2.

What I am interested is where do we start getting is_leader == False?

2017-11-14 17:21:27 INFO juju.worker.leadership tracker.go:208 mysql leadership for mysql/2 denied
2017-11-14 17:21:27 DEBUG juju.worker.leadership tracker.go:230 notifying mysql/2 ticket of impending loss of mysql leadership
2017-11-14 17:21:27 DEBUG juju.worker.leadership tracker.go:215 mysql/2 waiting for mysql leadership release
2017-11-14 17:21:27 DEBUG juju.worker.leadership tracker.go:269 mysql/2 is not mysql leader
2017-11-14 17:21:27 DEBUG juju.worker.uniter.remotestate watcher.go:394 got leadership change: minion

^^^ is it here?

2017-11-14 17:21:27 DEBUG worker.uniter.jujuc server.go:178 running hook tool "is-leader"

^^^ what is the result?

====

Topic 3.

Now for the odd part with actual data unrelated to leadership (https://bugs.launchpad.net/charm-percona-cluster/+bug/1732257/+attachment/5009043/+files/juju-crashdump-b65366d4-4e66-42b7-98a3-cab719fd1529.tar.gz):

A peer relation for the percona-cluster application is called "cluster" and the app is named mysql in the model:

cluster-relation-joined counts:

unit-mysql-0: 1 # <--- ?
unit-mysql-1: 2
unit-mysql-2: 2

While unit-mysql-0 had only one -joined event, it still could observe *2* other peers:

executor operation related to joining mysql/2 on mysql/0:

2017-11-14 17:22:52 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run relation-joined (28; mysql/2) hook"
2017-11-14 17:22:52 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run relation-joined (28; mysql/2) hook"
2017-11-14 17:22:52 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running cluster-relation-joined hook

previous executor operation:

2017-11-14 17:26:37 INFO juju-log shared-db:88: Insufficient number of peer units to form cluster (expected=3, got=2)

...

the next operation:

2017-11-14 17:31:05 DEBUG juju.worker.uniter.operation executor.go:100 preparing operation "run relation-changed (28; mysql/2) hook"
2017-11-14 17:31:05 DEBUG juju.worker.uniter.operation executor.go:100 executing operation "run relation-changed (28; mysql/2) hook"
2017-11-14 17:31:05 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running cluster-relation-changed hook

2017-11-14 17:31:48 INFO juju-log cluster:28: Sufficient number of peer units to form cluster 3

^^^ how is that possible? Logs never reported a join event for mysql/1? How did that become visible in this state snapshot? ^^^

2017-11-14 17:31:52 DEBUG juju.worker.uniter.operation executor.go:100 committing operation "run relation-changed (28; mysql/2) hook"
2017-11-14 17:31:52 DEBUG juju.worker.uniter.operation executor.go:89 lock released
2017-11-14 17:31:52 DEBUG juju.worker.uniter.operation executor.go:69 running operation run leader-settings-changed hook

2017-11-14 17:31:57 ERROR juju-log bootstrap uuid differs: 0dde54c4-c961-11e7-9775-2f72db67cbda != b3e1b7e8-c95f-11e7-91fa-56c8581eb616