Canonical Juju

Upgrade from 2.8.10 to 2.9.0 results in applications with no leader

Bug #1927732 reported by Michael Skalka on 2021-05-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Joseph Phillips	Canonical Juju 2.9.5

Bug Description

Following upgrading a controller from 2.8.10 to 2.9 multiple applications within a model on that controller lost their leader, resulting in hook errors which are not recoverable.

ubuntu@prodymcprodface-cpe-f4b88f49-d1a9-4d98-b8c1-db1b25e5d641:~$ juju status -m controller
Model Controller Cloud/Region Version SLA Timestamp
controller foundations-maas maas_cloud 2.9.0 unsupported 12:54:59Z

Machine State DNS Inst id Series AZ Message
0 started 10.246.64.200 ybxnk6 bionic zone1 Deployed
1 started 10.246.64.201 juju-2 bionic zone2 Deployed
2 started 10.246.64.202 juju-3 bionic zone3 Deployed

ubuntu@prodymcprodface-cpe-f4b88f49-d1a9-4d98-b8c1-db1b25e5d641:~$ juju status keystone
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.8.10 unsupported 12:55:14Z

Unit Workload Agent Machine Public address Ports Message
keystone/0 error idle 0/lxd/7 10.244.8.176 5000/tcp hook failed: "config-changed"
  hacluster-keystone/2 active idle 10.244.8.176 Unit is ready and clustered
  keystone-ldap/2 active executing 10.244.8.176 Unit is ready
  logrotated/47 active idle 10.244.8.176 Unit is ready.
  public-policy-routing/30 active idle 10.244.8.176 Unit is ready
keystone/1 active executing 2/lxd/5 10.244.8.152 5000/tcp Unit is ready
  hacluster-keystone/0 active idle 10.244.8.152 Unit is ready and clustered
  keystone-ldap/0 active idle 10.244.8.152 Unit is ready
  logrotated/15 active idle 10.244.8.152 Unit is ready.
  public-policy-routing/6 active idle 10.244.8.152 Unit is ready
keystone/2 active executing 4/lxd/5 10.244.8.179 5000/tcp Unit is ready
  hacluster-keystone/1 active idle 10.244.8.179 Unit is ready and clustered
  keystone-ldap/1* active idle 10.244.8.179 Unit is ready
  logrotated/39 active idle 10.244.8.179 Unit is ready.
  public-policy-routing/23 active idle 10.244.8.179 Unit is ready

Machine State DNS Inst id Series AZ Message
0 started 10.244.8.128 azurill bionic zone1 Deployed
0/lxd/7 started 10.244.8.176 juju-5f5691-0-lxd-7 bionic zone1 Container started
2 started 10.244.8.130 meowth bionic zone2 Deployed
2/lxd/5 started 10.244.8.152 juju-5f5691-2-lxd-5 bionic zone2 Container started
4 started 10.244.8.131 taillow bionic zone3 Deployed
4/lxd/5 started 10.244.8.179 juju-5f5691-4-lxd-5 bionic zone3 Container started

Occasionally the keystone application will seemingly elect a leader, then immediately error:

Unit Workload Agent Machine Public address Ports Message
keystone/0 error idle 0/lxd/7 10.244.8.176 5000/tcp hook failed: "leader-elected"
  hacluster-keystone/2* active idle 10.244.8.176 Unit is ready and clustered
  keystone-ldap/2 active idle 10.244.8.176 Unit is ready
  logrotated/47 active executing 10.244.8.176 Unit is ready.
  public-policy-routing/30 active idle 10.244.8.176 Unit is ready
keystone/1 error idle 2/lxd/5 10.244.8.152 5000/tcp hook failed: "leader-elected"
  hacluster-keystone/0 active idle 10.244.8.152 Unit is ready and clustered
  keystone-ldap/0* active idle 10.244.8.152 Unit is ready
  logrotated/15 active idle 10.244.8.152 Unit is ready.
  public-policy-routing/6 active idle 10.244.8.152 Unit is ready
keystone/2* error idle 4/lxd/5 10.244.8.179 5000/tcp hook failed: "leader-elected"
  hacluster-keystone/1 active idle 10.244.8.179 Unit is ready and clustered
  keystone-ldap/1 active idle 10.244.8.179 Unit is ready
  logrotated/39 active idle 10.244.8.179 Unit is ready.
  public-policy-routing/23* active idle 10.244.8.179 Unit is ready

Tags:

Revision history for this message

Michael Skalka (mskalka) wrote on 2021-05-07:

juju-crashdump-44d6dd5c-1367-4046-9dc6-9c069fa76123.tar.xz Edit (29.7 MiB, application/x-tar)

Openstack model crashdump

Revision history for this message

Michael Skalka (mskalka) wrote on 2021-05-07:

juju-crashdump-08c75d95-751a-44b0-9a29-6b706171508f.tar.xz Edit (1.2 MiB, application/x-tar)

Controller crashdump

John A Meinel (jameinel) on 2021-05-07

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Michael Skalka (mskalka) wrote on 2021-05-07:

Backing this down to high, the model has eventually elected a leader:

Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.8.10 unsupported 13:30:46Z

Backing this down to high, the model has eventually elected a leader:

Model      Controller        Cloud/Region  Version  SLA          Timestamp
openstack  foundations-maas  maas_cloud    2.8.10   unsupported  13:30:46Z

App                    Version  Status  Scale  Charm             Store       Rev  OS      Notes
hacluster-keystone              active      3  hacluster         jujucharms   76  ubuntu
keystone               17.0.0   active      3  keystone          jujucharms  323  ubuntu
keystone-ldap          17.0.0   active      3  keystone-ldap     jujucharms   35  ubuntu
logrotated                      active      3  logrotated        jujucharms    3  ubuntu
public-policy-routing           active      3  advanced-routing  jujucharms    4  ubuntu

Unit                        Workload     Agent      Machine  Public address  Ports     Message
keystone/0                  active       idle       0/lxd/7  10.244.8.176    5000/tcp  Unit is ready
  hacluster-keystone/2      active       idle                10.244.8.176              Unit is ready and clustered
  keystone-ldap/2           active       idle                10.244.8.176              Unit is ready
  logrotated/47             active       idle                10.244.8.176              Unit is ready.
  public-policy-routing/30  active       idle                10.244.8.176              Unit is ready
keystone/1*                 maintenance  executing  2/lxd/5  10.244.8.152    5000/tcp  Updating endpoint for swift
  hacluster-keystone/0      active       idle                10.244.8.152              Unit is ready and clustered
  keystone-ldap/0           active       idle                10.244.8.152              Unit is ready
  logrotated/15             active       idle                10.244.8.152              Unit is ready.
  public-policy-routing/6   active       idle                10.244.8.152              Unit is ready
keystone/2                  active       idle       4/lxd/5  10.244.8.179    5000/tcp  Unit is ready
  hacluster-keystone/1*     active       idle                10.244.8.179              Unit is ready and clustered
  keystone-ldap/1*          active       idle                10.244.8.179              Unit is ready
  logrotated/39             active       idle                10.244.8.179              Unit is ready.
  public-policy-routing/23  active       idle                10.244.8.179              Unit is ready

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2021-05-07:

The crash dumps do not have controller machine logs, not primary unit logs. Luckily the setup is still available.

The controller has many errors along the line of:
machine-2: 13:02:50 ERROR juju.worker.raft.raftforwarder couldn't expire lease "93375d9b-f8d3-4f19-8936-2eae6e9048dd:singular-controller#75bbcf8b-cf77-483b-8486-3b630a27ef6b#" in db: state changing too quickly; try again soon
machine-2: 13:02:51 ERROR juju.worker.raft.raftforwarder couldn't claim lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#ceph-osd#" for "ceph-osd/0" in db: state changing too quickly; try again soon
machine-2: 13:03:00 ERROR juju.worker.raft.raftforwarder couldn't claim lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#logrotated#" for "logrotated/37" in db: state changing too quickly; try again soon
machine-2: 13:03:05 ERROR juju.worker.raft.raftforwarder couldn't expire lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#hacluster-keystone#" in db: state changing too quickly; try again soon
machine-2: 13:03:08 ERROR juju.worker.raft.raftforwarder couldn't claim lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#ntp#" for "ntp/1" in db: state changing too quickly; try again soon
machine-2: 13:03:43 ERROR juju.worker.raft.raftforwarder couldn't expire lease "44c18b11-f1f0-4b9d-8185-ed6b3e35d05f:singular-controller#44c18b11-f1f0-4b9d-8185-ed6b3e35d05f#" in db: state changing too quickly; try again soon
machine-2: 13:03:56 ERROR juju.worker.raft.raftforwarder couldn't expire lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#hacluster-glance#" in db: state changing too quickly; try again soon

Revision history for this message

John A Meinel (jameinel) wrote on 2021-05-07:

machine-2: 13:02:50 ERROR juju.worker.raft.raftforwarder couldn't expire lease "93375d9b-f8d3-4f19-8936-2eae6e9048dd:singular-controller#75bbcf8b-cf77-483b-8486-3b630a27ef6b#" in db: state changing too quickly; try again soon

is particularly worrying as this is the lease that would decide which controller runs the singleton workers (like the provisioner)

It would be good to understand how we're hitting contention on leases, as when they are updated they shouldn't be a shared document that is being updated at the same time.

Revision history for this message

John A Meinel (jameinel) wrote on 2021-05-07:

This might not be what I thought it was. I tried to trigger it by doing a deploy with lots of applications and then upgrading, but everything went through clean.

$ sudo snap install juju_28 --channel 2.8 --classic
$ sudo snap install juju --channel 2.9 --classic
$ juju_28 bootstrap lxd lxd
$ juju_28 enable-ha
$ juju_28 deploy cs:~jameinel/ubuntu-lite -n 10
$ juju status # until all containers are ready
# Deploy 100 applications to 10 containers
$ for j in `seq -w 0 9`; do echo $j; for i in `seq -w 0 9`; do echo " $i"; juju deploy cs:~jameinel/ubuntu-lite ul$j$i --to $i & done; time wait; done
# Make sure they all have 2 units
$ for j in `seq -w 0 9`; do echo $j; for i in `seq -w 0 9`; do echo " $i"; juju add-unit ul$j$i --to $i & done; time wait; done

$ juju upgrade-controller

I did see some failures, but nothing systemic:
2021-05-07 20:50:23 WARNING juju.core.raftlease store.go:300 response timeout waiting for Command(ver: 1, op: claim, ns: singular-controller, model: aef0a3, lease: c0cd5b

ubuntu-lite doesn't do anything in a leader-elected hook, but we also weren't seeing the database updates failing due to "state changing too quickly".

Joseph Phillips (manadart) on 2021-05-12

Changed in juju:
assignee:	nobody → Joseph Phillips (manadart)

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-12:

There is a question on what is the authority here.

The errors "ERROR juju.worker.raft.raftforwarder couldn't expire lease {key} in db..." are just logged errors.

When they are emitted, we have already applied the operation to the FSM, and operation itself completes with success, so as far as Raft and the lease manager are concerned, they can report who is or is not the leader. It's just that the "leaseholders" collection is out-of-sync.

I am still trying to tease out how this is actually possible in the HA up/down upgrade dance, but it seems to me we should remove the DB assertions and write the changes in hard, because the maintenance of leases is moving forward regardless of the DB state - we might as well have the info correct there.

John A Meinel (jameinel) on 2021-05-13

Changed in juju:
milestone:	none → 2.9.2

John A Meinel (jameinel) on 2021-05-17

tags:

added: canonical-bootstack canonical-is

Canonical Juju QA Bot (juju-qa-bot) on 2021-05-19

Changed in juju:
milestone:	2.9.2 → 2.9.3

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-25:

juju-log-machine-0.tgz Edit (76.3 MiB, application/x-tar)

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-25:

juju-log-machine-1.tgz Edit (86.5 MiB, application/x-tar)

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-25:

#10

juju-log-machine-2.tgz Edit (92.3 MiB, application/x-tar)

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-25:

#11

Will start building a story here.

- Machine 0 is running the Mongo primary.
- Machine 2 evidently does not complete the upgrade after the first detection.
The upgrade request is logged twice.
- All upgradesteps workers are killed via tomb at least once.
3 times for machine 0.

There is a lot of churn here - we could improve the flow through this, reducing worker down-time if there was a once-through run of the steps. The tomb-kill can be reproduced consistently.

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-25:

#12

Missed the artefact:
https://pastebin.canonical.com/p/rDzgtFhJXS/

Canonical Juju QA Bot (juju-qa-bot) on 2021-05-25

Changed in juju:
milestone:	2.9.3 → 2.9.4

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-05-26 (last edit on 2021-05-27):

#13

Based on interrogating the logs for entries from the raft-backstop worker, it does not appear that we have a backstop-induced split-brain in Raft.

Interestingly, Raft is still running on machine-1 after upgrade availability is detected, but apparently not on (at least) node 0. This should not be the case, as the main Raft worker is predicated upon "ifFullyUpgraded".

It's getting interesting here. Look at the back-off times versus the log timestamps. This is the sort of logging we might expect to see if multiple instances of the same worker were running:
https://pastebin.canonical.com/p/XjTWzDz59G/

Here we see the start of our tribulation. The DB upgrade completes, the upgrade-steps runner starts up, but the upgrader worker sees the the agent version for OS instead of series and tanks all the workers for another upgrade before the controllers have coordinated the steps for 2.9.0.
https://pastebin.canonical.com/p/GVz6tWc3FJ/

The other controllers will be doing this same dance and we see the dependency engine thrash a bit before the upgrade-steps worker can run successfully.
https://pastebin.canonical.com/p/yf2vpWVfRj/

John A Meinel (jameinel) on 2021-05-26

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

John A Meinel (jameinel) wrote on 2021-05-26:

#14

machine-0/1/2 log files snipped for timestamp range Edit (16.3 MiB, application/x-tar)

I tried to create some log files that had the right content in them for debugging. The full logs are still useful if we need to dig, but these are concatenating all the log files, and then snipping them from 12:35 until 13:00 across all 3 controller machines.

Joseph Phillips (manadart) on 2021-06-01

Changed in juju:
milestone:	2.9.4 → 2.9.5

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-06-09:

#15

The trigger for this issue should be addressed once the last of these has landed:
- https://github.com/juju/juju/pull/12992
- https://github.com/juju/juju/pull/13033
- https://github.com/juju/juju/pull/13044
- https://github.com/juju/juju/pull/13051
- https://github.com/juju/juju/pull/13058
- https://github.com/juju/juju/pull/13047

Canonical Juju QA Bot (juju-qa-bot) on 2021-06-16

Changed in juju:
milestone:	2.9.5 → 2.9.6

Joseph Phillips (manadart) on 2021-06-17

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2021-06-21

Changed in juju:
status:	Fix Committed → Fix Released

John A Meinel (jameinel) on 2021-06-28

Changed in juju:
milestone:	2.9.6 → 2.9.5

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.