Upgrade from 2.8.10 to 2.9.0 results in applications with no leader

Bug #1927732 reported by Michael Skalka
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

Following upgrading a controller from 2.8.10 to 2.9 multiple applications within a model on that controller lost their leader, resulting in hook errors which are not recoverable.

ubuntu@prodymcprodface-cpe-f4b88f49-d1a9-4d98-b8c1-db1b25e5d641:~$ juju status -m controller
Model Controller Cloud/Region Version SLA Timestamp
controller foundations-maas maas_cloud 2.9.0 unsupported 12:54:59Z

Machine State DNS Inst id Series AZ Message
0 started 10.246.64.200 ybxnk6 bionic zone1 Deployed
1 started 10.246.64.201 juju-2 bionic zone2 Deployed
2 started 10.246.64.202 juju-3 bionic zone3 Deployed

ubuntu@prodymcprodface-cpe-f4b88f49-d1a9-4d98-b8c1-db1b25e5d641:~$ juju status keystone
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.8.10 unsupported 12:55:14Z

App Version Status Scale Charm Store Rev OS Notes
hacluster-keystone active 3 hacluster jujucharms 76 ubuntu
keystone 17.0.0 active 3 keystone jujucharms 323 ubuntu
keystone-ldap 17.0.0 active 3 keystone-ldap jujucharms 35 ubuntu
logrotated active 3 logrotated jujucharms 3 ubuntu
public-policy-routing active 3 advanced-routing jujucharms 4 ubuntu

Unit Workload Agent Machine Public address Ports Message
keystone/0 error idle 0/lxd/7 10.244.8.176 5000/tcp hook failed: "config-changed"
  hacluster-keystone/2 active idle 10.244.8.176 Unit is ready and clustered
  keystone-ldap/2 active executing 10.244.8.176 Unit is ready
  logrotated/47 active idle 10.244.8.176 Unit is ready.
  public-policy-routing/30 active idle 10.244.8.176 Unit is ready
keystone/1 active executing 2/lxd/5 10.244.8.152 5000/tcp Unit is ready
  hacluster-keystone/0 active idle 10.244.8.152 Unit is ready and clustered
  keystone-ldap/0 active idle 10.244.8.152 Unit is ready
  logrotated/15 active idle 10.244.8.152 Unit is ready.
  public-policy-routing/6 active idle 10.244.8.152 Unit is ready
keystone/2 active executing 4/lxd/5 10.244.8.179 5000/tcp Unit is ready
  hacluster-keystone/1 active idle 10.244.8.179 Unit is ready and clustered
  keystone-ldap/1* active idle 10.244.8.179 Unit is ready
  logrotated/39 active idle 10.244.8.179 Unit is ready.
  public-policy-routing/23 active idle 10.244.8.179 Unit is ready

Machine State DNS Inst id Series AZ Message
0 started 10.244.8.128 azurill bionic zone1 Deployed
0/lxd/7 started 10.244.8.176 juju-5f5691-0-lxd-7 bionic zone1 Container started
2 started 10.244.8.130 meowth bionic zone2 Deployed
2/lxd/5 started 10.244.8.152 juju-5f5691-2-lxd-5 bionic zone2 Container started
4 started 10.244.8.131 taillow bionic zone3 Deployed
4/lxd/5 started 10.244.8.179 juju-5f5691-4-lxd-5 bionic zone3 Container started

Occasionally the keystone application will seemingly elect a leader, then immediately error:

Unit Workload Agent Machine Public address Ports Message
keystone/0 error idle 0/lxd/7 10.244.8.176 5000/tcp hook failed: "leader-elected"
  hacluster-keystone/2* active idle 10.244.8.176 Unit is ready and clustered
  keystone-ldap/2 active idle 10.244.8.176 Unit is ready
  logrotated/47 active executing 10.244.8.176 Unit is ready.
  public-policy-routing/30 active idle 10.244.8.176 Unit is ready
keystone/1 error idle 2/lxd/5 10.244.8.152 5000/tcp hook failed: "leader-elected"
  hacluster-keystone/0 active idle 10.244.8.152 Unit is ready and clustered
  keystone-ldap/0* active idle 10.244.8.152 Unit is ready
  logrotated/15 active idle 10.244.8.152 Unit is ready.
  public-policy-routing/6 active idle 10.244.8.152 Unit is ready
keystone/2* error idle 4/lxd/5 10.244.8.179 5000/tcp hook failed: "leader-elected"
  hacluster-keystone/1 active idle 10.244.8.179 Unit is ready and clustered
  keystone-ldap/1 active idle 10.244.8.179 Unit is ready
  logrotated/39 active idle 10.244.8.179 Unit is ready.
  public-policy-routing/23* active idle 10.244.8.179 Unit is ready

Revision history for this message
Michael Skalka (mskalka) wrote :

Openstack model crashdump

Revision history for this message
Michael Skalka (mskalka) wrote :

Controller crashdump

John A Meinel (jameinel)
Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Michael Skalka (mskalka) wrote :

Backing this down to high, the model has eventually elected a leader:

Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.8.10 unsupported 13:30:46Z

App Version Status Scale Charm Store Rev OS Notes
hacluster-keystone active 3 hacluster jujucharms 76 ubuntu
keystone 17.0.0 active 3 keystone jujucharms 323 ubuntu
keystone-ldap 17.0.0 active 3 keystone-ldap jujucharms 35 ubuntu
logrotated active 3 logrotated jujucharms 3 ubuntu
public-policy-routing active 3 advanced-routing jujucharms 4 ubuntu

Unit Workload Agent Machine Public address Ports Message
keystone/0 active idle 0/lxd/7 10.244.8.176 5000/tcp Unit is ready
  hacluster-keystone/2 active idle 10.244.8.176 Unit is ready and clustered
  keystone-ldap/2 active idle 10.244.8.176 Unit is ready
  logrotated/47 active idle 10.244.8.176 Unit is ready.
  public-policy-routing/30 active idle 10.244.8.176 Unit is ready
keystone/1* maintenance executing 2/lxd/5 10.244.8.152 5000/tcp Updating endpoint for swift
  hacluster-keystone/0 active idle 10.244.8.152 Unit is ready and clustered
  keystone-ldap/0 active idle 10.244.8.152 Unit is ready
  logrotated/15 active idle 10.244.8.152 Unit is ready.
  public-policy-routing/6 active idle 10.244.8.152 Unit is ready
keystone/2 active idle 4/lxd/5 10.244.8.179 5000/tcp Unit is ready
  hacluster-keystone/1* active idle 10.244.8.179 Unit is ready and clustered
  keystone-ldap/1* active idle 10.244.8.179 Unit is ready
  logrotated/39 active idle 10.244.8.179 Unit is ready.
  public-policy-routing/23 active idle 10.244.8.179 Unit is ready

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

The crash dumps do not have controller machine logs, not primary unit logs. Luckily the setup is still available.

The controller has many errors along the line of:
machine-2: 13:02:50 ERROR juju.worker.raft.raftforwarder couldn't expire lease "93375d9b-f8d3-4f19-8936-2eae6e9048dd:singular-controller#75bbcf8b-cf77-483b-8486-3b630a27ef6b#" in db: state changing too quickly; try again soon
machine-2: 13:02:51 ERROR juju.worker.raft.raftforwarder couldn't claim lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#ceph-osd#" for "ceph-osd/0" in db: state changing too quickly; try again soon
machine-2: 13:03:00 ERROR juju.worker.raft.raftforwarder couldn't claim lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#logrotated#" for "logrotated/37" in db: state changing too quickly; try again soon
machine-2: 13:03:05 ERROR juju.worker.raft.raftforwarder couldn't expire lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#hacluster-keystone#" in db: state changing too quickly; try again soon
machine-2: 13:03:08 ERROR juju.worker.raft.raftforwarder couldn't claim lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#ntp#" for "ntp/1" in db: state changing too quickly; try again soon
machine-2: 13:03:43 ERROR juju.worker.raft.raftforwarder couldn't expire lease "44c18b11-f1f0-4b9d-8185-ed6b3e35d05f:singular-controller#44c18b11-f1f0-4b9d-8185-ed6b3e35d05f#" in db: state changing too quickly; try again soon
machine-2: 13:03:56 ERROR juju.worker.raft.raftforwarder couldn't expire lease "07f0fc1e-1280-4158-87b2-b8a86f5f5691:application-leadership#hacluster-glance#" in db: state changing too quickly; try again soon

Revision history for this message
John A Meinel (jameinel) wrote :

machine-2: 13:02:50 ERROR juju.worker.raft.raftforwarder couldn't expire lease "93375d9b-f8d3-4f19-8936-2eae6e9048dd:singular-controller#75bbcf8b-cf77-483b-8486-3b630a27ef6b#" in db: state changing too quickly; try again soon

is particularly worrying as this is the lease that would decide which controller runs the singleton workers (like the provisioner)

It would be good to understand how we're hitting contention on leases, as when they are updated they shouldn't be a shared document that is being updated at the same time.

Revision history for this message
John A Meinel (jameinel) wrote :

This might not be what I thought it was. I tried to trigger it by doing a deploy with lots of applications and then upgrading, but everything went through clean.

$ sudo snap install juju_28 --channel 2.8 --classic
$ sudo snap install juju --channel 2.9 --classic
$ juju_28 bootstrap lxd lxd
$ juju_28 enable-ha
$ juju_28 deploy cs:~jameinel/ubuntu-lite -n 10
$ juju status # until all containers are ready
# Deploy 100 applications to 10 containers
$ for j in `seq -w 0 9`; do echo $j; for i in `seq -w 0 9`; do echo " $i"; juju deploy cs:~jameinel/ubuntu-lite ul$j$i --to $i & done; time wait; done
# Make sure they all have 2 units
$ for j in `seq -w 0 9`; do echo $j; for i in `seq -w 0 9`; do echo " $i"; juju add-unit ul$j$i --to $i & done; time wait; done

$ juju upgrade-controller

I did see some failures, but nothing systemic:
2021-05-07 20:50:23 WARNING juju.core.raftlease store.go:300 response timeout waiting for Command(ver: 1, op: claim, ns: singular-controller, model: aef0a3, lease: c0cd5b

ubuntu-lite doesn't do anything in a leader-elected hook, but we also weren't seeing the database updates failing due to "state changing too quickly".

Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

There is a question on what is the authority here.

The errors "ERROR juju.worker.raft.raftforwarder couldn't expire lease {key} in db..." are just logged errors.

When they are emitted, we have already applied the operation to the FSM, and operation itself completes with success, so as far as Raft and the lease manager are concerned, they can report who is or is not the leader. It's just that the "leaseholders" collection is out-of-sync.

I am still trying to tease out how this is actually possible in the HA up/down upgrade dance, but it seems to me we should remove the DB assertions and write the changes in hard, because the maintenance of leases is moving forward regardless of the DB state - we might as well have the info correct there.

John A Meinel (jameinel)
Changed in juju:
milestone: none → 2.9.2
John A Meinel (jameinel)
tags: added: canonical-bootstack canonical-is
Changed in juju:
milestone: 2.9.2 → 2.9.3
Revision history for this message
Joseph Phillips (manadart) wrote :
Revision history for this message
Joseph Phillips (manadart) wrote :
Revision history for this message
Joseph Phillips (manadart) wrote :
Revision history for this message
Joseph Phillips (manadart) wrote :

Will start building a story here.

- Machine 0 is running the Mongo primary.
- Machine 2 evidently does not complete the upgrade after the first detection.
  The upgrade request is logged twice.
- All upgradesteps workers are killed via tomb at least once.
  3 times for machine 0.

There is a lot of churn here - we could improve the flow through this, reducing worker down-time if there was a once-through run of the steps. The tomb-kill can be reproduced consistently.

Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
milestone: 2.9.3 → 2.9.4
Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Based on interrogating the logs for entries from the raft-backstop worker, it does not appear that we have a backstop-induced split-brain in Raft.

Interestingly, Raft is still running on machine-1 after upgrade availability is detected, but apparently not on (at least) node 0. This should not be the case, as the main Raft worker is predicated upon "ifFullyUpgraded".

It's getting interesting here. Look at the back-off times versus the log timestamps. This is the sort of logging we might expect to see if multiple instances of the same worker were running:
https://pastebin.canonical.com/p/XjTWzDz59G/

Here we see the start of our tribulation. The DB upgrade completes, the upgrade-steps runner starts up, but the upgrader worker sees the the agent version for OS instead of series and tanks all the workers for another upgrade before the controllers have coordinated the steps for 2.9.0.
https://pastebin.canonical.com/p/GVz6tWc3FJ/

The other controllers will be doing this same dance and we see the dependency engine thrash a bit before the upgrade-steps worker can run successfully.
https://pastebin.canonical.com/p/yf2vpWVfRj/

John A Meinel (jameinel)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

I tried to create some log files that had the right content in them for debugging. The full logs are still useful if we need to dig, but these are concatenating all the log files, and then snipping them from 12:35 until 13:00 across all 3 controller machines.

Changed in juju:
milestone: 2.9.4 → 2.9.5
Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
milestone: 2.9.5 → 2.9.6
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
John A Meinel (jameinel)
Changed in juju:
milestone: 2.9.6 → 2.9.5
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.