upgrade-series hangs, leaves lock behind

Bug #1855013 reported by Xav Paice
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

Running 2.6.10, Xenial.

I ran juju upgrade-series 12 prepare bionic, which looked to have completed the hooks but then hung for 20+ mins. I ended up ctrl-c'ing the command.

This left a lock behind, preventing retry of the prepare, and preventing complete:

https://pastebin.canonical.com/p/9ptW9CX9Yy/

Fixed with:

db.machineUpgradeSeriesLocks.update({ "_id" : "55ab2ebd-6e66-4d23-8549-aee9f03fae13:12" }, { $set: { "machine-status" : "prepare completed" } } )

Is there a better way to do this? Two things come to mind - firstly, don't ctrl-c, I should have spent more time digging into what was hanging. Second, we could do with a retry or a force option, in case I wasn't running this in a tmux and lost my session to the controller (which happens when we work remotely).

Changed in juju:
status: New → Triaged
importance: Undecided → High
Changed in juju:
milestone: none → 2.8-beta1
milestone: 2.8-beta1 → none
Revision history for this message
Diko Parvanov (dparv) wrote :

Affected by a similar behavior. While doing series upgrade from xenial to bionic running the complete action at the end one of the units hanged execution for a long time. I had to ctrl+c as nothing was happening for 30+ minutes.

Afterwards I finished running the post-series-upgrade on the remaining units, then tried to add a lxd to one of the hosts and I got

ERROR acquiring machine to host unit <UNIT> machine "X" is locked for series upgrade

Then I checked mongodb and found out that all machines were in "complete started" machine-status.

Fixed it manually by updating the "complete started" to "completed" for all machines (all were affected by this):

db.machineUpgradeSeriesLocks.update({}, { $set: { "machine-status" : "completed" } } )

Tried to re-do the series upgrade or complete actions but juju did not allow it.

Changed in juju:
milestone: none → 2.8-beta1
Felipe Reyes (freyes)
tags: added: seg
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8-beta1 → 2.8.1
tags: added: upgradeseries
Revision history for this message
Tim Penhey (thumper) wrote :

Perhaps we add a sub-command "upgrade-series [id] reset" which rolls back the prepare.

@manadart, does this sound reasonable?

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.1 → 2.8.2
Changed in juju:
milestone: 2.8.2 → 2.8.3
Pen Gale (pengale)
Changed in juju:
milestone: 2.8.4 → none
Revision history for this message
David van der Spek (vanderspek-david) wrote :

I'm also facing this issue when trying to upgrade a Kubernetes node. The prepare command seemed to be stuck, after ctrl-c I now cannot try to prepare again or mark it as complete. Being able to cancel a hanging command without causing a seemingly non-repairable state seems like an essential feature for production deployments.

Revision history for this message
Pen Gale (pengale) wrote :

@vanderspek-david: what version of Juju are you running in the model w/ the hung node?

A lot of upgrade series issues were fixed in 2.8.x. I'm not sure whether this issue is still outstanding, or has already been addressed.

Revision history for this message
Jose Guedez (jfguedez) wrote :

Hit this today on juju 2.9.27, Bionic => Focal. On the "complete" step after the upgrade two subordinates got stuck with 'hook failed: "post-series-upgrade"', apparently while resuming services like this:

2022-06-16 02:32:06 WARNING unit.ceilometer-agent/4.post-series-upgrade logger.go:60 raise Exception("Couldn't resume: {}".format("; ".join(messages)))
2022-06-16 02:32:06 WARNING unit.ceilometer-agent/4.post-series-upgrade logger.go:60 Exception: Couldn't resume: ceilometer-agent-compute didn't resume cleanly.; Services not running that should be: ceilometer-agent-compute
2022-06-16 02:32:06 ERROR juju.worker.uniter.operation runhook.go:146 hook "post-series-upgrade" (via explicit, bespoke hook script) failed: exit status 1

However I was able to systemctl start the services without issues. Looks like juju never retried it or checked again and remained stuck.

Revision history for this message
Ian Booth (wallyworld) wrote :

As Tim says in a previous comment, there's currently a lack of tooling to recover when a unit's post-series-upgrade hook fails.

This PR https://github.com/juju/juju/pull/14163 helps, but there may still be manual intervention needed. You can directly "reset" a failed unit's upgrade series lock by updating the status in the machineUpgradeSeriesLocks collection, eg

db.machineUpgradeSeriesLocks.updateOne({"_id" : "b77c7406-5dfa-42fc-8968-e69608a00f9c:19"}, {$set: {"unit-statuses.neutron-openvswitch/4.status": "complete started"}})

Replace the model UUID above and the machine "19" and unit name "neutron-openvswitch/4" with the relevant values.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.