Canonical Juju

upgrade-series hangs, leaves lock behind

Bug #1855013 reported by Xav Paice on 2019-12-03

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	High	Unassigned

Bug Description

Running 2.6.10, Xenial.

I ran juju upgrade-series 12 prepare bionic, which looked to have completed the hooks but then hung for 20+ mins. I ended up ctrl-c'ing the command.

This left a lock behind, preventing retry of the prepare, and preventing complete:

https://pastebin.canonical.com/p/9ptW9CX9Yy/

Fixed with:

db.machineUpgradeSeriesLocks.update({ "_id" : "55ab2ebd-6e66-4d23-8549-aee9f03fae13:12" }, { $set: { "machine-status" : "prepare completed" } } )

Is there a better way to do this? Two things come to mind - firstly, don't ctrl-c, I should have spent more time digging into what was hanging. Second, we could do with a retry or a force option, in case I wasn't running this in a tmux and lost my session to the controller (which happens when we work remotely).

Tags:

Joseph Phillips (manadart) on 2019-12-04

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Richard Harding (rharding) on 2019-12-04

Changed in juju:
milestone:	none → 2.8-beta1
milestone:	2.8-beta1 → none

Revision history for this message

Diko Parvanov (dparv) wrote on 2020-02-12:

Affected by a similar behavior. While doing series upgrade from xenial to bionic running the complete action at the end one of the units hanged execution for a long time. I had to ctrl+c as nothing was happening for 30+ minutes.

Afterwards I finished running the post-series-upgrade on the remaining units, then tried to add a lxd to one of the hosts and I got

ERROR acquiring machine to host unit <UNIT> machine "X" is locked for series upgrade

Then I checked mongodb and found out that all machines were in "complete started" machine-status.

Fixed it manually by updating the "complete started" to "completed" for all machines (all were affected by this):

db.machineUpgradeSeriesLocks.update({}, { $set: { "machine-status" : "completed" } } )

Tried to re-do the series upgrade or complete actions but juju did not allow it.

Richard Harding (rharding) on 2020-02-12

Changed in juju:
milestone:	none → 2.8-beta1

Felipe Reyes (freyes) on 2020-04-01

tags:

added: seg

Ian Booth (wallyworld) on 2020-04-02

Changed in juju:
milestone:	2.8-beta1 → 2.8.1

Joseph Phillips (manadart) on 2020-05-20

tags:

added: upgradeseries

Revision history for this message

Tim Penhey (thumper) wrote on 2020-06-08:

Perhaps we add a sub-command "upgrade-series [id] reset" which rolls back the prepare.

@manadart, does this sound reasonable?

Ian Booth (wallyworld) on 2020-07-03

Changed in juju:
milestone:	2.8.1 → 2.8.2

Canonical Juju QA Bot (juju-qa-bot) on 2020-09-15

Changed in juju:
milestone:	2.8.2 → 2.8.3

Pen Gale (pengale) on 2020-09-23

Changed in juju:
milestone:	2.8.4 → none

Revision history for this message

David van der Spek (vanderspek-david) wrote on 2020-12-20:

I'm also facing this issue when trying to upgrade a Kubernetes node. The prepare command seemed to be stuck, after ctrl-c I now cannot try to prepare again or mark it as complete. Being able to cancel a hanging command without causing a seemingly non-repairable state seems like an essential feature for production deployments.

Revision history for this message

Pen Gale (pengale) wrote on 2021-01-04:

@vanderspek-david: what version of Juju are you running in the model w/ the hung node?

A lot of upgrade series issues were fixed in 2.8.x. I'm not sure whether this issue is still outstanding, or has already been addressed.

Revision history for this message

Jose Guedez (jfguedez) wrote on 2022-06-16:

Hit this today on juju 2.9.27, Bionic => Focal. On the "complete" step after the upgrade two subordinates got stuck with 'hook failed: "post-series-upgrade"', apparently while resuming services like this:

2022-06-16 02:32:06 WARNING unit.ceilometer-agent/4.post-series-upgrade logger.go:60 raise Exception("Couldn't resume: {}".format("; ".join(messages)))
2022-06-16 02:32:06 WARNING unit.ceilometer-agent/4.post-series-upgrade logger.go:60 Exception: Couldn't resume: ceilometer-agent-compute didn't resume cleanly.; Services not running that should be: ceilometer-agent-compute
2022-06-16 02:32:06 ERROR juju.worker.uniter.operation runhook.go:146 hook "post-series-upgrade" (via explicit, bespoke hook script) failed: exit status 1

However I was able to systemctl start the services without issues. Looks like juju never retried it or checked again and remained stuck.

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-06-20:

As Tim says in a previous comment, there's currently a lack of tooling to recover when a unit's post-series-upgrade hook fails.

This PR https://github.com/juju/juju/pull/14163 helps, but there may still be manual intervention needed. You can directly "reset" a failed unit's upgrade series lock by updating the status in the machineUpgradeSeriesLocks collection, eg

db.machineUpgradeSeriesLocks.updateOne({"_id" : "b77c7406-5dfa-42fc-8968-e69608a00f9c:19"}, {$set: {"unit-statuses.neutron-openvswitch/4.status": "complete started"}})

Replace the model UUID above and the machine "19" and unit name "neutron-openvswitch/4" with the relevant values.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.