Stuck in post-series-upgrade even the application error is manually resolved

Bug #1948906 reported by Hemanth Nakkina
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

Hi

Performing a series upgrade from bionic to focal.
And the post-series-upgrade hook failed with a known error from the application.
The application error is resolved manually.
But the post-series-upgrade is stuck with the following error without completing the upgrade.

Error message in juju unit logs:
2021-10-27 06:28:14 INFO juju.worker.uniter uniter.go:339 unit "designate-bind/1" started
2021-10-27 06:28:14 INFO juju.worker.uniter uniter.go:357 hooks are retried true
2021-10-27 06:28:14 INFO juju.worker.uniter resolver.go:150 awaiting error resolution for "post-series-upgrade" hook
2021-10-27 06:28:19 INFO juju.worker.uniter resolver.go:150 awaiting error resolution for "post-series-upgrade" hook
2021-10-27 06:28:19 ERROR juju.worker.uniter.operation runhook.go:200 error updating workload status before post-series-upgrade hook: upgrade series status "complete running"
2021-10-27 06:28:19 ERROR juju.worker.uniter agent.go:31 resolver loop error: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"
2021-10-27 06:28:19 INFO juju.worker.uniter uniter.go:323 unit "designate-bind/1" shutting down: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"
2021-10-27 06:28:19 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"

I have to change mongodb to recover from this.
Even though juju does not expect application to fail, there should be mechanism to recover from the situation without changing the database.
(Also note upgrading the application unit to the charm revision that has application issue fix cannot resolve the agent status to idle)

Reproducer Steps:
1. juju add-model test
2. juju deploy cs:designate-bind-34 --series bionic
3. juju upgrade-series 1 prepare focal
4. # ssh to designate-bind unit and perform upgrade
5. juju upgrade-series 1 complete
   # This is hanged
6. Manually fix the application error
   juju ssh designate-bind/1 sudo dpkg-reconfigure bind9
7. juju run -u designate-bind/1 -- hooks/update-status
   # After this step, workload status is in Active but the Agent is in Failed state.

Analysis:
After step 5, the mongodb upgradeserieslock looks like this.
See the unit status for designate-bind/1 is "complete running"

db.machineUpgradeSeriesLocks.find().forEach(printjson)
{
 "_id" : "d3f7c58a-a55f-495c-8b23-bf86deb93a6b:1",
 "machine-id" : "1",
 "to-series" : "focal",
 "from-series" : "bionic",
 "machine-status" : "complete started",
 "messages" : [
  {
   "message" : "machine-1 validation of upgrade series from \"bionic\" to \"focal\"",
   "timestamp" : ISODate("2021-10-27T05:57:44.583Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 started upgrade series from \"bionic\" to \"focal\"",
   "timestamp" : ISODate("2021-10-27T05:57:44.669Z"),
   "seen" : true
  },
  {
   "message" : "designate-bind/1 pre-series-upgrade hook running",
   "timestamp" : ISODate("2021-10-27T05:57:44.783Z"),
   "seen" : true
  },
  {
   "message" : "designate-bind/1 pre-series-upgrade completed",
   "timestamp" : ISODate("2021-10-27T05:57:48.824Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 binaries and service files written",
   "timestamp" : ISODate("2021-10-27T05:57:48.964Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 complete phase started",
   "timestamp" : ISODate("2021-10-27T06:23:40.403Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 start units after series upgrade",
   "timestamp" : ISODate("2021-10-27T06:23:40.509Z"),
   "seen" : true
  },
  {
   "message" : "designate-bind/1 post-series-upgrade hook running",
   "timestamp" : ISODate("2021-10-27T06:23:40.646Z"),
   "seen" : true
  }
 ],
 "timestamp" : ISODate("2021-10-27T06:23:40.646Z"),
 "unit-statuses" : {
  "designate-bind/1" : {
   "status" : "complete running",
   "timestamp" : ISODate("2021-10-27T06:23:40.646Z")
  }
 },
 "model-uuid" : "d3f7c58a-a55f-495c-8b23-bf86deb93a6b",
 "txn-revno" : NumberLong(16),
 "txn-queue" : [
  "6178f06c3f0b39661e6cb75a_e44848a8"
 ]
}

As part of post-upgrade-series beforeHook, juju expects the unit to be in "complete started" state so that FSM can change it to "complete running". Since the unit is already in "complete running", the agent retry of post-upgrade-series hook fails in beforeHook.
https://github.com/juju/juju/blob/develop/worker/uniter/operation/runhook.go#L196

Workaround:
Update the unit status to "complete started" in mongodb

db.machineUpgradeSeriesLocks.update(
    { "machine-id" : "1" },
    {
      $set: {
        "unit-statuses.designate-bind/1.status": "complete started"
      }
    }
)

After 5-10 minutes, the post-series-upgrade hook is retriggered automatically and agent is back to idle state.

Tags: seg
description: updated
Revision history for this message
Simon Richardson (simonrichardson) wrote :

> there should be mechanism to recover from the situation without changing the database.

This is a known issue, but you are right, we should do better here.

Changed in juju:
status: New → Triaged
importance: Undecided → High
tags: added: seg
Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Download full text (3.5 KiB)

From my experience these past days, this can happen as well on the pre-series-upgrade hook with a very similar behavior where is stuck in a loop too for various reasons.

I may add another reproducer as well.

The environment is the following :
MaaS : 3.1
Juju : 2.9.31
Openstack : bionic-ussuri

The goal is to upgrade from bionic-ussuri to focal-ussuri.

As per the following documents:
https://juju.is/docs/olm/upgrade-a-machines-series
https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/upgrade-series.html
The following steps are executed:
juju upgrade-series <machine-id> prepare focal
(manual steps below including do-release-upgrade)
juju upgrade-series <machine-id> complete

If the pre-series-upgrade hook in the “upgrade-series prepare” or post-series-upgrade in the “upgrade-series complete” step fails, there is no way to recover the unit from the “blocked” status and an infinite loop occurs[1][2]. The unit is stuck in such a state even if errors in the OS layer are resolved such as APT package errors.

How to reproduce:
deploy aodh on bionic by Juju with the bundle attached to this ticket
* juju deploy upgrade-issue-bundle.yaml
do upgrade-series prepare
* juju upgrade-series 0/lxd/0 prepare focal
do release-upgrade (for testing purpose, this is optional)
* juju run –machine 0/lxd/0 –timeout=60m \
  sudo DEBIAN_FRONTEND=noninteractive \
  do-release-upgrade -f DistUpgradeViewNonInteractive
break apt on purpose
* sudo ln -s /bin/false /usr/local/bin/apt-get
do upgrade-series complete
* juju upgrade-series 0/lxd/0 complete
fix apt
* sudo rm /usr/local/bin/apt-get

Then, there is no way to complete or rerun the “complete” step since the unit is stuck at blocked. And executing the same command errors out:

$ juju upgrade-series 0/lxd/0 complete
-> ERROR machine "0/lxd/0" can not complete, it is either not prepared or already completed

[1] in upgrade-series prepare
2022-06-13 07:06:10 ERROR juju.worker.dependency engine.go:693 "uniter" manifold worker returned unexpected error: executing operation "run pre-series-upgrade hook" for placement/5: upgrade series status "prepare running"
2022-06-13 07:08:21 ERROR juju.worker.uniter.operation runhook.go:194 error updating workload status before pre-series-upgrade hook: upgrade series status "prepare running"
2022-06-13 07:08:21 ERROR juju.worker.uniter agent.go:31 resolver loop error: executing operation "run pre-series-upgrade hook" for placement/5: upgrade series status "prepare running"
2022-06-13 07:08:21 ERROR juju.worker.dependency engine.go:693 "uniter" manifold worker returned unexpected error: executing operation "run pre-series-upgrade hook" for placement/5: upgrade series status "prepare running"

[2] in upgrade-series complete
2022-06-13 07:42:13 ERROR juju.worker.dependency engine.go:693 "uniter" manifold worker returned unexpected error: executing operation "run post-series-upgrade hook" for placement/4: upgrade series status "complete running"
2022-06-13 07:44:10 ERROR juju.worker.uniter.operation runhook.go:194 error updating workload status before post-series-upgrade hook: upgrade series status "complete running"
2022-06-13 07:44:10 ERROR juju.worker.unite...

Read more...

Revision history for this message
DUFOUR Olivier (odufourc) wrote :
Revision history for this message
DUFOUR Olivier (odufourc) wrote :

subscribed ~field-high

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Joseph Phillips (manadart)
milestone: none → 2.9.33
Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.