Comment 0 for bug 1948906

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Hi

Performing a series upgrade from bionic to focal.
And the post-series-upgrade hook failed with a known error from the application.
The application error is resolved manually.
But the post-series-upgrade is stuck with the following error without completing the upgrade.

Error message in juju unit logs:
2021-10-27 06:28:14 INFO juju.worker.uniter uniter.go:339 unit "designate-bind/1" started
2021-10-27 06:28:14 INFO juju.worker.uniter uniter.go:357 hooks are retried true
2021-10-27 06:28:14 INFO juju.worker.uniter resolver.go:150 awaiting error resolution for "post-series-upgrade" hook
2021-10-27 06:28:19 INFO juju.worker.uniter resolver.go:150 awaiting error resolution for "post-series-upgrade" hook
2021-10-27 06:28:19 ERROR juju.worker.uniter.operation runhook.go:200 error updating workload status before post-series-upgrade hook: upgrade series status "complete running"
2021-10-27 06:28:19 ERROR juju.worker.uniter agent.go:31 resolver loop error: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"
2021-10-27 06:28:19 INFO juju.worker.uniter uniter.go:323 unit "designate-bind/1" shutting down: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"
2021-10-27 06:28:19 ERROR juju.worker.dependency engine.go:671 "uniter" manifold worker returned unexpected error: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"

I have to change mongodb to recover from this.
Even though juju does not expect application to fail, there should be mechanism to recover from the situation without changing the database.
(Also note upgrading the application unit to the charm revision that has application issue fix cannot resolve the agent status to idle)

Reproducer Steps:
1. juju add-model test
2. juju deploy cs:designate-bind-34 --series bionic
3. juju upgrade-series 1 prepare focal
4. # ssh to designate-bind unit and perform upgrade
5. juju upgrade-series 1 complete
   # This is hanged
6. Manually fix the application error
   juju ssh designate-bind/1 sudo dpkg-reconfigure bind9
7. juju run -u designate-bind/1 -- hooks/update-status
   # After this step, workload status is in Active but the Agent is in Failed state.

Analysis:
After step 5, the mongodb upgradeserieslock looks like this.
See the unit status for designate-bind/1 is "complete running"

db.machineUpgradeSeriesLocks.find().forEach(printjson)
{
 "_id" : "d3f7c58a-a55f-495c-8b23-bf86deb93a6b:1",
 "machine-id" : "1",
 "to-series" : "focal",
 "from-series" : "bionic",
 "machine-status" : "complete started",
 "messages" : [
  {
   "message" : "machine-1 validation of upgrade series from \"bionic\" to \"focal\"",
   "timestamp" : ISODate("2021-10-27T05:57:44.583Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 started upgrade series from \"bionic\" to \"focal\"",
   "timestamp" : ISODate("2021-10-27T05:57:44.669Z"),
   "seen" : true
  },
  {
   "message" : "designate-bind/1 pre-series-upgrade hook running",
   "timestamp" : ISODate("2021-10-27T05:57:44.783Z"),
   "seen" : true
  },
  {
   "message" : "designate-bind/1 pre-series-upgrade completed",
   "timestamp" : ISODate("2021-10-27T05:57:48.824Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 binaries and service files written",
   "timestamp" : ISODate("2021-10-27T05:57:48.964Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 complete phase started",
   "timestamp" : ISODate("2021-10-27T06:23:40.403Z"),
   "seen" : true
  },
  {
   "message" : "machine-1 start units after series upgrade",
   "timestamp" : ISODate("2021-10-27T06:23:40.509Z"),
   "seen" : true
  },
  {
   "message" : "designate-bind/1 post-series-upgrade hook running",
   "timestamp" : ISODate("2021-10-27T06:23:40.646Z"),
   "seen" : true
  }
 ],
 "timestamp" : ISODate("2021-10-27T06:23:40.646Z"),
 "unit-statuses" : {
  "designate-bind/1" : {
   "status" : "complete running",
   "timestamp" : ISODate("2021-10-27T06:23:40.646Z")
  }
 },
 "model-uuid" : "d3f7c58a-a55f-495c-8b23-bf86deb93a6b",
 "txn-revno" : NumberLong(16),
 "txn-queue" : [
  "6178f06c3f0b39661e6cb75a_e44848a8"
 ]
}

As part of post-upgrade-series beforeHook, juju expects the unit to be in "complete started" state so that FSM can change it to "complete running". Since the unit is already in "complete running", the agent retry of post-upgrade-series hook fails in beforeHook.
https://github.com/juju/juju/blob/develop/worker/uniter/operation/runhook.go#L196

Workaround:
Update the unit status to "complete status" in mongodb

db.machineUpgradeSeriesLocks.update(
    { "machine-id" : "1" },
    {
      $set: {
        "unit-statuses.designate-bind/1.status": "complete started"
      }
    }
)

After 5-10 minutes, the post-series-upgrade hook is retriggered automatically and agent is back to idle state.