Stuck in post-series-upgrade even the application error is manually resolved
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Joseph Phillips |
Bug Description
Hi
Performing a series upgrade from bionic to focal.
And the post-series-upgrade hook failed with a known error from the application.
The application error is resolved manually.
But the post-series-upgrade is stuck with the following error without completing the upgrade.
Error message in juju unit logs:
2021-10-27 06:28:14 INFO juju.worker.uniter uniter.go:339 unit "designate-bind/1" started
2021-10-27 06:28:14 INFO juju.worker.uniter uniter.go:357 hooks are retried true
2021-10-27 06:28:14 INFO juju.worker.uniter resolver.go:150 awaiting error resolution for "post-series-
2021-10-27 06:28:19 INFO juju.worker.uniter resolver.go:150 awaiting error resolution for "post-series-
2021-10-27 06:28:19 ERROR juju.worker.
2021-10-27 06:28:19 ERROR juju.worker.uniter agent.go:31 resolver loop error: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"
2021-10-27 06:28:19 INFO juju.worker.uniter uniter.go:323 unit "designate-bind/1" shutting down: executing operation "run post-series-upgrade hook" for designate-bind/1: upgrade series status "complete running"
2021-10-27 06:28:19 ERROR juju.worker.
I have to change mongodb to recover from this.
Even though juju does not expect application to fail, there should be mechanism to recover from the situation without changing the database.
(Also note upgrading the application unit to the charm revision that has application issue fix cannot resolve the agent status to idle)
Reproducer Steps:
1. juju add-model test
2. juju deploy cs:designate-
3. juju upgrade-series 1 prepare focal
4. # ssh to designate-bind unit and perform upgrade
5. juju upgrade-series 1 complete
# This is hanged
6. Manually fix the application error
juju ssh designate-bind/1 sudo dpkg-reconfigure bind9
7. juju run -u designate-bind/1 -- hooks/update-status
# After this step, workload status is in Active but the Agent is in Failed state.
Analysis:
After step 5, the mongodb upgradeserieslock looks like this.
See the unit status for designate-bind/1 is "complete running"
db.machineUpgra
{
"_id" : "d3f7c58a-
"machine-id" : "1",
"to-series" : "focal",
"from-series" : "bionic",
"machine-status" : "complete started",
"messages" : [
{
"message" : "machine-1 validation of upgrade series from \"bionic\" to \"focal\"",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "machine-1 started upgrade series from \"bionic\" to \"focal\"",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "designate-bind/1 pre-series-upgrade hook running",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "designate-bind/1 pre-series-upgrade completed",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "machine-1 binaries and service files written",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "machine-1 complete phase started",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "machine-1 start units after series upgrade",
"timestamp" : ISODate(
"seen" : true
},
{
"message" : "designate-bind/1 post-series-upgrade hook running",
"timestamp" : ISODate(
"seen" : true
}
],
"timestamp" : ISODate(
"unit-statuses" : {
"designate-
"status" : "complete running",
"timestamp" : ISODate(
}
},
"model-uuid" : "d3f7c58a-
"txn-revno" : NumberLong(16),
"txn-queue" : [
"6178f06c3f0b
]
}
As part of post-upgrade-series beforeHook, juju expects the unit to be in "complete started" state so that FSM can change it to "complete running". Since the unit is already in "complete running", the agent retry of post-upgrade-series hook fails in beforeHook.
https:/
Workaround:
Update the unit status to "complete started" in mongodb
db.machineUpgra
{ "machine-id" : "1" },
{
$set: {
}
}
)
After 5-10 minutes, the post-series-upgrade hook is retriggered automatically and agent is back to idle state.
description: | updated |
tags: | added: seg |
Changed in juju: | |
status: | Triaged → In Progress |
assignee: | nobody → Joseph Phillips (manadart) |
milestone: | none → 2.9.33 |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
> there should be mechanism to recover from the situation without changing the database.
This is a known issue, but you are right, we should do better here.