juju run reboot can leave units stuck in a failed state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Eric Claude Jones | ||
2.2 |
Fix Released
|
High
|
Eric Claude Jones |
Bug Description
I'm having trouble reproducing this on a simpler system but I'll report it anyway. There is a certain amount of conjecture here, but enough evidence to suggest that this is the explanation. Our model contained a number of units which were reporting as "failed". Investigating one such unit showed that the logs contained a lot of failures like:
2017-05-31 09:46:21 ERROR juju.worker.
action 266c44a2-
2017-05-31 09:46:28 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-
159c89": transaction aborted
2017-05-31 09:46:28 ERROR juju.worker.
action 266c44a2-
2017-05-31 09:46:34 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-
159c89": transaction aborted
2017-05-31 09:46:34 ERROR juju.worker.
action 266c44a2-
2017-05-31 09:46:40 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-
159c89": transaction aborted
So the uniter was getting transaction errors trying to fail an action. The transaction is failing because in the database the action is already marked as complete.
The transaction is:
{
"_id" : ObjectId(
"s" : 5,
"o" : [
{
},
{
}
],
"n" : "40a334e1"
}
And the mongodb document it operates on is:
{
"_id" : "cfac1b37-
"model-uuid" : "cfac1b37-
"receiver" : "nrpe/130",
"name" : "juju-run",
"parameters" : {
"command" : "reboot",
"timeout" : NumberLong(
},
"enqueued" : ISODate(
"started" : ISODate(
"completed" : ISODate(
"status" : "completed",
"message" : "",
"results" : {
"Code" : "0",
"Stderr" : "",
"Stdout" : ""
},
"txn-revno" : NumberLong(4),
"txn-queue" : [
"592e85f5c2b5
]
}
There problem here appears to be that as far as the controller is concerned the action is complete, but the uniter state contains:
leader: false
started: true
stopped: false
installed: true
status-set: false
op: run-action
opstep: pending
action-id: 266c44a2-
Which seems to mean that the unit thinks it needs to update the controller. One thing here is that if the system is in this state then it should not get stuck there. If the uniter is trying to update the state of an action that is already in a terminal state then it should return an error in such a way that the uniter stops trying to do an update that can't possibly succeed.
How we go into this situation is a little fuzzier, but it looks like the action that was running was "juju run --unit nrpe/130 reboot". My guess is that there is a race condition where the uniter runs reboot and then has enough time to send the action complete message back to the controller, but is stopped before it manages to update the local state. When the agent comes back after the reboot it tries to send the fail message to the controller, which fails in the way described above.
Changed in juju: | |
milestone: | none → 2.3-alpha1 |
milestone: | 2.3-alpha1 → none |
assignee: | nobody → Eric Claude Jones (ecjones) |
tags: | added: canonical-is |
Changed in juju: | |
status: | Triaged → In Progress |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
milestone: | none → 2.2.2 |
milestone: | 2.2.2 → none |
For completeness the controller version is 2.1.2 and the model version is 2.1.1.