juju run reboot can leave units stuck in a failed state

Bug #1694734 reported by Martin Hilton
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Eric Claude Jones
2.2
Fix Released
High
Eric Claude Jones

Bug Description

I'm having trouble reproducing this on a simpler system but I'll report it anyway. There is a certain amount of conjecture here, but enough evidence to suggest that this is the explanation. Our model contained a number of units which were reporting as "failed". Investigating one such unit showed that the logs contained a lot of failures like:

2017-05-31 09:46:21 ERROR juju.worker.dependency engine.go:547 "uniter" manifold worker returned unexpected error: executing operation "fail
action 266c44a2-c780-493e-8c5b-9c78f5159c89": transaction aborted
2017-05-31 09:46:28 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-c780-493e-8c5b-9c78f5
159c89": transaction aborted
2017-05-31 09:46:28 ERROR juju.worker.dependency engine.go:547 "uniter" manifold worker returned unexpected error: executing operation "fail
action 266c44a2-c780-493e-8c5b-9c78f5159c89": transaction aborted
2017-05-31 09:46:34 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-c780-493e-8c5b-9c78f5
159c89": transaction aborted
2017-05-31 09:46:34 ERROR juju.worker.dependency engine.go:547 "uniter" manifold worker returned unexpected error: executing operation "fail
action 266c44a2-c780-493e-8c5b-9c78f5159c89": transaction aborted
2017-05-31 09:46:40 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-c780-493e-8c5b-9c78f5
159c89": transaction aborted

So the uniter was getting transaction errors trying to fail an action. The transaction is failing because in the database the action is already marked as complete.

The transaction is:

{
        "_id" : ObjectId("592e85f5c2b5ab36c4f6ebbf"),
        "s" : 5,
        "o" : [
                {
                        "c" : "actions",
                        "d" : "cfac1b37-693f-4c39-8586-405c93a08747:266c44a2-c780-493e-8c5b-9c78f5159c89",
                        "a" : {
                                "status" : {
                                        "$nin" : [
                                                "completed",
                                                "cancelled",
                                                "failed"
                                        ]
                                }
                        },
                        "u" : {
                                "$set" : {
                                        "status" : "failed",
                                        "message" : "action terminated",
                                        "results" : {

                                        },
                                        "completed" : ISODate("2017-05-31T08:59:34Z")
                                }
                        }
                },
                {
                        "c" : "actionnotifications",
                        "d" : "cfac1b37-693f-4c39-8586-405c93a08747:nrpe/130_a_266c44a2-c780-493e-8c5b-9c78f5159c89",
                        "r" : true
                }
        ],
        "n" : "40a334e1"
}

And the mongodb document it operates on is:

{
 "_id" : "cfac1b37-693f-4c39-8586-405c93a08747:266c44a2-c780-493e-8c5b-9c78f5159c89",
 "model-uuid" : "cfac1b37-693f-4c39-8586-405c93a08747",
 "receiver" : "nrpe/130",
 "name" : "juju-run",
 "parameters" : {
  "command" : "reboot",
  "timeout" : NumberLong("300000000000")
 },
 "enqueued" : ISODate("2017-05-30T14:31:41Z"),
 "started" : ISODate("2017-05-30T16:14:39Z"),
 "completed" : ISODate("2017-05-30T16:14:39Z"),
 "status" : "completed",
 "message" : "",
 "results" : {
  "Code" : "0",
  "Stderr" : "",
  "Stdout" : ""
 },
 "txn-revno" : NumberLong(4),
 "txn-queue" : [
  "592e85f5c2b5ab36c4f6ebbf_500fe2a3"
 ]
}

There problem here appears to be that as far as the controller is concerned the action is complete, but the uniter state contains:

leader: false
started: true
stopped: false
installed: true
status-set: false
op: run-action
opstep: pending
action-id: 266c44a2-c780-493e-8c5b-9c78f5159c89

Which seems to mean that the unit thinks it needs to update the controller. One thing here is that if the system is in this state then it should not get stuck there. If the uniter is trying to update the state of an action that is already in a terminal state then it should return an error in such a way that the uniter stops trying to do an update that can't possibly succeed.

How we go into this situation is a little fuzzier, but it looks like the action that was running was "juju run --unit nrpe/130 reboot". My guess is that there is a race condition where the uniter runs reboot and then has enough time to send the action complete message back to the controller, but is stopped before it manages to update the local state. When the agent comes back after the reboot it tries to send the fail message to the controller, which fails in the way described above.

Revision history for this message
Martin Hilton (martin-hilton) wrote :

For completeness the controller version is 2.1.2 and the model version is 2.1.1.

Revision history for this message
Tim Penhey (thumper) wrote :

Possibly related to bug 1649637

tags: added: actions uniter
Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Cory Johns (johnsca) wrote :

Unsure if 1649637 has the same root cause, but I can still replicate it consistently on trusty instances: http://pastebin.ubuntu.com/24741077/

I've never seen it happen on a xenial instance that I can recall.

Revision history for this message
Cory Johns (johnsca) wrote :

It also seems that doing the reboot via `juju ssh` does not cause the failures, so the transaction cause seems reasonable.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.3-alpha1
milestone: 2.3-alpha1 → none
assignee: nobody → Eric Claude Jones (ecjones)
Paul Gear (paulgear)
tags: added: canonical-is
Changed in juju:
status: Triaged → In Progress
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Eric Claude Jones (ecjones) wrote :

WORK AROUND

For users experiencing this bug with versions prior to that which contains the fix, the following work around can be used.

If you notice that the unit agent is stuck in continual attempts to fail an action, you can try to step the agent pass that action by updating its state file manually. For example, let us say that logs reveal that you unit agent is stuck canceling an action with ID f5fe333a-9662-4fc1-8dc2-045b16208e85 and your unit agent's state file looks as follows:

leader: false
started: true
stopped: false
installed: true
status-set: false
op: run-action
opstep: pending
action-id: f5fe333a-9662-4fc1-8dc2-045b16208e85

You can delete the line starting with `action-id` and and change the line starting with `op` to `op: continue`. This will step the unit agent past trying to fail the specified action. This should be safe since the controller already believes the action to be cancelled (which is why it is repeatedly rejecting the uniter's requests to fail the action).

Changed in juju:
milestone: none → 2.2.2
milestone: 2.2.2 → none
Revision history for this message
Eric Claude Jones (ecjones) wrote :
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

pr landed 26-jun-2017

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.