Canonical Juju

juju run reboot can leave units stuck in a failed state

Bug #1694734 reported by Martin Hilton on 2017-05-31

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Eric Claude Jones
	2.2	Fix Released	High	Eric Claude Jones	Canonical Juju 2.2.2

Bug Description

I'm having trouble reproducing this on a simpler system but I'll report it anyway. There is a certain amount of conjecture here, but enough evidence to suggest that this is the explanation. Our model contained a number of units which were reporting as "failed". Investigating one such unit showed that the logs contained a lot of failures like:

2017-05-31 09:46:21 ERROR juju.worker.dependency engine.go:547 "uniter" manifold worker returned unexpected error: executing operation "fail
action 266c44a2-c780-493e-8c5b-9c78f5159c89": transaction aborted
2017-05-31 09:46:28 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-c780-493e-8c5b-9c78f5
159c89": transaction aborted
2017-05-31 09:46:28 ERROR juju.worker.dependency engine.go:547 "uniter" manifold worker returned unexpected error: executing operation "fail
action 266c44a2-c780-493e-8c5b-9c78f5159c89": transaction aborted
2017-05-31 09:46:34 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-c780-493e-8c5b-9c78f5
159c89": transaction aborted
2017-05-31 09:46:34 ERROR juju.worker.dependency engine.go:547 "uniter" manifold worker returned unexpected error: executing operation "fail
action 266c44a2-c780-493e-8c5b-9c78f5159c89": transaction aborted
2017-05-31 09:46:40 ERROR juju.worker.uniter agent.go:28 resolver loop error: executing operation "fail action 266c44a2-c780-493e-8c5b-9c78f5
159c89": transaction aborted

So the uniter was getting transaction errors trying to fail an action. The transaction is failing because in the database the action is already marked as complete.

The transaction is:

{
        "_id" : ObjectId("592e85f5c2b5ab36c4f6ebbf"),
        "s" : 5,
        "o" : [
                {
                        "c" : "actions",
                        "d" : "cfac1b37-693f-4c39-8586-405c93a08747:266c44a2-c780-493e-8c5b-9c78f5159c89",
                        "a" : {
                                "status" : {
                                        "$nin" : [
                                                "completed",
                                                "cancelled",
                                                "failed"
                                        ]
                                }
                        },
                        "u" : {
                                "$set" : {
                                        "status" : "failed",
                                        "message" : "action terminated",
                                        "results" : {

                                        },
                                        "completed" : ISODate("2017-05-31T08:59:34Z")
                                }
                        }
                },
                {
                        "c" : "actionnotifications",
                        "d" : "cfac1b37-693f-4c39-8586-405c93a08747:nrpe/130_a_266c44a2-c780-493e-8c5b-9c78f5159c89",
                        "r" : true
                }
        ],
        "n" : "40a334e1"
}

And the mongodb document it operates on is:

{
"_id" : "cfac1b37-693f-4c39-8586-405c93a08747:266c44a2-c780-493e-8c5b-9c78f5159c89",
"model-uuid" : "cfac1b37-693f-4c39-8586-405c93a08747",
"receiver" : "nrpe/130",
"name" : "juju-run",
"parameters" : {
  "command" : "reboot",
  "timeout" : NumberLong("300000000000")
},
"enqueued" : ISODate("2017-05-30T14:31:41Z"),
"started" : ISODate("2017-05-30T16:14:39Z"),
"completed" : ISODate("2017-05-30T16:14:39Z"),
"status" : "completed",
"message" : "",
"results" : {
  "Code" : "0",
  "Stderr" : "",
  "Stdout" : ""
},
"txn-revno" : NumberLong(4),
"txn-queue" : [
  "592e85f5c2b5ab36c4f6ebbf_500fe2a3"
]
}

There problem here appears to be that as far as the controller is concerned the action is complete, but the uniter state contains:

leader: false
started: true
stopped: false
installed: true
status-set: false
op: run-action
opstep: pending
action-id: 266c44a2-c780-493e-8c5b-9c78f5159c89

Which seems to mean that the unit thinks it needs to update the controller. One thing here is that if the system is in this state then it should not get stuck there. If the uniter is trying to update the state of an action that is already in a terminal state then it should return an error in such a way that the uniter stops trying to do an update that can't possibly succeed.

How we go into this situation is a little fuzzier, but it looks like the action that was running was "juju run --unit nrpe/130 reboot". My guess is that there is a race condition where the uniter runs reboot and then has enough time to send the action complete message back to the controller, but is stopped before it manages to update the local state. When the agent comes back after the reboot it tries to send the fail message to the controller, which fails in the way described above.

Tags:

Revision history for this message

Martin Hilton (martin-hilton) wrote on 2017-05-31:

For completeness the controller version is 2.1.2 and the model version is 2.1.1.

Revision history for this message

Tim Penhey (thumper) wrote on 2017-06-01:

Possibly related to bug 1649637

tags:	added: actions uniter
Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Cory Johns (johnsca) wrote on 2017-06-01:

Unsure if 1649637 has the same root cause, but I can still replicate it consistently on trusty instances: http://pastebin.ubuntu.com/24741077/

I've never seen it happen on a xenial instance that I can recall.

Revision history for this message

Cory Johns (johnsca) wrote on 2017-06-01:

It also seems that doing the reboot via `juju ssh` does not cause the failures, so the transaction cause seems reasonable.

Ian Booth (wallyworld) on 2017-06-02

Changed in juju:
milestone:	none → 2.3-alpha1
milestone:	2.3-alpha1 → none
assignee:	nobody → Eric Claude Jones (ecjones)

Paul Gear (paulgear) on 2017-06-02

tags:

added: canonical-is

Eric Claude Jones (ecjones) on 2017-06-25

Changed in juju:
status:	Triaged → In Progress

Eric Claude Jones (ecjones) on 2017-06-26

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Eric Claude Jones (ecjones) wrote on 2017-07-11:

WORK AROUND

For users experiencing this bug with versions prior to that which contains the fix, the following work around can be used.

If you notice that the unit agent is stuck in continual attempts to fail an action, you can try to step the agent pass that action by updating its state file manually. For example, let us say that logs reveal that you unit agent is stuck canceling an action with ID f5fe333a-9662-4fc1-8dc2-045b16208e85 and your unit agent's state file looks as follows:

leader: false
started: true
stopped: false
installed: true
status-set: false
op: run-action
opstep: pending
action-id: f5fe333a-9662-4fc1-8dc2-045b16208e85

You can delete the line starting with `action-id` and and change the line starting with `op` to `op: continue`. This will step the unit agent past trying to fail the specified action. This should be safe since the controller already believes the action to be cancelled (which is why it is repeatedly rejecting the uniter's requests to fail the action).

Nicholas Skaggs (nskaggs) on 2017-07-11

Changed in juju:
milestone:	none → 2.2.2
milestone:	2.2.2 → none

Revision history for this message

Eric Claude Jones (ecjones) wrote on 2018-02-26:

https://github.com/juju/juju/pull/7552

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2021-10-14:

pr landed 26-jun-2017

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1649637

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.