Canonical Juju

failed hook after reboot

Bug #1829394 reported by Jason Hobbs on 2019-05-16

This bug affects 1 person

	Status	Importance	Assigned to
Canonical Juju	Invalid	Undecided	Unassigned
Juju Wait Plugin	New	Undecided	Unassigned
Kubernetes Worker Charm	Invalid	Undecided	Unassigned

Bug Description

We're doing reboot testing, and after rebooting all the nodes in our kubernetes cluster, a kubernetes worker went into error state with a failed 'update status' hook.

It's not clear why the hook failed, there is no traceback.

2019-05-16 11:54:48 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed: "update-status"

We are using 'juju wait' to determine when the kubernetes applications are ready again after reboot, and this error trips juju wait up and causes the test to fail.

Tags:

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-05-16:

juju-crashdump-kubernetes-2019-05-16-11.55.26.tar.gz Edit (22.0 MiB, application/x-tar)

Revision history for this message

Mike Wilson (knobby) wrote on 2019-05-16:

Ok, after talking with the juju guys we think this is expected and bad behavior from juju wait. They fully expect a transient error to be possible when an agent is restarting. The opinion of the Juju team is that juju wait should handle this tranient error and not exit.

This is either a juju wait plugin(stubbs) issue or a juju issue.

Changed in charm-kubernetes-worker:
status:	New → Invalid

Revision history for this message

Richard Harding (rharding) wrote on 2019-05-22:

I don't follow this though. If the agent came up, and it went to exec an update-status hook. That hook should fail/succeed on its own merits. I guess I could see a condition that if the machine was rebooted while exec'ing the hook that Juju would treat it as failed and the unit would be in an "error state" until it managed to recover/auto retry, and get to a successful spot.

Revision history for this message

Richard Harding (rharding) wrote on 2019-05-22:

If we can reproduce this or check deeper into why the hook failed that'd be great. As it is, I'm not sure what we can address so I'm going to mark this as incomplete pending investigation as to the timing of the hook error and logs that state if it was running during the restart or not.

Changed in juju:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-06-06:

Rick,

What's happening is we're rebooting machines while a hook is running. When juju comes back up, the hook is marked as failed.

It can take an arbitrary amount of time to get out of error state at that point because there is no guarantee on when the hook will get to retry.

We have added a longer delay to try to reduce the likelihood of hitting this, but it's not a sure thing and we still hit it some.

summary:	- failed update-status after reboot + failed hook after reboot
Changed in juju:
status:	Incomplete → New

Revision history for this message

Tim Penhey (thumper) wrote on 2019-06-10:

How are you rebooting the machine?

If the reboot is happening in the middle of a hook execution, then I don't see any reason why there would be something in the logs.

The uniter writes that it is starting a hook into local state. If the uniter starts up and sees that it had previously started a hook but didn't finish it, it gets marked as an error. This is normal.

Hook executions are generally retried, except update-status.

The update-status hook will execute in five minutes ± jitter after the unit agent has started and connected to the controller.

This is expected behaviour.