failed hook after reboot

Bug #1829394 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned
Juju Wait Plugin
New
Undecided
Unassigned
Kubernetes Worker Charm
Invalid
Undecided
Unassigned

Bug Description

We're doing reboot testing, and after rebooting all the nodes in our kubernetes cluster, a kubernetes worker went into error state with a failed 'update status' hook.

It's not clear why the hook failed, there is no traceback.

2019-05-16 11:54:48 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed: "update-status"

We are using 'juju wait' to determine when the kubernetes applications are ready again after reboot, and this error trips juju wait up and causes the test to fail.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Mike Wilson (knobby) wrote :

Ok, after talking with the juju guys we think this is expected and bad behavior from juju wait. They fully expect a transient error to be possible when an agent is restarting. The opinion of the Juju team is that juju wait should handle this tranient error and not exit.

This is either a juju wait plugin(stubbs) issue or a juju issue.

Changed in charm-kubernetes-worker:
status: New → Invalid
Revision history for this message
Richard Harding (rharding) wrote :

I don't follow this though. If the agent came up, and it went to exec an update-status hook. That hook should fail/succeed on its own merits. I guess I could see a condition that if the machine was rebooted while exec'ing the hook that Juju would treat it as failed and the unit would be in an "error state" until it managed to recover/auto retry, and get to a successful spot.

Revision history for this message
Richard Harding (rharding) wrote :

If we can reproduce this or check deeper into why the hook failed that'd be great. As it is, I'm not sure what we can address so I'm going to mark this as incomplete pending investigation as to the timing of the hook error and logs that state if it was running during the restart or not.

Changed in juju:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Rick,

What's happening is we're rebooting machines while a hook is running. When juju comes back up, the hook is marked as failed.

It can take an arbitrary amount of time to get out of error state at that point because there is no guarantee on when the hook will get to retry.

We have added a longer delay to try to reduce the likelihood of hitting this, but it's not a sure thing and we still hit it some.

summary: - failed update-status after reboot
+ failed hook after reboot
Changed in juju:
status: Incomplete → New
Revision history for this message
Tim Penhey (thumper) wrote :

How are you rebooting the machine?

If the reboot is happening in the middle of a hook execution, then I don't see any reason why there would be something in the logs.

The uniter writes that it is starting a hook into local state. If the uniter starts up and sees that it had previously started a hook but didn't finish it, it gets marked as an error. This is normal.

Hook executions are generally retried, except update-status.

The update-status hook will execute in five minutes ± jitter after the unit agent has started and connected to the controller.

This is expected behaviour.

Changed in juju:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.