juju update-status hook does not time out
Bug #1735264 reported by
Felipe Reyes
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Low
|
Unassigned |
Bug Description
In juju 2.3 the update-status hook is no longer reported in "juju status" output, this is something has helped to find charms looping inside the update-status hook indefinitely
I think juju should send a SIGTERM (and then a SIGKILL) if a update-status has been running for more than X seconds, hopefully this is configurable via model-config. If the hook didn't finish with the expected time, then the unit should be set in error state, probably an automatic retry (like it's done with config-changed hook) could be a good idea for not so reliable update-status hooks.
Changed in juju: | |
importance: | Undecided → Wishlist |
status: | New → Triaged |
summary: |
- [2.3] juju update-status hook does not time out + juju update-status hook does not time out |
To post a comment you must log in.
I would not change the default of not doing anything at all.
I agree that it helps debugging such charm issues (encountered that myself - they are really annoying to debug) but I am more inclined to "warn on runs for too long" rather than "kill on runs for too long".
To give an example: imagine that I spawned 100 child processes from the main charm process and they are blocked on something. After that the parent gets killed and all children are re-parented to the init process or a child subreaper process (unless you use PR_SET_PDEATHSIG via prctl(2) in each child) - they will hang there instead.
I would also think about just alerting via some mechanism. Tracking each unit and presenting that in juju status might be expensive to calculate and maintain. Having an event name alongside "executing" status (https:/ /git.io/ vbkHj) would certainly help - we already fetch messages set via status_set, why not do the same with event names?
Logging in Uniter could help though this is less visible - maybe explicit alerts in other ops tools will catch them.
I don't think this should be restricted just to update-status events - some bugs in other hooks may result in hangs during deployment or testing on CI infra so this would be generally useful for other events.