Canonical Juju

Bug #1735264
Comment #1

Comment 1 for bug 1735264

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2017-11-29: Re: [2.3] juju update-status hook does not time out

I would not change the default of not doing anything at all.

I agree that it helps debugging such charm issues (encountered that myself - they are really annoying to debug) but I am more inclined to "warn on runs for too long" rather than "kill on runs for too long".

To give an example: imagine that I spawned 100 child processes from the main charm process and they are blocked on something. After that the parent gets killed and all children are re-parented to the init process or a child subreaper process (unless you use PR_SET_PDEATHSIG via prctl(2) in each child) - they will hang there instead.

I would also think about just alerting via some mechanism. Tracking each unit and presenting that in juju status might be expensive to calculate and maintain. Having an event name alongside "executing" status (https://git.io/vbkHj) would certainly help - we already fetch messages set via status_set, why not do the same with event names?

Logging in Uniter could help though this is less visible - maybe explicit alerts in other ops tools will catch them.

I don't think this should be restricted just to update-status events - some bugs in other hooks may result in hangs during deployment or testing on CI infra so this would be generally useful for other events.