Comment 2 for bug 1735264

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1735264] Re: [2.3] juju update-status hook does not time out

On Wed, Nov 29, 2017 at 08:17:21PM -0000, Dmitrii Shcherbakov wrote:
> I would not change the default of not doing anything at all.
>
> I agree that it helps debugging such charm issues (encountered that
> myself - they are really annoying to debug) but I am more inclined to
> "warn on runs for too long" rather than "kill on runs for too long".

any suggestion where this could fit?, the workload message is the only place
that comes to my mind, but it is not really a good one.

>
> To give an example: imagine that I spawned 100 child processes from the
> main charm process and they are blocked on something. After that the
> parent gets killed and all children are re-parented to the init process
> or a child subreaper process (unless you use PR_SET_PDEATHSIG via
> prctl(2) in each child) - they will hang there instead.

do we really expect an update-status hook implementation with these characteristics?.

> I would also think about just alerting via some mechanism. Tracking each
> unit and presenting that in juju status might be expensive to calculate
> and maintain. Having an event name alongside "executing" status
> (https://git.io/vbkHj) would certainly help - we already fetch messages
> set via status_set, why not do the same with event names?

The executing state is no longer surfaced for the update-status hook, so we have
no mechanism to inform the user about this long running hook.

>
> Logging in Uniter could help though this is less visible

we saw an environment where the update-status was running for more than a day,
the logs were continously repeating the same thing, so it can be detected by a
human, but harder by tools.

> - maybe explicit alerts in other ops tools will catch them.

yes, this is for sure an option, but I believe juju shouldn't allow their hooks
go rogue :-)