juju update-status hook does not time out

Bug #1735264 reported by Felipe Reyes
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

In juju 2.3 the update-status hook is no longer reported in "juju status" output, this is something has helped to find charms looping inside the update-status hook indefinitely

I think juju should send a SIGTERM (and then a SIGKILL) if a update-status has been running for more than X seconds, hopefully this is configurable via model-config. If the hook didn't finish with the expected time, then the unit should be set in error state, probably an automatic retry (like it's done with config-changed hook) could be a good idea for not so reliable update-status hooks.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I would not change the default of not doing anything at all.

I agree that it helps debugging such charm issues (encountered that myself - they are really annoying to debug) but I am more inclined to "warn on runs for too long" rather than "kill on runs for too long".

To give an example: imagine that I spawned 100 child processes from the main charm process and they are blocked on something. After that the parent gets killed and all children are re-parented to the init process or a child subreaper process (unless you use PR_SET_PDEATHSIG via prctl(2) in each child) - they will hang there instead.

I would also think about just alerting via some mechanism. Tracking each unit and presenting that in juju status might be expensive to calculate and maintain. Having an event name alongside "executing" status (https://git.io/vbkHj) would certainly help - we already fetch messages set via status_set, why not do the same with event names?

Logging in Uniter could help though this is less visible - maybe explicit alerts in other ops tools will catch them.

I don't think this should be restricted just to update-status events - some bugs in other hooks may result in hangs during deployment or testing on CI infra so this would be generally useful for other events.

tags: added: cpe-onsite
Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1735264] Re: [2.3] juju update-status hook does not time out

On Wed, Nov 29, 2017 at 08:17:21PM -0000, Dmitrii Shcherbakov wrote:
> I would not change the default of not doing anything at all.
>
> I agree that it helps debugging such charm issues (encountered that
> myself - they are really annoying to debug) but I am more inclined to
> "warn on runs for too long" rather than "kill on runs for too long".

any suggestion where this could fit?, the workload message is the only place
that comes to my mind, but it is not really a good one.

>
> To give an example: imagine that I spawned 100 child processes from the
> main charm process and they are blocked on something. After that the
> parent gets killed and all children are re-parented to the init process
> or a child subreaper process (unless you use PR_SET_PDEATHSIG via
> prctl(2) in each child) - they will hang there instead.

do we really expect an update-status hook implementation with these characteristics?.

> I would also think about just alerting via some mechanism. Tracking each
> unit and presenting that in juju status might be expensive to calculate
> and maintain. Having an event name alongside "executing" status
> (https://git.io/vbkHj) would certainly help - we already fetch messages
> set via status_set, why not do the same with event names?

The executing state is no longer surfaced for the update-status hook, so we have
no mechanism to inform the user about this long running hook.

>
> Logging in Uniter could help though this is less visible

we saw an environment where the update-status was running for more than a day,
the logs were continously repeating the same thing, so it can be detected by a
human, but harder by tools.

> - maybe explicit alerts in other ops tools will catch them.

yes, this is for sure an option, but I believe juju shouldn't allow their hooks
go rogue :-)

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : Re: [2.3] juju update-status hook does not time out

> any suggestion where this could fit?

Would be difficult to fit into the same line because event names can get quite long. But yes before/after status message seems like the right place conceptually.

Maybe another section in Juju status would be useful?

"Operations in progress"? (hooks, actions)

or

"Stale operations"

> do we really expect an update-status hook implementation with these characteristics?

This is just an example - you may want to execute some binary in a child process even though it's update-status. For other events this is more important though.

Also if a process is in D state (uninterruptible sleep) for some reason you won't be able to kill it anyway. I am trying to explore generic cases.

> The executing state is no longer surfaced for the update-status hook

I see.

> I believe juju shouldn't allow their hooks go rogue :-)

I agree, let's figure out what works best without asking to make Juju track too much state.

It could be a goroutine per unit agent that tracks execution time as we only have one Uniter executing charm code per logical machine at a time. Such tracking would depend on clocks and has to be synchronized with retries of failed hooks which is additional complexity.

Revision history for this message
Felipe Reyes (freyes) wrote :

I just remembered another use case where the agent should monitor and terminate a process, see https://bugs.launchpad.net/juju/+bug/1638332/comments/4

John A Meinel (jameinel)
Changed in juju:
importance: Undecided → Wishlist
status: New → Triaged
summary: - [2.3] juju update-status hook does not time out
+ juju update-status hook does not time out
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Wishlist → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.