Bug #1844773 “Need a consistent set of hook events surrounding s...” : Bugs : Canonical Juju

Ryan Beisner (1chb1n) on 2019-09-20

summary:

- no dependable hook for start of a unit (ie. hook context after reboot)
+ Need a consistent set of hook events surrounding stop/start of
+ application unit (ie. reboots)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-10-07:

#1

We used to have config-changed trigger whenever a unit agent was restarted, but the feedback was that caused outages because of charms that force restarted their application whenever the hook was called (regardless of whether there was actual change relative to the last config of the application.)

I'm curious what actual event you're looking for/what you want to do in "machine was restarted".
Wouldn't that be a systemd process that starts your instance?

leader-settings-changed certainly doesn't seem reliable for this. It might only be triggering for the same reason that config-changed used to trigger (during initialization we just assume NewValue != UnknownExistingValue).

It would be good to concretely understand what you're looking to run/alert/do something. (Generally we don't recommend starting an application process directly as a result of a hook, vs the hook configuring a systemd service. Otherwise there isn't any way that you would guarantee if the service dies it would get restarted, as there may be no Juju Model lifecycle event that would trigger another hook.)

Changed in juju:
status:	New → Incomplete

Revision history for this message

Richard Harding (rharding) wrote on 2019-10-07:

#2

We talked about this one during the openstack sync call last week. The consensus was that the goal was to help tell when a system was coming up from a cold boot (e.g. the system was restarted, not the agent or the service). The idea is that if the cloud were to shut down it was important to know it was coming up from a cold start situation. The discussion was to leverage the start hook as a promised "first hook on reboot, as long as not in a series-upgrade or other defined situation". In this way logic could be handled on the start hook to help manage this cold start situation.

We agreed that the openstack team would put together some defined scenarios they want to make sure are covered and we'd make sure that this start hook on reboot would cover those in a sufficient manner and look at adding to the next cycle roadmap.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-10-10: Re: [Bug 1844773] Re: Need a consistent set of hook events surrounding stop/start of application unit (ie. reboots)

#3

How does the agent who would drive the hook distinguish from "I'm started
because the machine restarted" and "I'm restarted because the agent
restarted" or "I'm restarted because I lost connectivity to the outside
world" or "I restarted because of an unhandled error".

My primary concern is introducing something like this, when we just went to
a lot of effort to *not* trigger config-changed on all of those types of
events. I can see a case for "Start", but we'd want to understand how we
would be giving a reliable signal that isn't just moving it around.

On Mon, Oct 7, 2019 at 6:31 PM Richard Harding <email address hidden>
wrote:

> We talked about this one during the openstack sync call last week. The
> consensus was that the goal was to help tell when a system was coming up
> from a cold boot (e.g. the system was restarted, not the agent or the
> service). The idea is that if the cloud were to shut down it was
> important to know it was coming up from a cold start situation. The
> discussion was to leverage the start hook as a promised "first hook on
> reboot, as long as not in a series-upgrade or other defined situation".
> In this way logic could be handled on the start hook to help manage this
> cold start situation.
>
> We agreed that the openstack team would put together some defined
> scenarios they want to make sure are covered and we'd make sure that
> this start hook on reboot would cover those in a sufficient manner and
> look at adding to the next cycle roadmap.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1844773
>
> Title:
> Need a consistent set of hook events surrounding stop/start of
> application unit (ie. reboots)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
>

Revision history for this message

Richard Harding (rharding) wrote on 2019-10-10:

#4

The initial thought is a check of system uptime but there's also tools like
the last command (last reboot) as well. I think that's part of getting the
requirement situations from stakeholders and defining those rules.

On Thu, Oct 10, 2019, 12:15 AM John A Meinel <email address hidden> wrote:

> How does the agent who would drive the hook distinguish from "I'm started
> because the machine restarted" and "I'm restarted because the agent
> restarted" or "I'm restarted because I lost connectivity to the outside
> world" or "I restarted because of an unhandled error".
>
> My primary concern is introducing something like this, when we just went to
> a lot of effort to *not* trigger config-changed on all of those types of
> events. I can see a case for "Start", but we'd want to understand how we
> would be giving a reliable signal that isn't just moving it around.
>
>
> On Mon, Oct 7, 2019 at 6:31 PM Richard Harding <<email address hidden>
> >
> wrote:
>
> > We talked about this one during the openstack sync call last week. The
> > consensus was that the goal was to help tell when a system was coming up
> > from a cold boot (e.g. the system was restarted, not the agent or the
> > service). The idea is that if the cloud were to shut down it was
> > important to know it was coming up from a cold start situation. The
> > discussion was to leverage the start hook as a promised "first hook on
> > reboot, as long as not in a series-upgrade or other defined situation".
> > In this way logic could be handled on the start hook to help manage this
> > cold start situation.
> >
> > We agreed that the openstack team would put together some defined
> > scenarios they want to make sure are covered and we'd make sure that
> > this start hook on reboot would cover those in a sufficient manner and
> > look at adding to the next cycle roadmap.
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1844773
> >
> > Title:
> > Need a consistent set of hook events surrounding stop/start of
> > application unit (ie. reboots)
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1844773
>
> Title:
> Need a consistent set of hook events surrounding stop/start of
> application unit (ie. reboots)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
>

The initial thought is a check of system uptime but there's also tools like
the last command (last reboot) as well. I think that's part of getting the
requirement situations from stakeholders and defining those rules.

On Thu, Oct 10, 2019, 12:15 AM John A Meinel <john@arbash-meinel.com> wrote:

> How does the agent who would drive the hook distinguish from "I'm started
> because the machine restarted" and "I'm restarted because the agent
> restarted" or "I'm restarted because I lost connectivity to the outside
> world" or "I restarted because of an unhandled error".
>
> My primary concern is introducing something like this, when we just went to
> a lot of effort to *not* trigger config-changed on all of those types of
> events. I can see a case for "Start", but we'd want to understand how we
> would be giving a reliable signal that isn't just moving it around.
>
>
> On Mon, Oct 7, 2019 at 6:31 PM Richard Harding <rick.harding@canonical.com
> >
> wrote:
>
> > We talked about this one during the openstack sync call last week. The
> > consensus was that the goal was to help tell when a system was coming up
> > from a cold boot (e.g. the system was restarted, not the agent or the
> > service). The idea is that if the cloud were to shut down it was
> > important to know it was coming up from a cold start situation. The
> > discussion was to leverage the start hook as a promised "first hook on
> > reboot, as long as not in a series-upgrade or other defined situation".
> > In this way logic could be handled on the start hook to help manage this
> > cold start situation.
> >
> > We agreed that the openstack team would put together some defined
> > scenarios they want to make sure are covered and we'd make sure that
> > this start hook on reboot would cover those in a sufficient manner and
> > look at adding to the next cycle roadmap.
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1844773
> >
> > Title:
> >   Need a consistent set of hook events surrounding stop/start of
> >   application unit (ie. reboots)
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1844773
>
> Title:
>   Need a consistent set of hook events surrounding stop/start of
>   application unit (ie. reboots)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
>

Revision history for this message

Tim Penhey (thumper) wrote on 2019-10-10:

#5

When I was discussing this with John the other day, I remember we talked
about having a hook that executes when the agent first starts on a boot.
I'm less interested in how we determine that, and more interested in why
we need it.

Ryan, could you please add some use cases from the openstack charms
where this information is needed?

It is possible that we are addressing a symptom and not the problem, and
we'd like to understand the use-cases a little more.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-10-25:

#6

The essential use cases are most simply:

* One or more nodes are rebooted cleanly, either for maintenance reasons, or after a series upgrade.
* One or more nodes are booted after an unexpected power failure.
* An entire cloud is booted after an unexpected power failure.

The charm workloads are typically clustered, and not all workloads self-heal, hence the need for the charm to take action, triggered by a reliable signal/hook. Workload examples:

  Database services (mysql, percona cluseter)
  API services (keystone, cinder, glance)
  Virt management daemons (nova, libvirt)
  Storage daemons (ceph-osd, ceph-mon)

To be clear, my request is not necessarily to walk back the config-changed behavior change on agent start. It is just the point in time where we noticed this gap in signalling.

Revision history for this message

Tim Penhey (thumper) wrote on 2019-10-28:

#7

Asking questions, not of anyone in particular, mostly around implementation.

Instead of relying on boot time to determine whether there has been an agent start since reboot, is there a system shared resource the agent could create if missing and use that to determine whether or not the system has rebooted?

The idea here is that what is created is a named ephemeral system resource that goes away when the machine reboots.

Perhaps a named pipe that doesn't get closed? Something like that?

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-10-29:

#8

For POSIX I would try using named semaphores on a per-agent basis.

http://man7.org/linux/man-pages/man7/sem_overview.7.html
"Persistence
POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down."

So, on agent start/restart:

* sem_open;
* sem_trywait -> if EAGAIN is not returned, run the start hook.

I believe if a process exits or crashes without calling sem_post, the semaphore will stay locked forever which would be suitable with sem_trywait. Likewise, we will never call sem_unlink so it will stay there until reboot.

For windows agents there is similar functionality in winapi:
https://docs.microsoft.com/en-us/windows/win32/sync/semaphore-objects?redirectedfrom=MSDN
"Closing the handle does not affect the semaphore count; therefore, be sure to call ReleaseSemaphore before closing the handle or before the process terminates."

Since "stop" is mentioned in the feature request, the semantics that are currently documented are such that application files (packages) and config are to be removed on "stop".

https://discourse.jujucharms.com/t/charm-hooks/1040#heading--stop
"Stop the application
Remove any files/configuration created during the application lifecycle
Prepare any backup(s) of the application that are required for restore purposes."

In other words: stop == uninstall.

So I think it's worth a discussion when it comes to running "stop" before reboot because some applications remove packages and local state when this event is called. Maybe we need a different event for this.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-12-29:

#9

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Tim Penhey (thumper) on 2020-01-05

Changed in juju:
status:	Expired → Triaged
importance:	Undecided → High

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#10

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

Canonical Juju

Need a consistent set of hook events surrounding stop/start of application unit (ie. reboots)

Bug Description

Other bug subscribers

Remote bug watches