Need a consistent set of hook events surrounding stop/start of application unit (ie. reboots)

Bug #1844773 reported by Ryan Beisner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

In our work surrounding power events such as planned and unplanned reboots of units, we've discovered that there is no consistent hook on which we can depend for post-reboot requirements.

We've found that the leader-settings-changed hook is mostly dependable as the first hook that occurs after a reboot, but that is not the proper hook to rely on for this need.

In order to give charm authors a facility to consistently handle lifecycle management operations, Juju should provide a consistent set of hook events surrounding stop/start of application units.

Ryan Beisner (1chb1n)
summary: - no dependable hook for start of a unit (ie. hook context after reboot)
+ Need a consistent set of hook events surrounding stop/start of
+ application unit (ie. reboots)
Revision history for this message
John A Meinel (jameinel) wrote :

We used to have config-changed trigger whenever a unit agent was restarted, but the feedback was that caused outages because of charms that force restarted their application whenever the hook was called (regardless of whether there was actual change relative to the last config of the application.)

I'm curious what actual event you're looking for/what you want to do in "machine was restarted".
Wouldn't that be a systemd process that starts your instance?

leader-settings-changed certainly doesn't seem reliable for this. It might only be triggering for the same reason that config-changed used to trigger (during initialization we just assume NewValue != UnknownExistingValue).

It would be good to concretely understand what you're looking to run/alert/do something. (Generally we don't recommend starting an application process directly as a result of a hook, vs the hook configuring a systemd service. Otherwise there isn't any way that you would guarantee if the service dies it would get restarted, as there may be no Juju Model lifecycle event that would trigger another hook.)

Changed in juju:
status: New → Incomplete
Revision history for this message
Richard Harding (rharding) wrote :

We talked about this one during the openstack sync call last week. The consensus was that the goal was to help tell when a system was coming up from a cold boot (e.g. the system was restarted, not the agent or the service). The idea is that if the cloud were to shut down it was important to know it was coming up from a cold start situation. The discussion was to leverage the start hook as a promised "first hook on reboot, as long as not in a series-upgrade or other defined situation". In this way logic could be handled on the start hook to help manage this cold start situation.

We agreed that the openstack team would put together some defined scenarios they want to make sure are covered and we'd make sure that this start hook on reboot would cover those in a sufficient manner and look at adding to the next cycle roadmap.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1844773] Re: Need a consistent set of hook events surrounding stop/start of application unit (ie. reboots)

How does the agent who would drive the hook distinguish from "I'm started
because the machine restarted" and "I'm restarted because the agent
restarted" or "I'm restarted because I lost connectivity to the outside
world" or "I restarted because of an unhandled error".

My primary concern is introducing something like this, when we just went to
a lot of effort to *not* trigger config-changed on all of those types of
events. I can see a case for "Start", but we'd want to understand how we
would be giving a reliable signal that isn't just moving it around.

On Mon, Oct 7, 2019 at 6:31 PM Richard Harding <email address hidden>
wrote:

> We talked about this one during the openstack sync call last week. The
> consensus was that the goal was to help tell when a system was coming up
> from a cold boot (e.g. the system was restarted, not the agent or the
> service). The idea is that if the cloud were to shut down it was
> important to know it was coming up from a cold start situation. The
> discussion was to leverage the start hook as a promised "first hook on
> reboot, as long as not in a series-upgrade or other defined situation".
> In this way logic could be handled on the start hook to help manage this
> cold start situation.
>
> We agreed that the openstack team would put together some defined
> scenarios they want to make sure are covered and we'd make sure that
> this start hook on reboot would cover those in a sufficient manner and
> look at adding to the next cycle roadmap.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1844773
>
> Title:
> Need a consistent set of hook events surrounding stop/start of
> application unit (ie. reboots)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
>

Revision history for this message
Richard Harding (rharding) wrote :

The initial thought is a check of system uptime but there's also tools like
the last command (last reboot) as well. I think that's part of getting the
requirement situations from stakeholders and defining those rules.

On Thu, Oct 10, 2019, 12:15 AM John A Meinel <email address hidden> wrote:

> How does the agent who would drive the hook distinguish from "I'm started
> because the machine restarted" and "I'm restarted because the agent
> restarted" or "I'm restarted because I lost connectivity to the outside
> world" or "I restarted because of an unhandled error".
>
> My primary concern is introducing something like this, when we just went to
> a lot of effort to *not* trigger config-changed on all of those types of
> events. I can see a case for "Start", but we'd want to understand how we
> would be giving a reliable signal that isn't just moving it around.
>
>
> On Mon, Oct 7, 2019 at 6:31 PM Richard Harding <<email address hidden>
> >
> wrote:
>
> > We talked about this one during the openstack sync call last week. The
> > consensus was that the goal was to help tell when a system was coming up
> > from a cold boot (e.g. the system was restarted, not the agent or the
> > service). The idea is that if the cloud were to shut down it was
> > important to know it was coming up from a cold start situation. The
> > discussion was to leverage the start hook as a promised "first hook on
> > reboot, as long as not in a series-upgrade or other defined situation".
> > In this way logic could be handled on the start hook to help manage this
> > cold start situation.
> >
> > We agreed that the openstack team would put together some defined
> > scenarios they want to make sure are covered and we'd make sure that
> > this start hook on reboot would cover those in a sufficient manner and
> > look at adding to the next cycle roadmap.
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1844773
> >
> > Title:
> > Need a consistent set of hook events surrounding stop/start of
> > application unit (ie. reboots)
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/1844773
>
> Title:
> Need a consistent set of hook events surrounding stop/start of
> application unit (ie. reboots)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1844773/+subscriptions
>

Revision history for this message
Tim Penhey (thumper) wrote :

When I was discussing this with John the other day, I remember we talked
about having a hook that executes when the agent first starts on a boot.
I'm less interested in how we determine that, and more interested in why
we need it.

Ryan, could you please add some use cases from the openstack charms
where this information is needed?

It is possible that we are addressing a symptom and not the problem, and
we'd like to understand the use-cases a little more.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The essential use cases are most simply:

 * One or more nodes are rebooted cleanly, either for maintenance reasons, or after a series upgrade.
 * One or more nodes are booted after an unexpected power failure.
 * An entire cloud is booted after an unexpected power failure.

The charm workloads are typically clustered, and not all workloads self-heal, hence the need for the charm to take action, triggered by a reliable signal/hook. Workload examples:

  Database services (mysql, percona cluseter)
  API services (keystone, cinder, glance)
  Virt management daemons (nova, libvirt)
  Storage daemons (ceph-osd, ceph-mon)

To be clear, my request is not necessarily to walk back the config-changed behavior change on agent start. It is just the point in time where we noticed this gap in signalling.

Revision history for this message
Tim Penhey (thumper) wrote :

Asking questions, not of anyone in particular, mostly around implementation.

Instead of relying on boot time to determine whether there has been an agent start since reboot, is there a system shared resource the agent could create if missing and use that to determine whether or not the system has rebooted?

The idea here is that what is created is a named ephemeral system resource that goes away when the machine reboots.

Perhaps a named pipe that doesn't get closed? Something like that?

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

For POSIX I would try using named semaphores on a per-agent basis.

http://man7.org/linux/man-pages/man7/sem_overview.7.html
"Persistence
POSIX named semaphores have kernel persistence: if not removed by sem_unlink(3), a semaphore will exist until the system is shut down."

So, on agent start/restart:

* sem_open;
* sem_trywait -> if EAGAIN is not returned, run the start hook.

I believe if a process exits or crashes without calling sem_post, the semaphore will stay locked forever which would be suitable with sem_trywait. Likewise, we will never call sem_unlink so it will stay there until reboot.

For windows agents there is similar functionality in winapi:
https://docs.microsoft.com/en-us/windows/win32/sync/semaphore-objects?redirectedfrom=MSDN
"Closing the handle does not affect the semaphore count; therefore, be sure to call ReleaseSemaphore before closing the handle or before the process terminates."

Since "stop" is mentioned in the feature request, the semantics that are currently documented are such that application files (packages) and config are to be removed on "stop".

https://discourse.jujucharms.com/t/charm-hooks/1040#heading--stop
"Stop the application
Remove any files/configuration created during the application lifecycle
Prepare any backup(s) of the application that are required for restore purposes."

In other words: stop == uninstall.

So I think it's worth a discussion when it comes to running "stop" before reboot because some applications remove packages and local state when this event is called. Maybe we need a different event for this.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Tim Penhey (thumper)
Changed in juju:
status: Expired → Triaged
importance: Undecided → High
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.