OS-charms should check for expected services/processes before setting workload status to a ready state.

Bug #1524388 reported by Ryan Beisner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nova-compute (Juju Charms Collection)
Fix Released
High
Alex Kavanagh

Bug Description

OS-charms should check for expected services/processes before setting workload status to a ready state.

As of the 15.10 charms, workload status can be set to "Unit is Ready," even when a critical service has failed to start.

Taking it one step further: a hook should probably fail in those cases.

I've observed bug leaks in the following, where this type of sanity check within the charm would have raised red flags before charm commits or SRUs:
 - nova-compute
 - swift-*
 - rabbitmq-server

This also impacts automation and testability of our charms in that:

1. The amulet tests, mojo spec tests, and other tests, wait for the charm to advertise "I'm Ready" via workload status before commencing tests. Service checks in the Amulet tests will catch this leak, but other functional tests which may not inspect or exercise all relevant processes may not catch it.

2. Systems of automation, such as autopilot, mojo specs, and generic bundle deployment would be better-served by early failure, ie. a failed hook or a not-ready service, before moving on to next steps of the deployment automation.

This is targeted to the nova-compute charm for initial discussion. However, all OpenStack charms should be considered for this enhancement.

Tags: uosci

Related branches

Ryan Beisner (1chb1n)
description: updated
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I've done some digging through 4 charms and there appears to be (or perhaps the beginnings of) a pattern that defines the following useful three functions 'services()', 'restart_map()' and 'assess_status(configs)'.

The charmhelpers.core.host module provides a 'service_running(<service_name_string>)' function that returns True/False if the service is running.

The charmelpers.core.host module also provides 'service(<action string>, <service name string>)' that uses the OS systemctl (systemd) of service commands to perform an action (like start, stop, restart, etc.). This is call blocks until the OS command finishes. Thus, either a 'restart' or 'start' will succeed or fail (quickly), unless the service later fails.

The proposal, therefore, is to either:

a) Modify assess_status(...) in all of the charms to call something like:

all_running = reduce(operator.and_, [service_running(s) for s in services()], True)
if not all_running:
    <set state to some failed state>

(obviously, for efficiency, we might want to bail on the first 'not running' service, so we could re-write that as a for // break.)

OR

b) Change set_os_workload_status(...) to test for whether the services that should be running are running, and set a failed state if they are not. This would require a charm sync across all the charms, but might be simpler from a conceptual perspective.

However, I'm not sure (enough) how set_os_workload_status(...) is used to know whether this is a breaking change to how it was designed to be used.

Thoughts?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I'd lean toward (a).

IMHO, the self-aware(tm) charm will possess some basic functionality checks to know if it is running properly, before declaring itself ready. This may involve introspection of more than just running services or processes. It could also be checking for a listening socket, or some arbitrary test method. I think we will be best served by having all three. But a process check is a good start.

Idea: start out by just checking for running processes. Construct a mapping of <charm>: [<expected_processes>] in the form of a centralized helper dict (or yaml file) and process check helper. There may be another layer required in that data, as process names and their existence may vary across Ubuntu releases and/or OpenStack releases.

ex. assess_status remains blocked and status is updated if not expected_processes_are_running('keystone', UBUNTU_RELEASE, OS_RELEASE), then some retries, and ultimately a hook is failed after exhausting a generous retry threshold.

This could lead nicely into a new self-check action, where the same would basically re-trigger.

All of this foo would still require a sync into the charms, but no harm there.

Changed in nova-compute (Juju Charms Collection):
status: New → In Progress
David Ames (thedac)
Changed in nova-compute (Juju Charms Collection):
assignee: nobody → Alex Kavanagh (ajkavanagh)
milestone: none → 16.04
importance: Undecided → High
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Note that for the moment, port checks are being disabled as some services are asynchronous with respect to their service start/stop scripts/functions.

Changed in nova-compute (Juju Charms Collection):
status: In Progress → Fix Committed
James Page (james-page)
Changed in nova-compute (Juju Charms Collection):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.