Service restarts are not handled via Pacemaker for HA scenarios that use it

Bug #1891160 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charms.openstack
Fix Released
High
Dmitrii Shcherbakov

Bug Description

While looking at https://bugs.launchpad.net/charm-manila-ganesha/+bug/1890401/ (comments #4 and #5 in particular) it became clear that service restarts via charmhelpers.core.host.service_restart in an HA scenario do not take into account services which should only run on one unit. While HA assumes the presence of multiple units, it may be that other units are passive with their units being down until a failover condition occurs.

Moreover, when service lifetime management is given to Pacemaker, it is better not to interfere with its operations via manual service restarts and use cluster resource lifecycle commands instead:

crm -w resource {start,stop,restart} <resource-name>

The -w option makes the CLI commands to wait for the completion of a transition instead of kicking off the asynchronous process and exiting.
https://crmsh.github.io/man-3/#topics_CommandLine

It seems like there needs to be a way in charms.openstack to trigger `crm -w resource restart` instead of restarting services via init system-specific commands (systemd, upstart). This would allow Pacemaker to restart only the units that actually have the resource scheduled.

Example:

1) while the services list is empty
https://opendev.org/openstack/charm-manila-ganesha/src/commit/7d804802302e696c1c9fbc623c951d3ac578c4e1/src/lib/charm/openstack/manila_ganesha.py#L163-L166

2) restart_on_change would still build a list of services from the restart map and restart them via charmhelpers.core.host.service_stop and charmhelpers.core.host.service_start.

https://opendev.org/openstack/charms.openstack/src/commit/7e8c5c1461c4bae155c27ef46e80c653cc77f1a8/charms_openstack/charm/core.py#L689-L709

--------------------------------------------------------------------

Considerations for restarting resources via crm commands:

1) crmsh restart commands are not node-specific and we would like to avoid causing cluster-wide restarts triggered by unit-local operations:
https://crmsh.github.io/man-2.0/#cmdhelp_resource_restart

2) crm_resource has --restart and --wait commands and a --node option:

https://manpages.ubuntu.com/manpages/xenial/man8/crm_resource.8.html
https://manpages.ubuntu.com/manpages/focal/man8/crm_resource.8.html
--restart
(Advanced) Tell the cluster to restart this resource and anything that depends on it
--wait (Advanced) Wait until the cluster settles into a stable state

-N, --node=value Node name

Based on the implementation, if a resource isn't running on the specified host, the command will exit with -ENXIO

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-1.1.14/tools/crm_resource_runtime.c#L1027-L1070 (~xenial)
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.3/tools/crm_resource_runtime.c#L1261-L1305 (~focal)

So running the following from the charm and catching errno.ENXIO seems like a way forward:

crm_resource --restart --wait --node <node-name>

The only alternative to using crm_resource I see is a combination of `ban` + `clear` commands but the clear command doesn't accept the node argument based on the docs:
https://crmsh.github.io/man-2.0/#cmdhelp_resource_ban
https://crmsh.github.io/man-2.0/#cmdhelp_resource_clear

description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charms.openstack (master)

Reviewed: https://review.opendev.org/745896
Committed: https://git.openstack.org/cgit/openstack/charms.openstack/commit/?id=c9b4009810c66097689b17751f70363e6108ebd7
Submitter: Zuul
Branch: master

commit c9b4009810c66097689b17751f70363e6108ebd7
Author: Dmitrii Shcherbakov <email address hidden>
Date: Wed Aug 12 16:40:05 2020 +0300

    Allow service actions to be overridden

    In order to allow child classes to use other means of controlling
    service lifetime, this change introduces methods that call
    systemd/upstart via charm-helpers by default but allow for something
    else to be used in overrides.

    The example use-case in question is the use of crm tools to restart
    services:

    crm_resource --restart --wait --node <node-name>

    Which will restart a resource on a given node if it is running there or
    return ENXIO otherwise.

    Change-Id: Ifb16f6743296a1ef6dcb212cad517afd57270f7f
    Closes-Bug: #1891160

Changed in charms.openstack:
status: In Progress → Fix Released
Changed in charms.openstack:
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.