Pacemaker remote resources marked as unclean and not moved.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
In Progress
|
Undecided
|
Unassigned | ||
OpenStack Pacemaker Remote Charm |
Invalid
|
Undecided
|
Unassigned |
Bug Description
ocf:pacemaker:
not move in the event of the node hosting them powering off.
This causes the pacemaker_
to fail to connect:
# systemctl status pacemaker_
● pacemaker_
Loaded: loaded (/lib/systemd/
Active: active (running) since Mon 2020-07-27 11:00:02 UTC; 1h 5min ago
Docs: man:pacemaker_
http://
Main PID: 76263 (pacemaker_remot)
Tasks: 1 (limit: 4915)
CGroup: /system.
└─76263 /usr/sbin/
Jul 27 11:00:02 node2 systemd[1]: Started Pacemaker Remote Service.
Jul 27 11:00:02 node2 pacemaker_
Jul 27 11:00:05 node2 pacemaker_
Jul 27 11:52:32 node2 pacemaker_
Jul 27 11:52:32 node2 pacemaker_
Jul 27 11:52:32 node2 pacemaker_
This in turn causes masakari host monitors to hang when checking
the state of their peers which prevents a host down notification
being sent to masakari.
The cause of this seems to be that the cluster has the global
stonith-
resource for the pacemaker remotes but there is no
stonith resource for the lxd container. In the event of a host
being lost, the cluster tries to power off the compute
node, which does have a stonith resource. It then tries to power
off the lxd container but can find no corresponding stonith
resource. This causes the container to marked as unclean
and the resources are not moved.
Changed in charm-pacemaker-remote: | |
status: | New → Invalid |
Reviewed: https:/ /review. opendev. org/743742 /git.openstack. org/cgit/ openstack/ charm-hacluster /commit/ ?id=b40a6754b02 56058213afcde80 174ca7e730a403
Committed: https:/
Submitter: Zuul
Branch: master
commit b40a6754b025605 8213afcde80174c a7e730a403
Author: Liam Young <email address hidden>
Date: Wed Jul 29 11:59:43 2020 +0000
Create null stonith resource for lxd containers.
If stonith is enabled then when a compute node is detected as failed remotes that are associated with the lost container from
it is powered down. This can include a lxd container which is also
part of the cluster. In this case because stonith is enabled at a
global level, pacemaker will try and power off the lxd container
too. But the container does not have a stonith device and this causes
the container to be marked as unclean (but not down). This running
unclean state prevents resources being moved and causes any
pacemaker-
losing their connection which prevents masakari hostmonitor from
ascertaining the cluster health.
The way to work around this is to create a dummy stonith device for
the lxd containers. This allows the cluster to properly mark the lost
container as down and resources are relocated.
Change-Id: Ic45dbdd9d8581f 25549580c7e98a8 d6e0bf8c3e7
Partial-Bug: #1889094