it is not appropriate for pacemaker_remote to check host status

Bug #1937244 reported by zhaoleilc
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
masakari-monitors
In Progress
Wishlist
suzhengwei

Bug Description

To allow for scalability to dozens or even hundreds of nodes,
pacemaker-remote was introduced[1]. A physical host running
pacemaker-remote service shall be called remote node, and
a node running the full high-availability stack of corosync
and all pacemaker components shall be called cluster node.

Hostmonitor distinguishes remote nodes from cluster nodes
by setting restrict_to_remotes. However, it is not apporpriate
for pacemaker_remote to check host status since pacemaker_remote
service can only establish one network link between cluster node
and remote node on port 3122.

There are always multiple interfaces in a production environment
such as management network, tenant network and public network etc.
Evacuation action should be triggered when multiple network
communication break down rather than just relying on one. For
example, Live migration action might be better when only tenant
network communication break down. Cluster node can establish
multiple network links by using corosync, additionally, corosync 2
can support two interfaces and corosync 3 can support more.[2]

In addition, it is dangerous to use pacemaker-remote in a production
environment. More detailedly, the remote node status will be marked off
if pacemaker_remote service become down from active, and evacuation
action is triggered. This scenario is confusing since the real state of
node may be normal.

[1] https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Remote/html/intro.html#overview
[2] https://github.com/corosync/corosync/issues/465

zhaoleilc (zhaoleilc)
description: updated
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

In practice, however, Pacemaker Remote is used for this and other purposes. We (as in Masakari team) are aware of the limitations of the Pacemaker stack (and the needless burden of extra features it brings with it) and are actively working on introducing an alternative in the form of Consul monitoring: https://blueprints.launchpad.net/masakari/+spec/host-monitor-by-consul
I hope this answers your report because there is nothing else I can offer here (other than discussing other alternatives but none were provided in the report - corosync has the original limitation of 16 nodes, it's the very reason why Pacemaker Remote is used instead as a quick hack).

Changed in masakari-monitors:
status: New → Incomplete
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Please confirm if I understood you right and you agree with what I wrote. Then I can move this to "In progress"-"Wishlist" which actually reflects the reality pretty well.

Revision history for this message
zhaoleilc (zhaoleilc) wrote :

Thank you for your reply. I will give a try on the alternative in the form of Consul monitoring!

Changed in masakari-monitors:
status: Incomplete → In Progress
importance: Undecided → Wishlist
assignee: nobody → suzhengwei (sue.sam)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.