masakari

False positives of host failure

Bug #1947676 reported by Rikimaru Honjo on 2021-10-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	masakari	Invalid	Undecided	Unassigned

Bug Description

[My environment]

I'm using the latest masakari with pacemaker-remote. Pacemaker is running on OpenStack controller nodes. And pacemaker-remote is running on OpenStack compute nodes.

[Pacemaker remote behavior]
I found that the following behavior of pacemaker remote.

1. One of the OpenStack controller nodes goes down.
2. Remote resources associated with the down node go offline for a moment[1]. But the compute nodes works collectly.
3. The remote resources is automatically migrated to the other controller node. The remote resources go online.

[1]
=================
    <nodes>
        <node name="masakari_cp01" id="masakari_cp01" online="false" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="false" expected_up="false" is_dc="false" resources_running="0" type="remote" />
        <node name="masakari_cp02" id="masakari_cp02" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="false" expected_up="false" is_dc="false" resources_running="1" type="remote" />
    </nodes>
=================

[Issue]
Once, masakari-hostmonitor has confirmed the online/offline status when remote resources went offline for a moment.

As a result, the compute nodes were stopped by stonith, and instances on the nodes were evacuated. But the compute nodes didn't have any issues.

I think that temporally offline shouldn't be detected.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2021-10-19:

As Pacemaker is not too robust w.r.t. these temporary issues, we have introduced a new feature in Wallaby to help with that: https://docs.openstack.org/releasenotes/masakari-monitors/wallaby.html#new-features

"""
Support for repeated check of node status in hostmonitor.

Repeated check is more reliable than single check to determine host status, especially when there is network instability in play.

With this feature, the following config option can be set.

[host]
monitoring_samples = 3

The above means 3 checks will be done before the node status is decided. The default value is 1 which is backwards compatible.
"""

We are also working on integrating Consul-based monitoring in Yoga which is supposed to be more resilient.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2021-10-19:

(And STONITH can be configured in Pacemaker. Masakari is oblivious to its existence.)

Revision history for this message

Rikimaru Honjo (honjo-rikimaru-c6) wrote on 2021-10-19:

Oops... Thanks a lot.

I confirm whether or not the suggested setting helps me.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2021-10-19:

Sure thing. I will mark this as "Invalid" for now.

Changed in masakari:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.