False positives of host failure

Bug #1947676 reported by Rikimaru Honjo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
masakari
Invalid
Undecided
Unassigned

Bug Description

[My environment]

I'm using the latest masakari with pacemaker-remote. Pacemaker is running on OpenStack controller nodes. And pacemaker-remote is running on OpenStack compute nodes.

[Pacemaker remote behavior]
I found that the following behavior of pacemaker remote.

1. One of the OpenStack controller nodes goes down.
2. Remote resources associated with the down node go offline for a moment[1]. But the compute nodes works collectly.
3. The remote resources is automatically migrated to the other controller node. The remote resources go online.

[1]
=================
    <nodes>
        <node name="masakari_cp01" id="masakari_cp01" online="false" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="false" expected_up="false" is_dc="false" resources_running="0" type="remote" />
        <node name="masakari_cp02" id="masakari_cp02" online="true" standby="false" standby_onfail="false" maintenance="false" pending="false" unclean="false" shutdown="false" expected_up="false" is_dc="false" resources_running="1" type="remote" />
    </nodes>
=================

[Issue]
Once, masakari-hostmonitor has confirmed the online/offline status when remote resources went offline for a moment.

As a result, the compute nodes were stopped by stonith, and instances on the nodes were evacuated. But the compute nodes didn't have any issues.

I think that temporally offline shouldn't be detected.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

As Pacemaker is not too robust w.r.t. these temporary issues, we have introduced a new feature in Wallaby to help with that: https://docs.openstack.org/releasenotes/masakari-monitors/wallaby.html#new-features

"""
Support for repeated check of node status in hostmonitor.

Repeated check is more reliable than single check to determine host status, especially when there is network instability in play.

With this feature, the following config option can be set.

  [host]
  monitoring_samples = 3

The above means 3 checks will be done before the node status is decided. The default value is 1 which is backwards compatible.
"""

We are also working on integrating Consul-based monitoring in Yoga which is supposed to be more resilient.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

(And STONITH can be configured in Pacemaker. Masakari is oblivious to its existence.)

Revision history for this message
Rikimaru Honjo (honjo-rikimaru-c6) wrote :

Oops... Thanks a lot.

I confirm whether or not the suggested setting helps me.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Sure thing. I will mark this as "Invalid" for now.

Changed in masakari:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.