False positives of host failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
masakari |
Invalid
|
Undecided
|
Unassigned |
Bug Description
[My environment]
I'm using the latest masakari with pacemaker-remote. Pacemaker is running on OpenStack controller nodes. And pacemaker-remote is running on OpenStack compute nodes.
[Pacemaker remote behavior]
I found that the following behavior of pacemaker remote.
1. One of the OpenStack controller nodes goes down.
2. Remote resources associated with the down node go offline for a moment[1]. But the compute nodes works collectly.
3. The remote resources is automatically migrated to the other controller node. The remote resources go online.
[1]
=================
<nodes>
<node name="masakari_
<node name="masakari_
</nodes>
=================
[Issue]
Once, masakari-
As a result, the compute nodes were stopped by stonith, and instances on the nodes were evacuated. But the compute nodes didn't have any issues.
I think that temporally offline shouldn't be detected.
As Pacemaker is not too robust w.r.t. these temporary issues, we have introduced a new feature in Wallaby to help with that: https:/ /docs.openstack .org/releasenot es/masakari- monitors/ wallaby. html#new- features
"""
Support for repeated check of node status in hostmonitor.
Repeated check is more reliable than single check to determine host status, especially when there is network instability in play.
With this feature, the following config option can be set.
[host] samples = 3
monitoring_
The above means 3 checks will be done before the node status is decided. The default value is 1 which is backwards compatible.
"""
We are also working on integrating Consul-based monitoring in Yoga which is supposed to be more resilient.