A pacemaker node fails monitor (probe) and stop /start operations on a resource because it returns "rc=189
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pacemaker (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Bionic |
In Progress
|
Medium
|
Jorge Niedbalski | ||
Focal |
Fix Released
|
Medium
|
Unassigned | ||
Groovy |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Cause: Pacemaker implicitly ordered all stops needed on a Pacemaker Remote node before the stop of the node's Pacemaker Remote connection, including stops that were implied by fencing of the node. Also, Pacemaker scheduled actions on Pacemaker Remote nodes with a failed connection so that the actions could be done once the connection is recovered, even if the connection wasn't being recovered (for example, if the node was shutting down when the failure occurred).
Consequence: If a Pacemaker Remote node needed to be fenced while it was in the process of shutting down, once the fencing completed pacemaker scheduled probes on the node. The probes fail because the connection is not actually active. Due to the failed probe, a stop is scheduled which also fails, leading to fencing of the node again, and the situation repeats itself indefinitely.
Fix: Pacemaker Remote connection stops are no longer ordered after implied stops, and actions are not scheduled on Pacemaker Remote nodes when the connection is failed and not being started again.
Result: A Pacemaker Remote node that needs to be fenced while it is in the process of shutting down is fenced once, without repeating indefinitely.
The fix seems to be fixed in pacemaker-
Related to https:/
Changed in pacemaker (Ubuntu Groovy): | |
status: | New → Fix Released |
Changed in pacemaker (Ubuntu Focal): | |
status: | New → Fix Released |
Changed in pacemaker (Ubuntu Bionic): | |
status: | New → In Progress |
assignee: | nobody → Jorge Niedbalski (niedbalski) |
Changed in pacemaker (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in pacemaker (Ubuntu Focal): | |
importance: | Undecided → Medium |
Changed in pacemaker (Ubuntu Groovy): | |
importance: | Undecided → Medium |
I am able to reproduce a similar issue with the following bundle: https:/ /paste. ubuntu. com/p/VJ3m7nMN7 9/
Resource created with
sudo pcs resource create test2 ocf:pacemaker:Dummy op_sleep=10 op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s
juju ssh nova-cloud- controller/ 2 "sudo pcs constraint location test2 prefers juju-acda3d- pacemaker- remote- 10.cloud. sts" controller/ 2 "sudo pcs constraint location test2 prefers juju-acda3d- pacemaker- remote- 11.cloud. sts" controller/ 2 "sudo pcs constraint location test2 prefers juju-acda3d- pacemaker- remote- 12.cloud. sts"
juju ssh nova-cloud-
juju ssh nova-cloud-
Online: [ juju-acda3d- pacemaker- remote- 7 juju-acda3d- pacemaker- remote- 8 juju-acda3d- pacemaker- remote- 9 ] pacemaker- remote- 10.cloud. sts juju-acda3d- pacemaker- remote- 11.cloud. sts juju-acda3d- pacemaker- remote- 12.cloud. sts ]
RemoteOnline: [ juju-acda3d-
Full list of resources:
Resource Group: grp_nova_vips bf9661e_ vip (ocf::heartbeat :IPaddr2) : Started juju-acda3d- pacemaker- remote- 7 pacemaker- remote- 7 juju-acda3d- pacemaker- remote- 8 juju-acda3d- pacemaker- remote- 9 ] pacemaker- remote- 10.cloud. sts (ocf::pacemaker :remote) : Started juju-acda3d- pacemaker- remote- 8 pacemaker- remote- 12.cloud. sts (ocf::pacemaker :remote) : Started juju-acda3d- pacemaker- remote- 8 pacemaker- remote- 11.cloud. sts (ocf::pacemaker :remote) : Started juju-acda3d- pacemaker- remote- 7
res_nova_
Clone Set: cl_nova_haproxy [res_nova_haproxy]
Started: [ juju-acda3d-
juju-acda3d-
juju-acda3d-
juju-acda3d-
test2 (ocf::pacemaker :Dummy) : Started juju-acda3d- pacemaker- remote- 10.cloud. sts
## After running the following commands on juju-acda3d- pacemaker- remote- 10.cloud. sts
1) sudo systemctl stop pacemaker_remote
2) forcedfully shutdown (openstack server stop xxxx) in less than 10 seconds after the pacemaker_remote gets
executed.
Remote is shutdown
RemoteOFFLINE: [ juju-acda3d- pacemaker- remote- 10.cloud. sts ]
The resource status remains as stopped across the 3 machines, and doesn't recovers.
$ juju run --application nova-cloud- controller "sudo pcs resource show | grep -i test2" ocf::pacemaker: Dummy): \tStopped\ n" controller/ 0 ocf::pacemaker: Dummy): \tStopped\ n" controller/ 1 ocf::pacemaker: Dummy): \tStopped\ n" controller/ 2
- Stdout: " test2\t(
UnitId: nova-cloud-
- Stdout: " test2\t(
UnitId: nova-cloud-
- Stdout: " test2\t(
UnitId: nova-cloud-
However, If I do a clean shutdown (without interrupting the pacemaker_remote fence), that ends up
with the resource migrated correctly to another node.
6 nodes configured
9 resources configured
Online: [ juju-acda3d- pacemaker- remote- 7 juju-acda3d- pacemaker- remote- 8 juju-acda3d- pacemaker- remote- 9 ] pacemaker- remote- 11.cloud. sts juju-acda3d- pacemaker- remote- 12.cloud. sts ] pacemaker- remote- 10.cloud. sts ]
RemoteOnline: [ juju-acda3d-
RemoteOFFLINE: [ juju-acda3d-
Full list of resources:
[...] :Dummy) : Started juju-acda3d- pacemaker- remote- 12.cloud. sts
test2 (ocf::pacemaker
I will keep investigating this behavior and determine is this is linked to the bug reported.