[bionic] fence_scsi not working properly with Pacemaker 1.1.18-2ubuntu1.1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
fence-agents (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Medium
|
Rafael David Tinoco | ||
Disco |
Won't Fix
|
Medium
|
Rafael David Tinoco | ||
Eoan |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
OBS: I have split this bug into 2 bugs:
- fence-agents (this) and pacemaker (LP: #1866119)
#### SRU: fence-agents
[Impact]
* fence_scsi is not currently working in a share disk environment
* all clusters relying in fence_scsi and/or fence_scsi + watchdog won't be able to start the fencing agents OR, in worst case scenarios, the fence_scsi agent might start but won't make scsi reservations in the shared scsi disk.
[Test Case]
* having a 3-node setup, nodes called "clubionic01, clubionic02, clubionic03", with a shared scsi disk (fully supporting persistent reservations) /dev/sda, one might try the following command:
sudo fence_scsi --verbose -n clubionic01 -d /dev/sda -k 3abe0000 -o off
from nodes "clubionic02 or clubionic03" and check if the reservation worked:
(k)rafaeldtinoc
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there are NO registered reservation keys
(k)rafaeldtinoc
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there is NO reservation held
* having a 3-node setup, nodes called "clubionic01, clubionic02, clubionic03", with a shared scsi disk (fully supporting persistent reservations) /dev/sda, with corosync and pacemaker operational and running, one might try:
rafaeldtinoco@
crm(live)configure# property stonith-enabled=on
crm(live)configure# property stonith-action=off
crm(live)configure# property no-quorum-
crm(live)configure# property have-watchdog=true
crm(live)configure# property symmetric-
crm(live)configure# commit
crm(live)configure# end
crm(live)# end
rafaeldtinoco@
stonith:
pcmk_
devices=
meta provides=unfencing
And see that crm_mon won't show fence_clubionic resource operational.
[Regression Potential]
* Fix involves adding new cmdline and stdin arguments to the fencing agents. Both changes in that direction (normalizing "-" with "_" and deprecating some commands in favor of others) keep the existing commands working and allow the new commands to work as well (that part is the fix, because of the integration with pacemaker).
* Comments #3 and #4 show this new version fully working.
* This is a quite complex change and I'd appreciate leaving it in -proposed for a
while longer (15 days ?) for a higher chance to detect issues. Furthermore there was no update since bionic release, so users could in the worst-case (and only then)
report a bug and downgrade to the former version.
* Judging by this issue, it is very likely that any Ubuntu user that have tried using fence_scsi has probably migrated to a newer version because fence_scsi agent is broken since its release.
[Other Info]
* The way I fixed fence_scsi was this:
I packaged pacemaker in latest 1.1.X version and kept it "vanilla" so I could bisect fence-agents. At that moment I realized that bisecting was going to be hard because there were multiple issues, not only one. I backported the latest fence-agents together with Pacemaker 1.1.19-0 and saw that it worked.
From then on, I bisected the following intervals:
4.3.0 .. 4.4.0 (eoan - working)
4.2.0 .. 4.3.0
4.1.0 .. 4.2.0
4.0.25 .. 4.1.0 (bionic - not working)
In each of those intervals I discovered issues. For example, Using 4.3.0 I faced problems so I had to backport fixes that were in between 4.4.0 and 4.3.0. Then, backporting 4.2.0, I faced issues so I had to backport fixes from the 4.3.0 <-> 4.2.0 interval. I did this until I was at 4.0.25 version, current Bionic fence-agents version.
* Original Description:
Trying to setup a cluster with an iscsi shared disk, using fence_scsi as the fencing mechanism, I realized that fence_scsi is not working in Ubuntu Bionic. I first thought it was related to Azure environment (LP: #1864419), where I was trying this environment, but then, trying locally, I figured out that somehow pacemaker 1.1.18 is not fencing the shared scsi disk properly.
Note: I was able to "backport" vanilla 1.1.19 from upstream and fence_scsi worked. I have then tried 1.1.18 without all quilt patches and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19 might tell us which commit has fixed the behaviour needed by the fence_scsi agent.
(k)rafaeldtinoc
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
params pcmk_host_
meta provides=unfencing
property cib-bootstrap-
----
(k)rafaeldtinoc
Stack: corosync
Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon Mar 2 15:55:30 2020
Last change: Mon Mar 2 15:45:33 2020 by root via cibadmin on clubionic01.private
3 nodes configured
1 resource configured
Online: [ clubionic01.private clubionic02.private clubionic03.private ]
Active resources:
fence_clubionic (stonith:
----
(k)rafaeldtinoc
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there are NO registered reservation keys
(k)rafaeldtinoc
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x0, there is NO reservation held
Related branches
- Christian Ehrhardt (community): Approve
- Canonical Server Core Reviewers: Pending requested
- Canonical Server: Pending requested
-
Diff: 6068 lines (+5964/-7)12 files modifieddebian/changelog (+24/-0)
debian/patches/lp1865523-01-fencing-add-consistency-cmdline-stdin.patch (+158/-0)
debian/patches/lp1865523-02-fix-for-ignored-options.patch (+115/-0)
debian/patches/lp1865523-03-Maintain-ABI-compatibility.patch (+30/-0)
debian/patches/lp1865523-04-fence_scsi-Remove-period.patch (+26/-0)
debian/patches/lp1865523-05-fence_scsi-fix-python3-encoding.patch (+35/-0)
debian/patches/lp1865523-06-fence_scsi-fixes-around-node-id.patch (+123/-0)
debian/patches/lp1865523-07-fence-metadata-update.xml (+5441/-0)
debian/patches/remove-fence_amt_ws (+2/-2)
debian/patches/reproducible (+1/-1)
debian/patches/series (+7/-0)
debian/patches/zvm-stdint (+2/-4)
Changed in pacemaker (Ubuntu): | |
status: | New → Confirmed |
importance: | Undecided → Medium |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
Changed in pacemaker (Ubuntu): | |
assignee: | Rafael David Tinoco (rafaeldtinoco) → nobody |
importance: | Medium → Undecided |
status: | Confirmed → Triaged |
no longer affects: | pacemaker (Ubuntu Focal) |
no longer affects: | pacemaker (Ubuntu Eoan) |
no longer affects: | pacemaker (Ubuntu Disco) |
Changed in pacemaker (Ubuntu): | |
status: | Triaged → Fix Released |
Changed in pacemaker (Ubuntu Bionic): | |
status: | New → Confirmed |
importance: | Undecided → High |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
Changed in fence-agents (Ubuntu Eoan): | |
status: | New → Fix Released |
Changed in fence-agents (Ubuntu Focal): | |
status: | New → Fix Released |
Changed in fence-agents (Ubuntu Disco): | |
status: | New → Confirmed |
Changed in fence-agents (Ubuntu Bionic): | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Changed in fence-agents (Ubuntu Disco): | |
importance: | Undecided → Medium |
Changed in pacemaker (Ubuntu Bionic): | |
importance: | High → Medium |
Changed in fence-agents (Ubuntu Bionic): | |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
Changed in fence-agents (Ubuntu Disco): | |
assignee: | nobody → Rafael David Tinoco (rafaeldtinoco) |
description: | updated |
description: | updated |
no longer affects: | fence-agents (Ubuntu Focal) |
Changed in fence-agents (Ubuntu Bionic): | |
status: | Confirmed → In Progress |
no longer affects: | pacemaker (Ubuntu) |
no longer affects: | pacemaker (Ubuntu Bionic) |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
To make this work I had to use pacemaker from upstream (Vanilla) version: 1.1.19-0
$ dpkg -l | grep 1.1.19 | awk '{print $2" "$3}' amd64 1.1.19-0ubuntu1 amd64 1.1.19-0ubuntu1 amd64 1.1.19-0ubuntu1 2:amd64 1.1.19-0ubuntu1 resource- agents 1.1.19-0ubuntu1
libcib4:amd64 1.1.19-0ubuntu1
libcrmcluster4:
libcrmcommon3:amd64 1.1.19-0ubuntu1
libcrmservice3:
liblrmd1:amd64 1.1.19-0ubuntu1
libpe-rules2:amd64 1.1.19-0ubuntu1
libpe-status10:
libpengine10:amd64 1.1.19-0ubuntu1
libstonithd2:amd64 1.1.19-0ubuntu1
libtransitioner
pacemaker 1.1.19-0ubuntu1
pacemaker-cli-utils 1.1.19-0ubuntu1
pacemaker-common 1.1.19-0ubuntu1
pacemaker-doc 1.1.19-0ubuntu1
pacemaker-
AND fence-agents from Ubuntu Eoan:
fence-agents 4.2.1-1
Only after that "combination" I was able to make fence_scsi agent to work:
(k)rafaeldtinoc o@clubionic01: ~$ crm conf show list="clubionic 01.private clubionic02.private clubionic03. private" devices="/dev/sda" \ options: \
have-watchdog= false \
dc-version= 1.1.19- 1.1.19 \
cluster- infrastructure= corosync \
cluster- name=clubionic \
stonith- enabled= true \
stonith- action= off \
no-quorum- policy= stop
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
params pcmk_host_
meta provides=unfencing
property cib-bootstrap-
with proper reservations being made:
(k)rafaeldtinoc o@clubionic03: ~$ sudo sg_persist --in --read-keys --device=/dev/sda
LIO-ORG cluster.bionic. 4.0
Peripheral device type: disk
PR generation=0x4, 3 registered reservation keys follow:
0x3abe0002
0x3abe0000
0x3abe0001