fence_scsi and fence_mpath configuration issues (e.g. /var/run/cluster/fence_scsi.key)

Bug #1864404 reported by Rafael David Tinoco
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
fence-agents (Ubuntu)
Invalid
Medium
Unassigned
Bionic
Invalid
Medium
Unassigned
Eoan
Invalid
Medium
Unassigned
Focal
Invalid
Medium
Unassigned

Bug Description

This bug's intent is to check if fence_scsi and fence_mpath agents are working in all supported Ubuntu versions. This is needed because both agents are very prone to errors and, depending the way they are configured, a vast set of errors can occur.

# fence-agents:

Both agents, fence_scsi and fence_mpath, are prone to errors

## fence_scsi:

You may find the following cluster resource manager errors:

Failed Actions:
* fence_bionic_start_0 on clubionic01 'unknown error' (1): call=8, status=Error, exitreason='', last-rc-change='Mon Feb 24 03:20:28 2020', queued=0ms, exec=1132ms

And the logs show:

Feb 24 03:20:31 clubionic02 fence_scsi[14072]: Failed: Cannot open file "/var/run/cluster/fence_scsi.key"
Feb 24 03:20:31 clubionic02 fence_scsi[14072]: Please use '-h' for usage

The fence_scsi agent is responsible for creating those files on the fly and this error might be related to how the fence agent was configured in pacemaker.

## fence-mpath

You may find very difficult to configure fence_mpath to work flawless, try to follow comments from this bug.

Changed in fence-agents (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu):
status: Confirmed → New
importance: High → Undecided
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

# WORKAROUND:

A way to workaround this issue is the following... in a 3 node cluster you can do this:

# node01

create a file /etc/fence_scsi.key containing: 62ed0000

# node02

create a file /etc/fence_scsi.key containing: 62ed0001

# node03

create a file /etc/fence_scsi.key containing: 62ed0002

In all 3 nodes create a file /etc/tmpfiles.d/fence_scsi.conf containing:

L /var/run/cluster/fence_scsi.key - - - - /etc/fence_scsi.key

This will make systemd to create a symlink /var/run/cluster/fence_scsi.key pointing to your /etc/fence_scsi.key file and will allow fence_scsi agent to use that file to do its SCSI persistent reservations.

After creating those files reboot all the nodes and check if the file got created after boot:

rafaeldtinoco@clubionic01:~$ ls /var/run/cluster/fence_scsi.key
/var/run/cluster/fence_scsi.key

rafaeldtinoco@clubionic01:~$ cat /var/run/cluster/fence_scsi.key
62ed0000

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Make sure to also create the "fence_scsi.dev" file doing the same:

# node01

create a file /etc/fence_scsi.dev containing: /dev/disk/by-path/acpi-VMBUS:01-scsi-0:0:0:0

# node02

create a file /etc/fence_scsi.dev containing: /dev/disk/by-path/acpi-VMBUS:01-scsi-0:0:0:0

# node03

create a file /etc/fence_scsi.dev containing: /dev/disk/by-path/acpi-VMBUS:01-scsi-0:0:0:0

Note: of course do change contents of the file to your shared disk device "by-path" path.

Then, in all 3 nodes, add a 2nd line to /etc/tmpfiles.d/fence_scsi.conf file:

L /var/run/cluster/fence_scsi.key - - - - /etc/fence_scsi.key
L /var/run/cluster/fence_scsi.dev - - - - /etc/fence_scsi.dev

And reboot the node to check if the file gets created after boot:

rafaeldtinoco@clubionic01:~$ sudo ls /var/run/cluster
fence_scsi.dev fence_scsi.key

rafaeldtinoco@clubionic02:~$ sudo ls /var/run/cluster
fence_scsi.dev fence_scsi.key

rafaeldtinoco@clubionic03:~$ sudo ls /var/run/cluster
fence_scsi.dev fence_scsi.key

no longer affects: fence-agents (Ubuntu Xenial)
summary: - [bionic] fence_scsi cannot open /var/run/cluster/fence_scsi.key (does
- not exist)
+ fence_scsi cannot open /var/run/cluster/fence_scsi.key (does not exist)
+ after nodes are rebooted
Changed in fence-agents (Ubuntu Bionic):
status: New → Confirmed
Changed in fence-agents (Ubuntu Eoan):
status: New → Confirmed
Changed in fence-agents (Ubuntu Focal):
status: New → Confirmed
Changed in fence-agents (Ubuntu Bionic):
importance: Undecided → High
Changed in fence-agents (Ubuntu Eoan):
importance: Undecided → High
Changed in fence-agents (Ubuntu Focal):
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu Focal):
status: Confirmed → Fix Released
Changed in fence-agents (Ubuntu Focal):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in fence-agents (Ubuntu):
status: Fix Released → Confirmed
Changed in fence-agents (Ubuntu Focal):
status: Fix Released → Confirmed
Changed in fence-agents (Ubuntu Bionic):
importance: High → Medium
Changed in fence-agents (Ubuntu Focal):
importance: High → Medium
Changed in fence-agents (Ubuntu Eoan):
importance: High → Medium
Changed in fence-agents (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
importance: High → Medium
Changed in fence-agents (Ubuntu):
status: Confirmed → Fix Released
description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.9 KiB)

For Groovy:

# fence_mpath

node 1: clusterg01
node 2: clusterg02
node 3: clusterg03
primitive fence-mpath-clusterg01 stonith:fence_mpath \
    params pcmk_on_timeout=70 pcmk_off_timeout=70 pcmk_host_list=clusterg01 pcmk_monitor_action=metadata pcmk_
    meta provides=unfencing target-role=Started
primitive fence-mpath-clusterg02 stonith:fence_mpath \
    params pcmk_on_timeout=70 pcmk_off_timeout=70 pcmk_host_list=clusterg02 pcmk_monitor_action=metadata pcmk_
    meta provides=unfencing target-role=Started
primitive fence-mpath-clusterg03 stonith:fence_mpath \
    params pcmk_on_timeout=70 pcmk_off_timeout=70 pcmk_host_list=clusterg03 pcmk_monitor_action=metadata pcmk_
    meta provides=unfencing target-role=Started
property cib-bootstrap-options: \
    have-watchdog=false \
    dc-version=2.0.3-4b1f869f0f \
    cluster-infrastructure=corosync \
    cluster-name=clusterg \
    stonith-enabled=true \
    no-quorum-policy=stop \
    last-lrm-refresh=1590773755

--

$ crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: clusterg01 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Mon Jun 1 04:17:28 2020
  * Last change: Mon Jun 1 04:07:10 2020 by root via cibadmin on clusterg03
  * 3 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ clusterg01 clusterg02 clusterg03 ]

Full List of Resources:
  * fence-mpath-clusterg01 (stonith:fence_mpath): Started clusterg01
  * fence-mpath-clusterg02 (stonith:fence_mpath): Started clusterg02
  * fence-mpath-clusterg03 (stonith:fence_mpath): Started clusterg03

--

(k)rafaeldtinoco@clusterg02:~$ sudo mpathpersist --in -r /dev/mapper/volume01
  PR generation=0x11, Reservation follows:
   Key = 0x59450001
  scope = LU_SCOPE, type = Write Exclusive, registrants only

(k)rafaeldtinoco@clusterg02:~$ sudo mpathpersist --in -k /dev/mapper/volume01
  PR generation=0x11, 12 registered reservation keys follow:
    0x59450001
    0x59450001
    0x59450001
    0x59450001
    0x59450000
    0x59450000
    0x59450000
    0x59450000
    0x59450002
    0x59450002
    0x59450002
    0x59450002

-- when removing communication in between all nodes and clusterg01:

(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -k /dev/mapper/volume01
  PR generation=0x12, 8 registered reservation keys follow:
    0x59450001
    0x59450001
    0x59450001
    0x59450001
    0x59450002
    0x59450002
    0x59450002
    0x59450002

(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -r /dev/mapper/volume01
  PR generation=0x12, Reservation follows:
   Key = 0x59450001
  scope = LU_SCOPE, type = Write Exclusive, registrants only

and

Node List:
  * Node clusterg01: UNCLEAN (offline)
  * Online: [ clusterg02 clusterg03 ]

Full List of Resources:
  * fence-mpath-clusterg01 (stonith:fence_mpath): Started [ clusterg01 clusterg02 ]
  * fence-mpath-clusterg02 (stonith:fence_mpath): Started clusterg03
  * fence-mpath-clusterg03 (stonith:fence_mpath): Started clusterg03

Pending Fencing Actions:
  * reboot of clusterg01 pending: client=pacemaker-controld.906, origin=clusterg02

and watchdog on host clusterg01 rebooted it. After reboot, only a single path has
came set the reservat...

Read more...

description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

For the previous case (fence_mpath in Groovy), when rebooting the fenced node automatically (by watchdog monitoring /dev/watchdog from softdog module) perhaps pacemaker systemd unit should start only AFTER iscsid, open-iscsi AND multipath-tools were already started. This would give time to all paths to be online. PROBLEM: paths being added might take more time than the service to tell systemd it is ON. Note: Don't fully trust automated reboots for unfencing a node with mpath, do it manually.

summary: - fence_scsi cannot open /var/run/cluster/fence_scsi.key (does not exist)
- after nodes are rebooted
+ fence_scsi and fence_mpath configuration issues (e.g.
+ /var/run/cluster/fence_scsi.key)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Turns out pacemaker.service already declares:

# Some OCF resources may have dependencies that aren't managed by the cluster;
# these must be started before Pacemaker and stopped after it. The
# resource-agents package provides this target, which lets system adminstrators
# add drop-ins for those dependencies.
After=resource-agents-deps.target
Wants=resource-agents-deps.target

Meaning that we just have to add OCF resources (or fencing agents) dependencies to this target... but it does not work =). I'm discussing with upstream this:

https://lists.clusterlabs.org/pipermail/users/2020-June/027207.html

Changed in fence-agents (Ubuntu):
status: Fix Released → In Progress
Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in fence-agents (Ubuntu Eoan):
status: Confirmed → Won't Fix
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

This bug was serving me as a base to track tested fencing agents in Bionic, Focal and Groovy. No more reason to keep it opened, closing it as invalid.

Changed in fence-agents (Ubuntu Bionic):
status: Confirmed → Invalid
Changed in fence-agents (Ubuntu Eoan):
status: Won't Fix → Invalid
Changed in fence-agents (Ubuntu Focal):
status: Confirmed → Invalid
Changed in fence-agents (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.