fence_scsi and fence_mpath configuration issues (e.g. /var/run/cluster/fence_scsi.key)

Bug #1864404 reported by Rafael David Tinoco on 2020-02-24
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
fence-agents (Ubuntu)
Medium
Unassigned
Bionic
Medium
Unassigned
Eoan
Medium
Unassigned
Focal
Medium
Unassigned

Bug Description

This bug's intent is to check if fence_scsi and fence_mpath agents are working in all supported Ubuntu versions. This is needed because both agents are very prone to errors and, depending the way they are configured, a vast set of errors can occur.

# fence-agents:

Both agents, fence_scsi and fence_mpath, are prone to errors

## fence_scsi:

You may find the following cluster resource manager errors:

Failed Actions:
* fence_bionic_start_0 on clubionic01 'unknown error' (1): call=8, status=Error, exitreason='', last-rc-change='Mon Feb 24 03:20:28 2020', queued=0ms, exec=1132ms

And the logs show:

Feb 24 03:20:31 clubionic02 fence_scsi[14072]: Failed: Cannot open file "/var/run/cluster/fence_scsi.key"
Feb 24 03:20:31 clubionic02 fence_scsi[14072]: Please use '-h' for usage

The fence_scsi agent is responsible for creating those files on the fly and this error might be related to how the fence agent was configured in pacemaker.

## fence-mpath

You may find very difficult to configure fence_mpath to work flawless, try to follow comments from this bug.

Changed in fence-agents (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu):
status: Confirmed → New
importance: High → Undecided
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
no longer affects: fence-agents (Ubuntu Xenial)
summary: - [bionic] fence_scsi cannot open /var/run/cluster/fence_scsi.key (does
- not exist)
+ fence_scsi cannot open /var/run/cluster/fence_scsi.key (does not exist)
+ after nodes are rebooted
Changed in fence-agents (Ubuntu Bionic):
status: New → Confirmed
Changed in fence-agents (Ubuntu Eoan):
status: New → Confirmed
Changed in fence-agents (Ubuntu Focal):
status: New → Confirmed
Changed in fence-agents (Ubuntu Bionic):
importance: Undecided → High
Changed in fence-agents (Ubuntu Eoan):
importance: Undecided → High
Changed in fence-agents (Ubuntu Focal):
importance: Undecided → High
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Changed in fence-agents (Ubuntu Focal):
status: Confirmed → Fix Released
Changed in fence-agents (Ubuntu Focal):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in fence-agents (Ubuntu):
status: Fix Released → Confirmed
Changed in fence-agents (Ubuntu Focal):
status: Fix Released → Confirmed
Changed in fence-agents (Ubuntu Bionic):
importance: High → Medium
Changed in fence-agents (Ubuntu Focal):
importance: High → Medium
Changed in fence-agents (Ubuntu Eoan):
importance: High → Medium
Changed in fence-agents (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
importance: High → Medium
Changed in fence-agents (Ubuntu):
status: Confirmed → Fix Released
description: updated
Download full text (3.9 KiB)

For Groovy:

# fence_mpath

node 1: clusterg01
node 2: clusterg02
node 3: clusterg03
primitive fence-mpath-clusterg01 stonith:fence_mpath \
    params pcmk_on_timeout=70 pcmk_off_timeout=70 pcmk_host_list=clusterg01 pcmk_monitor_action=metadata pcmk_
    meta provides=unfencing target-role=Started
primitive fence-mpath-clusterg02 stonith:fence_mpath \
    params pcmk_on_timeout=70 pcmk_off_timeout=70 pcmk_host_list=clusterg02 pcmk_monitor_action=metadata pcmk_
    meta provides=unfencing target-role=Started
primitive fence-mpath-clusterg03 stonith:fence_mpath \
    params pcmk_on_timeout=70 pcmk_off_timeout=70 pcmk_host_list=clusterg03 pcmk_monitor_action=metadata pcmk_
    meta provides=unfencing target-role=Started
property cib-bootstrap-options: \
    have-watchdog=false \
    dc-version=2.0.3-4b1f869f0f \
    cluster-infrastructure=corosync \
    cluster-name=clusterg \
    stonith-enabled=true \
    no-quorum-policy=stop \
    last-lrm-refresh=1590773755

--

$ crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: clusterg01 (version 2.0.3-4b1f869f0f) - partition with quorum
  * Last updated: Mon Jun 1 04:17:28 2020
  * Last change: Mon Jun 1 04:07:10 2020 by root via cibadmin on clusterg03
  * 3 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ clusterg01 clusterg02 clusterg03 ]

Full List of Resources:
  * fence-mpath-clusterg01 (stonith:fence_mpath): Started clusterg01
  * fence-mpath-clusterg02 (stonith:fence_mpath): Started clusterg02
  * fence-mpath-clusterg03 (stonith:fence_mpath): Started clusterg03

--

(k)rafaeldtinoco@clusterg02:~$ sudo mpathpersist --in -r /dev/mapper/volume01
  PR generation=0x11, Reservation follows:
   Key = 0x59450001
  scope = LU_SCOPE, type = Write Exclusive, registrants only

(k)rafaeldtinoco@clusterg02:~$ sudo mpathpersist --in -k /dev/mapper/volume01
  PR generation=0x11, 12 registered reservation keys follow:
    0x59450001
    0x59450001
    0x59450001
    0x59450001
    0x59450000
    0x59450000
    0x59450000
    0x59450000
    0x59450002
    0x59450002
    0x59450002
    0x59450002

-- when removing communication in between all nodes and clusterg01:

(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -k /dev/mapper/volume01
  PR generation=0x12, 8 registered reservation keys follow:
    0x59450001
    0x59450001
    0x59450001
    0x59450001
    0x59450002
    0x59450002
    0x59450002
    0x59450002

(k)rafaeldtinoco@clusterg03:~$ sudo mpathpersist --in -r /dev/mapper/volume01
  PR generation=0x12, Reservation follows:
   Key = 0x59450001
  scope = LU_SCOPE, type = Write Exclusive, registrants only

and

Node List:
  * Node clusterg01: UNCLEAN (offline)
  * Online: [ clusterg02 clusterg03 ]

Full List of Resources:
  * fence-mpath-clusterg01 (stonith:fence_mpath): Started [ clusterg01 clusterg02 ]
  * fence-mpath-clusterg02 (stonith:fence_mpath): Started clusterg03
  * fence-mpath-clusterg03 (stonith:fence_mpath): Started clusterg03

Pending Fencing Actions:
  * reboot of clusterg01 pending: client=pacemaker-controld.906, origin=clusterg02

and watchdog on host clusterg01 rebooted it. After reboot, only a single path has
came set the reservat...

Read more...

description: updated

For the previous case (fence_mpath in Groovy), when rebooting the fenced node automatically (by watchdog monitoring /dev/watchdog from softdog module) perhaps pacemaker systemd unit should start only AFTER iscsid, open-iscsi AND multipath-tools were already started. This would give time to all paths to be online. PROBLEM: paths being added might take more time than the service to tell systemd it is ON. Note: Don't fully trust automated reboots for unfencing a node with mpath, do it manually.

summary: - fence_scsi cannot open /var/run/cluster/fence_scsi.key (does not exist)
- after nodes are rebooted
+ fence_scsi and fence_mpath configuration issues (e.g.
+ /var/run/cluster/fence_scsi.key)

Turns out pacemaker.service already declares:

# Some OCF resources may have dependencies that aren't managed by the cluster;
# these must be started before Pacemaker and stopped after it. The
# resource-agents package provides this target, which lets system adminstrators
# add drop-ins for those dependencies.
After=resource-agents-deps.target
Wants=resource-agents-deps.target

Meaning that we just have to add OCF resources (or fencing agents) dependencies to this target... but it does not work =). I'm discussing with upstream this:

https://lists.clusterlabs.org/pipermail/users/2020-June/027207.html

Changed in fence-agents (Ubuntu):
status: Fix Released → In Progress
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in fence-agents (Ubuntu Eoan):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers