Comment 0 for bug 1000355

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote : drbd fence-peer breaks when using kernel 2.6.32-41

Ubuntu 10.04 Lucid with 2.6.32-41 kernel and drbd8

Kernel 2.6.32-41 fixed a consistency issue around UMH_WAIT_PROC in this bug:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/963685

This causes the drbd fencing script's exit codes to be incorrectly interpreted which then breaks the drbd fencing:

To replicate:

Have fencing enabled in drbd config:
In handlers section: fence-peer "/usr/lib/drbd/crm-fence-peer.sh"
In the disk section: fencing resource-only;

Have both drbd nodes uptodate with one primary one secondary
Make the fence-peer get executed. I did this by:
Having drbd under pacemaker control. Both pacemaker nodes were online and in-sync. Drbd in primary on node 1. Put node 1 in standby. Fence-peer will get executed.
Fence handler will report fence-peer exited with 0 (broken) - such as this:

May 15 09:45:17 kernel: [56645.420714] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
May 15 09:45:17 kernel: [56645.420920] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 0 (0x0)
May 15 09:45:17 kernel: [56645.420925] block drbd0: fence-peer helper broken, returned 0

If you log debug output of fence-peer script (crm-fence-peer.sh) when executed it exits 4 not the kernel reported 0.

This commit in drbd git should fix this behavior:

http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commitdiff;h=e6cbc43

This will cause complete failure of a drbd setup using fencing to auto-recover or continue without manual intervention and repair.