Infinite loop in helper LVM script for DRBD 8 in Lucid

Bug #744293 reported by Numérigraphe
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
drbd8 (Ubuntu)
Fix Released
High
Unassigned
Lucid
Fix Released
High
Unassigned

Bug Description

The script /usr/lib/drbd/snapshot-resync-target-lvm.sh shipped with drbd v8.3.7 in Lucid fails to read its command line options and falls into an infinite loop.

This bug is not in present in natty, but I humbly request an SRU to Lucid.

This script can be called by DRBD to create an LVM snapshot of a resource before it starts resyncing (thus becoming inconsistent). This script is present (though commented out) in the default drbd configuration file /etc/drbd.d/global_common.conf .

This bug can result in drbd silently failing to resync an outdated resource. Thus, any newer data will be lost if the cluster fails over to the outdated node.

This is a known problem upstream and it was fixed in later versions of DRBD. The following patch was committed upstream to address this bug : http://git.drbd.org/?p=drbd-8.3.git;a=commitdiff;h=1fd3ad953663615e946167191e1d9885af81450a

Steps to reproduce:
On a cluster composed of node A and node B:
 - on both nodes, uncomment the line "before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";"
 - on both nodes, configure a drbd resource called "test"
 - make the initial synchronization
 - on B, do "drbdadm disconnect test"
 - nn A, make "test" primary and write to it
 - on B, do "drbdadm connect test"
 => the script will kick in and fall in an endless loop - top will show it using 100% CPU.

lsb_release -rd
Description: Ubuntu 10.04.2 LTS
Release: 10.04

apt-cache policy drbd8-utils
drbd8-utils:
  Installé : 2:8.3.7-1ubuntu2.1
  Candidat : 2:8.3.7-1ubuntu2.1
 Table de version :
 *** 2:8.3.7-1ubuntu2.1 0
        500 http://fr.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
        100 /var/lib/dpkg/status
     2:8.3.7-1ubuntu2 0
        500 http://fr.archive.ubuntu.com/ubuntu/ lucid/main Packages

Lionel Sausin.

=======
SRU Justification

IMPACT:

Using the --percent option, the script entered in an endless loop. This can result on having DRBD failing to resync an outdated resource when using LVM.

REPRODUCE (as specified above):

On a cluster composed of node A and node B:
 - on both nodes, uncomment the line "before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";"
 - on both nodes, configure a drbd resource called "test"
 - make the initial synchronization
 - on B, do "drbdadm disconnect test"
 - nn A, make "test" primary and write to it
 - on B, do "drbdadm connect test"
 => the script will kick in and fall in an endless loop - top will show it using 100% CPU.

HOW FIXED:

The fix was taken from upstream. It basically consists on sourcing the default file first, then shifting correctly the option, and finally and use drbdadm sh-minor to obtain the minor version instead of guessing it, which all together can cause the script to enter in infinite loop and fail to resync as specified above.

PATCH:

Attached. Uploaded to lucid-proposed for review.

REGRESSION POTENTIAL:

Minimal. This has been tested thoroughly.

=======

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi there,

thank you for reporting bugs and trying to make Ubuntu better. I'll look into this bug. Thank you!

Changed in drbd8 (Ubuntu):
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Andres,

This is a real pain for those of us who would like to ensure that we don't lose data due to a failure during re-sync. If we don't take a snapshot prior to a re-sync of the secondary and there was a failure it could be catastrophic. I'm pretty sure Lars put out a patch which is mentioned in the original bug report.

I realize getting this into 10.04 can be a long process. Any chance in the mean time you would be willing to put up a patched version in the HA PPA? I could patch it myself however if you put it up at least it would be there for those that don't know how or don't want to modify their packages outside of apt.

Thanks!

Changed in drbd8 (Ubuntu):
assignee: nobody → Andres Rodriguez (andreserl)
status: Triaged → In Progress
importance: Wishlist → High
description: updated
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Jacob,

I'm preparing a Stable Release Update to address this bug. Once it is done, it will be made available for testing. If it is successful then it can be released.

Cheers,

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Excellent!

Thank you Andres

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Andres,

There hasn't been any activity (that I can see) on this since you submitted the SRU. Do you have any further info or idea when this would be mainstream available for 10.04?

Or would you be willing to put up a patched package on the Ubuntu-HA PPA for Lucid?

Thanks!

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Not to be a pain but any update on this? If it is put to SRU Proposed I would gladly test it.

Thanks!

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jacob.

Sorry for the delay... apparently the SRU was lost for some reason.

Could you please re-test the package available at my ppa:

https://launchpad.net/~andreserl/+archive/ppa/+packages?field.series_filter=lucid

Once you proceed with the testing I'll proceed to upload it to the SRU queue for verification.

Cheers!

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Hi Andres,

So I gave it a try and now I'm racking my brain because I thought I knew what I was doing but it didn't behave as I expected and I don't know if it's my ignorance or a problem so I'll let you be the judge!

Executed the following on both nodes, one after the other finished:
Added your PPA.
apt-get update; apt-get upgrade
Edit /etc/drbd.d/global_common.conf and uncomment before and after-resync-target
shutdown -r now

Ended up with all kinds of errors and zombie lvm processes and drbd complaining etc. after restart.
Looked at the lvm scripts in question - /usr/lib/drbd/snapshot-resync-target-lvm.sh
The changes from the patch are not there... if I understand what the patch was supposed to do.
First - The meat of the patch was to remove a couple of "shift" lines from the case statement among other things correct?
Second - doing the apt-get upgrade should have replaced the patched script?

Node 2:
root@Vulture:~# apt-cache policy drbd8-source
drbd8-source:
  Installed: 2:8.3.7-1ubuntu2.2
  Candidate: 2:8.3.7-1ubuntu2.2
  Version table:
 *** 2:8.3.7-1ubuntu2.2 0
        500 http://ppa.launchpad.net/andreserl/ppa/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
     2:8.3.7-1ubuntu2.1 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     2:8.3.7-1ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages

Node 1:
root@Condor:~# apt-cache policy drbd8-source
drbd8-source:
  Installed: 2:8.3.7-1ubuntu2.2
  Candidate: 2:8.3.7-1ubuntu2.2
  Version table:
 *** 2:8.3.7-1ubuntu2.2 0
        500 http://ppa.launchpad.net/andreserl/ppa/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
     2:8.3.7-1ubuntu2.1 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     2:8.3.7-1ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages

root@Vulture:~# cat /usr/lib/drbd/snapshot-resync-target-lvm.sh | grep -A 20 case
        case $1 in
                -p|--percent)
                        SNAP_PERC="$2"
                        shift
                        ;;
                -a|--additional)
                        SNAP_ADDITIONAL="$2"
                        shift 2
                        ;;
                -n|--disconnect-on-error)
                        DISCONNECT_ON_ERROR=1
                        shift
                        ;;
                -v|--verbose)
                        BE_VERBOSE=1
                        shift
                        ;;
                --)
                        shift
                        break
                        ;;

Thoughts??

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Andres - have you had a chance to review my last update?

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Jacob-

It looks like the patch was included in debian/patches/ but not in debian/patches/00list, hence it was never being applied to the source.

I've fixed it and uploaded drbd8 - 2:8.3.7-1ubuntu2.3 to a ppa @ ppa:gandelman-a/ppa

Please test and verify, thanks!

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jacob,

Yes. As I was noted, the patch wasn't being applied during the package build. We are uploading a new package applying the patch. Will let you know once it is ready for testing. Cheers!

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Still the same thing - zombie processes for lvm snapshots with 100% cpu.

Script still doesn't have the changes in the patch applied - still looks identical as my post #8.

root@Vulture:~# apt-cache policy drbd8-source
drbd8-source:
  Installed: 2:8.3.7-1ubuntu2.3
  Candidate: 2:8.3.7-1ubuntu2.3
  Version table:
 *** 2:8.3.7-1ubuntu2.3 0
        500 http://ppa.launchpad.net/gandelman-a/ppa/ubuntu/ lucid/main Packages
        100 /var/lib/dpkg/status
     2:8.3.7-1ubuntu2.2 0
        500 http://ppa.launchpad.net/andreserl/ppa/ubuntu/ lucid/main Packages
     2:8.3.7-1ubuntu2.1 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
     2:8.3.7-1ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Hi Jacob-

Please also update drbd8-utils as this is the package that provides the lvm snapshot script.

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Thanks Adam - thought that was weird.

Only update available was drbd8-source after adding your ppa.
Maybe because I use amd64 builds? I noticed it says need to be built next to those in your ppa.

I'll check again in the morning.

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Ok looks good. No zombie processes. I updated and restarted node 1 - drbd initiated snapsho on startup, synced properly (wouldn't get that far before) and then removed them. After doing that on node 1 I updated and restarted node 2 (causing everything to fail to node 1) and everything moved and resynced after restart properly.

Only thing I don't understand/like is the following in the syslog:

Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 3 (/) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 4 (/etc) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 6 (/etc) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 7 (/etc) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1761]: File descriptor 7 (/etc) leaked on lvdisplay invocation. Parent PID 1761: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 8 (/etc) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 9 (/etc) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash
Jul 13 10:50:16 Condor snapshot-resync-target-lvm.sh[1760]: File descriptor 10 (/etc) leaked on lvdisplay invocation. Parent PID 1760: /bin/bash

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Jacob-

This is a warning from lvdisplay unrelated and can be ignored in the context of this bug. Please see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=466138#15

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Thanks Adam/Andres

I've got to remember: Google first, ask second - always not just mostly!

I'll report in again in about a week on this. I am doing a lot of reconfig, add resources, testing, and failing of my pacemaker cluster so I will report on how the snapshot patch holds up. It *should* get a good workout!

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Still working good - no issues. I've changed/rebooted both nodes and it takes snapshots, re-syncs, then removes snapshots (at least 30 times since applying the patch) without a hitch.

Thanks again for your help!

Changed in drbd8 (Ubuntu):
assignee: Andres Rodriguez (andreserl) → nobody
status: In Progress → New
status: New → Fix Released
Changed in drbd8 (Ubuntu Lucid):
importance: Undecided → High
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Please test proposed package

Hello Numérigraphe, or anyone else affected,

Accepted drbd8 into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in drbd8 (Ubuntu Lucid):
status: New → Fix Committed
tags: added: verification-needed
Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

I have tested this a couple times a week since the fix was proposed and I have had no ill effects and it has fixed the problem.
Hopefully we can get this update out to all!

If there is any other info I can provide please let me know.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package drbd8 - 2:8.3.7-1ubuntu2.2

---------------
drbd8 (2:8.3.7-1ubuntu2.2) lucid-proposed; urgency=low

  * SRU:
    - debian/patches/11_fix_lvm_infinite_loop.dpatch: infinite loop in
      helper LVM script. (LP: #744293)
 -- Andres Rodriguez <email address hidden> Thu, 05 May 2011 10:20:53 -0400

Changed in drbd8 (Ubuntu Lucid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.