curtain fails with "Device is busy" error during unmount

Bug #1462139 reported by Mike Pontillo
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Fix Released
Medium
Unassigned
curtin (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Fix Released
Medium
Unassigned
Vivid
Fix Released
Medium
Unassigned

Bug Description

=== Begin SRU Template ===
[Description]
A race condition can occur when invoking grub-install inside the target environment.
The failure case shows logs like this:
 | Installing for i386-pc platform.
 | Installation finished. No error reported.
 | umount: /tmp/tmpM4R1dI/target/dev: device is busy.
 | (In some cases useful info about processes that use
 | the device is found by lsof(8) or fuser(1))
 | Unexpected error while running command.
 | Command: ['umount', '/tmp/tmpM4R1dI/target/dev']
 | Exit code: 1
 | Reason: -
 | Stdout: ''
 | Stderr: ''
 | Installation failed with exception: Unexpected error while running command.
 | Command: ['curtin', 'curthooks']

This is believed to be because some process (likely spawned by udev) has open filehandles on /dev when curtin went to clean up the target mounts.

The solution is to run 'udevadm settle' before unmounting '/dev/' from the target.

[Impact]
The impact is transient failure to install. This race condition is very rarely seen on hardware, but was somewhat easily reproduced in a heavily loaded vmware environment.

[Test Case]
In the original bug-opener's environment it fails fairly reliably under heavy host load using vmware. He would do a deploy to several guests on the same host at the same time and this would reproduce. Unfortunately I was unable to come up with a test case in a less complex environment.

[Regression Potential]
Regression potential should be very low here. The most likely fallout is just additional time for the install as a result of running 'udevadm settle'. A system that did not exhibit this bug will install a small fraction of a second slower.

$ sudo bash -c 'time for x in "$@"; do udevadm settle; done' -- $(seq 1 100)
  real 0m0.214s
  user 0m0.012s
  sys 0m0.008s

As shown above, that is likely to be on the order of 1/100th of a second.

=== End SRU Template ===

Here's the relevant part of the curtin output:

Installing for i386-pc platform.
Installation finished. No error reported.
umount: /tmp/tmpM4R1dI/target/dev: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
Unexpected error while running command.
Command: ['umount', '/tmp/tmpM4R1dI/target/dev']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'curthooks']

I spoke with Scott Moser on IRC about this, and he suggested the attached patch, which fixed the problem 100%.

I checked the installation output of a few of my MAAS nodes, and didn't see any lsof output. So I assume the "udevadm settle" command is the fix.

(Thanks Scott! I hope we can land this in time for MAAS 1.8.)

Revision history for this message
Mike Pontillo (mpontillo) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

I was unable to reproduce this bug in testing, so I'm unfortunately not 100% certain the fix is doing more than effectively a 'sleep'.
Heres what I did:
  wget http://cloud-images.ubuntu.com/daily/server/trusty/current/trusty-server-cloudimg-amd64-root.tar.gz
  mkdir x
  tar -C x -Sxpzf trusty-server-cloudimg-amd64-root.tar.gz

  $ cat go.py
#!/usr/bin/python
from curtin import util
import sys, os

runs = int(os.environ.get("RUNS", "10"))
target = sys.argv[1]
cmd = sys.argv[2:]
print("target: %s" % target)
print("cmd: %s" % ' '.join(cmd))
for run in range(0, runs):
    print("run: %s" % run)
    with util.ChrootableTarget(target=target, allow_daemons=False):
        util.subp(['chroot', target] + cmd)

# install grub-pc in the target
$ sudo RUNS=1 PYTHONPATH=$PWD ./go.py x apt-get install grub-pc
# set up /dev/vdb (kvm / openstack guest) with partition
$ sudo wipefs --all /dev/vdb; echo "2048," | sudo sfdisk --force --unit=S /dev/vdb
$ sudo RUNS=100 PYTHONPATH=$PWD ./go.py x grub-install /dev/vdb

I was unable to see that fail, which is essentially what i believe was happening in Mike's failure case.
I also tried with trusty OS + trusty root.tar.gz and vivid OS + vivid root.tar.gz.

That said, udevadm monitor *does* show udev events, and so its quite possible that there was something responding to that event.

Worst case, this is a 'sleep'. Best case its a fix.

Revision history for this message
Scott Moser (smoser) wrote :

this should be fixed in revno 210.

Changed in curtin:
status: New → Fix Committed
Scott Moser (smoser)
affects: ubuntu (Ubuntu) → curtin (Ubuntu)
Changed in curtin (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
Changed in curtin (Ubuntu Trusty):
status: New → Confirmed
Changed in curtin (Ubuntu Vivid):
status: New → Confirmed
Changed in curtin (Ubuntu Trusty):
importance: Undecided → Medium
Changed in curtin (Ubuntu Vivid):
importance: Undecided → Medium
Scott Moser (smoser)
description: updated
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Mike, or anyone else affected,

Accepted curtin into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/curtin/0.1.0~bzr221-0ubuntu1~14.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in curtin (Ubuntu Trusty):
status: Confirmed → Fix Committed
tags: added: verification-needed
Revision history for this message
Chris J Arges (arges) wrote :

Hello Mike, or anyone else affected,

Accepted curtin into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/curtin/0.1.0~bzr221-0ubuntu1~14.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in curtin (Ubuntu Vivid):
status: Confirmed → Fix Committed
Scott Moser (smoser)
Changed in curtin (Ubuntu):
status: Confirmed → Fix Released
tags: added: verification-done
removed: verification-needed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

I tested the fix; looks good! Thanks.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package curtin - 0.1.0~bzr221-0ubuntu1~14.04.1

---------------
curtin (0.1.0~bzr221-0ubuntu1~14.04.1) trusty-proposed; urgency=medium

  * New upstream snapshot.
    - support installation to multipath devices. (LP: #1371634)
    - know that kernel version 4.2.0 maps to linux-generic-lts-wily
    - support install to arm64 systems that use UEFI for boot (LP: #1447834)
    - fix remaining usage of 'lsblk --out' rather than 'lsblk --output'
      (LP: #1386275)
    - retry 'apt-get update' on failure to avoid transient failures
      (LP: #1403133)
    - run udevadm settle before unmounting /dev in a target to avoid transient
      failures (LP: #1462139)
    - fixes and additions to tools used in development.
    - Add --no-nvram to the grub-install command for UEFI. (LP: #1311827)
    - avoid race condition and transient failure due busy device in mkfs
      (LP: #1443542)
    - improvements to device and partition naming code which allow installation
      devices with HP cciss smart array drives(LP: #1401190, #1263181)
    - do not consider devices < 1G as installable targets
  * debian/README.source fix doc on how to create new upstream snapshots

 -- Scott Moser <email address hidden> Wed, 24 Jun 2015 14:31:14 -0400

Changed in curtin (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for curtin has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package curtin - 0.1.0~bzr221-0ubuntu1~14.10.1

---------------
curtin (0.1.0~bzr221-0ubuntu1~14.10.1) vivid-proposed; urgency=medium

  * New upstream snapshot.
    - support installation to multipath devices. (LP: #1371634)
    - know that kernel version 4.2.0 maps to linux-generic-lts-wily
    - support install to arm64 systems that use UEFI for boot (LP: #1447834)
    - fix remaining usage of 'lsblk --out' rather than 'lsblk --output'
      (LP: #1386275)
    - retry 'apt-get update' on failure to avoid transient failures
      (LP: #1403133)
    - run udevadm settle before unmounting /dev in a target to avoid transient
      failures (LP: #1462139)
    - fixes and additions to tools used in development.
    - Add --no-nvram to the grub-install command for UEFI. (LP: #1311827)
    - avoid race condition and transient failure due busy device in mkfs
      (LP: #1443542)
    - improvements to device and partition naming code which allow installation
      devices with HP cciss smart array drives(LP: #1401190, #1263181)
    - do not consider devices < 1G as installable targets
  * debian/README.source fix doc on how to create new upstream snapshots

 -- Scott Moser <email address hidden> Wed, 24 Jun 2015 16:12:59 -0400

Changed in curtin (Ubuntu Vivid):
status: Fix Committed → Fix Released
Scott Moser (smoser)
Changed in curtin:
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.