fsck.ext3: Device or resource busy while trying to open /dev/xvda2

Bug #898373 reported by Scott Moser on 2011-11-30
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
High
Unassigned

Bug Description

I'm opening this against cloud-init, but I do not actually think that is valid.
In an ec2 instance for test of alpha-1 images [1], we saw errors on boot like:

  Loading, please wait...
  [1342606.840604] udevd[81]: starting version 175
  Begin: Loading essential drivers ... done.
  Begin: Running /scripts/init-premount ... done.
  Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
  Begin: Running /scripts/local-premount ... done.
  [1342607.148903] EXT4-fs (xvda1): mounted filesystem with ordered data mode. Opts: (null)
  Begin: Running /scripts/local-bottom ... done.
  done.
  Begin: Running /scripts/init-bottom ... done.
  lxcmount stop/pre-start, process 178
  fsck from util-linux 2.19.1
  fsck from util-linux 2.19.1
  fsck: fsck.swap: not found
  fsck: Error 2 while executing fsck.swap for /dev/xvda3
  fsck.ext3: Device or resource busy while trying to open /dev/xvda2
  Filesystem mounted or opened exclusively by another program?
  cloud-init start-local running: Wed, 30 Nov 2011 19:25:20 +0000. up 3.98 seconds
  no instance data found in start-local

The swap errors are due to bug 898365.
The cloud-config given to the instance included:
  mounts:
   - [ ephemeral0, /opt , auto, "defaults,noexec" ]
   - [ swap, null ]

And given that bug, the /etc/fstab during this reboot probably had:
  /dev/xvda2 /opt auto defaults,noexec,comment=cloudconfig 0 2
  /dev/xvda3 None auto defaults,nobootwait,comment=cloudconfig 0 2

The thing that confuses me, and the reason I do not think this is cloud-init's fault is that this happens prior to cloud-init start-local. So, yes, there is a broken fstab, but not for /dev/xvda2.

The errors listed above are on first reboot after cloud-init wrote those fstab entries, and then issued 'mount -a' from a upstart job (cloud-config.conf) launched at:
   start on (filesystem and started rsyslog)

--
[1] https://jenkins.qa.ubuntu.com/job/precise-server-ec2/ARCH=i386,REGION=us-west-1,STORAGE=instance-store,TEST=cloud-config,label=ubuntu-server-ec2-testing/1/artifact/None/i386/m1.small/instance-store/i-835a3dc4/uec2-20111130-1920-5d6ee3bc489541-terminated.console.txt

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: cloud-init 0.6.2-0ubuntu1 [modified: usr/share/pyshared/cloudinit/CloudConfig/cc_mounts.py]
ProcVersionSignature: User Name 3.2.0-2.5-virtual 3.2.0-rc3
Uname: Linux 3.2.0-2-virtual i686
ApportVersion: 1.90-0ubuntu1
Architecture: i386
Date: Wed Nov 30 21:44:25 2011
Ec2AMI: ami-a7cc07ce
Ec2AMIManifest: ubuntu-us-east-1/images-testing/ubuntu-precise-daily-i386-server-20111130.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.small
Ec2Kernel: aki-805ea7e9
Ec2Ramdisk: unavailable
PackageArchitecture: all
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Related bugs:
 * bug 928990: fsck / dirty filesystem on instance is death

Related branches

Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :

Just a note, this did not occur on all instances with this user-data.

Scott Moser (smoser) on 2011-12-01
Changed in cloud-init (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Scott Moser (smoser) wrote :

after looking at the console pointed at in comment 3, I thought i might be able to trigger this by issuing 'reboot-instances' on a running node. I was not successful in doing so.

In a instance of:
 eu-west-1 ami-738fb007 ubuntu-precise-daily-i386-server-20120119

I issued reboot via 'euca-reboot-instances' 10 times and colllected log output (the low number is really because it takes 5m to get a console output). None of these resulted in the fsck issue.

Scott Moser (smoser) wrote :

after doing this, i caught a fsck device busy error:

args="--region=eu-west-1 i-087e0b41'
for((i=12;i<50;i++)); do
  echo $(date): i=$i; euca-reboot-instances $args && sleep 1m &&
  euca-get-console-output $args > /tmp/console.hard-reboot$i; done

the 37th boot (attaching) showed this error.

Scott Moser (smoser) on 2012-02-08
description: updated
Stefan Bader (smb) wrote :

Hm, when I look at /etc/init/cloud-init-local.conf and */cloud-init.conf they seem to say start on mounted MOUNTPOINT=/. So I'd somehow expect mountall to be finished with / at that point. Still the output of cloud-init and mountall seem to intermix often. As if they are executed around the same time...

Stefan Bader (smb) wrote :

Ok, / is non-sense as it is certainly mounted and also the failing fsck is fot /mnt. But generally it looks like mountall and cloud-init running concurrently and anything that opens the device may block a rw access for the fsck. So any blkid could do that. The window is not very big but who knows...

Scott Moser (smoser) wrote :

I think there are 2 things that could be causing this from cloud-init:
 a.) cloudinit/CloudConfig/cc_resize.py [1]
   This uses blkid, but almost certainly only passes blkid a device node that represents '/'. It basically does a stat on /, gets major and minor, creates a device node that has the same major, minor and calls blkid on that like:
    blkid -c /dev/null -sTYPE -ovalue /tmp/THATDEV
   It does this to find the filesystem type of /.

 b.) cloudinit/DataSourceOVF.py [2]
   This would run early in the boot, and possibly race with fsck. It is trying to determine if each device in /dev matching '^(sr[0-9]+|hd[a-z]|xvd.*)' is a cdrom. Notable in that list is 'xvd.*'. If that a device matching that is not mounted, then we do a read(512) in an attempt to determine if there is media in the CD-ROM drive, and if *that* succeeds, it will try to mount read-only and look for some files, then unmount.

It seems to me that 'b' is a potential source of the issue, as either the 'read()' or the mount, check, umount could cause the resource busy option.

2 things that could address that:
 a.) restrict devices being checked to 'xvd[a-z]' (eliminating the possible mount of /dev/xvda[0-9] entirely. This could still cause issue on other devices like /dev/xvdb.
 b.) mount with 'ro,noload' (per 'man mount')

I'm *guessing* that fsck is trying to open the device in read-only and the mount in 'b' is not truely read-only.

Any thoughts on how I can essentially poll these devices looking for "ovf data" and avoid this race?
Woudl a badly timed 'read()' interfere with fsck ? (ie, if it was an open RO, not RW?)

One thing that would filter out this device would be to filter out non-CD ROM devices, but I actually like the fact that the device does not have to be a CDROM.

--
[1] http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/cloud-init/precise/view/head:/cloudinit/CloudConfig/cc_resizefs.py
[2] http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/precise/cloud-init/precise/view/head:/cloudinit/DataSourceOVF.py

Scott Moser (smoser) wrote :

Note, assuming our theory on the circumstances of the race is correct above (that it is 'read' or mount of that device), I was unable to reproduce with a loop device, which I thought would have caused a similar issue:

1. truncate --size 100m /tmp/my.img
2. mkfs -F -t ext3 /tmp/my.img
3. LODEV=$(sudo losetup --show --find /tmp/my.img)
4. sudo python -c 'import time, sys; print sys.argv[1]; fp=open(sys.argv[1]); time.sleep(30); fp.close();' /dev/loop1 &
   # inside the 30 seconds that that will have an open RO file handle, do step 5
5. sudo fsck.ext3 -f /dev/loop1 # '-f' forces fsck even if not marked dirty
   or
    sudo fsck.ext3 /dev/loop1

Other things I've found:
 * using 'open(sys.argv[1],"rw")' does not change anything, fsck will still continue.
 * changing step 4 to 'mount -o ro /dev/loop1 /mnt' *will* cause fsck.ext3 to complain that the filesystem is mounted and print a similar error message.

So, I'm suspecting that it is indeed the OVF code's mount -o,ro that is causing this problem.

I'm not sure how to resolve it, other than limiting the mounts checked to FS of iso9660 only.

Stefan Bader (smb) wrote :

Hm weird. I would have expected at least a rw open on the block device should fail, let alone a ro (since fsck probably wants to modify the fs). Though I'd need to check exactly what access mode mount -oro uses (maybe its some ro exclusive). Actually it is not really a read-only open on the block device (that much I know) because this still will cause changes to the superblock (mount time and so on) if that is possible and the block device itself is not read-only.
Beside that you could restrict the devices touched by looking at /sys/block/?/device/type which is 5 (cdrom) only for cd type devices.
The other option would be, if there is an event that mountall emits as the final one (done with automounting), that would be a safer time to start with the cloud-init, wouldn't it?

Scott Moser (smoser) wrote :

I believe this is fixed in cloud-init revision 515
http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/revision/515

Note, we could also restrict the device type searched by checking /sys/block/devname/device/type as stefan suggested.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.6.3~bzr519-0ubuntu1

---------------
cloud-init (0.6.3~bzr519-0ubuntu1) precise; urgency=low

  * New upstream snapshot.
    - [Mike Milner] add support for managing CA Certificates (LP: #915232)
    - in ci-info lines, use '.' to for empty field for easier machine reading
    - support empty lines in "#include" files (LP: #923043)
    - [Jef Baeur] support configuration of salt minions Bauer) (LP: #927795)
    - DataSourceOVF: only search for OVF data on ISO9660 filesystems (LP: #898373)
    - DataSourceConfigDrive: support getting data from openstack config drive
      (LP: #857378)
    - [Juerg Haefliger] formating and pylint cleanups
  * increase timeouts for initial config check for metadata service
    to address slow metadata service in openstack
  * add awareness of ConfigDrive data source
 -- Scott Moser <email address hidden> Thu, 16 Feb 2012 17:27:05 -0500

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers