systemd mount targets fail due to device busy or already mounted

Bug #1718287 reported by Billy Olsen on 2017-09-19
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
High
Ryan Harper

Bug Description

[Issue]

After rebooting a 16.04 AWS instance (ami-1d4e7a66) with several external disks attached, formatted, and added to /etc/fstab - systemd mount targets fail to mount with:

● media-v.mount - /media/v
   Loaded: loaded (/etc/fstab; bad; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2017-09-19 20:12:18 UTC; 1min 54s ago
    Where: /media/v
     What: /dev/xvdv
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 1196 ExecMount=/bin/mount /dev/xvdv /media/v -t ext4 -o defaults (code=exited, status=32)

Sep 19 20:12:17 ip-172-31-7-167 systemd[1]: Mounting /media/v...
Sep 19 20:12:17 ip-172-31-7-167 mount[1196]: mount: /dev/xvdv is already mounted or /media/v busy
Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: media-v.mount: Mount process exited, code=exited status=32
Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: Failed to mount /media/v.
Sep 19 20:12:18 ip-172-31-7-167 systemd[1]: media-v.mount: Unit entered failed state.

From the cloud-init logs, it appears that the the OVF datasource is mounting the device to find data:

2017-09-19 20:12:17,502 - util.py[DEBUG]: Peeking at /dev/xvdv (max_bytes=512)
2017-09-19 20:12:17,502 - util.py[DEBUG]: Reading from /proc/mounts (quiet=False)
2017-09-19 20:12:17,502 - util.py[DEBUG]: Read 2570 bytes from /proc/mounts
...
2017-09-19 20:12:17,506 - util.py[DEBUG]: Running command ['mount', '-o', 'ro,sync', '-t', 'iso9660', '/dev/xvdv', '/tmp/tmpw2tyqqid'] with allowed return codes [0] (shell=False, capture=True)
2017-09-19 20:12:17,545 - util.py[DEBUG]: Failed mount of '/dev/xvdv' as 'iso9660': Unexpected error while running command.
Command: ['mount', '-o', 'ro,sync', '-t', 'iso9660', '/dev/xvdv', '/tmp/tmpw2tyqqid']
Exit code: 32
Reason: -
Stdout: -
Stderr: mount: wrong fs type, bad option, bad superblock on /dev/xvdv,
               missing codepage or helper program, or other error

               In some cases useful info is found in syslog - try
               dmesg | tail or so.
2017-09-19 20:12:17,545 - util.py[DEBUG]: Recursively deleting /tmp/tmpw2tyqqid
2017-09-19 20:12:17,545 - DataSourceOVF.py[DEBUG]: /dev/xvdv not mountable as iso9660

[Vitals]

Version: 0.7.9-153-g16a7302f-0ubuntu1~16.04.2
OS: Ubuntu 16.04
Provider: AWS - ami-1d4e7a66

[Recreate]

To recreate this

1. Launch an AWS instance using AMI ami-1d4e7a66 and attach several disks (I used 25 additional disks)

2. Format and mount all 25:
   mkdir /media/{b..z}
   for i in {b..z}; do
       mkfs -t ext4 /dev/xvd$i
       mount /dev/xvd$i /media/$i
       echo "/dev/xvd$i /media/$i ext4 defaults,nofail 0 2" >> /etc/fstab
   done

3. reboot instance

Since this is a race, multiple may be necessary. A reproducer script is attached.

Tags: sts Edit Tag help

Related branches

Billy Olsen (billy-olsen) wrote :
Changed in cloud-init:
importance: Undecided → High
assignee: nobody → Ryan Harper (raharper)
Ryan Harper (raharper) wrote :

The current analysis is that any parallel invocation of mount which specifies the same device is going to result in one of the two mounts to fail with EBUSY (or some other failure) do to how the underlying device handles either opens or mounts in the kernel. This is recreatable outside of cloud-init when running mounts in parallel pointing to the same disk.

# rw mount loop
while true; do
    umount -f ${MNT_POINT} |:
    mount ${DISK} ${MNT_POINT} -t ext4 -o defaults || {
     echo "`date +%s.%N`: mount failed on COUNT=$COUNT";
        exit 1
    }
    echo "`date +%s.%N`: COUNT=$COUNT" > ${MNT_POINT}/lasttouch
    COUNT=$(($COUNT + 1))
done

# ro mount loop
while true; do
    dd if=${DISK} bs=512 count=1 of=/dev/null
    # we expect this to fail, that's OK
    mount -o ro,sync -t iso9660 ${DISK} ${MNT_POINT} |:
    echo "`date +%s.%N`: READ COUNT=$COUNT"
    COUNT=$(($COUNT + 1))
done

There isn't currently a general solution to prevent this race from occuring; so the focus is on reducing the likelihood of a race, and configuration changes to be resilient in the face of the race.

The linked changes to cloud-init will reduce the number of block devices that OVF will probe. The OVF Datasource is looking for block devices with ISO9660 filesystems; on an instance that has no ISO9660 filesystems, OVF will not perform any mount operations.

When crafting fstab entries, one can set the 6th column, (fs_passno) to zero (0)
and systemd will not attempt to fsck the device prior to mounting. This does not remove
the mount race, but it does reduce the amount of time between when the mount unit starts and when it completes.

Changed in cloud-init:
status: New → In Progress
Ryan Harper (raharper) wrote :

Some more details, in particular the ami-id is needed:

https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-1d4e7a66

t2-micro is fine, in us-east1, add 26 volumes, mark for deletion.

Run the script and star the reboot check.

That triggered in a single reboot; Instance is *old* at this time, comes with linux-image-4.4.0-aws-1030.

I'm currently testing with the image updated; current linux image is 4.4.0-aws-1035; That's not reproducing as quickly, so far 25 reboots with no issues.

Separately on an a non-upgraded instance with the above ami and instance type as described, I've packaged and installed a cloud-init package from the linked branch; that's successfully run for 68 reboots with no issues.

i'll continue running these over night and update this in the morning.

Ryan Harper (raharper) wrote :

Instance on 4.4.0-aws-1035 has 525 boots with no issues/no changes to cloud-init
Instance on 4.4.0-aws-1030 has 650 boots with no issues/updated cloud-init

I believe it's still required to have the cloud-init changes; however, due to changes in the kernel/systemd or otherwise; systemd mounts are not interleaved with cloud-init probing. This behavior is very consistent within a specific kernel/systemd pair.

For example on the latest boot of 4.4.0-aws-1035, I can check the timestamps (first and last) of the mounts (ignoring xvda/rootfs)

% $ journalctl --utc -o short-precise | grep mounted.filesystem | grep -v xvda | sed -e 1b -e '$!d'

Sep 21 13:56:13.858878 ip-172-31-27-63 kernel: EXT4-fs (xvdr): mounted filesystem with ordered data mode. Opts: (null)
Sep 21 13:56:14.129499 ip-172-31-27-63 kernel: EXT4-fs (xvdo): mounted filesystem with ordered data mode. Opts: (null)

And check the most recent cloud-init boot for when OVF search starts/stops

cloud-init-local starts OVF searching at:
2017-09-21 13:56:09,617 - handlers.py[DEBUG]: start: init-local/search-OVF: searching for local data from DataSourceOVF

Finishes at:
2017-09-21 13:56:10,322 - handlers.py[DEBUG]: finish: init-local/search-OVF: SUCCESS: no local data found from DataSourceOVF

So we probed *before* systemd mounts

Then at cloud-init.service (net mode) time, start at
2017-09-21 13:56:11,650 - handlers.py[DEBUG]: start: init-network/search-OVFNet: searching for network data from DataSourceOVFNet

Finish OVFNet at:
2017-09-21 13:56:12,246 - handlers.py[DEBUG]: finish: init-network/search-OVFNet: SUCCESS: no network data found from DataSourceOVFNet

And OVFNet probes *before* systemd mounts start at 13:56:13.85

Chad Smith (chad.smith) on 2017-10-04
Changed in cloud-init:
status: In Progress → Fix Committed
Robin Miller (robincello) wrote :

This issue seems to have returned on kernel 4.4.0-1038-aws with cloud-init 0.7.9-233-ge586fe35-0ubuntu1~16.04.2

Ryan Harper (raharper) wrote :

Currently validating cloud-init in xenial-proposed which resolves this specific issue with cloud-init racing systemd-fstab entries with the probing during DatasourceOVF.

cloud-init | 17.1-18-gd4f70470-0ubuntu1~16.04.1 | xenial-proposed | source, all

This bug is believed to be fixed in cloud-init in 1705804. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers