Ubuntu

mountall spins eating cpu when 'nobootwait' option exists in fstab followed by a comma

Reported by Scott Moser on 2010-09-28
74
This bug affects 12 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Critical
Colin Watson
Lucid
High
Dustin Kirkland 
Maverick
Critical
Colin Watson

Bug Description

Binary package hint: mountall

As reported at [1], mountall is eating cpu cycles on reboot of both 10.04 (images 20100923 and later) and 10.10.

I believe the issue is related to changes made under bug 634102.

Here is what happens:
- The images are created with an fstab like:
 | proc /proc proc nodev,noexec,nosuid 0 0
 | LABEL=uec-rootfs / ext4 defaults 0 0
- On first boot, cloud-init writes additional entries like:
 | /dev/sda2 /mnt auto defaults,nobootwait,comment=cloudconfig 0 2
 | /dev/sda3 none swap sw,comment=cloudconfig 0 0
- On reboot, mountall will be eating CPU. and /mnt will *not* be mounted.
- removing the 'nobootwait' flag, and rebooting will result in system functioning properly

I've modified /etc/init/mountall and added '--debug' and '--verbose', and rebooted and collected the console log.

I believe the issue only presents itself when the device exists at boot and the 'nobootwait' flag is present.

I can boot 2 different instance types,
 m1.small : has /dev/sda2 and /dev/sda3
 t1.micro : does not have /dev/sda2 or /dev/sda3

mountall will spin in m1.small, but not in t1.micro.

I'm attaching the output of debug boot on m1.small (where /dev/sda2 does exist and mountall spins).

--
[1] http://groups.google.com/group/ec2ubuntu/browse_thread/thread/d415746dff066f06

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: mountall 2.17
ProcVersionSignature: User Name 2.6.35-22.33-virtual 2.6.35.4
Uname: Linux 2.6.35-22-virtual i686
Architecture: i386
Date: Tue Sep 28 06:45:14 2010
Ec2AMI: ami-307d8859
Ec2AMIManifest: ubuntu-images-testing-us/ubuntu-maverick-daily-i386-server-20100927.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.small
Ec2Kernel: aki-407d9529
Ec2Ramdisk: unavailable
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: mountall

==== SRU Information ====
- Impact: On EC2 or UEC, due to how cloud-init writes fstab entries, on first reboot mountall will eat all CPU cycles. On other systems, a user who updates /e
c/fstab seemingly correct will have mountall spin the first time it runs.
- How it was addressed: Logic error in cut_options routine in mountall was fixe
. The code segment was easily removed and tested against different strings to verify no regressions.
- Patch: A branch has been linked with the same patch as used in maverick.
  lp:~smoser/ubuntu/lucid/mountall/bug649591
- How to reproduce:
  Edit an existing entry in /etc/fstab for which there is an existing device. Change the options field to contain 'nobootwait,' in front of what was previously there. Reboot. You will now see mountall consuming resources.
- Regression: The most likely regression would be either failure of mountall to mount a device or segfault in mountall. That said, the patch was tested well, and Colin had a strong understanding of the cause.
==== End SRU Information ====

Scott Moser (smoser) wrote :
Changed in mountall (Ubuntu):
importance: Undecided → Critical
milestone: none → ubuntu-10.10
status: New → Confirmed
Scott Moser (smoser) wrote :

I installed debugging libraries and found 2 more pieces of info:
a.) removing 'comment=cloudconfig' works around the problem
b.) mountall is spinning in cut_options at mountall.c:622
When I attach with gdb backtrace shows:

#0 0xb7884c42 in cut_options (parent=0x0, mnt=0xb8ffbff8) at mountall.c:622
#1 0xb7884fbe in run_mount (mnt=0xb8ffbff8, fake=0) at mountall.c:1843
#2 0xb78867b9 in try_mount (mnt=0xb8ffbff8, force=0) at mountall.c:1659
#3 0xb7881fe8 in spawn_child_handler (proc=0xb9000478, pid=437,
    event=NIH_CHILD_EXITED, status=0) at mountall.c:1789
#4 0xb784b6bb in nih_child_poll () at child.c:217
#5 0xb784eed2 in nih_main_loop () at main.c:600
#6 0xb7888994 in main (argc=3, argv=0xbfaa7db4) at mountall.c:3409

Colin Watson (cjwatson) on 2010-09-28
Changed in mountall (Ubuntu Maverick):
assignee: nobody → Colin Watson (cjwatson)
Scott Moser (smoser) wrote :

Also working around the problem is reodering options:
The following fs_mntops fields do not cause a problem:
- comment=cloudconfig,defaults,nobootwait
- defaults,nobootwait
Where the original does:
- defaults,nobootwait,comment=cloudconfig

I think this is an issue with '=' occuring after an option passed to cut_options. Ie, the call to cut_options looks like:
        opts = cut_options (NULL, mnt, "showthrough", "optional",
                            "bootwait", "nobootwait",
                            NULL);

Colin Watson (cjwatson) on 2010-09-28
summary: mountall spins eating cpu when 'nobootwait' option exists in fstab
+ followed by a comma
Changed in mountall (Ubuntu Maverick):
status: Confirmed → Fix Committed
Changed in mountall (Ubuntu Lucid):
importance: Undecided → High
status: New → Triaged
assignee: nobody → Colin Watson (cjwatson)
milestone: none → ubuntu-10.04.2
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.19

---------------
mountall (2.19) maverick; urgency=low

  * Fix infinite loop when one of mountall's private mount options is
    followed by a comma, and guard against other reasons why cut_options
    might end up comparing a zero-length option (LP: #649591).
 -- Colin Watson <email address hidden> Tue, 28 Sep 2010 09:35:11 +0100

Changed in mountall (Ubuntu Maverick):
status: Fix Committed → Fix Released
Scott Moser (smoser) wrote :

As pointed out in Eric Hammond's post, this bug can be fixed in a booted instance with:
 - sudo perl -pi -e 's/(nobootwait),(\S+)/$2,$1/' /etc/fstab
or, if you prefer sed:
 - sudo sed -i 's/\(nobootwait\),\([^[:space:]]\+\)/\2,\1/' /etc/fstab

That will change:
/dev/sda2 /mnt auto defaults,nobootwait,comment=cloudconfig 0 0
to
/dev/sda2 /mnt auto defaults,comment=cloudconfig,nobootwait 0 0

If you haven't yet rebooted the instance, you will not need to take any further action. If you *have* rebooted the instance you will see 'mountall' spinning. That can be fixed with:
sudo stop mountall

Alternatively, you can avoid the bug when you start the instance by launching with user-data like
$ cat ud.txt
#cloud-config
mounts:
 - [ ephemeral0 ]
$ euca-run-instances --user-data-file=ud.txt ...

That indicates to cloud-config that it should not write an entry in /etc/fstab for ephemeral0 (/dev/sda2 on i386 and /dev/sdb on x86_64).

Scott Moser (smoser) on 2010-09-30
description: updated
Colin Watson (cjwatson) on 2010-09-30
description: updated
Scott Moser (smoser) wrote :

I did a local build of the proposed merge, and tested that it fixes the issue.
Additionally, I have a build pending in my ppa with that also.

This is ready for someone to sponsor to lucid-proposed, but I cannot do the upload myself.

Changed in mountall (Ubuntu Lucid):
assignee: Colin Watson (cjwatson) → Dustin Kirkland (kirkland)
status: Triaged → In Progress
status: In Progress → Fix Committed
Dustin Kirkland  (kirkland) wrote :

Uploaded to lucid-proposed. Please test that build when it becomes available and respond here with comments.

Accepted mountall into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
ingo (ingo-steiner) wrote :

I did test 2.15.3 in Lucid-amd64 and observed (compared with 2.15.2):

there is no more any output from regularly run fsck available during boot on the console. It just reports:
fsck from util-linux-ng 2.17.2

2.15.2 showed for every single device something like this:
fsck from util-linux-ng 2.17.2 sdaX: clean, 141299/1028160 files, 812106/4112632 blocks

Scott Moser (smoser) wrote :

Ingo,
  It doesn't seem likely that the changes made in 2.15.3 affected such output.
  See the diff from 2.15.1 -> 2.15.3 at [1]. I found that from [2]. The first 2 hunks of the mountall.c changes are the only changes made in 2.15.3 versus 2.15.2.

--
[1] http://launchpadlibrarian.net/57191747/mountall_2.15.1_2.15.3.diff.gz
[2] https://launchpad.net/ubuntu/+source/mountall

Scott Moser (smoser) wrote :

I've verified this by the following:

## start instance of type m1.small (any instance type that has a ephemeral0
## will show it)
## ami-6c06f305 = ebs/ubuntu-lucid-10.04-i386-server-20100923
# ec2-run-instances ami-6c06f305 --instance-type t1.micro
# ssh to instance
$ sudo reboot # it took a reboot to trigger the bug
# reconnect to instance
# verify mountall pegging cpu
$ sudo stop mountall
$ rel=$(lsb_release --codename --short)
$ echo deb http://archive.ubuntu.com/ubuntu/ ${rel}-proposed restricted main multiverse universe | sudo tee -a /etc/apt/sources.list.d/${rel}-proposed.list
$ sudo apt-get update
$ sudo apt-get install mountall
$ sudo reboot
# ssh to instance
# verify mountall *not* pegging cpu

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.15.3

---------------
mountall (2.15.3) lucid-proposed; urgency=low

  [ Colin Watson ]
  * Fix infinite loop when one of mountall's private mount options is
    followed by a comma, and guard against other reasons why cut_options
    might end up comparing a zero-length option (LP: #649591).
 -- Scott Moser <email address hidden> Thu, 30 Sep 2010 03:35:23 -0400

Changed in mountall (Ubuntu Lucid):
status: Fix Committed → Fix Released
ingo (ingo-steiner) wrote :

My comment above was incorrect:

"there is no more any output from regularly run fsck available during boot on the console." - sorry.

It still shows up, the only difference:
it does not display partitions by device (/dev/sdaX), instead it displays the label. That's why I overlooked it.

Is there any way to enable the progress-bar again (like in Hardy), as obtained with 'fsck -C' on ext3 filesystems?

Mark - Syminet (mark-syminet) wrote :

What Ingo said - for servers, not having a console progress bar is *terrible*, since for servers we often aren't physically there. So even with e.g. remote VNC access we get nothing but a blank screen and have no idea if it's actually even doing a fsck, or locked up for some other reason. All of this while customers are calling frantically screaming at us because their server is down.

As a stopgap we're now booting them into rescue mode from external media and running the fscks manually, but all of this causes unnecessary hassle, stress and increased downtime. Would be much nicer to have our progress bar back like we always had before.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers