Cloud-init fails to write ext4 filesystem to Azure Ephemeral Drive

Bug #1626243 reported by Matt Bearup on 2016-09-21
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Unassigned
cloud-init (Ubuntu)
Medium
Unassigned
Xenial
Medium
Unassigned
Yakkety
Medium
Unassigned
Zesty
Medium
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
There is a race condition that occurs when cloud-init tries to partition a
block device (/dev/sdb) and then put a filesystem on a partition on it. It is
possible that cloud-init tries to run mkfs on /dev/sdb1 after partitioning the
device /dev/sdb but before the partition device node '/dev/sdb1' exists.

When this race condition occurs, cloud-init will fail to make the "ephemeral"
device available to the user on Azure.

[Test Case]
A reliable reproduce test case is hard to come by here. The failure case
is believed to be well understood.

[Regression Potential]
There should be very little chance for regression, as essentially all the change
does is change:

1. sgdisk -n 1:0:0 /dev/sdb
2. mkfs.ext4 /dev/sdb1

to

1. sgdisk -n 1:0:0 /dev/sdb
1a udevadm settle
1b blockdev --rereadpt
1c udevadm settle
2. mkfs.ext4 /dev/sdb1

The steps '1b' and '1c' above are not necessary, but were present already in
the method. They serve here as additional wait.

[Other Info]
The change that fixes this is viewable at [1]. For context, viewin all of
cc_disk_setup.py [2]. Basically we just add a call to read_parttbl [3] to
exec_mkpart_gpt after invoking a sgdisk command that partitions a disk.
read_partbl basically does a udevadm settle which fixes the race condition that
was seen.

[1] https://git.launchpad.net/cloud-init/commit/?id=29348af1c889931e8973f8fc8cb090c063316f7a
[2] https://git.launchpad.net/cloud-init/tree/cloudinit/config/cc_disk_setup.py?id=29348af1c889931e8973f8fc8cb090c063316f7a
[3] https://git.launchpad.net/cloud-init/tree/cloudinit/config/cc_disk_setup.py?id=29348af1c889931e8973f8fc8cb090c063316f7a#n674

=== End SRU Template ===

The symptom is similar to bug 1611074 but the cause is different. In this case it seems there is an error accessing /dev/sdb1 when lsblk is run, possibly because sgdisk isn't done creating the partition. The specific error message is "/dev/sdb1: not a block device." A simple wait and retry here may resolve the issue.

util.py[DEBUG]: Running command ['/sbin/sgdisk', '-p', '/dev/sdb'] with allowed return codes [0] (shell=False, capture=True)
cc_disk_setup.py[DEBUG]: Device partitioning layout matches
util.py[DEBUG]: Creating partition on /dev/disk/cloud/azure_resource took 0.056 seconds
cc_disk_setup.py[DEBUG]: setting up filesystems: [{'filesystem': 'ext4', 'device': 'ephemeral0.1', 'replace_fs': 'ntfs'}]
cc_disk_setup.py[DEBUG]: ephemeral0.1 is mapped to disk=/dev/disk/cloud/azure_resource part=1
cc_disk_setup.py[DEBUG]: Creating new filesystem.
cc_disk_setup.py[DEBUG]: Checking /dev/sdb against default devices
cc_disk_setup.py[DEBUG]: Manual request of partition 1 for /dev/sdb1
cc_disk_setup.py[DEBUG]: Checking device /dev/sdb1
util.py[DEBUG]: Running command ['/sbin/blkid', '-c', '/dev/null', '/dev/sdb1'] with allowed return codes [0, 2] (shell=False, capture=True)
cc_disk_setup.py[DEBUG]: Device /dev/sdb1 has None None
cc_disk_setup.py[DEBUG]: Device /dev/sdb1 is cleared for formating
cc_disk_setup.py[DEBUG]: File system None will be created on /dev/sdb1
util.py[DEBUG]: Running command ['/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/sdb1', '--nodeps'] with allowed return codes [0] (shell=False, capture=True)
util.py[DEBUG]: Creating fs for /dev/disk/cloud/azure_resource took 0.008 seconds
util.py[WARNING]: Failed during filesystem operation#012Failed during disk check for /dev/sdb1#012Unexpected error while running command.#012Command: ['/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/sdb1', '--nodeps']#012Exit code: 32#012Reason: -#012Stdout: ''#012Stderr: 'lsblk: /dev/sdb1: not a block device\n'

Revision history for this message
Scott Moser (smoser) wrote :

quick read of above log does look like we might need a udevadm settle in there.

description: updated
Scott Moser (smoser) on 2016-10-21
Changed in cloud-init:
status: New → Confirmed
Changed in cloud-init (Ubuntu):
status: New → Confirmed
Changed in cloud-init:
importance: Undecided → Medium
Changed in cloud-init (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :

It really does seem like this is an obvious case of needing to udevadm settle so that the partition exists after partitioning and before formatting.

I'm attaching a little shell script that really should reproduce the failure, but I can't get it to.

Revision history for this message
Scott Moser (smoser) wrote :

How easily can you recreate this? If I give you a patched cloud-init, could you be reasonably sure that it was now fixed?

Revision history for this message
Matt Bearup (mbearup) wrote :

We don't have a reliable repro but would be glad to test out a fix.

Scott Moser (smoser) on 2016-10-25
Changed in cloud-init:
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.8-27-g29348af-0ubuntu1

---------------
cloud-init (0.7.8-27-g29348af-0ubuntu1) zesty; urgency=medium

  * debian/cloud-init.templates: enable DigitalOcean by default [Ben Howard]
  * New upstream snapshot.
    - disk-config: udev settle after partitioning in gpt format. (LP: #1626243)
    - unittests: do not read system /etc/cloud/cloud.cfg.d (LP: #1635350)
    - Add documentation for logging features. [Wesley Wiedenmeier]
    - Add support for snap create-user on Ubuntu Core images. [Ryan Harper]
      (LP: #1619393)
    - Fix sshd restarts for rhel distros. [Jim Gorz] (LP: #1470433)
    - OpenNebula: replace 'ip' parsing with cloudinit.net usage.
    - Fix python2.6 things found running in centos 6.
    - Move user/group functions to new ug_util file [Joshua Harlow]
    - DigitalOcean: enable usage of data source by default.
    - update Gentoo initscripts to run in the correct order [Matthew Thode]
    - MAAS: improve the main of datasource to look at kernel cmdline config.
    - tests: silence the Cheetah UserWarning about NameMapper C version.

 -- Scott Moser <email address hidden> Tue, 25 Oct 2016 17:06:59 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Matt Bearup (mbearup) wrote :

Thanks for the quick response, in my testing I've been unable to repro the issue. Will this be backported to Xenial? Users will continue to hit this issue until the fix is backported.

Thanks again

Revision history for this message
Matt Bearup (mbearup) wrote :

Users continue to hit this issue every day in Xenial, I don't see how they'll be mitigated without backporting the fix. Can we get an ETA for backporting to Xenial?

Thanks again,

Scott Moser (smoser) on 2016-11-07
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
Changed in cloud-init (Ubuntu Yakkety):
status: New → Confirmed
Changed in cloud-init (Ubuntu Xenial):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Yakkety):
importance: Undecided → Medium
Scott Moser (smoser) on 2016-11-07
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Matt, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-47-gb6561a1-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed
Scott Moser (smoser) on 2016-11-16
description: updated
Revision history for this message
Matt Bearup (mbearup) wrote :

Thanks for your help Steve and Scott. I ran a test with 100 VM's (custom image with this version of cloud-init included) and all 100 successfully came up with /mnt formatted as ext4. So I think this bug is sorted out.
However, I'm still able to repro bug 1611074 when I resize a VM with this version of cloud-init installed. The changelog indicates that the bugfix was backported, so it's a bit surprising.

Revision history for this message
Matt Bearup (mbearup) wrote :

Correction, if I follow the workflow (upgrade cloud-init -> reboot -> resize) then the ephemeral drive is formatted properly. If this ins intended functionality (reboot is required before resize) then we can consider bug 1611074 resolved as well.

Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Matt, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-49-g9e904bb-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Scott Moser (smoser) wrote :

I've verified this xenial with the following. The net is that when a new instance came across an empty ephemeral disk, it put a gpt partition table on it, with one partition, and formatted it ext4.

## Launch an azure instance and ssh in
$ azure-ubuntu xenial . --user-data-file=none Standard_D1_v2
flavor=Basic_A0 image=xenial-daily location=us-east-1
azure vm create --vm-size=Standard_D1_v2 --vm-name=smoser1121x "--location=East US" --<email address hidden> --no-ssh-password --ssh=22 smoser1121x b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu_DAILY_BUILD-xenial-16_04-LTS-amd64-server-20161119-en-us-30GB smoser

$ ssh <email address hidden>

## these are useful for collecting logs and such.
% git clone https://gist.github.com/29ea35a797c0df1fcb6ac875a024efa9.git htools
% sudo ./htools/save-old-data first-boot
new instance local: not found
new instance net : true
reformattable: not found
disk_setup ran: true
mounts ran: true
proc-mounts: /dev/sdb1 /mnt ext4
/etc/fstab: /dev/disk/cloud/azure_resource-part1 /mnt defaults,nofail,comment=cloudconfig

% sudo ./htools/enable-proposed
deb http://azure.archive.ubuntu.com/ubuntu/ xenial-proposed main universe
% sudo eatmydata apt-get update -q && sudo eatmydata apt-get -qy install cloud-init
% dpkg-query --show cloud-init
cloud-init 0.7.8-49-g9e904bb-0ubuntu1~16.04.1

## Wipe the disk, taking off partition table and filesystem on part1
% sudo umount /mnt
% sudo dd if=/dev/zero of=/dev/disk/cloud/azure_resource bs=1M count=10

## reboot clearing /var/lib/cloud so instance believes it is new.
% sudo ./htools/do-reboot clean
cleared /var/lib/cloud
cleared logs
rebooting

## now go back in (note, a Basic_A0 may take like 20 minutes to come back up
## due to the IO of mkfs) more reasonably sized instance will take ~ 2 seconds.
% sudo ./htools/save-old-data
new instance local: not found
new instance net : true
reformattable: true
disk_setup ran: true
mounts ran: true
proc-mounts: /dev/sdb1 /mnt ext4
/etc/fstab: /dev/disk/cloud/azure_resource-part1 /mnt defaults,nofail,x-systemd.requires=cloud-init.service,comment=cloudconfig

% sudo blkid /dev/sdb
/dev/sdb: PTUUID="fd0861f4-305c-40e8-96f1-f6203fa66d39" PTTYPE="gpt"

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (4.0 KiB)

This bug was fixed in the package cloud-init - 0.7.8-49-g9e904bb-0ubuntu1~16.04.1

---------------
cloud-init (0.7.8-49-g9e904bb-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/cloud-init.postinst: update /etc/fstab on Azure to fix
    future resize operations. (LP: #1611074)
  * New upstream snapshot.
    - Add activate_datasource, for datasource specific code paths.
      (LP: #1611074)
    - systemd: cloud-init-local use RequiresMountsFor=/var/lib/cloud
      (LP: #1642062)

cloud-init (0.7.8-47-gb6561a1-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/cloud-init.templates: enable DigitalOcean by default [Ben Howard]
  * New upstream snapshot.
    - systemd/cloud-init-local.service:
      + replace 'Wants' and 'After' on local-fs.target with more granular
        After=systemd-remount-fs.service and RequiresMountsFor=/var/lib
        and Before=sysinit.target.
        This is done run sufficiently early enough to update /etc/fstab.
        (LP: #1611074)
      + add Before=NetworkManager.service so that cloud-init can render
        NetworkManager network config before it would apply them.
    - systemd/cloud-init.service:
      + add Before=sysinit.target and DefaultDependencies=no (LP: #1611074)
      + drop Requires=networking.service to work where networking.service is
        not needed.
      + add Conflicts=shutdown.target
      + drop unnecessary Wants=local-fs.target
    - net: support reading ipv6 dhcp config from initramfs [LaMont Jones]
      (LP: #1621615)
    - dmidecode: Allow dmidecode to be used on aarch64, and only attempt
      usage on x86, x86_64, and aarch64. [Robert Schweikert]
    - disk-config: udev settle after partitioning in gpt format.
      (LP: #1626243)
    - Add support for snap create-user on Ubuntu Core images. [Ryan Harper]
      (LP: #1619393)
    - Fix sshd restarts for rhel distros. [Jim Gorz]
    - Move user/group functions to new ug_util file [Joshua Harlow]
    - update Gentoo initscripts to run in the correct order [Matthew Thode]
    - MAAS: improve the debugging tool in datasource to consider
      config provided on kernel cmdline.
    - lxd: Update network config for LXD 2.3 [Stéphane Graber] (LP: #1640556)
    - Decode unicode types in decode_binary [Robert Schweikert]
    - Allow ephemeral drive to be unpartitioned [Paul Meyer]
    - subp: add 'update_env' argument which allows for more easily adding
      environment variables to a subprocess call.
    - Adjust mounts and disk configuration for systemd. (LP: #1611074)
    - DataSources:
      + Ec2: protect against non-dictionary in block-device-mapping.
      + AliYun: Add new datasource for Ali-Cloud ECS, that is
        available but not enabled by default [kaihuan.pkh]
      + DigitalOcean: use meta-data for network configuration and
        enable data source by default. [Ben Howard]
      + OpenNebula: replace parsing of 'ip' command with similar function
        available in cloudinit.net. This fixed unit tests when running
        in environment with no networking.
    - doc changes:
      + Add documentation on stages of boot.
      + make the RST files consistently formated and other improvements.
     ...

Read more...

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris Halse Rogers (raof) wrote : Please test proposed package

Hello Matt, or anyone else affected,

Accepted cloud-init into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-49-g9e904bb-0ubuntu1~16.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Yakkety):
status: Confirmed → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

I've verified this using:
$ dpkg-query --show cloud-init
cloud-init 0.7.8-49-g9e904bb-0ubuntu1~16.10.1
$ cat /etc/cloud/build.info
build_name: server
serial: 20161214

I basically did the same as in comment 12 with 'yakkety' instead of 'xenial'

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.8-49-g9e904bb-0ubuntu1~16.10.1

---------------
cloud-init (0.7.8-49-g9e904bb-0ubuntu1~16.10.1) yakkety; urgency=medium

  * debian/cloud-init.templates: enable DigitalOcean by default [Ben Howard]
  * debian/cloud-init.postinst: update /etc/fstab on Azure to fix
    future resize operations. (LP: #1611074)
  * New upstream snapshot.
    - systemd/cloud-init-local.service:
      + replace 'Wants' and 'After' on local-fs.target with more granular
        After=systemd-remount-fs.service and RequiresMountsFor=/var/lib
        and Before=sysinit.target.
        This is done run sufficiently early enough to update /etc/fstab.
        (LP: #1611074)
    - systemd/cloud-init.service:
      + add Before=sysinit.target and DefaultDependencies=no (LP: #1611074)
      + drop Requires=networking.service to work where networking.service is
        not needed.
      + add Conflicts=shutdown.target
      + drop unnecessary Wants=local-fs.target
    - net: support reading ipv6 dhcp config from initramfs [LaMont Jones]
      (LP: #1621615)
    - dmidecode: Allow dmidecode to be used on aarch64, and only attempt
      usage on x86, x86_64, and aarch64. [Robert Schweikert]
    - disk-config: udev settle after partitioning in gpt format.
      (LP: #1626243)
    - Add support for snap create-user on Ubuntu Core images. [Ryan Harper]
      (LP: #1619393)
    - Fix sshd restarts for rhel distros. [Jim Gorz]
    - Move user/group functions to new ug_util file [Joshua Harlow]
    - update Gentoo initscripts to run in the correct order [Matthew Thode]
    - MAAS: improve the debugging tool in datasource to consider
      config provided on kernel cmdline.
    - DataSources:
      + Ec2: protect against non-dictionary in block-device-mapping.
      + AliYun: Add new datasource for Ali-Cloud ECS, that is
        available but not enabled by default [kaihuan.pkh]
      + OpenNebula: replace parsing of 'ip' command with similar function
        available in cloudinit.net. This fixed unit tests when running
        in environment with no networking.
    - doc changes:
      + Add documentation on stages of boot.
      + make the RST files consistently formated and other improvements.
      + fixed example to not overwrite /etc/hosts [Chris Glass]
      + fix spelling / typos in ca_certs and scripts_vendor.
      + improve HACKING.rst file
      + Add documentation for logging features. [Wesley Wiedenmeier]
    - code style and unit test changes:
      + pep8: fix style errors reported by pycodestyle 2.1.0
      + pyflakes: fix issue with pyflakes 1.3 found in ubuntu zesty-proposed.
      + Add coverage dependency to bddeb to fix package build.
      + Add coverage collection to tox unit tests. [Joshua Powers]
      + do not read system /etc/cloud/cloud.cfg.d (LP: #1635350)
      + tests: silence the Cheetah UserWarning about NameMapper C version.
      + Fix python2.6 things found running in centos 6.

 -- Scott Moser <email address hidden> Tue, 22 Nov 2016 17:04:36 -0500

Changed in cloud-init (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

This is fixed in cloud-init 0.7.9.

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers