Reformatting of ephemeral drive fails on resize of Azure VM

Bug #1611074 reported by Paul Meyer on 2016-08-08
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
High
Unassigned
cloud-init (Ubuntu)
High
Scott Moser
Xenial
Medium
Unassigned
Yakkety
Medium
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
In some cases, cloud-init writes entries to /etc/fstab, and on azure it will
even format a disk for mounting and then write the entry for that 'ephemeral'
disk there.

A supported operation on Azure is to "resize" the system. When you do this
the system is shut down, resized (given larger/faster disks and more CPU) and
then brought back up. In that process, the "ephemeral" disk re-initialized
to its original NTFS format. The designed goal is for cloud-init to recognize
this situation and re-format the disk to ext4.

The problem is that the mount of that disk happens before cloud-init can
reformat. Thats because the entry in fstab has 'auto' and is automatically
mounted. The end result is that after resize operation the user will be left
with the ephemeral disk mounted at /mnt and having a ntfs filesystem rather
than ext4.

[Test Case]
The text in comment 3 describes how to recreate by the original reporter.
Another way to do this is to just re-format the ephemeral disk as
ntfs and then reboot. The result *should* be that after reboot it
comes back up and has an ext4 filesystem on it.

1.) boot system on azure
  (for this, i use https://gist.github.com/smoser/5806147, but you can
   use web ui or any other way).
   Save output of
     journalctl --no-pager > journalctl.orig
     systemctl status --no-pager > systemctl-status.orig
     systemctl --no-pager > systemctl.orig

2.) unmount the ephemeral disk
   $ umount /mnt

3.) repartition it so that mkfs.ntfs does less and is faster
   This is not strictly necessary, but mkfs.ntfs can take upwards of
   20 minutes. shrinking /dev/sdb2 to be 200M means it will finish
   in < 1 minute.

   $ disk=/dev/disk/cloud/azure_resource
   $ part=/dev/disk/cloud/azure_resource-part1
   $ echo "2048,$((2*1024*100)),7" | sudo sfdisk "$disk"
   $ time mkfs.ntfs --quick "$part"

4.) reboot
5.) expect that /proc/mounts has /dev/disk/cloud/azure_resource-part1 as ext4
    and that fstab has x-systemd.requires in it.

    $ awk '$2 == "/mnt" { print $0 }' /proc/mounts
    /dev/sdb1 /mnt ext4 rw,relatime,data=ordered 0 0

    $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
    /dev/sdb1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,comment=cloudconfig 0 2

6.) collect journal and systemctl information as described in step 1 above.
    Compare output, specifically looking for case insensitve "breaks"

[Regression Potential]
Regression is unlikely. Likely failure case is just that the problem is not
correctly fixed, and the user ends up with either an NTFS formated disk that
is mounted at /mnt or there is nothing mounted at /mnt.

=== End SRU Template ===

After resizing a 16.04 VM on Azure, the VM is presented with a new ephemeral drive (of a different size), which initially is NTFS formatted. Cloud-init tries to format the appropriate partition ext4, but fails because it is mounted. Cloud-init has unmount logic for exactly this case in the get_data call on the Azure data source, but this is never called because fresh cache is found.

Jun 27 19:07:47 azubuntu1604arm [CLOUDINIT] handlers.py[DEBUG]: start: init-network/check-cache: attempting to read from cache [trust]
Jun 27 19:07:47 azubuntu1604arm [CLOUDINIT] util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
Jun 27 19:07:47 azubuntu1604arm [CLOUDINIT] util.py[DEBUG]: Read 5950 bytes from /var/lib/cloud/instance/obj.pkl
Jun 27 19:07:47 azubuntu1604arm [CLOUDINIT] stages.py[DEBUG]: restored from cache: DataSourceAzureNet [seed=/dev/sr0]
Jun 27 19:07:47 azubuntu1604arm [CLOUDINIT] handlers.py[DEBUG]: finish: init-network/check-cache: SUCCESS: restored from cache: DataSourceAzureNet [seed=/dev/sr0]
...
Jun 27 19:07:48 azubuntu1604arm [CLOUDINIT] cc_disk_setup.py[DEBUG]: Creating file system None on /dev/sdb1
Jun 27 19:07:48 azubuntu1604arm [CLOUDINIT] cc_disk_setup.py[DEBUG]: Using cmd: /sbin/mkfs.ext4 /dev/sdb1
Jun 27 19:07:48 azubuntu1604arm [CLOUDINIT] util.py[DEBUG]: Running command ['/sbin/mkfs.ext4', '/dev/sdb1'] with allowed return codes [0] (shell=False, capture=True)
Jun 27 19:07:48 azubuntu1604arm [CLOUDINIT] util.py[DEBUG]: Creating fs for /dev/disk/cloud/azure_resource took 0.052 seconds
Jun 27 19:07:48 azubuntu1604arm [CLOUDINIT] util.py[WARNING]: Failed during filesystem operation#012Failed to exec of '['/sbin/mkfs.ext4', '/dev/sdb1']':#012Unexpected error while running command.#012Command: ['/sbin/mkfs.ext4', '/dev/sdb1']#012Exit code: 1#012Reason: -#012Stdout: ''#012Stderr: 'mke2fs 1.42.13 (17-May-2015)\n/dev/sdb1 is mounted; will not make a filesystem here!\n'

$ lsb_release -rd
Description: Ubuntu 16.04.1 LTS
Release: 16.04
$ cat /etc/cloud/build.info
build_name: server
serial: 20160721
~$ dpkg -l cloud-init
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-================================-=====================-=====================-=====================================================================
ii cloud-init 0.7.7~bzr1256-0ubuntu all Init scripts for cloud instances

We're seeing ~100% repro of this bug on resize, where the only success cases are caused by another bug that messes up fstab and prevents mounting of the drive.

Related bugs:
 bug 1629868: cloud-init times out because of no dbus
 bug 1603222: Azure: incorrect entry in fstab for ephemeral disk

Related branches

Revision history for this message
Paul Meyer (paul-meyer) wrote :
Revision history for this message
Dan Watkins (oddbloke) wrote :

Hi Paul,

Could you give me steps that I can follow to reproduce this issue (ideally using the Azure CLI)? That'll make it easier for us to test fixes.

Thanks,

Dan

Revision history for this message
Paul Meyer (paul-meyer) wrote : RE: [Bug 1611074] Re: Reformatting of ephemeral drive fails on resize of Azure VM
Download full text (5.8 KiB)

Hi Dan,

Thanks for checking this out. Basically just create a 16.04 VM and resize it (e.g. from D1 to D2). Look at mount/blkid ouput in between and after to see the difference:

azure config mode arm
azure vm quick-create bug1611074 reprovm centralus linux Canonical:UbuntuServer:16.04.0-LTS:latest $USER -M ~/.ssh/id_rsa.pub -z Standard_D1

ssh to machine, `mount|grep '/dev/sd'` should show something like this:
/dev/sda1 on / type ext4 (rw,relatime,discard,data=ordered)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)

Now, resize VM, which forces re-creation of the resource disk (formatted NTFS)
azure vm set bug1611074 reprovm -z Standard_D2

ssh to machine, `mount|grep '/dev/sd'` now shows this:
/dev/sda1 on / type ext4 (rw,relatime,discard,data=ordered)
/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)

And `blkid` will show
/dev/sda1: LABEL="cloudimg-rootfs" UUID="b2e47a31-37fe-4914-b333-bd1c2a2dacae" TYPE="ext4" PARTUUID="c74ad4d8-01"
/dev/sdb1: LABEL="Temporary Storage" UUID="B82692572692170A" TYPE="ntfs" PARTUUID="4041cb24-01"

There's a slight chance that it doesn't repro, I noticed that there's a race between the scsi initialization or udev and the code in cloud-init that determines whether it should take /dev/disk/azure/resource or /dev/disk/azure/resource-part1. This code checks for the existence of the latter and if it exists, uses that. Sometimes this check fails, which leads to the resource disk not being prepared or mounted properly. The incorrect fstab entry prevents mount on the resized VM, which then allows for reformat to ext4.
If you run into this, just resize again to any size and it should repro then.

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Dan Watkins
Sent: Wednesday, August 24, 2016 6:21 AM
To: Paul Meyer <email address hidden>
Subject: [Bug 1611074] Re: Reformatting of ephemeral drive fails on resize of Azure VM

Hi Paul,

Could you give me steps that I can follow to reproduce this issue (ideally using the Azure CLI)? That'll make it easier for us to test fixes.

Thanks,

Dan

--
You received this bug notification because you are subscribed to the bug report.
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbugs.launchpad.net%2fbugs%2f1611074&data=01%7c01%7cpaul.meyer%40microsoft.com%7c6eb31b5a8b87409b9e4308d3cc22e364%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=zwkk7WQtS%2bFOKRtcR1gKcRvaejefD32xAo%2bP8IWbxEE%3d

Title:
  Reformatting of ephemeral drive fails on resize of Azure VM

Status in cloud-init:
  New
Status in cloud-init package in Ubuntu:
  New

Bug description:
  After resizing a 16.04 VM on Azure, the VM is presented with a new
  ephemeral drive (of a different size), which initially is NTFS
  formatted. Cloud-init tries to format the appropriate partition ext4,
  but fails because it is mounted. Cloud-init has unmount logic for
  exactly this case in the get_data call on the Azure data source, but
  this is never called because fresh cache is found.

  Jun 27 19:07:47 azubuntu1604arm [CLOUDINIT] handlers.py[DEBUG]: star...

Read more...

Jon Grimm (jgrimm) on 2016-09-02
Changed in cloud-init (Ubuntu):
assignee: nobody → Scott Moser (smoser)
Revision history for this message
Scott Moser (smoser) wrote :

Dan, thanks for the recreate description.
I'd never been aware of 'quick-create'. I'd built another wrapper around the azure cli to do something similar. https://gist.github.com/smoser/5806147

I think that the crux of the issue here is in the change to systemd we do not block mounts from happening while 'cloud-init-local' and 'cloud-init' units are running. Because of this, I think that while cloud-init deciding to format or not, the old entry in /etc/fstab gets used, the filesystem gets mounted as ntfs, and then the attempt at mkfs fails (as seen in the log).

I'll poke a bit more.
Thanks for the cli instructions.

Scott

Revision history for this message
Martin Pitt (pitti) wrote :

Summary from IRC:

 - Add "x-systemd.requires=cloud-init.service" mount flag to fstab if [ -d /run/systemd/system ] (mountall chokes on unknown options, argh)
 - Make sure cloud-init calls "mount" on a newly written mount point, so that it is mounted when later services start
 - Mark them as "nofail", as they could go away on next boot and we don't want to fail boot because of that
 - Move cloud-init.service to early boot: Add DefaultDependencies=no and Before=basic.target

Scott Moser (smoser) on 2016-09-17
Changed in cloud-init:
status: New → Confirmed
Changed in cloud-init (Ubuntu):
status: New → Confirmed
Changed in cloud-init:
importance: Undecided → High
Changed in cloud-init (Ubuntu):
importance: Undecided → High
Scott Moser (smoser) on 2016-09-20
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.8-3-g80f5ec4-0ubuntu1

---------------
cloud-init (0.7.8-3-g80f5ec4-0ubuntu1) yakkety; urgency=medium

  * New upstream snapshot.
    - Adjust mounts and disk configuration for systemd. (LP: #1611074)
    - dmidecode: run dmidecode only on i?86 or x86_64 arch. [Robert Schweikert]

 -- Scott Moser <email address hidden> Tue, 20 Sep 2016 13:59:20 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Patricia Gaughen (gaughen) wrote :

Scott - wanted to confirm that this will be SRU'd back to Xenial. Also, this is seen on Trusty, can it be backported to Trusty?

Revision history for this message
Paul Meyer (paul-meyer) wrote :
Scott Moser (smoser) on 2016-10-03
Changed in cloud-init:
status: Confirmed → Fix Committed
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

This can / will go back to xenial the next time we sync cloud-init back to xenial.
bug 1629868 seems like it is related, so I would hold off on an SRU to xenial until that is fixed.

Revision history for this message
Scott Moser (smoser) wrote :

Hi,

now that bug 1629868 is understood (duped to bug 1629797) we can reasonably safely move this back to xenial. The issue is also not relevant for ubuntu on xenial, because resolved is not used there.
There is a release of cloud-init currently in -proposed (0.7.8-1-g3705bb5-0ubuntu1~16.04.3) [1]. Once that clears we can look at moving this back also.

So best case scenario is enter into proposed in 10 days or so, and then to released a week after that.

--
[1] https://launchpad.net/ubuntu/+source/cloud-init

Revision history for this message
Patricia Gaughen (gaughen) wrote :

Hey Scott - I see that the version you were waiting for to clear has landed. Do you have an ETA on when the this change will hit -proposed? Thanks!

Scott Moser (smoser) on 2016-11-07
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → Medium
Scott Moser (smoser) on 2016-11-09
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Paul, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-47-gb6561a1-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed
Revision history for this message
Paul Meyer (paul-meyer) wrote :

I tested 0.7.8-47-gb6561a1-0ubuntu1~16.04.1 and it did not fix this bug yet:

Nov 16 20:09:18 testvm [CLOUDINIT] cc_disk_setup.py[DEBUG]: Device /dev/sdb1 has Temporary Storage ntfs
Nov 16 20:09:18 testvm [CLOUDINIT] cc_disk_setup.py[DEBUG]: Device /dev/sdb1 is cleared for formating
Nov 16 20:09:18 testvm [CLOUDINIT] cc_disk_setup.py[DEBUG]: File system None will be created on /dev/sdb1
Nov 16 20:09:18 testvm [CLOUDINIT] util.py[DEBUG]: Running command ['/bin/lsblk', '--pairs', '--output', 'NAME,TYPE,FSTYPE,LABEL', '/dev/sdb1', '--nodeps'] with allowed return codes [0] (shell=False, capture=True)
Nov 16 20:09:18 testvm [CLOUDINIT] cc_disk_setup.py[DEBUG]: Creating file system None on /dev/sdb1
Nov 16 20:09:18 testvm [CLOUDINIT] cc_disk_setup.py[DEBUG]: Using cmd: /sbin/mkfs.ext4 /dev/sdb1
Nov 16 20:09:18 testvm [CLOUDINIT] util.py[DEBUG]: Running command ['/sbin/mkfs.ext4', '/dev/sdb1'] with allowed return codes [0] (shell=False, capture=True)
Nov 16 20:09:18 testvm [CLOUDINIT] util.py[DEBUG]: Creating fs for /dev/disk/cloud/azure_resource took 0.047 seconds
Nov 16 20:09:18 testvm [CLOUDINIT] util.py[WARNING]: Failed during filesystem operation#012Failed to exec of '['/sbin/mkfs.ext4', '/dev/sdb1']':#012Unexpected error while running command.#012Command: ['/sbin/mkfs.ext4', '/dev/sdb1']#012Exit code: 1#012Reason: -#012Stdout: ''#012Stderr: 'mke2fs 1.42.13 (17-May-2015)\n/dev/sdb1 is mounted; will not make a filesystem here!\n'

Let me know what other logs/info you want to see...

tags: added: verification-failed
removed: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

Paul, can you attach your full cloud-init.log from that boot above?

Revision history for this message
Paul Meyer (paul-meyer) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

At the moment, I'm hoping the issue really stems from this being an upgrade, and that a new instance that already had the newer version would be OK. What I believe is happening is:
 a.) old cloud-init on first boot writes /etc/fstab for the resource disk with something like:
     /dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,comment=cloudconfig 0 2
 b.) apt-get install of new cloud-init
 c.) resize
 d.) new system boots with the /etc/fstab line show above. The fstab entry does not stop systemd from mounting the device, and the mount of the ntfs partition happens. cloud-init goes to format it, and mkfs.ext4 complains.

The new cloud-init writes an fstab entry with the options field as:
 defaults,nofail,x-systemd.requires=cloud-init.service,comment=cloudconfig

I see some other paths that i'd like to clean up, including grabbing Daniel's merge proposal at [1], but I'm currently hopeful that the issue above is what we're seeing.

We can test that theory by manually editing the /etc/fstab entry after first boot to include the options field above (specifically x-systemd.requires=cloud-init.service).

--
[1] https://code.launchpad.net/~daniel-thewatkins/cloud-init/+git/cloud-init/+merge/310411

Revision history for this message
Matt Bearup (mbearup) wrote :

Attaching logs from my repro as well. I did (patch -> reboot -> resize). The included fstab is after resize, I'll check the state of fstab at intermediary steps as well.

Revision history for this message
Paul Meyer (paul-meyer) wrote :

So I tried another time, this time paying attention to fstab in between steps. I created a machine and updated cloud-init to 0.7.8-47-gb6561a1-0u.

$ mount|grep sdb ; grep mnt /etc/fstab
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,comment=cloudconfig 0 2

Now I thought it might be worth trying a reboot, which apparently rewrites fstab:

$ mount|grep sdb ; grep mnt /etc/fstab
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,comment=cloudconfig 0 2

So now I'm ready to resize, which I do, but unfortunately:

$ mount|grep sdb ; grep mnt /etc/fstab
/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,comment=cloudconfig 0 2

And:
$ systemctl status /mnt
● mnt.mount - /mnt
   Loaded: loaded (/etc/fstab; bad; vendor preset: enabled)
   Active: active (mounted) since Wed 2016-11-16 22:21:17 UTC; 10min ago
    Where: /mnt
     What: /dev/sdb1
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 1066 ExecMount=/bin/mount /dev/disk/cloud/azure_resource-part1 /mnt -o defaults,x-systemd.requires=cloud-init.service,comment=cloudconfig (c

$ systemctl status cloud-init.service
● cloud-init.service - Initial cloud-init job (metadata service crawler)
   Loaded: loaded (/lib/systemd/system/cloud-init.service; enabled; vendor preset: enabled)
   Active: active (exited) since Wed 2016-11-16 22:21:15 UTC; 11min ago
  Process: 1024 ExecStart=/usr/bin/cloud-init init (code=exited, status=0/SUCCESS)

From cloud-init.log:
Nov 16 22:21:18 testvm3 [CLOUDINIT] util.py[WARNING]: Failed during filesystem operation#012Failed to exec of '['/sbin/mkfs.ext4', '/dev/sdb1']':#012Un
expected error while running command.#012Command: ['/sbin/mkfs.ext4', '/dev/sdb1']#012Exit code: 1#012Reason: -#012Stdout: ''#012Stderr: 'mke2fs 1.42.1
3 (17-May-2015)\n/dev/sdb1 is mounted; will not make a filesystem here!\n'

So it looks like it's a closer race now than before, but it is still a race.
Should it be x-systemd.after? From the systemd.unit manpage:

If a unit foo.service requires a unit bar.service as configured with Requires= and no ordering is configured with After= or Before=, then both units will be started simultaneously and without any delay between them if foo.service is activated.

Revision history for this message
Paul Meyer (paul-meyer) wrote :

That can't be it... systemd.mount man page says:
 x-systemd.requires=
           Configures a Requires= and an After= dependency between the created mount unit and another systemd unit, such as a device or mount unit.

Revision history for this message
Matt Bearup (mbearup) wrote :

Agree with Paul, in my testing x-system.after makes no difference. Removing ntfs-3g and blocking the ntfs kernel module are the only things that are working for me.

Revision history for this message
Paul Meyer (paul-meyer) wrote :

Turns out it's cloud-config.service (not cloud-init.service) that does the mkfs:

paulmey@testvm3:~$ journalctl -b -ojson|jq 'select(.MESSAGE|contains("mkfs"))|._SYSTEMD_UNIT'
"cloud-config.service"
"cloud-config.service"
"cloud-config.service"
"cloud-config.service"
"cloud-config.service"

I changed that in /usr/lib/python3/dist-packages/cloudinit/config/cc_mounts.py and rebooted twice, once to write the new fstab, then the second reboot actually reformats:

paulmey@testvm3:~$ mount|grep sdb ; grep mnt /etc/fstab
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-config.service,comment=cloudconfig 0 2

I think the only thing that needs to change is to require cloud-config.service instead of cloud-init.service ?

Revision history for this message
Scott Moser (smoser) wrote :

Paul,
A long winded comment, please stick with me. Please try to answer these
first:
Question 1.) Is there a way to definitive/declaritive way to determine
   that an instance has been resized? I'd hope for something kind of
   like an insnance id, like a "size-id". Basically, we need a way to
   determine if this event has occurred so that we can act on it.

You are right that it is cloud-config.service that is running this.
Steve Langasek helped me come to that realization also. I had originally
hoped that that too would be solved by this new cloud-init being present
in the "first" boot of an instance, but unfortunately that doesnt seem
right.

The following things were changed with commit 3705bb59 [1] that were
involved in the fix.
a.) we added x-systemd.requires=cloud-init.service to the mount options
    in /etc/fstab.
b.) we moved disk_setup and mounts from
      cloud_config_modules and running in cloud-config.service
    to
      cloud_init_modules and running in cloud_init_modules.service
c.) An azure specific bit of behavior adjusts disk_setup and mounts
    to run every boot (per-always) rather than the default behavior
    of per-instance. This is done specifically to catch this resize.

    It is done dynamically, and prior to cloud-init doing better caching
    to save work, it ended up getting run every instance.

    The result is that after upgrade and then resize, the disk_setup
    and mounts config modules still get run at cloud-config.service
    and thus lose the race with the systemd mounting of the device.

    The code here really takes a bunch of symantics to try to determine
    if a resize has occurred. Thus the question above.

Unfortunately, 'c' happens based on the presence of the ephemeral
disk at the time when the datasource first runs. That is racy with
the disks coming online. We need to find a better way to determine
when disks can be erased (and thus the disk_setup and mounts modules
can re-run). Note, it always was racy with the presence of the disks,
but because we're running earlier now we hit the race more.

I can't think of a solution that doesn't basically require waiting for
the disk to appear.

Revision history for this message
Paul Meyer (paul-meyer) wrote :

Thanks Scott for the thorough explanation.

To answer your question: I don't know that there is any such property. I'm asking around to see if there is, but let's assume 'no' for now. Aside from resize, we run into the same situation after a VM needs to be moved to another node. In that case, the ephemeral drive is also recreated (and reformatted ntfs). So any indication we decide to rely on should include this situation. The common factor is that the disk is completely new.

I think you're right that we might need to wait for the disk to appear. I assume that b) was done for other reasons than this bug, or perhaps to not delay mounting too long?
An alternative could be to actively unmount the ntfs partition before mkfs. Of course that comes with its own host of race conditions, but the success rate may be higher.

Scott Moser (smoser) on 2016-11-17
Changed in cloud-init (Ubuntu):
status: Fix Released → Confirmed
Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

I have put up a merge proposal at
 https://code.launchpad.net/~smoser/cloud-init/+git/cloud-init/+merge/311205

The plan there seems sane, we will now wait in the Azure datasource for the azure disk resource to appear. Paul told me that all azure instances will have an ephemeral disk, so waiting for it to appear is fine.

I have seen a race condition where 'mount -a' fails, we're debugging that.

I like the overall logic much better, and we probably at least partially fix bug 1642383 by waiting.

Revision history for this message
Scott Moser (smoser) wrote :

I've uploaded again to my ppa, 0.7.8-53-g902745d-1~bddeb should build there sometime soon.
I've been testing that with my quick hack of formatting /dev/sdb1 to ntfs and rebooting,
(go-format.sh: http://paste.ubuntu.com/23492436/) and so far it is looking reasonable.

I expect to test tomorrow some more, and welcome any testing anyone else wants to give it.

I'll play with actual resizing tomorrow.

Revision history for this message
Scott Moser (smoser) wrote :

I've just uploaded this to zesty and to xenial-proposed.
The changes can be seen
  https://git.launchpad.net/cloud-init/commit/?id=9e904bbc3336b96475bfd00fb3bf1262ae4de49f

There was also a change made to packaging so that on upgrade we will update /etc/fstab
on the ephemeral disk to include
  x-systemd.requires=cloud-init.service,comment=cloudconfig

This fixes the upgrade, then resize case.
That change can be seen at
  https://git.launchpad.net/cloud-init/commit/?h=ubuntu/devel&id=fac7c5c3b5758c03b3df40ce25849c73de1a8140

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.8-49-g9e904bb-0ubuntu1

---------------
cloud-init (0.7.8-49-g9e904bb-0ubuntu1) zesty; urgency=medium

  * debian/cloud-init.postinst: update /etc/fstab on Azure to fix
    future resize operations. (LP: #1611074)
  * New upstream snapshot.
    - Add activate_datasource, for datasource specific code paths.
      Use that on Azure to handle re-formatting of ephemeral disk.
      (LP: #1611074)

 -- Scott Moser <email address hidden> Fri, 18 Nov 2016 16:37:34 -0500

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Paul, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-49-g9e904bb-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: removed: verification-failed
tags: added: verification-needed
Revision history for this message
Matt Bearup (mbearup) wrote :

I tested this latest fix and it looks good to me. The post-install fix seems to work and after multiple resizes I still see /mnt coming back as ext4

-> Pre-install
$ dpkg -l | grep 'cloudinit '
ii cloud-init 0.7.8-1-g3705bb5-0ubuntu1~16.04.3 all Init scripts for cloud instances
$ grep mnt /etc/fstab
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,comment=cloudconfig 0 2

apt-get install cloud-init
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  cloud-init
1 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Need to get 288 kB of archives.
After this operation, 80.9 kB of additional disk space will be used.
Get:1 http://azure.archive.ubuntu.com/ubuntu xenial-proposed/main amd64 cloud-init all 0.7.8-49-g9e904bb-0ubuntu1~16.04.1 [288 kB]
Fetched 288 kB in 0s (651 kB/s)
Preconfiguring packages ...
(Reading database ... 61252 files and directories currently installed.)
Preparing to unpack .../cloud-init_0.7.8-49-g9e904bb-0ubuntu1~16.04.1_all.deb ...
Unpacking cloud-init (0.7.8-49-g9e904bb-0ubuntu1~16.04.1) over (0.7.8-1-g3705bb5-0ubuntu1~16.04.3) ...
Processing triggers for ureadahead (0.100.0-19) ...
Setting up cloud-init (0.7.8-49-g9e904bb-0ubuntu1~16.04.1) ...
Installing new version of config file /etc/cloud/cloud.cfg ...
Leaving 'diversion of /etc/init/ureadahead.conf to /etc/init/ureadahead.conf.disabled by cloud-init'
cloud-init postinst fixed /etc/fstab for x-systemd.requires

-> Post-install
$ dpkg -l | grep 'cloudinit '
ii cloud-init 0.7.8-49-g9e904bb-0ubuntu1~16.04.1 all Init scripts for cloud instances
$ grep mnt /etc/fstab
/dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,comment=cloudconfig 0 2

-> After Resize
/dev/sdb1 ext4 281G 63M 267G 1% /mnt

-> After another resize
/dev/sdb1 ext4 596G 70M 566G 1% /mnt

-> After yet another resize
/dev/sdb1 ext4 69G 52M 66G 1% /mnt

Revision history for this message
Scott Moser (smoser) wrote :

I've marked verification done based on Matt's comment above.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Paul Meyer (paul-meyer) wrote :

Thanks for the fix, Scott! Thanks for testing and confirming, Matt!

Revision history for this message
Matt Bearup (mbearup) wrote :

Thanks Scott, Steve, and Paul for driving a fix that will help everyone!

Scott Moser (smoser) on 2016-11-22
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (4.0 KiB)

This bug was fixed in the package cloud-init - 0.7.8-49-g9e904bb-0ubuntu1~16.04.1

---------------
cloud-init (0.7.8-49-g9e904bb-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/cloud-init.postinst: update /etc/fstab on Azure to fix
    future resize operations. (LP: #1611074)
  * New upstream snapshot.
    - Add activate_datasource, for datasource specific code paths.
      (LP: #1611074)
    - systemd: cloud-init-local use RequiresMountsFor=/var/lib/cloud
      (LP: #1642062)

cloud-init (0.7.8-47-gb6561a1-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/cloud-init.templates: enable DigitalOcean by default [Ben Howard]
  * New upstream snapshot.
    - systemd/cloud-init-local.service:
      + replace 'Wants' and 'After' on local-fs.target with more granular
        After=systemd-remount-fs.service and RequiresMountsFor=/var/lib
        and Before=sysinit.target.
        This is done run sufficiently early enough to update /etc/fstab.
        (LP: #1611074)
      + add Before=NetworkManager.service so that cloud-init can render
        NetworkManager network config before it would apply them.
    - systemd/cloud-init.service:
      + add Before=sysinit.target and DefaultDependencies=no (LP: #1611074)
      + drop Requires=networking.service to work where networking.service is
        not needed.
      + add Conflicts=shutdown.target
      + drop unnecessary Wants=local-fs.target
    - net: support reading ipv6 dhcp config from initramfs [LaMont Jones]
      (LP: #1621615)
    - dmidecode: Allow dmidecode to be used on aarch64, and only attempt
      usage on x86, x86_64, and aarch64. [Robert Schweikert]
    - disk-config: udev settle after partitioning in gpt format.
      (LP: #1626243)
    - Add support for snap create-user on Ubuntu Core images. [Ryan Harper]
      (LP: #1619393)
    - Fix sshd restarts for rhel distros. [Jim Gorz]
    - Move user/group functions to new ug_util file [Joshua Harlow]
    - update Gentoo initscripts to run in the correct order [Matthew Thode]
    - MAAS: improve the debugging tool in datasource to consider
      config provided on kernel cmdline.
    - lxd: Update network config for LXD 2.3 [Stéphane Graber] (LP: #1640556)
    - Decode unicode types in decode_binary [Robert Schweikert]
    - Allow ephemeral drive to be unpartitioned [Paul Meyer]
    - subp: add 'update_env' argument which allows for more easily adding
      environment variables to a subprocess call.
    - Adjust mounts and disk configuration for systemd. (LP: #1611074)
    - DataSources:
      + Ec2: protect against non-dictionary in block-device-mapping.
      + AliYun: Add new datasource for Ali-Cloud ECS, that is
        available but not enabled by default [kaihuan.pkh]
      + DigitalOcean: use meta-data for network configuration and
        enable data source by default. [Ben Howard]
      + OpenNebula: replace parsing of 'ip' command with similar function
        available in cloudinit.net. This fixed unit tests when running
        in environment with no networking.
    - doc changes:
      + Add documentation on stages of boot.
      + make the RST files consistently formated and other improvements.
     ...

Read more...

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Chris Halse Rogers (raof) wrote : Please test proposed package

Hello Paul, or anyone else affected,

Accepted cloud-init into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-49-g9e904bb-0ubuntu1~16.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Yakkety):
status: New → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Mathew Hodson (mhodson) on 2016-12-08
Changed in cloud-init (Ubuntu Yakkety):
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :

I've verified this on yakkety as shown in description, using:

smoser@smoser1219y:~$ dpkg-query --show cloud-init
cloud-init 0.7.8-49-g9e904bb-0ubuntu1~16.10.1
smoser@smoser1219y:~$ cat /etc/cloud/build.info
build_name: server
serial: 20161214

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.8-49-g9e904bb-0ubuntu1~16.10.1

---------------
cloud-init (0.7.8-49-g9e904bb-0ubuntu1~16.10.1) yakkety; urgency=medium

  * debian/cloud-init.templates: enable DigitalOcean by default [Ben Howard]
  * debian/cloud-init.postinst: update /etc/fstab on Azure to fix
    future resize operations. (LP: #1611074)
  * New upstream snapshot.
    - systemd/cloud-init-local.service:
      + replace 'Wants' and 'After' on local-fs.target with more granular
        After=systemd-remount-fs.service and RequiresMountsFor=/var/lib
        and Before=sysinit.target.
        This is done run sufficiently early enough to update /etc/fstab.
        (LP: #1611074)
    - systemd/cloud-init.service:
      + add Before=sysinit.target and DefaultDependencies=no (LP: #1611074)
      + drop Requires=networking.service to work where networking.service is
        not needed.
      + add Conflicts=shutdown.target
      + drop unnecessary Wants=local-fs.target
    - net: support reading ipv6 dhcp config from initramfs [LaMont Jones]
      (LP: #1621615)
    - dmidecode: Allow dmidecode to be used on aarch64, and only attempt
      usage on x86, x86_64, and aarch64. [Robert Schweikert]
    - disk-config: udev settle after partitioning in gpt format.
      (LP: #1626243)
    - Add support for snap create-user on Ubuntu Core images. [Ryan Harper]
      (LP: #1619393)
    - Fix sshd restarts for rhel distros. [Jim Gorz]
    - Move user/group functions to new ug_util file [Joshua Harlow]
    - update Gentoo initscripts to run in the correct order [Matthew Thode]
    - MAAS: improve the debugging tool in datasource to consider
      config provided on kernel cmdline.
    - DataSources:
      + Ec2: protect against non-dictionary in block-device-mapping.
      + AliYun: Add new datasource for Ali-Cloud ECS, that is
        available but not enabled by default [kaihuan.pkh]
      + OpenNebula: replace parsing of 'ip' command with similar function
        available in cloudinit.net. This fixed unit tests when running
        in environment with no networking.
    - doc changes:
      + Add documentation on stages of boot.
      + make the RST files consistently formated and other improvements.
      + fixed example to not overwrite /etc/hosts [Chris Glass]
      + fix spelling / typos in ca_certs and scripts_vendor.
      + improve HACKING.rst file
      + Add documentation for logging features. [Wesley Wiedenmeier]
    - code style and unit test changes:
      + pep8: fix style errors reported by pycodestyle 2.1.0
      + pyflakes: fix issue with pyflakes 1.3 found in ubuntu zesty-proposed.
      + Add coverage dependency to bddeb to fix package build.
      + Add coverage collection to tox unit tests. [Joshua Powers]
      + do not read system /etc/cloud/cloud.cfg.d (LP: #1635350)
      + tests: silence the Cheetah UserWarning about NameMapper C version.
      + Fix python2.6 things found running in centos 6.

 -- Scott Moser <email address hidden> Tue, 22 Nov 2016 17:04:36 -0500

Changed in cloud-init (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

This is fixed in cloud-init 0.7.9.

Revision history for this message
Scott Moser (smoser) wrote :

This is fixed in 0.7.9.

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers