fstab entries written by cloud-config may not be mounted

Bug #1691489 reported by Scott Moser on 2017-05-17
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Unassigned
cloud-init (Ubuntu)
Medium
Unassigned
Xenial
Medium
Unassigned
Yakkety
Medium
Unassigned
Zesty
Medium
Unassigned
Artful
Medium
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
There is a race condition on a re-deployment of cloud-init on Azure
where /mnt will not get properly formatted or mounted. This is due to
"dirty" entries in /etc/fstab that cause a device to be busy when
cloud-init goes to format it. This shows itself usually as 'mkfs'
complaining that the device is busy. The cause is that systemd
starts an fsck and collides with cloud-init re-formatting the disk.

The problem can be seen other places but seemed to be most reproducible
and originally found on Azure.

[Test Case]
1.) Launch a Azure vm, ideally size L32S.
2.) Log in and verify the system properly mounted /mnt.
3.) Re-deploy the vm through the web ui and try again.

[Regression Potential]
Worst case scenario, these changes unnecessarily slow down boot and
do not fix the problem.

[Regression]
This SRU change caused bug 1717477.

[Other Info]
Upstream commit at
  https://git.launchpad.net/cloud-init/commit/?id=1f5489c258

=== End SRU Template ===

As reported in bug 1686514, sometimes /mnt will not get mounted when re-delpoying or stopping-then-starting a Azure vm of L32S. This is probably a more generic issue, I suspect shown due to the speed of disks on these systems.

Related bugs:
 * bug 1686514: Azure: cloud-init does not handle reformatting GPT partition ephemeral disks
 * bug 1717477: cloud-init generates ordering cycle via After=cloud-init in systemd-fsck

Related branches

Scott Moser (smoser) wrote :

These tarballs are collected with 'save-old-data' at
 https://git.launchpad.net/~smoser/cloud-init/+git/sru-info/tree/bin

They represent:
 orig-boot.tar.xz: the first boot of a 16.04 pristine image (0.7.9-90-g61eb03fe-0ubuntu1~16.04.1)
 upgrade-first-reboot.tar.xz: I did a dpkg -i of cloud-init_0.7.9-139-gb5722bd1-1~bddeb_all.deb (current branch with fix for bug 1686514)
 after-restart.tar.xz: After a 'stop' and then 'start' in the web console. This showed the bug.
 after-restart-with-fsck.tar.xz: dpkg -i of a another branch cloud-init_0.7.9-140-g2e21a411-1~bddeb_all.deb and stop and start.

Changed in cloud-init:
status: New → Confirmed
importance: Undecided → Medium
Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :

It seems that in addition to blocking fsck, we should also block swap usage.
The severity of this issue is somewhat limited as the scenario will only happen when
a.) there is a filesystem (or swap) on a disk
b.) there is a (likely stale) entry in /etc/fstab for that disk already

This means that we're kind of limited to either
1. azure instances and resize/redeploy
2. first boot of a an instance snapshootted with stuff in /etc/fstab
3. developer testing (re-partition/setup and rm -Rf /var/lib/cloud && reboot)

Scott Moser (smoser) wrote :

Dimitri,

Do you know how I can limit swap usage until after cloud-init.service is done?
I'm under the impression that I can do that with fsck by adding the drop-in to
 /systemd/systemd-fsck@.service.d/cloud-init.conf
as seen in the merge proposal.

I'm open to other ideas too.

Balint Reczey (rbalint) on 2017-05-25
Changed in cloud-init:
assignee: nobody → Balint Reczey (rbalint)
Balint Reczey (rbalint) wrote :

I tried finding other options, but to work around /etc/fstab containing potentially invalid swap partition the only options seems to be calling "swapoff -a" and then later "swapon -a" from cloud-init when it detects that a partition re-initialization needs to take place.

The same stands for systemd-fsckd.service. IMO it should be stopped for the time reformatting takes place instead of adding the drop-in which would potentially slow down boot even when this workaround is not needed.

Scott Moser (smoser) wrote :

Balint,

Thanks for the reply.

With regard to slowing down boot, I'm not too concerned about that. Because in almost all properly functioning scenarios, cloud-init's generator will enable or disable cloud-init. So the slow down would be limited to scenarios where cloud-init was supposed to run, primarily on non-first boots of an instance. I agree though, it does put a bottleneck in boot.

With reard to 'swapoff -a' or 'swapon -a' or the systemd-fsck.service equivalent, I'm not opposed to that, but I don't know how it could be made to be non-racey. Do you have a solution in mind that doesn't have a race in it?

Ie, for swap:
  - cloud-init check if there is swap in use.
  - cloud-init run 'swapoff -a'
  - cloud-init do some things
  - cloud-init run 'swapon -a'

while systemd in parallel
  - enable swap for .mount entries that were generated from stale fstab

This can be mitigated some by being more granular (swapoff /dev/XXX), but still racy unless cloud-init can coordinate that with systemd. Is that possible?

Thanks again for the input.
Scott

Balint Reczey (rbalint) wrote :

I filed a merge request to limit the fsck delay to Azure, please take a look at it.

Regarding the swap I think the least hack-ish safe solution would be relying on systemd-fstab-generator to create the .swap units as usual, and instead of running swapoff/swapon cloud init could find all .swap units and stop them for the time it does things.

That would avoid the race because the generator runs early, before the units, and stopping .swap units is done by systemd.

Scott Moser (smoser) on 2017-06-26
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
Changed in cloud-init (Ubuntu Yakkety):
status: New → Confirmed
Changed in cloud-init (Ubuntu Zesty):
status: New → Confirmed
Changed in cloud-init (Ubuntu Xenial):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Yakkety):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Zesty):
importance: Undecided → Medium
Balint Reczey (rbalint) on 2017-07-12
Changed in cloud-init:
assignee: Balint Reczey (rbalint) → nobody
assignee: nobody → Balint Reczey (rbalint)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.9-231-g80bf98b9-0ubuntu1

---------------
cloud-init (0.7.9-231-g80bf98b9-0ubuntu1) artful; urgency=medium

  * New upstream snapshot.
    - tests: remove 'yakkety' from releases as it is EOL.
    - systemd: make systemd-fsck run after cloud-init.service (LP: #1691489)
    - tests: Add initial tests for EC2 and improve a docstring.
    - locale: Do not re-run locale-gen if provided locale is system default.
    - archlinux: fix set hostname usage of write_file.
      [Joshua Powers] (LP: #1705306)
    - sysconfig: support subnet type of 'manual'.
    - Drop rand_str() usage in DNS redirection detection
      [Bob Aman] (LP: #1088611)

 -- Scott Moser <email address hidden> Mon, 31 Jul 2017 09:47:34 -0400

Changed in cloud-init (Ubuntu Artful):
status: Confirmed → Fix Released
Changed in cloud-init:
status: Confirmed → Fix Released
Scott Moser (smoser) on 2017-08-04
description: updated
Scott Moser (smoser) on 2017-08-04
Changed in cloud-init:
status: Fix Released → Fix Committed

Hello Scott, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-233-ge586fe35-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in cloud-init (Ubuntu Zesty):
status: Confirmed → Fix Committed
tags: added: verification-needed-zesty
Chris J Arges (arges) wrote :

Hello Scott, or anyone else affected,

Accepted cloud-init into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-233-ge586fe35-0ubuntu1~17.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Chad Smith (chad.smith) wrote :

Validated across multiple (5) 'clean' reboots that Azure vms don't hit the race condition with mounts and don't result in cloud-init errors.

ubuntu@xen1:~$ dpkg-query --show cloud-init
cloud-init 0.7.9-233-ge586fe35-0ubuntu1~16.04.1
ubuntu@xen1:~$ grep -i error /var/log/cloud-init.log /run/cloud-init/*
grep: /run/cloud-init/dhclient.hooks: Is a directory
/run/cloud-init/result.json: "errors": []
/run/cloud-init/status.json: "errors": [],
/run/cloud-init/status.json: "errors": [],
/run/cloud-init/status.json: "errors": [],
/run/cloud-init/status.json: "errors": [],
ubuntu@xen1:~$ cat /run/cloud-init/result.json
{
 "v1": {
  "datasource": "DataSourceAzure [seed=/var/lib/waagent]",
  "errors": []
 }
}
ubuntu@xen1:~$ grep reformat /var/log/cloud-init.log
2017-09-12 21:58:12,526 - DataSourceAzure.py[DEBUG]: reformattable=False: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was not ntfs formatted

tags: added: verification-done-xenial
removed: verification-needed-xenial
Chad Smith (chad.smith) wrote :

Zesty verification:
 # Saw initial failure before upgrade
ubuntu@zesty1:~$ dpkg-query --show cloud-init
cloud-init 0.7.9-153-g16a7302f-0ubuntu1~17.04.2
ubuntu@zesty1:~$ grep reformat /var/log/cloud-init.log
2017-09-12 22:08:16,313 - DataSourceAzure.py[DEBUG]: reformattable=True: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was ntfs formatted and had no important files. Safe for reformatting.

# Saw 5 successes across reprovisions after upgrade

ubuntu@zesty1:~$ grep reformat /var/log/cloud-init.log 2017-09-12 22:19:39,881 - DataSourceAzure.py[DEBUG]: reformattable=False: partition 1 (/dev/sdb1) on device /dev/disk/cloud/azure_resource was not ntfs formatted
ubuntu@zesty1:~$ mount | grep mnt
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
ubuntu@zesty1:~$ dpkg-query --show cloud-init
cloud-init 0.7.9-233-ge586fe35-0ubuntu1~17.04.1
ubuntu@zesty1:~$ grep -i error /var/log/cloud-init* /run/cloud-init/*
grep: /run/cloud-init/dhclient.hooks: Is a directory
/run/cloud-init/result.json: "errors": []
/run/cloud-init/status.json: "errors": [],
/run/cloud-init/status.json: "errors": [],
/run/cloud-init/status.json: "errors": [],
/run/cloud-init/status.json: "errors": [],
ubuntu@zesty1:~$ cat /run/cloud-init/result.json
{
 "v1": {
  "datasource": "DataSourceAzure [seed=/var/lib/waagent]",
  "errors": []
 }
}

Chad Smith (chad.smith) on 2017-09-12
tags: added: verification-done-zesty
removed: verification-needed verification-needed-zesty
Launchpad Janitor (janitor) wrote :
Download full text (6.4 KiB)

This bug was fixed in the package cloud-init - 0.7.9-233-ge586fe35-0ubuntu1~16.04.1

---------------
cloud-init (0.7.9-233-ge586fe35-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/cloud-init.templates: enable Scaleway cloud.
  * debian/cloud-init.templates: enable Aliyun cloud.
  * drop the following cherry picks, now incorporated in snapshot.
    + debian/patches/cpick-5fb49bac-azure-identify-platform...
    + debian/patches/cpick-003c6678-net-remove-systemd-link...
    + debian/patches/cpick-1cd4323b-azure-remove-accidental...
    + debian/patches/cpick-ebc9ecbc-Azure-Add-network-config...
    + debian/patches/cpick-11121fe4-systemd-make-cloud-final...
  * debian/patches/stable-release-no-jsonschema-dep.patch:
    add patch to remove optional dependency on jsonschema.
  * New upstream snapshot.
    - cloudinit.net: add initialize_network_device function and tests
      [Chad Smith]
    - makefile: fix ci-deps-ubuntu target [Chad Smith]
    - tests: adjust locale integration test to parse default locale.
    - tests: remove 'yakkety' from releases as it is EOL.
    - centos: do not package systemd-fsck drop-in.
    - systemd: make systemd-fsck run after cloud-init.service (LP: #1691489)
    - tests: Add initial tests for EC2 and improve a docstring.
    - locale: Do not re-run locale-gen if provided locale is system default.
    - archlinux: fix set hostname usage of write_file. [Joshua Powers]
    - sysconfig: support subnet type of 'manual'.
    - tools/run-centos: make running with no argument show help.
    - Drop rand_str() usage in DNS redirection detection
      [Bob Aman] (LP: #1088611)
    - sysconfig: use MACADDR on bonds/bridges to configure mac_address
      [Ryan Harper]
    - net: eni route rendering missed ipv6 default route config
      [Ryan Harper] (LP: #1701097)
    - sysconfig: enable mtu set per subnet, including ipv6 mtu
      [Ryan Harper]
    - sysconfig: handle manual type subnets [Ryan Harper]
    - sysconfig: fix ipv6 gateway routes [Ryan Harper]
    - sysconfig: fix rendering of bond, bridge and vlan types.
      [Ryan Harper]
    - Templatize systemd unit files for cross distro deltas. [Ryan Harper]
    - sysconfig: ipv6 and default gateway fixes. [Ryan Harper]
    - net: fix renaming of nics to support mac addresses written in upper
      case. (LP: #1705147)
    - tests: fixes for issues uncovered when moving to python 3.6.
    - sysconfig: include GATEWAY value if set in subnet
      [Ryan Harper]
    - Scaleway: add datasource with user and vendor data for Scaleway.
      [Julien Castets]
    - Support comments in content read by load_shell_content.
    - cloudinitlocal fail to run during boot [Hongjiang Zhang]
    - doc: fix disk setup example table_type options [Sandor Zeestraten]
    - tools: Fix exception handling. [Joonas Kylmälä]
    - tests: fix usage of mock in GCE test.
    - test_gce: Fix invalid mock of platform_reports_gce to return False
      [Chad Smith]
    - test: fix incorrect keyid for apt repository. [Joshua Powers]
    - tests: Update version of pylxd [Joshua Powers]
    - write_files: Remove log from helper function signatures.
      [Andrew Jorgensen]
    - doc: document...

Read more...

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :
Download full text (6.4 KiB)

This bug was fixed in the package cloud-init - 0.7.9-233-ge586fe35-0ubuntu1~17.04.1

---------------
cloud-init (0.7.9-233-ge586fe35-0ubuntu1~17.04.1) zesty; urgency=medium

  * debian/cloud-init.templates: enable Scaleway cloud.
  * debian/cloud-init.templates: enable Aliyun cloud.
  * drop the following cherry picks, now incorporated in snapshot.
    + debian/patches/cpick-5fb49bac-azure-identify-platform...
    + debian/patches/cpick-003c6678-net-remove-systemd-link...
    + debian/patches/cpick-1cd4323b-azure-remove-accidental...
    + debian/patches/cpick-ebc9ecbc-Azure-Add-network-config...
    + debian/patches/cpick-11121fe4-systemd-make-cloud-final...
  * debian/patches/stable-release-no-jsonschema-dep.patch:
    add patch to remove optional dependency on jsonschema.
  * New upstream snapshot.
    - cloudinit.net: add initialize_network_device function and tests
      [Chad Smith]
    - makefile: fix ci-deps-ubuntu target [Chad Smith]
    - tests: adjust locale integration test to parse default locale.
    - tests: remove 'yakkety' from releases as it is EOL.
    - centos: do not package systemd-fsck drop-in.
    - systemd: make systemd-fsck run after cloud-init.service (LP: #1691489)
    - tests: Add initial tests for EC2 and improve a docstring.
    - locale: Do not re-run locale-gen if provided locale is system default.
    - archlinux: fix set hostname usage of write_file. [Joshua Powers]
    - sysconfig: support subnet type of 'manual'.
    - tools/run-centos: make running with no argument show help.
    - Drop rand_str() usage in DNS redirection detection
      [Bob Aman] (LP: #1088611)
    - sysconfig: use MACADDR on bonds/bridges to configure mac_address
      [Ryan Harper]
    - net: eni route rendering missed ipv6 default route config
      [Ryan Harper] (LP: #1701097)
    - sysconfig: enable mtu set per subnet, including ipv6 mtu
      [Ryan Harper]
    - sysconfig: handle manual type subnets [Ryan Harper]
    - sysconfig: fix ipv6 gateway routes [Ryan Harper]
    - sysconfig: fix rendering of bond, bridge and vlan types.
      [Ryan Harper]
    - Templatize systemd unit files for cross distro deltas. [Ryan Harper]
    - sysconfig: ipv6 and default gateway fixes. [Ryan Harper]
    - net: fix renaming of nics to support mac addresses written in upper
      case. (LP: #1705147)
    - tests: fixes for issues uncovered when moving to python 3.6.
    - sysconfig: include GATEWAY value if set in subnet
      [Ryan Harper]
    - Scaleway: add datasource with user and vendor data for Scaleway.
      [Julien Castets]
    - Support comments in content read by load_shell_content.
    - cloudinitlocal fail to run during boot [Hongjiang Zhang]
    - doc: fix disk setup example table_type options [Sandor Zeestraten]
    - tools: Fix exception handling. [Joonas Kylmälä]
    - tests: fix usage of mock in GCE test.
    - test_gce: Fix invalid mock of platform_reports_gce to return False
      [Chad Smith]
    - test: fix incorrect keyid for apt repository. [Joshua Powers]
    - tests: Update version of pylxd [Joshua Powers]
    - write_files: Remove log from helper function signatures.
      [Andrew Jorgensen]
    - doc: document the cmdli...

Read more...

Changed in cloud-init (Ubuntu Zesty):
status: Fix Committed → Fix Released
thermoman (thermoman) wrote :

This release broke a lot of my machines, generating ordering cycles on every machine.

Please see #1717477

Scott Moser (smoser) on 2017-09-15
Changed in cloud-init (Ubuntu Yakkety):
status: Confirmed → Won't Fix
description: updated
Scott Moser (smoser) wrote :

Not sure what to do here.
We intend to fix the other bug (bug 1717477) by reverting this change.
Thus re-opening this bug.

Changed in cloud-init:
status: Fix Committed → Confirmed
assignee: Balint Reczey (rbalint) → nobody
Changed in cloud-init (Ubuntu Xenial):
status: Fix Released → Confirmed
status: Confirmed → Fix Released
Ryan Harper (raharper) wrote :

As far as I can tell, I don't think we can "delay" the fsck service due to how the systemd-fstab-generator works on /etc/fstab entries

For entries with a no-zero value for fsck (6th column), then the generator will write out a .mount file that looks like this:

ubuntu@ubuntu:/run/systemd/generator$ cat btrfs.mount
# Automatically generated by systemd-fstab-generator

[Unit]
SourcePath=/etc/fstab
Documentation=man:fstab(5) man:systemd-fstab-generator(8)
Before=local-fs.target
Requires=systemd-fsck@dev-disk-by\x2duuid-d8e33db0\x2d9a54\x2d11e7\x2dbd8f\x2d525400123456.service
After=systemd-fsck@dev-disk-by\x2duuid-d8e33db0\x2d9a54\x2d11e7\x2dbd8f\x2d525400123456.service

[Mount]
What=/dev/disk/by-uuid/d8e33db0-9a54-11e7-bd8f-525400123456
Where=/btrfs
Type=btrfs

This will want to run fsck on the device, and then mount it, and all *before* local-fs.target

cloud-init cannot run until *after* local-fs.target is reached. Asking fsck service to run later is always going to be in-conflict with fsck+mount from the generator.

I'm not sure we can reliably interrupt these services; the .mount unit is going to require a fsck; if we stop the fsck, then the mount won't happen.

This is going to require some more thought and discussion.

Changed in cloud-init (Ubuntu Artful):
status: Fix Released → Confirmed

This bug is believed to be fixed in cloud-init in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Confirmed → Fix Released
Scott Moser (smoser) on 2017-10-03
Changed in cloud-init:
status: Fix Released → Confirmed
Changed in cloud-init (Ubuntu Xenial):
status: Fix Released → Fix Committed
status: Fix Committed → Confirmed
Changed in cloud-init (Ubuntu Zesty):
status: Fix Released → Confirmed
Scott Moser (smoser) on 2018-07-20
Changed in cloud-init (Ubuntu Zesty):
status: Confirmed → Won't Fix
Changed in cloud-init (Ubuntu Artful):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers