cloud-init growpart race with udev

Bug #1834875 reported by Tobias Koch on 2019-07-01
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Unassigned
cloud-utils
Undecided
Unassigned
linux-azure (Ubuntu)
Undecided
Unassigned
systemd (Ubuntu)
Undecided
Unassigned

Bug Description

On Azure, it happens regularly (20-30%), that cloud-init's growpart module fails to extend the partition to full size.

Such as in this example:

========================================

2019-06-28 12:24:18,666 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-06-28 12:24:19,157 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-06-28 12:24:19,726 - util.py[DEBUG]: resize_devices took 1.075 seconds
2019-06-28 12:24:19,726 - handlers.py[DEBUG]: finish: init-network/config-growpart: FAIL: running config-growpart with frequency always
2019-06-28 12:24:19,727 - util.py[WARNING]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
2019-06-28 12:24:19,727 - util.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 812, in _run_modules
    freq=freq)
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 351, in handle
    func=resize_devices, args=(resizer, devices))
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2521, in log_time
    ret = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 298, in resize_devices
    (old, new) = resizer.resize(disk, ptnum, blockdev)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 159, in resize
    return (before, get_size(partdev))
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 198, in get_size
    fd = os.open(filename, os.O_RDONLY)
FileNotFoundError: [Errno 2] No such file or directory: '/dev/disk/by-partuuid/a5f2b49f-abd6-427f-bbc4-ba5559235cf3'

========================================

@rcj suggested this is a race with udev. This seems to only happen on Cosmic and later.

Ryan Harper (raharper) wrote :

Hi,

Could you run cloud-init collect-logs and attach the tarball provided?

Are there any particular instance specific defaults ?what type, what region? Size of the ephemeral disk?

Changed in cloud-init:
importance: Undecided → Medium
status: New → Incomplete
Tobias Koch (tobijk) wrote :

This seems hard to reproduce with the azure-cli, but happens regularly on instances we launch directly via the API. I'm not aware of any special configurations. These are stock images launched in northeurope. The ephemeral disk is 30GB.

Tobias Koch (tobijk) wrote :

This happens across different instance types without a clear pattern.

Tobias Koch (tobijk) on 2019-07-02
Changed in cloud-init:
status: Incomplete → New
Dan Watkins (daniel-thewatkins) wrote :

Hey Tobi, could you clarify what the failure here actually is? Looking at the journal (specifically the kernel EXT4 entries), it looks like some resizing _does_ happen. (Obviously we don't want a traceback in the log so we can chase that down, but I'd like to understand the concrete effect that this is having on users in order to prioritise it. :)

(Also, if you could get us access to a failing instance, that would be great! I don't have cloud-test-framework checked out any longer, else I'd try to do it.)

Changed in cloud-init:
status: New → Incomplete
Dan Watkins (daniel-thewatkins) wrote :

OK, so Ryan and I just kicked this back and forth a bit in a meeting. Looking at the failure, I believe that we successfully resize the disks/partitions, but when we look up the partition by partuuid to double-check what changed we fail. (Ryan's hypothesis is that the symlink is in the process of being recreated by udev when we go to check it.)

(The reason I believe we successfully resize the disk is that the traceback occurs after we've successfully run the growpart command which we can see in the cloud-init log and in dmesg.)

Access to a failing instance would be really helpful, so I can double-check our thinking here.

Tobias Koch (tobijk) wrote :

Duh, you're right, the resize happens. Then the only problem with this is that the tests cannot tell why the cloud-init service starts degraded.

Test instance at 137.135.172.1.

Dan Watkins (daniel-thewatkins) wrote :

Thanks for the test instance! Yeah, even though the resize is successful, we don't want tracebacks, warnings or degraded boot! (Also, I haven't checked, but I imagine this would cause functional issues if there was more than a single disk/partition that needed resizing; I expect we would resize the first only.)

Looking at the test instance, the by-partuuid symlinks are created after we hit this error:

$ ls -l --time-style=full-iso /dev/disk/by-partuuid/
total 0
lrwxrwxrwx 1 root root 10 2019-07-03 11:39:34.578632000 +0000 251ca684-01 -> ../../sdb1
lrwxrwxrwx 1 root root 11 2019-07-03 11:39:34.718632000 +0000 46150cea-0763-4a42-b688-5d7c32604a99 -> ../../sda14
lrwxrwxrwx 1 root root 11 2019-07-03 11:39:34.674632000 +0000 76944892-a450-4661-ad65-fbe719c872b0 -> ../../sda15
lrwxrwxrwx 1 root root 10 2019-07-03 11:39:34.698632000 +0000 b429205c-8c8c-434b-bb30-f80882728e23 -> ../../sda1

That said, those timestamps are _so_ far after the error (~28s) that I _guess_ they were regenerated again. (It's also possible that nothing poked udev for that long, I suppose.)

Toby, if I give you a patched cloud-init in a PPA, would you be able to build a test image from it?

Changed in cloud-init:
status: Incomplete → Confirmed
Tobias Koch (tobijk) wrote :

Thanks for looking into this! Sure, send over the PPA.

Tobias Koch (tobijk) wrote :

I still see this happening with the test package from the PPA.

Tobias Koch (tobijk) wrote :

It definitely happens less frequently. Test instance at 52.169.141.202.

Thanks for the feedback Tobi. I've built a new cloud-init package in the same PPA, which just adds a little more debug logging. Please give this a spin and then give me SSH access to both a passing and a failing instance.

Thanks!

Tobias Koch (tobijk) wrote :

Sorry, for the slow feedback. It's a pain to produce these instances.

failed instance: 52.178.178.45
passed instance: 13.74.143.107

Download full text (4.2 KiB)

No worries, thanks for the instances!

OK, here are the relevant lines from the log:

== SUCCESSFUL ==

2019-07-24 06:34:11,361 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:11,363 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:33:59.017040800 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

2019-07-24 06:34:11,377 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:11,649 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,907 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,991 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,994 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:34:12.981040800 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

== FAILURE ==

2019-07-24 06:25:42,590 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:25:27.707830000 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

2019-07-24 06:25:42,590 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,097 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,624 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,633 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,636 - util.py[DEBUG]: resize_devices took 1.059 seconds
2019-07-24 06:25:43,636 - handlers.py[DEBUG]: finish: init-network/config-growpart: FAIL: running config-growpart with frequency always
2019-07-24 06:25:43,636 - util.py[WARNING]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
2019-07-24 06:25:43,637 - util.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 827, in _run_modules
    freq=freq)
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 187, in run
...

Read more...

So looking at that output, I have these observations:

* we are dealing with a fairly tight window; in the successful run, the timestamp of the symlink is 06:34:12.981040800 and we run the `ls` command to show us that at 06:34:12,991; that's about 10ms
* based on the timestamp of the next log message in each case, `udevadm settle` on success appears to take substantially longer (06:34:12,907 -> 06:34:12,991; 84ms) than on failure (06:25:43,624 -> 06:25:43,633; 9ms)
* on success, the timestamp of the symlink is between the `udevadm settle` and the next log message, which suggests that it was created as a result of the settle

So none of this points to something _obvious_ (or, at least, not something obvious to me). Given the disparity in the apparent `udevadm settle` execution times, I wonder if the udev events from the resize haven't made their way in to the queue before we execute the settle, meaning it just exits without waiting at all?

Ryan Harper (raharper) wrote :

Looking at the growpart command; it does two primary things. 1) issues commands to update the size of the partition (via sgdisk or sfdisk) 2) issues a "rereadpt" command.

I really suspect you're right in that the "rereadpt" may have been issued but not yet completed, including the uevent which would trigger udev events; Now, the partition already exists we really shouldn't see the the symlink go away; which is the _really_ strange part.

So, I think we should fire up udevadm monitor

sudo udevadm monitor --property --kernel --udev --subsystem-match=block

And then issue growpart commands on a disk whose partition can be grown.

You can launch and instance in azure with user-data to diable growpart; then issue the same growpart command after logging in; I'd also run growpart under bash -x for a trace so we can see what's happening.

Lastly, we can also use an inotify watch on /dev/disk/by-partuuid

% inotifywait -mr --timefmt "%s" --format "%T %w%f %e" /dev/disk/by-partuuid
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
1563987982 /dev/disk/by-partuuid/foo CREATE
1563987982 /dev/disk/by-partuuid/foo OPEN
1563987982 /dev/disk/by-partuuid/foo ATTRIB
1563987982 /dev/disk/by-partuuid/foo CLOSE_WRITE,CLOSE
1563987985 /dev/disk/by-partuuid/foo DELETE

So manually running `growpart /dev/sda 1` resulted in the following inotifywait output:

1563992577 /dev/disk/by-partuuid/60a13b74-44c8-4d0d-b415-6c697e398204 ATTRIB
1563992577 /dev/disk/by-partuuid/43b8aeb4-14df-4ae2-8a6a-d3ada9ea01dd ATTRIB
1563992577 /dev/disk/by-partuuid/7f427e61-bcb8-491d-b067-689ba6f4702b ATTRIB

and the attached `udevadm monitor` output. The fact we don't see a removal here suggests we aren't tickling the bug, so I'm going to launch another instance and see if running the cloud-init module makes a difference.

I tried a few more times to get a manual reproducer and failed. So I've created two systemd units, one[0] to collect the `udevadm monitor` info, and one[1] to collect the `inotifywait` info. (As these commands output to stdout, their output can be found in the systemd journal.) I put these in place, and iterated a few reboots to get them working OK.

I then `cloud-init clean`'d and captured the image (following Azure's instructions[2]). I launched my captured image multiple times (I didn't keep track of exactly how many, but N>10 for sure) and didn't hit a single failure. So either (a) I got "lucky", (b) the extra instrumentation fixes the race condition, (c) the differences between my image and a truly pristine one are too big, (d) how I'm launching instances and how the CPC test framework is launching instances is sufficiently different to not tickle the bug, or, of course, (e) something else entirely.

Tobi, I know we're pushing things a bit now, but are you able to add these two units to an image to see what they produce? (Make sure to enable them as well as install them! It took me two reboots to understand why they weren't executing. ¬.¬)

(The `inotifywait` unit[1] appears to not yet be perfect, as it sometimes emits "Current command vanished from the unit file, execution of the command list won't be resumed" and then stops emitting anything else. I haven't been able to work out what this error message actually means either via search or by looking at the systemd source code briefly, and asking in #systemd on Freenode hasn't yet yielded any further info.)

[0] https://paste.ubuntu.com/p/FgbZqwZR6Y/
[1] https://paste.ubuntu.com/p/57HkmCKN67/
[2] https://docs.microsoft.com/en-us/azure/virtual-machines/linux/capture-image

Changed in cloud-init:
status: Confirmed → In Progress
assignee: nobody → Dan Watkins (daniel-thewatkins)

(Marked as In Progress, though depending on the outcome of the triage we're doing ATM this might go back to Triaged rather than get fixed immediately.)

Tobias Koch (tobijk) wrote :

Is the problem with inotifywait an instance of this systemd bug [1]? I don't see a work-around mentioned.

With the two services from above included:

failed instance: 52.178.156.32
passed instance: 13.94.113.245

[1] https://github.com/systemd/systemd/issues/12258

Thanks for the continued debugging assistance! It definitely could be that bug, though looking at the failing instance, we're getting enough information from the inotifywait service, so thankfully I don't think we need to worry about it:

Jul 26 10:39:01.324469 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx inotifywait[462]: 1564137541 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 DELETE

OK, so we're definitely seeing a DELETE on failure, and it _looks_ like it happens during the growpart:

2019-07-26 10:39:00,725 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
Jul 26 10:39:01.324469 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx inotifywait[462]: 1564137541 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 DELETE
2019-07-26 10:39:01,339 - util.py[DEBUG]: resize_devices took 1.334 seconds

Looking at the `udevadm monitor` logs from each instance, we only have one udev/kernel remove on each (listed twice, once for the kernel event, once for the corresponding udev event):

ubuntu@<failing>:~$ journalctl -o short-precise -u udevmon.service | grep ACTION=remove -B1
Jul 26 10:39:03.571001 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: KERNEL[36.612409] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:39:03.571001 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: ACTION=remove
--
Jul 26 10:39:03.588454 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: UDEV [36.629717] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:39:03.588454 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: ACTION=remove

ubuntu@<passing>:~$ journalctl -o short-precise -u udevmon.service | grep ACTION=remove -B1
Jul 26 10:43:33.613194 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: KERNEL[33.632423] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:43:33.613194 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: ACTION=remove
--
Jul 26 10:43:33.627533 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: UDEV [33.644486] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:43:33.627533 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: ACTION=remove

I'm going to dig in to the logs a little further to see if I can find anything.

I've confirmed that the udev and kernel ACTION=change event on /dev/sda is the same across the failing and succeeding instances (except for some timestamps which I edited out of these two documents, and the ordering/naming of the DEVLINKS):

$ diff -u failing.sda_event success.sda_event
--- failing.sda_event 2019-07-26 10:25:13.088884853 -0400
+++ success.sda_event 2019-07-26 10:23:51.465073329 -0400
@@ -41,5 +41,5 @@
 ID_FS_TYPE=
 MAJOR=8
 MINOR=0
-DEVLINKS=/dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9 /dev/disk/azure/root /dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0 /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9 /dev/disk/cloud/azure_root /dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9
+DEVLINKS=/dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0 /dev/disk/azure/root /dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9 /dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9 /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9 /dev/disk/cloud/azure_root
 TAGS=:systemd:

After the /dev/sda udev event is logged, each partition on /dev/sda {sda1, sda14, sda15} then gets an ACTION=change udev event logged. One difference between the two instances is the order of these events. On the failing system, the order is (sda1, sda15, sda14), and on the succeeding system, it's (sda15, sda14, sda1). This vaguely suggests a race is involved, as the timing is different, but it's certainly not categorical. It's worth noting that the _kernel_ events (which are all emitted before even the sda udev event is processed) happen in the same order on each instance: sda, sda1, sda14, sda15.

OK, the sda1 events provide a substantial difference:

--- failing.sda1_event 2019-07-26 10:42:50.424486349 -0400
+++ success.sda1_event 2019-07-26 12:02:54.536766079 -0400
@@ -53,9 +53,16 @@
 ID_FS_VERSION=1.0
 ID_FS_TYPE=ext4
 ID_FS_USAGE=filesystem
+ID_PART_ENTRY_SCHEME=gpt
+ID_PART_ENTRY_UUID=23f70cea-8174-4426-bc82-462217b97245
+ID_PART_ENTRY_TYPE=0fc63daf-8483-4772-8e79-3d69d8477de4
+ID_PART_ENTRY_NUMBER=1
+ID_PART_ENTRY_OFFSET=227328
+ID_PART_ENTRY_SIZE=62689247
+ID_PART_ENTRY_DISK=8:0
 fabric_name=root
 .ID_FS_TYPE_NEW=ext4
 MAJOR=8
 MINOR=1
-DEVLINKS=/dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/cloud/azure_root-part1 /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9-part1 /dev/disk/by-uuid/d800ba44-3054-4cb3-81f7-dfe32f5e8b48 /dev/disk/by-label/cloudimg-rootfs /dev/disk/azure/root-part1 /dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0-part1
+DEVLINKS=/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 /dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/by-uuid/d800ba44-3054-4cb3-81f7-dfe32f5e8b48 /dev/disk/by-label/cloudimg-rootfs /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9-part1 /dev/disk/azure/root-part1 /dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0-part1 /dev/disk/cloud/azure_root-part1
 TAGS=:systemd:

As we can see here, there are quite a few differences, which appear to reflect what we're seeing: DEVLINKS on the failing instance doesn't include the by-partuuid path, and we're missing all of the ID_PART_ENTRY_* lines from its udev event.

So looks like the next step is probably to determine why the udev event doesn't contain what it should!

After some discussion with rharper on IRC[0], our working hypothesis is that the events that are being processed are as a result of the sgdisk command that does the partition table modification[1], _not_ the pt_update call (which boils down to `partx --update`[2]). As a result, we think we are racing the re-read of the partition table; this is consonant with the different ordering of the udev events on failure.

To test this hypothesis, I'm going to create a cloud-init package which performs its own partition table re-read (which should trigger udev events for those partitions) and `udevadm settle` after the resize. This _should_ mean that the symlink is correctly re-created. If this succeeds, we can examine what change we need to make to cloud-utils.

Tobi, I'll upload that package to the same PPA as the others, and will let you know here when it's ready. We'll want an image with that cloud-init as well as the debugging systemd units.

[0] Which should eventually show up here: https://irclogs.ubuntu.com/2019/07/26/%23cloud-init.html
[1] https://git.launchpad.net/cloud-utils/tree/bin/growpart#n505
[2] https://git.launchpad.net/cloud-utils/tree/bin/growpart#n650

OK, the PPA now has a package with the partx call added, and I've tested that it does something along the lines of what we want in a test instance of my own. Tobi, when you have time, please build a test image with the PPA cloud-init and the debugging units and point us at a failed and successful instance again. Thanks!

Tobias Koch (tobijk) wrote :

passed instance: 104.41.204.32
failed instance: 104.41.207.197

Thanks! TL;DR: the partx call didn't change much.

On the failing instance, this is when the partx call happens:

2019-07-29 12:09:25,391 - util.py[DEBUG]: Running command ['partx', '--update', '/dev/sda']

Looking in the journal at that timestamp, I see (lines omitted and/or edited for brevity):

2019-07-29T12:09:25.395099+0000 udevadm[246]: KERNEL[27.837161] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.396119+0000 tobijk-x-alan-eoan-base-uutvyfllcflbntzjhhes udevadm[246]: UDEV [27.838088] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.397301+0000 udevadm[246]: KERNEL[27.838192] add /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.405384+0000 inotifywait[462]: 1564402165 /dev/disk/by-partuuid/9c72c02a-6134-4811-86d8-feee960b40b2 CREATE
2019-07-29T12:09:25.418906+0000 udevadm[246]: UDEV [27.860977] add /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)

These remove/add lines for sda14 are consistent with what I saw when just running `partx` by hand in an instance.

Looking at the passing instance, there is one noticeable difference: the sda14 remove/add KERNEL events are emitted to the log _before_ any of the UDEV events from the resizing are emitted. This means that:

* On the failing instance, we see the first resizing KERNEL event (sda) emitted at 12:09:25.347818 and the sda1 UDEV event emitted at 12:09:25.372478; .025s.
* On the passing instance, we see the first resizing KERNEL event (sda) emitted at 12:14:21.925924 and the sda1 UDEV event emitted at 12:14:22.002377; .076s.

(And, once again, we see that the order in which udev processes those events goes sda15, sda14, sda1.)

OK, I chatted briefly with Ryan and he agrees with my assessment that this is happening too deep underneath us for us to readily work around it. The two obvious places to look are udev and the kernel.

To help triangulate, Tobi, could you attempt an eoan build and test using the generic kernel instead of linux-azure?

Tobias Koch (tobijk) wrote :

Yep, this doesn't happen with linux-generic.

Now one wrinkle about it being a linux-azure bug: if you _aren't_ seeing this on bionic but _were_ seeing it on cosmic, then there must be something else involved, because I believe (based on the publishing history[0]) that cosmic was on the same kernel version as bionic.

This could, of course, still be a kernel bug that's just not hit by bionic's userspace. But, to be sure of where we stand, can you confirm the statement in the original report that this was an issue on cosmic? Do you have enough test history to determine when this started happening?

[0] https://launchpad.net/ubuntu/+source/linux-meta-azure/+publishinghistory

(Oh, and thank you for that test run!)

Tobias Koch (tobijk) wrote :

Oh, good observation, Dan. Yes, it was happening on Cosmic and later. But I cannot say when it started. Now in a moment of despair I tried something very unorthodox: I copied udevadm from Bionic to the Eoan image and ran the tests again. Look and behold, all pass.

no longer affects: linux-azure (Ubuntu)

That certainly is unorthodox! When you say "copied udevadm", what specifically do you mean? Just the binary, the udev package, ...?

Tobias Koch (tobijk) wrote :

When I say unorthodox, I mean I copied the binary ;) So that should narrow it down quite a bit. Unless there is more funniness involved.

So, to reset on status: I believe that we've narrowed this down to a systemd/udev problem, as it goes away when a different udevadm binary is used. I don't have a good sense of how to go about digging into that issue, so I don't think I can take it much further.

(Also, as it appears to be a systemd issue, I'm going to mark the cloud-init task Invalid. Feel free to move it back to New if you believe this is a mistake!)

Changed in cloud-init:
status: In Progress → Invalid
assignee: Dan Watkins (daniel-thewatkins) → nobody
Tobias Koch (tobijk) wrote :

@daniel-thewatkins, I'm not convinced that this bug is invalid for cloud-init. After reading through all of this again, I still don't understand, what guarantee there is that when `udevadm settle` is called, all relevant events have already been queued.

Copying udevadm over, and with that suppressing the error, suggests that maybe the event queue handling is spurious, but on the other hand, it might just be that previous versions were slower at something and the 10ms window discussed above is always exceeded because of that anyway.

I'm not saying there cannot also be a bug somewhere else. But if there is no specification that says there cannot under any circumstances be a race condition in what cloud-init is doing, then cloud-init should handle this more robustly.

I'm not an expert on that level, but somehow in a world of multi-core CPUs and fancy schedulers, invoking command line tools in a certain order does not seem to preclude the possibility of a race in how the event is submitted, routed and queued without there being an explicit locking mechanism.

Tobias Koch (tobijk) on 2019-08-23
Changed in cloud-init:
status: Invalid → New
Ryan Harper (raharper) wrote :

The sequence is:

exec growpart
  exec sgdisk --info # read-only
  exec sgdisk --pretend # read-only
  exec sgdisk --backup # read-only copy
  # modification of disk starts
  exec sgdisk --move-second-header \
              --delete=PART \
              --new=PART \
              --typecode --partition-guid --change-name
  # now that sgdisk has *closed* the filehandle on the disk, systemd-udevd will
  # get an inotify signal and trigger udevd to run udev scripts on the disk.
  # this includes the *removal* of symlinks due to the --delete portion of sgdisk call
  # and following the removal, the -new will trigger the add run on the rules which would
  # recreate the symlinks.

  # update kernel partition sizes; this is an ioctl so it does not trigger an udev events
  exec partx --update
  # the kernel has the new partition sizes, and udev scripts/events are all queued (and possibly in flight)
exit growpart

cloud-init invokes get_size() operation which:
   # this is where the race occurs if the symlink created by udev is *not* present
   os.open(/dev/disk/by-id/fancy-symlink-with-partuuid-points-to-sdb1)

Dan had put a udevadm settle in this spot like so

def get_size(filename)
   util.subp(['udevadm', 'settle'])
   os.open(....)

So, you're suggesting that somehow _not all_ of the uevents triggered by the sgdisk command in growpart *wouldn't* have been queued before we call udevadm settle?

If some other events are happening how is cloud-init to know such that it can take action to "handle this race" more robustly?

Lastly if there is a *race* in the symlink creation/remove/delay in uevent propigation; why is that a userspace let alone a cloud-init issue. This isn't universally reproducible, rather it's pretty narrow circumstances between certain kernels and udevs all the while the growpart/cloud-init code remains the same.

Changed in cloud-init:
status: New → Incomplete

[N.B. I wrote the below before I saw Ryan's comment, so there is some repetition.]

OK, I've spent some time catching up on this properly so I can summarise: per comment #24, the issue is that when udev processes the events emitted by the kernel, it (sometimes) doesn't determine the correct partition information. The kernel _does_ emit all the events we would expect, and udev _does_ handle all the events we would expect (which is to say that `udevadm settle` doesn't change behaviour here, it merely ensures that the broken behaviour has completed before we proceed). The hypothesised race condition is somewhere between the kernel and udev: I believe the kernel event is emitted before the partition table has necessarily been fully updated so when udev processes the event and reads the partition table, sometimes it finds the partition and sometimes it doesn't. To be clear, the kernel event generation and the buggy udev event handling all happens as a result of the resize command, _not_ as a result of anything else cloud-init runs subsequently.

So as far as I can tell, this bug would occur regardless of what runs the resize command, and no matter what commands are executed after the resize command. (It might be possible to work around this bug by issuing commands that force a re-read of the partition table on a disk, for example, but this bug _would_ still have occurred before then.)

cloud-init could potentially work around a (kernel|systemd) that isn't handling partitions correctly, but we really shouldn't have to. Until we're satisfied that they cannot actually be fixed, we shouldn't do that. (I am _not_ convinced that this cannot be fixed in (the kernel|systemd), because using a different kernel and using a different udevadm have both caused the issue to stop reproducing.)

So, let me be a little more categorical. The information we have at the moment indicates an issue in the interactions between the kernel and udev on partition resize. cloud-init's involvement is merely as the initiator of that resize. Until we have more information that indicates the issue to be in cloud-init, this isn't a valid cloud-init issue. Once we have more information from the kernel and/or systemd folks, if it indicates that cloud-init _is_ at fault, please move this back to New.

Dan Streetman (ddstreet) wrote :

> Dan had put a udevadm settle in this spot like so
>
> def get_size(filename)
> util.subp(['udevadm', 'settle'])
> os.open(....)

if you know you've just changed (e.g.) /dev/sda, possibly its kernel-generated udev events just haven't reached udev yet, so the immediate call to 'udev settle' has nothing to wait for; maybe you should tell udev to explicitly request a new full set of events and settle on that?

udevadm trigger --settle /dev/sda

Looking at the comment timestamps, Dan probably didn't see my comment, but to reiterate: all the events we expect to be processed _are_ processed, the issue is that when they are processed they don't always end up with the correct partition information.

On Fri, Aug 23, 2019 at 10:00 AM Dan Streetman <email address hidden>
wrote:

> > Dan had put a udevadm settle in this spot like so
> >
> > def get_size(filename)
> > util.subp(['udevadm', 'settle'])
> > os.open(....)
>
> if you know you've just changed (e.g.) /dev/sda, possibly its kernel-
> generated udev events just haven't reached udev yet, so the immediate
> call to 'udev settle' has nothing to wait for; maybe you should tell
> udev to explicitly request a new full set of events and settle on that?
>
> udevadm trigger --settle /dev/sda
>

Possible; though in the case that we don't race, we get to repeat all of
the rules.
If we can sort out what changes in the kernel/udev expose this race then
maybe
we can see if there's something that cloud-init/growpart/etc can do.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Ryan Harper (raharper) wrote :

On Fri, Aug 23, 2019 at 10:21 AM Dan Watkins <email address hidden>
wrote:

> Looking at the comment timestamps, Dan probably didn't see my comment,
> but to reiterate: all the events we expect to be processed _are_
> processed, the issue is that when they are processed they don't always
> end up with the correct partition information.
>

One bit of info we don't have is a timestamp for when the BLKPG ioctl from
partx --update is run. That would confirm the order of things.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Tobias Koch (tobijk) wrote :

I may be missing the point, but the symlink in question is eventually recreated, does that tell us anything? This here

> Dan had put a udevadm settle in this spot like so
>
> def get_size(filename)
> util.subp(['udevadm', 'settle'])
> os.open(....)

looks to me like the event queue should be empty now, but how do you know userspace has acted on what came out of it? Is it strictly required that any event is cleared only after the corresponding action has completed? If yes, we can probably blame udev. If not, cloud-init should wait on the link to appear.

On Fri, Aug 23, 2019 at 08:23:06PM -0000, Tobias Koch wrote:
> I may be missing the point, but the symlink in question is eventually
> recreated, does that tell us anything? This here

Yes, this is more supporting evidence that this is a race condition; the
state of the system both before and some time after the resize is
consistent, the kernel/udev just lose track of the existence of the
partition for some amount of time after the resize happens.

> > Dan had put a udevadm settle in this spot like so
> >
> > def get_size(filename)
> > util.subp(['udevadm', 'settle'])
> > os.open(....)
>
> looks to me like the event queue should be empty now, but how do you
> know userspace has acted on what came out of it?

The symlink exists before the resize, so if it's missing then we know
that the udev events have been processed. The DEVLINKS line in the diff
in comment #24 shows that udev doesn't think that symlink should exist.
We can see the symlink being deleted in the logging in comment #22.

> Is it strictly required that any event is cleared only after the
> corresponding action has completed? If yes, we can probably blame
> udev. If not, cloud-init should wait on the link to appear.

The link is not going to appear without something prompting a re-read of
the partition table from udev. I don't believe this should be necessary
if the kernel/udev are behaving properly.

(Odds are that whatever causes it to be recreated later in boot would be
blocked by cloud-init waiting.)

Tobias Koch (tobijk) wrote :

> (Odds are that whatever causes it to be recreated later in boot would be
> blocked by cloud-init waiting.)

But that's not happening. The instance does boot normally, the only service degraded is cloud-init and there is no significant delay either.

So conversely, if I put a loop into cloud-init and just waited on the symlink to appear and if that worked with minimal delay, would that refute the above?

Ryan Harper (raharper) wrote :
Download full text (3.2 KiB)

On Mon, Aug 26, 2019 at 4:05 AM Tobias Koch <email address hidden>
wrote:

> > (Odds are that whatever causes it to be recreated later in boot would be
> > blocked by cloud-init waiting.)
>
> But that's not happening. The instance does boot normally, the only
> service degraded is cloud-init and there is no significant delay either.
>

> So conversely, if I put a loop into cloud-init and just waited on the
> symlink to appear and if that worked with minimal delay, would that
> refute the above?
>

That's still a workaround for something we don't exactly know why is racing
nor why this isn't more widespread. The code in cloud-init and growpart,
sgdisk
and partx are stable (the code has not changed significantly much in some
time).

We don't have root cause for the race at this time. When cloud-init
invokes growpart
the symlink exists, and when growpart returns sometimes it does not. If
anything growpart
should address the race itself; and at this point, it would have to pickup
a workaround as well.

Let's at least make sure we understand the actual race before we look
further into workarounds.

From what I can see in what growpart is doing, the sgdisk command will
clear the partition tables (this involves removing the partition and then
re-adding it, which triggers udev. Further, Dan's show that partx --update
can also trigger a remove and an add. Looking at the partx update code;
*sometimes* it will remove and add, however, if the partition to be updated
*exists* then it will instead issue an update IOCTL which only updates the
size value in sysfs.

https://github.com/karelzak/util-linux/blob/53ae7d60cfeacd4e87bfe6fcc015b58b78ef4555/disk-utils/partx.c#L451

Which makes me think that in the successful path, we're seeing partx
--update take the partx_resize_partition path, which submits the resize
IOCTL

https://github.com/karelzak/util-linux/blob/917f53cf13c36d32c175f80f2074576595830573/include/partx.h#L54

which in linux kernel does:

https://elixir.bootlin.com/linux/latest/source/block/ioctl.c#L100

and just updates the size value in sysfs:

https://elixir.bootlin.com/linux/latest/source/block/ioctl.c#L146

which AFAICT does not emit any new uevents;

Lastly, in either path (partx updates vs partx removes/adds); invoking a
udevadm settle after the binary has exited is the reasonable way to ensure
that *if* any uevents were created, that they are processed.

growpart could add udevadm settle code; so could cloud-init. We actually
did that in our first test package and that did not result in ensuring the
symlink was present.

All of this suggests to me that *something* isn't processing the sequence
of uevents in such a way that the once they've all been processed we have
the symlink.

We must be missing some other bit of information in the failing path where
the symlink is eventually recreated (possibly due to some other write or
close on the disk on the disk which re-triggers rules).

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bug...

Read more...

Scott Moser (smoser) wrote :

If there is a race, or a need to wait, it almost certainly is in cloud-utils (growpart). If you use growpart to grow a partition, the result should be that that is done, and it is ready. The caller should not expect that they have to know additional information (to call 'udevadm settle') or should have to wait some amount of time.

On Mon, Aug 26, 2019 at 08:58:46AM -0000, Tobias Koch wrote:
> > (Odds are that whatever causes it to be recreated later in boot would be
> > blocked by cloud-init waiting.)
>
> But that's not happening. The instance does boot normally, the only
> service degraded is cloud-init and there is no significant delay either.

Right, I was talking about the proposed addition of a wait in
cloud-init. We wouldn't expect to see this today.

> So conversely, if I put a loop into cloud-init and just waited on the
> symlink to appear and if that worked with minimal delay, would that
> refute the above?

What would the loop do? `udevadm settle` doesn't do anything, because
the events have all been processed, they've just been processed
incorrectly.

On Mon, Aug 26, 2019 at 03:17:10PM -0000, Scott Moser wrote:
> If there is a race, or a need to wait, it almost certainly is in cloud-
> utils (growpart).

I would still like us to get a systemd/kernel person to take a look at
this, because I think that the race is somewhere further down than
cloud-utils. (I definitely agree that any workarounds we end up
deciding to use would be best placed in cloud-utils, though!)

> If you use growpart to grow a partition, the result
> should be that that is done, and it is ready. The caller should not
> expect that they have to know additional information (to call 'udevadm
> settle') or should have to wait some amount of time.

+1

Dan Streetman (ddstreet) wrote :

Pretty sure you need to actually try what I suggested in comment 40, but of course use the specific dev/partition you're missing the udev symlink for.

e.g. if you need the symlink for /dev/sda1 then instead of:

$ growpart /dev/sda 1 ; udevadm settle ; ls /dev/disk/by-uuid/$UUID

you should do

$ growpart /dev/sda 1 ; udevadm trigger -w /dev/sda1 ; ls /dev/disk/by-uuid/$UUID

the -w param is relatively new to udevadm so for older releases like bionic do 'udevadm trigger /dev/sda1 ; udevadm settle'

Dan, are you asking for that to get more debugging information, or are you suggesting that as a fix for this?

(I'm yet to be convinced that this isn't a regression in (the kernel|src:systemd), so I don't think that we should have to run _any_ additional commands to address this.)

Dan Streetman (ddstreet) wrote :

I think that is the correct way to wait for udev to process a specific device's changes.

I don't think we're looking for a way to wait for udev. For this bug to present, udev has to have processed all of the available events (i.e. there's nothing left to wait for), and have done so incorrectly (probably as a result of its interaction with the Azure kernel, as this doesn't reproduce on the generic kernel).

And, to be clear, your proposal isn't just a "wait". You're proposing retriggering events and _then_ waiting. This sounds like a potential workaround, but I believe it would further mask the underlying problem. We (as Ubuntu) control every part of this, so I think we should be able to fix this properly, rather than working around a linux-azure/udev issue in cloud-init and/or cloud-image-utils.

Dan Streetman (ddstreet) wrote :

> You're proposing retriggering events and _then_ waiting

I'm proposing make sure that the uevent(s) for the specific partition you're interested in have started processing before you start waiting, which is what the retriggering does.

But it could be a kernel bug, sure.

The lack of events is not the issue. We see all the events from the kernel that we would expect to see during a resize, and we see the same number (and type) of events on both the passing and the failing instances.

The problem is that if udev processes the sda1 event first then, at least some of the time, it doesn't detect that sda1 has a partition on it, so it doesn't create the by-partuuid symlink for that partition. See comments #23 and #24 for more detail.

Dan Streetman (ddstreet) wrote :

> The lack of events is not the issue

I'm not saying it is. But you seem to have it under control; I was just offering my opinion.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.