Bug #1834875 “cloud-init growpart race with udev” : Bugs : cloud-initramfs-tools

Revision history for this message

Ryan Harper (raharper) wrote on 2019-07-01:

#1

Hi,

Could you run cloud-init collect-logs and attach the tarball provided?

Are there any particular instance specific defaults ?what type, what region? Size of the ephemeral disk?

Changed in cloud-init:
importance:	Undecided → Medium
status:	New → Incomplete

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-02:

#2

cloud-init.tar.gz Edit (62.5 KiB, application/x-tar)

This seems hard to reproduce with the azure-cli, but happens regularly on instances we launch directly via the API. I'm not aware of any special configurations. These are stock images launched in northeurope. The ephemeral disk is 30GB.

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-02:

#3

This happens across different instance types without a clear pattern.

Tobias Koch (tobijk) on 2019-07-02

Changed in cloud-init:
status:	Incomplete → New

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-02:

#4

Hey Tobi, could you clarify what the failure here actually is? Looking at the journal (specifically the kernel EXT4 entries), it looks like some resizing _does_ happen. (Obviously we don't want a traceback in the log so we can chase that down, but I'd like to understand the concrete effect that this is having on users in order to prioritise it. :)

(Also, if you could get us access to a failing instance, that would be great! I don't have cloud-test-framework checked out any longer, else I'd try to do it.)

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-02:

#5

OK, so Ryan and I just kicked this back and forth a bit in a meeting. Looking at the failure, I believe that we successfully resize the disks/partitions, but when we look up the partition by partuuid to double-check what changed we fail. (Ryan's hypothesis is that the symlink is in the process of being recreated by udev when we go to check it.)

(The reason I believe we successfully resize the disk is that the traceback occurs after we've successfully run the growpart command which we can see in the cloud-init log and in dmesg.)

Access to a failing instance would be really helpful, so I can double-check our thinking here.

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-03:

#6

Duh, you're right, the resize happens. Then the only problem with this is that the tests cannot tell why the cloud-init service starts degraded.

Test instance at 137.135.172.1.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-03:

#7

Thanks for the test instance! Yeah, even though the resize is successful, we don't want tracebacks, warnings or degraded boot! (Also, I haven't checked, but I imagine this would cause functional issues if there was more than a single disk/partition that needed resizing; I expect we would resize the first only.)

Looking at the test instance, the by-partuuid symlinks are created after we hit this error:

$ ls -l --time-style=full-iso /dev/disk/by-partuuid/
total 0
lrwxrwxrwx 1 root root 10 2019-07-03 11:39:34.578632000 +0000 251ca684-01 -> ../../sdb1
lrwxrwxrwx 1 root root 11 2019-07-03 11:39:34.718632000 +0000 46150cea-0763-4a42-b688-5d7c32604a99 -> ../../sda14
lrwxrwxrwx 1 root root 11 2019-07-03 11:39:34.674632000 +0000 76944892-a450-4661-ad65-fbe719c872b0 -> ../../sda15
lrwxrwxrwx 1 root root 10 2019-07-03 11:39:34.698632000 +0000 b429205c-8c8c-434b-bb30-f80882728e23 -> ../../sda1

That said, those timestamps are _so_ far after the error (~28s) that I _guess_ they were regenerated again. (It's also possible that nothing poked udev for that long, I suppose.)

Toby, if I give you a patched cloud-init in a PPA, would you be able to build a test image from it?

Dan Watkins (oddbloke) on 2019-07-03

Changed in cloud-init:
status:	Incomplete → Confirmed

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-03:

#8

Thanks for looking into this! Sure, send over the PPA.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-03:

#9

OK, a test package is now available in https://launchpad.net/~daniel-thewatkins/+archive/ubuntu/lp1834875

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-04:

#10

I still see this happening with the test package from the PPA.

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-05:

#11

It definitely happens less frequently. Test instance at 52.169.141.202.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-22:

#12

Thanks for the feedback Tobi. I've built a new cloud-init package in the same PPA, which just adds a little more debug logging. Please give this a spin and then give me SSH access to both a passing and a failing instance.

Thanks!

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-24:

#13

Sorry, for the slow feedback. It's a pain to produce these instances.

failed instance: 52.178.178.45
passed instance: 13.74.143.107

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-24:

#14

Download full text (4.2 KiB)

No worries, thanks for the instances!

OK, here are the relevant lines from the log:

== SUCCESSFUL ==

2019-07-24 06:34:11,361 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:11,363 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:33:59.017040800 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

2019-07-24 06:34:11,377 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:11,649 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,907 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,991 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,994 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:34:12.981040800 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

== FAILURE ==

2019-07-24 06:25:42,590 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:25:27.707830000 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

2019-07-24 06:25:42,590 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,097 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,624 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,633 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,636 - util.py[DEBUG]: resize_devices took 1.059 seconds
2019-07-24 06:25:43,636 - handlers.py[DEBUG]: finish: init-network/config-growpart: FAIL: running config-growpart with frequency always
2019-07-24 06:25:43,636 - util.py[WARNING]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
2019-07-24 06:25:43,637 - util.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 827, in _run_modules
    freq=freq)
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 187, in run
...

No worries, thanks for the instances!

OK, here are the relevant lines from the log:

== SUCCESSFUL ==

2019-07-24 06:34:11,361 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:11,363 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:33:59.017040800 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

2019-07-24 06:34:11,377 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:11,649 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,907 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,991 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:34:12,994 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:34:12.981040800 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

== FAILURE ==

2019-07-24 06:25:42,590 - cc_growpart.py[CRITICAL]: DEBUG ODD_BLOKE: ls output:
lrwxrwxrwx 1 root root 10 2019-07-24 06:25:27.707830000 +0000 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 -> ../../sda1

2019-07-24 06:25:42,590 - util.py[DEBUG]: Running command ['growpart', '--dry-run', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,097 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,624 - util.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,633 - util.py[DEBUG]: Running command ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245'] with allowed return codes [0] (shell=False, capture=True)
2019-07-24 06:25:43,636 - util.py[DEBUG]: resize_devices took 1.059 seconds
2019-07-24 06:25:43,636 - handlers.py[DEBUG]: finish: init-network/config-growpart: FAIL: running config-growpart with frequency always
2019-07-24 06:25:43,636 - util.py[WARNING]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
2019-07-24 06:25:43,637 - util.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 827, in _run_modules
    freq=freq)
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 358, in handle
    func=resize_devices, args=(resizer, devices))
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2525, in log_time
    ret = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 305, in resize_devices
    (old, new) = resizer.resize(disk, ptnum, blockdev)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 163, in resize
    return (before, get_size(partdev))
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 202, in get_size
    out, err = util.subp(['ls', '--full-time', filename], capture=True)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in subp
    cmd=args)
cloudinit.util.ProcessExecutionError: Unexpected error while running command.
Command: ['ls', '--full-time', '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245']
Exit code: 2
Reason: -
Stdout: 
Stderr: ls: cannot access '/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245': No such file or directory

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-24:

#15

So looking at that output, I have these observations:

* we are dealing with a fairly tight window; in the successful run, the timestamp of the symlink is 06:34:12.981040800 and we run the `ls` command to show us that at 06:34:12,991; that's about 10ms
* based on the timestamp of the next log message in each case, `udevadm settle` on success appears to take substantially longer (06:34:12,907 -> 06:34:12,991; 84ms) than on failure (06:25:43,624 -> 06:25:43,633; 9ms)
* on success, the timestamp of the symlink is between the `udevadm settle` and the next log message, which suggests that it was created as a result of the settle

So none of this points to something _obvious_ (or, at least, not something obvious to me). Given the disparity in the apparent `udevadm settle` execution times, I wonder if the udev events from the resize haven't made their way in to the queue before we execute the settle, meaning it just exits without waiting at all?

Revision history for this message

Ryan Harper (raharper) wrote on 2019-07-24:

#16

Looking at the growpart command; it does two primary things. 1) issues commands to update the size of the partition (via sgdisk or sfdisk) 2) issues a "rereadpt" command.

I really suspect you're right in that the "rereadpt" may have been issued but not yet completed, including the uevent which would trigger udev events; Now, the partition already exists we really shouldn't see the the symlink go away; which is the _really_ strange part.

So, I think we should fire up udevadm monitor

sudo udevadm monitor --property --kernel --udev --subsystem-match=block

And then issue growpart commands on a disk whose partition can be grown.

You can launch and instance in azure with user-data to diable growpart; then issue the same growpart command after logging in; I'd also run growpart under bash -x for a trace so we can see what's happening.

Lastly, we can also use an inotify watch on /dev/disk/by-partuuid

% inotifywait -mr --timefmt "%s" --format "%T %w%f %e" /dev/disk/by-partuuid
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
1563987982 /dev/disk/by-partuuid/foo CREATE
1563987982 /dev/disk/by-partuuid/foo OPEN
1563987982 /dev/disk/by-partuuid/foo ATTRIB
1563987982 /dev/disk/by-partuuid/foo CLOSE_WRITE,CLOSE
1563987985 /dev/disk/by-partuuid/foo DELETE

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-24:

#17

udevmonitor Edit (10.3 KiB, text/plain)

So manually running `growpart /dev/sda 1` resulted in the following inotifywait output:

1563992577 /dev/disk/by-partuuid/60a13b74-44c8-4d0d-b415-6c697e398204 ATTRIB
1563992577 /dev/disk/by-partuuid/43b8aeb4-14df-4ae2-8a6a-d3ada9ea01dd ATTRIB
1563992577 /dev/disk/by-partuuid/7f427e61-bcb8-491d-b067-689ba6f4702b ATTRIB

and the attached `udevadm monitor` output. The fact we don't see a removal here suggests we aren't tickling the bug, so I'm going to launch another instance and see if running the cloud-init module makes a difference.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-25:

#18

I tried a few more times to get a manual reproducer and failed. So I've created two systemd units, one[0] to collect the `udevadm monitor` info, and one[1] to collect the `inotifywait` info. (As these commands output to stdout, their output can be found in the systemd journal.) I put these in place, and iterated a few reboots to get them working OK.

I then `cloud-init clean`'d and captured the image (following Azure's instructions[2]). I launched my captured image multiple times (I didn't keep track of exactly how many, but N>10 for sure) and didn't hit a single failure. So either (a) I got "lucky", (b) the extra instrumentation fixes the race condition, (c) the differences between my image and a truly pristine one are too big, (d) how I'm launching instances and how the CPC test framework is launching instances is sufficiently different to not tickle the bug, or, of course, (e) something else entirely.

Tobi, I know we're pushing things a bit now, but are you able to add these two units to an image to see what they produce? (Make sure to enable them as well as install them! It took me two reboots to understand why they weren't executing. ¬.¬)

(The `inotifywait` unit[1] appears to not yet be perfect, as it sometimes emits "Current command vanished from the unit file, execution of the command list won't be resumed" and then stops emitting anything else. I haven't been able to work out what this error message actually means either via search or by looking at the systemd source code briefly, and asking in #systemd on Freenode hasn't yet yielded any further info.)

[0] https://paste.ubuntu.com/p/FgbZqwZR6Y/
[1] https://paste.ubuntu.com/p/57HkmCKN67/
[2] https://docs.microsoft.com/en-us/azure/virtual-machines/linux/capture-image

Changed in cloud-init:
status:	Confirmed → In Progress
assignee:	nobody → Dan Watkins (daniel-thewatkins)

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-25:

#19

(Marked as In Progress, though depending on the outcome of the triage we're doing ATM this might go back to Triaged rather than get fixed immediately.)

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-26:

#20

Is the problem with inotifywait an instance of this systemd bug [1]? I don't see a work-around mentioned.

With the two services from above included:

failed instance: 52.178.156.32
passed instance: 13.94.113.245

[1] https://github.com/systemd/systemd/issues/12258

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-26:

#21

Thanks for the continued debugging assistance! It definitely could be that bug, though looking at the failing instance, we're getting enough information from the inotifywait service, so thankfully I don't think we need to worry about it:

Jul 26 10:39:01.324469 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx inotifywait[462]: 1564137541 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 DELETE

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-26:

#22

OK, so we're definitely seeing a DELETE on failure, and it _looks_ like it happens during the growpart:

2019-07-26 10:39:00,725 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
Jul 26 10:39:01.324469 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx inotifywait[462]: 1564137541 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 DELETE
2019-07-26 10:39:01,339 - util.py[DEBUG]: resize_devices took 1.334 seconds

Looking at the `udevadm monitor` logs from each instance, we only have one udev/kernel remove on each (listed twice, once for the kernel event, once for the corresponding udev event):

ubuntu@<failing>:~$ journalctl -o short-precise -u udevmon.service | grep ACTION=remove -B1
Jul 26 10:39:03.571001 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: KERNEL[36.612409] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:39:03.571001 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: ACTION=remove
--
Jul 26 10:39:03.588454 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: UDEV [36.629717] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:39:03.588454 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: ACTION=remove

ubuntu@<passing>:~$ journalctl -o short-precise -u udevmon.service | grep ACTION=remove -B1
Jul 26 10:43:33.613194 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: KERNEL[33.632423] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:43:33.613194 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: ACTION=remove
--
Jul 26 10:43:33.627533 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: UDEV [33.644486] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:43:33.627533 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: ACTION=remove

I'm going to dig in to the logs a little further to see if I can find anything.

OK, so we're definitely seeing a DELETE on failure, and it _looks_ like it happens during the growpart:

2019-07-26 10:39:00,725 - util.py[DEBUG]: Running command ['growpart', '/dev/sda', '1'] with allowed return codes [0] (shell=False, capture=True)
Jul 26 10:39:01.324469 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx inotifywait[462]: 1564137541 /dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 DELETE
2019-07-26 10:39:01,339 - util.py[DEBUG]: resize_devices took 1.334 seconds

Looking at the `udevadm monitor` logs from each instance, we only have one udev/kernel remove on each (listed twice, once for the kernel event, once for the corresponding udev event):

ubuntu@<failing>:~$ journalctl -o short-precise -u udevmon.service | grep ACTION=remove -B1
Jul 26 10:39:03.571001 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: KERNEL[36.612409] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:39:03.571001 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: ACTION=remove
--
Jul 26 10:39:03.588454 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: UDEV  [36.629717] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:39:03.588454 tobijk-x-alan-eoan-base-tgqxhgulkkndpdyjlolx udevadm[251]: ACTION=remove

ubuntu@<passing>:~$ journalctl -o short-precise -u udevmon.service | grep ACTION=remove -B1
Jul 26 10:43:33.613194 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: KERNEL[33.632423] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:43:33.613194 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: ACTION=remove
--
Jul 26 10:43:33.627533 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: UDEV  [33.644486] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host3/target3:0:1/3:0:1:0/block/sdb/sdb1 (block)
Jul 26 10:43:33.627533 tobijk-x-alan-eoan-base-afjyuaryhtemavrzofcr udevadm[247]: ACTION=remove

I'm going to dig in to the logs a little further to see if I can find anything.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-26:

#23

I've confirmed that the udev and kernel ACTION=change event on /dev/sda is the same across the failing and succeeding instances (except for some timestamps which I edited out of these two documents, and the ordering/naming of the DEVLINKS):

$ diff -u failing.sda_event success.sda_event
--- failing.sda_event 2019-07-26 10:25:13.088884853 -0400
+++ success.sda_event 2019-07-26 10:23:51.465073329 -0400
@@ -41,5 +41,5 @@
ID_FS_TYPE=
MAJOR=8
MINOR=0
-DEVLINKS=/dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9 /dev/disk/azure/root /dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0 /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9 /dev/disk/cloud/azure_root /dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9
+DEVLINKS=/dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0 /dev/disk/azure/root /dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9 /dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9 /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9 /dev/disk/cloud/azure_root
TAGS=:systemd:

After the /dev/sda udev event is logged, each partition on /dev/sda {sda1, sda14, sda15} then gets an ACTION=change udev event logged. One difference between the two instances is the order of these events. On the failing system, the order is (sda1, sda15, sda14), and on the succeeding system, it's (sda15, sda14, sda1). This vaguely suggests a race is involved, as the timing is different, but it's certainly not categorical. It's worth noting that the _kernel_ events (which are all emitted before even the sda udev event is processed) happen in the same order on each instance: sda, sda1, sda14, sda15.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-26:

#24

OK, the sda1 events provide a substantial difference:

--- failing.sda1_event 2019-07-26 10:42:50.424486349 -0400
+++ success.sda1_event 2019-07-26 12:02:54.536766079 -0400
@@ -53,9 +53,16 @@
ID_FS_VERSION=1.0
ID_FS_TYPE=ext4
ID_FS_USAGE=filesystem
+ID_PART_ENTRY_SCHEME=gpt
+ID_PART_ENTRY_UUID=23f70cea-8174-4426-bc82-462217b97245
+ID_PART_ENTRY_TYPE=0fc63daf-8483-4772-8e79-3d69d8477de4
+ID_PART_ENTRY_NUMBER=1
+ID_PART_ENTRY_OFFSET=227328
+ID_PART_ENTRY_SIZE=62689247
+ID_PART_ENTRY_DISK=8:0
fabric_name=root
.ID_FS_TYPE_NEW=ext4
MAJOR=8
MINOR=1
-DEVLINKS=/dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/cloud/azure_root-part1 /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9-part1 /dev/disk/by-uuid/d800ba44-3054-4cb3-81f7-dfe32f5e8b48 /dev/disk/by-label/cloudimg-rootfs /dev/disk/azure/root-part1 /dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0-part1
+DEVLINKS=/dev/disk/by-partuuid/23f70cea-8174-4426-bc82-462217b97245 /dev/disk/by-id/wwn-0x60022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/by-uuid/d800ba44-3054-4cb3-81f7-dfe32f5e8b48 /dev/disk/by-label/cloudimg-rootfs /dev/disk/by-id/scsi-14d53465420202020d7a4872310ff41908c3b8d8a6789eec9-part1 /dev/disk/azure/root-part1 /dev/disk/by-id/scsi-360022480d7a4872310ff8d8a6789eec9-part1 /dev/disk/by-path/acpi-VMBUS:01-vmbus-00000000000088990000000000000000-lun-0-part1 /dev/disk/cloud/azure_root-part1
TAGS=:systemd:

As we can see here, there are quite a few differences, which appear to reflect what we're seeing: DEVLINKS on the failing instance doesn't include the by-partuuid path, and we're missing all of the ID_PART_ENTRY_* lines from its udev event.

So looks like the next step is probably to determine why the udev event doesn't contain what it should!

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-26:

#25

After some discussion with rharper on IRC[0], our working hypothesis is that the events that are being processed are as a result of the sgdisk command that does the partition table modification[1], _not_ the pt_update call (which boils down to `partx --update`[2]). As a result, we think we are racing the re-read of the partition table; this is consonant with the different ordering of the udev events on failure.

To test this hypothesis, I'm going to create a cloud-init package which performs its own partition table re-read (which should trigger udev events for those partitions) and `udevadm settle` after the resize. This _should_ mean that the symlink is correctly re-created. If this succeeds, we can examine what change we need to make to cloud-utils.

Tobi, I'll upload that package to the same PPA as the others, and will let you know here when it's ready. We'll want an image with that cloud-init as well as the debugging systemd units.

[0] Which should eventually show up here: https://irclogs.ubuntu.com/2019/07/26/%23cloud-init.html
[1] https://git.launchpad.net/cloud-utils/tree/bin/growpart#n505
[2] https://git.launchpad.net/cloud-utils/tree/bin/growpart#n650

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-26:

#26

OK, the PPA now has a package with the partx call added, and I've tested that it does something along the lines of what we want in a test instance of my own. Tobi, when you have time, please build a test image with the PPA cloud-init and the debugging units and point us at a failed and successful instance again. Thanks!

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-07-29:

#27

passed instance: 104.41.204.32
failed instance: 104.41.207.197

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-30:

#28

Thanks! TL;DR: the partx call didn't change much.

On the failing instance, this is when the partx call happens:

2019-07-29 12:09:25,391 - util.py[DEBUG]: Running command ['partx', '--update', '/dev/sda']

Looking in the journal at that timestamp, I see (lines omitted and/or edited for brevity):

2019-07-29T12:09:25.395099+0000 udevadm[246]: KERNEL[27.837161] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.396119+0000 tobijk-x-alan-eoan-base-uutvyfllcflbntzjhhes udevadm[246]: UDEV [27.838088] remove /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.397301+0000 udevadm[246]: KERNEL[27.838192] add /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.405384+0000 inotifywait[462]: 1564402165 /dev/disk/by-partuuid/9c72c02a-6134-4811-86d8-feee960b40b2 CREATE
2019-07-29T12:09:25.418906+0000 udevadm[246]: UDEV [27.860977] add /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)

These remove/add lines for sda14 are consistent with what I saw when just running `partx` by hand in an instance.

Looking at the passing instance, there is one noticeable difference: the sda14 remove/add KERNEL events are emitted to the log _before_ any of the UDEV events from the resizing are emitted. This means that:

* On the failing instance, we see the first resizing KERNEL event (sda) emitted at 12:09:25.347818 and the sda1 UDEV event emitted at 12:09:25.372478; .025s.
* On the passing instance, we see the first resizing KERNEL event (sda) emitted at 12:14:21.925924 and the sda1 UDEV event emitted at 12:14:22.002377; .076s.

(And, once again, we see that the order in which udev processes those events goes sda15, sda14, sda1.)

Thanks!  TL;DR: the partx call didn't change much.

On the failing instance, this is when the partx call happens:

2019-07-29 12:09:25,391 - util.py[DEBUG]: Running command ['partx', '--update', '/dev/sda']

Looking in the journal at that timestamp, I see (lines omitted and/or edited for brevity):

2019-07-29T12:09:25.395099+0000 udevadm[246]: KERNEL[27.837161] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.396119+0000 tobijk-x-alan-eoan-base-uutvyfllcflbntzjhhes udevadm[246]: UDEV  [27.838088] remove   /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.397301+0000 udevadm[246]: KERNEL[27.838192] add      /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)
2019-07-29T12:09:25.405384+0000 inotifywait[462]: 1564402165 /dev/disk/by-partuuid/9c72c02a-6134-4811-86d8-feee960b40b2 CREATE
2019-07-29T12:09:25.418906+0000 udevadm[246]: UDEV  [27.860977] add      /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sda/sda14 (block)

These remove/add lines for sda14 are consistent with what I saw when just running `partx` by hand in an instance.

Looking at the passing instance, there is one noticeable difference: the sda14 remove/add KERNEL events are emitted to the log _before_ any of the UDEV events from the resizing are emitted.  This means that:

* On the failing instance, we see the first resizing KERNEL event (sda) emitted at 12:09:25.347818 and the sda1 UDEV event emitted at 12:09:25.372478; .025s.
* On the passing instance, we see the first resizing KERNEL event (sda) emitted at 12:14:21.925924 and the sda1 UDEV event emitted at 12:14:22.002377; .076s.

(And, once again, we see that the order in which udev processes those events goes sda15, sda14, sda1.)

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-07-30:

#29

OK, I chatted briefly with Ryan and he agrees with my assessment that this is happening too deep underneath us for us to readily work around it. The two obvious places to look are udev and the kernel.

To help triangulate, Tobi, could you attempt an eoan build and test using the generic kernel instead of linux-azure?

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-08-02:

#30

Yep, this doesn't happen with linux-generic.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-02:

#31

Now one wrinkle about it being a linux-azure bug: if you _aren't_ seeing this on bionic but _were_ seeing it on cosmic, then there must be something else involved, because I believe (based on the publishing history[0]) that cosmic was on the same kernel version as bionic.

This could, of course, still be a kernel bug that's just not hit by bionic's userspace. But, to be sure of where we stand, can you confirm the statement in the original report that this was an issue on cosmic? Do you have enough test history to determine when this started happening?

[0] https://launchpad.net/ubuntu/+source/linux-meta-azure/+publishinghistory

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-02:

#32

(Oh, and thank you for that test run!)

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-08-05:

#33

Oh, good observation, Dan. Yes, it was happening on Cosmic and later. But I cannot say when it started. Now in a moment of despair I tried something very unorthodox: I copied udevadm from Bionic to the Eoan image and ran the tests again. Look and behold, all pass.

no longer affects:

linux-azure (Ubuntu)

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-06:

#34

That certainly is unorthodox! When you say "copied udevadm", what specifically do you mean? Just the binary, the udev package, ...?

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-08-07:

#35

When I say unorthodox, I mean I copied the binary ;) So that should narrow it down quite a bit. Unless there is more funniness involved.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-22:

#36

So, to reset on status: I believe that we've narrowed this down to a systemd/udev problem, as it goes away when a different udevadm binary is used. I don't have a good sense of how to go about digging into that issue, so I don't think I can take it much further.

(Also, as it appears to be a systemd issue, I'm going to mark the cloud-init task Invalid. Feel free to move it back to New if you believe this is a mistake!)

Changed in cloud-init:
status:	In Progress → Invalid
assignee:	Dan Watkins (daniel-thewatkins) → nobody

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-08-23:

#37

@daniel-thewatkins, I'm not convinced that this bug is invalid for cloud-init. After reading through all of this again, I still don't understand, what guarantee there is that when `udevadm settle` is called, all relevant events have already been queued.

Copying udevadm over, and with that suppressing the error, suggests that maybe the event queue handling is spurious, but on the other hand, it might just be that previous versions were slower at something and the 10ms window discussed above is always exceeded because of that anyway.

I'm not saying there cannot also be a bug somewhere else. But if there is no specification that says there cannot under any circumstances be a race condition in what cloud-init is doing, then cloud-init should handle this more robustly.

I'm not an expert on that level, but somehow in a world of multi-core CPUs and fancy schedulers, invoking command line tools in a certain order does not seem to preclude the possibility of a race in how the event is submitted, routed and queued without there being an explicit locking mechanism.

Tobias Koch (tobijk) on 2019-08-23

Changed in cloud-init:
status:	Invalid → New

Revision history for this message

Ryan Harper (raharper) wrote on 2019-08-23:

#38

The sequence is:

exec growpart
  exec sgdisk --info # read-only
  exec sgdisk --pretend # read-only
  exec sgdisk --backup # read-only copy
  # modification of disk starts
  exec sgdisk --move-second-header \
              --delete=PART \
              --new=PART \
              --typecode --partition-guid --change-name
  # now that sgdisk has *closed* the filehandle on the disk, systemd-udevd will
  # get an inotify signal and trigger udevd to run udev scripts on the disk.
  # this includes the *removal* of symlinks due to the --delete portion of sgdisk call
  # and following the removal, the -new will trigger the add run on the rules which would
  # recreate the symlinks.

  # update kernel partition sizes; this is an ioctl so it does not trigger an udev events
  exec partx --update
  # the kernel has the new partition sizes, and udev scripts/events are all queued (and possibly in flight)
exit growpart

cloud-init invokes get_size() operation which:
# this is where the race occurs if the symlink created by udev is *not* present
os.open(/dev/disk/by-id/fancy-symlink-with-partuuid-points-to-sdb1)

Dan had put a udevadm settle in this spot like so

def get_size(filename)
util.subp(['udevadm', 'settle'])
os.open(....)

So, you're suggesting that somehow _not all_ of the uevents triggered by the sgdisk command in growpart *wouldn't* have been queued before we call udevadm settle?

If some other events are happening how is cloud-init to know such that it can take action to "handle this race" more robustly?

Lastly if there is a *race* in the symlink creation/remove/delay in uevent propigation; why is that a userspace let alone a cloud-init issue. This isn't universally reproducible, rather it's pretty narrow circumstances between certain kernels and udevs all the while the growpart/cloud-init code remains the same.

Changed in cloud-init:
status:	New → Incomplete

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-23:

#39

[N.B. I wrote the below before I saw Ryan's comment, so there is some repetition.]

OK, I've spent some time catching up on this properly so I can summarise: per comment #24, the issue is that when udev processes the events emitted by the kernel, it (sometimes) doesn't determine the correct partition information. The kernel _does_ emit all the events we would expect, and udev _does_ handle all the events we would expect (which is to say that `udevadm settle` doesn't change behaviour here, it merely ensures that the broken behaviour has completed before we proceed). The hypothesised race condition is somewhere between the kernel and udev: I believe the kernel event is emitted before the partition table has necessarily been fully updated so when udev processes the event and reads the partition table, sometimes it finds the partition and sometimes it doesn't. To be clear, the kernel event generation and the buggy udev event handling all happens as a result of the resize command, _not_ as a result of anything else cloud-init runs subsequently.

So as far as I can tell, this bug would occur regardless of what runs the resize command, and no matter what commands are executed after the resize command. (It might be possible to work around this bug by issuing commands that force a re-read of the partition table on a disk, for example, but this bug _would_ still have occurred before then.)

cloud-init could potentially work around a (kernel|systemd) that isn't handling partitions correctly, but we really shouldn't have to. Until we're satisfied that they cannot actually be fixed, we shouldn't do that. (I am _not_ convinced that this cannot be fixed in (the kernel|systemd), because using a different kernel and using a different udevadm have both caused the issue to stop reproducing.)

So, let me be a little more categorical. The information we have at the moment indicates an issue in the interactions between the kernel and udev on partition resize. cloud-init's involvement is merely as the initiator of that resize. Until we have more information that indicates the issue to be in cloud-init, this isn't a valid cloud-init issue. Once we have more information from the kernel and/or systemd folks, if it indicates that cloud-init _is_ at fault, please move this back to New.

[N.B. I wrote the below before I saw Ryan's comment, so there is some repetition.]

OK, I've spent some time catching up on this properly so I can summarise: per comment #24, the issue is that when udev processes the events emitted by the kernel, it (sometimes) doesn't determine the correct partition information.  The kernel _does_ emit all the events we would expect, and udev _does_ handle all the events we would expect (which is to say that `udevadm settle` doesn't change behaviour here, it merely ensures that the broken behaviour has completed before we proceed).  The hypothesised race condition is somewhere between the kernel and udev: I believe the kernel event is emitted before the partition table has necessarily been fully updated so when udev processes the event and reads the partition table, sometimes it finds the partition and sometimes it doesn't.  To be clear, the kernel event generation and the buggy udev event handling all happens as a result of the resize command, _not_ as a result of anything else cloud-init runs subsequently.

So as far as I can tell, this bug would occur regardless of what runs the resize command, and no matter what commands are executed after the resize command.  (It might be possible to work around this bug by issuing commands that force a re-read of the partition table on a disk, for example, but this bug _would_ still have occurred before then.)

cloud-init could potentially work around a (kernel|systemd) that isn't handling partitions correctly, but we really shouldn't have to.  Until we're satisfied that they cannot actually be fixed, we shouldn't do that.  (I am _not_ convinced that this cannot be fixed in (the kernel|systemd), because using a different kernel and using a different udevadm have both caused the issue to stop reproducing.)

So, let me be a little more categorical.  The information we have at the moment indicates an issue in the interactions between the kernel and udev on partition resize.  cloud-init's involvement is merely as the initiator of that resize.  Until we have more information that indicates the issue to be in cloud-init, this isn't a valid cloud-init issue.  Once we have more information from the kernel and/or systemd folks, if it indicates that cloud-init _is_ at fault, please move this back to New.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-08-23:

#40

> Dan had put a udevadm settle in this spot like so
>
> def get_size(filename)
> util.subp(['udevadm', 'settle'])
> os.open(....)

if you know you've just changed (e.g.) /dev/sda, possibly its kernel-generated udev events just haven't reached udev yet, so the immediate call to 'udev settle' has nothing to wait for; maybe you should tell udev to explicitly request a new full set of events and settle on that?

udevadm trigger --settle /dev/sda

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-23:

#41

Looking at the comment timestamps, Dan probably didn't see my comment, but to reiterate: all the events we expect to be processed _are_ processed, the issue is that when they are processed they don't always end up with the correct partition information.

Revision history for this message

Ryan Harper (raharper) wrote on 2019-08-23: Re: [Bug 1834875] Re: cloud-init growpart race with udev

#42

On Fri, Aug 23, 2019 at 10:00 AM Dan Streetman <email address hidden>
wrote:

> > Dan had put a udevadm settle in this spot like so
> >
> > def get_size(filename)
> > util.subp(['udevadm', 'settle'])
> > os.open(....)
>
> if you know you've just changed (e.g.) /dev/sda, possibly its kernel-
> generated udev events just haven't reached udev yet, so the immediate
> call to 'udev settle' has nothing to wait for; maybe you should tell
> udev to explicitly request a new full set of events and settle on that?
>
> udevadm trigger --settle /dev/sda
>

Possible; though in the case that we don't race, we get to repeat all of
the rules.
If we can sort out what changes in the kernel/udev expose this race then
maybe
we can see if there's something that cloud-init/growpart/etc can do.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Revision history for this message

Ryan Harper (raharper) wrote on 2019-08-23:

#43

On Fri, Aug 23, 2019 at 10:21 AM Dan Watkins <email address hidden>
wrote:

> Looking at the comment timestamps, Dan probably didn't see my comment,
> but to reiterate: all the events we expect to be processed _are_
> processed, the issue is that when they are processed they don't always
> end up with the correct partition information.
>

One bit of info we don't have is a timestamp for when the BLKPG ioctl from
partx --update is run. That would confirm the order of things.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-08-23:

#44

I may be missing the point, but the symlink in question is eventually recreated, does that tell us anything? This here

> Dan had put a udevadm settle in this spot like so
>
> def get_size(filename)
> util.subp(['udevadm', 'settle'])
> os.open(....)

looks to me like the event queue should be empty now, but how do you know userspace has acted on what came out of it? Is it strictly required that any event is cleared only after the corresponding action has completed? If yes, we can probably blame udev. If not, cloud-init should wait on the link to appear.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-08-23:

#45

On Fri, Aug 23, 2019 at 08:23:06PM -0000, Tobias Koch wrote:
> I may be missing the point, but the symlink in question is eventually
> recreated, does that tell us anything? This here

Yes, this is more supporting evidence that this is a race condition; the
state of the system both before and some time after the resize is
consistent, the kernel/udev just lose track of the existence of the
partition for some amount of time after the resize happens.

> > Dan had put a udevadm settle in this spot like so
> >
> > def get_size(filename)
> > util.subp(['udevadm', 'settle'])
> > os.open(....)
>
> looks to me like the event queue should be empty now, but how do you
> know userspace has acted on what came out of it?

The symlink exists before the resize, so if it's missing then we know
that the udev events have been processed. The DEVLINKS line in the diff
in comment #24 shows that udev doesn't think that symlink should exist.
We can see the symlink being deleted in the logging in comment #22.

> Is it strictly required that any event is cleared only after the
> corresponding action has completed? If yes, we can probably blame
> udev. If not, cloud-init should wait on the link to appear.

The link is not going to appear without something prompting a re-read of
the partition table from udev. I don't believe this should be necessary
if the kernel/udev are behaving properly.

(Odds are that whatever causes it to be recreated later in boot would be
blocked by cloud-init waiting.)

Revision history for this message

Tobias Koch (tobijk) wrote on 2019-08-26:

#46

> (Odds are that whatever causes it to be recreated later in boot would be
> blocked by cloud-init waiting.)

But that's not happening. The instance does boot normally, the only service degraded is cloud-init and there is no significant delay either.

So conversely, if I put a loop into cloud-init and just waited on the symlink to appear and if that worked with minimal delay, would that refute the above?

Revision history for this message

Ryan Harper (raharper) wrote on 2019-08-26:

#47

Download full text (3.2 KiB)

On Mon, Aug 26, 2019 at 4:05 AM Tobias Koch <email address hidden>
wrote:

> > (Odds are that whatever causes it to be recreated later in boot would be
> > blocked by cloud-init waiting.)
>
> But that's not happening. The instance does boot normally, the only
> service degraded is cloud-init and there is no significant delay either.
>

> So conversely, if I put a loop into cloud-init and just waited on the
> symlink to appear and if that worked with minimal delay, would that
> refute the above?
>

That's still a workaround for something we don't exactly know why is racing
nor why this isn't more widespread. The code in cloud-init and growpart,
sgdisk
and partx are stable (the code has not changed significantly much in some
time).

We don't have root cause for the race at this time. When cloud-init
invokes growpart
the symlink exists, and when growpart returns sometimes it does not. If
anything growpart
should address the race itself; and at this point, it would have to pickup
a workaround as well.

Let's at least make sure we understand the actual race before we look
further into workarounds.

From what I can see in what growpart is doing, the sgdisk command will
clear the partition tables (this involves removing the partition and then
re-adding it, which triggers udev. Further, Dan's show that partx --update
can also trigger a remove and an add. Looking at the partx update code;
*sometimes* it will remove and add, however, if the partition to be updated
*exists* then it will instead issue an update IOCTL which only updates the
size value in sysfs.

https://github.com/karelzak/util-linux/blob/53ae7d60cfeacd4e87bfe6fcc015b58b78ef4555/disk-utils/partx.c#L451

Which makes me think that in the successful path, we're seeing partx
--update take the partx_resize_partition path, which submits the resize
IOCTL

https://github.com/karelzak/util-linux/blob/917f53cf13c36d32c175f80f2074576595830573/include/partx.h#L54

which in linux kernel does:

https://elixir.bootlin.com/linux/latest/source/block/ioctl.c#L100

and just updates the size value in sysfs:

https://elixir.bootlin.com/linux/latest/source/block/ioctl.c#L146

which AFAICT does not emit any new uevents;

Lastly, in either path (partx updates vs partx removes/adds); invoking a
udevadm settle after the binary has exited is the reasonable way to ensure
that *if* any uevents were created, that they are processed.

growpart could add udevadm settle code; so could cloud-init. We actually
did that in our first test package and that did not result in ensuring the
symlink was present.

All of this suggests to me that *something* isn't processing the sequence
of uevents in such a way that the once they've all been processed we have
the symlink.

We must be missing some other bit of information in the failing path where
the symlink is eventually recreated (possibly due to some other write or
close on the disk on the disk which re-triggers rules).

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bug...

On Mon, Aug 26, 2019 at 4:05 AM Tobias Koch <1834875@bugs.launchpad.net>
wrote:

> > (Odds are that whatever causes it to be recreated later in boot would be
> > blocked by cloud-init waiting.)
>
> But that's not happening. The instance does boot normally, the only
> service degraded is cloud-init and there is no significant delay either.
>

> So conversely, if I put a loop into cloud-init and just waited on the
> symlink to appear and if that worked with minimal delay, would that
> refute the above?
>

That's still a workaround for something we don't exactly know why is racing
nor why this isn't more widespread.  The code in cloud-init and growpart,
sgdisk
and partx are stable (the code has not changed significantly much in some
time).

We don't have root cause for the race at this time.  When cloud-init
invokes growpart
the symlink exists, and when growpart returns sometimes it does not.  If
anything growpart
should address the race itself; and at this point, it would have to pickup
a workaround as well.

Let's at least make sure we understand the actual race before we look
further into workarounds.

From what I can see in what growpart is doing, the sgdisk command will
clear the partition tables (this involves removing the partition and then
re-adding it, which triggers udev.  Further, Dan's show that partx --update
can also trigger a remove and an add.  Looking at the partx update code;
*sometimes* it will remove and add, however, if the partition to be updated
*exists* then it will instead issue an update IOCTL which only updates the
size value in sysfs.

https://github.com/karelzak/util-linux/blob/53ae7d60cfeacd4e87bfe6fcc015b58b78ef4555/disk-utils/partx.c#L451

Which makes me think that in the successful path, we're seeing partx
--update take the partx_resize_partition path, which submits the resize
IOCTL

https://github.com/karelzak/util-linux/blob/917f53cf13c36d32c175f80f2074576595830573/include/partx.h#L54

which in linux kernel does:

https://elixir.bootlin.com/linux/latest/source/block/ioctl.c#L100

and just updates the size value in sysfs:

https://elixir.bootlin.com/linux/latest/source/block/ioctl.c#L146

which AFAICT does not emit any new uevents;

Lastly, in either path (partx updates vs partx removes/adds);  invoking a
udevadm settle after the binary has exited is the reasonable way to ensure
that *if* any uevents were created, that they are processed.

growpart could add udevadm settle code; so could cloud-init.  We actually
did that in our first test package and that did not result in ensuring the
symlink was present.

All of this suggests to me that *something* isn't processing the sequence
of uevents in such a way that the once they've all been processed we have
the symlink.

We must be missing some other bit of information in the failing path where
the symlink is eventually recreated (possibly due to some other write or
close on the disk on the disk which re-triggers rules).

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
>   cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Revision history for this message

Scott Moser (smoser) wrote on 2019-08-26:

#48

If there is a race, or a need to wait, it almost certainly is in cloud-utils (growpart). If you use growpart to grow a partition, the result should be that that is done, and it is ready. The caller should not expect that they have to know additional information (to call 'udevadm settle') or should have to wait some amount of time.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-09-03:

#49

On Mon, Aug 26, 2019 at 08:58:46AM -0000, Tobias Koch wrote:
> > (Odds are that whatever causes it to be recreated later in boot would be
> > blocked by cloud-init waiting.)
>
> But that's not happening. The instance does boot normally, the only
> service degraded is cloud-init and there is no significant delay either.

Right, I was talking about the proposed addition of a wait in
cloud-init. We wouldn't expect to see this today.

> So conversely, if I put a loop into cloud-init and just waited on the
> symlink to appear and if that worked with minimal delay, would that
> refute the above?

What would the loop do? `udevadm settle` doesn't do anything, because
the events have all been processed, they've just been processed
incorrectly.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-09-03:

#50

On Mon, Aug 26, 2019 at 03:17:10PM -0000, Scott Moser wrote:
> If there is a race, or a need to wait, it almost certainly is in cloud-
> utils (growpart).

I would still like us to get a systemd/kernel person to take a look at
this, because I think that the race is somewhere further down than
cloud-utils. (I definitely agree that any workarounds we end up
deciding to use would be best placed in cloud-utils, though!)

> If you use growpart to grow a partition, the result
> should be that that is done, and it is ready. The caller should not
> expect that they have to know additional information (to call 'udevadm
> settle') or should have to wait some amount of time.

+1

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-09-30:

#51

Pretty sure you need to actually try what I suggested in comment 40, but of course use the specific dev/partition you're missing the udev symlink for.

e.g. if you need the symlink for /dev/sda1 then instead of:

$ growpart /dev/sda 1 ; udevadm settle ; ls /dev/disk/by-uuid/$UUID

you should do

$ growpart /dev/sda 1 ; udevadm trigger -w /dev/sda1 ; ls /dev/disk/by-uuid/$UUID

the -w param is relatively new to udevadm so for older releases like bionic do 'udevadm trigger /dev/sda1 ; udevadm settle'

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-09-30:

#52

Dan, are you asking for that to get more debugging information, or are you suggesting that as a fix for this?

(I'm yet to be convinced that this isn't a regression in (the kernel|src:systemd), so I don't think that we should have to run _any_ additional commands to address this.)

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-09-30:

#53

I think that is the correct way to wait for udev to process a specific device's changes.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-09-30:

#54

I don't think we're looking for a way to wait for udev. For this bug to present, udev has to have processed all of the available events (i.e. there's nothing left to wait for), and have done so incorrectly (probably as a result of its interaction with the Azure kernel, as this doesn't reproduce on the generic kernel).

And, to be clear, your proposal isn't just a "wait". You're proposing retriggering events and _then_ waiting. This sounds like a potential workaround, but I believe it would further mask the underlying problem. We (as Ubuntu) control every part of this, so I think we should be able to fix this properly, rather than working around a linux-azure/udev issue in cloud-init and/or cloud-image-utils.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-09-30:

#55

> You're proposing retriggering events and _then_ waiting

I'm proposing make sure that the uevent(s) for the specific partition you're interested in have started processing before you start waiting, which is what the retriggering does.

But it could be a kernel bug, sure.

Revision history for this message

Dan Watkins (oddbloke) wrote on 2019-09-30:

#56

The lack of events is not the issue. We see all the events from the kernel that we would expect to see during a resize, and we see the same number (and type) of events on both the passing and the failing instances.

The problem is that if udev processes the sda1 event first then, at least some of the time, it doesn't detect that sda1 has a partition on it, so it doesn't create the by-partuuid symlink for that partition. See comments #23 and #24 for more detail.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-09-30:

#57

> The lack of events is not the issue

I'm not saying it is. But you seem to have it under control; I was just offering my opinion.

Francis Ginther (fginther) on 2019-11-01

tags:

added: id-5dbb311b74e3f364d8f98c56

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2019-11-01:

#58

Some observations:
* growpart uses sfdisk without --no-tell-kernel option, meaning that it does notify kernel about partition changes
* growpart later calls partx, which may be redundant / cause no changes or events
* as a side note, partprobe, blockdev --rereadpt can also be used to reread partition tables, I'm not sure the difference between them
* growpart does not take exclusive lock of the device, meaning sgdisk is known to be racy with udev events

Imho the sequency of commands should be:
* take flock on the device, to neutralise udev
* modify device with sfdisk
* reread partitions tables (i would say with blockdev --rereadpt, rather than partx/partprobe)
* release the flock
* udevadm trigger --action=add --wait device (or trigger && settle)

This way it ensures that no udev events are processed for the device whilst we are operating and rereading the device partitions, and then we release the lock, at which point everything has to be quiet and steady, trigger, settle, done.

See:
       sfdisk uses BLKRRPART (reread partition table) ioctl to make sure that the device is not used
       by system or another tools (see also --no-reread). It's possible that this feature or another
       sfdisk activity races with udevd. The recommended way how to avoid possible collisions is to
       use exclusive flock for the whole-disk device to serialize device access. The exclusive lock
       will cause udevd to skip the event handling on the device. For example:

flock /dev/sdc sfdisk /dev/sdc

Note, this semantic is not currently supported by udevd for MD and DM devices.

at http://manpages.ubuntu.com/manpages/eoan/en/man8/sfdisk.8.html

Dimitri John Ledkov (xnox) on 2019-11-01

Changed in systemd (Ubuntu):
status:	New → Incomplete

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2019-11-06:

#59

@ Cloud init team, do we want to try changing cloud-utils to use a lock?

And like have a canary "only use locked codepath on this region" such that we can assert through testing that this no longer happens with new code, but does with old code.

Revision history for this message

Ryan Harper (raharper) wrote on 2019-11-06:

#60

A couple of comments on the suggested path:

> Imho the sequency of commands should be:
> * take flock on the device, to neutralise udev

+1 on this approach. Do you know if the flock will block
systemd's inotify write watch on the block device which triggers
udevd? This is the typical race we see with partition creation
and rules executing.

> * modify device with sfdisk
> * reread partitions tables (i would say with blockdev --rereadpt, rather than partx/partprobe)

I'm not sure we can use blockdev --rereadpt as we are operating upon the root disk we're booted on and my understanding is that the ioctl that partx uses is the only way to update the kernel partition table while the disk is in use, otherwise we'd see the normal warning message like when you fdisk your booted device and it says the disk is busy and cannot read the partition table.

> * release the flock

+1

> * udevadm trigger --action=add --wait device (or trigger && settle)

I don't relish the idea of *re-adding* actions on the disk again since the partx update
should have already emitted the uevents associated with the new partitions. However,
we could do this as a way to force reloading of everything. I'd like to withhold
judgement on whether we need this after testing with use of flock on the device.

> And like have a canary "only use locked codepath on this region" such that we can assert through testing that this no longer happens with new code, but does with old code.

The change to cloud-utils growpart could add a flag (--use-flock) so cloud-init could emit different log messages on which path it takes (including a warning if we cannot use flock (ie, you may race).

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2019-11-07:

#61

flock will not block inotify or udev events being emitted.

See
https://github.com/systemd/systemd/blob/master/src/udev/udevd.c#L322
https://github.com/systemd/systemd/blob/master/src/udev/udevd.c#L409

it will prevent udevd from running the rules against it. Thus effectively the event will be fired and done, but nothing actually executed for it.

" /* apply rules, create node, symlinks */"

Also, most importantly udevd also takes locks on devices, thus if one cannot take a lock on it, it means it shouldn't be touched yet as udevd is still processing it.

Yeah, i read code and ioctls both call, and I have no idea which one is better, or what the difference is. If partx works, cool. But also, I thought udev should be updating partition tables all by itself. I somehow wonder if we even need partx call, if we properly flock the device and trigger udev after everything is done.

Udev rules should be safe to be retriggered like that. But also, we should retrigger the expected events which have been skipped. So does growpart create partition? move it? delete/recreate one? i.e does ADD happen? or like REMOVE & ADD? or maybe it's like just MOVE or CHANGE? Do we have logs of the emitted events already?

Don't like flags, as then we'll have to supported forever =) maybe env variable? or like simply change in focal and compare focal vs eoan?

Revision history for this message

Ryan Harper (raharper) wrote on 2019-11-07:

#62

> it will prevent udevd from running the rules against it. Thus effectively the event will be fired and done, but nothing actually executed for it.

Interesting, I suspect this is the race we see. The events emitted but no actions taken (ie we didn't get our by-partuuid symlink created.

> I somehow wonder if we even need partx call, if we properly flock the device and trigger udev after everything is done.

Growpart is modifying the partition table of the root device which is already mounted. We will not be able to flock the root device since it's open and mounted. This is precisely why we need to use the partx call as it updates the kernel partition table mapping without requiring an exclusive lock on the device.

> So does growpart create partition? move it? delete/recreate one? i.e does ADD happen? or like REMOVE & ADD? or maybe it's like just MOVE or CHANGE? Do we have logs of the emitted events already?

growpart on gpt, uses sgdisk which, deletes and then recreates the existing partition but with a larger size.

The logs are above, see comment #22 -> #24 and #38.

> Don't like flags, as then we'll have to supported forever =) maybe env variable? or like simply change in focal and compare focal vs eoan?

This may not matter if we can't flock. I suspect we'll need to use the ugly re-trigger events for the disk. growpart could take the disk it's pointed at, examine the existing udevadm symlinks associated with this (via udevadm info --export), perform it's operation, and then post-operation, trigger the add event on the disk, settle, and confirm that udevadm info --export contains the same set of links (partuuid ,etc).

Revision history for this message

Scott Moser (smoser) wrote on 2019-11-07:

#63

I really think you are all *way* over thinking this.
a. growpart made a change to the partition table (using sfdisk)
b. growpart called partx --update --nr 3 /dev/sda
c. growpart exited

With a and b growpart created udev events. If you create udev events, you really need to wait for those events to finish, or your just pushing complexity onto your consumer.

Now Daniel's updated cloud-init with output captured in comment 14 clearly called udevadm settle after 'c'. But the problem still existed.

So that means we have this sequence of events:
a.) growpart change partition table
b.) growpart call partx
c.) udev created and events being processed
d.) growpart exits
e.) cloud-init calls udevadm settle
f.) udev events occurring that removed the link
g.) cloud-init raced on reading that size - fail

To me, that means either udevadm settle called in 'e' didn't actually do what it is suppsoed to do and wait for all events in the queue to clear. Or, something else created new events. I have long suspected that to be the case, and I think the thing doing it is udev rules.

If that is true, then udev events cause more udev events, so a single 'settle' is never enough. Nor can you actually ever know if you *have* settled long enough.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2019-11-07:

#64

> So that means we have this sequence of events:
> a.) growpart change partition table
> b.) growpart call partx
> c.) udev created and events being processed

That is not true. whilst sfdisk is deleting, creating, finishing partition table (a) and partx is called (b), udev events are already fired and running in parallel and may complete against deleted, partially new, completely new partition table, with or without partx completed.

No amount of settling for events will fix the fact that events were run against racy state of the partition table _during_ sfdisk and partx calls.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-11-07:

#65

Just curious, did anyone actually test with my suggestion from comment 40? That is, just make your partition table change (and update the kernel with partx or whatever, of course), and after it's done trigger --settle udev for the device. Be interesting to know if that actually fixes it or not.

Revision history for this message

Ryan Harper (raharper) wrote on 2019-11-07:

#66

@ddstreet

Yes, settle does not help.

Re-triggering udevadm trigger --action=add /sys/class/block/sda

Would re-run all of them after the partition change has occurred, which is what I'm currently suggesting as a heavy-handed workaround.

I would like to understand *why* the udevd/kernel pair exhibits this racy behavior whereas other kernels/udevd from Bionic do not.

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-11-07:

#67

> Yes, settle does not help.

Well, I didn't suggest just to settle ;-)

> I'm currently suggesting as a heavy-handed workaround.

I don't really see why you think this is heavy-handed, but I must be missing something.

> I would like to understand *why* the udevd/kernel pair exhibits this racy behavior
> whereas other kernels/udevd from Bionic do not.

Certainly a good point; might be good to add some debug to the kernel and/or systemd to see what's going on.

Revision history for this message

Ryan Harper (raharper) wrote on 2019-11-07:

#68

On Thu, Nov 7, 2019 at 1:30 PM Dan Streetman <email address hidden> wrote:

> > Yes, settle does not help.
>
> Well, I didn't suggest just to settle ;-)
>

Sorry; long bug thread.

> > I'm currently suggesting as a heavy-handed workaround.
>
> I don't really see why you think this is heavy-handed, but I must be
> missing something.
>

It's replaying the disk events which results in running rules multiple
times.
The ADD|CHANGE event was already emitted once (when growpart modifies the
partition)
and the rules were run (likely at the wrong time). Issuing a second
trigger will repeat this.
IMO, that's a non-zero amount of time that slows the boot down, so I'd like
to avoid that.

Revision history for this message

Scott Moser (smoser) wrote on 2019-11-07:

#69

> > So that means we have this sequence of events:
> > a.) growpart change partition table
> > b.) growpart call partx
> > c.) udev created and events being processed

> That is not true. whilst sfdisk is deleting, creating, finishing
> partition table (a) and partx is called (b), udev events are already fired
> and running in parallel and may complete against deleted, partially new,
> completely new partition table, with or without partx completed.

You're correct... I left out some 'events created and handled' after 'a'.
But that doesn't change anything. The problem we're seeing here is *not*
that 'b' had any issue.

>
> No amount of settling for events will fix the fact that events were run
> against racy state of the partition table _during_ sfdisk and partx calls.

complete non-sense. I dont care about any racy state *during* anything. I
call 'udevadm settle'. That means "block until stuff is done." I think
you're saying that I cannot:
1.) do something that causes udev events
2.) wait until all udev events caused by that something are finished

if that is the case, then nothing ever can fix this, and we might as well
go find jobs on a farm.

Revision history for this message

Ryan Harper (raharper) wrote on 2019-11-07:

#70

On Thu, Nov 7, 2019 at 11:30 AM Dimitri John Ledkov <email address hidden>
wrote:

> > So that means we have this sequence of events:
> > a.) growpart change partition table
> > b.) growpart call partx
> > c.) udev created and events being processed
>
> That is not true. whilst sfdisk is deleting, creating, finishing
> partition table (a) and partx is called (b), udev events are already
> fired and running in parallel and may complete against deleted,
> partially new, completely new partition table, with or without partx
> completed.
>

> No amount of settling for events will fix the fact that events were run
> against racy state of the partition table _during_ sfdisk and partx
> calls.
>

udevadm control --stop-exec-queue
sgdisk
partx
udevadm control --start-exec-queue

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-11-07:

#71

> Issuing a second
> trigger will repeat this.
> IMO, that's a non-zero amount of time that slows the boot down, so I'd like
> to avoid that.

systemd-udev-trigger.serivce retriggers *everything* at boot (except in an unprivileged container where it can't), so I'm not sure how much added time we're talking about just for a single block device. But yeah, you need the kernel to send a uevent to udevd after the partition table is ready for udevd to read, or udevd won't get the right partition table info; whether you do some locking to try to block udevd from reading it or just retrigger an event after it's ready.

Revision history for this message

Ryan Harper (raharper) wrote on 2019-11-07:

#72

On Thu, Nov 7, 2019 at 2:41 PM Dan Streetman <email address hidden> wrote:

> > Issuing a second
> > trigger will repeat this.
> > IMO, that's a non-zero amount of time that slows the boot down, so I'd
> like
> > to avoid that.
>
> systemd-udev-trigger.serivce retriggers *everything* at boot (except in
>
an unprivileged container where it can't), so I'm not sure how much
> added time we're talking about just for a single block device.

We can't know how many devices are in the system. It maybe nearly zero,
it could be minutes. The cold plug is required (initramfs may or maynot
have ran
scripts/udevd etc but kernel events already were emitted during initial
boot and userspace isn't ready yet) and runs before cloud-init runs.

> yeah, you need the kernel to send a uevent to udevd after the partition
> table is ready for udevd to read, or udevd won't get the right partition
> table info; whether you do some locking to try to block udevd from
> reading it or just retrigger an event after it's ready.
>

Generally yes; and what we still don;t know is why it's racing here but not
everywhere all the time. Note that *growpart* runs on every single cloud
instance boot. There's a _LOT_ of those; something changed in udev/kernel
which is racing and it'd be good to track that bit down.

> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Revision history for this message

Dan Streetman (ddstreet) wrote on 2019-11-07:

#73

> We can't know how many devices are in the system. It maybe nearly zero,
> it could be minutes.

I'm not suggesting you trigger uevents for all devices, just the one you're repartitioning...

> The cold plug is required (initramfs may or maynot
> have ran
> scripts/udevd etc but kernel events already were emitted during initial
> boot and userspace isn't ready yet) and runs before cloud-init runs.

yes I know, my point was you appear concerned about the extra time of retriggering a single device, while bootup retriggers all devices.

In any case, as I said in comment 57 - you appear to have things under control, I was only curious if anyone had actually tested if simply retriggering the repartitioned device does actually fix/workaround the problem. I'll step back out of this bug again. :)

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2019-11-08:

#74

On Thu, 7 Nov 2019 at 17:50, Ryan Harper <email address hidden> wrote:
>
> @ddstreet
>
> Yes, settle does not help.
>
> Re-triggering udevadm trigger --action=add /sys/class/block/sda
>
> Would re-run all of them after the partition change has occurred, which
> is what I'm currently suggesting as a heavy-handed workaround.
>
> I would like to understand *why* the udevd/kernel pair exhibits this
> racy behavior whereas other kernels/udevd from Bionic do not.
>

Just because a race is always won, doesn't mean it didn't always exist ;-)

--
Regards,

Dimitri.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2019-11-08:

#75

On Thu, 7 Nov 2019 at 20:05, Scott Moser <email address hidden> wrote:
>
> > > So that means we have this sequence of events:
> > > a.) growpart change partition table
> > > b.) growpart call partx
> > > c.) udev created and events being processed
>
> > That is not true. whilst sfdisk is deleting, creating, finishing
> > partition table (a) and partx is called (b), udev events are already fired
> > and running in parallel and may complete against deleted, partially new,
> > completely new partition table, with or without partx completed.
>
> You're correct... I left out some 'events created and handled' after 'a'.
> But that doesn't change anything. The problem we're seeing here is *not*
> that 'b' had any issue.
>
> >
> > No amount of settling for events will fix the fact that events were run
> > against racy state of the partition table _during_ sfdisk and partx calls.
>
> complete non-sense. I dont care about any racy state *during* anything. I
> call 'udevadm settle'. That means "block until stuff is done." I think
> you're saying that I cannot:
> 1.) do something that causes udev events
> 2.) wait until all udev events caused by that something are finished
>
> if that is the case, then nothing ever can fix this, and we might as well
> go find jobs on a farm.
>

Both those thing happen, but udev events are started processing whilst
the partition table changes have not completed yet. This is what is
document in the sfdisk manpage as a know bug that nobody yet has
managed to figure out and derace.
Meaning if the udev events happened, and one waits to finish their
processing, there is no guarantee that they have been processed
against consistent disk state.

This is why sfdisk recommends taking flock. And this is why udev also
tries to take an flock.

In the past IBM has demonstrated a race similar to this one in
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1571707
where they tried to rapidly and in parallel partition 256 devices,
with only 89 of them successfully showing partitions after the limit
test is executed, and appear fully after a reboot in April 2016 on top
of Xenial.

--
Regards,

Dimitri.

On Thu, 7 Nov 2019 at 20:05, Scott Moser <ssmoser2+ubuntu@gmail.com> wrote:
>
> > > So that means we have this sequence of events:
> > >  a.) growpart change partition table
> > >  b.) growpart call partx
> > >  c.) udev created and events being processed
>
> > That is not true. whilst sfdisk is deleting, creating, finishing
> > partition table (a) and partx is called (b), udev events are already fired
> > and running in parallel and may complete against deleted, partially new,
> > completely new partition table, with or without partx completed.
>
> You're correct... I left out some 'events created and handled' after 'a'.
> But that doesn't change anything.  The problem we're seeing here is *not*
> that 'b' had any issue.
>
> >
> > No amount of settling for events will fix the fact that events were run
> > against racy state of the partition table _during_ sfdisk and partx calls.
>
> complete non-sense.  I dont care about any racy state *during* anything. I
> call 'udevadm settle'.  That means "block until stuff is done."  I think
> you're saying that I cannot:
>  1.) do something that causes udev events
>  2.) wait until all udev events caused by that something are finished
>
> if that is the case, then nothing ever can fix this, and we might as well
> go find jobs on a farm.
>

Both those thing happen, but udev events are started processing whilst
the partition table changes have not completed yet. This is what is
document in the sfdisk manpage as a know bug that nobody yet has
managed to figure out and derace.
Meaning if the udev events happened, and one waits to finish their
processing, there is no guarantee that they have been processed
against consistent disk state.

This is why sfdisk recommends taking flock. And this is why udev also
tries to take an flock.

In the past IBM has demonstrated a race similar to this one in
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1571707
where they tried to rapidly and in parallel partition 256 devices,
with only 89 of them successfully showing partitions after the limit
test is executed, and appear fully after a reboot in April 2016 on top
of Xenial.

-- 
Regards,

Dimitri.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-01-30:

#76

So I've been handed the job of moving this bug forward this iteration.

Having just re-read all the comments again and read some source code while listening to loud techno music my observations/questions are these:

1) this is a doozy

2) I would really like to see complete udevadm monitor and inotify output for successful and failing boots

3) As far as I can read the gdisk source, there is no point when there no entry for the partition on disk (i.e. it only writes the partition tables once)

4) I think this comment "I believe the kernel event is emitted before the partition table has necessarily been fully updated so when udev processes the event and reads the partition table, sometimes it finds the partition and sometimes it doesn't." is not quite right, or at least there is another possibility entirely consistent with the observed facts: the diff in comment #24 looks more like udev finds the partition OK but fails to fill out the ID_PART_ENTRY_* entries. These come from libblkid via udev-builtin-blkid.c. The part about this theory that doesn't make sense is that the ID_FS_* entries are there and they come from the same codepath. So who knows!

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-01-30:

#77

Ah, no I think this might be along right lines: udev is calling blkid on the _partition_ of course, so it can probe for filesystem etc without looking at the partition table. After it's done that, it does look for the partition table so it can read the ID_PART_ENTRY_* values from it, but if it fails to load the partition table it just gives up and still returns success.

As to how this might happen, well there are certainly moments during sgdisk's writing of the partition table when the gpt data is not completely consistent (it contains checksums and is not written atomically). Two things are still a bit unsatisfactory about this explanation though: 1) it's a mighty tight race, hitting this with any regularity borders on the incredible and 2) I have no idea why this only occurs on Azure on Cosmic up. Although I guess 1) may explain 2) I suppose.

Setting LIBBLKID_DEBUG=lowprobe in systemd-udevd's environment during a failing boot would be very interesting to confirm or deny this line of reasoning -- although this will make udevd produce gobs of output and possibly disturb any race.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-01-30:

#78

Oh yeah and one other thing I don't understand: why udev is processing the partition while sgdisk is still running.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-02-03:

#79

Hi, could someone prepare an image with https://paste.ubuntu.com/p/HYGv6m8gCc/ at /etc/systemd/system/systemd-udevd.service.d/udevd-debugging.conf, boot it on azure until it fails and put the journalctl output (probably a few megs) somewhere I can read it? Output from a successful boot would also be interesting.

Revision history for this message

Pat Viafore (patviafore) wrote on 2020-02-03:

#80

I can get that going, and will report back on this bug when I have more information.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-02-04:

#81

On another tangent, I wonder if the change that brought this to light was the switch to booting without initrd. The timing is about right, and fits with the fact that it doesn't occur with the generic kernel (which cannot boot without initrd in Azure). So if someone has an excess of time to test <wink>, testing bionic but with initrdless boot and focal with initrdful boot would be interesting. I don't know exactly what runes have to be cast to do this.

Getting the udevd debug output is still higher priority though.

Revision history for this message

Pat Viafore (patviafore) wrote on 2020-02-06:

#82

failing_journalctl_output.txt Edit (6.6 MiB, text/plain)

I'm attaching the output of a failing build

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-06:

#83

Cloud-init service starts and will run growpart, etc

Feb 06 00:37:26 ubuntu systemd[1]: Starting Initial cloud-init job (pre-networking)...
Feb 06 00:37:37 test-xrdpdnvfctsofyygmzan systemd[1]: Starting Initial cloud-init job (metadata service crawler)...

Something has modified sdb1 (growpart/sgdisk) so we see a change event, and removal of symlink
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[265]: sdb1: Device (SEQNUM=3353, ACTION=change) is queued
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: sdb1: Updating old name, '/dev/disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea' no longer belonging to '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sdb/sdb1'
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: read: off=2244476928 len=26214
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: sdb1: No reference left, removing '/dev/disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea'

cc_growpart attempts to check if the block device has been resized and fails
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan cloud-init[570]: 2020-02-06 00:37:42,658 - util.py[WARNING]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed

cloud-init now runs resizefs on sdb1
Feb 06 00:37:43 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resizing filesystem from 548091 to 7836155 blocks
Feb 06 00:37:44 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resized filesystem to 7836155

Almost a minute later, something triggers a rescan and we see the link return (looks like walinuxagent poking)

Feb 06 00:38:14 test-xrdpdnvfctsofyygmzan systemd-udevd[1482]: sdb1: /usr/lib/udev/rules.d/60-persistent-storage.rules:106 LINK 'disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea'
Feb 06 00:38:14 test-xrdpdnvfctsofyygmzan systemd-udevd[1482]: sdb1: Creating symlink '/dev/disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea' to '../../sdb1'

Cloud-init service starts and will run growpart, etc

Feb 06 00:37:26 ubuntu systemd[1]: Starting Initial cloud-init job (pre-networking)...
Feb 06 00:37:37 test-xrdpdnvfctsofyygmzan systemd[1]: Starting Initial cloud-init job (metadata service crawler)...

Something has modified sdb1 (growpart/sgdisk) so we see a change event, and removal of symlink
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[265]: sdb1: Device (SEQNUM=3353, ACTION=change) is queued
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: sdb1: Updating old name, '/dev/disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea' no longer belonging to '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host0/target0:0:0/0:0:0:0/block/sdb/sdb1'
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: read: off=2244476928 len=26214
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: sdb1: No reference left, removing '/dev/disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea'

cc_growpart attempts to check if the block device has been resized and fails
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan cloud-init[570]: 2020-02-06 00:37:42,658 - util.py[WARNING]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed

cloud-init now runs resizefs on sdb1
Feb 06 00:37:43 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resizing filesystem from 548091 to 7836155 blocks
Feb 06 00:37:44 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resized filesystem to 7836155

Almost a minute later, something triggers a rescan and we see the link return (looks like walinuxagent poking)

Feb 06 00:38:14 test-xrdpdnvfctsofyygmzan systemd-udevd[1482]: sdb1: /usr/lib/udev/rules.d/60-persistent-storage.rules:106 LINK 'disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea'
Feb 06 00:38:14 test-xrdpdnvfctsofyygmzan systemd-udevd[1482]: sdb1: Creating symlink '/dev/disk/by-partuuid/5e01dd62-6f50-4cd7-8f62-30bb372b58ea' to '../../sdb1'

Revision history for this message

Pat Viafore (patviafore) wrote on 2020-02-06:

#84

passing_journalctl_output.txt Edit (5.8 MiB, text/plain)

Adding in a passing version of journalctl

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-02-09:

#85

> Ah, no I think this might be along right lines: udev is calling blkid
> on the _partition_ of course, so it can probe for filesystem etc without
> looking at the partition table. After it's done that, it does look for
> the partition table so it can read the ID_PART_ENTRY_* values from it,
> but if it fails to load the partition table it just gives up and still
> returns success.

Ah so this was in the right direction, but not completely right: the failure is not in reading the partition table of the disk being probed, but failing to figure out which entry in that table corresponds to the partition being probed:

Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: trying to convert devno 0x811 to partition
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: searching by offset/size
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: not found partition for device
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: parts: end probing for partition entry [nothing]

The function of interest in libblkid here is blkid_partlist_devno_to_partition, which (unless some apis have very misleading names) is reading the offset and size of the partition from sysfs. What must be happening here is that udev sees the disk being closed by gdisk and somehow runs the builting blkid command on the partition before the kernel has been informed of the resize of the partition. And indeed, we can see when the kernel notices this:

Feb 06 00:37:43 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resizing filesystem from 548091 to 7836155 blocks
Feb 06 00:37:44 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resized filesystem to 7836155

At least a second later. (In passing_journalctl_output.txt, this message is printed before udev even gets the inotify event about sdb being closed).

So there's always a race here but I don't know why we only see this on Azure with newer releases. A proper fix would be to get sgdisk to call partx_resize_partition before closing the disk but it would also be interesting to see why it takes so long to get to that part sometimes. growpart is too much shell for me, but maybe it's possible to get it to run partx under strace and get the output of that out somehow?

> Ah, no I think this might be along right lines: udev is calling blkid
> on the _partition_ of course, so it can probe for filesystem etc without
> looking at the partition table. After it's done that, it does look for
> the partition table so it can read the ID_PART_ENTRY_* values from it,
> but if it fails to load the partition table it just gives up and still
> returns success.

Ah so this was in the right direction, but not completely right: the failure is not in reading the partition table of the disk being probed, but failing to figure out which entry in that table corresponds to the partition being probed:

Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: trying to convert devno 0x811 to partition
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: searching by offset/size
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: not found partition for device
Feb 06 00:37:42 test-xrdpdnvfctsofyygmzan systemd-udevd[556]: 556: libblkid: LOWPROBE: parts: end probing for partition entry [nothing]

The function of interest in libblkid here is blkid_partlist_devno_to_partition, which (unless some apis have very misleading names) is reading the offset and size of the partition from sysfs. What must be happening here is that udev sees the disk being closed by gdisk and somehow runs the builting blkid command on the partition before the kernel has been informed of the resize of the partition. And indeed, we can see when the kernel notices this:

Feb 06 00:37:43 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resizing filesystem from 548091 to 7836155 blocks
Feb 06 00:37:44 test-xrdpdnvfctsofyygmzan kernel: EXT4-fs (sdb1): resized filesystem to 7836155

At least a second later. (In passing_journalctl_output.txt, this message is printed before udev even gets the inotify event about sdb being closed).

So there's always a race here but I don't know why we only see this on Azure with newer releases. A proper fix would be to get sgdisk to call partx_resize_partition before closing the disk but it would also be interesting to see why it takes so long to get to that part sometimes. growpart is too much shell for me, but maybe it's possible to get it to run partx under strace and get the output of that out somehow?

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-10:

#86

Yes, this is my read on the issue as well. The trigger is related to the inotify watch that systemd-udevd puts on the disk. Something that might help that we could try per xnox's comment around use of flock.

if growpart were to flock /dev/sda (we need to sort out what flags are needed to prevent udev probe and rule execution) prior to running sgdisk, and not release this flock until after partx runs (to inform the kernel of the update) and a udevadm settle (and possibly a trigger of a CHANGE event).

This should prevent early reads of the modified but not update-to-date partition data in the kernel.

Lastly, another area of exploration is: why isn't a change event emitted when partx runs to update the kernel with new partition data?

If there is one, and the rules then generate the correct symlink; another approach for growpart is for it to block until the size of the new partition is correct. growpart calculates what the new size should be; so it could poll for this value thus blocking cloud-init's size check from running until growpart is confident that the kernel has updated the correct value (and the subsequent rules have completed execution)!

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-14:

#87

Here's the upstream changes to growpart I'm suggesting:

https://code.launchpad.net/~raharper/cloud-utils/+git/cloud-utils/+merge/379177

I've also proposed on modifications to cloud-init's cc_growpart as a further method
to aid debugging if this hit as well as some mitigation around the race.

https://github.com/canonical/cloud-init/pull/213

Scott Moser (smoser) on 2020-02-25

Changed in cloud-utils:
status:	New → Fix Committed
Changed in cloud-utils (Ubuntu):
status:	New → Confirmed

Revision history for this message

Scott Moser (smoser) wrote on 2020-02-25:

#88

The fix is in cloud-utils upstream now.
Still to do:
a.) review/merge cloud-initramfs-tools pull request
https://code.launchpad.net/~raharper/cloud-utils/+git/cloud-utils/+merge/379177
b.) upload cloud-initramfs-tools to focal
c.) upload cloud-utils to focal
d.) any SRU

the order of 'b' and 'c' are not *terribly* important, because I do not think that there are many users of 'growroot' at this point. That said... the change to cloud-utils *does* break cloud-initramfs-tools so they should be done in that order.

I personally do not really want to babysit these getting in. Can someone else sign up for that?

Changed in cloud-initramfs-tools (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joshua Powers (powersj) wrote on 2020-02-25:

#89

@smoser

a looks merged

b has not had any changes since 2018 and looks up-to-date in Focal. What am I missing?

c I see a devel branch, is it assumed that it uses a packaging structure similar to cloud-init or can someone just grab and upload master?

Revision history for this message

Scott Moser (smoser) wrote on 2020-02-25:

#90

a.) I gave the wrong link. ugh. It should have been:
https://code.launchpad.net/~smoser/cloud-initramfs-tools/+git/cloud-initramfs-tools/+merge/379774

b.) the fixed link to 'a' probably makes more sense now. But basically
you need a newer cloud-initramfs-tools to adjust for the fact that
growpart needs flock now.

cloud-initramfs-tools I believe is still a "native package". So it just
has the packaging along side

c.) you can use new-upstream-snapshot. cloud-utils is set to use the
same basic workflow as cloud-init or curtin.

I'm OK to upload cloud-initramfs-tools and also cloud-utils, but I think
in both cases it makes sense to get someone from server team to do it.
Just to have someone other than myself having done it.

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-25:

#91

@Scott,

cloud-utils isn't quite new-upstream-snapshot out of the box; the debian dir does not contain the changelog; however, I think I've got this sorted out. I've a MP I can put up; but it only will show the add of the changelog file. I'll attach a debdiff and a source package.

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-25:

#92

debdiff showing the changes to upload to fix the bug. Edit (4.9 KiB, text/plain)

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-25:

#93

tarball of source package to upload Edit (66.8 KiB, application/x-tar)

Revision history for this message

Scott Moser (smoser) wrote on 2020-02-25:

#94

this seemed to "just work" for me.
http://paste.ubuntu.com/p/93dWDPZfZT/

Revision history for this message

Ryan Harper (raharper) wrote on 2020-02-25:

#95

On Tue, Feb 25, 2020 at 2:35 PM Scott Moser <email address hidden>
wrote:

> this seemed to "just work" for me.
> http://paste.ubuntu.com/p/93dWDPZfZT/

Ah, I didn't check that there was an existing ubuntu/devel branch. Sorry.

I've pushed a MR here:

https://code.launchpad.net/~raharper/cloud-utils/+git/cloud-utils/+merge/379851

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1834875
>
> Title:
> cloud-init growpart race with udev
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1834875/+subscriptions
>

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-02-25:

#96

This bug was fixed in the package cloud-initramfs-tools - 0.45ubuntu1

---------------
cloud-initramfs-tools (0.45ubuntu1) focal; urgency=medium

* Add dependency on flock for growroot's use of growpart.
(LP: #1834875)

-- Scott Moser <email address hidden> Tue, 25 Feb 2020 13:08:22 -0500

Changed in cloud-initramfs-tools (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2020-03-01:

#97

This bug was fixed in the package cloud-utils - 0.31-7-gd99b2d76-0ubuntu1

---------------
cloud-utils (0.31-7-gd99b2d76-0ubuntu1) focal; urgency=medium

* New upstream snapshot.
- growpart: add flock support to prevent udev races (LP: #1834875)

-- Ryan Harper <email address hidden> Tue, 25 Feb 2020 14:41:21 -0600

Changed in cloud-utils (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-03-06:

#98

Can someone from CPC confirm that this has fixed the issues in the Azure tests?

Revision history for this message

Pat Viafore (patviafore) wrote on 2020-04-08:

#99

Sorry, we were waiting on a bug regarding dictionary keys being modified to resolve before we could verify this. We have not seen this in the past couple of runs in our testing, so I think we're good to proceed.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-04-08:

#100

Great, thanks for confirming.

I wonder if we should SRU this. You see it on Eoan I presume? The bug is theoretically present in Bionic too aiui but if it never crops up I don't know if it's worth it.

Revision history for this message

Pat Viafore (patviafore) wrote on 2020-04-08:

#101

Yes, we see it on Eoan and blocking tests, so if you could SRU it there, it'd be much appreciated. We have never seen it on Bionic to my knowledge.

Michael Hudson-Doyle (mwhudson) on 2020-04-08

no longer affects:	linux-azure (Ubuntu)
no longer affects:	systemd (Ubuntu)

Revision history for this message

Markus Schade (lp-markusschade) wrote on 2020-04-22:

#102

It seems that the changes to cloud-initramfs-growroot have a side effect that was not present before focal.

After initially provisioning an image, the root partition is resized in initramfs, but cannot be mounted afterwards. A second boot to the now resized disk works then without problems.

Removing cloud-initramfs-growroot provides a working focal image.

Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... [ 2.261862] Btrfs loaded, crc32c=crc32c-intel
Scanning for Btrfs filesystems
done.
Begin: Will now check root file system ... fsck from util-linux 2.34
[/usr/sbin/fsck.ext4 (1) -- /dev/sda1] fsck.ext4 -a -C0 /dev/sda1
/dev/sda1: clean, 32500/188048 files, 340923/751616 blocks
done.
[ 2.397086] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
done.
Begin: Running /scripts/local-bottom ... [ 2.723549] sda: sda1 sda14 sda15
GROWROOT: CHANGED: partition=1 start=129024 old: size=6012928 end=6141952 new: size=39872479 end=40001503
done.
Begin: Running /scripts/init-bottom ... mount: mounting /dev on /root/dev failed: No such file or directory
mount: mounting /dev on /root/dev failed: No such file or directory
done.
mount: mounting /run on /root/run failed: No such file or directory
BusyBox v1.30.1 (Ubuntu 1:1.30.1-4ubuntu6) multi-call binary.

Revision history for this message

Scott Moser (smoser) wrote on 2020-04-22:

#103

@Markus,
Please open a bug against cloud-initramfs-growroot and reference it here (and reference this bug there).

Please also provide a full console log.

fwiw, generally speaking cloud-initramfs-growroot has not been necessary since at least 14.04, quite possibly even beefore.

Michael Hudson-Doyle (mwhudson) on 2020-05-05

description:

updated

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-05-12: Please test proposed package

#104

Hello Tobias, or anyone else affected,

Accepted cloud-utils into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-utils/0.31-5-gef42f6b5-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-utils (Ubuntu Eoan):
status:	New → Fix Committed
tags:	added: verification-needed verification-needed-eoan
Changed in cloud-initramfs-tools (Ubuntu Eoan):
status:	New → Fix Committed

Revision history for this message

Brian Murray (brian-murray) wrote on 2020-05-12:

#105

Hello Tobias, or anyone else affected,

Accepted cloud-initramfs-tools into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-initramfs-tools/0.44ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message

Ubuntu SRU Bot (ubuntu-sru-bot) wrote on 2020-05-13: Autopkgtest regression report (cloud-utils/0.31-5-gef42f6b5-0ubuntu1.1)

#106

All autopkgtests for the newly accepted cloud-utils (0.31-5-gef42f6b5-0ubuntu1.1) for eoan have finished running.
The following regressions have been reported in tests triggered by the package:

lxc/3.0.4-0ubuntu1 (i386)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/eoan/update_excuses.html#cloud-utils

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Mathew Hodson (mhodson) on 2020-05-13

affects:	cloud-init → cloud-initramfs-tools
Changed in cloud-initramfs-tools:
status:	Incomplete → Fix Released

Revision history for this message

Pat Viafore (patviafore) wrote on 2020-06-01:

#107

We are seeing some failures still in Eoan (not Focal) with tracebacks involving growpart:

2020-05-29 10:14:29,218 - util.py[DEBUG]: Running module growpart (<module 'cloudinit.config.cc_growpart' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 851, in _run_modules
    freq=freq)
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 187, in run
    results = functor(*args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 360, in handle
    func=resize_devices, args=(resizer, devices))
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2558, in log_time
    ret = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 307, in resize_devices
    (old, new) = resizer.resize(disk, ptnum, blockdev)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 159, in resize
    return (before, get_size(partdev))
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_growpart.py", line 206, in get_size
    fd = os.open(filename, os.O_RDONLY)
FileNotFoundError: [Errno 2] No such file or directory: '/dev/disk/by-partuuid/465f461c-537a-4d09-acae-cf4469c962c5'

It is still an intermittent failure.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-06-09:

#108

Huh. Is that with cloud-utils 0.31-5-gef42f6b5-0ubuntu1.1 from proposed?

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2020-06-09:

#109

Anyway given Eoan's imminent demise, maybe we don't care.

Mathew Hodson (mhodson) on 2020-07-03

Changed in cloud-utils:
status:	Fix Committed → Fix Released

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-08-18: Proposed package removed from archive

#110

The version of cloud-utils in the proposed pocket of Eoan that was purported to fix this bug report has been removed because the bugs that were to be fixed by the upload were not verified in a timely (105 days) fashion.

tags:	removed: verification-needed-eoan
Changed in cloud-utils (Ubuntu Eoan):
status:	Fix Committed → Won't Fix
tags:	removed: verification-needed

Revision history for this message

Łukasz Zemczak (sil2100) wrote on 2020-08-18:

#112

The version of cloud-initramfs-tools in the proposed pocket of Eoan that was purported to fix this bug report has been removed because the bugs that were to be fixed by the upload were not verified in a timely (105 days) fashion.

Changed in cloud-initramfs-tools (Ubuntu Eoan):
status:	Fix Committed → Won't Fix

cloud-initramfs-tools

cloud-init growpart race with udev

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
cloud-initramfs-tools	Fix Released	Medium	Unassigned
cloud-utils	Fix Released	Undecided	Unassigned
cloud-initramfs-tools (Ubuntu)	Fix Released	Undecided	Unassigned
Eoan	Won't Fix	Undecided	Unassigned
cloud-utils (Ubuntu)	Fix Released	Undecided	Unassigned
Eoan	Won't Fix	Undecided	Unassigned