Comment 1 for bug 1983516

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks @ChrisPatterson for continuing to help us out here on big systems.

Looks like a case where the network rename by the kernel is colliding with cloud-init.

I'm thinking the failure symptom is the following:
  - cloud-init calls get_devicelist and looping starts looping through devices found [1]
  - kernel renames some nic and sysfs gets updated
  - cloud-init is unable to finish the loop of calls to 'udevadm', 'test-builtin', 'net_setup_link', <PREVIOUS/STALE_DEVICE_NAME>

We need to better handle this potential race condition in cloud-init and vet whether a rename happened out from under us, or block the renames in the kernel temporarily if we can.

References:

[1] https://github.com/canonical/cloud-init/blob/main/cloudinit/net/netplan.py#L279-L284