Thanks @ChrisPatterson for continuing to help us out here on big systems.
Looks like a case where the network rename by the kernel is colliding with cloud-init.
I'm thinking the failure symptom is the following:
- cloud-init calls get_devicelist and looping starts looping through devices found [1]
- kernel renames some nic and sysfs gets updated
- cloud-init is unable to finish the loop of calls to 'udevadm', 'test-builtin', 'net_setup_link', <PREVIOUS/STALE_DEVICE_NAME>
We need to better handle this potential race condition in cloud-init and vet whether a rename happened out from under us, or block the renames in the kernel temporarily if we can.
Thanks @ChrisPatterson for continuing to help us out here on big systems.
Looks like a case where the network rename by the kernel is colliding with cloud-init.
I'm thinking the failure symptom is the following: STALE_DEVICE_ NAME>
- cloud-init calls get_devicelist and looping starts looping through devices found [1]
- kernel renames some nic and sysfs gets updated
- cloud-init is unable to finish the loop of calls to 'udevadm', 'test-builtin', 'net_setup_link', <PREVIOUS/
We need to better handle this potential race condition in cloud-init and vet whether a rename happened out from under us, or block the renames in the kernel temporarily if we can.
References:
[1] https:/ /github. com/canonical/ cloud-init/ blob/main/ cloudinit/ net/netplan. py#L279- L284