node failed to deploy because an ephemeral network device was not found

Bug #1931735 reported by Junien F
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
cloud-init
Expired
Undecided
Unassigned

Bug Description

Hi,

Using MAAS snap 2.8.6-8602-g.07cdffcaa.

I just had a node failed to deploy because a network device that was present during commissioning wasn't present anymore, making cloud-init sad. To be precise, the node deployed properly, rebooted, and during the post-deploy boot, cloud-init got sad with :

RuntimeError: Not all expected physical devices present: {'be:65:46:cb:58:b7'}

(full stacktrace at https://pastebin.canonical.com/p/9Ycxwk5rRy/)

I was indeed able to find the network device with MAC address 'be:65:46:cb:58:b7', and it's an ephemeral NIC that gets created when someone logs in the HTML5 console (this is a Gigabyte server by the way). So someone was probably logged on the HTML5 console when the node was commissioned.

I deleted this ephemeral device from the node in MAAS, and was then able to deploy it properly.

These ephemeral NICs appear to have random MAC addresses. I was logged on the HTML5 console during the boot logged above, and you can see there's a device named "enx5a099ca01d4b" with MAC address "5a:09:9c:a0:1d:4b" (which doesn't match a known OUI).

This is actually a cdc_ether device :
$ dmesg|grep cdc_ether
[ 29.867170] cdc_ether 1-1.3:2.0 usb0: register 'cdc_ether' at usb-0000:06:00.3-1.3, CDC Ethernet Device, 5a:09:9c:a0:1d:4b
[ 29.867296] usbcore: registered new interface driver cdc_ether
[ 29.958137] cdc_ether 1-1.3:2.0 enx5a099ca01d4b: renamed from usb0
[ 205.908811] cdc_ether 1-1.3:2.0 enx5a099ca01d4b: unregister 'cdc_ether' usb-0000:06:00.3-1.3, CDC Ethernet Device

(the last time is very probably when I logged off the HTML5 console, which removes the device).

So I think :
- MAAS should ignore these devices by default
- cloud-init shouldn't die when a cdc_ether device is missing.

Thanks

Revision history for this message
James Falcon (falcojr) wrote :

Without all of the cloud-init logs, it's hard to know what exactly happened here. cloud-init couldn't find a datasource, which normally wouldn't be related to a missing network device.

Is this issue reproducible? If so, running 'cloud-init collect-logs -u' on an affected instance might help us debug this issue.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Junien F (axino) wrote :

If cloud-init fails like it did above, then the instance doesn't get an IP address and doesn't get SSH keys, so it's not possible to login to run "cloud-init collect-logs -u".

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
James Falcon (falcojr) wrote :

If you can't access the machine, how did you obtain the logs posted in your pastebin?

Revision history for this message
Junien F (axino) wrote :

I can see what's getting output on the console. I can't login and run commands.

Revision history for this message
James Falcon (falcojr) wrote :

If we look at the source of the second traceback[1], you can see that it is being raised intentionally. Based on where it is called from[2], I think this makes sense.

If we have already applied networking, we will skip this step. We obtain the networking config immediately before we attempt to wait for these physical devices, so if there's a mismatch between the devices and the config, it points to a larger problem outside of cloud-init.

Given that this behavior is intentional and in place to prevent us from proceeding in a bad state, I think this works as intended, and I don't see a strong case for changing it. We can certainly discuss it further, but it appears that cloud-init is getting into this state because MAAS is handing us an invalid config, and that is the root issue that needs to be addressed.

[1] https://github.com/canonical/cloud-init/blob/950c186a7e0c66a3ed84ea97291e5829ca3d826c/cloudinit/distros/networking.py#L177
[2] https://github.com/canonical/cloud-init/blob/master/cloudinit/stages.py#L812

Revision history for this message
Junien F (axino) wrote :

I agree that MAAS shouldn't hand out a config which has a network device that's not present on the system.

On the other hand, cloud-init dying extremely early and not even trying to set up networking because an _unused_ device is missing isn't super user friendly either. I would expect cloud-init to try to do what it can when a device is missing, above all when said device doesn't have any configuration.

James Falcon (falcojr)
Changed in cloud-init:
status: New → Opinion
Revision history for this message
Bill Wear (billwear) wrote :

@axino, if you think there's something here MAAS should do, please add this to the Features category in discourse. i think this is working pretty much as expected, given the circumstances.

Changed in maas:
status: New → Invalid
Revision history for this message
Junien F (axino) wrote :
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Opinion → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.