Some hardening for vfio devices being less fatal at reboot

Bug #1967222 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
New
Undecided
Unassigned

Bug Description

Hi,
I've just seen this on reboot - and I have to admit I'm not even entirely sure we have to fix this as it is a conflict of config <-> system. But let me start with first things first.

I have a system with a whole bunch of network card.
1. a four port NetXtreme BCM5719
2. a single MLX ConnectX-4 Lx
3. a two port Intel X540-AT2

The system I have is MAAS deployed, so Maas has config data for the system describing these (I guess).

I'm using DPDK on that system which happens to sometimes means I'm unassinging the "normal" driver and replacing it with e.g. vfio-pci for use in userspace. That will make the card disappear from a classic systems POV like `ip`, but it is still there in e.g. `lspci`.

Now the problem I'm seeing is after my workload reassigned two of those devices as seen here:

$ dpdk-devbind.py --status

Network devices using DPDK-compatible driver
============================================
0000:04:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=uio_pci_generic unused=ixgbe,vfio-pci
0000:04:00.1 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=uio_pci_generic unused=ixgbe,vfio-pci

Network devices using kernel driver
===================================
0000:02:00.0 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno1 drv=tg3 unused=vfio-pci,uio_pci_generic *Active*
0000:02:00.1 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno2 drv=tg3 unused=vfio-pci,uio_pci_generic
0000:02:00.2 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno3 drv=tg3 unused=vfio-pci,uio_pci_generic
0000:02:00.3 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' if=eno4 drv=tg3 unused=vfio-pci,uio_pci_generic
0000:08:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1 drv=mlx5_core unused=vfio-pci,uio_pci_generic

If I reboot the system while in that mode (and it will be assigned that way on reboot again) it happens that formerly configured MACs are not present and cloud init will complain.

...
[ 244.164245] cloud-init[1256]: failed run of stage init
[ 244.176231] cloud-init[1256]: ------------------------------------------------------------
[ 244.192244] cloud-init[1256]: Traceback (most recent call last):
[ 244.204241] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 384, in main_init
[ 244.216146] cloud-init[1256]: init.fetch(existing=existing)
[ 244.228220] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 432, in fetch
[ 244.240199] cloud-init[1256]: return self._get_data_source(existing=existing)
[ 244.252143] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 323, in _get_data_source
[ 244.264134] cloud-init[1256]: (ds, dsname) = sources.find_source(
[ 244.276209] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 923, in find_source
[ 244.288117] cloud-init[1256]: raise DataSourceNotFoundException(msg)
[ 244.300224] cloud-init[1256]: cloudinit.sources.DataSourceNotFoundException: Did not find any data source, searched classes: (DataSourceMAAS)
[ 244.312144] cloud-init[1256]: During handling of the above exception, another exception occurred:
[ 244.324119] cloud-init[1256]: Traceback (most recent call last):
[ 244.336203] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 761, in status_wrapper
[ 244.348129] cloud-init[1256]: ret = functor(name, args)
[ 244.360201] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 406, in main_init
[ 244.372143] cloud-init[1256]: init.apply_network_config(bring_up=bring_up_interfaces)
[ 244.384132] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 908, in apply_network_config
[ 244.396115] cloud-init[1256]: self.distro.networking.wait_for_physdevs(netcfg)
[ 244.408110] cloud-init[1256]: File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in wait_for_physdevs
[ 244.420109] cloud-init[1256]: raise RuntimeError(msg)
[ 244.432230] cloud-init[1256]: RuntimeError: Not all expected physical devices present: {'8c:dc:d4:b3:6d:e8', '8c:dc:d4:b3:6d:e9'}
[ 244.444140] cloud-init[1256]: ------------------------------------------------------------
[ 304.197386] cloud-init[1256]: 2022-03-31 05:47:39,843 - handlers.py[WARNING]: failed posting event: finish: init-network: SUCCESS: searching for network datasources

There might be a related or unrelated (not sure) later crash on not finding any datasource. But you'll see so in the logs that I'll upload.
And as I said you "might" say you configured these devices and they are not there what are we supposed to do, but seeing the crash I wondered if there might be a better way and wnated to bring it up for your consideration.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
James Falcon (falcojr) wrote :

I think we can consider this a duplicate of https://bugs.launchpad.net/cloud-init/+bug/1936972 . We've had this kind of issue with MAAS before where we're getting metadata telling us that certain devices exist that cloud-init can't detect. Cloud-init's position has been "fix the metadata" as a mismatch there isn't something we want to silently ignore.

We're happy to revisit this if needed though.

Christian, do you agree that I can mark this as duplicate?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Functionally yes it is a dup, but my question intentionally was mostly "should we fail less fatal than traceback" ?

Revision history for this message
James Falcon (falcojr) wrote :

My answer is still no, but I could be convinced otherwise.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.