Azure/Xenial Pro FIPS: RuntimeError: duplicate mac found!

Bug #1927124 reported by Gauthier Jolly
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

On Azure instances running Xenial Pro FIPS images with accelerated networking enabled, cloud-init fails to setup the user's ssh key and I can see the following stack trace in the logs:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 652, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
    init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in apply_network_config
    self.distro.networking.wait_for_physdevs(netcfg)
  File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 147, in wait_for_physdevs
    present_macs = self.get_interfaces_by_mac().keys()
  File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 76, in get_interfaces_by_mac
    blacklist_drivers=self.blacklist_drivers)
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 830, in get_interfaces_by_mac
    blacklist_drivers=blacklist_drivers)
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 901, in get_interfaces_by_mac_on_linux
    (name, ret[mac], mac))
RuntimeError: duplicate mac found! both 'eth0' and 'enP1p0s2' have mac '00:0d:3a:7f:a8:e5'

The following SAS URL can be used to start a VM with this image in order to reproduce the problem:
https://gjolly.blob.core.windows.net/daily-vhd/xenial/20210430/Ubuntu_DAILY_BUILD-xenial-16_04-LTS-amd64-server-20210430-en-us-30GB.vhd?sp=r&st=2021-05-04T14:27:27Z&se=2022-05-04T22:27:27Z&spr=https&sv=2020-02-10&sr=b&sig=UYNr7aoThE28sZqkgAWCRSHuaBRqz4rAfJHWzbUqXKw%3D

Revision history for this message
Gauthier Jolly (gjolly) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote :

We addressed something like this in the past with https://bugs.launchpad.net/cloud-init/+bug/1844191 and the resulting commit https://github.com/canonical/cloud-init/commit/059d049c57cac02cdeaca832233a19712e0b4ded

Maybe something in FIPS specific kernel isn't surfacing bridge/bonds the way cloud-init expects

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks for the bug Gauthier.

 This issue is due to older 4.4 FIPS kernel not exposing a master attribute in /sys/class/net/<devname> for SRIOV advanced networking devices. Cloud-init is able to ignore this device file on Xenial's newer 4.15.0-1114-azure kernel on Azure, but I'm afraid we won't be able to either manipulate the 4.4 certified FIPS kernel to expose this 'master' sysfs device file

on FIPS kernel 4.4.0-1017-fips Xenial no "master" sysfs attribute allowing cloud-init to determine SRIOV device.

ubuntu@pro-xenial-up-fips:/sys/class/net/enP1p0s2$ ls
addr_assign_type dormant name_assign_type speed
address duplex netdev_group statistics
addr_len flags operstate subsystem
broadcast gro_flush_timeout phys_port_id tx_queue_len
carrier ifalias phys_port_name type
carrier_changes ifindex phys_switch_id uevent
device iflink power
dev_id link_mode proto_down
dev_port mtu queues

on ubuntu@pro-xenial-up-fips:~$ uname -r
4.15.0-1114-azure

# note the "master" file indicating that this is a network device which has a master, therefore ignored by cloud-init
ubuntu@pro-xenial-up-fips:~$ ls /sys/class/net/enP11928s1
addr_assign_type dev_port mtu speed
address dormant name_assign_type statistics
addr_len duplex netdev_group subsystem
broadcast flags operstate tx_queue_len
carrier gro_flush_timeout phys_port_id type
carrier_changes ifalias phys_port_name uevent
carrier_down_count ifindex phys_switch_id upper_eth0
carrier_up_count iflink power
device link_mode proto_down
dev_id master queues

Changed in cloud-init (Ubuntu):
status: New → Triaged
Revision history for this message
Chad Smith (chad.smith) wrote :
Download full text (3.7 KiB)

More details, this is a upstream bug due to a cloudinit/stages creating a copy of the distro instance based on re-reading and updating distro config from disk if unset in Init
https://github.com/canonical/cloud-init/blob/master/cloudinit/stages.py#L91-L96

The two problems upstream are that are:
 1. cloudinit/distros/networking.py get_interfaces_by_mac doesn't honor blacklist_drivers from a datasource
 2. DataSourceAzure sets blacklist_drivers on DataSourceAzure.distro.networking.blacklist_drivers during _get_data.
 3. stages.py also does not copy blacklist_drivers into a newly instantiated distro instance on the found datasource.

This will only affect older kernels like 4.4 because any newer kernels surface a sysfs "master" links in SRIOV devices so cloud-init ignores them by default so no duplicate mac errors are seen.

The following diff resolves this for Azure on 4.4 FIPS kernel.

I'll have to talk with the team about how best to support this on Xenial PRO images.

diff --git a/cloudinit/distros/networking.py b/cloudinit/distros/networking.py
index c291196a..471d7e52 100644
--- a/cloudinit/distros/networking.py
+++ b/cloudinit/distros/networking.py
@@ -71,7 +71,7 @@ class Networking(metaclass=abc.ABCMeta):
     def get_interfaces(self) -> list:
         return net.get_interfaces()

- def get_interfaces_by_mac(self) -> dict:
+ def get_interfaces_by_mac(self, *, blacklist_drivers=None) -> dict:
         return net.get_interfaces_by_mac(
             blacklist_drivers=self.blacklist_drivers)

@@ -144,7 +144,9 @@ class Networking(metaclass=abc.ABCMeta):
         expected_macs = set(expected_ifaces.keys())

         # set of current macs
- present_macs = self.get_interfaces_by_mac().keys()
+ present_macs = self.get_interfaces_by_mac(
+ blacklist_drivers=self.blacklist_drivers
+ ).keys()

         # compare the set of expected mac address values to
         # the current macs present; we only check MAC as cloud-init
diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
index dcdf9f8f..0069bd0a 100755
--- a/cloudinit/sources/DataSourceAzure.py
+++ b/cloudinit/sources/DataSourceAzure.py
@@ -344,6 +344,7 @@ class DataSourceAzure(sources.DataSource):
         EventType.BOOT,
         EventType.BOOT_LEGACY
     }}
+ blacklist_drivers = BLACKLIST_DRIVERS

     _negotiated = False
     _metadata_imds = sources.UNSET
@@ -626,7 +627,7 @@ class DataSourceAzure(sources.DataSource):
         except Exception as e:
             LOG.warning("Failed to get system information: %s", e)

- self.distro.networking.blacklist_drivers = BLACKLIST_DRIVERS
+ self.distro.networking.blacklist_drivers = self.blacklist_drivers

         try:
             crawled_data = util.log_time(
diff --git a/cloudinit/stages.py b/cloudinit/stages.py
index bbded1e9..cc7619b3 100644
--- a/cloudinit/stages.py
+++ b/cloudinit/stages.py
@@ -92,6 +92,14 @@ class Init(object):
             # said datasource and move its distro/system config
             # from whatever it was to a new set...
             if self.datasource is not NULL_DATA_SOURCE:
+ # Certain datasources excl...

Read more...

Revision history for this message
Chad Smith (chad.smith) wrote :

I have proposed an upstream PR to fix this inconsistency in handling excluded drivers for azure in stages at https://github.com/canonical/cloud-init/pull/914

Given that Xenial is currently in Extended Security Maintenance for support, I don't know if we will be able to publish a fix into xenial-updates to fix this corner case.

This will only affect fresh launches of Azure Ubuntu PRO FIPS 16.04 (Xenial) images which also have Accelerated networking enabled.

Two posible workarounds in the absence of a cloud-init fix in xenial-updates:
 1. Provide the following #cloud-config userdata during Ubuntu PRO FIPS 16.04 with accelerated networking (Attached as azure-xenial-pro-fips-workaround.yaml)

#cloud-config
bootcmd:
- "sed -i '/distro = self._distro/i \\ if self.datasource.dsname == \"Azure\":\\n self._distro.networking.blacklist_drivers = [\"mlx4_core\", \"mlx5_core\"]' /usr/lib/python3/dist-packages/cloudinit/stages.py"

OR

 2. Launch a Ubuntu PRO 16.04 (Xenial) with Accelerated networking enable FIPS & reboot:

   ssh <azure_pro_xenial_vm>
   # Add overrides to /etc/ubuntu-advantage/uaclient.conf
   $ echo "features:\n allow_xenial_fips_on_cloud: true" | sudo tee -a /etc/ubuntu-advantage/uaclient.conf
   $ sudo ua enable fips --assume-yes
   $ sudo reboot

The reason option 2 works is because SSH keys will have already been generated so the Traceback on duplicate mac addresses won't affect accessibility of VM once it reboots into FIPS mode.

Revision history for this message
Chad Smith (chad.smith) wrote :

cloud-config userdata to provide during Azure PRO FIPS 16.04 (Xenial) launch with accelerated networking: via 'az vm create --custom-data azure-xenial*workaround.yaml...'

Revision history for this message
Éric St-Jean (esj) wrote :

hi,
marking this as fix released, which is not entirely correct i do agree
however, xenial is out of standard maintenance, and at this time we only issue critical security fixes for it
also, given that there are workarounds, i'm closing this issue

Changed in cloud-init (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.