azure advanced networking sometimes triggers duplicate mac detection

Bug #1844191 reported by Ryan Harper
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Critical
Unassigned
cloud-init (Ubuntu)
In Progress
Undecided
Unassigned
Bionic
In Progress
Undecided
Unassigned
Focal
In Progress
Undecided
Unassigned
Jammy
In Progress
Undecided
Unassigned
Kinetic
In Progress
Undecided
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
When accelerated network is enabled on Azure, the host presents two network interfaces with the same mac address to the VM:
a synthetic nic (netvsc) and a VF nic, which is enslaved to the synthetic nic.

The net module is already excluding slave nics when enumerating interfaces. However, if cloud-init starts enumerating after the kernel makes the VF visible to userspace, but before the enslaving has finished, cloud-init will see two nics with duplicate mac.

[Test Case]
Launch an instance with accelerated networking and ensure the instance comes up as expected with no networking-related Tracebacks in /var/log/cloud-init.log

[Regression Potential]
This is already in error handling code and is scoped to a particular driver. A regression here would mean we could allow a cloud-init instance to come up with duplicate macs when we otherwise wouldn't.

[Other info]
This bug was attempted but could not be reproduced by the cloud-init team. It was reported as being seen in "1 in 1000" launches.

Github PR: https://github.com/canonical/cloud-init/pull/1853

=== End SRU Template ===

Initial bug:

Hi, we're still being affected by this on Azure with 19.2-24-ge7881d5c-0ubuntu1~18.04.1 - using PACKER to build from image: BuildSource : Marketplace/Canonical/UbuntuServer/18.04-DAILY-LTS

Here is the packer config:
````
    "provisioners": [
        {
          "type": "shell",
          "inline": [
            "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done"
          ]
        },
        {
            "type": "ansible",
            "playbook_file": "{{user `ansible_playbook`}}",
            "user": "packer",
            "extra_arguments": [ "--extra-vars", "codeVersion={{user `code_version`}} managed_image_name={{user `managed_image_name`}}" ]
        },
        {
            "type": "shell",
            "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
            "inline_shebang": "/bin/sh -x",
            "inline": [ "/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync" ]
    }]
````

Here is the playbook:
````
---
- hosts: all
  remote_user: ubuntu
  become: yes
  become_method: sudo
  become_user: root

  environment:
    DEBIAN_FRONTEND: noninteractive
````

Note: we are applying `enableAcceleratedNetworking: true` to the NIC, anecdotally we think this is related.

Usually our playbook has more in it (obviously) but Azure kept pointing fingers at us that our image was causing the problem, so I ran this test simply deploying a blank deprovisioned image via our same process.

And here's what happens on the serial console log:

````
[ 20.337603] sh[910]: + [ -e /var/lib/cloud/instance/obj.pkl ]
[ 20.343177] sh[910]: + echo cleaning persistent cloud-init object
[ 20.349027] [ OK ] Started Network Time Synchronization.
[ OK ] Reached target System Time Synchronized.
sh[910]: cleaning persistent cloud-init object
[ 20.361066] sh[910]: + rm /var/lib/cloud/instance/obj.pkl
[ 20.412333] sh[910]: + exit 0
[ 34.282291] cloud-init[938]: Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1~18.04.1 running 'init-local' at Mon, 16 Sep 2019 18:02:23 +0000. Up 32.02 seconds.
[ 34.288809] cloud-init[938]: 2019-09-16 18:02:25,262 - util.py[WARNING]: failed stage init-local
[ 34.423057] cloud-init[938]: failed run of stage init-local
[ 34.437716] cloud-init[938]: ------------------------------------------------------------
[ 34.441088] cloud-init[938]: Traceback (most recent call last):
[ 34.443719] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
[ 34.448072] cloud-init[938]: ret = functor(name, args)
[ 34.450532] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
[ 34.454849] cloud-init[938]: init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
[ 34.458725] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 697, in apply_network_config
[ 34.463421] cloud-init[938]: net.wait_for_physdevs(netcfg)
[ 34.466051] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
[ 34.470673] cloud-init[938]: present_macs = get_interfaces_by_mac().keys()
[ 34.473964] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
[ 34.479325] cloud-init[938]: (name, ret[mac], mac))
[ 34.481838] cloud-init[938]: RuntimeError: duplicate mac found! both 'eth0' and 'enP1s1' have mac '00:0d:3a:7c:f7:3f'
[ 34.486614] cloud-init[938]: ------------------------------------------------------------
[FAILED] Failed to start Initial cloud-init job (pre-networking).
See 'systemctl status cloud-init-local.service' for details.
[ OK ] Reached target Network (Pre).
         Starting Network Service...
[ OK ] Started Network Service.
         Starting Wait for Network to be Configured...
         Starting Network Name Resolution...
[ OK ] Started Wait for Network to be Configured.
         Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
[ OK ] Reached target Network.
````

When this happens, the machine never boots, and we get an OSProvisioningTimedOut error after about 30 minutes, and the machine never reaches healthy state.

Related branches

Ryan Harper (raharper)
Changed in cloud-init:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ryan Harper (raharper) wrote :

I can reproduce this on Azure with advanced networking on 19.2

root@ragged-bond1:~# python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cloudinit import net
>>> import yaml
>>> y = yaml.load(open('/etc/netplan/50-cloud-init.yaml'))
>>> net.wait_for_physdevs(y['network'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
    present_macs = get_interfaces_by_mac().keys()
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
    (name, ret[mac], mac))
RuntimeError: duplicate mac found! both 'enP1s1' and 'eth0' have mac '00:0d:3a:6c:d9:80'

Looking at the sriov device, the sysfs attributes include a 'master' pointing to eth0, so I think we can reasonably ignore devices which have the 'master' which is related to device bonding.

root@ragged-bond1:/usr/lib/python3/dist-packages# diff -u cloudinit/net/__init__.py.orig cloudinit/net/__init__.py
--- cloudinit/net/__init__.py.orig 2019-09-16 21:15:42.550376776 +0000
+++ cloudinit/net/__init__.py 2019-09-16 21:18:26.178760942 +0000
@@ -109,6 +109,10 @@
     return os.path.exists(sys_dev_path(devname, "bonding"))

+def has_master_attr(devname):
+ return os.path.exists(sys_dev_path(devname, path='master'))
+
+
 def is_renamed(devname):
     """
     /* interface name assignment types (sysfs name_assign_type attribute) */
@@ -661,6 +665,9 @@
             continue
         if is_bond(name):
             continue
+ if has_master_attr(name):
+ LOG.debug('Skipping device %s with "master" sysfs attriute', name)
+ continue
         mac = get_interface_mac(name)
         # some devices may not have a mac (tun0)
         if not mac:

Changed in cloud-init:
importance: High → Critical
status: Triaged → In Progress
Revision history for this message
Ryan Harper (raharper) wrote :

I've uploaded a version of cloud-init with this patch to a PPA:

% add-apt-repository -y ppa:raharper/bugfixes
% apt install cloud-init

https://launchpad.net/~raharper/+archive/ubuntu/bugfixes/+files/cloud-init_19.2-36-g17b20580-1~bddeb~18.04.1_all.deb

Revision history for this message
Danno B (slikk66) wrote :

Hi Ryan, our current workflow is to take the DAILY image, create a base image for all our specialized images "base1804" on a bi-weekly basis, and then create a specialized image for each of our services as the code repositories are updated.

How long until you estimate this will natively find itself into the Canonical/UbuntuServer/18.04-DAILY-LTS image?

I'll try to get this installed currently via your deb file until then.

Thank you for your effort on this, you've got the patch out before Azure has even responded to my support request our ticket.

Revision history for this message
Danno B (slikk66) wrote :

Patch looks good on our instance! Was able to boot with advanced networking after manually installing this deb file to the image during packer build.

I'll keep the patch in place until I've confirmed it's been merged and released onto the daily image.

Thanks again!

Revision history for this message
Dan Watkins (oddbloke) wrote :

Added the block-proposed tag so that we can perform manual eoan testing before migration happens.

tags: added: block-proposed
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 059d049c to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=059d049c

Changed in cloud-init:
status: In Progress → Fix Committed
Revision history for this message
Dragonshadow (gteachey) wrote :

I'd like to confirm, this has not been released to a package update yet correct? We appear to have hit this same bug.

We're using Accelerated Networking, and adding a second IP to the interface generated the same duplicate MAC error reported here.

I'm not sure if a separate bug report should be made? In our case the machine was already deployed/provisioned, but after adding in a second IP to the NIC we've lost routing and the error is seen.

Revision history for this message
lilideng (lilideng) wrote :

when it will go into azure gallery image?

Revision history for this message
Chad Smith (chad.smith) wrote :

I apologize for the delay here, this bug should have been set to Fix Released when we released 19.2.36 (which has been published to Ubuntu Xenial, Bionic, Disco and Eaon images as of Oct 10th I believe. Azure image builds were delayed a bit due to an image build pipeline issue, but Azure also saw these fixes in October. Marking Fix Released on this bug now.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Chris Patterson (cjp256) wrote :

Please re-open. There is a race between the device surfacing and getting bonded. If this enumeration happens in between those events, it will fail with duplicate mac error causing other problems.

Brett Holman (holmanb)
Changed in cloud-init:
status: Fix Released → Confirmed
Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Upstream PR landed with a fix for this issue allowing cloud-init to ignore duplicate macs as seen on mellanox subordinate devices.
https://github.com/canonical/cloud-init/pull/1853.

We have also released this into Ubuntu Lunar 23.04 as cloud-init version 22.4-0ubuntu4.

Our plan is also to queue this up as soon as possible for our next SRU (Stable release update).

Marking this as Fix released as it will be in the next cloud images build for 23.04.
We will create separate bug tasks on this bug for bionic, focal, jammy and kinetic when we start the SRU release process for this bug.

In the meantime, https://code.launchpad.net/~cloud-init-dev/+archive/ubuntu/daily has development builds containing this fix for those looking to validate this behavior before an official SRU release to Bionic, Focal, jammy and Kinetic.

Changed in cloud-init:
status: Confirmed → Fix Released
James Falcon (falcojr)
description: updated
Chad Smith (chad.smith)
Changed in cloud-init (Ubuntu):
status: New → In Progress
Changed in cloud-init (Ubuntu Bionic):
status: New → In Progress
Changed in cloud-init (Ubuntu Focal):
status: New → In Progress
Changed in cloud-init (Ubuntu Jammy):
status: New → In Progress
Changed in cloud-init (Ubuntu Kinetic):
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers