azure advanced networking sometimes triggers duplicate mac detection

Bug #1844191 reported by Ryan Harper
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Critical
Unassigned
cloud-init (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Unassigned
Kinetic
Fix Released
Undecided
Unassigned

Bug Description

=== Begin SRU Template ===
[Impact]
When accelerated network is enabled on Azure, the host presents two network interfaces with the same mac address to the VM:
a synthetic nic (netvsc) and a VF nic, which is enslaved to the synthetic nic.

The net module is already excluding slave nics when enumerating interfaces. However, if cloud-init starts enumerating after the kernel makes the VF visible to userspace, but before the enslaving has finished, cloud-init will see two nics with duplicate mac.

[Test Case]
Launch an instance with accelerated networking and ensure the instance comes up as expected with no networking-related Tracebacks in /var/log/cloud-init.log

[Regression Potential]
This is already in error handling code and is scoped to a particular driver. A regression here would mean we could allow a cloud-init instance to come up with duplicate macs when we otherwise wouldn't.

[Other info]
This bug was attempted but could not be reproduced by the cloud-init team. It was reported as being seen in "1 in 1000" launches.

Github PR: https://github.com/canonical/cloud-init/pull/1853

=== End SRU Template ===

Initial bug:

Hi, we're still being affected by this on Azure with 19.2-24-ge7881d5c-0ubuntu1~18.04.1 - using PACKER to build from image: BuildSource : Marketplace/Canonical/UbuntuServer/18.04-DAILY-LTS

Here is the packer config:
````
    "provisioners": [
        {
          "type": "shell",
          "inline": [
            "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done"
          ]
        },
        {
            "type": "ansible",
            "playbook_file": "{{user `ansible_playbook`}}",
            "user": "packer",
            "extra_arguments": [ "--extra-vars", "codeVersion={{user `code_version`}} managed_image_name={{user `managed_image_name`}}" ]
        },
        {
            "type": "shell",
            "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
            "inline_shebang": "/bin/sh -x",
            "inline": [ "/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync" ]
    }]
````

Here is the playbook:
````
---
- hosts: all
  remote_user: ubuntu
  become: yes
  become_method: sudo
  become_user: root

  environment:
    DEBIAN_FRONTEND: noninteractive
````

Note: we are applying `enableAcceleratedNetworking: true` to the NIC, anecdotally we think this is related.

Usually our playbook has more in it (obviously) but Azure kept pointing fingers at us that our image was causing the problem, so I ran this test simply deploying a blank deprovisioned image via our same process.

And here's what happens on the serial console log:

````
[ 20.337603] sh[910]: + [ -e /var/lib/cloud/instance/obj.pkl ]
[ 20.343177] sh[910]: + echo cleaning persistent cloud-init object
[ 20.349027] [ OK ] Started Network Time Synchronization.
[ OK ] Reached target System Time Synchronized.
sh[910]: cleaning persistent cloud-init object
[ 20.361066] sh[910]: + rm /var/lib/cloud/instance/obj.pkl
[ 20.412333] sh[910]: + exit 0
[ 34.282291] cloud-init[938]: Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1~18.04.1 running 'init-local' at Mon, 16 Sep 2019 18:02:23 +0000. Up 32.02 seconds.
[ 34.288809] cloud-init[938]: 2019-09-16 18:02:25,262 - util.py[WARNING]: failed stage init-local
[ 34.423057] cloud-init[938]: failed run of stage init-local
[ 34.437716] cloud-init[938]: ------------------------------------------------------------
[ 34.441088] cloud-init[938]: Traceback (most recent call last):
[ 34.443719] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
[ 34.448072] cloud-init[938]: ret = functor(name, args)
[ 34.450532] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
[ 34.454849] cloud-init[938]: init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
[ 34.458725] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 697, in apply_network_config
[ 34.463421] cloud-init[938]: net.wait_for_physdevs(netcfg)
[ 34.466051] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
[ 34.470673] cloud-init[938]: present_macs = get_interfaces_by_mac().keys()
[ 34.473964] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
[ 34.479325] cloud-init[938]: (name, ret[mac], mac))
[ 34.481838] cloud-init[938]: RuntimeError: duplicate mac found! both 'eth0' and 'enP1s1' have mac '00:0d:3a:7c:f7:3f'
[ 34.486614] cloud-init[938]: ------------------------------------------------------------
[FAILED] Failed to start Initial cloud-init job (pre-networking).
See 'systemctl status cloud-init-local.service' for details.
[ OK ] Reached target Network (Pre).
         Starting Network Service...
[ OK ] Started Network Service.
         Starting Wait for Network to be Configured...
         Starting Network Name Resolution...
[ OK ] Started Wait for Network to be Configured.
         Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
[ OK ] Reached target Network.
````

When this happens, the machine never boots, and we get an OSProvisioningTimedOut error after about 30 minutes, and the machine never reaches healthy state.

Related branches

Ryan Harper (raharper)
Changed in cloud-init:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ryan Harper (raharper) wrote :

I can reproduce this on Azure with advanced networking on 19.2

root@ragged-bond1:~# python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cloudinit import net
>>> import yaml
>>> y = yaml.load(open('/etc/netplan/50-cloud-init.yaml'))
>>> net.wait_for_physdevs(y['network'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
    present_macs = get_interfaces_by_mac().keys()
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
    (name, ret[mac], mac))
RuntimeError: duplicate mac found! both 'enP1s1' and 'eth0' have mac '00:0d:3a:6c:d9:80'

Looking at the sriov device, the sysfs attributes include a 'master' pointing to eth0, so I think we can reasonably ignore devices which have the 'master' which is related to device bonding.

root@ragged-bond1:/usr/lib/python3/dist-packages# diff -u cloudinit/net/__init__.py.orig cloudinit/net/__init__.py
--- cloudinit/net/__init__.py.orig 2019-09-16 21:15:42.550376776 +0000
+++ cloudinit/net/__init__.py 2019-09-16 21:18:26.178760942 +0000
@@ -109,6 +109,10 @@
     return os.path.exists(sys_dev_path(devname, "bonding"))

+def has_master_attr(devname):
+ return os.path.exists(sys_dev_path(devname, path='master'))
+
+
 def is_renamed(devname):
     """
     /* interface name assignment types (sysfs name_assign_type attribute) */
@@ -661,6 +665,9 @@
             continue
         if is_bond(name):
             continue
+ if has_master_attr(name):
+ LOG.debug('Skipping device %s with "master" sysfs attriute', name)
+ continue
         mac = get_interface_mac(name)
         # some devices may not have a mac (tun0)
         if not mac:

Changed in cloud-init:
importance: High → Critical
status: Triaged → In Progress
Revision history for this message
Ryan Harper (raharper) wrote :

I've uploaded a version of cloud-init with this patch to a PPA:

% add-apt-repository -y ppa:raharper/bugfixes
% apt install cloud-init

https://launchpad.net/~raharper/+archive/ubuntu/bugfixes/+files/cloud-init_19.2-36-g17b20580-1~bddeb~18.04.1_all.deb

Revision history for this message
Danno B (slikk66) wrote :

Hi Ryan, our current workflow is to take the DAILY image, create a base image for all our specialized images "base1804" on a bi-weekly basis, and then create a specialized image for each of our services as the code repositories are updated.

How long until you estimate this will natively find itself into the Canonical/UbuntuServer/18.04-DAILY-LTS image?

I'll try to get this installed currently via your deb file until then.

Thank you for your effort on this, you've got the patch out before Azure has even responded to my support request our ticket.

Revision history for this message
Danno B (slikk66) wrote :

Patch looks good on our instance! Was able to boot with advanced networking after manually installing this deb file to the image during packer build.

I'll keep the patch in place until I've confirmed it's been merged and released onto the daily image.

Thanks again!

Revision history for this message
Dan Watkins (oddbloke) wrote :

Added the block-proposed tag so that we can perform manual eoan testing before migration happens.

tags: added: block-proposed
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 059d049c to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=059d049c

Changed in cloud-init:
status: In Progress → Fix Committed
Revision history for this message
Dragonshadow (gteachey) wrote :

I'd like to confirm, this has not been released to a package update yet correct? We appear to have hit this same bug.

We're using Accelerated Networking, and adding a second IP to the interface generated the same duplicate MAC error reported here.

I'm not sure if a separate bug report should be made? In our case the machine was already deployed/provisioned, but after adding in a second IP to the NIC we've lost routing and the error is seen.

Revision history for this message
lilideng (lilideng) wrote :

when it will go into azure gallery image?

Revision history for this message
Chad Smith (chad.smith) wrote :

I apologize for the delay here, this bug should have been set to Fix Released when we released 19.2.36 (which has been published to Ubuntu Xenial, Bionic, Disco and Eaon images as of Oct 10th I believe. Azure image builds were delayed a bit due to an image build pipeline issue, but Azure also saw these fixes in October. Marking Fix Released on this bug now.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Chris Patterson (cjp256) wrote :

Please re-open. There is a race between the device surfacing and getting bonded. If this enumeration happens in between those events, it will fail with duplicate mac error causing other problems.

Brett Holman (holmanb)
Changed in cloud-init:
status: Fix Released → Confirmed
Revision history for this message
Chad Smith (chad.smith) wrote (last edit ):

Upstream PR landed with a fix for this issue allowing cloud-init to ignore duplicate macs as seen on mellanox subordinate devices.
https://github.com/canonical/cloud-init/pull/1853.

We have also released this into Ubuntu Lunar 23.04 as cloud-init version 22.4-0ubuntu4.

Our plan is also to queue this up as soon as possible for our next SRU (Stable release update).

Marking this as Fix released as it will be in the next cloud images build for 23.04.
We will create separate bug tasks on this bug for bionic, focal, jammy and kinetic when we start the SRU release process for this bug.

In the meantime, https://code.launchpad.net/~cloud-init-dev/+archive/ubuntu/daily has development builds containing this fix for those looking to validate this behavior before an official SRU release to Bionic, Focal, jammy and Kinetic.

Changed in cloud-init:
status: Confirmed → Fix Released
James Falcon (falcojr)
description: updated
Chad Smith (chad.smith)
Changed in cloud-init (Ubuntu):
status: New → In Progress
Changed in cloud-init (Ubuntu Bionic):
status: New → In Progress
Changed in cloud-init (Ubuntu Focal):
status: New → In Progress
Changed in cloud-init (Ubuntu Jammy):
status: New → In Progress
Changed in cloud-init (Ubuntu Kinetic):
status: New → In Progress
Chad Smith (chad.smith)
tags: removed: block-proposed
Revision history for this message
Chad Smith (chad.smith) wrote :

Dropped old block-proposed tag as it was stale on the first iteration and fix on this bug.

Since the bug was subsequently re-opened and address due to race conditions seen at scale the 'block-proposed' no longer applies. Uploads of 22.4.2 are queued for release in lunar-proposed and we expect this to publish to lunar (and any other series) now that the stale tag is removed.

Revision history for this message
Chad Smith (chad.smith) wrote :

Additionally we will perform manual SRU validation of 22.4.2 for Bionic, Focal, Jammy and Kinetic for this bug to ensure we don't introduce a regression. A note that this manual testing on Azure Ubuntu LTS releases has already been performed during validation of the proposed upstream fix outside of the SRU verification results.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Ryan, or anyone else affected,

Accepted cloud-init into kinetic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/22.4.2-0ubuntu0~22.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-kinetic to verification-done-kinetic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-kinetic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Kinetic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-kinetic
Changed in cloud-init (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed-jammy
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Ryan, or anyone else affected,

Accepted cloud-init into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/22.4.2-0ubuntu0~22.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Ryan, or anyone else affected,

Accepted cloud-init into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/22.4.2-0ubuntu0~20.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cloud-init (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Ryan, or anyone else affected,

Accepted cloud-init into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/22.4.2-0ubuntu0~18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
James Falcon (falcojr) wrote :

Running the test case against proposed, I ran the steps in a loop for 4 hours and have seen no regression or error in the logs.

James Falcon (falcojr)
tags: added: verification-done-bionic verification-done-focal verification-done-jammy verification-done-kinetic
removed: verification-needed verification-needed-bionic verification-needed-focal verification-needed-jammy verification-needed-kinetic
Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :

I was able to reproduce the issue and verified that the proposed 22.4.2 package addressed the issue on bionic. Here's the message in cloud-init.log. I have uploaded the whole cloud-init.log to the bug

2022-12-01 19:27:00,145 - __init__.py[WARNING]: duplicate mac found! both 'eth1' and 'eth3' have mac '60:45:bd:ba:75:8d'. Ignoring 'eth3' due to driver 'mlx5_core' and 'eth1' having driver hv_netvsc.

Revision history for this message
Anh Vo (MSFT) (vtqanh) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 22.4.2-0ubuntu0~22.10.1

---------------
cloud-init (22.4.2-0ubuntu0~22.10.1) kinetic; urgency=medium

  * Upstream snapshot based on 22.4.2 upstream release. (LP: #1996645)
    - List of changes from upstream can be found at
      https://raw.githubusercontent.com/canonical/cloud-init/22.4.2/ChangeLog
    - Includes (LP: #1997559, #1844191) not present in 22.4.0.

cloud-init (22.4-0ubuntu0~22.10.1) kinetic; urgency=medium

  * d/control: drop python3-httpretty from Build-Depends
  * d/cloud-init.templates: Add NWCS to datasource list
  * Upstream snapshot based on 22.4 upstream release. (LP: #1996645)
    List of changes from upstream can be found at
    https://raw.githubusercontent.com/canonical/cloud-init/22.4/ChangeLog

 -- James Falcon <email address hidden> Mon, 28 Nov 2022 10:45:45 -0600

Changed in cloud-init (Ubuntu Kinetic):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 22.4.2-0ubuntu0~22.04.1

---------------
cloud-init (22.4.2-0ubuntu0~22.04.1) jammy; urgency=medium

  * Upstream snapshot based on 22.4.2 upstream release. (LP: #1996645)
    - List of changes from upstream can be found at
      https://raw.githubusercontent.com/canonical/cloud-init/22.4.2/ChangeLog
    - Includes (LP: #1997559, #1844191) not present in 22.4.0.

cloud-init (22.4-0ubuntu0~22.04.1) jammy; urgency=medium

  * d/control: drop python3-httpretty from Build-Depends
  * d/cloud-init.templates: Add NWCS to datasource list
  * refresh patches:
    + debian/patches/expire-on-hashed-users.patch
  * Upstream snapshot based on 22.4 upstream release. (LP: #1996645)
    List of changes from upstream can be found at
    https://raw.githubusercontent.com/canonical/cloud-init/22.4/ChangeLog

 -- James Falcon <email address hidden> Mon, 28 Nov 2022 10:32:24 -0600

Changed in cloud-init (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 22.4.2-0ubuntu0~20.04.1

---------------
cloud-init (22.4.2-0ubuntu0~20.04.1) focal; urgency=medium

  * Upstream snapshot based on 22.4.2 upstream release. (LP: #1996645)
    - List of changes from upstream can be found at
      https://raw.githubusercontent.com/canonical/cloud-init/22.4.2/ChangeLog
    - Includes (LP: #1997559, #1844191) not present in 22.4.0.

cloud-init (22.4-0ubuntu0~20.04.1) focal; urgency=medium

  * d/control: drop python3-httpretty from Build-Depends
  * d/cloud-init.templates: Add NWCS to datasource list
  * refresh patches:
    + debian/patches/expire-on-hashed-users.patch
  * Upstream snapshot based on 22.4 upstream release. (LP: #1996645)
    List of changes from upstream can be found at
    https://raw.githubusercontent.com/canonical/cloud-init/22.4/ChangeLog

 -- James Falcon <email address hidden> Mon, 28 Nov 2022 10:48:49 -0600

Changed in cloud-init (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 22.4.2-0ubuntu0~18.04.1

---------------
cloud-init (22.4.2-0ubuntu0~18.04.1) bionic; urgency=medium

  * Upstream snapshot based on 22.4.2 upstream release. (LP: #1996645)
    - List of changes from upstream can be found at
      https://raw.githubusercontent.com/canonical/cloud-init/22.4.2/ChangeLog
    - Includes (LP: #1997559, #1844191) not present in 22.4.0.

cloud-init (22.4-0ubuntu0~18.04.1) bionic; urgency=medium

  * d/control: drop python3-httpretty from Build-Depends
  * d/cloud-init.templates: Add NWCS to datasource list
  * refresh patches:
    + debian/patches/expire-on-hashed-users.patch
  * Upstream snapshot based on 22.4 upstream release. (LP: #1996645)
    List of changes from upstream can be found at
    https://raw.githubusercontent.com/canonical/cloud-init/22.4/ChangeLog

 -- James Falcon <email address hidden> Mon, 28 Nov 2022 10:50:30 -0600

Changed in cloud-init (Ubuntu Bionic):
status: Fix Committed → Fix Released
James Falcon (falcojr)
Changed in cloud-init (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.