azure advanced networking sometimes triggers duplicate mac detection

Bug #1844191 reported by Ryan Harper on 2019-09-16
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
cloud-init
Critical
Unassigned

Bug Description

Hi, we're still being affected by this on Azure with 19.2-24-ge7881d5c-0ubuntu1~18.04.1 - using PACKER to build from image: BuildSource : Marketplace/Canonical/UbuntuServer/18.04-DAILY-LTS

Here is the packer config:
````
    "provisioners": [
        {
          "type": "shell",
          "inline": [
            "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done"
          ]
        },
        {
            "type": "ansible",
            "playbook_file": "{{user `ansible_playbook`}}",
            "user": "packer",
            "extra_arguments": [ "--extra-vars", "codeVersion={{user `code_version`}} managed_image_name={{user `managed_image_name`}}" ]
        },
        {
            "type": "shell",
            "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
            "inline_shebang": "/bin/sh -x",
            "inline": [ "/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync" ]
    }]
````

Here is the playbook:
````
---
- hosts: all
  remote_user: ubuntu
  become: yes
  become_method: sudo
  become_user: root

  environment:
    DEBIAN_FRONTEND: noninteractive
````

Note: we are applying `enableAcceleratedNetworking: true` to the NIC, anecdotally we think this is related.

Usually our playbook has more in it (obviously) but Azure kept pointing fingers at us that our image was causing the problem, so I ran this test simply deploying a blank deprovisioned image via our same process.

And here's what happens on the serial console log:

````
[ 20.337603] sh[910]: + [ -e /var/lib/cloud/instance/obj.pkl ]
[ 20.343177] sh[910]: + echo cleaning persistent cloud-init object
[ 20.349027] [ OK ] Started Network Time Synchronization.
[ OK ] Reached target System Time Synchronized.
sh[910]: cleaning persistent cloud-init object
[ 20.361066] sh[910]: + rm /var/lib/cloud/instance/obj.pkl
[ 20.412333] sh[910]: + exit 0
[ 34.282291] cloud-init[938]: Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1~18.04.1 running 'init-local' at Mon, 16 Sep 2019 18:02:23 +0000. Up 32.02 seconds.
[ 34.288809] cloud-init[938]: 2019-09-16 18:02:25,262 - util.py[WARNING]: failed stage init-local
[ 34.423057] cloud-init[938]: failed run of stage init-local
[ 34.437716] cloud-init[938]: ------------------------------------------------------------
[ 34.441088] cloud-init[938]: Traceback (most recent call last):
[ 34.443719] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
[ 34.448072] cloud-init[938]: ret = functor(name, args)
[ 34.450532] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
[ 34.454849] cloud-init[938]: init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
[ 34.458725] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 697, in apply_network_config
[ 34.463421] cloud-init[938]: net.wait_for_physdevs(netcfg)
[ 34.466051] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
[ 34.470673] cloud-init[938]: present_macs = get_interfaces_by_mac().keys()
[ 34.473964] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
[ 34.479325] cloud-init[938]: (name, ret[mac], mac))
[ 34.481838] cloud-init[938]: RuntimeError: duplicate mac found! both 'eth0' and 'enP1s1' have mac '00:0d:3a:7c:f7:3f'
[ 34.486614] cloud-init[938]: ------------------------------------------------------------
[FAILED] Failed to start Initial cloud-init job (pre-networking).
See 'systemctl status cloud-init-local.service' for details.
[ OK ] Reached target Network (Pre).
         Starting Network Service...
[ OK ] Started Network Service.
         Starting Wait for Network to be Configured...
         Starting Network Name Resolution...
[ OK ] Started Wait for Network to be Configured.
         Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
[ OK ] Reached target Network.
````

When this happens, the machine never boots, and we get an OSProvisioningTimedOut error after about 30 minutes, and the machine never reaches healthy state.

Related branches

Ryan Harper (raharper) on 2019-09-16
Changed in cloud-init:
importance: Undecided → High
status: New → Triaged
Ryan Harper (raharper) wrote :

I can reproduce this on Azure with advanced networking on 19.2

root@ragged-bond1:~# python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from cloudinit import net
>>> import yaml
>>> y = yaml.load(open('/etc/netplan/50-cloud-init.yaml'))
>>> net.wait_for_physdevs(y['network'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
    present_macs = get_interfaces_by_mac().keys()
  File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
    (name, ret[mac], mac))
RuntimeError: duplicate mac found! both 'enP1s1' and 'eth0' have mac '00:0d:3a:6c:d9:80'

Looking at the sriov device, the sysfs attributes include a 'master' pointing to eth0, so I think we can reasonably ignore devices which have the 'master' which is related to device bonding.

root@ragged-bond1:/usr/lib/python3/dist-packages# diff -u cloudinit/net/__init__.py.orig cloudinit/net/__init__.py
--- cloudinit/net/__init__.py.orig 2019-09-16 21:15:42.550376776 +0000
+++ cloudinit/net/__init__.py 2019-09-16 21:18:26.178760942 +0000
@@ -109,6 +109,10 @@
     return os.path.exists(sys_dev_path(devname, "bonding"))

+def has_master_attr(devname):
+ return os.path.exists(sys_dev_path(devname, path='master'))
+
+
 def is_renamed(devname):
     """
     /* interface name assignment types (sysfs name_assign_type attribute) */
@@ -661,6 +665,9 @@
             continue
         if is_bond(name):
             continue
+ if has_master_attr(name):
+ LOG.debug('Skipping device %s with "master" sysfs attriute', name)
+ continue
         mac = get_interface_mac(name)
         # some devices may not have a mac (tun0)
         if not mac:

Changed in cloud-init:
importance: High → Critical
status: Triaged → In Progress
Ryan Harper (raharper) wrote :

I've uploaded a version of cloud-init with this patch to a PPA:

% add-apt-repository -y ppa:raharper/bugfixes
% apt install cloud-init

https://launchpad.net/~raharper/+archive/ubuntu/bugfixes/+files/cloud-init_19.2-36-g17b20580-1~bddeb~18.04.1_all.deb

Danno B (slikk66) wrote :

Hi Ryan, our current workflow is to take the DAILY image, create a base image for all our specialized images "base1804" on a bi-weekly basis, and then create a specialized image for each of our services as the code repositories are updated.

How long until you estimate this will natively find itself into the Canonical/UbuntuServer/18.04-DAILY-LTS image?

I'll try to get this installed currently via your deb file until then.

Thank you for your effort on this, you've got the patch out before Azure has even responded to my support request our ticket.

Danno B (slikk66) wrote :

Patch looks good on our instance! Was able to boot with advanced networking after manually installing this deb file to the image during packer build.

I'll keep the patch in place until I've confirmed it's been merged and released onto the daily image.

Thanks again!

Dan Watkins (daniel-thewatkins) wrote :

Added the block-proposed tag so that we can perform manual eoan testing before migration happens.

tags: added: block-proposed

This bug is fixed with commit 059d049c to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=059d049c

Changed in cloud-init:
status: In Progress → Fix Committed
Dragonshadow (gteachey) wrote :

I'd like to confirm, this has not been released to a package update yet correct? We appear to have hit this same bug.

We're using Accelerated Networking, and adding a second IP to the interface generated the same duplicate MAC error reported here.

I'm not sure if a separate bug report should be made? In our case the machine was already deployed/provisioned, but after adding in a second IP to the NIC we've lost routing and the error is seen.

lilideng (lilideng) wrote :

when it will go into azure gallery image?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers