RuntimeError: duplicate mac found! both 'swp1' and 'swp3' have mac '32:98:88:9c:2d:29'

Bug #1997922 reported by Aristo Chen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OEM Priority Project
Fix Released
High
Unassigned
cloud-init
Fix Released
High
James Falcon

Bug Description

Hi,

This is Aristo from OEM Enablement team in Taiwan, I am currently enabling a device that has 1 Ethernet port and 4 Etherent switch port, and I will get the following error on first boot
"""
[ 22.855169] cloud-init[519]: Cloud-init v. 22.3.4-0ubuntu1~22.04.1 running 'init-local' at Fri, 25 Nov 2022 01:23:27 +0000. Up 22.75 seconds.
[ 23.745575] cloud-init[519]: 2022-11-25 01:23:28,899 - util.py[WARNING]: failed stage init-local
[ 23.764650] cloud-init[519]: failed run of stage init-local
[ 23.780376] cloud-init[519]: ------------------------------------------------------------
[ 23.796379] cloud-init[519]: Traceback (most recent call last):
[ 23.812604] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 767, in status_wrapper
[ 23.832472] cloud-init[519]: ret = functor(name, args)
[ 23.848500] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 433, in main_init
[ 23.869966] cloud-init[519]: init.apply_network_config(bring_up=bring_up_interfaces)
[ 23.888410] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 922, in apply_network_config
[ 23.908494] cloud-init[519]: self.distro.networking.wait_for_physdevs(netcfg)
[ 23.928436] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 148, in wait_for_physdevs
[ 23.952413] cloud-init[519]: present_macs = self.get_interfaces_by_mac().keys()
[ 23.972380] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 75, in get_interfaces_by_mac
[ 23.996508] cloud-init[519]: return net.get_interfaces_by_mac(
[ 24.012399] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 926, in get_interfaces_by_mac
[ 24.036393] cloud-init[519]: return get_interfaces_by_mac_on_linux(
[ 24.056387] cloud-init[519]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 1007, in get_interfaces_by_mac_on_linux
[ 24.080426] cloud-init[519]: raise RuntimeError(
[ 24.099221] cloud-init[519]: RuntimeError: duplicate mac found! both 'swp1' and 'swp3' have mac '9a:57:7d:78:47:c0'
[ 24.120454] cloud-init[519]: ------------------------------------------------------------

"""

The network-config is
"""
#cloud-config
version: 2
ethernets:
  enp0s0f0:
    dhcp4: true
    optional: true
"""

Here is all the interfaces
"""
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 16:11:29:db:df:62 brd ff:ff:ff:ff:ff:ff
3: enp0s0f2: <BROADCAST,MULTICAST> mtu 1520 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:57:7d:78:47:c0 brd ff:ff:ff:ff:ff:ff
4: can0: <NOARP,ECHO> mtu 16 qdisc noop state DOWN group default qlen 10
    link/can
5: can1: <NOARP,ECHO> mtu 16 qdisc noop state DOWN group default qlen 10
    link/can
6: swp0@enp0s0f2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:57:7d:78:47:c0 brd ff:ff:ff:ff:ff:ff
7: swp1@enp0s0f2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:57:7d:78:47:c0 brd ff:ff:ff:ff:ff:ff
8: swp2@enp0s0f2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:57:7d:78:47:c0 brd ff:ff:ff:ff:ff:ff
9: swp3@enp0s0f2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:57:7d:78:47:c0 brd ff:ff:ff:ff:ff:ff

"""

Please let me know if you need any further info from me, thanks!

Revision history for this message
Aristo Chen (aristochen) wrote :

This patch fix the issue, though it may not be the proper way, please let me know if you have any concern about this patch, thanks!

Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks for filing a bug, and the patch suggestion and making cloud-init better. Please also run sudo cloud-init collect-logs and attach the resulting cloud-init.tar.gz so we can better see the order of driver loads (and /var/log/cloud-init.log) to better triage this issue to confirm the best course of action here.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Chad Smith (chad.smith) wrote :

Marking bug status 'incomplete until we have collect-logs output. Also, please set the bug status back to 'New' when logs are attached to make sure our team sees the update and addresses this bug.

Revision history for this message
Chad Smith (chad.smith) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote :

Ideally, we'd probably also like to see

cat > get_driver.py <<EOF
from cloudinit.net import device_driver
import sys

print(device_driver(sys.argv[1]))
EOF

for nic in enp0s0fs swp0 swp1 swp2 swp3; do
 echo "----- $nic"
 ls -l /sys/class/net/$nic/
 python3 ./get_driver.py $nic;
done

Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

Here is the cloud-init.tar.gz and output(https://pastebin.canonical.com/p/79csrK6Njw/) for #5

please let me know if you need any info from my side

thanks!

Revision history for this message
James Falcon (falcojr) wrote :

Thank you for the additional logs. I can see the problem.

Can you help me understand the use case for running cloud-init on a switch? Would it be acceptable to disable cloud-init's networking configuration? Does cloud-init even need to run at all?

Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

Cloud-init is used by default when OEM Enablement team builds an image for our customer, and there is a stand alone Ethernet port on the device, cloud-init is used for setting network configuration for that Ethernet port.

Is there any other way that I can define in the "network-config" to tell cloud-init to ignore those 4 switch port?

Revision history for this message
Brett Holman (holmanb) wrote :

Hi WeiMing Chen,

In order to disable cloud-init networking, please set the following in /etc/cloud/cloud.cfg.

network:
  config: disabled

More information here[1].

Please let us know if this solution is acceptable for your use case.

[1] https://cloudinit.readthedocs.io/en/latest/topics/network-config.html#disabling-network-configuration

Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

Thanks for your reply, but I think that is not acceptable in my use case

Just to clarify, there are 5 ethernet ports on the device, one of them is an individual Ethernet port, which is used to have internet connection, and that is the only thing that I defined in network-config

There are another 4 Ethernet port on device which are actually switch, and those 4 port some how have the same MAC address, which cause cloud-init to crash.

Is it possible that cloud-init only shows an warning for duplicated MAC address instead of crash?

Revision history for this message
shixuantong (sxt1001) wrote :

Is this possible in other application scenarios? If that's possible, I think it's better to shows an warning for it not crash.

Revision history for this message
shixuantong (sxt1001) wrote :

I found that there was also a problem checking for the same mac address in the function get_ib_hwaddrs_by_interface().

https://github.com/canonical/cloud-init/blob/dc1d27bae63b51a925e40a80475ae45be62b3857/cloudinit/net/__init__.py#L1133

Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

Any update on this issue? thanks!

Robert Liu (robertliu)
Changed in oem-priority:
importance: Undecided → High
tags: added: originate-from-1998894
tags: added: oem-priority
Revision history for this message
James Falcon (falcojr) wrote :

I understand what you're trying to accomplish and how that doesn't work with cloud-init, but I think we want a different approach than what you've specified in your patch. Cloud-init was never intended to be run on a device like a switch, and so on one hand this feels outside the scope of something cloud-init should be dealing with. On the other hand, we do understand that there are some limited use cases where duplicate mac addresses may be valid. Rather than adding one-off driver checks, we'd like to add something a little more general-use.

We can add a config option in /etc/cloud/cloud.cfg named something like "warn_on_duplicate_mac", which when set to true will warn rather than traceback. Will this work for you?

Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

thanks for your reply, that should work for me, thanks!

James Falcon (falcojr)
Changed in cloud-init:
status: Incomplete → Triaged
importance: Undecided → High
Revision history for this message
Chad Smith (chad.smith) wrote :

One of the solutions we discussed was the ability to (on Sysfs-based environments) validate leader/subordinate (upper/lower) distinctions for devices based on whether the virtual functions are created by the driver once it configures the device. Typically subordinate devices will have either a 'master' or 'upper_<upper_dev_name>' symlink in '/sys/class/net/<lower_devName>/device/'. A leader will have a 'lower_<lower_dev_name>' symlink to indicate that it is the 'upper'.

The problem we have with just relying on the upper/lower relationship being setup is that there is a race present if cloud-init walks /sys/class/net before drivers finish initial configuration and setup of the upper/lower relationships for the network VFs. When cloud-init beats driver setup, those upper_|lower_ symlinks aren't surfaced yet by the kernel.

So, ultimately we still have to rely on driver names to inform us of the unlikely scenarios where a network device or switch may have duplicate MACs for upper and lower functions, but those devices have not yet been fully configured.

A gap can exist as much as 40 seconds between driver detection for a device and the final upper_|lower_ configuration as seen in /sys for this Hyper-V system where duplicate MACs are expected:

ubuntu@hyper-v-timestamp:~$ for dev in eth0 enP62764s1; do echo ---- $dev; ls -l --full-time /sys/class/net/$dev/device/driver; ls -l --full-time /sys/class/net/$dev/ | egrep 'upper|lower|master'; done
---- eth0
lrwxrwxrwx 1 root root 0 2023-01-31 21:25:18.651923500 +0000 /sys/class/net/eth0/device/driver -> ../../../../../../bus/vmbus/drivers/hv_netvsc
lrwxrwxrwx 1 root root 0 2023-01-31 21:25:27.864907346 +0000 lower_enP62764s1 -> ../../../1d6f00a2-f52c-4ea5-9bf1-5cbb5824b1d3/pcif52c:00/f52c:00:02.0/net/enP62764s1
---- enP62764s1
lrwxrwxrwx 1 root root 0 2023-01-31 21:25:19.227923500 +0000 /sys/class/net/enP62764s1/device/driver -> ../../../../../../../../bus/pci/drivers/mlx5_core
lrwxrwxrwx 1 root root 0 2023-01-31 21:25:20.923923500 +0000 master -> ../../../../../000d3a1e-a3af-000d-3a1e-a3af000d3a1e/net/eth0
lrwxrwxrwx 1 root root 0 2023-01-31 21:26:05.957294183 +0000 upper_eth0 -> ../../../../../000d3a1e-a3af-000d-3a1e-a3af000d3a1e/net/eth0

Revision history for this message
Chad Smith (chad.smith) wrote :

Upstream pull request with a proposed fix of this issue https://github.com/canonical/cloud-init/pull/1988

Thanks for the patch submittal

Changed in cloud-init:
status: Triaged → In Progress
assignee: nobody → James Falcon (falcojr)
Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

Thanks for the effort, may I know when will this fix land in Jammy? we are expecting to release our next image by mid March

Revision history for this message
James Falcon (falcojr) wrote :

This feature has made it into cloud-init 23.1 which is expected to be released tomorrow, followed by SRU into the supported Ubuntu series's (including Jammy). SRU process generally takes us 1-2 weeks depending on any issues found during testing, so likely by March 8.

Revision history for this message
Alberto Contreras (aciba) wrote : Fixed in cloud-init version 23.1.

This bug is believed to be fixed in cloud-init in version 23.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: In Progress → Fix Released
Revision history for this message
Aristo Chen (aristochen) wrote :

Hi,

Thanks for all the effort! I can no longer reproduce this bug

Aristo Chen (aristochen)
Changed in oem-priority:
status: New → Fix Released
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.