"Predictable Network Interface Naming" suddenly changed the name from enp96s0f0 to enp96s0f0np0 causing outage

Bug #2085835 reported by Stoyan Atanasov
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
systemd (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

On three identical servers - sleds in a blade chassis.
the NIC name of one of the servers suddenly changed, causing an outage. The netplan config could not assign IP to the interface due to the name change.

enp96s0f0 to enp96s0f0np0

The two other servers still have the enp96s0f0 name. No BIOS or other configuration change is detectable between the three systems. Firmware, software - all the same

I tried to debug this issue myself, but it turned out to be a very niche and complicated topic.
I went to submit an issue on https://github.com/systemd/systemd/issues, but they only accept issues for newer versions.

System info:

dundts@mongodb-rs2:~$ lsb_release -rd
Description: Ubuntu 22.04.3 LTS
Release: 22.04

dundts@mongodb-rs2:~$ apt-cache policy systemd
system:
  Installiert: 249.11-0ubuntu3.11
  Installationskandidat: 249.11-0ubuntu3.12
  Versionstabelle:
     249.11-0ubuntu3.12 500
        500 http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
 *** 249.11-0ubuntu3.11 100
        100 /var/lib/dpkg/status
     249.11-0ubuntu3.7 500
        500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
     249.11-0ubuntu3 500
        500 http://de.archive.ubuntu.com/ubuntu jammy/main amd64 Packages

dundts@mongodb-rs2:~$ sudo dmesg | grep i40e
[sudo] password for dundts:
[ 3.657531] i40e: Intel(R) Ethernet Connection XL710 Network Driver
[ 3.658208] i40e: Copyright (c) 2013 - 2019 Intel Corporation.
[ 3.715866] i40e 0000:60:00.0: fw 3.1.54559 api 1.5 nvm 3.2d 0x80000b4b 1.1767.0 [8086:37d3] [152d:8a40]
[ 3.720222] i40e 0000:60:00.0: MAC address: d8:c4:97:4c:66:ae
[ 3.720558] i40e 0000:60:00.0: FW LLDP is enabled
[ 3.732613] i40e 0000:60:00.0: Added LAN device PF0 bus=0x60 dev=0x00 func=0x00
[ 3.733075] i40e 0000:60:00.0: Features: PF-id[0] VFs: 32 VSIs: 66 QP: 32 RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[ 3.788671] i40e 0000:60:00.1: fw 3.1.54559 api 1.5 nvm 3.2d 0x80000b4b 1.1767.0 [8086:37d3] [152d:8a40]
[ 3.804478] i40e 0000:60:00.1: MAC address: d8:c4:97:4c:66:af
[ 3.804964] i40e 0000:60:00.1: FW LLDP is enabled
[ 3.839982] i40e 0000:60:00.1 eth1: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None
[ 3.851939] i40e 0000:60:00.1: Added LAN device PF1 bus=0x60 dev=0x00 func=0x01
[ 3.877994] i40e 0000:60:00.1: Features: PF-id[1] VFs: 32 VSIs: 66 QP: 32 RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve PTP VEPA
[ 4.065330] i40e 0000:60:00.0 enp96s0f0np0: renamed from eth0
[ 4.096320] i40e 0000:60:00.1 enp96s0f1np1: renamed from eth1

dundts@mongodb-rs2:~$ sudo udevadm info /sys/class/net/enp96s0f0np0
P: /devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.0/net/enp96s0f0np0
L: 0
E: DEVPATH=/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/0000:5f:03.0/0000:60:00.0/net/enp96s0f0np0
E: INTERFACE=enp96s0f0np0
E: IFINDEX=2
E: SUBSYSTEM=net
E: USEC_INITIALIZED=4079509
E: ID_MM_CANDIDATE=1
E: ID_NET_NAMING_SCHEME=v249
E: ID_NET_NAME_MAC=enxd8c4974c66ae
E: ID_OUI_FROM_DATABASE=Quanta Computer Inc.
E: ID_NET_NAME_PATH=enp96s0f0np0
E: ID_BUS=pci
E: ID_VENDOR_ID=0x8086
E: ID_MODEL_ID=0x37d3
E: ID_PCI_CLASS_FROM_DATABASE=Network controller
E: ID_PCI_SUBCLASS_FROM_DATABASE=Ethernet controller
E: ID_VENDOR_FROM_DATABASE=Intel Corporation
E: ID_MODEL_FROM_DATABASE=Ethernet Connection X722 for 10GbE SFP+
E: ID_PATH=pci-0000:60:00.0
E: ID_PATH_TAG=pci-0000_60_00_0
E: ID_NET_DRIVER=i40e
E: ID_NET_LINK_FILE=/usr/lib/systemd/network/99-default.link
E: ID_NET_NAME=enp96s0f0np0
E: NM_UNMANAGED=1
E: SYSTEMD_ALIAS=/sys/subsystem/net/devices/enp96s0f0np0
E: TAGS=:systemd:
E: CURRENT_TAGS=:system:

Revision history for this message
Nick Rosbrook (enr0n) wrote :

By default (and as we can see in the ID_NET_LINK_FILE= property above), the name policy is controlled by NamePolicy= in /usr/lib/systemd/network/99-default.link. On Jammy, this looks like:

root@jammy:~# cat /lib/systemd/network/99-default.link
# SPDX-License-Identifier: LGPL-2.1-or-later
#
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.

[Match]
OriginalName=*

[Link]
NamePolicy=keep kernel database onboard slot path
AlternativeNamesPolicy=database onboard slot path
MACAddressPolicy=persistent

Basically, udev will try those policies in order until one matches. Based on the udev properties above, it looks like the 'path' policy was selected. This *might* different than before, or maybe something else changed on the system that then changed resulting name constructed by udev.

Can you please enable debug logs for udev (ideally on both a working and non-working system), and share them here? E.g.

$ mkdir -p /etc/systemd/system/systemd-udevd.service.d
$ cat > /etc/systemd/system/systemd-udevd.service.d/debug.conf << EOF
[Service]
Environment=SYSTEMD_LOG_LEVEL=debug
EOF

Then, reboot, and then run:

$ journalctl -u systemd-udevd.service -b > udev.log

and attach the result.

Revision history for this message
Stoyan Atanasov (stoyanstatanasov) wrote : Re: [Bug 2085835] Re: "Predictable Network Interface Naming" suddenly changed the name from enp96s0f0 to enp96s0f0np0 causing outage
Download full text (6.3 KiB)

Thank you for your quick response.

We are very interested in solving this issue as this servers are part of our critical infrastructure - mongodb replica set, so I secured a permission to restart the servers.

Here are the attached udev.log files. One is the abnormal where we had the issue and the normal is from one of our other machines unaffected by the issue.
I have zipped the files as they were quite large ~6MB

________________________________
From: <email address hidden> <email address hidden> on behalf of Nick Rosbrook <email address hidden>
Sent: Tuesday, October 29, 2024 15:28
To: Stoyan Atanasov <email address hidden>
Subject: [Bug 2085835] Re: "Predictable Network Interface Naming" suddenly changed the name from enp96s0f0 to enp96s0f0np0 causing outage

By default (and as we can see in the ID_NET_LINK_FILE= property above),
the name policy is controlled by NamePolicy= in
/usr/lib/systemd/network/99-default.link. On Jammy, this looks like:

root@jammy:~# cat /lib/systemd/network/99-default.link
# SPDX-License-Identifier: LGPL-2.1-or-later
#
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.

[Match]
OriginalName=*

[Link]
NamePolicy=keep kernel database onboard slot path
AlternativeNamesPolicy=database onboard slot path
MACAddressPolicy=persistent

Basically, udev will try those policies in order until one matches.
Based on the udev properties above, it looks like the 'path' policy was
selected. This *might* different than before, or maybe something else
changed on the system that then changed resulting name constructed by
udev.

Can you please enable debug logs for udev (ideally on both a working and
non-working system), and share them here? E.g.

$ mkdir -p /etc/systemd/system/systemd-udevd.service.d
$ cat > /etc/systemd/system/systemd-udevd.service.d/debug.conf << EOF
[Service]
Environment=SYSTEMD_LOG_LEVEL=debug
EOF

Then, reboot, and then run:

$ journalctl -u systemd-udevd.service -b > udev.log

and attach the result.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/2085835

Title:
  "Predictable Network Interface Naming" suddenly changed the name from
  enp96s0f0 to enp96s0f0np0 causing outage

Status in systemd package in Ubuntu:
  New

Bug description:
  On three identical servers - sleds in a blade chassis.
  the NIC name of one of the servers suddenly changed, causing an outage. The netplan config could not assign IP to the interface due to the name change.

  enp96s0f0 to enp96s0f0np0

  The two other servers still have the enp96s0f0 name. No BIOS or other
  configuration change is detectable between the three systems.
  Firmware, software - all the same

  I tried to debug this issue myself, but it turned out to be a very niche and complicated topic.
  I went to submit an issue on https://github.com/systemd/systemd/issues, but they only accept issues for newer versions.

  System info:

  dundts@mongodb-rs2...

Read more...

Revision history for this message
Nick Rosbrook (enr0n) wrote :

Thanks for the logs!

So, I think in both cases, udev *is* using the 'path' rename policy. However, it seems that on *one* system, the kernel has a value for /sys/class/net/<iface>/phys_port_name, which results in the addition to the interface name.

Can you please run:

$ cat /sys/class/net/<iface>/phys_port_name

on each machine (with the correct interface name), and report back? Also, can you report the kernel versions (uname -r) from each machine?

My new suspicion is that you have a different/newer kernel on one machine, and that the network driver gained support for phys_port_name_show().

Revision history for this message
Stoyan Atanasov (stoyanstatanasov) wrote :
Download full text (5.7 KiB)

Hi Nick,
Thank you for the amazing and timely work!

I checked the kernel versions and you are correct! The versions are different. I don't know how this happened. Is there a way to find out how the kernel version changed?

on the normal machine:
---------
dundts@mongodb-rs3:~$ uname -r
5.15.0-124-generic
dundts@mongodb-rs3:~$ sudo cat /sys/class/net/enp96s0f0/phys_port_name
cat: /sys/class/net/enp96s0f0/phys_port_name: Operation not supported

on the abnormal machine:
---------
dundts@mongodb-rs2:~$ uname -r
6.8.0-47-generic
dundts@mongodb-rs2:~$ sudo cat /sys/class/net/enp96s0f0np0/phys_port_name
p0

________________________________
From: <email address hidden> <email address hidden> on behalf of Nick Rosbrook <email address hidden>
Sent: Wednesday, October 30, 2024 10:15
To: Stoyan Atanasov <email address hidden>
Subject: [Bug 2085835] Re: "Predictable Network Interface Naming" suddenly changed the name from enp96s0f0 to enp96s0f0np0 causing outage

Thanks for the logs!

So, I think in both cases, udev *is* using the 'path' rename policy.
However, it seems that on *one* system, the kernel has a value for
/sys/class/net/<iface>/phys_port_name, which results in the addition to
the interface name.

Can you please run:

$ cat /sys/class/net/<iface>/phys_port_name

on each machine (with the correct interface name), and report back?
Also, can you report the kernel versions (uname -r) from each machine?

My new suspicion is that you have a different/newer kernel on one
machine, and that the network driver gained support for
phys_port_name_show().

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/2085835

Title:
  "Predictable Network Interface Naming" suddenly changed the name from
  enp96s0f0 to enp96s0f0np0 causing outage

Status in systemd package in Ubuntu:
  New

Bug description:
  On three identical servers - sleds in a blade chassis.
  the NIC name of one of the servers suddenly changed, causing an outage. The netplan config could not assign IP to the interface due to the name change.

  enp96s0f0 to enp96s0f0np0

  The two other servers still have the enp96s0f0 name. No BIOS or other
  configuration change is detectable between the three systems.
  Firmware, software - all the same

  I tried to debug this issue myself, but it turned out to be a very niche and complicated topic.
  I went to submit an issue on https://github.com/systemd/systemd/issues, but they only accept issues for newer versions.

  System info:

  dundts@mongodb-rs2:~$ lsb_release -rd
  Description: Ubuntu 22.04.3 LTS
  Release: 22.04

  dundts@mongodb-rs2:~$ apt-cache policy systemd
  system:
    Installiert: 249.11-0ubuntu3.11
    Installationskandidat: 249.11-0ubuntu3.12
    Versionstabelle:
       249.11-0ubuntu3.12 500
          500 http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
   *** 249.11-0ubuntu3.11 100
          100 /var/lib/dpkg/status
       249.11-0ubuntu3.7 500
          500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
       249.11-0ubuntu3 500
          500 http://de.archive.ubuntu.com/ubuntu jammy/main amd64 Pa...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Nick Rosbrook (enr0n) wrote :

The output of:

$ apt policy linux-image-generic

should show you where the kernel is installed from (e.g. jammy vs. jammy-updates). However, based on the version numbers seen in $(uname -r), it appears you are running 22.04 (jammy) on one machine, and 24.04 (noble) on another? Or, at least, the kernel versions suggest that.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Stoyan Atanasov (stoyanstatanasov) wrote :

Sorry to take so long, I asked my colleagues - nobody says they have done anything to the machine. Maybe they don't remember...

here is the output of the "abnormal" machine:
dundts@mongodb-rs2:~$ uname -r
6.8.0-47-generic
dundts@mongodb-rs2:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
dundts@mongodb-rs2:~$ apt policy linux-image-generic
linux-image-generic:
  Installiert: (keine)
  Installationskandidat: 5.15.0.125.124
  Versionstabelle:
     5.15.0.125.124 500
        500 http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
     5.15.0.25.27 500
        500 http://de.archive.ubuntu.com/ubuntu jammy/main amd64 Packages
dundts@mongodb-rs2:~$

and here is one of the other two "normal" machines:

dundts@mongodb-rs3:~$ uname -r
5.15.0-124-generic
dundts@mongodb-rs3:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
dundts@mongodb-rs3:~$ apt policy linux-image-generic
linux-image-generic:
  Installiert: 5.15.0.125.124
  Installationskandidat: 5.15.0.125.124
  Versionstabelle:
 *** 5.15.0.125.124 500
        500 http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
        100 /var/lib/dpkg/status
     5.15.0.25.27 500
        500 http://de.archive.ubuntu.com/ubuntu jammy/main amd64 Packages
dundts@mongodb-rs3:~$

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

The 6.8.0 kernel likely has been installed via the linux-generic-hwe-22.04 package, can you show apt policy for that one, too?

Revision history for this message
Stoyan Atanasov (stoyanstatanasov) wrote :

Sure!

Here is from the "abnormal" machine:
dundts@mongodb-rs2:~$ apt-cache policy linux-generic-hwe-22.04
linux-generic-hwe-22.04:
  Installiert: 6.8.0-48.48~22.04.1
  Installationskandidat: 6.8.0-48.48~22.04.1
  Versionstabelle:
 *** 6.8.0-48.48~22.04.1 500
        500 http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
        100 /var/lib/dpkg/status
     5.15.0.25.27 500
        500 http://de.archive.ubuntu.com/ubuntu jammy/main amd64 Packages
dundts@mongodb-rs2:~$

Here is from the "normal" machine:
dundts@mongodb-rs3:~$ apt-cache policy linux-generic-hwe-22.04
linux-generic-hwe-22.04:
  Installiert: (keine)
  Installationskandidat: 6.8.0-48.48~22.04.1
  Versionstabelle:
     6.8.0-48.48~22.04.1 500
        500 http://de.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
     5.15.0.25.27 500
        500 http://de.archive.ubuntu.com/ubuntu jammy/main amd64 Packages
dundts@mongodb-rs3:~$

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.