[regression] netplan apply is not idempotent and fails trying to rename a bond member interface

Bug #1802322 reported by Dmitrii Shcherbakov on 2018-11-08
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
netplan
High
Mathieu Trudel-Lapierre
netplan.io (Ubuntu)
High
Mathieu Trudel-Lapierre
Bionic
Undecided
Unassigned
Cosmic
Undecided
Unassigned

Bug Description

[Impact]
Usage of juju to deploy systems with bridge configurations is severely broken if using layer3+4 bonding, as renaming might be attempted and break the application of config.

[Test case]
1) Deploy a system using MaaS and Juju, with a network configuration such as:
        bond0:
            interfaces:
            - enp4s0f0
            - enp5s0f0
            macaddress: 00:0a:f7:72:a7:28
            mtu: 9000
            parameters:
                down-delay: 0
                lacp-rate: fast
                mii-monitor-interval: 100
                mode: 802.3ad
                transmit-hash-policy: layer3+4
                up-delay: 0

        enp4s0f0:
            match:
                macaddress: 00:0a:f7:72:a7:28
            mtu: 9000
            set-name: enp4s0f0

        enp5s0f0:
            match:
                macaddress: 00:0e:1e:ac:67:00
            mtu: 9000
            set-name: enp5s0f0

[Regression potential]
This fix is to correct an existing regression. Changes in the rename code might otherwise impact the effect of attempting to rename interfaces when (and only when) 'netplan apply' is being run, which only ever happens as directed by the user (either directly at the command-line or via scripting such as via juju). Changes are limited to the behavior of 'netplan apply' in the interface renaming step; and the fix has been to ignore non-physical devices (which are not renamed anyway, but created) and physical devices members of a bond/bridge.

---

After an update for https://bugs.launchpad.net/netplan/+bug/1770082 was released for bionic and our systems started getting the new packages, *clean* MAAS + Juju + Bionic + LXD container deployments started to fail on bridge activation.

juju model-config logging-config='<root>=WARNING;unit=DEBUG;juju.network.netplan=TRACE'

2018-11-08 13:44:10 DEBUG juju.network.netplan activate.go:99 Netplan activation result "Traceback (most recent call last):
  File \"/usr/sbin/netplan\", line 23, in <module>
    netplan.main()
  File \"/usr/share/netplan/netplan/cli/core.py\", line 50, in main
    self.run_command()
  File \"/usr/share/netplan/netplan/cli/utils.py\", line 130, in run_command
    self.func()
  File \"/usr/share/netplan/netplan/cli/commands/apply.py\", line 43, in run
    self.run_command()
  File \"/usr/share/netplan/netplan/cli/utils.py\", line 130, in run_command
    self.func()
  File \"/usr/share/netplan/netplan/cli/commands/apply.py\", line 102, in command_apply
    stderr=subprocess.DEVNULL)
  File \"/usr/lib/python3.6/subprocess.py\", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ip', 'link', 'set', 'dev', 'enp5s0f0', 'name', 'enp4s0f0']' returned non-zero exit status 2.
" "" 1

From the Juju machine agent code:

command := fmt.Sprintf("%snetplan generate && netplan apply && sleep 10", params.RunPrefix)
// ...
logger.Debugf("Netplan activation result %q %q %d", result.Stderr, result.Stdout, result.Code)

The rename operation in question does not seem to be justified by anything that juju would want to do.

Inspecting closer it can be seen that 00:0a:f7:72:a7:28 is a mac address of enp4s0f0 which also happens to be a MAC address of the bond and gets applied to all members of a bond (enp5s0f0 is of specific interest) after the first run of netplan after the deployment.

It looks like a subsequent `netplan generate && netplan apply` invocation by Juju causes netplan to try to apply "enp4s0f0" name to "enp5s0f0" interface because it has "00:0a:f7:72:a7:28" for a mac address as a result of becoming a bond member.

netplan generated by cloud-init:
http://paste.ubuntu.com/p/QfR4f5yMYP/

        bond0:
            interfaces:
            - enp4s0f0
            - enp5s0f0

        enp4s0f0:
            match:
                macaddress: 00:0a:f7:72:a7:28
            mtu: 9000
            set-name: enp4s0f0

        enp5s0f0:
            match:
                macaddress: 00:0e:1e:ac:67:00
            mtu: 9000
            set-name: enp5s0f0

curtin config:
http://paste.ubuntu.com/p/NkvZKqZYjr/

# ip addr show enp5s0f0
8: enp5s0f0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond0 state DOWN group default qlen 1000
    link/ether 00:0a:f7:72:a7:28 brd ff:ff:ff:ff:ff:ff

# ip addr show enp4s0f0
6: enp4s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 00:0a:f7:72:a7:28 brd ff:ff:ff:ff:ff:ff

This is currently blocking all of our bionic deployments as all of them have bonds.

Dmitrii Shcherbakov (dmitriis) wrote :

Subscribed ~field-critical as it blocks field deployments and does not have a workaround.

Ryan Harper (raharper) wrote :

Can you collect /run/systemd/network/ and the system journal?

Mihai Gheorghe (mgheorghe) wrote :
Mihai Gheorghe (mgheorghe) wrote :

Why is this still including set-name calls? I thought we had established this was very wrong to do?

Regardless, there's clearly an error, but it's one that could be easily avoided in this particular if the config generated by cloud-init was "correct".

Changed in netplan.io (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
Changed in netplan:
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
importance: Undecided → High
status: New → Triaged
tags: added: regression-release
tags: added: regression-update
removed: regression-release
Mihai Gheorghe (mgheorghe) wrote :
Ryan Harper (raharper) wrote :

OK, I can recreate.

The specific detail is that the bond macaddress matches one of the
physical interfaces and that the bond itself is down at the time of
the apply command.

# netplan apply --debug
WARK: matches: {'by-driver': {}, 'by-mac': {'00:16:3e:2c:4c:e5':
'eth0', '00:16:3e:c3:41:0a': 'eth1', '00:16:3e:b2:be:9a': 'eth2'}}
WARK: device lo operstate is unknown, not changing
WARK iface=bond1 link={'addr': '00:16:3e:c3:41:0a', 'broadcast':
'ff:ff:ff:ff:ff:ff'} mac=00:16:3e:c3:41:0a
driver=/sys/devices/virtual/net/bond1/device/driver
WARK: device eth0 operstate is up, not changing
WARK iface=eth1 link={'addr': '00:16:3e:c3:41:0a', 'broadcast':
'ff:ff:ff:ff:ff:ff'} mac=00:16:3e:c3:41:0a
driver=/sys/devices/virtual/net/eth1/device/driver
WARK interface=eth1 new_name=eth1
WARK iface=eth2 link={'addr': '00:16:3e:c3:41:0a', 'broadcast':
'ff:ff:ff:ff:ff:ff'} mac=00:16:3e:c3:41:0a
driver=/sys/devices/virtual/net/eth2/device/driver
WARK process changes: {'bond1': {'name': 'eth1'}, 'eth2': {'name': 'eth1'}}
WARK: devices: ['lo', 'bond1', 'eth0', 'eth1', 'eth2']
WARK: changes: {'bond1': {'name': 'eth1'}, 'eth2': {'name': 'eth1'}}
Traceback (most recent call last):
  File "/usr/sbin/netplan", line 23, in <module>
    netplan.main()
  File "/usr/share/netplan/netplan/cli/core.py", line 50, in main
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 130, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/apply.py", line 43, in run
    self.run_command()
  File "/usr/share/netplan/netplan/cli/utils.py", line 130, in run_command
    self.func()
  File "/usr/share/netplan/netplan/cli/commands/apply.py", line 105,
in command_apply
    stderr=subprocess.DEVNULL)
  File "/usr/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ip', 'link', 'set', 'dev',
'bond1', 'name', 'eth1']' returned non-zero exit status 2.

Two things here:

1) we shouldn't attempt to rename non-physical interfaces (lo and
bond1 can be skipped)
2) physical devices which are part of a bond should be skipped since
the mac changes

On Thu, Nov 8, 2018 at 10:35 AM Mihai Gheorghe
<email address hidden> wrote:
>
> # netplan --debug apply
>
> http://paste.ubuntu.com/p/bcv8cJQGvY/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1802322
>
> Title:
> [regression] netplan apply is not idempotent and fails trying to
> rename a bond member interface
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/netplan/+bug/1802322/+subscriptions

tags: added: id-5be4a491c40aa00e98bc940a
Dmitrii Shcherbakov (dmitriis) wrote :

As a temporary measure for Juju deployments this can be used as a workaround:

cat userdata-netplan.yaml
cloudinit-userdata: |
   preruncmd: sed -i '/set-name/d' /etc/netplan/50-cloud-init.yaml

# before `juju deploy`
juju model-config userdata-netplan.yaml

Dmitrii Shcherbakov (dmitriis) wrote :

Any update so far?

I can see the code + test commits in a relevant branch:
https://code.launchpad.net/~netplan-developers/netplan/+git/netplan/+ref/lp1802322

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.90

---------------
netplan.io (0.90) disco; urgency=medium

  * New upstream release:
    - build: fixes for building on RPM-based distros
    - build: code prettiness changes (make indentation consistent)
    - Fix device name-changes detection (LP: #1770082)
    - Add support for IPv6 Privacy Extensions (LP: #1750392)
    - Add dhcp{4,6}-overrides to control DNS, NTP, hostname updates via DHCP
      (LP: #1759014)
    - Clarify MAC and MTU setting requirements (LP: #1800668)
    - Various documentation fixes (LP: #1800669)
    - Improve error reporting to give clearer messages and context
      (LP: #1800670)
    - Skip non-physical/bond interfaces when applying renames (LP: #1802322)

 -- Mathieu Trudel-Lapierre <email address hidden> Wed, 21 Nov 2018 14:06:13 -0500

Changed in netplan.io (Ubuntu):
status: Triaged → Fix Released
description: updated

Hello Dmitrii, or anyone else affected,

Accepted netplan.io into cosmic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/netplan.io/0.40.2.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-cosmic to verification-done-cosmic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-cosmic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in netplan.io (Ubuntu Cosmic):
status: New → Fix Committed
tags: added: verification-needed verification-needed-cosmic
Changed in netplan.io (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed-bionic
Adam Conrad (adconrad) wrote :

Hello Dmitrii, or anyone else affected,

Accepted netplan.io into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/netplan.io/0.40.1~18.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Verification-done on bionic: netplan.io 0.40.1~18.04.3
Verification-done on cosmic: netplan.io 0.40.2.2

I have checked that the new package in proposed now skips devices that can't be identified as physical interfaces or that are known to be members of bonds/bridges, etc. so as not to attempt to rename them:

$ netplan --debug apply --debug
[...]
DEBUG:Skipping non-physical interface: lo
DEBUG:Skipping composite member ens6
DEBUG:Skipping composite member ens7
DEBUG:Skipping non-physical interface: bond0
DEBUG:{}
DEBUG:netplan triggering .link rules for lo
DEBUG:netplan triggering .link rules for ens6
DEBUG:netplan triggering .link rules for ens7
DEBUG:netplan triggering .link rules for bond0

tags: added: verification-done-bionic verification-done-cosmic
removed: verification-needed verification-needed-bionic verification-needed-cosmic
Dmitrii Shcherbakov (dmitriis) wrote :

Unsubscribed ~field-critical as we can no longer reliably reproduce this although we could in the beginning.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.40.2.2

---------------
netplan.io (0.40.2.2) cosmic; urgency=medium

  * Fix idempotency in renaming: bond members should be exempt from rename, as
    they may all share a single MAC for the bond device. (LP: #1802322)
  * tests/integration.py: add test designed to catch the above regression.

netplan.io (0.40.2.1) cosmic; urgency=medium

  * Fix typo breaking rename on 'netplan apply'. (LP: #1770082)

 -- Mathieu Trudel-Lapierre <email address hidden> Wed, 21 Nov 2018 14:28:42 -0500

Changed in netplan.io (Ubuntu Cosmic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for netplan.io has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netplan.io - 0.40.1~18.04.3

---------------
netplan.io (0.40.1~18.04.3) bionic; urgency=medium

  * Fix idempotency in renaming: bond members should be exempt from rename, as
    they may all share a single MAC for the bond device. (LP: #1802322)
  * tests/integration.py: add test designed to catch the above regression.

netplan.io (0.40.1~18.04.2) bionic; urgency=medium

  * Fix typo breaking rename on 'netplan apply'. (LP: #1770082)

netplan.io (0.40.1~18.04.1) bionic; urgency=medium

  * Backport netplan 0.40.1 to 18.04. (LP: #1793309)

netplan.io (0.40.1) cosmic; urgency=medium

  * tests/generate.py: use random.sample() instead of random.choices() to
    better support older pythons.
  * Deal gracefully with empty files on 'netplan apply' (LP: #1795343)

netplan.io (0.40) cosmic; urgency=medium

  * New upstream release:
    - networkd: route source is PreferredSource= not From=
    - Improve NetworkManager error reporting on unrenderable routes.
    - Don't render ipv4 dns-search unless we have an ipv4 address.
      (LP: #1786726)
    - Set permissive umask on networkd .network, .link and .netdev files
      (LP: #1736965, LP: #1768560)
    - Fix support for link-scope routes. (LP: #1747455)
    - Update man pages for deletion of replug code.
    - Spell Gratuitous ARP correctly and make it work. (LP: #1756701)
    - Many typo fixes for documentation. (LP: #1783940)
    - Various build system fixes.
    - Fix integration tests:
      - iproute2 output changes for link-scope routes
      - fix stability of networkd igmp-resend test
      - fix manual_addresses test now that networkd lists ~. domain
    - Deduplicate code for parsing interface options
    - Add support for optional-addresses.

netplan.io (0.39) cosmic; urgency=medium

  * New upstream release:
    - Allow link-local addresses to be configured. (LP: #1771704)
    - Forces bridges with no addresses to be brought online. (LP: #1736975)

netplan.io (0.38) cosmic; urgency=medium

  * New upstream release:
    - Write udev .rules files to /run/udev/rules.d to enforce interface
      renaming. (LP: #1770082)
    - Don't traceback for 'netplan ip leases' when iface is not managed or
      doesn't DHCP (LP: #1768823)
    - Fix duplicate "/" path separator in error messages (LP: #1771440)
    - Fix incorrect terminal reset in 'netplan try' on Ctrl-C. (LP: #1768798)
    - Updated doc entries: mtu, fix fwmark->mark, cleanup optional.
      (LP: #1768783)
    - Added documentation validation at build.
    - Added configuration example for multi-ip interfaces.
  * tests/integration.py: fix test_eth_and_bridge autopkg test harder.
  * debian/control:
    - Add iproute2 to Depends.
    - Add python3-netifaces to Depends, Build-Depends.

 -- Mathieu Trudel-Lapierre <email address hidden> Wed, 21 Nov 2018 14:42:59 -0500

Changed in netplan.io (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers