networkctl reload with bond devices causes slaves to go DOWN and UP, causing couple of seconds of network loss

Bug #2003250 reported by frantisek reznicek
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Fix Released
Low
Unassigned
Jammy
Triaged
Medium
Unassigned
Kinetic
Won't Fix
Low
Unassigned

Bug Description

We currently use Ubuntu 22.04.1 LTS including updates for our production cloud (switched from legacy Centos 7).
Although we like the distribution we recently hit serious systemd buggy behavior described in [1] bugreport using packages [2].

Unfortunatelly the clouds we are running consist of openstack on top of kubernetes and we need to have complex network configuration including linux bond devices.

Our observation is that every time we apply our configuration via CI/CD infrastructure using ansible and netplan (regardless whether there is actual network configuration change) we see approximatelly 8-16 seconds network interruptions and see bond interfaces going DOWN and then UP.

We expect bond interfaces stay UP when there is no network configuration change.

We went though couple of options how to solve the issue and the first one is to add such existing patch [3] into current systemd-249.11-0ubuntu3.6.

Could you comment whether this kind of non-security patch is likely to land in 22.04.1 LTS soon.
We are able to help to bring patch into systemd package community way if you suggest the steps.

[1] https://github.com/systemd/systemd/issues/25067
[2] Packages
root@controlplane-001:/etc/apt0# apt list | grep -E '^(systemd/|netplan.io)'
netplan.io/jammy-updates,now 0.105-0ubuntu2~22.04.1 amd64 [installed,automatic]
systemd/jammy-updates,now 249.11-0ubuntu3.6 amd64 [installed,automatic]
[3] https://github.com/systemd/systemd/pull/25162
[4] # lsb_release -rd
Description: Ubuntu 22.04.1 LTS
Release: 22.04

Tags: networking
Revision history for this message
Nick Rosbrook (enr0n) wrote :

I have confirmed that this bug affects Jammy and newer. The upstream patch looks straight-forward, so I will test a build with that patch included to see if it fixes the issue.

Changed in systemd (Ubuntu Jammy):
status: New → Triaged
Changed in systemd (Ubuntu Kinetic):
status: New → Triaged
Changed in systemd (Ubuntu Jammy):
importance: Undecided → Low
Changed in systemd (Ubuntu Kinetic):
importance: Undecided → Low
Changed in systemd (Ubuntu):
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Nick Rosbrook (enr0n) wrote :

Applying the patch from upstream had some unexpected problems, i.e. causing other interfaces not to come up when I would have expected them to. This needs further investigation -- maybe there is some missing logic from current upstream that makes the patch incorrect when backported. For now, this patch will not be backported.

Revision history for this message
frantisek reznicek (frantisek-reznicek) wrote :

Thank you very much for the status.

On our side we improved the ansible logic configuring the networking via netplan the way ansible actually performs netplan apply **only if** there is netplan configuration change. This helps to minimize unexpected bond toggling, but still far from being solved.

Revision history for this message
Sergey Borodavkin (bocmanpy) wrote :

Same problem here but with strange behaviour with bond interface.
All slaves was release from bond and after reloading bond stay without any slave links.

dmesg logs

bond-xe2: (slave eth4): Releasing backup interface
bond-xe2: (slave eth5): Removing an active aggregator
bond-xe2: (slave eth5): Releasing backup interface

# cat /proc/net/bonding/bond-xe2
Ethernet Channel Bonding Driver: v5.19.0-35-generic
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: down
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
802.3ad info
LACP active: on
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: da:74:dc:e2:48:b4
bond bond-xe2 has no active aggregator

After running networkctl reload, bond enslave links and all start working correct.

Revision history for this message
Sergey Borodavkin (bocmanpy) wrote (last edit ):

Seems its not "low importance" bug. 🤕
Updating systemd package can trigger such bug, and if you have a unattended-upgrade it's gonna cause a network flap on that host.
Easy to reproduce it with:
# apt install --reinstall systemd

----------------------------------
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy

systemd 249 (249.11-0ubuntu3.9)

Revision history for this message
Junien F (axino) wrote :

I agree that the importance should be higher than "Low".

This bug is also triggered every time a "netplan apply" is run, since netplan will always re-generate the systemd-networkd config files.

VLAN interfaces are also torn down and recreated.

This is highly problematic on critical networking hosts, such as firewalls, since any networking configuration change will trigger seconds of downtime, which can lead to VRRP failovers, etc...

Revision history for this message
Nick Rosbrook (enr0n) wrote :

The referenced commit is present in v253, so this should be fixed in the devel series now. Kinetic is EOL so it won't be fixed there. I have not had the time to find an alternative fix for Jammy, but as noted before, the referenced commit caused other regressions when I tested so it cannot simply be backported.

Changed in systemd (Ubuntu Kinetic):
status: Triaged → Won't Fix
Changed in systemd (Ubuntu):
status: Triaged → Fix Released
Changed in systemd (Ubuntu Jammy):
importance: Low → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.