networkctl reload with bond devices causes slaves to go DOWN and UP, causing couple of seconds of network loss
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| systemd (Ubuntu) |
Fix Released
|
Low
|
Unassigned | ||
| Jammy |
Fix Released
|
Medium
|
Nick Rosbrook | ||
| Kinetic |
Won't Fix
|
Low
|
Unassigned | ||
Bug Description
[SRU TEMPLATE]
[DESCRIPTION]
We currently use Ubuntu 22.04.1 LTS including updates for our production cloud (switched from legacy Centos 7).
Although we like the distribution we recently hit serious systemd buggy behavior described in [1] bugreport using packages [2].
Unfortunatelly the clouds we are running consist of openstack on top of kubernetes and we need to have complex network configuration including linux bond devices.
Our observation is that every time we apply our configuration via CI/CD infrastructure using ansible and netplan (regardless whether there is actual network configuration change) we see approximatelly 8-16 seconds network interruptions and see bond interfaces going DOWN and then UP.
We expect bond interfaces stay UP when there is no network configuration change.
We went though couple of options how to solve the issue and the first one is to add such existing patch [3] into current systemd-
Could you comment whether this kind of non-security patch is likely to land in 22.04.1 LTS soon.
We are able to help to bring patch into systemd package community way if you suggest the steps.
[TESTING]
On a Jammy system, create a bond interface with two subordinate devices. Assuming the interfaces ens3 and ens9 exist on the system, this can be done using the following:
$ cat > /etc/netplan/
network:
version: 2
renderer: networkd
ethernets:
ens3:
dhcp4: no
ens9:
dhcp4: no
bonds:
bond0:
dhcp4: yes
interfaces:
- ens3
- ens9
parameters:
mode: active-backup
primary: ens3
EOF
$ netplan generate && netplan apply
From here, there are two tests that can be used to verify the fix.
1. Update the modification time of the generated network files, and call networkctl reload. From networkctl(1), when "reload" is called:
[...] If a new, modified or removed .network file is found, then all interfaces which match the file are reconfigured.
Hence, the following will trigger the desired code path:
$ touch /run/systemd/
$ networkctl reload
Without the fix, you can see in the logs the interfaces of the bond going up and down. With the fix, this should not happen.
$ journalctl -b -u systemd-
Finally, check that everything is back in the configured state:
$ networkctl status
2. This bug can also be triggered by calling networkctl reconfigure directly.
$ networkctl reconfigure ens3
$ networkctl reconfigure ens9
Check the logs that the links were not brought down:
$ journalctl -b -u systemd-
Finally, check that everything is back in the configured state:
$ networkctl status
[REGRESSION POTENTIAL]
This patch is confined to the SET_LINK_MASTER logic for configuring links in systemd-networkd. While bond interfaces are the motivation for the fix, this early return applies for all interface types which SET_LINK_MASTER is supported, e.g. bridge interfaces as well.
This logic has seen exercise in newer releases of systemd and Ubuntu without further modification, so I would not expect to see regressions for other interface types. Furthermore, the bond type is the only type where the link is set to down in order to configure the master interface index, so this call was already effectively a no-op for those other interface types.
If any problems did occur, it would be related to (re-)configuring link types which have a master interface set.
[OTHER]
This fix requires two upstream patches:
https:/
https:/
The second is a follow-up to the first, to complete the fix.
These patches do not apply cleanly to v249, so some trivial conflicts were resolved to make the patches apply. Additionally, some additional logic is added to the patches so that the link state is correctly set when this new branch is hit.
Specifically, we decrement the set_link_messages counter, and call link_check_ready() before returning -EALREADY. This is necessary because the version of systemd where these patches originate from saw a lot of refactoring in this area of systemd-networkd since v249. So, while in newer versions of systemd, the message counter is handled correctly, and link_check_ready() is eventually called despite cancelling the SET_LINK_MASTER request, this never happens when these patches are applied to v249. Hence, we add the necessary steps to the patch.
Related branches
- Nick Rosbrook: Approve
-
Diff: 72 lines (+58/-0)2 files modifieddebian/patches/lp2003250-network-skip-to-reassign-master-ifindex-if-already-s.patch (+57/-0)
debian/patches/series (+1/-0)
| description: | updated |
| description: | updated |
| description: | updated |
| description: | updated |
| tags: | added: systemd-sru-next |
| description: | updated |
| description: | updated |
| Changed in systemd (Ubuntu Jammy): | |
| status: | Triaged → In Progress |
| assignee: | nobody → Nick Rosbrook (enr0n) |

I have confirmed that this bug affects Jammy and newer. The upstream patch looks straight-forward, so I will test a build with that patch included to see if it fixes the issue.