systemd-networkd IPv6 default routes dropped under load, don't recover
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
systemd (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Ubuntu 22.04.3 LTS
systemd 249.11-0ubuntu3.12
systemd issue tracker says this version is too old to report upstream and I should report to downstream bug tracker.
IPv6 default routes are getting lost and not renewed.
We're using IPv6 RA to find default routes for our servers and desktops. The RAs come from HP/Aruba routers and have a short lifetime of about 46s. Occasionally, we will see the default routes get dropped. Despite receiving RAs, the default routes don't get recreated.
The most recent machine to be affected had a user running an excessively large job (load average 157). This is the state of the network when the machine is working:
```sh
# ip a
1: lo: <LOOPBACK,
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,
link/ether 2c:ea:7f:56:9a:66 brd ff:ff:ff:ff:ff:ff
altname enp4s0f0
3: eno2: <BROADCAST,
link/ether 2c:ea:7f:56:9a:66 brd ff:ff:ff:ff:ff:ff permaddr 2c:ea:7f:56:9a:67
altname enp4s0f1
4: bond0: <BROADCAST,
link/ether 2c:ea:7f:56:9a:66 brd ff:ff:ff:ff:ff:ff
inet xxx.xxx.202.112/24 brd 129.215.202.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 xxxx:xxx:
valid_lft 2591994sec preferred_lft 604794sec
inet6 fe80::2eea:
valid_lft forever preferred_lft forever
# ip -6 r
::1 dev lo proto kernel metric 256 pref medium
xxxx:xxx:
fe80::/64 dev bond0 proto kernel metric 256 pref medium
default proto ra metric 1024 expires 28sec pref medium
nexthop via fe80::609:
nexthop via fe80::609:
```
When the problem arises, the last three lines disappear. `tcpdump icmp6` shows RAs being received but networkd doesn't create the routes in the kernel. The machine keeps its IPv6 addresses, but without a default route it can't make any IPv6 connections or answer incoming IPv6 connections.
Sorry, reproduction method is unclear. Here's a best guess:
1. Configure networkd using netplan:
```yaml
---
network:
bonds:
bond0:
addresses:
- xxx.xxx.202.112/24
dhcp4: false
interfaces:
- eth0
- eth1
macaddress: 2C:EA:7F:56:9A:66
parameters:
mode: active-backup
ethernets:
eth0:
dhcp4: false
match:
macaddress: 2C:EA:7F:56:9A:66
eth1:
dhcp4: false
match:
macaddress: 2C:EA:7F:56:9A:67
renderer: networkd
version: 2
```
2. Load the machine, or just wait. Possibly this is related to packets being dropped, but I would expect the system to recover once the load is removed.
3. Note the lack of IPv6 connectivity, inability to log in with ssh, etc.
I see this behaviour too, quite often on multiple machines, always seemingly happening when the machine is under high load. When it happens, systemd-networkd typically logs something like:
systemd- networkd[ 3512]: enp193s0f0np0: Could not set route: Connection timed out networkd[ 3512]: enp193s0f0np0: Failed
systemd-
In my eyes there are two or three problems here:
1. networkd is deleting and re-adding a route when it probably doesn't need to (I guess when it receives an ICMPv6 router advertisement); the route hasn't changed and an identical one already exists in the routing table
2. The error is handled poorly; perhaps it could retry
3. After the error, the default route stays missing _permanently_ (until systemd-networkd is prodded with e.g. "netplan apply"); at the very least it ought to try to re-add the route next time it sees an RA packet