systemd-networkd IPv6 default routes dropped under load, don't recover

Bug #2053288 reported by Bruce Duncan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Ubuntu 22.04.3 LTS
systemd 249.11-0ubuntu3.12

systemd issue tracker says this version is too old to report upstream and I should report to downstream bug tracker.

IPv6 default routes are getting lost and not renewed.

We're using IPv6 RA to find default routes for our servers and desktops. The RAs come from HP/Aruba routers and have a short lifetime of about 46s. Occasionally, we will see the default routes get dropped. Despite receiving RAs, the default routes don't get recreated.

The most recent machine to be affected had a user running an excessively large job (load average 157). This is the state of the network when the machine is working:

```sh
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 2c:ea:7f:56:9a:66 brd ff:ff:ff:ff:ff:ff
    altname enp4s0f0
3: eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 2c:ea:7f:56:9a:66 brd ff:ff:ff:ff:ff:ff permaddr 2c:ea:7f:56:9a:67
    altname enp4s0f1
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 2c:ea:7f:56:9a:66 brd ff:ff:ff:ff:ff:ff
    inet xxx.xxx.202.112/24 brd 129.215.202.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 xxxx:xxx:xxx:202:2eea:7fff:fe56:9a66/64 scope global dynamic mngtmpaddr noprefixroute
       valid_lft 2591994sec preferred_lft 604794sec
    inet6 fe80::2eea:7fff:fe56:9a66/64 scope link
       valid_lft forever preferred_lft forever
# ip -6 r
::1 dev lo proto kernel metric 256 pref medium
xxxx:xxx:xxx:202::/64 dev bond0 proto ra metric 1024 expires 2591998sec pref medium
fe80::/64 dev bond0 proto kernel metric 256 pref medium
default proto ra metric 1024 expires 28sec pref medium
 nexthop via fe80::609:73ff:fe48:c000 dev bond0 weight 1
 nexthop via fe80::609:73ff:fe48:6500 dev bond0 weight 1
```

When the problem arises, the last three lines disappear. `tcpdump icmp6` shows RAs being received but networkd doesn't create the routes in the kernel. The machine keeps its IPv6 addresses, but without a default route it can't make any IPv6 connections or answer incoming IPv6 connections.

Sorry, reproduction method is unclear. Here's a best guess:

1. Configure networkd using netplan:

```yaml
---
network:
  bonds:
    bond0:
      addresses:
      - xxx.xxx.202.112/24
      dhcp4: false
      interfaces:
      - eth0
      - eth1
      macaddress: 2C:EA:7F:56:9A:66
      parameters:
        mii-monitor-interval: 1
        mode: active-backup
  ethernets:
    eth0:
      dhcp4: false
      match:
        macaddress: 2C:EA:7F:56:9A:66
    eth1:
      dhcp4: false
      match:
        macaddress: 2C:EA:7F:56:9A:67
  renderer: networkd
  version: 2
```

2. Load the machine, or just wait. Possibly this is related to packets being dropped, but I would expect the system to recover once the load is removed.
3. Note the lack of IPv6 connectivity, inability to log in with ssh, etc.

Revision history for this message
Malcolm Scott (malcscott) wrote :

I see this behaviour too, quite often on multiple machines, always seemingly happening when the machine is under high load. When it happens, systemd-networkd typically logs something like:

systemd-networkd[3512]: enp193s0f0np0: Could not set route: Connection timed out
systemd-networkd[3512]: enp193s0f0np0: Failed

In my eyes there are two or three problems here:

1. networkd is deleting and re-adding a route when it probably doesn't need to (I guess when it receives an ICMPv6 router advertisement); the route hasn't changed and an identical one already exists in the routing table

2. The error is handled poorly; perhaps it could retry

3. After the error, the default route stays missing _permanently_ (until systemd-networkd is prodded with e.g. "netplan apply"); at the very least it ought to try to re-add the route next time it sees an RA packet

Revision history for this message
Malcolm Scott (malcscott) wrote :

This upstream fix may be relevant: https://github.com/systemd/systemd/issues/25441

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.