34 wireguard peers result in invalid peer configuration

Bug #1853956 reported by Joshua Sjoding
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

ubuntu server 18.04.3 LTS
systemd 237-3ubuntu10.31
wireguard 0.0.20191012-wg1~bionic from PPA.

We're using systemd-networkd to configure wireguard via wireguard.netdev and wireguard.network files in /etc/systemd/network/. All endpoints have IPv4 addresses.

When we include 34, 35, or 36 [WireGuardPeer] entries in the netdev file some peers are configured incorrectly. The affected peers seem to be related to the total number of peers (counting from 0 here):

33 peers: No issue
34 peers: Peer 1 and 2 fail
35 peers: Peer 2 and 3 fail
36 peers: Peer 3 and 4 fail
37 peers: No issue

In all cases peer 0 is functional. For an affected pair of peers A and B, peer A ends up with the allowed IP address range of peer B. Peer B ends up with no allowed IP addresses. This can be seen in the output of wg. The connections to both peers fail because of incorrect address range assignments.

We first encountered this issue in a production environment when we moved from 33 to 34 unique peers on each server. The issue was reproduced on 3 different physical servers with similar configuration by adding and removing peer 34.

The [WireGuardPeer] entries do not need to be unique to reproduce the issue. In my testing I used 6 distinct peers and then used 28 or more identical copies of a 7th peer. The results were the same.

In January 2019 a bug was reported that was also related to the number of wireguard peers, but the description seems sufficiently different from our case that I felt I should file a distinct bug report. Here's a link to that report in case I'm wrong about that:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1811149

Revision history for this message
Joshua Sjoding (joshua.sjoding) wrote :

On two systems with 33 peers I noticed that this shows up in dmesg after a reboot:

netlink: 'systemd-network': attribute type 5 has an invalid length.

These lines also show up whenever I run `sudo systemctl restart systemd-networkd` now. They didn't show up before the reboot.

This suggests that there may be issues I haven't noticed yet even with fewer than 34 peers. In our production environment not all of our peers are online all the time, so an issue affecting a few of them could go unnoticed.

Revision history for this message
Joshua Sjoding (joshua.sjoding) wrote :

I now believe the dmesg complaint in my last comment to be a separate issue. A fix for it was backported to systemd v238 in this commit:

https://github.com/systemd/systemd-stable/commit/7db3fe08c5eb83584f3a3d356876b4acaa797585#diff-f29d1bfc98e548dc0eb497c3d17cbefa

It was not backported to systemd v237:

https://github.com/systemd/systemd-stable/commits/v237-stable/src/network/netdev/wireguard.c

Revision history for this message
Joshua Sjoding (joshua.sjoding) wrote :

I think the underlying problem is improper fragmentation of netlink messages sent to the WireGuard device by systemd v237 in the set_wireguard_interface function:

https://github.com/systemd/systemd/blob/v237/src/network/netdev/wireguard.c#L107

Appending netlink message data can fail if the message size limit has been exceeded. This can happen if there are too many peers or ip masks in the netdev file, and the v237 code doesn't seem to handle this properly. It's supposed to split the data up into message fragments, but instead it can end up writing incoherent data to the netlink socket or end up in an infinite loop.

This issue was fixed in systemd v241 by reworking the code over a few commits:

https://github.com/systemd/systemd/pull/11418
https://github.com/systemd/systemd/pull/11580 (this fixed issues with the first PR)

I found some comments (now resolved) on one of the commits illuminating:

https://github.com/systemd/systemd/pull/11418/commits/e1f717d4a02e15ae11a191dd4962b2f4d117678d

Mic92 on 2019-01-15:

> The idea is that netlink's messages are limited in size. If an interface has many peers, addresses or ip masks then the configuration might not fit into one message and has to be split across different messages.

yuwata on 2019-01-15:

> Yeah. I guess there was some bug in the cancellation logic, and it causes infinite loop with the magic number 23.

The infinite loop with 23 peers yuwata mentions is a reference to Leonid's bug report from January:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1811149

I expect that backporting these fixes from v241 to bionic's systemd v237 branch would resolve both my issue and the issue reported by Leonid.

I realize this is a non-trivial change and there's a regression risk.

Revision history for this message
Joshua Sjoding (joshua.sjoding) wrote :

It turns out the fix for this issue was backported to systemd v240:

https://github.com/systemd/systemd-stable/pull/37

I performed a release upgrade on one of our affected servers, bringing it up from ubuntu 18.04 to ubuntu 19.04 (which uses systemd v240), and I can confirm that the peers are being configured correctly now.

So this issue affects ubuntu 18.04 LTS but not any later supported releases. 18.10 was also affected but it's EOL.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Dan Streetman (ddstreet) wrote :

please reopen if this is still an issue

Changed in systemd (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.