netplan generator causes deadlock during systemd daemon-reload

Bug #1999178 reported by Imre Jonk
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Netplan
Triaged
High
Unassigned

Bug Description

From the issue in the systemd tracker [1]:

During a systemd daemon-reload, the Netplan generator in /usr/lib/systemd/system-generators/netplan tries to reload the udev rules and databases with udevadm control --reload [2], which causes a deadlock, because reloading udev rules and databases requires a functioning systemd-userdbd.service, which is not the case during a systemd daemon-reload. The systemd developers will not consider making changes to any systemd component in order to prevent this deadlock, and instead suggest that the Netplan developers generate their systemd link/network/netdev units using a generator service like systemd-network-generator.service.

[1] https://github.com/systemd/systemd/issues/25543
[2] https://github.com/canonical/netplan/blob/bf8036d43837bc071d1d3f716f67dc24ef5a5b23/src/generate.c#L54

Revision history for this message
Imre Jonk (imrejonk) wrote :

The user-visible impact of this bug is that `systemctl daemon-reload` hangs until the udevadm reload times out. This causes some significant issues, or as devvick on GitHub describes it: "I cannot even ssh to the host while this hangup is happening."

Revision history for this message
boris digital (9v4w9vxj9) wrote :
Lukas Märdian (slyon)
Changed in netplan:
status: New → Triaged
importance: Undecided → High
tags: added: foundations-todo rls-ll-incoming
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

It can be easily reproduced on Arch Linux:

lxc launch images:archlinux archlinux --vm -c security.secureboot=false
lxc exec archlinux bash
pacman -Sy netplan
mkdir /etc/netplan

cat << EOF > /etc/netplan/10-blah.yaml
network:
  version: 2
  ethernets:
    enp5s0:
      dhcp4: true
EOF

systemctl daemon-reload # this will get stuck for 60 seconds

Systemd version: 252.3-1-arch
netplan version: -0.105-1

I don't run into the same problem on Ubuntu Lunar with systemd 252.1-1ubuntu1.

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

Thank you for your bug report.

I prepared a PR to start some discussions around this problem.

https://github.com/canonical/netplan/pull/304

Revision history for this message
Gauthier Jolly (gjolly) wrote :

We found a reproducer on Ubuntu 23.04 Minimal images for Azure, see this comment for more details: https://bugs.launchpad.net/ubuntu/+source/walinuxagent/+bug/2016012/comments/2

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

Reproducer on regular Ubuntu images:

lxc launch ubuntu-daily:lunar lunar --vm
lxc shell lunar

echo 'KERNEL=="console", GROUP="syslog", MODE="0620"' > /usr/lib/udev/rules.d/67-console.rules

# time systemctl daemon-reload

real 0m0.264s
user 0m0.005s
sys 0m0.004s

groupdel -f syslog

# time systemctl daemon-reload

real 1m0.327s
user 0m0.000s
sys 0m0.009s

Lukas Märdian (slyon)
tags: added: fr-4130
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

On Archlinux, daemon-reload will get stuck even if all the groups referenced from udev rules exist.

After some digging, I've found the reason why it happens in Arch and not in Ubuntu: nsswitch.conf

On Archlinux, udevd will query the /etc/group file AND systemd:

group: files [SUCCESS=merge] systemd

On Ubuntu, we only query systemd if we can't find the group in the /etc/group file (the default success action is "return":

group: files systemd sss

Change /etc/nsswitch.conf on Ubuntu to always go to systemd and daemon-reload will always get stuck even if all groups exist:

group: files [SUCCESS=merge] systemd sss

Here is a stack trace from where it gets stuck waiting for a response from systemd (userdb?) https://paste.ubuntu.com/p/6k4nCssnGh/

Revision history for this message
Greg (rollenwiese) wrote :

Just logging in to confirm that changing in nsswitch.conf:

group: files [SUCCESS=merge] systemd

to:

group: files systemd

does resolve the issue.

Hopefully this can be fixed soon. Alternately it would be fantastic if netplan could just output a systemd network config to an arbitrary directory...

Revision history for this message
Lupe Christoph (lupe) wrote (last edit ):

I beg to differ. Yesterday I upgraded my Ubuntu 22.10 to 23.04. That hung many times and took two hours on a machine with a SATA SSD and a Ryzen 7 5700G. Working out some smaller problems I found that systyemctl daemon-reload hangs almost exactly one minute.

Now, this one minute delay is familiar to me. I have it on a Debian 11 machine in the boot sequence. That machine also uses netplan. The timeout seems to occur twice in the boot sequence, causing a delay of two minutes. *But* systemctl daemon-reload is fast on that machine.

ubuntu-23.04$ grep '^group' /etc/nsswitch.conf
group: compat systemd

debian-11: $ grep '^group' /etc/nsswitch.conf
group: files systemd

Replacing "compat" with "files" on Ubuntu makes no difference.

I had filed a bug report more than a year ago in the Debian bug tracker (#1008995) which went ignored.

So far I have narrowed the problem on Ubuntu 23.04 down to these two lines:
Jul 22 19:45:12 alanya systemd[45294]: /usr/lib/systemd/system-generators/systemd-gpt-auto-generator succeeded.
Jul 22 19:46:12 alanya systemd[45294]: /usr/lib/systemd/system-generators/netplan succeeded.

This is from a debug-level journal, you need to change the default level with "systemctl log-level debug" to see them. Default on Ubuntu is "info".

I tried to strace /usr/lib/systemd/system-generators/netplan, but it exited immediately with exit code 1.

If anybody can provide me with a hint on further debugging this, I'm quite willing to continue. Right now, I'm stuck.

Update:
I converted the Debian machine to systemd-networkd and deinstalled the netplan packages. The 2 minute delay remained. The delay is in /lib/systemd/systemd-networkd-wait-online. It waits for any of the network devices. I can make it return immediately by adding a --ignore for all of them. No good option.

But all this is besides the point. The delay on Ubuntu is definitely somewhere in the netplan mechanism.

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

I can confirm we are hitting this issue when upgrading from Lunar to Mantic.

To reproduce it, one just need to launch a LXD Lunar VM and upgrade to Mantic. The process will stuck many times and will take longer than it should.

The problem seems to be that the user "usbmux" doesn't exist in the system, so whatever is trying to use it will go to systemd after looking for it in /etc/passwd.

The workaround is to add the user: useradd -u 111 -g 46 usbmux

But the real problem still must be addressed in Netplan.

Revision history for this message
Lupe Christoph (lupe) wrote :

I have the user usbmux, and judging by the UID, have for some time:

usbmux:x:108:46:usbmux

The highest system UID is 111.

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

Thanks for the information.

Do you still have dpkg logs from that time to confirm if the package usbmuxd was installed along with the upgrade?

Here is the sequence of events that triggers this issue with a clean Lunar installation:

1) The upgrade will install the package usbmuxd. It will drop some udev configuration in /lib/udev/rules.d/39-usbmuxd.rules. This file refers to the used usbmux.
2) At some point installation scripts will start calling systemd daemon-reload during upgrade. When it happens you'll notice the upgrade process will get stuck.
3) When the upgrade process gets to the point where it calls usbmuxd's postinst script, the user will be created.

From this point on, daemon-reloads will not stuck anymore. At least not because of the user usbmux.

If you have udev rules referring to non-existent system users, you'll probably see the same thing happening.

Revision history for this message
Lupe Christoph (lupe) wrote :

> 1) The upgrade will install the package usbmuxd. It will drop some udev configuration in /lib/udev/rules.d/39-usbmuxd.rules. This file refers to the used usbmux.

Sorry, the last update of usbmuxd was 1.1.1-2build2 that happened on Sun, 7 Aug 2022 15:31:28 +0200 (CEST).

So no usbmuxd installation scripts where called during the upgrade to Lunar. And, yes, I checked the log of the installation as well as the News and Changelog. No usbmuxd.

Revision history for this message
Kirt Runolfson (kirtr) wrote :

Is there a fix for this bug yet? Can it be pushed into production?

apt install cups
apt install ubuntu-desktop

Both of these commands trigger system hangs on a fresh mantic server or lxd installation.

I've tracked both of these down to netplan per Lupe's debugging strategy enumerated above. I haven't been able to trigger it with usbmuxd.

Since this is the current supported path for network based desktop installation, it seems like a critical bug not an "This bug affects 2 people" bug. Should I create another report to get it more visibility?

Revision history for this message
Hoernchen (hoernchen) wrote :

This was a massive issue after upgrading Ubuntu from 22.04 to 23.10 and then 24.04 -
Mai 11 13:42:04 insp systemd-udevd[397]: /etc/udev/rules.d/99-usbmon.rules:1 Unknown group 'wireshark', ignoring.

This little innocent line caused the reaload of systemd which happens all the time to take 60 seconds which caused package operations using apt to take forever.....

Revision history for this message
Florian Hackenberger (f-hackenberger) wrote :

This also affects me and was super difficult to figure out. It took me about 2 hours to figure out the slow apt operations had to do with systemd, then figure out how to trace systemd appropriately to follow through (I used 'sudo strace --summary --summary-wall-clock --timestamps --follow-forks -p 1', where 1 is the systemd init PID) and then finally being fast enough to see that the PID it's waiting on is '/usr/lib/systemd/system-generators/netplan /run/systemd/generator /run/systemd/generator.early /run/systemd/generator.late', which finally led me to this bug report. In my case it was a custom udev rule for 'mixxx' that was trying to reference a non-existing group.

I filed an issue on the ubuntu bugtracker: https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/2069495 to get this fixed.

Revision history for this message
Lukas Märdian (slyon) wrote :

I wonder if we could make use of the `resolve_names=late` setting from udev.conf(5), to work around this behavior?

https://www.freedesktop.org/software/systemd/man/latest/udev.conf.html

E.g. something like this:

$ mkdir -p /etc/udev/udev.conf.d/
$ echo "resolve_names=late" > /etc/udev/udev.conf.d/netplan.conf
$ reboot

Could anybody who's affected by this give this workaround a try?

Revision history for this message
Lupe Christoph (lupe) wrote :

The delay disappeared on my Debian system with the bookworm upgrade, so I can't test anymore.

Revision history for this message
Martin Damzog (thessalonians) wrote : Re: [Bug 1999178] Re: netplan generator causes deadlock during systemd daemon-reload

With 'resolve_names=late' in udev configuration there is no change in behaviour. 'netplan generate' still takes about 1 min to complete.
Deleting the link from /usr/lib/systemd/system-generators 'fixes' the problem, but there are side effects, I guess.

On Thu, Jul 04, 2024 at 12:14:55PM -0000, Lukas Märdian wrote:
> I wonder if we could make use of the `resolve_names=late` setting from
> udev.conf(5), to work around this behavior?
>
> https://www.freedesktop.org/software/systemd/man/latest/udev.conf.html
>
> E.g. something like this:
>
> $ mkdir -p /etc/udev/udev.conf.d/
> $ echo "resolve_names=late" > /etc/udev/udev.conf.d/netplan.conf
> $ reboot
>
> Could anybody who's affected by this give this workaround a try?
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (2069495).
> https://bugs.launchpad.net/bugs/1999178
>
> Title:
> netplan generator causes deadlock during systemd daemon-reload
>
> Status in Netplan:
> Triaged
>
> Bug description:
> From the issue in the systemd tracker [1]:
>
> During a systemd daemon-reload, the Netplan generator in
> /usr/lib/systemd/system-generators/netplan tries to reload the udev
> rules and databases with udevadm control --reload [2], which causes a
> deadlock, because reloading udev rules and databases requires a
> functioning systemd-userdbd.service, which is not the case during a
> systemd daemon-reload. The systemd developers will not consider making
> changes to any systemd component in order to prevent this deadlock,
> and instead suggest that the Netplan developers generate their systemd
> link/network/netdev units using a generator service like systemd-
> network-generator.service.
>
> [1] https://github.com/systemd/systemd/issues/25543
> [2] https://github.com/canonical/netplan/blob/bf8036d43837bc071d1d3f716f67dc24ef5a5b23/src/generate.c#L54
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/netplan/+bug/1999178/+subscriptions
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.