[jammy] DNS issue triggered by command "udevadm trigger --subsystem-nomatch=input"

Bug #2028023 reported by Aristo Chen
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OEM Priority Project
New
Critical
Aristo Chen
systemd (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Incomplete
High
Unassigned

Bug Description

[Summary]
In jammy server image, when we use NetworkManager as netplan renderer, we may have DNS issue after executing command "udevadm trigger --subsystem-nomatch=input" as super user, the udevadm command may also be executed when we use "apt install XXXXXX" or "snap install XXXXXX".

[Steps to reproduce]
1. Login to a jammy server(jammy desktop environment works fine) image environment with Vagrant
2. execute command "apt install network-manager"
3. modify netplan config to use NetworkManager as renderer
"""
# Let NetworkManager manage all devices on this system
network:
  version: 2
  renderer: NetworkManager
"""
4. reboot
5. execute command "ping google.com" first to make sure DNS works fine
6. execute command "udevadm trigger --subsystem-nomatch=input" as super user
7. execute command "ping google.com" to check if DNS still works fine
8. if DNS works fine, then repeat step 6-7, the fail rate is around 20-30%

[Other info]
I have tested the above mentioned steps for the following scenarios, which work fine without DNS issue
1. Using focal server environment with Vagrant, there is no DNS issue
2. Using jammy server environment with Vagrant which has DNS issue, then add the kinetic/kinetic-updates source list, install systemd(version: 251.4-1ubuntu7.3) from kinetic, and there is no DNS issue

If using focal server environment with Vagrant, then add the jammy/jammy-updates source list, install systemd(version: 249.11-0ubuntu3.9) from jammy, then I will have DNS issue as well by following the steps mentioned in [Steps to reproduce]

[Other info - Vagrantfile]
The following content is for the Vagrantfile that I used to test
"""
Vagrant.configure("2") do |config|
  if Vagrant.has_plugin?("vagrant-timezone")
    config.timezone.value = :host
  end
  config.vm.define "test" do |test|
    test.vm.box = "ubuntu/jammy64"
    test.vm.provider "virtualbox" do |vb|
      vb.name = "test2"
      vb.memory = "4096"
      vb.cpus = 4
      config.disksize.size = "50GB"
    end
  end
end
"""

[Fail rate]
20-30%

[Test script]
The following is the script that I used to test. Ideally it should run forever
"""
#!/bin/bash

set -ex

counter=1

while true; do
 echo "$(date): test round ${counter}"
 counter=$((counter + 1))
 sudo udevadm trigger --subsystem-nomatch=input
 ping google.com -c 2
 sleep 0.5
done
"""

Aristo Chen (aristochen)
Changed in oem-priority:
importance: Undecided → Critical
assignee: nobody → Aristo Chen (aristochen)
Aristo Chen (aristochen)
description: updated
Aristo Chen (aristochen)
tags: added: originate-from-2012231
Revision history for this message
Rex Tsai (chihchun) wrote :

I would add a command to collect the status of rsolvectl, to understand what went wrong when the dns is not working.

% resolvectl status

Revision history for this message
Aristo Chen (aristochen) wrote :

==== When DNS is not working ====
vagrant@ubuntu-jammy:~$ ping google.com
ping: google.com: Temporary failure in name resolution
vagrant@ubuntu-jammy:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (enp0s3)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

==== When DNS is working ====
vagrant@ubuntu-jammy:~$ ping google.com
PING google.com (172.217.160.110) 56(84) bytes of data.
64 bytes from tsa03s06-in-f14.1e100.net (172.217.160.110): icmp_seq=1 ttl=63 time=6.92 ms
64 bytes from tsa03s06-in-f14.1e100.net (172.217.160.110): icmp_seq=2 ttl=63 time=7.14 ms
^C
--- google.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 6.915/7.029/7.144/0.114 ms
vagrant@ubuntu-jammy:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (enp0s3)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.0.2.3
       DNS Servers: 10.0.2.3
        DNS Domain: buildd

Revision history for this message
Aristo Chen (aristochen) wrote :

Found that the issue can be fixed by this commit in upstream

"""
commit 26591ffffd62c62796b1f65146648022d68e1279
Author: Yu Watanabe <email address hidden>
Date: Sun Nov 14 15:46:47 2021 +0900

    resolve: do not clear DNS servers or friends on link which is not managed by networkd

    When networkd detects an unmanaged link, then the state is changed in
    the following order:
    pending -> initialized -> unmanaged

    The "initialized" state was added by bd08ce56156751d58584a44e766ef61340cdae2d.

diff --git a/src/resolve/resolved-link.c b/src/resolve/resolved-link.c
index 18dc3d29e9..dd219f297c 100644
--- a/src/resolve/resolved-link.c
+++ b/src/resolve/resolved-link.c
@@ -565,7 +565,7 @@ static int link_is_managed(Link *l) {
         if (r < 0)
                 return r;

- return !STR_IN_SET(state, "pending", "unmanaged");
+ return !STR_IN_SET(state, "pending", "initialized", "unmanaged");
 }

 static void link_read_settings(Link *l) {

"""

Revision history for this message
Nick Rosbrook (enr0n) wrote :

This patch is present in Lunar and newer, so we only need to SRU for Jammy.

tags: added: systemd-sru-next
Changed in systemd (Ubuntu Jammy):
importance: Undecided → High
Changed in systemd (Ubuntu):
status: New → Fix Released
Revision history for this message
Nick Rosbrook (enr0n) wrote :

I think this is actually more of a configuration issue in the environment: it seems that systemd-networkd is still running, despite having enabled NetworkManager.

The reason I say that is that when systemd-networkd is not running, i.e. all links are unmanaged, the directory /run/systemd/netif/links will be empty. Hence, link_is_managed will always return 0 because it will get -ENODATA from sd_network_link_get_setup_state[1]. Therefore, without systemd-networkd running, you should never hit the condition where the link state is "initialized" (nor should you even hit the `return !STR_IN_SET(...);` statement in link_is_managed()).

Can you please confirm if doing e.g. systemctl disable --now systemd-networkd fixes the problem?

[1] https://github.com/systemd/systemd-stable/blob/v249.11/src/resolve/resolved-link.c#L562C32-L562C32

Changed in systemd (Ubuntu Jammy):
status: New → Incomplete
Revision history for this message
Aristo Chen (aristochen) wrote :

Hi Nick,

Thanks for looking into this issue!

I have tried "systemctl disable --now systemd-networkd", and tested with the test script in the bug description, I am still able to reproduce this issue

Revision history for this message
Nick Rosbrook (enr0n) wrote :

Even after a reboot etc.? And you confirmed that systemd-networkd is indeed not running anymore?

Nick Rosbrook (enr0n)
tags: removed: systemd-sru-next
Revision history for this message
Lukas Märdian (slyon) wrote :

systemd-networkd.service (and .socket) seems to be enabled by default on Ubuntu server images through livecd-rootfs: https://git.launchpad.net/ubuntu/+source/livecd-rootfs/tree/live-build/ubuntu-server/includes.chroot.ubuntu-server-minimal/etc/systemd/system

So in order to keep networkd disable on such images, we'd probably need to use something like "systemctl mask systemd-networkd.service".

Revision history for this message
Esko Järnfors (esko-jarnfors) wrote :

Hello!

We have a rather large environment with ~1500 computers and we keep hitting this issue annoyingly often. Is masking the systemd-networkd service unit a viable fix (that will not cascade onwards to cause new issues) for production systems or could the systemd patch be issued as an SRU to fix only the underlying bug (it looks like the SRU process was aborted for some reason)?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.