systemd-udevd: Run net_setup_link on 'change' uevents to prevent DNS outages on Azure

Bug #1988119 reported by Pieter
278
This bug affects 43 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Critical
Matthew Ruffell

Bug Description

[Impact]

A widespread outage was caused on Azure instances earlier today, when systemd 237-3ubuntu10.54 was published to the bionic-security pocket. Instances could no longer resolve DNS queries, breaking networking.

For affected users, the following workarounds are available. Use whatever is most convenient.
- Reboot your instances
- or -
- Issue "udevadm trigger -cadd -yeth0 && systemctl restart systemd-networkd" as root

The trigger was found to be open-vm-tools issuing "udevadm trigger". Azure has a specific netplan setup that uses the `driver` match to set up networking. If a udevadm trigger is executed, the KV pair that contains this info is lost. Next time netplan is executed, the server loses it's DNS information.

This is the same as bug 1902960 experienced on Focal two years ago.

The root cause was found to be a bug in systemd, where if we receive a "Remove" action from a change uevent, we need to run net_setup_link(), we need to skip device rename and keep the old name.

[Testcase]

Start an instance up on Azure, any type. Simply issue udevadm trigger and reload systemd-networkd:

$ ping google.com
PING google.com (172.253.62.102) 56(84) bytes of data.
64 bytes from bc-in-f102.1e100.net (172.253.62.102): icmp_seq=1 ttl=56 time=1.85 ms
$ sudo udevadm trigger && sudo systemctl restart systemd-networkd
$ ping google.com
ping: google.com: Temporary failure in name resolution

To fix a broken instance, you can run:

$ sudo udevadm trigger -cadd -yeth0 && sudo systemctl restart systemd-networkd

and then install the test packages below:

Test packages are available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf343528-test

If you install them, the issue should no longer occur.

[Where problems could occur]

If a regression were to occur, it would affect systemd-udevd processing 'change' events from network devices, which could lead to network outages. Since this would happen when systemd-networkd is restarted on postinstall, a regression would cause widespread outages due to this SRU being targeted to the security pocket, where unattended-upgrades will automatically install from.

Side effects could include incorrect udevd device properties.

It is very important that this SRU is well tested before release.

[Other info]

This was fixed in Systemd 247 with the following commit:

commit e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151
Author: Yu Watanabe <email address hidden>
Date: Mon, 14 Sep 2020 15:21:04 +0900
Subject: udev: re-assign ID_NET_DRIVER=, ID_NET_LINK_FILE=, ID_NET_NAME= properties on non-'add' uevent
Link: https://github.com/systemd/systemd/commit/e0e789c1e97e2cdf1cafe0c6b7d7e43fa054f151

This was backported to Focal's systemd 245.4-4ubuntu3.4 in bug 1902960 two years ago. Focal required a heavy backport, which was performed by Dan Streetman. Focals backport can be found in d/p/lp1902960-udev-re-assign-ID_NET_DRIVER-ID_NET_LINK_FILE-ID_NET.patch, or the below pastebin:

https://paste.ubuntu.com/p/K5k7bGt3Wx/

The changes between the Focal backport and the Bionic backport are:

- We use udev_device_get_action() instead of device_get_action()
- device_action_from_string() is used to get to enum DeviceAction
- We return 0 from the "if (a == DEVICE_ACTION_MOVE) " hunk instead of "goto no_rename"
- log_device_* has been changed to log_*.

See attached debdiff for Bionic backport.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Pieter Lexis (pieter-lexis-tt) wrote :
Download full text (15.3 KiB)

We've just had the same problem, on multiple VMs running in Azure.

In the dpkg log we can see that systemd was indeed updated (times in UTC):

2022-08-30 06:31:18 status unpacked udev:amd64 237-3ubuntu10.54
2022-08-30 06:31:18 status half-configured udev:amd64 237-3ubuntu10.54
2022-08-30 06:31:19 status installed udev:amd64 237-3ubuntu10.54
2022-08-30 06:31:19 status triggers-pending initramfs-tools:all 0.130ubuntu3.13
2022-08-30 06:31:19 trigproc man-db:amd64 2.8.3-2ubuntu0.1 <none>
2022-08-30 06:31:19 status half-configured man-db:amd64 2.8.3-2ubuntu0.1
2022-08-30 06:31:19 status installed man-db:amd64 2.8.3-2ubuntu0.1
2022-08-30 06:31:19 trigproc ureadahead:amd64 0.100.0-21 <none>
2022-08-30 06:31:19 status half-configured ureadahead:amd64 0.100.0-21
2022-08-30 06:31:20 status installed ureadahead:amd64 0.100.0-21
2022-08-30 06:31:20 trigproc libc-bin:amd64 2.27-3ubuntu1.5 <none>
2022-08-30 06:31:20 status half-configured libc-bin:amd64 2.27-3ubuntu1.5
2022-08-30 06:31:20 status installed libc-bin:amd64 2.27-3ubuntu1.5
2022-08-30 06:31:20 trigproc systemd:amd64 237-3ubuntu10.53 <none>
2022-08-30 06:31:20 status half-configured systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:20 status installed systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:20 trigproc initramfs-tools:all 0.130ubuntu3.13 <none>
2022-08-30 06:31:20 status half-configured initramfs-tools:all 0.130ubuntu3.13
2022-08-30 06:31:34 status installed initramfs-tools:all 0.130ubuntu3.13
2022-08-30 06:31:37 startup archives unpack
2022-08-30 06:31:38 upgrade libnss-systemd:amd64 237-3ubuntu10.53 237-3ubuntu10.54
2022-08-30 06:31:38 status triggers-pending libc-bin:amd64 2.27-3ubuntu1.5
2022-08-30 06:31:38 status half-configured libnss-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status unpacked libnss-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status half-installed libnss-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status triggers-pending man-db:amd64 2.8.3-2ubuntu0.1
2022-08-30 06:31:38 status half-installed libnss-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status unpacked libnss-systemd:amd64 237-3ubuntu10.54
2022-08-30 06:31:38 status unpacked libnss-systemd:amd64 237-3ubuntu10.54
2022-08-30 06:31:38 upgrade libpam-systemd:amd64 237-3ubuntu10.53 237-3ubuntu10.54
2022-08-30 06:31:38 status half-configured libpam-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status unpacked libpam-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status half-installed libpam-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status half-installed libpam-systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status unpacked libpam-systemd:amd64 237-3ubuntu10.54
2022-08-30 06:31:38 status unpacked libpam-systemd:amd64 237-3ubuntu10.54
2022-08-30 06:31:38 upgrade systemd:amd64 237-3ubuntu10.53 237-3ubuntu10.54
2022-08-30 06:31:38 status half-configured systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status unpacked systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:38 status half-installed systemd:amd64 237-3ubuntu10.53
2022-08-30 06:31:39 status triggers-pending ureadahead:amd64 0.100.0-21
2022-08-30 06:31:39 status triggers-pending dbus:amd64 1.12.2-1ubuntu1.3
2022-08-...

Revision history for this message
Lutz Willek (willek) wrote :

Seems to be a duplicate of https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1938791 - same symptoms.

[Workaround]

Reboot the node, DNS should return back to normal.

Revision history for this message
Pieter Lexis (pieter-lexis-tt) wrote :

Microsoft has created an incident for this. https://azure.status.microsoft/en-us/status reports:

 Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors - Investigating

Starting at approximately 06:00 UTC on 30 Aug 2022, a number of customers running Ubuntu 18.04 (bionic) VMs recently upgraded to systemd version 237-3ubuntu10.54 reported experiencing DNS errors when trying to access their resources. Reports of this issue are confined to this single Ubuntu version.

This bug and a potential fix have been highlighted on the Canonical / Ubuntu site, which we encourage impacted customers to read:

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119

An additional potential workaround customers can consider is to reboot impacted VM instances so that they receive a fresh DHCP lease and new DNS resolver(s).

Any Azure service, including AKS, that uses Canonical Ubuntu version 18.04 of Linux may have some impact from this issue. We are working on mitigations across Azure services that are impacted.

More information will be provided within 60 minutes, when we expect to know more about the root cause and mitigation workstreams.

This message was last updated at 09:20 UTC on 30 August 2022

Revision history for this message
Iain Lane (laney) wrote :

I've removed the update from bionic-security and bionic-updates, and restored the versions which were previously in there.

This won't help anyone that has already received the broken update - I think the advice there is to restart, or there is a workaround in the OP here - but it should prevent any further occurrences.

Note that there will be a delay of up to an hour or so for mirrors to receive the deletion.

Revision history for this message
Pieter Lexis (pieter-lexis-tt) wrote :

> This won't help anyone that has already received the broken update - I think the advice there is to restart, or there is a workaround in the OP here - but it should prevent any further occurrences.

Do note this is not a solution for those using non-Azure resolvers provided via DHCP through their VNET. These users must reboot or manually set the fallback servers to their custom DNS resolver addresses

Vasili (vasili.namatov)
no longer affects: systemd
Revision history for this message
Lee Van Steerthem (leevs) wrote :

Not sure if this is the best place to help people out understanding if nodes are impacted.
We already saw 2 different types of impact on our Azure AKS clusters.
- Pod not able to Terminate
- New images being pulled from ACR (or any container registry

Sometimes it was very clear that we saw the nodes where "Not Ready` in order cases it's very hard to detect.

We have found a way to detect if your nodes are affected.

kubectl logs <pod name>
When you get the following error you know it's impacted: Error from server (InternalError): Internal error occurred: Authorization error (user=masterclient, verb=get, resource=nodes, subresource=proxy)

So restarting the node will help and especially if your cluster is sensitive then you can be more granular about the restart.

I hope it helps some visitors from the azure status page

Revision history for this message
Luciano Santos da Silva (lucianosilva7374) wrote :

Hey guys, nothing is working. My application has been out since this early morning. We have already tried to restart the nodes, restart the VM, but nothing has been working and we don't have any update from Microsoft. 4 hours ago they said "More information will be provided within 60 minutes, when we expect to know more about the root cause and mitigation workstreams.".

Revision history for this message
Mark Gerrits (skinny79) wrote (last edit ):

For anyone hitting this issue with AKS clusters: I have embedded the workaround above in a daemonset to avoid having to restart all nodes (for now)

https://gist.github.com/skinny/96e7feb6b347299ebfacaa76295a82e7

- Please check the image+tag used in the daemonset to whatever is available in your cluster.
- Deploy this daemonset to the cluster (default namespace is used)
- After all pods are running 1/1 for a bit, you can delete it
- Images can be pulled again :)

HTH

Revision history for this message
Robert Bopko (zer69) wrote (last edit ):

For people having this is issue on AKS clusters with custom DNS...

We have done this on all affected nodepools:

$ VMSS=XXX-vmss
$ nodeResourceGroup=XXX-worker
$ az vmss list-instances -g $nodeResourceGroup -n $VMSS --query "[].id" --output tsv | az vmss run-command invoke --scripts "systemd-resolve --set-dns=your_dns --set-dns=your_dns --set-domain=reddog.microsoft.com --interface=eth0" --command-id RunShellScript --ids @-

Revision history for this message
JG (jgentworth) wrote :

We are testing this in our AKS clusters now, but we were able to manually scale up a node pool which brought up new "working" nodes. Then manually scaled the pool back down to remove the "non-working" nodes. This left only new nodes up and the services are functioning properly now.

Revision history for this message
Richard Prammer (richardprammer) wrote :

Could this be a related issue, when deployment to aks fails, due to a connection refused when pulling images from azure container registry(ImagePullBackOff). This problem started this morning out of the blue.
Credentials for azure container service are ok and about every 20 image pulls I get one, and the container would start.

Revision history for this message
Stefan Zwanenburg (zwaantje) wrote (last edit ):

> Could this be a related issue, when deployment to aks fails, due to a connection refused when pulling images from azure container registry(ImagePullBackOff).

I you look closer at the message accompanying the ImagePullBackOff, you should see something like:
  dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:36288->[::1]:53: read: connection refused

Port 53 is the port a DNS server usually listens on.

If this is what you're seeing, then yes: your problems are caused by the issue described in here.

Revision history for this message
Mark Lopez (silvenga) wrote :

Yes @richardprammer, it appears ImagePullBackOff is one of the symptoms of this issue.

Revision history for this message
William Bergmann Børresen (williambb) wrote :

To temporary mitigate the ImagePullBackOff I scaled up a new functional node (DNS wise) and used this command to reconcile the AKS cluster:
az resource update --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME> --namespace Microsoft.ContainerService --resource-type ManagedClusters

This recovered CoreDNS in the kube-system namespace, which fixed the ImagePullBackOff

Revision history for this message
Liam Macgillavry (cjdmax) wrote :

az cli from cmd.exe, something like this for AKS nodes experiencing the issue: az vmss list-instances -g <resourcegroup> -n vmss --query "[].id" --output tsv | az vmss run-command invoke --scripts "echo FallbackDNS=168.63.129.16 >> /etc/systemd/resolved.conf; systemctl restart systemd-resolved.service" --command-id RunShellScript --ids @-

Revision history for this message
Anton Tykhyi (atykhyy) wrote :

Is it safe to downgrade from systemd 237-3ubuntu10.54 to the previous 237-3ubuntu10.50?

tags: added: regression-update
Revision history for this message
James Adler (jamesadler) wrote :

@atykhyy thank you that worked for VMSS!

I also had some VMs without scale sets, fixed those with:

az vm availability-set list -g <resourcegroup> --query "[].virtualMachines[].id" --output tsv | az vm run-command invoke --scripts "echo FallbackDNS=168.63.129.16 >> /etc/systemd/resolved.conf; systemctl restart systemd-resolved.service" --command-id RunShellScript --ids @-

Revision history for this message
Sebastien Tardif (sebastientardifverituity) wrote :

Microsoft Support provided fix for AKS, which I also tested successfully is:

kubectl get no -o json | jq -r '.items[].spec.providerID' | cut -c 9- | az vmss run-command invoke --ids @- \
  --command-id RunShellScript \
  --scripts 'grep nameserver /etc/resolv.conf || { dhclient -x; dhclient -i eth0; sleep 10; pkill dhclient; grep nameserver /etc/resolv.conf; }'

Revision history for this message
Adrian Joian (ajoian-2) wrote :

I've added a few alternatives how to fix the problem, mainly using az cli for vmss, ansible or running a daemonset in this gist : https://gist.github.com/naioja/eb8bac307a711e704b7923400b10bc14

Revision history for this message
bob sacamano (bobsacamano) wrote :
Revision history for this message
ForEachToil (foreachtoil) wrote :

You can find here some simple Python script to run a command to the VMSS instances for all subscriptions [or filtered ones]: https://github.com/foreachtoil/execute-command-on-all-vmss
I still lack threading, so this might take a little bit.

Changed in systemd (Ubuntu Bionic):
status: New → In Progress
Changed in systemd (Ubuntu):
status: Confirmed → Fix Released
Changed in systemd (Ubuntu Bionic):
importance: Undecided → Critical
Revision history for this message
AMAN PURWAR (aman1159) wrote :

manually scaling node pool/reboot nodes solves this issue.

Changed in systemd (Ubuntu Bionic):
assignee: nobody → Matthew Ruffell (mruffell)
tags: added: bionic sts
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for systemd on Bionic which fixes this bug.

description: updated
summary: - Update to systemd 237-3ubuntu10.54 broke dns
+ systemd-udevd: Run net_setup_link on 'change' uevents to prevent DNS
+ outages on Azure
Revision history for this message
Westerman (corwesterman) wrote :

Is there an workaround for Azure Container Apps at this point?

Revision history for this message
Severity1 (johnreilly-pospos) wrote :

@sebastientardifverituity the Microsoft support fix you mentioned worked for me.

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

@mruffel thank you for the debdiff! With my limited systemd codebase knowledge, this change feels fine. But I agree with the regression potential section of the SRU description - we should make sure that the update is well tested before going out as potentially it can change behavior.

Revision history for this message
Steffen Vinther Sørensen (arihtmtrx) wrote :

Sebastien Tardif (sebastientardifverituity) the fix you mentioned works for me, thanks

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hello everyone,

I know there are quite a few people watching this bug, so I will provide a status update.

The test package has been looking good throughout our internal testing, and we have proceeded to build the next systemd update, version 237-3ubuntu10.55, and it is currently in the bionic-security -proposed ppa.

If you would like to help test, that would be greatly appreciated. Please use a fresh VM on Azure, and please don't put the package into production just yet.

Instructions to install (On a Bionic system):
1) sudo add-apt-repository ppa:ubuntu-security-proposed/ppa
2) sudo apt update
3) sudo apt install libnss-systemd libpam-systemd libsystemd0 libudev1 systemd systemd-sysv udev
4) sudo apt-cache policy systemd | grep Installed
Installed: 237-3ubuntu10.55
5) sudo rm /etc/apt/sources.list.d/ubuntu-security-proposed-ubuntu-ppa-bionic.list
6) sudo apt update

From there you can run the reproducer:

$ sudo udevadm trigger && sudo systemctl restart systemd-networkd
$ ping google.com
PING google.com (172.253.122.138) 56(84) bytes of data.
64 bytes from bh-in-f138.1e100.net (172.253.122.138): icmp_seq=1 ttl=103 time=1.67 ms

if you do test, comment here on how it went. Again, please don't put the package into production until it has had a little more testing, and we will get this released to the world as quickly and safely as we can.

Thanks,
Matthew

Revision history for this message
Milan Barton (miba1248) wrote :

Hi Matthew,
on our production Ubuntu VM in Azure we have problem to ping google.com
The version of prod Ubuntu is:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

I have installed new test Ubuntu VM now:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

ping google.com was working fine.

Then I have applied your steps above and ping google.com is still working fine.

Milan

Revision history for this message
maniak (maruniakl) wrote :

I confirm that Sebastiens approach worked also for my AKS instance.
Thank you and to everyone involved, I owe you a pint :)

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119/comments/19

Revision history for this message
Ray Veldkamp (rayveldkamp) wrote (last edit ):

@mruffell, spinning up a clean Azure 18.04 Bionic VM and following your steps + reproducer, I can confirm DNS and network connectivity work fine after installing systemd from the security proposed ppa:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
Codename: bionic

$ sudo apt-cache policy systemd | grep Installed
  Installed: 237-3ubuntu10.55
$ sudo udevadm trigger && sudo systemctl restart systemd-networkd
$ ping google.com
PING google.com (216.58.214.14) 56(84) bytes of data.
64 bytes from lhr26s05-in-f14.1e100.net (216.58.214.14): icmp_seq=1 ttl=112 time=2.46 ms
64 bytes from lhr26s05-in-f14.1e100.net (216.58.214.14): icmp_seq=2 ttl=112 time=2.87 ms
64 bytes from lhr26s05-in-f14.1e100.net (216.58.214.14): icmp_seq=3 ttl=112 time=2.30 ms

Revision history for this message
Andres Hojman (ahojman) wrote (last edit ):

Can confirm we get rid off this issue on our Azure AKS setup; by updating our NodePool's OS Image to
AKSUbuntu-1804gen2containerd-2022.08.10 (using k8s version 1.22.11)

Changed in systemd (Ubuntu Bionic):
status: In Progress → Fix Released
Revision history for this message
Luciano Santos da Silva (lucianosilva7374) wrote :

I confirm that Mark Gerrits' approach worked also for my AKS instance.
Tahnk you very much.

Revision history for this message
Sander Aerts (vonkenketser) wrote :

I just created a new worker nodepool this morning, en redschedulded all pods to the new workers. Solved it for us.

Changed in systemd (Ubuntu Bionic):
status: Fix Released → Fix Committed
Revision history for this message
Matthew Ruffell (mruffell) wrote :
Download full text (5.4 KiB)

The failure mode still exists if "udevadm trigger" has been issued before the package upgrade to systemd 237-3ubuntu10.55.

That is, if unattended-upgrades or the user had installed open-vm-tools, and has not rebooted yet, they will lose network connection on upgrade to 237-3ubuntu10.55.

We need to implement a way to add ID_NET_DRIVER back to the device before the systemd upgrade takes place, otherwise an outage will occur.

Release admins - DO NOT RELEASE systemd 237-3ubuntu10.55 yet.

Tagging block-proposed.

$ ping google.com
PING google.com (142.251.45.110) 56(84) bytes of data.
64 bytes from iad23s04-in-f14.1e100.net (142.251.45.110): icmp_seq=1 ttl=56 time=1.51 ms
64 bytes from iad23s04-in-f14.1e100.net (142.251.45.110): icmp_seq=2 ttl=56 time=1.35 ms
64 bytes from iad23s04-in-f14.1e100.net (142.251.45.110): icmp_seq=3 ttl=56 time=1.17 ms
^C
--- google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 1.172/1.349/1.516/0.140 ms
azureuser@mruffell-test:~$ sudo apt-cache policy systemd | grep Installed
  Installed: 237-3ubuntu10.53
azureuser@mruffell-test:~$ udevadm info /sys/class/net/eth0 | grep ID_NET_DRIVER
E: ID_NET_DRIVER=hv_netvsc
azureuser@mruffell-test:~$ sudo udevadm trigger
azureuser@mruffell-test:~$ ping google.com
PING google.com (142.251.45.110) 56(84) bytes of data.
64 bytes from iad23s04-in-f14.1e100.net (142.251.45.110): icmp_seq=1 ttl=56 time=2.15 ms
64 bytes from iad23s04-in-f14.1e100.net (142.251.45.110): icmp_seq=2 ttl=56 time=1.21 ms
^C
--- google.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.212/1.682/2.152/0.470 ms
azureuser@mruffell-test:~$ udevadm info /sys/class/net/eth0 | grep ID_NET_DRIVER
azureuser@mruffell-test:~$ sudo apt install libnss-systemd libpam-systemd libsystemd0 libudev1 systemd systemd-sysv udev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  linux-headers-4.15.0-191
Use 'sudo apt autoremove' to remove it.
Suggested packages:
  systemd-container
The following packages will be upgraded:
  libnss-systemd libpam-systemd libsystemd0 libudev1 systemd systemd-sysv udev
7 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.
Need to get 4497 kB of archives.
After this operation, 8192 B of additional disk space will be used.
Get:1 http://ppa.launchpad.net/ubuntu-security-proposed/ppa/ubuntu bionic/main amd64 libsystemd0 amd64 237-3ubuntu10.55 [205 kB]
Get:2 http://ppa.launchpad.net/ubuntu-security-proposed/ppa/ubuntu bionic/main amd64 libnss-systemd amd64 237-3ubuntu10.55 [105 kB]
Get:3 http://ppa.launchpad.net/ubuntu-security-proposed/ppa/ubuntu bionic/main amd64 libpam-systemd amd64 237-3ubuntu10.55 [107 kB]
Get:4 http://ppa.launchpad.net/ubuntu-security-proposed/ppa/ubuntu bionic/main amd64 systemd amd64 237-3ubuntu10.55 [2915 kB]
Get:5 http://ppa.launchpad.net/ubuntu-security-proposed/ppa/ubuntu bionic/main amd64 udev amd64 237-3ubuntu10.55 [1099 kB]
Get:6 http://ppa.launchpad.net/ubuntu-security-proposed/ppa/ubuntu bionic/main amd64 libudev1 am...

Read more...

tags: added: block-proposed
Revision history for this message
Norberto Bensa (nbensa) wrote :

We got bite by this but we added a cron to our 18.04 instances:

* * * * * host google.com || dhclient

works-for-us :-)

HTH,
Norberto

Revision history for this message
Ihor Indyk (roootik) wrote :

Thank to Microsoft for fixing the issue on all AKS clusters.
But for the history we fixed that by just cordoning and then draining all cordoned nodes.
It effectively gracefully rotates the node pools.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is the second patch required to fully fix this bug. It adds a check on preinstall to see if ID_NET_DRIVER is present on the network interface, and if it is missing, call udevadm trigger -c add on the interface to add it.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is an improvement on the previous patch revision. Output is now forwarded to logger, we use shell expansion to enumerate network devices, we omit loopback, and we added a udevadm settle to wait for any thunderstorms to resolve before we continue installing the new udev package.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 237-3ubuntu10.56

---------------
systemd (237-3ubuntu10.56) bionic-security; urgency=medium

  * debian/udev.preinst:
    Add check_ID_NET_DRIVER() to ensure that on upgrade or install
    from an earlier version ID_NET_DRIVER is present on network
    interfaces. (LP: #1988119)

 -- Matthew Ruffell <email address hidden> Tue, 06 Sep 2022 15:18:05 +1200

Changed in systemd (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.