KVM Guest - DHCP lease lost (Ubuntu 18.04)

Bug #1817998 reported by GOVINDA TATTI
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

On Nvidia DGX2 system, we configured linux bridge (br0) using host physical NIC interface and it is using static IP (see below netplan file). BTW, we are using 18.04.2 based BaseOS and Guest images.

- All KVM guests are being launched using virtual network interface based on br0. All VMs are getting DHCP based IP address and network interface works fine for few hours (may be upto 24hours).
- After that we are noticing these VMs are losing IP address and noticed the message in VM’s syslog
"Feb 26 17:16:41 test-1g0 systemd-networkd[3479]: enp6s0: DHCP lease lost".
- At this point, we tried to create new VMs using br0 and none of them are getting any IP address.
- Then, we checked KVM host, and status of bridge but we didn’t see any error. Tried to unconfigure br0 by removing bridge configuration from host netplan and did “sudo netplan apply” but br0 is still there. It seems like bridge has in weird state and cannot unload this driver.

Guest
lab@dgx-server-vm:~$ ssh nvidia@192.168.123.138
The authenticity of host '192.168.123.138 (192.168.123.138)' can't be established.
ECDSA key fingerprint is SHA256:k8XpnGH7yle76z46CX16pflYVeYcKoG6kWCymIkv0kk.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.123.138' (ECDSA) to the list of known hosts.
nvidia@192.168.123.138's password:
 _ _ _ _ _ _ _ _ ___
| \ | |_ _(_) __| (_) __ _ | |_ ___ ___| |_ / | __ _ / _ \
| \| \ \ / / |/ _` | |/ _` | | __/ _ \/ __| __|____| |/ _` | | | |
| |\ |\ V /| | (_| | | (_| | | || __/\__ \ ||_____| | (_| | |_| |
|_| \_| \_/ |_|\__,_|_|\__,_| \__\___||___/\__| |_|\__, |\___/
                                                       |___/

Welcome to Ubuntu 18.04.2 LTS (4.15.0-45-generic)

Welcome to NVIDIA DGX KVM VM Server Version 4.0.5 (GNU/Linux 4.15.0-45-generic x86_64)

 * Documentation: https://help.ubuntu.com
 * Management: https://landscape.canonical.com
 * Support: https://ubuntu.com/advantage
System information as of: Wed Feb 27 12:20:21 PST 2019

System load: 0.00 IP Address:
Memory usage: 0.0% (59.36G avail) System uptime: 21:04 hours
Usage on /: 8% (44G free) Swap usage: 0.0%
Local Users: 1 Processes: 158

  System information as of Wed Feb 27 12:20:22 PST 2019

  System load: 0.0 Processes: 155
  Usage of /: 6.7% of 48.96GB Users logged in: 1
  Memory usage: 0% IP address for enp1s0: 192.168.123.138
  Swap usage: 0% IP address for docker0: 172.17.0.1

 * Canonical Livepatch is available for installation.
   - Reduce system reboots and improve kernel security. Activate at:
     https://ubuntu.com/livepatch

15 packages can be updated.
9 updates are security updates.

Last login: Wed Feb 27 12:05:09 2019
nvidia@test-1g0:~$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
        inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
        ether 02:42:5c:b9:6f:94 txqueuelen 0 (Ethernet)
        RX packets 0 bytes 0 (0.0 B)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 0 bytes 0 (0.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.123.138 netmask 255.255.255.0 broadcast 192.168.123.255
        inet6 fe80::5054:ff:feb9:b8a1 prefixlen 64 scopeid 0x20<link>
        ether 52:54:00:b9:b8:a1 txqueuelen 1000 (Ethernet)
        RX packets 38879 bytes 2449778 (2.4 MB)
        RX errors 0 dropped 1 overruns 0 frame 0
        TX packets 977 bytes 132770 (132.7 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp6s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet6 fe80::5055:ff:fe78:faa9 prefixlen 64 scopeid 0x20<link>
        ether 52:55:00:78:fa:a9 txqueuelen 1000 (Ethernet)
        RX packets 93842 bytes 7637062 (7.6 MB)
        RX errors 0 dropped 27 overruns 0 frame 0
        TX packets 1874 bytes 442869 (442.8 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 562 bytes 52271 (52.2 KB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 562 bytes 52271 (52.2 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

nvidia@test-1g0:~$ uptime
 12:20:35 up 21:04, 2 users, load average: 0.00, 0.00, 0.00
nvidia@test-1g0:~$ date
Wed Feb 27 12:20:44 PST 2019
nvidia@test-1g0:~$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.123.0 0.0.0.0 255.255.255.0 U 0 0 0 enp1s0
nvidia@test-1g0:~$ dmesg | grep -i DHCP
nvidia@test-1g0:~$ cat /var/log/syslog | grep -i dhcp
Feb 26 15:15:21 test-1g0 systemd-networkd[569]: enp1s0: DHCPv4 address 192.168.123.138/24 via 192.168.123.1
Feb 26 15:16:20 test-1g0 systemd-networkd[538]: enp1s0: DHCPv4 address 192.168.123.138/24 via 192.168.123.1
Feb 26 15:16:20 test-1g0 systemd-networkd[538]: enp6s0: DHCPv4 address 172.18.232.32/25 via 172.18.232.1
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: enp1s0: DHCPv4 address 192.168.123.138/24 via 192.168.123.1
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: enp6s0: DHCPv4 address 172.18.232.32/25 via 172.18.232.1
Feb 26 17:16:41 test-1g0 systemd-networkd[3479]: enp6s0: DHCP lease lost

nvidia@test-1g0:~$ sudo networkctl status enp6s0
[sudo] password for nvidia:
● 3: enp6s0
       Link File: /lib/systemd/network/99-default.link
    Network File: /run/systemd/network/10-netplan-virtionetworks.network
            Type: ether
           State: degraded (configured)
            Path: pci-0000:06:00.0
          Driver: virtio_net
          Vendor: Red Hat, Inc.
           Model: Virtio network device
      HW Address: 52:55:00:78:fa:a9
         Address: fe80::5055:ff:fe78:faa9

nvidia@test-1g0:~$ systemctl status systemd-networkd.service
● systemd-networkd.service - Network Service
   Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled-runtime; vendor preset: enabled)
   Active: active (running) since Tue 2019-02-26 15:16:42 PST; 21h ago
     Docs: man:systemd-networkd.service(8)
 Main PID: 3479 (systemd-network)
   Status: "Processing requests..."
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/systemd-networkd.service
           └─3479 /lib/systemd/systemd-networkd

Feb 26 15:16:42 test-1g0 systemd[1]: Started Network Service.
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: lo: Link is not managed by us
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: enp1s0: Link is not managed by us
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: docker0: Link is not managed by us
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: lo: Link is not managed by us
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: docker0: Link is not managed by us
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: enp1s0: DHCPv4 address 192.168.123.138/24 via 192.168.123.1
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: enp6s0: DHCPv4 address 172.18.232.32/25 via 172.18.232.1
Feb 26 15:16:42 test-1g0 systemd-networkd[3479]: enp6s0: Configured
Feb 26 17:16:41 test-1g0 systemd-networkd[3479]: enp6s0: DHCP lease lost

Revision history for this message
GOVINDA TATTI (gtatti) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Would you have the full journal from the time you started the guest (we should see it getting a lease from dnsmasq) until the guest complains about it being lost?

Please attach those from Guest AND Host as I'd like to check if dnsmasq in the host and/or systemd-networkd in the guest had any.

Best would be something like this:
- attach full journal of host and guest covering all the time
- provide a timestamp you started the guest and the guests name and MAC
- provide a timestamp the guest lost it's lease
- start another guest (as you say they fail)
- provide a timestamp of that guest starting

That would help a lot to parse your logs more efficiently.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ah BTW - to keep things clean please just start one guest until the issue occurs and then a second as discussed.

Do not start/stop other guests at the time
Do not have other guests up before you start
Do not restart services for all of the time you track this

That also should help to keep the logs clean.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Your configuration looks as if you bridges the guests to an external connected bridge (br0)
I assume your guests get DHCP from something else on 172.18.232.11/25 and not from the hosts DNSMASQ is that right?

If that is true please provide logs from the actual dnsserver.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Finally if your dhcp server might just be flaky/unresponsive then this is the correct behavior.
Please fix it on the dhcp server setup then OR follow the discussion at bug 1776013 which leads to https://www.freedesktop.org/software/systemd/man/systemd.network.html#CriticalConnection=

But really, if that is it the setup is broken - and not letting go the lease is usually not the right way.

Setting incomplete waiting for all the Data that was requested.
Once that is provided we can then hopefully also better triage what package/team we are looking at to work on this.

Changed in systemd (Ubuntu):
status: New → Incomplete
Revision history for this message
GOVINDA TATTI (gtatti) wrote :

Thanks Christian for your response.

- Yes, the guests are connected to a software bridge (br0) created on top of host physical NIC interface and they are getting DHCP IP addresses from external DHCP server.

- So far, we have seen this issue a couple of times and only way to recover is to reboot the host system. This indicates, it is something to do with the linux bridge configuration or driver. If it is a DHCP server, the issue should persist across reboot too.

- One another data point is that, we don't see this issue if we use "MacVTap configuration" for guests, each of them gets DHCP IP address and no issue of losing IP address.

This indicates this issue is something specific to linux bridge (br0) setup.

Revision history for this message
GOVINDA TATTI (gtatti) wrote :

Christian,

Let us move this discussion to DGX2 tracker ticket since we don't want to share any DGX2/Nvidia specific details in the generic ticket.

https://bugs.launchpad.net/nvidia-dgx-2/+bug/1818116

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for systemd (Ubuntu) because there has been no activity for 60 days.]

Changed in systemd (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.