KVM with e1000e and WinGuest Host OS on kernel 5.3 (ok with 5.0)

Bug #1849720 reported by Robert Strube
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
qemu (Ubuntu)
Fix Released
High
Unassigned

Bug Description

After upgrading to Ubuntu 19.10 I noticed that all of my Windows Servers VMs would cause a hard crash on the Host OS shortly after starting up the VM. I tracked it down to the Guest OS attempting to use the Network Connection, and if I disabled the virtual NIC for the VM, everything runs OK (albeit without a working network connection for the Guest OS). Note that I'm just using the built in virtual network called "default" that get's installed by default and uses NAT forwarding.

I believe the problem is related to AppArmor, as I noticed some errors present in various log files.

Unfortunately, due to a time critical project I had to roll back to Ubuntu 19.04 and didn't capture any of the log files. I did, however, find another user on Reddit with the exact same problems that I encountered and he agreed to let me post the log files.

Here are what are think are the relevant pieces from the log files:

===================================================================

Oct 22 22:59:23 brian-pc dnsmasq[2178]: exiting on receipt of SIGTERM
Oct 22 22:59:23 brian-pc kernel: [ 67.001284] device virbr0-nic left promiscuous mode
Oct 22 22:59:23 brian-pc kernel: [ 67.001298] virbr0: port 1(virbr0-nic) entered disabled state
Oct 22 22:59:23 brian-pc NetworkManager[3557]: <info> [1571799563.3862] device (virbr0-nic): released from master device virbr0
Oct 22 22:59:23 brian-pc gnome-shell[4401]: Removing a network device that was not added
Oct 22 22:59:23 brian-pc gnome-shell[2463]: Removing a network device that was not added
Oct 22 22:59:23 brian-pc avahi-daemon[1621]: Interface virbr0.IPv4 no longer relevant for mDNS.
Oct 22 22:59:23 brian-pc avahi-daemon[1621]: Leaving mDNS multicast group on interface virbr0.IPv4 with address 192.168.122.1.
Oct 22 22:59:23 brian-pc avahi-daemon[1621]: Withdrawing address record for 192.168.122.1 on virbr0.
Oct 22 22:59:23 brian-pc NetworkManager[3557]: <info> [1571799563.6859] device (virbr0): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Oct 22 22:59:23 brian-pc gnome-shell[2463]: Removing a network device that was not added
Oct 22 22:59:23 brian-pc gnome-shell[4401]: Removing a network device that was not added
Oct 22 22:59:23 brian-pc dbus-daemon[1610]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.192' (uid=0 pid=3557 comm="/usr/sbin/NetworkManager --no-daemon " label="unconfined")
Oct 22 22:59:23 brian-pc systemd[1]: Starting Network Manager Script Dispatcher Service...
Oct 22 22:59:23 brian-pc gnome-shell[2463]: Object NM.ActiveConnection (0x55ccfb376e50), has been already deallocated — impossible to get any property from it. This might be caused by the object having been destroyed from C code using something such as destroy(), dispose(), or remove() vfuncs.
Oct 22 22:59:23 brian-pc gnome-shell[2463]: == Stack trace for context 0x55ccfb8d15f0 ==
Oct 22 22:59:23 brian-pc gnome-shell[2463]: #0 55ccfbc736c0 i resource:///org/gnome/shell/ui/status/network.js:1329 (7f52226be550 @ 56)
Oct 22 22:59:23 brian-pc gnome-shell[2463]: #1 55ccfbc73628 i resource:///org/gnome/shell/ui/status/network.js:1346 (7f52226be5e0 @ 113)
Oct 22 22:59:23 brian-pc gnome-shell[2463]: #2 55ccfbc73588 i resource:///org/gnome/shell/ui/status/network.js:2049 (7f52226c1940 @ 216)
Oct 22 22:59:23 brian-pc gnome-shell[2463]: #3 55ccfbc734f0 i resource:///org/gnome/shell/ui/status/network.js:1853 (7f52226bfee0 @ 134)
Oct 22 22:59:23 brian-pc gnome-shell[2463]: #4 55ccfbc73430 i self-hosted:979 (7f522262dee0 @ 440)
Oct 22 22:59:23 brian-pc gnome-shell[2463]: JS ERROR: TypeError: connection.get_setting_ip4_config is not a function#012_isHotSpotMaster@resource:///org/gnome/shell/ui/status/network.js:1333:25#012getIndicatorIcon@resource:///org/gnome/shell/ui/status/network.js:1346:13#012_updateIcon@resource:///org/gnome/shell/ui/status/network.js:2049:52#012_syncVpnConnections@resource:///org/gnome/shell/ui/status/network.js:1853:9
Oct 22 22:59:23 brian-pc gnome-shell[2463]: JS ERROR: TypeError: connectionSettings is null#012_updateConnection@resource:///org/gnome/shell/ui/status/network.js:1922:9
Oct 22 22:59:23 brian-pc gnome-shell[4401]: JS ERROR: TypeError: connectionSettings is null#012_updateConnection@resource:///org/gnome/shell/ui/status/network.js:1922:9
Oct 22 22:59:23 brian-pc dbus-daemon[1610]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 22 22:59:23 brian-pc systemd[1]: Started Network Manager Script Dispatcher Service.
Oct 22 22:59:30 brian-pc systemd[1]: systemd-localed.service: Succeeded.
Oct 22 22:59:30 brian-pc systemd[1]: systemd-hostnamed.service: Succeeded.
Oct 22 22:59:34 brian-pc systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Oct 22 22:59:34 brian-pc systemd[1]: systemd-timedated.service: Succeeded.
Oct 22 22:59:35 brian-pc kernel: [ 78.876946] kauditd_printk_skb: 38 callbacks suppressed
Oct 22 22:59:35 brian-pc kernel: [ 78.876947] audit: type=1400 audit(1571799575.216:50): apparmor="DENIED" operation="open" profile="virt-aa-helper" name="/home/brian/.seconddrive/windows10/WindowsVM.img" pid=5558 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=64055
Oct 22 22:59:35 brian-pc kernel: [ 78.919660] audit: type=1400 audit(1571799575.260:51): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-556aee82-2888-4958-a5fb-a46cb463aaa9" pid=5560 comm="apparmor_parser"
Oct 22 22:59:35 brian-pc systemd[1]: Started Virtual machine log manager.
Oct 22 22:59:43 brian-pc geoclue[2717]: Service not used for 60 seconds. Shutting down..
Oct 22 22:59:43 brian-pc systemd[1]: geoclue.service: Main process exited, code=killed, status=15/TERM
Oct 22 22:59:43 brian-pc systemd[1]: geoclue.service: Succeeded.
Oct 22 22:59:48 brian-pc tracker-store[4662]: OK
Oct 22 22:59:48 brian-pc systemd[4096]: tracker-store.service: Succeeded.
Oct 22 22:59:54 brian-pc wpa_supplicant[1611]: wlp0s20f3: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-60 noise=9999 txrate=130000
Oct 22 22:59:59 brian-pc xdg-desktop-por[4334]: Failed to get application states: GDBus.Error:org.freedesktop.portal.Error.Failed: Could not get window list: GDBus.Error:org.freedesktop.DBus.Error.AccessDenied: App introspection not allowed
Oct 22 23:00:01 brian-pc kernel: [ 104.801913] audit: type=1400 audit(1571799601.149:52): apparmor="DENIED" operation="open" profile="virt-aa-helper" name="/home/brian/.seconddrive/windows10/WindowsVM.img" pid=5586 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=64055
Oct 22 23:00:01 brian-pc io.elementary.a[5579]: ComponentValidator.vala:38: Could not get the contents of blacklist file: Failed to open file “/etc/io.elementary.appcenter/appcenter.blacklist”: No such file or directory
Oct 22 23:00:01 brian-pc kernel: [ 104.845695] audit: type=1400 audit(1571799601.193:53): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-556aee82-2888-4958-a5fb-a46cb463aaa9" pid=5588 comm="apparmor_parser"
Oct 22 23:00:02 brian-pc PackageKit: get-packages transaction /29684_bebacdcb from uid 1000 finished with success after 618ms
Oct 22 23:00:03 brian-pc kernel: [ 107.458511] [UFW BLOCK] IN=wlp0s20f3 OUT= MAC=f4:d1:08:65:17:0b:70:f1:96:18:39:a2:08:00 SRC=72.21.91.29 DST=192.168.1.18 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=53668 DF PROTO=TCP SPT=80 DPT=49679 WINDOW=432 RES=0x00 ACK FIN URGP=0
Oct 22 23:00:07 brian-pc kernel: [ 110.735868] [UFW BLOCK] IN=wlp0s20f3 OUT= MAC=f4:d1:08:65:17:0b:70:f1:96:18:39:a2:08:00 SRC=72.21.91.29 DST=192.168.1.18 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=53669 DF PROTO=TCP SPT=80 DPT=49679 WINDOW=432 RES=0x00 ACK FIN URGP=0
Oct 22 23:00:09 brian-pc PackageKit: refresh-cache transaction /29685_cbcddcda from uid 1000 finished with success after 7637ms
Oct 22 23:00:09 brian-pc kernel: [ 113.602632] [UFW BLOCK] IN=wlp0s20f3 OUT= MAC=f4:d1:08:65:17:0b:70:f1:96:18:39:a2:08:00 SRC=72.21.81.240 DST=192.168.1.18 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=59921 DF PROTO=TCP SPT=80 DPT=49678 WINDOW=432 RES=0x00 ACK FIN URGP=0
Oct 22 23:00:11 brian-pc PackageKit: get-updates transaction /29686_cbeacbbd from uid 1000 finished with success after 1215ms
Oct 22 23:00:13 brian-pc kernel: [ 117.494453] [UFW BLOCK] IN=wlp0s20f3 OUT= MAC=f4:d1:08:65:17:0b:70:f1:96:18:39:a2:08:00 SRC=72.21.91.29 DST=192.168.1.18 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=53670 DF PROTO=TCP SPT=80 DPT=49679 WINDOW=432 RES=0x00 ACK FIN URGP=0
Oct 22 23:00:16 brian-pc gnome-shell[4401]: [AppIndicatorSupport-WARN] Attempting to re-register :1.123/org/ayatana/NotificationItem/virt_manager; resetting instead
Oct 22 23:00:16 brian-pc gnome-shell[4401]: [AppIndicatorSupport-WARN] Item :1.123/org/ayatana/NotificationItem/virt_manager is already registered
Oct 22 23:00:19 brian-pc kernel: [ 123.228740] audit: type=1400 audit(1571799619.579:54): apparmor="DENIED" operation="open" profile="virt-aa-helper" name="/home/brian/.seconddrive/windows10/WindowsVM.img" pid=6486 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=64055
Oct 22 23:00:19 brian-pc kernel: [ 123.270565] audit: type=1400 audit(1571799619.619:55): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-556aee82-2888-4958-a5fb-a46cb463aaa9" pid=6488 comm="apparmor_parser"
Oct 22 23:00:19 brian-pc kernel: [ 123.294563] audit: type=1400 audit(1571799619.643:56): apparmor="DENIED" operation="open" profile="virt-aa-helper" name="/home/brian/.seconddrive/windows10/WindowsVM.img" pid=6489 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=64055
Oct 22 23:00:19 brian-pc kernel: [ 123.336923] audit: type=1400 audit(1571799619.687:57): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-556aee82-2888-4958-a5fb-a46cb463aaa9" pid=6491 comm="apparmor_parser"
Oct 22 23:00:19 brian-pc kernel: [ 123.361043] audit: type=1400 audit(1571799619.711:58): apparmor="DENIED" operation="open" profile="virt-aa-helper" name="/home/brian/.seconddrive/windows10/WindowsVM.img" pid=6492 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=64055
Oct 22 23:00:19 brian-pc kernel: [ 123.402401] audit: type=1400 audit(1571799619.751:59): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-556aee82-2888-4958-a5fb-a46cb463aaa9" pid=6494 comm="apparmor_parser"
Oct 22 23:00:19 brian-pc kernel: [ 123.426276] audit: type=1400 audit(1571799619.775:60): apparmor="DENIED" operation="open" profile="virt-aa-helper" name="/home/brian/.seconddrive/windows10/WindowsVM.img" pid=6495 comm="virt-aa-helper" requested_mask="r" denied_mask="r" fsuid=0 ouid=64055
Oct 22 23:00:19 brian-pc kernel: [ 123.468397] audit: type=1400 audit(1571799619.819:61): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-556aee82-2888-4958-a5fb-a46cb463aaa9" pid=6497 comm="apparmor_parser"
Oct 22 23:00:19 brian-pc gnome-shell[4401]: [AppIndicatorSupport-WARN] Attempting to re-register :1.123/org/ayatana/NotificationItem/virt_manager; resetting instead
Oct 22 23:00:19 brian-pc gnome-shell[4401]: [AppIndicatorSupport-WARN] Item :1.123/org/ayatana/NotificationItem/virt_manager is already registered
Oct 22 23:00:19 brian-pc gnome-shell[4401]: [AppIndicatorSupport-WARN] Attempting to re-register :1.123/org/ayatana/NotificationItem/virt_manager; resetting instead
Oct 22 23:00:19 brian-pc gnome-shell[4401]: [AppIndicatorSupport-WARN] Item :1.123/org/ayatana/NotificationItem/virt_manager is already registered
Oct 22 23:00:23 brian-pc kernel: [ 127.427774] [UFW BLOCK] IN=wlp0s20f3 OUT= MAC=f4:d1:08:65:17:0b:b8:27:eb:75:38:47:08:00 SRC=192.168.1.10 DST=192.168.1.18 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=15892 DF PROTO=TCP SPT=445 DPT=38086 WINDOW=1332 RES=0x00 ACK URGP=0
Oct 22 23:00:27 brian-pc kernel: [ 130.911806] [UFW BLOCK] IN=wlp0s20f3 OUT= MAC=f4:d1:08:65:17:0b:70:f1:96:18:39:a2:08:00 SRC=72.21.91.29 DST=192.168.1.18 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=53671 DF PROTO=TCP SPT=80 DPT=49679 WINDOW=432 RES=0x00 ACK FIN URGP=0
Oct 22 23:00:44 brian-pc systemd[2021]: Reached target GNOME Session is stable (running for >2 minutes).
Oct 22 23:00:59 brian-pc xdg-desktop-por[4334]: Failed to get application states: GDBus.Error:org.freedesktop.portal.Error.Failed: Could not get window list: GDBus.Error:org.freedesktop.DBus.Error.AccessDenied: App introspection not allowed

===================================================================

I've also attached the entire log file for reference.

Revision history for this message
Robert Strube (robstrube) wrote :
Revision history for this message
Robert Strube (robstrube) wrote :

Accidentally filed bug under qemu-kvm, which is no longer a package being maintained. The bug affect Ubuntu 19.10 (Eoan).

no longer affects: qemu-kvm (Ubuntu)
Revision history for this message
Brian Mays (briguyjm) wrote :

I have the same issue with this bug. Can't run a Windows guest on libvirt + qemu-kvm without my computer crashing. The only way to run a Windows guest is to disable networking.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Brian and Robert,
interesting yet disturbing crash that is - thank you for the report.

The apparmor message you see (at least those in the log provided) are just from being unable to access an extra disk in "/home/brian/.seconddrive/windows10/WindowsVM.img".

Which on itself is odd as we ahve this rule for virt-aa-helper (that isn't the guest but a helper that runs in advance).

    # for backingstore -- allow access to non-hidden files in @{HOME} as well
    @{HOME}/** r,

But at least for now I'd not focus on apparmor since as I said that is disk and not network related. And I never saw qemu being denied to access it, just the helper tool.

Changed in qemu (Ubuntu):
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, the only errors/fails I see in the log are related to gnome/xdg-desktop :-/

So as I understand it Robert reported the bug, but Brian who chimed in in comment #3 still has a setup to reproduce is that correct?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We will need some more logs, please let me ask for:
"hard crash on the host" is the host kernel crashing and the whole system goes down?
- If yes, then is there a way to get the log that contains the kernel crash?
- If no then I'd assume you mean that the associated qemu process dies right (more likely)?
   In that case please provide:
   a) /var/log/libvirt/qemu/<guestname>.log
   b) if you get a crash in /var/crash/ of the qemu process either direct or via apport [1]

To silence apparmor in this you might consider removing the second disk "/home/brian/.seconddrive/windowsVM.img" from the guest definition just to see if it has any effect to the issue.

Given the issue it is unlikely to find something interesting in libvirt's log, but since I need to reach out to you for logs anyway let us be complete. No need to enable extra debugging output, but a snippet of `journalctl -u libvirtd` from when the guest was started until it crashed woulc be nice as well.

How is your guest definition looking, since you assume it is network related which network definition do you use - best would be to attach the full output of `virsh dumpxml <guestname>` here at the bug as it would help to reproduce the issue.

Finally how did you setup networking in the guest, any special drivers for virtio installed?
Or just windows as-is, if so which windows version btw?

I have no other windows than Win98 around and that doesn't fail :-)
I'm asking around for one. But there would be one more great thing, if you could test that it triggers on a fresh install from an ISO [2] as well then everyone even without a license could give it a debug try. So if you - knowing to be affected - could check and confirm that it would be great.

[1]: http://manpages.ubuntu.com/manpages/eoan/man1/apport-bug.1.html
[2]: https://www.microsoft.com/en-gb/software-download/windows10ISO

Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Robert Strube (robstrube) wrote :

Hi Christian,

Thank for for the response. This is indeed a troubling bug - especially since it crashes the entire system. The clarify, I originally posted about the issue in /r/linux on Reddit, but because of a time critical deadline I needed to uninstall 19.10. This is my production machine and I attempted to install 19.10 over the weekend. In my haste I forgot to take a snapshot of the log files!

Brian chimed in that he also was experiencing the same issues, so I asked him if he would be willing to provide his log files so I could open the bug report.

I *can* answer some of your questions.

In regards to the networking, I was just using the standard networking that came pre-configured.

The XML looks something like this (stored at /etc/libvirt/qemu/networks/default.xml)

<network>
  <name>default</name>
  <uuid>354a5337-37fb-47f9-a88b-1d1c505c5652</uuid>
  <forward mode="nat">
    <nat>
      <port start="1024" end="65535"/>
    </nat>
  </forward>
  <bridge name="virbr0" stp="on" delay="0"/>
  <mac address="52:54:00:3f:c0:bf"/>
  <domain name="default"/>
  <ip address="192.168.100.1" netmask="255.255.255.0">
    <dhcp>
      <range start="192.168.100.128" end="192.168.100.254"/>
    </dhcp>
  </ip>
</network>

I don't have the exact values any more, but they were very similar. The key is that the forwarding is NAT, and the virtual network is named "default".

The XML for the actual VM looks something like this (again not exact values, but very similar):

interface type="network">
  <mac address="52:54:00:c1:03:4c"/>
  <source network="default"/>
  <model type="e1000e"/>
  <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
</interface>

I would also be able to provide you with a stock Windows Server 2019 image if you would like for testing purposes. If you think this would be helpful, send me an email at rstrube --at-- gmail.com and I'll send you a link to download the image.

When I did a quick check in my logs, I did not notice any kernel panics or anything that truly stood out to me in my syslog or the dmesg logs.

Perhaps Brian would be able to assist more since he still has 19.10 installed?

Revision history for this message
Brian Mays (briguyjm) wrote :
Download full text (14.6 KiB)

This is the output of journalctl -u libvirtd. My computer was turned off when I got home after work, so these logs are the ones that I got as soon as I turned my machine on and ran my Windows VM:

-- Reboot --
Oct 25 16:49:21 brian-pc systemd[1]: Starting Virtualization daemon...
Oct 25 16:49:22 brian-pc libvirtd[1851]: libvirt version: 5.4.0, package: 0ubuntu5 (Matthias Klose <email address hidden> Thu, 05 Sep 2019 11:00:53 +0000)
Oct 25 16:49:22 brian-pc libvirtd[1851]: hostname: brian-pc
Oct 25 16:49:22 brian-pc libvirtd[1851]: Libvirt doesn't support VirtualBox API version 6000014
Oct 25 16:49:22 brian-pc systemd[1]: Started Virtualization daemon.
Oct 25 16:49:22 brian-pc dnsmasq[2102]: started, version 2.80 cachesize 150
Oct 25 16:49:22 brian-pc dnsmasq[2102]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify dumpfile
Oct 25 16:49:22 brian-pc dnsmasq-dhcp[2102]: DHCP, IP range 192.168.122.2 -- 192.168.122.254, lease time 1h
Oct 25 16:49:22 brian-pc dnsmasq-dhcp[2102]: DHCP, sockets bound exclusively to interface virbr0
Oct 25 16:49:22 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:22 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:22 brian-pc dnsmasq[2102]: read /etc/hosts - 3 addresses
Oct 25 16:49:22 brian-pc dnsmasq[2102]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Oct 25 16:49:22 brian-pc dnsmasq-dhcp[2102]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Oct 25 16:49:22 brian-pc libvirtd[1851]: cannot open directory '/home/brian/.seconddrive/virtualmachines/isos': No such file or directory
Oct 25 16:49:22 brian-pc libvirtd[1851]: internal error: Failed to autostart storage pool 'isos': cannot open directory '/home/brian/.seconddrive/virtualmachines/isos': No such file or directory
Oct 25 16:49:22 brian-pc libvirtd[1851]: cannot open directory '/home/brian/.seconddrive/virtualmachines': No such file or directory
Oct 25 16:49:22 brian-pc libvirtd[1851]: internal error: Failed to autostart storage pool 'virtualmachines': cannot open directory '/home/brian/.seconddrive/virtualmachines': No such file or directory
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:49:25 brian-pc dnsmasq[2102]: reading /etc/resolv.conf
Oct 25 16:49:25 brian-pc dnsmasq[2102]: using nameserver 127.0.0.53#53
Oct 25 16:53:08 brian-pc libvirtd[1851]: operation failed: pool 'defaul...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I have ran a win10 install on 19.10 in virt-manager with the e1000e card that you used as well ... nothing is crashing :-/

And the network surely is used ...
$ virsh domifstat win10 vnet0
vnet0 rx_bytes 9306542
vnet0 rx_packets 51933
vnet0 rx_errs 0
vnet0 rx_drop 0
vnet0 tx_bytes 1591661
vnet0 tx_packets 5049
vnet0 tx_errs 0
vnet0 tx_drop 0

I can't reproduce the bug as-is, we need to find what is different for you.

You said it crashes later "when it accesses network" at what stage is that?
Might my install test not be far enough?

reading the logs you attached now ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok you both had e1000e as I had in my test.
The guest xml has no very special config that would make me wonder.

You said there is no crash report, and the guest log confirms that it seems to end kind of normally.
  2019-10-23T03:01:14.664402Z qemu-system-x86_64: terminating on signal 15 from pid 1899 (/usr/sbin/libvirtd)

That means libvirt told him to end which is fine.

Robert said there also where no kernel panics or anything.

Could you outline these two things please:
- you said "crashes on the host" but we lack crash reports, how would such a crash look like exactly?
- maybe one can get a journal log while the crash is happening, as it might be libvirt who thinks it needs to shut down the guest (at least from what we see so far)?

Revision history for this message
Robert Strube (robstrube) wrote :

Hi Christian,

I can tell you what was happening for me.

I run all my VMs through virt-manager, so I first open up virt-manager and then start my Windows Server 2019 VM. The VM boots fine and I get to the login screen where I need to send a CTRL+ALT+DELETE to login. A couple extra seconds after reaching the login screen my entire hos system hangs. By this I mean I can no longer move the mouse in the Gnome DE, the screen no longer updates (one time I had a browser window open with some animated content), I can't switch to a virtual terminal (e.g. using CTRL+ALT+FX), nothing.

It makes no difference if I quickly hit CTRL+ALT+DELETE and start to type my name in, or if I just start the VM and wait, it always freezes my entire system, everything is unresponsive, and the only way to restart is to do a hard power off / power on.

The reason I determined it was the networking that caused the problem is because after one of these hard power off / power on cycles, the VM asked me if I wanted to boot into safe mode. I chose this option and the same thing happened. The next time I chose safe mode without a network connection and everything worked fine. I could browse files on the VM, start programs, etc.

After that I start messing with the configuration for my VM, and specifically removed the network hardware, the VM started fine - it was then that I realized the networking causes the problem.

I also tried with a clean ISO install to eliminate the possibility of the qemu guest agent causing the issues. I can run through the entire Windows Server 2019 installation process, but once I reboot after I've completed the installation, I get the same hard hang where the host system is entirely unresponsive.

One thing I thought of is providing you access to the Windows Server 2019 ISO, this wouldn't violate any license agreements as you can install it without needing to put a key in (trial mode).

Revision history for this message
Robert Strube (robstrube) wrote :

One thing I thought of, is that both Brian and I have System76 computers, so perhaps there's a hardware issue specifically with our computers? Brian, I have a Darter Pro 2019, what model do you have?

Just to clarify I've tried both Pop_OS 19.10 and Ubuntu 19.10 and the issue is present on both installs. Perhaps it's an issue with the networking hardware on our systems in conjunction with the new 5.3 kernel?

I have a Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12) as my ethernet controller. I had some really strange issues with the ethernet with Ubuntu 19.04, in fact you can see some posts about this hardware here:

https://ubuntu-mate.community/t/19-04-ethernet-wired-connection-refuses-to-connect-when-plugged-in-before-boot/19333/2

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1752772

My issue was that if I had the ethernet cable plugged in at boot, the connection wouldn't work. I would have to start the system, and plug the cable in afterwards. Even then it wouldn't always work correctly.

My point is that perhaps there's something up very low level with this particular hardware? I wonder if we blacklist the ethernet kernel module we could run the VM without issue?

Revision history for this message
Robert Strube (robstrube) wrote :

Here's the main bug report I was thinking about:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841040

The kernel module is 'r8169'.

Brian perhaps you could do an 'lsmod | grep r8169' to see if that module is loaded for you as well?

If it is, perhaps you could try to blacklist the module by adding 'r8169' to the end of your '/etc/modprobe.d/blacklist.conf' file and rebooting?

Of course if you're using a wired connection this would cause problems, but if you're like me and are using a wireless connection, perhaps this will solve the issue?

It seems like that ethernet card is notoriously buggy.

Revision history for this message
Brian Mays (briguyjm) wrote :

output of 'lsmod | grep r8169' :
Module Size Used by
r8169 81920 0

I am on a fully wireless setup (the VM is running on my laptop) so I'll try this and see if it works.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Robert, Brian,

I'm trying to reproduce this issue today. I saw you both said you're running this in System76 machines, is that correct ? Im asking because I was wondering if this can be reproduced in any machine or not (thinking of specific ways to get a kernel dump if I can't reproduce it on my side).

Will keep you posted.

Revision history for this message
Brian Mays (briguyjm) wrote :

Disabling the 'r8169' module has no affect on this issue. The system still hangs when running the VM with networking enabled.

Revision history for this message
Robert Strube (robstrube) wrote :

My colleague, Zach has also run into this exact issue. He's running a Dell XPS laptop and we'll be posting more information here shortly.

Revision history for this message
Robert Strube (robstrube) wrote :

So it looks like the problem occurs when updating the kernel from 5.0.x to 5.3.x. Zach is running Pop_OS 19.04, but System76 will push Ubuntu kernel updates earlier, so he recently updated and received kernel 5.3, despite still running a 19.04 base. Questionable practice for sure...

I realize Pop_OS *is not* Ubuntu, but since the use the core platform as a foundation, it can provide insight into whatever the problem is.

As soon as he updated to kernel 5.3, he immediately started getting crashes with his Windows VMs. The problem appears to be the same, when the Guest OS begins to use the network connection the crash occurs.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This is a somewhat blind stab,
but I've seen another "windows guest issue on 19.10" report at [1].
I don't have any particular reason to expect it would help, but for a try if one of the people here could try kernel_irqchip=on that would be nice.

[1]: https://askubuntu.com/questions/1185321/ubuntu-19-10-qemu4-0-not-able-to-set-kernel-irqchip-on-in-libvirt/

Revision history for this message
odror (ozdror) wrote :

Not able to change this feature (<irqchip mode='on'/>)
I get the following error:
 libvirt.libvirtError: unsupported configuration: unexpected feature 'irqchip

Revision history for this message
Robert Strube (robstrube) wrote :

Looks like another person is having the same problem, I've asked them to post more details here.
https://askubuntu.com/questions/1185341/qemu-system-freeze-unsigned-kernel-5-3-0-1004-kvm/1185536#1185536

Revision history for this message
Robert Strube (robstrube) wrote :

@Christian, your suspicion is that it's related to this issue: https://bugs.launchpad.net/qemu/+bug/1826422 ?

I have another system that i'll set 19.10 on and see if I can test out your fix / get more logging information.

Revision history for this message
Robert Strube (robstrube) wrote :

Sorry for all the messages. I've been doing a bit more research and I'd like some of the folks having problems to try this.

It appears Q35 machines with qemu 4.0+ changed the default ioapic mode to split, which does not work with with certain hardware combinations. This is not a new issue with split mode but it is an issue with qemu 4.0+ changing the default ioapic mode.

One suggestion I discovered was to add the following element:

<ioapic driver='kvm'/>

Inside the <features> element in your machine's XML. You can edit this directly in virt-manager or you can find the machine's XML and edit it manually (look in /etc/libvirt/qemu) for the appropriate XML.

It would be great for those people affected to try this simple change out and report back if it solves the issue.

Revision history for this message
odror (ozdror) wrote :

<ioapic driver='kvm'/> Did not work for me. The only thing that so far seems to work (after 1 test) is restarting libvirtd daemon before booting the VM.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

@ozdror, would you mind opening a new bug for your case ? looks like a different issue as the original issue for this case was a host hang.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

@robstrube,

would you mind stepping back a little bit and making a summary of attempt changes and effects, and attaching the following information:

- host kernel exact version
- guest kernel exact version
- qemu packages exact version
- windows kernel exact version
- virtual machine xml (virsh dumpxml <machine>)

if the problem is within kernels 5.0.0 and 5.3.0 this is easily investigated through a kernel bisection, even if it is a qemu code change causing this issue with the kernel, because we can see the nature of the kernel change that caused it.

i'm trying to organize a bit the approach so we can help you, would you mind providing the items I asked ? hopefully, with those answers, I'll be able to reproduce this on my side as well, using the exact VM definition you have, and then I can bisect the kernel very easily and take needed actions.

thank you

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

i'm not suggesting anything related to a kernel dump as you've said its not a host kernel crash, but a machine hang. usually a machine hang is caused by HW misbehavior (that could have been caused by HW, firmware and/or OS kernel). in those cases, the only way to get a kernel dump would be to have a HW watchdog device and possibly a SMM able to get the full memory dump, which is NOT the case of a "notebook", as you described in the beginning of the bug.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.3 KiB)

As an example, I have tried to reproduce with the following configuration:

(k)rafaeldtinoco@win2019crashhost:~$ uname -a
Linux win2019crashhost 5.0.0-34-generic #36-Ubuntu SMP Wed Oct 30 05:16:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

----

(k)rafaeldtinoco@win2019crashhost:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 19.10
Release: 19.10
Codename: eoan

----

(k)rafaeldtinoco@win2019crashhost:~$ virsh dumpxml win2019crash
<domain type='kvm'>
  <name>win2019crash</name>
  <uuid>68ea22b7-7401-45bb-a7b0-a60956081e42</uuid>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.5'>hvm</type>
    <boot dev='cdrom'/>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'/>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='yes'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/win2019crash/disk01.ext4.qcow2'/>
      <target dev='vda' bus='virtio'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/kwin2019/win2019.iso'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
    </disk>
    <disk type='file' device='floppy'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/kwin2019/virtio.vfd'/>
      <target dev='fda' bus='fdc'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0' model='piix3-uhci'>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='ide' index='0'>
    </controller>
    <controller type='scsi' index='0' model='lsilogic'>
    </controller>
    <controller type='fdc' index='0'/>
    <interface type='bridge'>
      <mac address='52:54:00:3d:1d:9d'/>
      <source bridge='bridge0'/>
      <model type='virtio'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
    </video>
    <memballoon model='virtio'>
    </memballoon>
  </devices>
</domain>

But I saw you're using q35 arch model, s...

Read more...

Revision history for this message
Clinton (clintonminton) wrote :

I had the same issue:

Host system freeze whenever a Windows guest VM made it to the login screen. The last thing I could see in journalctl -f was a DHCP exchange with the VM before everything froze.

Solution:

Change NIC device model from e1000e to virtio, apply change and boot. No freeze, Windows guest been running about 1 hour now.

My Specs:

System76 Galago Pro
PopOS 19.10
5.3.0-20-generic

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Clinton, can you provide the answers to comment #27 then ?

- host kernel exact version (5.3.0-20-generic in your case)
- guest kernel exact version (windows kernel in this case)
- qemu packages exact version (?)
- virtual machine xml (virsh dumpxml <machine>) (?)

Revision history for this message
Clinton (clintonminton) wrote :
Download full text (8.6 KiB)

apt list --installed | grep qemu

ipxe-qemu-256k-compat-efi-roms/eoan,eoan,now 1.0.0+git-20150424.a25a16d-0ubuntu3 all [installed,automatic]
ipxe-qemu/eoan,eoan,now 1.0.0+git-20190109.133f4c4-0ubuntu2 all [installed,automatic]
qemu-block-extra/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed,automatic]
qemu-kvm/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed]
qemu-system-common/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed,automatic]
qemu-system-data/eoan,eoan,now 1:4.0+dfsg-0ubuntu9 all [installed,automatic]
qemu-system-gui/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed,automatic]
qemu-system-x86/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed,automatic]
qemu-utils/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed,automatic]
qemu/eoan,now 1:4.0+dfsg-0ubuntu9 amd64 [installed]

apt list --installed | grep libvirt

gir1.2-libvirt-glib-1.0/eoan,now 2.0.0-1 amd64 [installed,automatic]
libvirt-clients/eoan,now 5.4.0-0ubuntu5 amd64 [installed]
libvirt-daemon-driver-storage-rbd/eoan,now 5.4.0-0ubuntu5 amd64 [installed,automatic]
libvirt-daemon-system/eoan,now 5.4.0-0ubuntu5 amd64 [installed]
libvirt-daemon/eoan,now 5.4.0-0ubuntu5 amd64 [installed]
libvirt-dev/eoan,now 5.4.0-0ubuntu5 amd64 [installed]
libvirt-glib-1.0-0/eoan,now 2.0.0-1 amd64 [installed,automatic]
libvirt0/eoan,now 5.4.0-0ubuntu5 amd64 [installed,automatic]
python3-libvirt/eoan,now 5.0.0-1 amd64 [installed,automatic]
ruby-fog-libvirt/eoan,eoan,now 0.6.0-1 all [installed,automatic]
ruby-libvirt/eoan,now 0.7.1-1 amd64 [installed,automatic]
vagrant-libvirt/eoan,eoan,now 0.0.45-2 all [installed,automatic]

Windows Guest info:

Windows 10 Version 1903 (OS Build 18362.418)
QEMU guest agent 7.4.5
SPICE Guest Tools 0.141

virsh dumpxml win10

<domain type='kvm' id='1'>
  <name>win10</name>
  <uuid>b7d6ad96-dfde-49bc-a820-772f40bddffe</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-q35-3.1'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
    </hyperv>
    <vmport state='off'/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client-IBRS</model>
    <vendor>Intel</vendor>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='vmx'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='clflushopt'/>
    <feature policy='require' name='umip'/>
    <feature policy='require' name='md-clear'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='xsaves'/>
    <feature policy='require' name='pdpe1gb'/>
    <feature policy='disable' name='skip-l1dfl-vmentry'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='rtm'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' t...

Read more...

Revision history for this message
Clinton (clintonminton) wrote :

See attached picture of journalctl -f during a crash using the e1000e NIC.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

That is awesome. Thx, I'll get back to you soon, I'll try to get a CascadeLake server in my lab so I can reproduce this closely to HW you have and are using.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

BTW Clinton, one last question, whats your exact host CPU model ? Sorry, missed that question before.

Revision history for this message
Clinton (clintonminton) wrote :

See attached for host's lscpu output.

Changed in qemu (Ubuntu):
status: Incomplete → In Progress
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Revision history for this message
Zach Graceffa (zacgrac) wrote :
Download full text (4.1 KiB)

Hello, a little late to the party, but I have the what started as the same VM image as Robert Strube. Its Windows 2019, default network driver. My system is Pop OS 19.04, based on Ubuntu 19.04. After an update I got kernel version 5.3.0-20-generic and this issue started happening to me. Pasted below is my 'journalctl -u libvirtd' from computer start up until crash.

-- Reboot --
Nov 05 22:26:45 adelle systemd[1]: Starting Virtualization daemon...
Nov 05 22:26:46 adelle libvirtd[1366]: libvirt version: 5.0.0, package: 1ubuntu2.5 (Christian Ehrhardt <email address hidden> Wed, 21 Aug 2019 11:15:43
Nov 05 22:26:46 adelle libvirtd[1366]: hostname: adelle
Nov 05 22:26:46 adelle libvirtd[1366]: Libvirt doesn't support VirtualBox API version 6000014
Nov 05 22:26:46 adelle systemd[1]: Started Virtualization daemon.
Nov 05 22:26:46 adelle dnsmasq[1559]: started, version 2.80 cachesize 150
Nov 05 22:26:46 adelle dnsmasq[1559]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify du
Nov 05 22:26:46 adelle dnsmasq-dhcp[1559]: DHCP, IP range 192.168.122.2 -- 192.168.122.254, lease time 1h
Nov 05 22:26:46 adelle dnsmasq-dhcp[1559]: DHCP, sockets bound exclusively to interface virbr0
Nov 05 22:26:46 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:46 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:46 adelle dnsmasq[1559]: read /etc/hosts - 7 addresses
Nov 05 22:26:46 adelle dnsmasq[1559]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Nov 05 22:26:46 adelle dnsmasq-dhcp[1559]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Nov 05 22:26:51 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:51 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:51 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:51 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:51 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:51 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:51 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:51 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:51 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:51 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:51 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:51 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:52 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:52 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:52 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:52 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:52 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:52 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:52 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:52 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:52 adelle dnsmasq[1559]: reading /etc/resolv.conf
Nov 05 22:26:52 adelle dnsmasq[1559]: using nameserver 127.0.0.53#53
Nov 05 22:26:58 adelle libvirtd[1366]: Device 0000:02:00.0 not found: could not acce...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Zach - that is an interesting detail.
This might explain why I didn't see it (had kernel 5.0 together with the userspace of Ubuntu 19.10)
So you can switch between good/bad mode by switching host kernels between 5.3 and others then?

Does this apply to others affected by this bug as well?
You might try to use a few kernels of [1] to try to pinpoint a specific version that made it start to fail.

@Rafael it seems your repro case is close, please give the kernels a try as well as machine type i440fx vs q35 and with/without ioapic.

P.S. the odd thing is that the error seems to only occur with emulated e1000 network which shouldn't depend too much on the kernel as I'd have expected that is emulated in userspace.
Anyway a check of the above if there really is a kernel dependency to all of this would be great.

[1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Clinton's VM did have virtio only network, as well as mine, and I also thought the same about the e1000 emulation not being anywhere close to host kernel. Anyway, I have been trying to reproduce this issue with the attached VMs definition with no success.

To be honest, I'm starting to suspect on the CPU flags and microcode for the specific CPUs. I'm wondering if any HW mitigation, from those CPU's new microcodes, are stepping into our way here.

I haven't tested with apparmor enabled, that might be my next step here.

GRUB_CMDLINE_LINUX_DEFAULT=" ..."

In kernels:

- linux-image-5.0.0-32-generic
- linux-image-5.0.0-34-generic
- linux-image-5.3.0-21-generic

With and without host HW/kernel mitigations:

pti=off kpti=off nopcid noibrs noibpb spectre_v2=off nospec_store_bypass_disable mds=off l1tf=off

With qemus:

- 1:3.1+dfsg-2ubuntu3.5
- 1:4.0+dfsg-0ubuntu10

with and without CPU flags:

- <feature policy='require' name='arch-capabilities'/>
- <feature policy='require' name='skip-l1dfl-vmentry'/>

with and without:

- all spice configuration (just like given example)
- with regular vnc basic configuration (no audio, usbs)

with Windows 2019:

- Standard Evaluation
- Build 17763.rs5_release.180914-1434

with machine types:

- i440fx
- q35

with the following CPU:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-7300U CPU @ 2.60GHz
Stepping: 9

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

My current suggestion would be to try changing the following CPU features:

    <feature policy='require' name='clflushopt'/>
    <feature policy='require' name='md-clear'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='pdpe1gb'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>

From the virtual machines in question and checking if that mitigates the issue (removing first half, and trying, removing the second half, keeping the first one, and trying, and so on).

That can be achieved with:

virsh edit <machine>

by deleting the lines and re-adding them.

Another attempt to try would be (with and/or without the CPU features) try to disable all security mitigations from the host and reproduce the issue.

That can be achieve by changing:

GRUB_CMDLINE_LINUX_DEFAULT="pti=off kpti=off nopcid noibrs noibpb spectre_v2=off nospec_store_bypass_disable mds=off l1tf=off ..."

in /etc/default/grub and running "update-grub".

And a last one would be to disable apparmor and check.

That can also be achieved by changing:

GRUB_CMDLINE_LINUX_DEFAULT="... apparmor=0"

in /etc/default/grub and running "update-grub".

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Meanwhile I'll try to find other HW I can test the same thing on.

Revision history for this message
Zach Graceffa (zacgrac) wrote :

@Christian this is correct. I boot into 5.0.0-31-generic when I need to use my VM.

Revision history for this message
Zach Graceffa (zacgrac) wrote :

I do this because its the only other kernel I have installed. I am not sure at what point between 5.0.0-31-generic and 5.3.0-20-generic this bug appears.

I've attached my history from the point that the bug started occurring, no references to qemu or libvirt though, only linux-image-5.3.0-20-generic.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Zach, thanks for the feedback.
Since we still struggle to recreate this on our side is there a chance that you could test kernels from [1] to help spotting which version bump exactly it was?

Going further we might even need to bisect things, but one step at a time.
I'd not want to put this on you (or the others), but if it continues not to trigger for us we might have to.

It is now clear that multiple people see this dependent on the kernel, adding a kernel task for it.
Maybe they have heard about similar issue on other bugs?

[1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/

tags: added: bot-stop-nagging
summary: - Running VM with Virtual NIC Crashes Host OS
+ KVM with e1000e and WinGuest Host OS on kernel 5.3 (ok with 5.0)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1849720

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This doesn't need kernel logs at this state of the bug, bot pleas stop spamming :-)

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Zach Graceffa (zacgrac) wrote :

@Christian, I'll try some more kernels over the weekend and let you know the results.

I can confirm that switching my NIC to "virtio" from "e10000e" allows me to run my vm on kernel 5.3.0-20-generic.

Revision history for this message
odror (ozdror) wrote : apport information

ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.9
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: dror 1780 F.... pulseaudio
 /dev/snd/controlC0: dror 1780 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 18.04
InstallationDate: Installed on 2019-11-07 (2 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
IwConfig:
 wlo1 no wireless extensions.

 lo no wireless extensions.
MachineType: HP HP Spectre x360 Convertible 15t-df100
Package: qemu
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.0.0-32-generic root=UUID=1ddbd6e2-08a0-4189-8d71-1ab1332485e3 ro quiet splash vt.handoff=1
ProcVersionSignature: Ubuntu 5.0.0-32.34~18.04.2-generic 5.0.21
RelatedPackageVersions:
 linux-restricted-modules-5.0.0-32-generic N/A
 linux-backports-modules-5.0.0-32-generic N/A
 linux-firmware 1.173.9
Tags: bionic
Uname: Linux 5.0.0-32-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 07/29/2019
dmi.bios.vendor: AMI
dmi.bios.version: F.04
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: 863E
dmi.board.vendor: HP
dmi.board.version: 53.22
dmi.chassis.type: 31
dmi.chassis.vendor: HP
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAMI:bvrF.04:bd07/29/2019:svnHP:pnHPSpectrex360Convertible15t-df100:pvr:rvnHP:rn863E:rvr53.22:cvnHP:ct31:cvrChassisVersion:
dmi.product.family: 103C_5335KV HP Spectre
dmi.product.name: HP Spectre x360 Convertible 15t-df100
dmi.product.sku: 5ZV29AV
dmi.sys.vendor: HP

tags: added: apport-collected bionic
Revision history for this message
odror (ozdror) wrote : AlsaInfo.txt

apport information

Revision history for this message
Jason Schulz (uxcn) wrote :

I can confirm this issue with Eoan Ermine (5.3.0-19-generic). I tried disabling AppArmor (security_driver=none), but my laptop still locks up at the Windows login screen (Windows 10).

Using the virtio NIC does work around the problem though.

(root) ~ uname -a
Linux pi.aqdx.us 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

(root) ~ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Stepping: 9
CPU MHz: 2070.601
CPU max MHz: 3500.0000
CPU min MHz: 400.0000
BogoMIPS: 5799.77
Virtualization: VT-x
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 512 KiB
L3 cache: 4 MiB
NUMA node0 CPU(s): 0-3
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est t
                                 m2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 er
                                 ms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

I've been able to re-create the same issue using mainline (5.3.9-050309-generic). Has anybody bisected kernel versions yet?

Revision history for this message
Zach Graceffa (zacgrac) wrote :

I bisected them tonight and found the issue between 5.2.21-050221-generic and 5.3.0-050300-generic. As far as I can tell there are no versions between, so it seems it starts at 5.3.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Kernel Team - this has enough affected people that I'd rate this at least high severity.
Unfortunately none of "us" could reproduce it on our side yet to bisect on our own.
As you see above affected users were so kind to test mainline kernels and identified 5.2.21-050221 - 5.3.0-050300 already. Do you have an automation to help user-bisecting by providing such kernels like the mainline kernels so that - if we continue to be unable to repro on our side - we could have the users here test the bisect steps for us?

@Zach - thank you so much for your tests.
Have you by any chance tried 5.4 [1] as well, maybe there is hope that a fix already exists and only needs to be backported?

[1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.4-rc7/

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Rafael - IIRC you said all combinations you tried didn't trigger anything for you - is that correct?
If so please state it here and mark the bug on the qemu task invalid and unassign yourself as it seems much more a kernel issue right now.
Or did you have combinations left to try?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I did a retry on my own as well (kernel 5.3, virtual e1000e card, win server guest), but it just won't fail for me. That confirmed Rafaels much more various tests :-/

Revision history for this message
Brian Mays (briguyjm) wrote :

I switched my networking interface from e1000e to virtio and ran the VM on kernel 5.3.0-19-generic and had no issues with the VM. Networking worked for both the host and the VM. When I compared this to a clone of the machine with the e1000e networking interface, I had the system freezing issue.

Revision history for this message
Zach Graceffa (zacgrac) wrote :

@Christian - 5.4-rc7 worked for both e1000e and virtio NICs. No crash as of about 15 minutes.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1849720] Re: KVM with e1000e and WinGuest Host OS on kernel 5.3 (ok with 5.0)

On Tue, Nov 12, 2019 at 5:35 AM Zach Graceffa
<email address hidden> wrote:
>
> @Christian - 5.4-rc7 worked for both e1000e and virtio NICs. No crash as
> of about 15 minutes.

Great, thanks Zach for the check!

@kernel-team, that means there likely is something in 5.3 -> 5.4 that
you could identify and backport.
As I said before, you might want to provide bisecting builds for this
either for the 5.2->5.3 breakage or the possible 5.3->5.4 fix.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

https://bugzilla.kernel.org/show_bug.cgi?id=205247

This could be related. That means, it's a problem on the networking code.

Changed in qemu (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

I am suspicious upstream commit e7a409c3f46cb0dbc7bfd4f6f9421d53e92614a5 ("ipv4: fix IPSKB_FRAG_PMTU handling with fragmentation") might fix this. It has been queued in eoan tree as part of 5.3.10 update.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

https://bugzilla.kernel.org/show_bug.cgi?id=205173

This other report seems to be the one that is fixed by the referred upstream commit, as they have the same reporter name. I am very confident this might be the fix here.

Cascardo.

Revision history for this message
Clinton (clintonminton) wrote :

@Rafael - Regarding comment #39

The win10.xml dump was after I switched from "e1000e" to "virtio" NIC

To reiterate:

Freeze happens while using e1000e
NO freeze while using virtio

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

Hi, the package from eoan-proposed linux version 5.3.0-24 should fix the issue. Can anyone test it?

Thanks.
Cascardo.

Revision history for this message
Thiago Padilha (tpadilhacc) wrote :

Hi

I've faced a similar problem on Debian.

My system is a Ryzen 1700 and I use windows and OSX VMs for work, and after I upgraded to Debian 10 I started getting random host freezes some time after booting the VMs. Strangely, I have a linux VM with almost identical config as the Windows 10 one and it doesn't cause a host crash.

Since my processor is a rather troublesome one (I had hardware bugs in the past, which were worked around in the kernel), I assumed this to be another such case, especially since the bug didn't occur if I disabled SMT on the BIOS.

However, since my system was very stable under Debian 9, I spent quite a few hours trying to root out what change could have cause it, and at least in my specific case (which I'm not sure is the same reported here), was traced back to a certain libvirt commit, more specifically this one: https://github.com/libvirt/libvirt/commit/3527f9dde67460e9f2d50ce52b8dade8c0848e86

So a suggestion to anyone affected: try to explicitly disable seccomp by setting `seccomp_sandbox = 0` in /etc/libvirt/qemu.conf

Revision history for this message
dyadMisha (login-localhosd) wrote :

Hi,

5.3.0-24 fixed this issue for me.
BTW, system with i7-4790 was affected, while i7-3820 was NOT affected. was tested by attaching the same disk to both computers.

Thank you!

Revision history for this message
HellTriX (helltrix) wrote : apport information

ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu8.2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: trixy 2281 F.... pulseaudio
 /dev/snd/controlC0: trixy 2281 F.... pulseaudio
 /dev/snd/pcmC0D8p: trixy 2281 F...m pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 19.10
InstallationDate: Installed on 2019-12-05 (0 days ago)
InstallationMedia: Ubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
KvmCmdLine: COMMAND STAT EUID RUID PID PPID %CPU COMMAND
MachineType: To Be Filled By O.E.M. To Be Filled By O.E.M.
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: qemu (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:
 0 amdgpudrmfb
 1 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.3.0-24-generic root=UUID=147e6b06-3c63-4fc0-847a-2bfdea6d3636 ro quiet splash iommu=pt iommu=1 amd_iommu=on vt.handoff=7
ProcVersionSignature: Ubuntu 5.3.0-24.26-generic 5.3.10
Tags: eoan
Uname: Linux 5.3.0-24-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip libvirt lpadmin lxd plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 11/18/2019
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: P1.62
dmi.board.name: X570 AQUA
dmi.board.vendor: ASRock
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: To Be Filled By O.E.M.
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrP1.62:bd11/18/2019:svnToBeFilledByO.E.M.:pnToBeFilledByO.E.M.:pvrToBeFilledByO.E.M.:rvnASRock:rnX570AQUA:rvr:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:
dmi.product.family: To Be Filled By O.E.M.
dmi.product.name: To Be Filled By O.E.M.
dmi.product.sku: To Be Filled By O.E.M.
dmi.product.version: To Be Filled By O.E.M.
dmi.sys.vendor: To Be Filled By O.E.M.

Revision history for this message
HellTriX (helltrix) wrote : AlsaInfo.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : CRDA.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : CurrentDmesg.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : IwConfig.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : Lspci.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : Lsusb.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : ProcInterrupts.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : ProcModules.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : PulseList.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : RelatedPackageVersions.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : RfKill.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : UdevDb.txt

apport information

Revision history for this message
HellTriX (helltrix) wrote : WifiSyslog.txt

apport information

Revision history for this message
Robert Strube (robstrube) wrote :

Sorry for responding so late, I've been using the workaround of setting the device model of the virtual NIC to use "virtio" instead of "e1000e" for the last several months for some critical production work and I didn't want to rock the boat.

I'm currently running kernel: 5.3.0-26-generic and I'm no longer experiencing the hard crash.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in qemu (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.