Virtual devices in Windows VMs occasionally loose connection to host

Bug #1807352 reported by Pascal
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description
===========
Multiple Windows virtual machine running on our OpenStack cloud seems to occasionally loose the connection from their disk & network virtual device (virtio) to the underlying host. The Windows VM is then no longer able to write down to the disk (see below for logs). Additionally the network connection is shown as "No internet access" (yellow alert symbol). Rebooting the affected VM fixes the problems.

Steps to reproduce
==================
We don't know how to reproduce the situation.

Expected result
===============
Windows VMs should run without any virtual device connectivity issues.

Actual result
=============
Virtual devices in Windows VMs occasionally loose connection to host, as described above.

Environment
===========
Host:
* CentOS Linux release 7.5.1804 (Core) / Kernel: 3.10.0-862
* Nova: openstack-nova-compute-15.1.3-1
* Hypervisor: Libvirt + KVM (Libvirt 3.9.0 / QEMU 2.10.0)
* Storage: Ceph 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
* Networking: Neutron with OpenVSwitch

VM:
* Windows Server 2016 Standard
* Red Hat VirtIO SCSI Disk Device (Driver version: 10.0.14393.1613)
* Red Hat VirtIO Ethernet Adapter (Driver version: 100.76.104.14900)
* Network: Static IP configuration

Libvirt XML (critical but irelevant data replaced by '***'):
<domain type='kvm' id='466'>
  <name>***</name>
  <uuid>***</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="15.1.3-1.el7"/>
      <nova:name>***</nova:name>
      <nova:creationTime>2018-12-05 07:11:13</nova:creationTime>
      <nova:flavor name="c04m036">
        <nova:memory>36864</nova:memory>
        <nova:disk>0</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>4</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="***">***</nova:user>
        <nova:project uuid="***">***</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="***"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>37748736</memory>
  <currentMemory unit='KiB'>37748736</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <shares>4096</shares>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>RDO</entry>
      <entry name='product'>OpenStack Compute</entry>
      <entry name='version'>15.1.3-1.el7</entry>
      <entry name='serial'>9646161a-aa15-4a0a-b20f-f2a984052149</entry>
      <entry name='uuid'>***</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
    </hyperv>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>SandyBridge-IBRS</model>
    <topology sockets='4' cores='1' threads='1'/>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='arat'/>
    <feature policy='require' name='xsaveopt'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='***'>
        <secret type='ceph' uuid='***'/>
      </auth>
      <source protocol='rbd' name='***'>
        <host name='***' port='***'/>
        <host name='***' port='***'/>
        <host name='***' port='***'/>
      </source>
      <target dev='vda' bus='virtio'/>
      <serial>***</serial>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <auth username='***'>
        <secret type='ceph' uuid='***'/>
      </auth>
      <source protocol='rbd' name='***'>
        <host name='***' port='***'/>
        <host name='***' port='***'/>
        <host name='***' port='***'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <serial>***</serial>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <interface type='bridge'>
      <mac address='***'/>
      <source bridge='qbr70392c0e-f1'/>
      <target dev='tap70392c0e-f1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/12'/>
      <log file='/var/lib/nova/instances/***/console.log' append='off'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/12'>
      <source path='/dev/pts/12'/>
      <log file='/var/lib/nova/instances/***/console.log' append='off'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/org.qemu.guest_agent.0.***.sock'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <graphics type='spice' port='***' autoport='yes' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='***'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='10'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c9,c745</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c9,c745</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

Logs & Configs
==============
Compute and storage servers:
No relevant log entries found.

Windows VM:
Level Date and Time Source Event ID Task Category
Warning 05.12.2018 07:54:44 vhdmp 129 None "\Device\RaidPort2
"
Warning 05.12.2018 07:54:37 Disk 153 None The IO operation at logical block address 0xa968 for Disk 9 (PDO name: \Device\00000f0d) was retried.
Warning 05.12.2018 07:54:33 NTFS 50 None {Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
Information 05.12.2018 07:54:33 Application Popup 26 None Application popup: Windows - Delayed Write Failed : Exception Processing Message 0xc0000222 Parameters 0x7ffdf3051d28 0x7ffdf3051d28 0x7ffdf3051d28 0x7ffdf3051d28
Information 05.12.2018 07:54:33 Application Popup 26 None Application popup: Windows - Delayed Write Failed : Exception Processing Message 0xc0000222 Parameters 0x7ffdf3051d28 0x7ffdf3051d28 0x7ffdf3051d28 0x7ffdf3051d28
Warning 05.12.2018 07:54:33 NTFS 50 None {Delayed Write Failed} Windows was unable to save all the data for the file \$Extend\$UsnJrnl:$J:$DATA. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
Warning 05.12.2018 07:54:33 NTFS 50 None {Delayed Write Failed} Windows was unable to save all the data for the file \AppData\Local\Microsoft\Windows\WebCache\V01.loglog. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
Warning 05.12.2018 07:54:33 NTFS 50 None {Delayed Write Failed} Windows was unable to save all the data for the file \AppData\Local\Microsoft\Windows\WebCache\WebCacheV01.jfm. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
Information 05.12.2018 07:54:33 Application Popup 26 None Application popup: Windows - Delayed Write Failed : Exception Processing Message 0xc0000222 Parameters 0x7ffdf3051d28 0x7ffdf3051d28 0x7ffdf3051d28 0x7ffdf3051d28
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Error 05.12.2018 07:54:32 NTFS 137 (2) The default transaction resource manager on volume \\?\Volume{de53bca7-0000-0000-0000-100000000000} encountered a non-retryable error and could not start. The data contains the error code.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Microsoft-Windows-Ntfs 140 None "The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{de53bca7-0000-0000-0000-100000000000}, DeviceName: \Device\HarddiskVolume100.
(A device which does not exist was specified.)"
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 153 None The IO operation at logical block address 0x2dc40 for Disk 6 (PDO name: \Device\00000ea0) was retried.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 51 None An error was detected on device \Device\Harddisk6\DR98 during a paging operation.
Warning 05.12.2018 07:54:32 Disk 153 None The IO operation at logical block address 0xace0 for Disk 6 (PDO name: \Device\00000ea0) was retried.
Warning 05.12.2018 07:54:32 Disk 153 None The IO operation at logical block address 0x32ac0 for Disk 6 (PDO name: \Device\00000ea0) was retried.

Pascal (paescu)
description: updated
Pascal (paescu)
description: updated
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Pascal: do you have the qemu logs for the problematic instance? By default it is under /var/log/libvirt/qemu/

Is there anything generally happen on the compute when the VM lose the device? Could you check the journal of the host around the same time?

I mark this bug Incomplete until the requested logs are provided, please set it back to New state when you respond.

Changed in nova:
status: New → Incomplete
Revision history for this message
Pascal (paescu) wrote :

Thanks for your reply!

No, we checked every log and couldn't see any related event.

As far as I know, it hasn't happened again since we've switched to newer version of the VirtIO drivers.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.