Recent KVM RTC cherry-picks break (some) Windows Live-Migrations

Bug #1668594 reported by Fabian Grünbichler on 2017-02-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Xenial
Undecided
Unassigned

Bug Description

== SRU Justification ==

Impact: Windows Live-Migration does not work reliably anymore with recent KVM RTC cherry-picks.

Fix: Single follow-up upstream cherry pick which fixes the problem.

Regression Potential: The patch has been upstream since 4.8, so it should be well-tested at this point. Thus regressions are unlikely.

---

The fix for https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649718 cherry-picked three commits from upstream 4.6 related to RTC interrupt handling. Unfortunately, the followup commit included in 4.8 was missed. As a result, Windwos Live-Migration in Qemu is broken on certain hardware.

Test system's CPU:
 cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping : 1
microcode : 0xb00001b
cpu MHz : 1425.621
cache size : 20480 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs :
bogomips : 4199.88
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

Live-Migrating a Windows Server 2016 VM (Qemu commandline below) hangs the VM in about 1/3 attempts. Interestingly, migrating the VM back to the original host allows the VM to run normally again (but subsequent migration attempts might hang it again as well).

non-minimized qemu command line:
/usr/bin/kvm -id 101 -chardev 'socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -pidfile /var/run/qemu-server/101.pid -daemonize -smbios 'type=1,uuid=a6e1bea5-09ab-4f8a-b1c2-3991725892f5' -drive 'if=pflash,unit=0,format=raw,readonly,file=/usr/share/kvm/OVMF_CODE-pure-efi.fd' -drive 'if=pflash,unit=1,format=raw,file=/tmp/101-ovmf.fd' -name Windows2016 -smp '2,sockets=1,cores=2,maxcpus=2' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vga std -vnc unix:/var/run/qemu-server/101.vnc,x509,password -no-hpet -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed' -m 2048 -k de -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -chardev 'socket,path=/var/run/qemu-server/101.qga,server,nowait,id=qga0' -device 'virtio-serial,id=qga0,bus=pci.0,addr=0x8' -device 'virtserialport,chardev=qga0,name=org.qemu.guest_agent.0' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:98525b424092' -drive 'file=PATH,if=none,id=drive-ide2,media=cdrom,aio=threads' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -device 'virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5' -drive 'file=PATH,if=none,id=drive-scsi0,cache=writeback,discard=on,format=raw,aio=threads,detect-zeroes=unmap' -device 'scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on' -device 'virtio-net-pci,mac=32:F5:16:78:A7:F0,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -rtc 'driftfix=slew,base=localtime' -global 'kvm-pit.lost_tick_policy=discard'

The causing commits were triaged via git bisect, applying the follow-up commit b0eaf4506f5f95d15d6731d72c0ddf4a2179eefa fixes the issue with no observable side-effects. Mainline 4.6 is also affected and triggers the bug just like Ubuntu kernels >= Ubuntu-4.4.0-63.84

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1668594

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Joseph Salisbury (jsalisbury) wrote :

Do you plan on sending an SRU request to the kernel team mailing list?
 <email address hidden>

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
tags: added: kernel-da-key
Brad Figg (brad-figg) on 2017-03-01
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

works as expected, thanks.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (15.0 KiB)

This bug was fixed in the package linux - 4.4.0-67.88

---------------
linux (4.4.0-67.88) xenial; urgency=low

  * linux: 4.4.0-67.88 -proposed tracker (LP: #1667052)

  * Recent KVM RTC cherry-picks break (some) Windows Live-Migrations
    (LP: #1668594)
    - kvm: x86: correctly reset dest_map->vector when restoring LAPIC state

  * Regression in 4.4.0-65-generic causes very frequent system crashes
    (LP: #1669611)
    - Revert "UBUNTU: SAUCE: apparmor: fix lock ordering for mkdir"
    - Revert "UBUNTU: SAUCE: apparmor: fix leak on securityfs pin count"
    - Revert "UBUNTU: SAUCE: apparmor: fix reference count leak when
      securityfs_setup_d_inode() fails"
    - Revert "UBUNTU: SAUCE: apparmor: fix not handling error case when
      securityfs_pin_fs() fails"

  * Upgrade Redpine RS9113 driver to support AP mode (LP: #1665211)
    - SAUCE: Redpine driver to support Host AP mode

  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * [Hyper-V] SAUCE: pci-hyperv fixes for SR-IOV on Azure (LP: #1665097)
    - SAUCE: PCI: hv: Fix wslot_to_devfn() to fix warnings on device removal
    - SAUCE: pci-hyperv: properly handle pci bus remove
    - SAUCE: pci-hyperv: lock pci bus on device eject

  * [Hyper-V/Azure] Please include Mellanox OFED drivers in Azure kernel and
    image (LP: #1650058)
    - net/mlx4_en: Fix bad WQE issue
    - net/mlx4_core: Fix racy CQ (Completion Queue) free
    - net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT
      transitions
    - net/mlx4_core: Avoid command timeouts during VF driver device shutdown

  * Xenial update to v4.4.49 stable release (LP: #1664960)
    - ARC: [arcompact] brown paper bag bug in unaligned access delay slot fixup
    - selinux: fix off-by-one in setprocattr
    - Revert "x86/ioapic: Restore IO-APIC irq_chip retrigger callback"
    - cpumask: use nr_cpumask_bits for parsing functions
    - hns: avoid stack overflow with CONFIG_KASAN
    - ARM: 8643/3: arm/ptrace: Preserve previous registers for short regset write
    - target: Don't BUG_ON during NodeACL dynamic -> explicit conversion
    - target: Use correct SCSI status during EXTENDED_COPY exception
    - target: Fix early transport_generic_handle_tmr abort scenario
    - target: Fix COMPARE_AND_WRITE ref leak for non GOOD status
    - ARM: 8642/1: LPAE: catch pending imprecise abort on unmask
    - mac80211: Fix adding of mesh vendor IEs
    - netvsc: Set maximum GSO size in the right place
    - scsi: zfcp: fix use-after-free by not tracing WKA port open/close on failed
      send
    - scsi: aacraid: Fix INTx/MSI-x issue with older controllers
    - scsi: mpt3sas: disable ASPM for MPI2 controllers
    - xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
    - ALSA: seq: Fix race at creating a queue
    - ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
    - drm/i915: fix use-after-free in page_flip_completed()
    - Linux 4.4.49

  * NFS client : kernel 4.4.0-57 crash with nfsv4 enries in /etc/fstab
    (LP: #1650336)
    - SUNRPC: fix refcounting problems with auth_g...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers