vmx_nm_test in ubuntu_kvm_unit_tests interrupted on X-oracle-4.15 / B-oracle-4.15 / X-KVM / B-KVM

Bug #1872401 reported by Po-Hsu Lin on 2020-04-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
High
Unassigned
linux (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
Bionic
Medium
Thadeu Lima de Souza Cascardo

Bug Description

[Impact]
When running kvm-unit-tests under a guest, it will be paused, requiring a
reset. When running the same test on a host (vmx_nm_test), it will fail.

[Test case]
Grab kvm-unit-tests, build it and run:

TESTNAME=vmx TIMEOUT=90s ACCEL= ./x86/run x86/vmx.flat -smp 1 -cpu host,+vmx -append "vmx_nm_test"

If done inside a guest, when the host runs the bionic 4.15 kernel, the guest
will pause.

[Potential regressions]
Nested KVM could stop working. Floating point could stop working on KVM
guests, though the code that relied on this was already removed from
Bionic.

------------------------------

This issue was first spotted on Mar.16 [1]

The ubuntu_kvm_unit_tests will be interrupted on X-oracle-4.15 on both VM.Standard2.1 and VM.Standard2.16, this is not a regression since it can be reproduced with 4.15.0-1031-oracle #34~16.04.1:

Running '/home/ubuntu/autotest/client/tmp/ubuntu_kvm_unit_tests/src/kvm-unit-tests/tests/vmx_nm_test'
 BUILD_HEAD=4671e4ba
 timeout -k 1s --foreground 30 /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel /tmp/tmp.da3iFrsCzC -smp 1 -cpu host,+vmx -append vmx_nm_test # -initrd /tmp/tmp.h2DFw8L0AF
 enabling apic
 paging enabled
 cr0 = 80010011
 cr3 = 477000
 cr4 = 20

 Test suite: vmx_nm_test
client_loop: send disconnect: Broken pipe
(node disconnected here)

Before the test started, this can be found in syslog:
Apr 13 06:26:25 selfprovisioned-phlin-kvm-unit kernel: [ 1073.529005] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.

After that, noting was printed and node disconnected.

If you try to run this case manually, it will stop at:
# ./vmx_nm_test
BUILD_HEAD=4671e4ba
ready!!!
timeout -k 1s --foreground 30 /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel /tmp/tmp.ZcGrnXu6se -smp 1 -cpu host,+vmx -append vmx_nm_test # -initrd /tmp/tmp.ADjEOAcRKM
enabling ap
(stopped here, even the "enabling apic" string was not printed)

It looks like this is a new test case added since the cycle of 4.15.0-1037.41~16.04.1-oracle

[1] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1867623/comments/2

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.15.0-1031-oracle 4.15.0-1031.34~16.04.1
ProcVersionSignature: User Name 4.15.0-1031.34~16.04.1-oracle 4.15.18
Uname: Linux 4.15.0-1031-oracle x86_64
ApportVersion: 2.20.1-0ubuntu2.21
Architecture: amd64
Date: Mon Apr 13 05:18:03 2020
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-oracle
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

Po-Hsu Lin (cypressyew) wrote :
Po-Hsu Lin (cypressyew) on 2020-04-13
description: updated
Po-Hsu Lin (cypressyew) on 2020-04-13
tags: added: 4.15 sru-20200406 ubuntu-kvm-unit-tests
Po-Hsu Lin (cypressyew) on 2020-04-13
description: updated
Po-Hsu Lin (cypressyew) on 2020-04-13
description: updated
Po-Hsu Lin (cypressyew) wrote :

On Oracle Bionic 4.15.0-1038.42, it has finished (but failed with bug 1872419) on VM.Standard2.1

Po-Hsu Lin (cypressyew) on 2020-05-21
tags: added: sru-20200518
Po-Hsu Lin (cypressyew) wrote :

This issue can be reproduced with KVM nodes on our BareMetal MAAS

Step to reproduce:
1. Deploy a KVM node with Xenial / Bionic
2. sudo apt install gcc build-essential cpu-checker qemu-kvm git
3. git clone --depth=1 git://kernel.ubuntu.com/ubuntu/kvm-unit-tests/ -b disco
4. cd kvm-unit-tests
5. ./configure
6. make standalone
7. cd tests
8. sudo ./vmx_nm_test
BUILD_HEAD=4671e4ba
timeout -k 1s --foreground 30 /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel /tmp/tmp.wHvN1CDAeZ -smp 1 -cpu host,+vmx -append vmx_nm_test # -initrd /tmp/tmp.J5VNtXChPd
enabling ap
(Test interrupted here)

When this happens, the node will stop responding, nothing special can be found in syslog. The virsh on the host will show it's been paused:
$ virsh list
 Id Name State
----------------------------------------------------
 533 curly paused

summary: - vmx_nm_test in ubuntu_kvm_unit_tests interrupted on X-oracle-4.15
+ vmx_nm_test in ubuntu_kvm_unit_tests interrupted on X-oracle-4.15 /
+ X-KVM / B-KVM
tags: added: 4.4 bionic

BTW the power status on MAAS will be "OFF"
And since this KVM will be in "paused" state, it needs to be destroyed before re-deploying, it seems this is why we constantly seeing deploying failures on KVM nodes.

Po-Hsu Lin (cypressyew) wrote :

This is still failing with the kvm-unit-tests from upstream.

Note that with the upstream repo, you will have to split the vmx_nm_tests like below:
https://kernel.ubuntu.com/git/ubuntu/kvm-unit-tests/commit/?h=disco&id=4671e4baaec6c968e626e9fc3557aa4581504586

Po-Hsu Lin (cypressyew) wrote :

Adding the kqa-blocker tag as this is hindering the automation of the KVM kernel SRU.
(Tester will need to destroy the paused VM manually)

tags: added: kqa-blocker sru-20200608
Po-Hsu Lin (cypressyew) wrote :

This issue does not exist on Focal Oracle 5.4.0-1016.16

Sean Feole (sfeole) on 2020-09-03
Changed in ubuntu-kernel-tests:
importance: Undecided → High
Po-Hsu Lin (cypressyew) on 2020-09-11
tags: added: sru-20200831
Po-Hsu Lin (cypressyew) wrote :

Issue found on B-oracle-5.4 VM.DenseIO2.8 / VM.Standard2.1

Po-Hsu Lin (cypressyew) wrote :

This is affecting the autovm# on Intel Cloud, node will be "PAUSED" when hitting this issue.

Po-Hsu Lin (cypressyew) wrote :

Still visible 4.15.0-1059.65 - oracle
on VM.DenseIO2.8

tags: added: sru-20201109
summary: vmx_nm_test in ubuntu_kvm_unit_tests interrupted on X-oracle-4.15 /
- X-KVM / B-KVM
+ B-oracle-4.15 / X-KVM / B-KVM
Changed in linux (Ubuntu Bionic):
status: New → In Progress
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in linux (Ubuntu):
status: New → Invalid
description: updated
Po-Hsu Lin (cypressyew) on 2020-12-04
no longer affects: linux-signed-oracle (Ubuntu Bionic)
no longer affects: linux-signed-oracle (Ubuntu)
Stefan Bader (smb) on 2020-12-04
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Ian (ian-may) on 2020-12-08
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Po-Hsu Lin (cypressyew) wrote :

This issue does not exist on Oracle 4.15.0-1059.65~16.04.1 anymore.

Po-Hsu Lin (cypressyew) wrote :

I can still see this issue with 4.15.0-1082.84 KVM with instance autovm2

The instance autovm2 will be in paused state.

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Po-Hsu Lin (cypressyew) wrote :

This issue is affecting X-kvm 4.4.0-1087.96

The KVM instance will be in paused state when testing the vmx_nm_test

Po-Hsu Lin (cypressyew) wrote :

Hello Thadeu,

With the B-kvm proposed kernel (4.15.0-1084.86 linux-kvm), this vmx-nm-test still hangs
This is syslog from console:

[ 1003.399966] kvm: MWAIT instruction emulated as NOP!
[ 1061.306351] *** Guest State ***
[ 1061.306676] CR0: actual=0x0000000080010031, shadow=0x0000000080010031, gh_mask=fffffffffffffff7
[ 1061.307599] CR4: actual=0x0000000000002020, shadow=0x0000000000002020, gh_mask=ffffffffffffe871
[ 1061.308426] CR3 = 0x0000000000477000
[ 1061.308701] RSP = 0x000000000046e748 RIP = 0x0000000000404850
[ 1061.309219] RFLAGS=0x00000002 DR7 = 0x0000000000000400
[ 1061.309715] Sysenter RSP=aaaaaaaaaaaaaaaa CS:RIP=0008:00000000004003d8
[ 1061.310334] CS: sel=0x0008, attr=0x0a09b, limit=0xffffffff, base=0x0000000000000000
[ 1061.311254] DS: sel=0x0010, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
[ 1061.312037] SS: sel=0x0010, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
[ 1061.312812] ES: sel=0x0010, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
[ 1061.313780] FS: sel=0x0010, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
[ 1061.314636] GS: sel=0x0010, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
[ 1061.315517] GDTR: limit=0x0000ffff, base=0x000000000041d00a
[ 1061.316671] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff, base=0x0000000000000000
[ 1061.317590] IDTR: limit=0x0000ffff, base=0x0000000000000000
[ 1061.318534] TR: sel=0x0080, attr=0x0008b, limit=0x00000067, base=0x000000008b41d48a
[ 1061.319666] EFER = 0x0000000000000500 PAT = 0x0007040600070406
[ 1061.320394] DebugCtl = 0x0000000000000000 DebugExceptions = 0x0000000000000000
[ 1061.321303] BndCfgS = 0x0000000000000000
[ 1061.321756] Interruptibility = 00000000 ActivityState = 00000000
[ 1061.322576] InterruptStatus = 0000
[ 1061.322948] *** Host State ***
[ 1061.323317] RIP = 0xffffffffc058c0e6 RSP = 0xffffa5b8879c3ca0
[ 1061.323986] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
[ 1061.324698] FSBase=00007fa9f3915700 GSBase=ffffa2487fd00000 TRBase=fffffe000005b000
[ 1061.325589] GDTBase=fffffe0000059000 IDTBase=fffffe0000000000
[ 1061.326382] CR0=0000000080050033 CR3=0000000128b3e000 CR4=00000000000026a0
[ 1061.327374] Sysenter RSP=fffffe000005b000 CS:RIP=0010:ffffffffbe801700
[ 1061.328157] EFER = 0x0000000000000d01 PAT = 0x0007040600070406
[ 1061.328838] *** Control State ***
[ 1061.329236] PinBased=000000bf CPUBased=b6a06dfe SecondaryExec=000223e3
[ 1061.330206] EntryControls=0001d3ff ExitControls=00afefff
[ 1061.330828] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
[ 1061.331815] VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
[ 1061.332592] VMExit: intr_info=00000000 errcode=00000000 ilen=00000003
[ 1061.333421] reason=80000021 qualification=0000000000000000
[ 1061.334151] IDTVectoring: info=00000000 errcode=00000000
[ 1061.334920] TSC Offset = 0xfffffcf67b3dcccc
[ 1061.335511] TPR Threshold = 0x00
[ 1061.335879] PostedIntrVec = 0xf2
[ 1061.336254] EPT pointer = 0x000000005b4ef05e
[ 1061.336755] Virtual processor ID = 0x0001

tags: added: verification-failed-bionic
removed: verification-needed-bionic
Po-Hsu Lin (cypressyew) wrote :

Follow up for comment #15, the instance will be in a "paused" state in virsh after this.

Po-Hsu Lin (cypressyew) wrote :

Another note is that although this does not fix the hang on KVM instance, but I can see this test passed on B-AWS bare metals (was failing on previous cycle)

Instance c5.metal
4.15.0-1091.96 AWS
Test failed with timeout
https://pastebin.ubuntu.com/p/b77pbFdjf7/

4.15.0-1093.99 AWS
Test passed
6. 01/22 06:24:42 DEBUG| utils:0153| [stdout] timeout -k 1s --foreground 30 /usr/bin/qemu-system-x86_64 -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -machine accel=kvm -kernel /tmp/tmp.lGWO5mvwC6 -smp 1 -cpu host,+vmx -append vmx_nm_test # -initrd /tmp/tmp.zlSuqJDX9u
7. 01/22 06:24:42 DEBUG| utils:0153| [stdout] enabling apic
8. 01/22 06:24:42 DEBUG| utils:0153| [stdout] paging enabled
9. 01/22 06:24:42 DEBUG| utils:0153| [stdout] cr0 = 80010011
10. 01/22 06:24:42 DEBUG| utils:0153| [stdout] cr3 = 477000
11. 01/22 06:24:42 DEBUG| utils:0153| [stdout] cr4 = 20
12. 01/22 06:24:42 DEBUG| utils:0153| [stdout]
13. 01/22 06:24:42 DEBUG| utils:0153| [stdout] Test suite: vmx_nm_test
14. 01/22 06:24:42 DEBUG| utils:0153| [stdout] PASS: fnop with CR0.TS set in L2 triggers #NM VM-exit to L1
15. 01/22 06:24:42 DEBUG| utils:0153| [stdout] PASS: fnop with CR0.EM set in L2 triggers #NM VM-exit to L1
16. 01/22 06:24:42 DEBUG| utils:0153| [stdout] SUMMARY: 9 tests
17. 01/22 06:24:42 DEBUG| utils:0153| [stdout] PASS vmx_nm_test (9 tests)

Po-Hsu Lin (cypressyew) wrote :

Maybe it's a fix for bug 1866587

Launchpad Janitor (janitor) wrote :
Download full text (6.1 KiB)

This bug was fixed in the package linux - 4.15.0-135.139

---------------
linux (4.15.0-135.139) bionic; urgency=medium

  * bionic/linux: 4.15.0-135.139 -proposed tracker (LP: #1912223)

  * [drm:qxl_enc_commit [qxl]] *ERROR* head number too large or missing monitors
    config: (LP: #1908219)
    - qxl: remove qxl_io_log()
    - qxl: move qxl_send_monitors_config()
    - qxl: hook monitors_config updates into crtc, not encoder.

  * Touchpad not detected on ByteSpeed C15B laptop (LP: #1906128)
    - Input: i8042 - add ByteSpeed touchpad to noloop table

  * vmx_nm_test in ubuntu_kvm_unit_tests interrupted on X-oracle-4.15 /
    B-oracle-4.15 / X-KVM / B-KVM (LP: #1872401)
    - KVM: nVMX: Always reflect #NM VM-exits to L1

  * stack trace in kernel (LP: #1903596)
    - net: napi: remove useless stack trace

  * CVE-2020-27777
    - [Config]: Set CONFIG_PPC_RTAS_FILTER

  * Bionic update: upstream stable patchset 2020-12-04 (LP: #1906875)
    - regulator: defer probe when trying to get voltage from unresolved supply
    - ring-buffer: Fix recursion protection transitions between interrupt context
    - time: Prevent undefined behaviour in timespec64_to_ns()
    - nbd: don't update block size after device is started
    - btrfs: sysfs: init devices outside of the chunk_mutex
    - btrfs: reschedule when cloning lots of extents
    - genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
    - hv_balloon: disable warning when floor reached
    - net: xfrm: fix a race condition during allocing spi
    - perf tools: Add missing swap for ino_generation
    - ALSA: hda: prevent undefined shift in snd_hdac_ext_bus_get_link()
    - can: rx-offload: don't call kfree_skb() from IRQ context
    - can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ
      context
    - can: dev: __can_get_echo_skb(): fix real payload length return value for RTR
      frames
    - can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()
    - can: peak_usb: add range checking in decode operations
    - can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping
    - can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is
      on
    - xfs: flush new eof page on truncate to avoid post-eof corruption
    - Btrfs: fix missing error return if writeback for extent buffer never started
    - ath9k_htc: Use appropriate rs_datalen type
    - usb: gadget: goku_udc: fix potential crashes in probe
    - gfs2: Free rd_bits later in gfs2_clear_rgrpd to fix use-after-free
    - gfs2: Add missing truncate_inode_pages_final for sd_aspace
    - gfs2: check for live vs. read-only file system in gfs2_fitrim
    - scsi: hpsa: Fix memory leak in hpsa_init_one()
    - drm/amdgpu: perform srbm soft reset always on SDMA resume
    - mac80211: fix use of skb payload instead of header
    - cfg80211: regulatory: Fix inconsistent format argument
    - scsi: scsi_dh_alua: Avoid crash during alua_bus_detach()
    - iommu/amd: Increase interrupt remapping table limit to 512 entries
    - pinctrl: intel: Set default bias in case no particular value given
    - ARM: 9019/1: kprobes: Avoid fortify_panic() when copying optprobe template
    - ...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Po-Hsu Lin (cypressyew) wrote :

This issue happened again on B-KVM 4.15.0-1086.88

Po-Hsu Lin (cypressyew) wrote :

Follow up for comment #20, this was caused by an out-dated host. Issue has gone after updating the host to 4.15.0-136
Thanks!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers