Intel nested KVM exits L2 due to TRIPLE_FAULT

Bug #1970034 reported by Lorenz Bauer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-meta-hwe-5.13 (Ubuntu)
New
Undecided
Unassigned

Bug Description

linux-image-5.13.0-39-generic:
  Installed: 5.13.0-39.44~20.04.1

Description: Ubuntu 20.04.1 LTS
Release: 20.04

I use qemu to run short lived Linux VMs as part of a CI pipeline, using nested KVM on Intel CPUs. With good probability, one of the qemu processes managing the VMs exits without any output. I've been able to track the behaviour to L1 qemu receiving KVM_EXIT_SHUTDOWN from KVM_RUN ioctl:

    ...
    15268@1647341556.924605:kvm_run_exit cpu_index 0, reason 2
    15268@1647341556.928341:kvm_run_exit cpu_index 0, reason 8

    on QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.21)

Digging deeper, I managed to capture the following trace from the L1 kernel (via perf record -a -e "kvm:*"):

    ...
    [001] 770.850287: kvm:kvm_entry: vcpu 0, rip 0x100146
    [001] 770.850307: kvm:kvm_exit: vcpu 0 reason TRIPLE_FAULT rip 0x100146 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
    [001] 770.850313: kvm:kvm_fpu: unload
    [001] 770.850316: kvm:kvm_userspace_exit: reason KVM_EXIT_SHUTDOWN (8)

   on Linux 5.13.0-30-generic #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Immediately prior to the triple fault there are a bunch of EXTERNAL_INTERRUPT and reads / writes of MSRs and CRs. The crash seems independent of the Linux version running in L2, I see it across a bunch of LTS kernels. Unfortunately I don't know which version of Linux / Ubuntu is in L0.

I've tried to reproduce on other machines I have access to, without much luck. I've also tried to make sense of rip 0x100146 on my own, but I don't understand x86 / qemu boot enough. Finally, I've tried looking at commits to KVM between 5.13 and master that mention TRIPLE_FAULT, but nothing rang a bell.

I've put traces from two failed executions + lscpu at https://gist.github.com/lmb/c36479bb67f397ba08319b5e0f752386
For completeness sake, you can see the failing CI runs at https://ebpf.semaphoreci.com/branches/317c3f18-4de0-488b-af6d-2a1fa0967f87

I've tried to get help with this issue via <email address hidden> but had no luck. See https://<email address hidden>/

Lorenz Bauer (lmbr)
affects: ubuntu → linux-meta-hwe-5.13 (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.