nested KVM fails - KVM: entry failed, hardware error 0x0

Bug #1682077 reported by Iain Lane
84
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned
linux-lts-xenial (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Hello,

We just noticed that systemd's autopkgtests have started failing often on amd64 for all releases, when run in Canonical's scalingstack cloud.

This is nested KVM I think.

TEST RUN: Basic systemd setup
+ /usr/bin/qemu-system-x86_64 -smp 1 -net none -m 512M -nographic -kernel /boot/vmlinuz-4.10.0-19-generic -drive format=raw,cache=unsafe,file=/var/tmp/systemd-test.unFR7V/rootdisk.img -initrd /boot/initrd.img-4.10.0-19-generic -machine accel=kvm -enable-kvm -cpu host -append 'root=/dev/sda1 raid=noautodetect loglevel=2 init=/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no '
KVM: entry failed, hardware error 0x0
EAX=00000000 EBX=00000000 ECX=00000000 EDX=000206a1
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 00000000 0000ffff
IDT= 00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=d8 66 e8 d8 a2 ff ff 66 5b 66 83 c4 08 66 5b 66 5e 66 c3 90 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

I have observed a failure on the compute node (region lgw01)

  Linux amemasu 4.4.0-59-generic #80~14.04.1-Ubuntu SMP Fri Jan 6 18:02:02 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

and a pass of the exact same package on the compute node (region lcy01)

  Linux kissel 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Here's an example log-
  https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-zesty-ci-train-ppa-service-2713/zesty/amd64/s/systemd/20170411_182221_b8bea@/log.gz

I found bug #1329434 which looks more or less identical and was fixed for 3.13. Perhaps it regressed in 4.4?

Tags: cscc xenial
Revision history for this message
Iain Lane (laney) wrote :

arges wrote a script in bug #1329434 which might be enough to repro this, but I don't have access to the hw myself to test

affects: linux (Ubuntu) → linux-lts-xenial (Ubuntu)
Revision history for this message
Iain Lane (laney) wrote :

Hmm, that's a non current kernel - I wonder if someone can test with the current lts-xenial?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-xenial (Ubuntu):
status: New → Confirmed
Revision history for this message
Seth Forshee (sforshee) wrote :

We hit this late last week too, I'm having difficulty reproducing it though. smb thinks there could be some issue with checks failing for feature bits not passed into the guest because they were new and thus perhaps a problem when using pretty recent hardware.

Changed in linux-lts-xenial (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → High
tags: added: kernel-da-key xenial
tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Iain Lane (laney) wrote :

For me it's enough to boot an instance on an affected lgw01 compute node, install qemu-system-x86, download a cloud image and run "/usr/bin/qemu-system-x86_64 -smp 1 -net none -m 512M -nographic -drive format=raw,cache=unsafe,file=zesty-server-cloudimg-amd64.img -machine accel=kvm -enable-kvm -cpu host"

Revision history for this message
Seth Forshee (sforshee) wrote :

Adding some miscellaneous data points.

I'm still unable to reproduce this outside of a lgw01 compute node that Laney set up for me, running xenial with qemu 1:2.5+dfsg-5ubuntu10.10 and linux 4.4.0-72. It doesn't happen on every host. I'm still experimenting to see if I can reproduce this on some hardware I control.

The failures happen before the L2 kernel starts, so the problem must be in L0 or L1. I've experimented with using older versions of qemu (1:2.5+dfsg-5ubuntu10.6) and the kernel (4.4.0-62.83) in L1 but still see the issues with those.

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Seth Forshee (sforshee) wrote :

Stefan Bader and I have continued trying to reproduce this. Here's what we've done:

- Tried to reproduce on various pieces of hardware available to us.
- Requested the hardware information about the host machine from IS and tested on the most similar machine available to us (same generation Xeon CPU, but a different model).
- Requested information about relevant software versions on the host machine (trusty with a 4.4.0-59 hwe kernel) and tried to reproduce using the same software versions.
- Requested the libvirt configuration for the instance which exhibits the failures and tried using a similar configuration.

So far we are still unable to reproduce the problem.

Revision history for this message
Iain Lane (laney) wrote :

Maybe IS happens to have a box that's not provisioned yet and you can borrow? Or you can borrow that one for a bit if lgw01 has some spare capacity? Or they'll upgrade to the current lts-xenial kernel so we maybe get lucky. :)

Revision history for this message
Manikandan R (maniram477) wrote :

Facing the same issue with KVM enabled L2 VM on following setup

CPU Model - Skylake-Client

L0:
Ubuntu 16.04.3 LTS
Kernel - 4.4.0-98-generic & 4.4.0-101-generic

libvirtd (libvirt) 2.5.0-3ubuntu5.5~cloud0
QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3ubuntu2.5~cloud0)

L1:
Ubuntu 16.04.3 LTS
Kernel - 4.4.0-98-generic

libvirtd (libvirt) 1.3.1-1ubuntu10.15
QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.16)

Able to launch L2 VM in qemu mode.

With same L0,L1 configuration able to launch L2 VM even with KVM on SandyBridge Machine

tags: removed: kernel-da-key
Revision history for this message
Ujwal Setlur (ujwal-setlur) wrote :

The last time something like this was fixed was https://bugs.launchpad.net/bugs/1329434. The fix was to disable APIC virtualization in nested guests. Did it suddenly get turned on again? Any patches we could test? Thanks.

Revision history for this message
Kai (kaixxx) wrote :

Changing the script to be used with xenial instead of trusty: https://gist.github.com/kmindi/4fb1332509a39b6214b0e8b39611f5ad

I can also confirm the bug on the following hardware configuration:

CPU Model - Skylkake (Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz)

L0 and L1:
Ubuntu 16.04.3 LTS
Kernel - 4.4.0-104-generic
libvirtd (libvirt) 1.3.1 (libvirt-bin 1.3.1-1ubuntu10)
QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.16)

Revision history for this message
Yan Chen (cheny-f) wrote :

Running into the same as well.
Is there any workaround on this?

Revision history for this message
fransiscus bimo (av.bee) wrote :

I am sorry if this bug is to old, but i have this problem and cannot find the solution until now.

I can reproduce this bug with following HW:

Ubuntu 16.04.5 LTS
Kernel 4.4.0-131-generic
libvirtd (libvirt) 1.3.1

QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.32)

Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz

But i have different machine which use Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz with same package version but have no problem.

Does this bug already fixed?

Revision history for this message
Gianpietro Lavado (gianpietro.lavado) wrote :

Reproduced in Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz

Ubuntu 16.04.5 LTS
4.4.0-134-generic

libvirt 4.0.0
QEMU 4.0.0
Running hypervisor: QEMU 2.11.1

Nested KVM not working, instances log "KVM: entry failed, hardware error 0x0"

Revision history for this message
Martin Pitt (pitti) wrote :

This happens in about half of the xenial semaphoreci.com instances as well:

$ uname -a
Linux semaphore-light-1809b 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.