KVM Guest pauses after upgrade to Ubuntu 20.04

Bug #1866870 reported by tstrike on 2020-03-10
44
This bug affects 6 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned
qemu (Ubuntu)
Undecided
Unassigned
seabios (Ubuntu)
Undecided
Unassigned

Bug Description

Symptom:
Error unpausing domain: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1311, in resume
    self._backend.resume()
  File "/usr/lib/python3/dist-packages/libvirt.py", line 2174, in resume
    if ret == -1: raise libvirtError ('virDomainResume() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

---

As outlined here: https://bugs.launchpad.net/qemu/+bug/1813165/comments/15

After upgrade, all KVM guests are in a default pause state. Even after forcing them off via virsh, and restarting them the guests are paused.

These Guests are not nested.

A lot of diganostic information are outlined in the previous bug report link provided. The solution mentioned in previous report had been allegedly integrated into the downstream updates.

Related branches

tstrike (tstrike34) wrote :
no longer affects: ubuntu
description: updated

Hi tstrike,
thanks for the report.
I have slightly modified the description and changed the bug tasks accordingly for you.

I first checked the related known fixes from the old case that is linked.
Just in case if we might miss one in Ubuntu 20.04 that you are using.

Kernel:
=> https://marc.info/?l=kvm&m=155085391830663&w=2
Tested and verified https://bugs.launchpad.net/qemu/+bug/1813165/comments/13
This got upstream and is in:
$ git describe --contains ad7dc69aeb231
v5.0-rc8~1^2~2
That we'd clearly have in Focal being on 5.4

qemu
https://git.qemu.org/?p=qemu.git;a=commit;h=9c1f8f4493e8355d0e48f7d1eebdf86893ba082d
Other fixes related to the topic are in qemu 2.8

On seabios disabling of SMM
- https://bugzilla.redhat.com/show_bug.cgi?id=1378006
- https://bugzilla.redhat.com/show_bug.cgi?id=1464654#c21
The following is from >=1.12.0-1 (was enabled by default before)
There is a small (for old qemu) and large binary (new qemu):
 42 build/bios.bin:
 43 # A stripped-down version of bios, to fit in 128Kb, for qemu <= 1.7
 44 »···$(call build-bios,bios,QEMU=y ROM_SIZE=128 PVSCSI=n BOOTSPLASH=n XEN=n USB_OHCI=n USB_XHCI=n USB_UAS=n SDCARD=n TCGBIOS=n MPT_SCSI=n NVME=n USE_SMM=n VGAHOOKS=n)
 45 build/bios-256k.bin:
 46 »···$(call build-bios,bios,QEMU=y ROM_SIZE=256)

Note: if we are out of options we could try testing to set USE_SMM=n here, but lets check other details first.

But as already explained on the linked bug 1813165:
"If you're seeing "KVM internal error. Suberror: 1" it can be multiple things, not necessarily the same bug."

Copied here from the other bug about the system setup that is in use:

L0 DistroRelease: Ubuntu 20.04 on Kernel Linux 5.4.0-14-generic x86_64
L1 3 guests Windows 10, Centos 8
No L2s
No guests are enabled for UEFI Boot

libvirt: 6.0.0-0ubuntu4
qemu 1:4.2-3ubuntu1

Issue triggers without nesting (ensured via modprobe kvm_intel nested=)

@tstrike - can you trigger the same issue with all your guests?
You list Windows and Centos guests, does it triggers with Centos as well or only the Windows guests?
Also if you have a chance (just to be sure) does it trigger with an Ubuntu guest as well? This would help for people retrying not using a case that doesn't even trigger in your setup.

@tstrike - you seem to hit this while starting your guest through libvirt.
Could you please attach your guest XML so that we can try to recreate this case?
  $ virsh dumpxml <guestname>
That will help when trying to recreate your case.

@tstrike
It would also be helpful to get your qemu commandline as well as any further messages qemu might have reported.
You'll find that in the per guest log file at:
 $ cat /var/log/libvirt/qemu/<guestname>.log

If you could report all that here that should be useful for everyone tracking this bug.

Changed in qemu (Ubuntu):
status: New → Incomplete

@tstrike: finally for the sake of apparmor denials or any other odd error that might be mentioned in there attaching the output of `dmesg` on your host might be useful as well.

tstrike (tstrike34) wrote :

Christian,

Thanks for getting my report in the proper syntax. I would be extremely happy to follow through on the tasks you laid out to me. Give me about 3 hours and I will update the report with the items requested.

tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :

Thank you @tstrike:

In your logs I see a bunch of qemu warnings right at the beginning:
2020-02-12T15:09:37.773025Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48FH).vmx-exit-load-perf-global-ctrl [bit 12]
2020-02-12T15:09:37.773107Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(490H).vmx-entry-load-perf-global-ctrl [bit 13]
2020-02-12T15:09:37.774800Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48FH).vmx-exit-load-perf-global-ctrl [bit 12]
2020-02-12T15:09:37.774821Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(490H).vmx-entry-load-perf-global-ctrl [bit 13]
2020-02-12T15:09:38.024821Z qemu-system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control

And then a crash like:
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=000086d4 EDX=00000000
ESI=00000000 EDI=00000000 EBP=000086d4 ESP=00006d7c
EIP=00007acf EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =f000 000f0000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 000f6200 00000037
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=b8 90 d9 00 00 66 e8 6b f7 ff ff 66 b8 0a 00 00 00 e9 61 f2 <f3> 0f 1e fb 66 57 66 56 66 53 66 53 66 89 c7 67 66 89 14 24 66 89 ce 66 e8 15 f8 ff ff 88

Boris Derzhavets (dbaxps0220) wrote :

Reproduced via attempt to install KVM Guest F31 Server on Ubuntu 20.04 (bare metal)

I tried with a guest XML matching yours (other than disk setup).
I didn't get those errors you reported even when using your config.

Notable differences to my default - your guest has:
- a rather old chip type (Penryn is a 2007 chip)
- a rather old machine type (uses xenial which matches ~pc-i440fx-2.5)
This probably based on when the system was created.
But since you also have the same issues on the windows guest which has the modern:
  <type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
  <cpu mode='host-model' check='partial'/>
So this isn't a route we need to go down...

Note: I tried this on kernel 5.4.0-14-generic with some common not too old & not too new chips
- Intel(R) Xeon(R) CPU E5-2620
- AMD Opteron(tm) Processor 4226

Then I remembered that you followed to disable nesting and after all vmx-* you see in the warnings could be related.

I ran this and restarted my guests:
# sudo rmmod kvm_intel
# sudo modprobe kvm_intel nested=0
or
# sudo rmmod kvm_amd
# sudo modprobe kvm_amd nested=0

Even then I didn't get the same warnings or crashes you got.

FYI: maybe related (similar symptom - which could be anything as we know, but still worth a link): https://bugzilla.redhat.com/show_bug.cgi?id=1718584

Thanks Boris for chiming in!
Maybe it is something in the guest (or the way virt-manager sets things up) after all - will install an F31 via virt-manager as well ...

Boris Derzhavets (dbaxps0220) wrote :

I've got the same issue starting guest via virt-install even with serial console.

I fetched
https://download.fedoraproject.org/pub/fedora/linux/releases/31/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-31-1.9.iso

And installed it on Ubuntu 20.04 via virt-manager (keeping all things on its default).

- New
- Local Media
- select ISO (Autodetects F31)
- Forward, Forward, Forward, Finish

Now warnings:
sudo cat /var/log/libvirt/qemu/fedora31.log | grep warning
<empty>

And it just works, the installer is on the graphical UI and waiting for me.

@Boris - in your log I've seen that you also got the Penryn cpu which I find odd.
"-cpu Penryn,vme=on,vmx=on,x2apic=on,tsc-deadline=on,xsave=on,hypervisor=on,arat=on,tsc-adjust=on,arch-capabilities=on,skip-l1dfl-vmentry=on \"
Assuming you also only used default I wonder how it got to that, maybe the reason for that is the same reason that eventually triggers the error.
But virt-manager/libvirt would usually just do a best-fit (for me Haswell-noTSX-IBRS).

@Boris and @tstrike Could you both please report:
$ virsh capabilities
$ virsh domcapabilities
$ sudo qemu-system-x86_64 --enable-kvm --nographic --nodefaults -S -qmp-pretty stdio
{"execute":"qmp_capabilities"}
{"execute":"query-cpu-definitions"}
Note: the command seems to hang as you are on QMP, then just enter the two commands below one by one. This will add "qemu's explanation why a given cpu is usable or not"

tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :
tstrike (tstrike34) wrote :

This particular command seems to hang on:

qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]

I tried to execute (thinking I was in a shell):

{"execute":"qmp_capabilities"}
{"execute":"query-cpu-definitions"}

I might have misinterpreted what you wanted me to do.

Boris Derzhavets (dbaxps0220) wrote :

Thanks a lot . I will do it at my earliest convenience this night. Haswell i4770 is installed on small server 32 GB. Department's policy doesn't allow me to test Ubuntu whichever release on bare metal. I could test only on outdated CPU's box and it seems to be a core reason.

Boris Derzhavets (dbaxps0220) wrote :

Done on Penryn's box

Thanks Boris!

@tstrike - is your system also "really an old penryn" or is it something newer?
Maybe share /proc/cpuinfo?

tstrike (tstrike34) wrote :
Download full text (3.5 KiB)

tstrike39@islandhealthcenter-media:~$ sudo cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
stepping : 10
microcode : 0xa0b
cpu MHz : 2416.548
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm pti tpr_shadow vnmi flexpriority dtherm
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5303.23
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
stepping : 10
microcode : 0xa0b
cpu MHz : 2010.620
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm pti tpr_shadow vnmi flexpriority dtherm
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5303.23
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
stepping : 10
microcode : 0xa0b
cpu MHz : 2419.534
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm pti tpr_shadow vnmi flexpriority dtherm
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5303.23
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
stepping : 10
microcode : 0xa0b
cpu MHz : 1988.790
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse...

Read more...

It detects host as Penryn as well for @tstrike.
Which is fine if it is a chip of around that era.
He reported to have an "Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz"
And for that chip the detection and chip used might be correct.

So to summarize all repro fails, but on Penryn ERA chips 2/2 cases trigger the bug.

I wonder if one that wants to reproduce needs a system with such a chip as well then to test and trigger this.

There should be plenty of people on CC as this is mirrored to qemu-devel due to the upstream qemu task. Is there an microarchitectural x86 specialist that knows if the chips of that generation have some special issues in regard to VMX that might explain what we see here?

It would be great if everyone could ask around for more systems with chips of that era.
Maybe we can further bisect which work and which will fail.

Andreas Hasenack (ahasenack) wrote :

I tried launching a focal vm on a focal host, and the vm launched but is in a paused state.

Attached is its log.

This is on an old E660 intel core system.

Andreas Hasenack (ahasenack) wrote :

/proc/cpuinfo

Andreas Hasenack (ahasenack) wrote :

I also have these two apparmor denied messages in dmesg:
[ 1380.529549] audit: type=1400 audit(1584023445.093:139): apparmor="DENIED" operation="open" profile="libvirt-aa346a1d-8caa-4c55-bef9-c3acbe17bdac" name="/" pid=19712 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=64055 ouid=0
[ 1380.529856] audit: type=1400 audit(1584023445.093:140): apparmor="DENIED" operation="open" profile="libvirt-aa346a1d-8caa-4c55-bef9-c3acbe17bdac" name="/sys/bus/nd/devices/" pid=19712 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=64055 ouid=0

And one last bit of info, this system booted with mitigations=off.

Andreas Hasenack (ahasenack) wrote :

virsh capabilities

Andreas Hasenack (ahasenack) wrote :

virsh domcapabilities

tstrike (tstrike34) wrote :

AppArmor is completely disabled on my server.

Andreas Hasenack (ahasenack) wrote :

After changing cpu to <cpu mode='host-model'/>:

I got this log (still in a paused state):
char device redirected to /dev/pts/3 (label charserial0)
2020-03-12T15:06:22.560159Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48EH).vmx-vnmi-pending [bit 22]
2020-03-12T15:06:22.560708Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48EH).vmx-secondary-ctls [bit 31]
2020-03-12T15:06:22.560971Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48BH).vmx-apicv-xapic [bit 0]
2020-03-12T15:06:22.561208Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48DH).vmx-vnmi [bit 5]
2020-03-12T15:06:22.561392Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(480H).vmx-ins-outs [bit 54]
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=000086d4 EDX=00000000
ESI=00000000 EDI=00000000 EBP=000086d4 ESP=00006d7c
EIP=00007acf EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =f000 000f0000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 000f6200 00000037
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=b8 90 d9 00 00 66 e8 6b f7 ff ff 66 b8 0a 00 00 00 e9 61 f2 <f3> 0f 1e fb 66 57 66 56 66 53 66 53 66 89 c7 67 66 89 14 24 66 89 ce 66 e8 15 f8 ff ff 88

By using host-model Andreas also was able to get the same signature:

2020-03-12T15:06:22.560159Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48EH).vmx-vnmi-pending [bit 22]
2020-03-12T15:06:22.560708Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48EH).vmx-secondary-ctls [bit 31]
2020-03-12T15:06:22.560971Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48BH).vmx-apicv-xapic [bit 0]
2020-03-12T15:06:22.561208Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48DH).vmx-vnmi [bit 5]
2020-03-12T15:06:22.561392Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(480H).vmx-ins-outs [bit 54]
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=000086d4 EDX=00000000
ESI=00000000 EDI=00000000 EBP=000086d4 ESP=00006d7c
EIP=00007acf EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =f000 000f0000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 000f6200 00000037
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=b8 90 d9 00 00 66 e8 6b f7 ff ff 66 b8 0a 00 00 00 e9 61 f2 <f3> 0f 1e fb 66 57 66 56 66 53 66 53 66 89 c7 67 66 89 14 24 66 89 ce 66 e8 15 f8 ff ff 88

So the warnings seem to depend a bit on which chip type we try to be to the guest.
We can ignore them for now.

What stays is the emulation error on this kind of chip.
I'll try to write up some tests to check different qemu and kernel levels to further corner what we are looking at.

Boris Derzhavets (dbaxps0220) wrote :

Penryn's architecture confirmed

Boris Derzhavets (dbaxps0220) wrote :

Seems to work fine on i4790 (Haswell) box.

Yeah @Boris - it really seems to be an issue bound to the Merom/Penryn processor generation.

I asked Andreas to check through some kernels and qemu versions so that we maybe eventually can consider bisecting something. But that will take a bit of time.

Of course everyone able to spend some time can consider checking a few kernels of [1] as well (probably the easiest test to begin with).

Still if there is an x86-microarchitecture expert out there that say "ah penryn, I know we added/dropped ... " please speak up :-)

[1]: https://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D

The vmx things make me wonder about a fix Paolo did a while ago for enabling inidivudal vmx features rather than vmx as a whole; but I can't remember if that was a kernel or qemu fix.

One thing I notice, that may be a red-herring, all of the machine code in the errors are 'f3 0f 1e fb' which is the new 'endbr32' security instruction - but that's really a rep nop, that I thought old instructions can handle anyway??

Andreas was so kind to try kenels 4.4, 4.15 and 5.6 all fail (with qemu 4.2)
He then tried Eoan (qemu 4.0) and Focal (qemu 4.2).
4.0 worked and 4.2 failed.

We will set up a bisect run on Monday and hopefully find the offending change.

@David - I agree that the messages might be a red-herring, but to be sure was that fix between 4.0 and 4.2?

I think the one I was thinking of is 0723cc8a5558c94388db75ae1f4991314914edd3 which is in a 4.2.0 rc
and there was 2605188240f939fa9ae9353f53a0985620b34769 - but that's a different crash to what you have.

So hmm.

Thanks David!

While bisecting on upstream git with just "-cpu Penryn" we have seen that it always works there.
So it might be an interaction with some Ubuntu build/packaging/configure detail together with these old chips.

While we still can't be sure if the VMX warnings are a red-herring chances are that only "-cpu Penryn,vmx=on" will trigger the issue - Andreas will test and bisect with that once he is back online - we will see if that is any different.

I'll also build a Ubuntu'esque 4.2 with the Penryn changes of [1] reverted just to complete the interim picture of our testing. That is available for testing at [2]. Further I added a Ubuntu build with rather crude reverts of almost all VMX related 4.2 changes.

[1]: https://git.qemu.org/?p=qemu.git;a=commit;h=0723cc8a5558c94388db75ae1f4991314914edd3
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/bug-1866870-qemu-penryn-crash
[3]: https://launchpad.net/~paelzer/+archive/ubuntu/bug-1866870-qemu-penryn-crash-fullreverts

Boris Derzhavets (dbaxps0220) wrote :

No luck when testing [2]. Reports are attached

Boris Derzhavets (dbaxps0220) wrote :

Log file for f31wks guest

Boris Derzhavets (dbaxps0220) wrote :

Verification new packages to be installed

Boris Derzhavets (dbaxps0220) wrote :
Andreas Hasenack (ahasenack) wrote :

The package from the PPA failed the same way for me:

ubuntu@f1:~$ qemu-system-x86_64 --enable-kvm -cpu Penryn,vmx=on -m 512 --nodefaults --nographic
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.1 [bit 19]
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=000086d4 EDX=00000000
ESI=00000000 EDI=00000000 EBP=000086d4 ESP=00006d7c
EIP=00007acf EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =f000 000f0000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 000f6200 00000037
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=b8 90 d9 00 00 66 e8 6b f7 ff ff 66 b8 0a 00 00 00 e9 61 f2 <f3> 0f 1e fb 66 57 66 56 66 53 66 53 66 89 c7 67 66 89 14 24 66 89 ce 66 e8 15 f8 ff ff 88
^Cqemu-system-x86_64: terminating on signal 2

ubuntu@f1:~$ qemu-system-x86_64 --help 2>&1|head -n 1
QEMU emulator version 4.2.0 (Debian 1:4.2-3ubuntu3~ppa1)
ubuntu@f1:~$

Andreas Hasenack (ahasenack) wrote :

Also crashed with the packages from the other ppa:

ubuntu@f1:~$ qemu-system-x86_64 --help 2>&1|head -n 1
QEMU emulator version 4.2.0 (Debian 1:4.2-3ubuntu3~exp1)

ubuntu@f1:~$ qemu-system-x86_64 --enable-kvm -cpu Penryn,vmx=on -m 512 --nodefaults --nographic
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.1 [bit 19]
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=000086d4 EDX=00000000
ESI=00000000 EDI=00000000 EBP=000086d4 ESP=00006d7c
EIP=00007acf EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =f000 000f0000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 000f6200 00000037
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=b8 90 d9 00 00 66 e8 6b f7 ff ff 66 b8 0a 00 00 00 e9 61 f2 <f3> 0f 1e fb 66 57 66 56 66 53 66 53 66 89 c7 67 66 89 14 24 66 89 ce 66 e8 15 f8 ff ff 88

ubuntu@f1:~$ apt-cache policy qemu-kvm
qemu-kvm:
  Installed: 1:4.2-3ubuntu3~exp1
  Candidate: 1:4.2-3ubuntu3~exp1
  Version table:
 *** 1:4.2-3ubuntu3~exp1 500
        500 http://ppa.launchpad.net/paelzer/bug-1866870-qemu-penryn-crash-fullreverts/ubuntu focal/main amd64 Packages
        100 /var/lib/dpkg/status
     1:4.2-3ubuntu2 500
        500 http://br.archive.ubuntu.com/ubuntu focal/main amd64 Packages

Andreas Hasenack (ahasenack) wrote :

Ok, upstream tag v4.2.0 and these configure options reproduced the crash:

export LDFLAGS="-Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,--as-needed"
export CFLAGS="-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g"

Full configure output: https://paste.ubuntu.com/p/Tzq6pDWD9R/

$ ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -cpu Penryn,vmx=on -m 512 --nodefaults --nographic
qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.01H:ECX.sse4.1 [bit 19]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48EH).vmx-vnmi-pending [bit 22]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48EH).vmx-secondary-ctls [bit 31]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48BH).vmx-apicv-xapic [bit 0]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48BH).vmx-wbinvd-exit [bit 6]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48DH).vmx-vnmi [bit 5]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48FH).vmx-exit-load-perf-global-ctrl [bit 12]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(490H).vmx-entry-load-perf-global-ctrl [bit 13]
qemu-system-x86_64: warning: host doesn't support requested feature: MSR(480H).vmx-ins-outs [bit 54]
KVM internal error. Suberror: 1
emulation failure
EAX=00000000 EBX=00000000 ECX=000086d4 EDX=00000000
ESI=00000000 EDI=00000000 EBP=000086d4 ESP=00006d7c
EIP=00007acf EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =f000 000f0000 ffffffff 00809b00
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 000f6200 00000037
IDT= 00000000 000003ff
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=b8 90 d9 00 00 66 e8 6b f7 ff ff 66 b8 0a 00 00 00 e9 61 f2 <f3> 0f 1e fb 66 57 66 56 66 53 66 53 66 89 c7 67 66 89 14 24 66 89 ce 66 e8 15 f8 ff ff 88
^Cqemu-system-x86_64: terminating on signal 2

I've tested smoe more cmbinations and found that I van have v4.2 work on focal.
Eventually I have realized that when I install start the qemu from Ubuntu not only that but also the formerly working build of v4.2.0 from git start to fail (without rebuilding).

A bit of package bisect later I found seabios to be related.
Focal is at 1.13.0-1
Eoan is at 1.12.0-1

After I knew that I verified and found it really only triggers on seabios 1.13.0.

With 1.13 I was also able to break the qemu v4.0.0 git build on eoan.
As well as the packaged qemu in Eoan.

So it seems we are actually looking for a problem of seabios (instead of qemu) with the Penryn chip.

I'll look at their changelog and bisect that tomorrow as time permits

I wanted to make sure why different qemu configs make it trigger or not, and after finding seabios to be related the candidates were obvious.

Default config gets us:
BIOS directory /usr/local/share/qemu

The long conf had:
--firmwarepath=/usr/share/qemu:/usr/share/seabios:/usr/lib/ipxe/qemu

Adding that to the short config which had most things disabled made it break as well.
Since it has much less moving parts having most other features disabled I'll continue to use that.

With that confirmed I checked if I can just point to a bios to break it, and indeed adding
  -bios /root/seabios_1.12.0-1/usr/share/seabios/bios.bin
  -bios /root/seabios_1.13.0-1/usr/share/seabios/bios.bin
respectively is a make or break change.

As a next step I reproduced the error with seabios rel-1.13.0 from https://review.coreboot.org/seabios.git.
It crashes as well.

But to make this puzzle even more interesting rel-1.12.0 from the same git crashes as well.
I wonder where this trip might end, from qemu to seabios to ... compiler?

Turns out 1.12 is a fairly old build and the working version in Ubuntu was from in Disco, therefore about a year ago.
=> https://launchpad.net/ubuntu/+source/seabios/1.12.0-1/+build/16284605

Therefore I built it in Eoan and even Disco.
As an overview:
Disco: gcc 4:8.3.0-1ubuntu3 binutils 2.32-7ubuntu4
Eoan: gcc 4:9.2.1-3.1ubuntu1 binutils 2.33-2ubuntu1.2
Focal: gcc 4:9.2.1-3.1ubuntu1 binutils 2.34-4ubuntu1

I ended up with these binaries to test:
./git-built-in-eoan/rel-1.12/bios.bin Breaks
./git-built-in-eoan/rel-1.13/bios.bin Breaks
./git-built-in-disco/rel-1.13/bios.bin Works
./git-built-in-disco/rel-1.12/bios.bin Works
./git-built-in-focal/head/bios.bin Breaks
./git-built-in-focal/rel-1.12/bios.bin Breaks
./git-built-in-focal/rel-1.13/bios.bin Breaks
./packaging/disco-seabios_1.12.0-1/bios.bin Works
./packaging/focal-seabios_1.13.0-1/bios.bin Breaks

To summarize:
- qemu breaks on chips of the Penryn generation
- it only breaks if the seabios bios is executed
- does not really depend on seabios or qemu version
- but it depends on seabios build environment

That's getting more fun :-)
You could look at whether seabios's config works out hte same in the two environments, or whether something makes use of new build flags - try looking at the gcc lines that are invoked in the good/bad cases and see if they're passing any options that the other doesn't.

Starting from the Disco build env that I had I changed the packages

Step #1 binutils:
Unpacking binutils-x86-64-linux-gnu (2.33-2ubuntu1.2) over (2.32-7ubuntu4) ...
Unpacking libbinutils:amd64 (2.33-2ubuntu1.2) over (2.32-7ubuntu4) ...
Unpacking binutils (2.33-2ubuntu1.2) over (2.32-7ubuntu4) ...
Unpacking binutils-common:amd64 (2.33-2ubuntu1.2) over (2.32-7ubuntu4) ...

=> Still working

Step #2 gcc:
Unpacking libubsan1:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libtsan0:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking gcc-9-base:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libstdc++6:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libquadmath0:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking liblsan0:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libitm1:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libgomp1:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libcc1-0:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libatomic1:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libasan5:amd64 (9.2.1-9ubuntu2) over (9.1.0-2ubuntu2~19.04) ...
Unpacking libgcc1:amd64 (1:9.2.1-9ubuntu2) over (1:9.1.0-2ubuntu2~19.04) ...
Unpacking libisl21:amd64 (0.21-2) ...
Unpacking cpp-9 (9.2.1-9ubuntu2) ...
Unpacking libgcc-9-dev:amd64 (9.2.1-9ubuntu2) ...
Unpacking gcc-9 (9.2.1-9ubuntu2) ...
Unpacking libstdc++-9-dev:amd64 (9.2.1-9ubuntu2) ...
Unpacking g++-9 (9.2.1-9ubuntu2) ...
Unpacking g++ (4:9.2.1-3.1ubuntu1) over (4:8.3.0-1ubuntu3) ...
Unpacking gcc (4:9.2.1-3.1ubuntu1) over (4:8.3.0-1ubuntu3) ...
Unpacking cpp (4:9.2.1-3.1ubuntu1) over (4:8.3.0-1ubuntu3) ...

=> now it is breaking

One thing that we have seen to cause breakage in other cases was the new default to enable:
  -fcf-protection

The code already carries quite a bunch of similar "no" rules:
COMMONCFLAGS += $(call cc-option,$(CC),-nopie,)
COMMONCFLAGS += $(call cc-option,$(CC),-fno-pie,)
COMMONCFLAGS += $(call cc-option,$(CC),-fno-stack-protector,)
COMMONCFLAGS += $(call cc-option,$(CC),-fno-stack-protector-all,)
COMMONCFLAGS += $(call cc-option,$(CC),-fstack-check=no,)
COMMONCFLAGS += $(call cc-option,$(CC),-Wno-address-of-packed-member,)

Lets add to that:
COMMONCFLAGS += $(call cc-option,$(CC),-fcf-protection=none,)

=> That made it work \o/ !

Changed in qemu:
status: New → Invalid
Changed in qemu (Ubuntu):
status: Incomplete → Invalid
Changed in seabios (Ubuntu):
status: New → Triaged

I *think* it's the cf-protection that's adding the endbr32 instructions that I spotted as being the failing instruction each time; but I don't understand why they would be CPU type specific.

Sent to seabios for their consideration:
=> https://<email address hidden>/thread/IXAWMA2HWW75LSR3NBBYQKWT3TI5WVVP/

I deleted the experimental PPAs that we had and created a new one with a new seabios:
=> https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3982

An MP is open to fix this in Focal:
=> https://code.launchpad.net/~paelzer/ubuntu/+source/seabios/+git/seabios/+merge/380881

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package seabios - 1.13.0-1ubuntu1

---------------
seabios (1.13.0-1ubuntu1) focal; urgency=medium

  * d/p/lp-1866870-build-use-fcf-protection-none-when-available.patch
    fix breakage on older chips due to fcf-protection (LP: #1866870)

 -- Christian Ehrhardt <email address hidden> Thu, 19 Mar 2020 13:10:10 +0100

Changed in seabios (Ubuntu):
status: Triaged → Fix Released
tstrike (tstrike34) wrote :

Can I just apt update && apt upgrade to get this fix or do I need to patch?

Andreas Hasenack (ahasenack) wrote :

apt-get will be enough once it's published, and looks like it just was published.

Boris Derzhavets (dbaxps0220) wrote :

Works for me. F31 KVM guest is installing on Q9550 box.

tstrike (tstrike34) wrote :

I can confirm that this bug has been fixed (zapped). Thank you all for your hard work and determination. A job well done indeed! As a former programmer I love you all's zeal for attacking this bug.

As a side note, knowing what you all go through, I always look things up, walk through at least level 1 stuff and provide logfiles. I hope what little I did, help you all resolve this.

Again my thanks, and I believe is this is where we say, "Please close the bug marked SOLVED".

Thank you Boris and Tstrike for the report and your help.
It was a great bug to identify and fix before the release of 20.04, I appreciate you using (and hereby testing) it ahead of time!

Hello!
Unfortunately the bug has apparently reappeared. I have a Windows 10 running in a VM, which after my today's "apt upgrade" goes into pause mode after a few seconds of running time.

Tail output of my /var/log/libvirt/qemu/win10.log
char device redirected to /dev/pts/1 (label charserial0)
2020-05-05T08:53:23.733051Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48FH).vmx-exit-load-perf-global-ctrl [bit 12]
2020-05-05T08:53:23.733122Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(490H).vmx-entry-load-perf-global-ctrl [bit 13]
2020-05-05T08:53:23.736093Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(48FH).vmx-exit-load-perf-global-ctrl [bit 12]
2020-05-05T08:53:23.736110Z qemu-system-x86_64: warning: host doesn't support requested feature: MSR(490H).vmx-entry-load-perf-global-ctrl [bit 13]
2020-05-05T08:54:04.912098Z qemu-system-x86_64: terminating on signal 15 from pid 1593 (/usr/sbin/libvirtd)
2020-05-05 08:54:05.112+0000: shutting down, reason=destroyed

Apt upgraded packages (from /var/log/apt/history.log):
Start-Date: 2020-05-05 09:32:02
Commandline: apt upgrade
Requested-By: andreas (1000)
Install: linux-image-5.4.0-29-generic:amd64 (5.4.0-29.33, automatic), linux-modules-extra-5.4.0-29-generic:amd64 (5.4.0-29.33, automatic), linux-headers-5.4.0-29-generic:amd64 (5.4.0-29.33, automatic), linux-modules-5.4.0-29-generic:amd64 (5.4.0-29.33, automatic), linux-headers-5.4.0-29:amd64 (5.4.0-29.33, automatic)
Upgrade: linux-headers-generic:amd64 (5.4.0.28.33, 5.4.0.29.34), linux-libc-dev:amd64 (5.4.0-28.32, 5.4.0-29.33), linux-image-generic:amd64 (5.4.0.28.33, 5.4.0.29.34), linux-generic:amd64 (5.4.0.28.33, 5.4.0.29.34)
End-Date: 2020-05-05 09:33:11

Kind regards,
   Andreas

Hi Andreas,
so the only upgrade you did to trigger this for you was to bump the kernel from 5.4.0-28.33 to 5.4.0-29.34 - nothing else? I have not (yet?) heard other similar reports, but it might be just too early?
At least on my system for now things still work with the new kernel like before.

I'd recommend filing a new bug, refer to this one as maybe being related and adding the following right away:
- kernel version (you have this here I know)
- qemu/libvirt/seabios/ovmf version (if you don't mind just attach `dpkg -l`)
- guest XML (if using libvirt) otherwise the qemu command-line
- add a cross check and report what happens with other guests configs (e.g. non windows, using
  another bios as the former issue was tied to seabios, use different guest CPU types)
- the full /var/log/apt/history.log
- a date when you last started the VM successfully (not just still-had-it-running, but started it)
  and the date when it started to fail (probably yesterday then I guess)

Hi Christian.

Just filed bug: #1877052

Fabrice Moyen (fmoyen) wrote :

Same issue here. I've upgraded my IBM Power ppc64le system to ubuntu 20.04. Now I'm trying to create KVM VMs and whatever I'm doing, the VM is created but before any installation step starts, it's falling into "paused" mode. When trying to resume it, I get:
"
Error unpausing domain: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required
"

Any workaround ? Do I need to reinstall Ubuntu 18.04 ?

Fabrice: That's probably a different error given that this lp seems to be with x86 vmx flags.
Check your /var/log/libvirt/qemu/ on the host to see if there's a particular error shown in the destination qemu after migration.

Fabrice Moyen (fmoyen) wrote :

David: Indeed ! How stupid I am. I missed the root cause inside QUEMU log file. This was clear enough...
error: kvm run failed Device or resource busy
This is probably because your SMT is enabled.

So I switch SMT (Power Simultaneous Multi-Threading) off and now it's OK; VMs are running and installing.

It had been years since I last touched KVM on Power and I lost my reflexes.
So please forget my precedent comment telling I had the same issue on Power platform. It was similar symptoms but not the same problem.

To post a comment you must log in.