nova-compute sets wrong CPU capabilites => guest kernel Oops'es on boot

Bug #1618473 reported by Timur Nurlygayanov on 2016-08-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
High
Alexei Sheplyakov
5.1.x
High
Alexey Stupnikov
6.0.x
High
Alexey Stupnikov
6.1.x
High
Alexey Stupnikov
7.0.x
High
Alexey Stupnikov
8.0.x
High
Alexey Stupnikov
9.x
High
Alexei Sheplyakov

Bug Description

Steps To Reproduce:

1. Find a CPU which supports AVX2 instructions (such as Intel Core i7)
2. Install the kernel and qemu which can't handle AVX2 instructions in the guest mode (for instance, the ones shipped with Ubuntu 14.04)
3. Define a VM (via libvirt) with 'cpu_mode' parameter being 'host-model' (<cpu mode='host-model'> in the VM XML definition).
4. Boot the kernel which can make use of AVX2 instructions (such as Linux 4.4.x) in the VM

Expected result: the *guest* kernel boots

Actual result: the *guest* kernel panics during the boot while trying to execute an AVX2 instruction (see http://xsnippet.org/361927)

Customer Impact:
Depending on the hardware Ubuntu 16.04 cloud images can be unusable "out of the box" on OpenStack clouds deployed with Fuel/MOS 9.0 (and earlier versions). Most likely the problem affects other kernels/drivers which make use of SIMD instructions

Note: there are at least 3 factors which influence the problem:

- the host CPU (should be more or less recent)
- the host kernel and qemu which can't handle some of the SIMD instructions in the guest mode
- the guest kernel (which makes use of those instructions)

The root cause is that libvirt specifies impossible combination of CPU model
and capabilities so the kernel Oops'es when trying to use the instructions which are reported as supported (when in reality they aren't).
Such a wrong CPU capabilities set is a result of <cpu mode='host-model'> in libvirt domain XML. This mode is documented to produce unexpected results
(http://libvirt.org/formatdomain.html#elementsCPU)

"Beware, due to the way libvirt detects host CPU and due to the fact libvirt does not talk to QEMU/KVM when creating the CPU model, CPU configuration created using host-model may not work as expected. The guest CPU may differ from the configuration and it may also confuse guest OS by using a combination of CPU features and other parameters (such as CPUID level) that don't work."

The wrong/inappropriate 'cpu mode' is set by Fuel in /etc/nova/nova.conf:

[libvirt]
inject_partition=-2
inject_password=False
disk_cachemodes="file=directsync,block=none"
cpu_mode=host-model

Fuel should not set a risky/problematic cpu_mode by default (it's better to leave this parameter unset).

tags: added: area-linux
description: updated
description: updated
Alexei Sheplyakov (asheplyakov) wrote :

Looks like Fuel misconfigures nova-compute so it reports wrong CPU capabilites (and the newer kernel tries to make use of them and Oopses).

Could you please try reproducing the problem with Ubuntu 14.04 *guest* running 4.4.0-x kernel?
Also what happens if you downgrade the kernel in xenial?

Alexei Sheplyakov (asheplyakov) wrote :

> Looks like Fuel misconfigures nova-compute so it reports wrong CPU capabilites (and the newer kernel tries to make use of them and Oopses).

Which is by the way the reason why nova-compute deployed by Fuel (on a VM) can't make use of nested virtualization.

Alexei Sheplyakov (asheplyakov) wrote :

Please attach

- the libvirt XML of the VM experiencing the problem (virsh dumpxml instance-xyz on the compute node running that VM)
- /proc/cpuinfo from the compute node running the VM in question

Alexei Sheplyakov (asheplyakov) wrote :

> It is important to note that the issue not reproduced on virtual environments (with virtual OpenStack compute nodes)

Actually being unable to use the nested (hardware) virtualization on compute nodes deployed by Fuel is the same (or at least very similar) bug

Piotr Skamruk (jell) wrote :

I had similar issue with fedora23 cloud image segfaulting as nested vm under ubuntu (16.04) in ubuntu (16.10) while on same configuration ubuntu 16.04 cloud image was running ok. Host kernel 4.4.0-34-generic.

With same configuration same fedora23 image works perfectly on lab machine (host kernel 4.4.13 - arch distribution).

Key difference noticed - in lab machine has server e5 processors (where all seems work ok) while my failing setup have i7 desktop chip.

Alexei Sheplyakov (asheplyakov) wrote :

The kernel 4.4.0 from Ubuntu 16.04 [1] (along with the corresponding initramfs image [2])
boots just fine on the compute node in question when running qemu manually
*without specifying CPU capabilites*:

qemu-system-x86_64 -enable-kvm \
        -m 4096 \
        -hda xenial-tmp.qcow2 \
        -kernel xenial-server-cloudimg-amd64-vmlinuz-generic \
        -initrd xenial-server-cloudimg-amd64-initrd-generic \
        -append 'root=/dev/sda1 console=ttyS0 init=/bin/bash' \
        -machine pc-1.0,accel=kvm,usb=off \
        -rtc base=utc \
        -serial mon:stdio \
        -vga none -nographic

[ 2.336938] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22985a87e67, max_idle_ns: 440795315743 ns
Begin: Loading essential drivers ... [ 33.255089] md: linear personality registered for level -1
[ 33.257775] md: multipath personality registered for level -4
[ 33.260461] md: raid0 personality registered for level 0
[ 33.263413] md: raid1 personality registered for level 1
[ 33.332011] raid6: sse2x1 gen() 9879 MB/s
[ 33.400007] raid6: sse2x1 xor() 7749 MB/s
[ 33.468006] raid6: sse2x2 gen() 12580 MB/s
[ 33.536006] raid6: sse2x2 xor() 8603 MB/s
[ 33.604006] raid6: sse2x4 gen() 14482 MB/s
[ 33.672006] raid6: sse2x4 xor() 9791 MB/s
[ 33.672408] raid6: using algorithm sse2x4 gen() 14482 MB/s
[ 33.672839] raid6: .... xor() 9791 MB/s, rmw enabled
[ 33.673255] raid6: using intx1 recovery algorithm
[ 33.675180] xor: measuring software checksum speed
[ 33.712005] prefetch64-sse: 15306.000 MB/sec
[ 33.752005] generic_sse: 14939.000 MB/sec
[ 33.752384] xor: using function: prefetch64-sse (15306.000 MB/sec)
[ 33.754373] async_tx: api initialized (async)
[ 33.764205] md: raid6 personality registered for level 6
[ 33.764648] md: raid5 personality registered for level 5
[ 33.765090] md: raid4 personality registered for level 4
[ 33.769554] md: raid10 personality registered for level 10

[1] http://cloud-images.ubuntu.com/xenial/current/unpacked/xenial-server-cloudimg-amd64-vmlinuz-generic
[2] http://cloud-images.ubuntu.com/xenial/current/unpacked/xenial-server-cloudimg-amd64-initrd-generic

Alexei Sheplyakov (asheplyakov) wrote :
Download full text (4.1 KiB)

However if I set the same CPU capabilities as libvirt does (due to <cpu mode='host-model'>)
the kernel Oops'es while trying to use avx2 instructions:

CPUFLAGS="-cpu Nehalem,+invpcid,+erms,+bmi2,+smep,+avx2,+bmi1,+fsgsbase,+abm,+rdtscp,+pdpe1gb,+rdrand,+f16c,+avx,+osxsave,+xsave,+tsc-deadline,+movbe,+x2apic,+dca,+pcid,+pdcm,+xtpr,+fma,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -smp 1,sockets=1,cores=1,threads=1"

qemu-system-x86_64 -enable-kvm \
        -m 4096 $CPUFLAGS \
        -hda xenial-tmp.qcow2 \
        -kernel xenial-server-cloudimg-amd64-vmlinuz-generic \
        -initrd xenial-server-cloudimg-amd64-initrd-generic \
        -append 'root=/dev/sda1 console=ttyS0 init=/bin/bash' \
        -machine pc-1.0,accel=kvm,usb=off \
        -rtc base=utc \
        -serial mon:stdio \
        -vga none -nographic

Begin: Loading essential drivers ... [ 33.325464] md: linear personality registered for level -1
[ 33.329338] md: multipath personality registered for level -4
[ 33.332992] md: raid0 personality registered for level 0
[ 33.337063] md: raid1 personality registered for level 1
[ 33.408007] raid6: sse2x1 gen() 8742 MB/s
[ 33.476005] raid6: sse2x1 xor() 7756 MB/s
[ 33.544008] raid6: sse2x2 gen() 12452 MB/s
[ 33.612004] raid6: sse2x2 xor() 8569 MB/s
[ 33.680005] raid6: sse2x4 gen() 14347 MB/s
[ 33.748005] raid6: sse2x4 xor() 9526 MB/s
[ 33.752013] invalid opcode: 0000 [#1] SMP
[ 33.752376] Modules linked in: raid6_pq(+) libcrc32c raid1 raid0 multipath linear psmouse floppy e1000
[ 33.753319] CPU: 0 PID: 610 Comm: modprobe Not tainted 4.4.0-34-generic #53-Ubuntu
[ 33.753978] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 33.754498] task: ffff8800d92acb00 ti: ffff8800dacac000 task.ti: ffff8800dacac000
[ 33.755151] RIP: 0010:[<ffffffffc0086a8d>] [<ffffffffc0086a8d>] raid6_avx21_gen_syndrome+0x3d/0x120 [raid6_pq]
[ 33.756002] RSP: 0018:ffff8800dacafb78 EFLAGS: 00010246
[ 33.756002] RAX: 0000000000000000 RBX: ffff8800dacafbc8 RCX: 00000000fffefbfe
[ 33.756002] RDX: 0000000000000080 RSI: 0000000000001000 RDI: 0000000000000012
[ 33.756002] RBP: ffff8800dacafba8 R08: 000000000000000a R09: 00000000000001b6
[ 33.756002] R10: 00000000fffefbed R11: 00000000000001b6 R12: 0000000000001000
[ 33.756002] R13: ffff8800d928e000 R14: ffff8800d928f000 R15: 0000000000000012
[ 33.756002] FS: 00007f5cdbac3700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[ 33.756002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 33.756002] CR2: 000055e32b7d3180 CR3: 00000000d9314000 CR4: 00000000001006f0
[ 33.756002] Stack:
[ 33.756002] 0000000000000080 ffffffffc00997a0 0000000000000001 0000000000004c36
[ 33.756002] ffffffffc0087238 ffff8800d928e000 ffff8800dacafc88 ffffffffc009e116
[ 33.756002] 00000000fffefbfe 0000000000003964 ffffffffc0089600 ffffffffc008a600
[ 33.756002] Call Trace:
[ 33.756002] [<ffffffffc009e116>] init_module+0x116/0x1000 [raid6_pq]
[ 33.756002] [<ffffffffc009e000>] ? 0xffffffffc009e000
[ 33.756002] [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[ 33.756002] [<ffffffff8182a018>] ? preempt_schedule_...

Read more...

Alexei Sheplyakov (asheplyakov) wrote :

Removing +avx2 from the above CPUFLAGS allows the kernel to successfully complete the boot

Alexei Sheplyakov (asheplyakov) wrote :

So the root cause is that nova-compute (indirectly) specifies incorrect/impossible CPU capabilites via <cpu mode='host_model'> in libvirt XML

By the way, the "host-model" mode is documented to be dangerous [1]:

"Beware, due to the way libvirt detects host CPU and due to the fact libvirt does not talk to QEMU/KVM when creating the CPU model, CPU configuration created using host-model may not work as expected. The guest CPU may differ from the configuration and it may also confuse guest OS by using a combination of CPU features and other parameters (such as CPUID level) that don't work. Until these issues are fixed, it's a good idea to avoid using host-model and use custom mode with just the CPU model from host capabilities XML."

http://libvirt.org/formatdomain.html#elementsCPU

summary: - It is not possible to boot Ubuntu 16.04 xenial on hardware host with
- Ubuntu 14.04
+ nova-compute sets wrong CPU capabilites => guest kernel Oops'es on boot

In /etc/nova/nova.conf:

[libvirt]
inject_partition=-2
inject_password=False
disk_cachemodes="file=directsync,block=none"
cpu_mode=host-model

Looks like a fuel-library bug

description: updated

Hi MOS Puppet team, it looks like we need to fix the Puppet manifests and remove "cpu_mode" parameter from /etc/nova/nova.conf" on all compute nodes.

Could you please take a look?

Thank you!

information type: Public → Public Security
Adam Heczko (aheczko-mirantis) wrote :

Changed type to private security.

information type: Public Security → Private Security
information type: Private Security → Public

The bug has nothing to do with security:

- it's the *guest* kernel which Oops'es (thus no way to escalate privileges in host system)
- the problem happens during the boot when everything runs with UID 0 (thus escalating privileges in the *guest* system makes no sense)

Adam Heczko (aheczko-mirantis) wrote :

Changing back to public.

description: updated

Patch https://review.openstack.org/#/c/363124 was prepared by Alexei Sheplyakov and it should solve the issue.

Assigned to 7.0, 8.0, 9.0 and 10.0 milestones, we need to fix this issue in all supported releases.

We tested the issue on MOS 8.0 environment and everything works fine with Ubuntu 16.04 image.
Status changed to Invalid for MOS 8.0 version.

Alexey Stupnikov (astupnikov) wrote :

It turns out that bugs description missed important point: you have explicitly set KVM hypervisor in environment settings. This is why there was a problem with MOS8: fuel selects qemu by default, not KVM.

Alexey Stupnikov (astupnikov) wrote :

I have opened a bug #1621067 about essential vCPU features being excluded after merged patch, since it will affect customers and it will be nice to find some workarounds before this issue will be escalated.

Alexey Stupnikov (astupnikov) wrote :

This bug affects all MOS distributions starting from MOS 5.1.1. The picture is the same: segfault after raid6_pq module is added.

Vitaly Sedelnik (vsedelnik) wrote :

Setting to Fix Committed for 9.1 - the review is https://review.openstack.org/#/c/365860/

Alexey Stupnikov (astupnikov) wrote :

Moving to Won't Fix for 5.1-updates and 6.0-updates as it is impossible to patch old Fuel.

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

tags: added: on-verification

Verified on MOS 7.0 + MU6 updates.

Actual results:
http://paste.openstack.org/show/586882/

tags: removed: on-verification
Dmitry Sutyagin (dsutyagin) wrote :

the resolution is incorrect, it breaks guest VM functionality for some customers

need to change "none" to "host-model,disable=avx2"
I'm testing this solution now

Dmitry Sutyagin (dsutyagin) wrote :

Ok it won't work this way apparently.

The proposal is to:
1. revert this fix as it makes matters worse for some customers
2. send out a workaround for affected customers, which is to:
--1. check current model which libvirt chooses by doing "ps -ef | grep qemu" on any problematic compute, and look for --cpu directive, should look like "-cpu Westmere,+invpcid,+erms,+bmi2,+smep..."
--2. edit /usr/share/libvirt/cpu_map.xml on computes - add a new SOMEMODELNAME model, inside add the base model, (Westmere in this example) and also add "<feature policy='disable' name='avx2'/>" in there, also add all features listed by qemu like so: "<feature name='invpcid'>" and so on
--3. edit nova.conf on computes:
----1. cpu_mode=custom
----2. cpu_model=SOMEMODELNAME
--4. restart nova-compute on computes

Dmitry Sutyagin (dsutyagin) wrote :

ok so final update from me:

1. roll back this fix
2. propose the workaround to all affected customers, which is:
--1. check current model which libvirt chooses by doing "ps -ef | grep qemu" on any problematic compute, and look for --cpu directive, should look like "-cpu Westmere,+invpcid,+erms,+bmi2,+smep..." - save this parameter
--2. edit nova.conf:
----1. cpu_mode=custom
----2. cpu_model=<PASTE THE PARAMETER HERE, BUT REMOVE PROBLEMATIC EXTENSIONS, SUCH AS avx2>

Example:
cpu_model=Westmere,+invpcid,+erms,+bmi2,+smep,+bmi1,+fsgsbase,+abm,+rdtscp,+pdpe1gb,+rdrand,+f16c,+avx,+osxsave,+xsave,+tsc-deadline,+movbe,+pcid,+pdcm,+xtpr,+fma,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme

If customers do not have problematic extensions in their cpu parameter, then they will have to use an earlier CPU model (can find one by looking at /usr/share/libvirt/cpu_map.xml) and will have to use it as the first word in the parameter, and manually determine and add all other extensions which appear in newer models, again possible by looking at cpu_map.xml file - it shows how newer CPU models derive from older and add features

Alexey Stupnikov (astupnikov) wrote :

Dmitry, I agree that our solution should be a bit more flexible than just setting cpu_mode to none. On the other hand, I have discussed it with Alexey Sheplyakov, nova and maintenance teams, and they prefer merged change.

Gregory Elkinbard (gelkinbard) wrote :

Current fix does not work for telco

This issue was fixed in the openstack/fuel-library 10.0.0 release.

no longer affects: fuel-ccp
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers