Bug #1618473 “nova-compute sets wrong CPU capabilites => guest k...” : Series 8.0.x : Bugs : Mirantis OpenStack

Timur Nurlygayanov (tnurlygayanov) on 2016-08-30

tags:

added: area-linux

Timur Nurlygayanov (tnurlygayanov) on 2016-08-30

description:

updated

Timur Nurlygayanov (tnurlygayanov) on 2016-08-30

description:

updated

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#1

Looks like Fuel misconfigures nova-compute so it reports wrong CPU capabilites (and the newer kernel tries to make use of them and Oopses).

Could you please try reproducing the problem with Ubuntu 14.04 *guest* running 4.4.0-x kernel?
Also what happens if you downgrade the kernel in xenial?

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#2

> Looks like Fuel misconfigures nova-compute so it reports wrong CPU capabilites (and the newer kernel tries to make use of them and Oopses).

Which is by the way the reason why nova-compute deployed by Fuel (on a VM) can't make use of nested virtualization.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#3

Please attach

- the libvirt XML of the VM experiencing the problem (virsh dumpxml instance-xyz on the compute node running that VM)
- /proc/cpuinfo from the compute node running the VM in question

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#4

> It is important to note that the issue not reproduced on virtual environments (with virtual OpenStack compute nodes)

Actually being unable to use the nested (hardware) virtualization on compute nodes deployed by Fuel is the same (or at least very similar) bug

Revision history for this message

Piotr Skamruk (jell) wrote on 2016-08-30:

#5

I had similar issue with fedora23 cloud image segfaulting as nested vm under ubuntu (16.04) in ubuntu (16.10) while on same configuration ubuntu 16.04 cloud image was running ok. Host kernel 4.4.0-34-generic.

With same configuration same fedora23 image works perfectly on lab machine (host kernel 4.4.13 - arch distribution).

Key difference noticed - in lab machine has server e5 processors (where all seems work ok) while my failing setup have i7 desktop chip.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#6

The kernel 4.4.0 from Ubuntu 16.04 [1] (along with the corresponding initramfs image [2])
boots just fine on the compute node in question when running qemu manually
*without specifying CPU capabilites*:

qemu-system-x86_64 -enable-kvm \
        -m 4096 \
        -hda xenial-tmp.qcow2 \
        -kernel xenial-server-cloudimg-amd64-vmlinuz-generic \
        -initrd xenial-server-cloudimg-amd64-initrd-generic \
        -append 'root=/dev/sda1 console=ttyS0 init=/bin/bash' \
        -machine pc-1.0,accel=kvm,usb=off \
        -rtc base=utc \
        -serial mon:stdio \
        -vga none -nographic

[ 2.336938] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22985a87e67, max_idle_ns: 440795315743 ns
Begin: Loading essential drivers ... [ 33.255089] md: linear personality registered for level -1
[ 33.257775] md: multipath personality registered for level -4
[ 33.260461] md: raid0 personality registered for level 0
[ 33.263413] md: raid1 personality registered for level 1
[ 33.332011] raid6: sse2x1 gen() 9879 MB/s
[ 33.400007] raid6: sse2x1 xor() 7749 MB/s
[ 33.468006] raid6: sse2x2 gen() 12580 MB/s
[ 33.536006] raid6: sse2x2 xor() 8603 MB/s
[ 33.604006] raid6: sse2x4 gen() 14482 MB/s
[ 33.672006] raid6: sse2x4 xor() 9791 MB/s
[ 33.672408] raid6: using algorithm sse2x4 gen() 14482 MB/s
[ 33.672839] raid6: .... xor() 9791 MB/s, rmw enabled
[ 33.673255] raid6: using intx1 recovery algorithm
[ 33.675180] xor: measuring software checksum speed
[ 33.712005] prefetch64-sse: 15306.000 MB/sec
[ 33.752005] generic_sse: 14939.000 MB/sec
[ 33.752384] xor: using function: prefetch64-sse (15306.000 MB/sec)
[ 33.754373] async_tx: api initialized (async)
[ 33.764205] md: raid6 personality registered for level 6
[ 33.764648] md: raid5 personality registered for level 5
[ 33.765090] md: raid4 personality registered for level 4
[ 33.769554] md: raid10 personality registered for level 10

[1] http://cloud-images.ubuntu.com/xenial/current/unpacked/xenial-server-cloudimg-amd64-vmlinuz-generic
[2] http://cloud-images.ubuntu.com/xenial/current/unpacked/xenial-server-cloudimg-amd64-initrd-generic

The kernel 4.4.0 from Ubuntu 16.04 [1] (along with the corresponding initramfs image [2])
boots just fine on the compute node in question when running qemu manually
*without specifying CPU capabilites*:

qemu-system-x86_64 -enable-kvm \
        -m 4096 \
        -hda xenial-tmp.qcow2 \
        -kernel xenial-server-cloudimg-amd64-vmlinuz-generic \
        -initrd xenial-server-cloudimg-amd64-initrd-generic \
        -append 'root=/dev/sda1 console=ttyS0 init=/bin/bash' \
        -machine pc-1.0,accel=kvm,usb=off \
        -rtc base=utc \
        -serial mon:stdio \
        -vga none -nographic

[    2.336938] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22985a87e67, max_idle_ns: 440795315743 ns
Begin: Loading essential drivers ... [   33.255089] md: linear personality registered for level -1
[   33.257775] md: multipath personality registered for level -4
[   33.260461] md: raid0 personality registered for level 0
[   33.263413] md: raid1 personality registered for level 1
[   33.332011] raid6: sse2x1   gen()  9879 MB/s
[   33.400007] raid6: sse2x1   xor()  7749 MB/s
[   33.468006] raid6: sse2x2   gen() 12580 MB/s
[   33.536006] raid6: sse2x2   xor()  8603 MB/s
[   33.604006] raid6: sse2x4   gen() 14482 MB/s
[   33.672006] raid6: sse2x4   xor()  9791 MB/s
[   33.672408] raid6: using algorithm sse2x4 gen() 14482 MB/s
[   33.672839] raid6: .... xor() 9791 MB/s, rmw enabled
[   33.673255] raid6: using intx1 recovery algorithm
[   33.675180] xor: measuring software checksum speed
[   33.712005]    prefetch64-sse: 15306.000 MB/sec
[   33.752005]    generic_sse: 14939.000 MB/sec
[   33.752384] xor: using function: prefetch64-sse (15306.000 MB/sec)
[   33.754373] async_tx: api initialized (async)
[   33.764205] md: raid6 personality registered for level 6
[   33.764648] md: raid5 personality registered for level 5
[   33.765090] md: raid4 personality registered for level 4
[   33.769554] md: raid10 personality registered for level 10

[1] http://cloud-images.ubuntu.com/xenial/current/unpacked/xenial-server-cloudimg-amd64-vmlinuz-generic
[2] http://cloud-images.ubuntu.com/xenial/current/unpacked/xenial-server-cloudimg-amd64-initrd-generic

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#7

Download full text (4.1 KiB)

However if I set the same CPU capabilities as libvirt does (due to <cpu mode='host-model'>)
the kernel Oops'es while trying to use avx2 instructions:

CPUFLAGS="-cpu Nehalem,+invpcid,+erms,+bmi2,+smep,+avx2,+bmi1,+fsgsbase,+abm,+rdtscp,+pdpe1gb,+rdrand,+f16c,+avx,+osxsave,+xsave,+tsc-deadline,+movbe,+x2apic,+dca,+pcid,+pdcm,+xtpr,+fma,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -smp 1,sockets=1,cores=1,threads=1"

qemu-system-x86_64 -enable-kvm \
        -m 4096 $CPUFLAGS \
        -hda xenial-tmp.qcow2 \
        -kernel xenial-server-cloudimg-amd64-vmlinuz-generic \
        -initrd xenial-server-cloudimg-amd64-initrd-generic \
        -append 'root=/dev/sda1 console=ttyS0 init=/bin/bash' \
        -machine pc-1.0,accel=kvm,usb=off \
        -rtc base=utc \
        -serial mon:stdio \
        -vga none -nographic

Begin: Loading essential drivers ... [ 33.325464] md: linear personality registered for level -1
[ 33.329338] md: multipath personality registered for level -4
[ 33.332992] md: raid0 personality registered for level 0
[ 33.337063] md: raid1 personality registered for level 1
[ 33.408007] raid6: sse2x1 gen() 8742 MB/s
[ 33.476005] raid6: sse2x1 xor() 7756 MB/s
[ 33.544008] raid6: sse2x2 gen() 12452 MB/s
[ 33.612004] raid6: sse2x2 xor() 8569 MB/s
[ 33.680005] raid6: sse2x4 gen() 14347 MB/s
[ 33.748005] raid6: sse2x4 xor() 9526 MB/s
[ 33.752013] invalid opcode: 0000 [#1] SMP
[ 33.752376] Modules linked in: raid6_pq(+) libcrc32c raid1 raid0 multipath linear psmouse floppy e1000
[ 33.753319] CPU: 0 PID: 610 Comm: modprobe Not tainted 4.4.0-34-generic #53-Ubuntu
[ 33.753978] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 33.754498] task: ffff8800d92acb00 ti: ffff8800dacac000 task.ti: ffff8800dacac000
[ 33.755151] RIP: 0010:[<ffffffffc0086a8d>] [<ffffffffc0086a8d>] raid6_avx21_gen_syndrome+0x3d/0x120 [raid6_pq]
[ 33.756002] RSP: 0018:ffff8800dacafb78 EFLAGS: 00010246
[ 33.756002] RAX: 0000000000000000 RBX: ffff8800dacafbc8 RCX: 00000000fffefbfe
[ 33.756002] RDX: 0000000000000080 RSI: 0000000000001000 RDI: 0000000000000012
[ 33.756002] RBP: ffff8800dacafba8 R08: 000000000000000a R09: 00000000000001b6
[ 33.756002] R10: 00000000fffefbed R11: 00000000000001b6 R12: 0000000000001000
[ 33.756002] R13: ffff8800d928e000 R14: ffff8800d928f000 R15: 0000000000000012
[ 33.756002] FS: 00007f5cdbac3700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[ 33.756002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 33.756002] CR2: 000055e32b7d3180 CR3: 00000000d9314000 CR4: 00000000001006f0
[ 33.756002] Stack:
[ 33.756002] 0000000000000080 ffffffffc00997a0 0000000000000001 0000000000004c36
[ 33.756002] ffffffffc0087238 ffff8800d928e000 ffff8800dacafc88 ffffffffc009e116
[ 33.756002] 00000000fffefbfe 0000000000003964 ffffffffc0089600 ffffffffc008a600
[ 33.756002] Call Trace:
[ 33.756002] [<ffffffffc009e116>] init_module+0x116/0x1000 [raid6_pq]
[ 33.756002] [<ffffffffc009e000>] ? 0xffffffffc009e000
[ 33.756002] [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[ 33.756002] [<ffffffff8182a018>] ? preempt_schedule_...

However if I set the same CPU capabilities as libvirt does (due to <cpu mode='host-model'>)
the kernel Oops'es while trying to use avx2 instructions:

CPUFLAGS="-cpu Nehalem,+invpcid,+erms,+bmi2,+smep,+avx2,+bmi1,+fsgsbase,+abm,+rdtscp,+pdpe1gb,+rdrand,+f16c,+avx,+osxsave,+xsave,+tsc-deadline,+movbe,+x2apic,+dca,+pcid,+pdcm,+xtpr,+fma,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -smp 1,sockets=1,cores=1,threads=1"

qemu-system-x86_64 -enable-kvm \
        -m 4096 $CPUFLAGS \
        -hda xenial-tmp.qcow2 \
        -kernel xenial-server-cloudimg-amd64-vmlinuz-generic \
        -initrd xenial-server-cloudimg-amd64-initrd-generic \
        -append 'root=/dev/sda1 console=ttyS0 init=/bin/bash' \
        -machine pc-1.0,accel=kvm,usb=off \
        -rtc base=utc \
        -serial mon:stdio \
        -vga none -nographic

Begin: Loading essential drivers ... [   33.325464] md: linear personality registered for level -1
[   33.329338] md: multipath personality registered for level -4
[   33.332992] md: raid0 personality registered for level 0
[   33.337063] md: raid1 personality registered for level 1
[   33.408007] raid6: sse2x1   gen()  8742 MB/s
[   33.476005] raid6: sse2x1   xor()  7756 MB/s
[   33.544008] raid6: sse2x2   gen() 12452 MB/s
[   33.612004] raid6: sse2x2   xor()  8569 MB/s
[   33.680005] raid6: sse2x4   gen() 14347 MB/s
[   33.748005] raid6: sse2x4   xor()  9526 MB/s
[   33.752013] invalid opcode: 0000 [#1] SMP 
[   33.752376] Modules linked in: raid6_pq(+) libcrc32c raid1 raid0 multipath linear psmouse floppy e1000
[   33.753319] CPU: 0 PID: 610 Comm: modprobe Not tainted 4.4.0-34-generic #53-Ubuntu
[   33.753978] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   33.754498] task: ffff8800d92acb00 ti: ffff8800dacac000 task.ti: ffff8800dacac000
[   33.755151] RIP: 0010:[<ffffffffc0086a8d>]  [<ffffffffc0086a8d>] raid6_avx21_gen_syndrome+0x3d/0x120 [raid6_pq]
[   33.756002] RSP: 0018:ffff8800dacafb78  EFLAGS: 00010246
[   33.756002] RAX: 0000000000000000 RBX: ffff8800dacafbc8 RCX: 00000000fffefbfe
[   33.756002] RDX: 0000000000000080 RSI: 0000000000001000 RDI: 0000000000000012
[   33.756002] RBP: ffff8800dacafba8 R08: 000000000000000a R09: 00000000000001b6
[   33.756002] R10: 00000000fffefbed R11: 00000000000001b6 R12: 0000000000001000
[   33.756002] R13: ffff8800d928e000 R14: ffff8800d928f000 R15: 0000000000000012
[   33.756002] FS:  00007f5cdbac3700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[   33.756002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   33.756002] CR2: 000055e32b7d3180 CR3: 00000000d9314000 CR4: 00000000001006f0
[   33.756002] Stack:
[   33.756002]  0000000000000080 ffffffffc00997a0 0000000000000001 0000000000004c36
[   33.756002]  ffffffffc0087238 ffff8800d928e000 ffff8800dacafc88 ffffffffc009e116
[   33.756002]  00000000fffefbfe 0000000000003964 ffffffffc0089600 ffffffffc008a600
[   33.756002] Call Trace:
[   33.756002]  [<ffffffffc009e116>] init_module+0x116/0x1000 [raid6_pq]
[   33.756002]  [<ffffffffc009e000>] ? 0xffffffffc009e000
[   33.756002]  [<ffffffff81002123>] do_one_initcall+0xb3/0x200
[   33.756002]  [<ffffffff8182a018>] ? preempt_schedule_common+0x18/0x30
[   33.756002]  [<ffffffff8182a04c>] ? _cond_resched+0x1c/0x30
[   33.756002]  [<ffffffff811eb843>] ? kmem_cache_alloc_trace+0x183/0x1f0
[   33.756002]  [<ffffffff8118c763>] do_init_module+0x5f/0x1cf
[   33.756002]  [<ffffffff8110a1c7>] load_module+0x1667/0x1c00
[   33.756002]  [<ffffffff81106770>] ? __symbol_put+0x60/0x60
[   33.756002]  [<ffffffff81213180>] ? kernel_read+0x50/0x80
[   33.756002]  [<ffffffff8110a9a4>] SYSC_finit_module+0xb4/0xe0
[   33.756002]  [<ffffffff8110a9ee>] SyS_finit_module+0xe/0x10
[   33.756002]  [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[   33.756002] Code: 55 41 54 53 48 89 d3 48 8d 14 c5 00 00 00 00 41 89 ff 49 89 f4 48 83 ec 08 4c 8b 2c c3 4c 8b 74 13 08 48 89 55 d0 e8 f3 32 fb c0 <c5> fd 6f 05 8b 2e 01 00 c5 e5 ef db 4d 85 e4 48 8b 55 d0 0f 84 
[   33.756002] RIP  [<ffffffffc0086a8d>] raid6_avx21_gen_syndrome+0x3d/0x120 [raid6_pq]
[   33.756002]  RSP <ffff8800dacafb78>
[   33.773948] ---[ end trace 031120da9402feef ]---

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#8

Removing +avx2 from the above CPUFLAGS allows the kernel to successfully complete the boot

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#9

So the root cause is that nova-compute (indirectly) specifies incorrect/impossible CPU capabilites via <cpu mode='host_model'> in libvirt XML

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#10

By the way, the "host-model" mode is documented to be dangerous [1]:

"Beware, due to the way libvirt detects host CPU and due to the fact libvirt does not talk to QEMU/KVM when creating the CPU model, CPU configuration created using host-model may not work as expected. The guest CPU may differ from the configuration and it may also confuse guest OS by using a combination of CPU features and other parameters (such as CPUID level) that don't work. Until these issues are fixed, it's a good idea to avoid using host-model and use custom mode with just the CPU model from host capabilities XML."

http://libvirt.org/formatdomain.html#elementsCPU

summary:

- It is not possible to boot Ubuntu 16.04 xenial on hardware host with
- Ubuntu 14.04
+ nova-compute sets wrong CPU capabilites => guest kernel Oops'es on boot

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#11

In /etc/nova/nova.conf:

[libvirt]
inject_partition=-2
inject_password=False
disk_cachemodes="file=directsync,block=none"
cpu_mode=host-model

Looks like a fuel-library bug

Alexei Sheplyakov (asheplyakov) on 2016-08-30

description:

updated

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-08-30:

#12

Hi MOS Puppet team, it looks like we need to fix the Puppet manifests and remove "cpu_mode" parameter from /etc/nova/nova.conf" on all compute nodes.

Could you please take a look?

Thank you!

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-30:

#13

https://review.openstack.org/363124

Maksim Malchuk (mmalchuk) on 2016-08-30

information type:

Public → Public Security

Revision history for this message

Adam Heczko (aheczko-mirantis) wrote on 2016-08-30:

#14

Changed type to private security.

information type:

Public Security → Private Security

Alexei Sheplyakov (asheplyakov) on 2016-08-31

information type:

Private Security → Public

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-08-31:

#15

The bug has nothing to do with security:

- it's the *guest* kernel which Oops'es (thus no way to escalate privileges in host system)
- the problem happens during the boot when everything runs with UID 0 (thus escalating privileges in the *guest* system makes no sense)

Revision history for this message

Adam Heczko (aheczko-mirantis) wrote on 2016-08-31:

#16

Changing back to public.

Alexei Sheplyakov (asheplyakov) on 2016-08-31

description:

updated

Revision history for this message

Ivan Berezovskiy (iberezovskiy) wrote on 2016-08-31:

#17

Patch https://review.openstack.org/#/c/363124 was prepared by Alexei Sheplyakov and it should solve the issue.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-08-31:

#18

Assigned to 7.0, 8.0, 9.0 and 10.0 milestones, we need to fix this issue in all supported releases.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-06:

#19

We tested the issue on MOS 8.0 environment and everything works fine with Ubuntu 16.04 image.
Status changed to Invalid for MOS 8.0 version.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-09-07:

#20

It turns out that bugs description missed important point: you have explicitly set KVM hypervisor in environment settings. This is why there was a problem with MOS8: fuel selects qemu by default, not KVM.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-09-07:

#21

I have opened a bug #1621067 about essential vCPU features being excluded after merged patch, since it will affect customers and it will be nice to find some workarounds before this issue will be escalated.

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-09-08:

#22

This bug affects all MOS distributions starting from MOS 5.1.1. The picture is the same: segfault after raid6_pq module is added.

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-09-09:

#23

Setting to Fix Committed for 9.1 - the review is https://review.openstack.org/#/c/365860/

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-09-12:

#24

Moving to Won't Fix for 5.1-updates and 6.0-updates as it is impossible to patch old Fuel.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-09-12:

#25

Verified on MOS 9.1:

http://paste.openstack.org/show/572188/

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-09-29:

#26

6.1 patch: https://review.openstack.org/#/c/368765
7.0 patch: https://review.openstack.org/#/c/368764
8.0 patch: https://review.openstack.org/#/c/368763

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-14: Fix included in openstack/fuel-library 10.0.0rc1

#27

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

TatyanaGladysheva (tgladysheva) on 2016-10-19

tags:

added: on-verification

Revision history for this message

TatyanaGladysheva (tgladysheva) wrote on 2016-10-24:

#28

Verified on MOS 7.0 + MU6 updates.

Actual results:
http://paste.openstack.org/show/586882/

tags:

removed: on-verification

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2016-11-02:

#29

the resolution is incorrect, it breaks guest VM functionality for some customers

need to change "none" to "host-model,disable=avx2"
I'm testing this solution now

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2016-11-02:

#30

Ok it won't work this way apparently.

The proposal is to:
1. revert this fix as it makes matters worse for some customers
2. send out a workaround for affected customers, which is to:
--1. check current model which libvirt chooses by doing "ps -ef | grep qemu" on any problematic compute, and look for --cpu directive, should look like "-cpu Westmere,+invpcid,+erms,+bmi2,+smep..."
--2. edit /usr/share/libvirt/cpu_map.xml on computes - add a new SOMEMODELNAME model, inside add the base model, (Westmere in this example) and also add "<feature policy='disable' name='avx2'/>" in there, also add all features listed by qemu like so: "<feature name='invpcid'>" and so on
--3. edit nova.conf on computes:
----1. cpu_mode=custom
----2. cpu_model=SOMEMODELNAME
--4. restart nova-compute on computes

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2016-11-03:

#31

ok so final update from me:

1. roll back this fix
2. propose the workaround to all affected customers, which is:
--1. check current model which libvirt chooses by doing "ps -ef | grep qemu" on any problematic compute, and look for --cpu directive, should look like "-cpu Westmere,+invpcid,+erms,+bmi2,+smep..." - save this parameter
--2. edit nova.conf:
----1. cpu_mode=custom
----2. cpu_model=<PASTE THE PARAMETER HERE, BUT REMOVE PROBLEMATIC EXTENSIONS, SUCH AS avx2>

Example:
cpu_model=Westmere,+invpcid,+erms,+bmi2,+smep,+bmi1,+fsgsbase,+abm,+rdtscp,+pdpe1gb,+rdrand,+f16c,+avx,+osxsave,+xsave,+tsc-deadline,+movbe,+pcid,+pdcm,+xtpr,+fma,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme

If customers do not have problematic extensions in their cpu parameter, then they will have to use an earlier CPU model (can find one by looking at /usr/share/libvirt/cpu_map.xml) and will have to use it as the first word in the parameter, and manually determine and add all other extensions which appear in newer models, again possible by looking at cpu_map.xml file - it shows how newer CPU models derive from older and add features

Revision history for this message

Alexey Stupnikov (astupnikov) wrote on 2016-11-03:

#32

Dmitry, I agree that our solution should be a bit more flexible than just setting cpu_mode to none. On the other hand, I have discussed it with Alexey Sheplyakov, nova and maintenance teams, and they prefer merged change.

Revision history for this message

Gregory Elkinbard (gelkinbard) wrote on 2016-11-08:

#33

Current fix does not work for telco

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-12-06:

#34

The fix is reverted in https://bugs.launchpad.net/fuel/+bug/1640559

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-07: Fix included in openstack/fuel-library 10.0.0

#35

This issue was fixed in the openstack/fuel-library 10.0.0 release.

Sergey Reshetnyak (sreshetniak) on 2017-01-13

no longer affects:

fuel-ccp

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Fix Released	High	Alexei Sheplyakov	Mirantis OpenStack 10.0
5.1.x	Won't Fix	High	Alexey Stupnikov	Mirantis OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Alexey Stupnikov	Mirantis OpenStack 6.0-updates
6.1.x	Fix Committed	High	Alexey Stupnikov	Mirantis OpenStack 6.1-updates
7.0.x	Fix Released	High	Alexey Stupnikov	Mirantis OpenStack 7.0-mu-6
8.0.x	Fix Committed	High	Alexey Stupnikov	Mirantis OpenStack 8.0-updates
9.x	Fix Released	High	Alexei Sheplyakov	Mirantis OpenStack 9.1

Mirantis OpenStack

nova-compute sets wrong CPU capabilites => guest kernel Oops'es on boot

Bug Description

Other bug subscribers

Remote bug watches