Trusty soft lockup issues with nested KVM
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Fix Released
|
High
|
Chris J Arges |
Bug Description
[Impact]
Upstream discussion: https:/
Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace:
PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
#1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
#2 [ffff88043fd03e30] panic at ffffffff81719ff4
#3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
#4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
#5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
#6 [ffff88043fd03f80] local_apic_
#7 [ffff88043fd03f98] smp_apic_
#8 [ffff88043fd03fb0] apic_timer_
--- <IRQ stack> ---
#9 [ffff880426f0d958] apic_timer_
[exception RIP: generic_
RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001
RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68
R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000
R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880426f0da38] smp_call_
#11 [ffff880426f0dab0] smp_call_
#12 [ffff880426f0db10] native_
#13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff880426f0db68] pmdp_splitting_
#15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
#16 [ffff880426f0dc20] split_huge_
#17 [ffff880426f0dc48] __split_
#18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
#19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
#20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
#21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
#22 [ffff880426f0df80] system_
RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293
RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff
RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000
RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738
R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700
R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 000000000000001c CS: 0033 SS: 002b
[Fix]
commit 9242b5b60df8b13
Mitigates this issue if b6b8a1451fc4041
[Workaround]
In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs):
virsh vcpupin <domain> 0 0
virsh vcpupin <domain> 1 1
[Test Case]
- Deploy openstack on openstack
- Run tempest on L1 cloud
- Check kernel log of L1 nova-compute nodes
(Although this may not necessarily be related to nested KVM)
Potentially related: https:/
Another test case is to do the following (on affected hardware):
1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM
Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between
L1 vCPUs until the hang occurs.
--
Original Description:
When installing qemu-kvm on a VM, KSM is enabled.
I have encountered this problem in trusty:$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
The way to see the behaviour:
1) $ more /sys/kernel/
0
2) $ sudo apt-get install qemu-kvm
3) $ more /sys/kernel/
1
To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):
24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-
[24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-
[24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-
[24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-
[24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-
[24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-
[24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-
I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
tags: | added: cts |
description: | updated |
affects: | ubuntu → qemu-kvm (Ubuntu) |
Changed in linux (Ubuntu): | |
assignee: | nobody → Chris J Arges (arges) |
importance: | Undecided → High |
summary: |
- qemu-kvm package enables KSM on VMs + issues with KSM enabled for nested KVM VMs |
description: | updated |
description: | updated |
summary: |
- soft lockup issues with nested KVM VMs running tempest + Trusty soft lockup issues with nested KVM |
Changed in linux (Ubuntu Trusty): | |
assignee: | nobody → Chris J Arges (arges) |
importance: | Undecided → High |
status: | New → In Progress |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Trusty): | |
status: | In Progress → Fix Committed |
ProblemType: Bug Zone: nova 256color DIR=<set> ature: User Name 3.13.0- 44.73-generic 3.13.11-ckt12
ApportVersion: 2.14.1-0ubuntu3.6
Architecture: amd64
Date: Thu Jan 22 10:37:18 2015
DistroRelease: Ubuntu 14.04
Ec2AMI: ami-0000000f
Ec2AMIManifest: FIXME
Ec2Availability
Ec2InstanceType: m1.medium
Ec2Kernel: aki-00000002
Ec2Ramdisk: ari-00000002
Package: qemu-kvm (not installed)
ProcEnviron:
TERM=xterm-
SHELL=/bin/bash
PATH=(custom, no user)
LANG=en_US.UTF-8
XDG_RUNTIME_
ProcVersionSign
SourcePackage: qemu
Tags: trusty ec2-images
Uname: Linux 3.13.0-44-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
_MarkForUpload: True