Trusty soft lockup issues with nested KVM

Bug #1413540 reported by Gema Gomez on 2015-01-22
64
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Trusty
High
Chris J Arges

Bug Description

[Impact]
Upstream discussion: https://lkml.org/lkml/2015/2/11/247

Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace:

PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"
 #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
 #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
 #2 [ffff88043fd03e30] panic at ffffffff81719ff4
 #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
 #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
 #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
 #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
 #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
 #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
 #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202
    RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001
    RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
    RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68
    R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000
    R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000
    ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75
#11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6
#12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d
#15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b
#16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8
#17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956
#18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d
#19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341
#20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd
#21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf
#22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d
    RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293
    RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff
    RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000
    RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738
    R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700
    R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: 000000000000001c CS: 0033 SS: 002b

[Fix]

commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3,
Mitigates this issue if b6b8a1451fc40412c57d1 is applied (as in the case of the affected 3.13 distro kernel. However the issue can still occur in some cases.

[Workaround]

In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs):

virsh vcpupin <domain> 0 0
virsh vcpupin <domain> 1 1

[Test Case]
- Deploy openstack on openstack
- Run tempest on L1 cloud
- Check kernel log of L1 nova-compute nodes

(Although this may not necessarily be related to nested KVM)
Potentially related: https://lkml.org/lkml/2014/11/14/656

Another test case is to do the following (on affected hardware):

1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between
L1 vCPUs until the hang occurs.

--

Original Description:

When installing qemu-kvm on a VM, KSM is enabled.

I have encountered this problem in trusty:$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

The way to see the behaviour:
1) $ more /sys/kernel/mm/ksm/run
0
2) $ sudo apt-get install qemu-kvm
3) $ more /sys/kernel/mm/ksm/run
1

To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):
 24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
[24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791]
[24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]
[24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791]

I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.

CVE References

Gema Gomez (gema) on 2015-01-22
tags: added: cts
description: updated
affects: ubuntu → qemu-kvm (Ubuntu)
Gema Gomez (gema) wrote :
Gema Gomez (gema) wrote :

ProblemType: Bug
ApportVersion: 2.14.1-0ubuntu3.6
Architecture: amd64
Date: Thu Jan 22 10:37:18 2015
DistroRelease: Ubuntu 14.04
Ec2AMI: ami-0000000f
Ec2AMIManifest: FIXME
Ec2AvailabilityZone: nova
Ec2InstanceType: m1.medium
Ec2Kernel: aki-00000002
Ec2Ramdisk: ari-00000002
Package: qemu-kvm (not installed)
ProcEnviron:
 TERM=xterm-256color
 SHELL=/bin/bash
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 XDG_RUNTIME_DIR=<set>
ProcVersionSignature: User Name 3.13.0-44.73-generic 3.13.11-ckt12
SourcePackage: qemu
Tags: trusty ec2-images
Uname: Linux 3.13.0-44-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
_MarkForUpload: True

Gema Gomez (gema) wrote :

Please, bear in mind I have already modified the KSM value for this node not to crash, that is the only change made to this system after the issue, other than rebooting it.

Serge Hallyn (serge-hallyn) wrote :

Moving to qemu; qemu-kvm is the source package only up to precise.

affects: qemu-kvm (Ubuntu) → qemu (Ubuntu)
Serge Hallyn (serge-hallyn) wrote :

A few comments here.

First, some of upstream qemu's position is that nested KVM is not for enterprise installs in the first place. (Others feel that if there are bugs, they should be worked out)

Second, the bug appears to be a soft luckup. It seems to me the bug should be about that. If you've opened a separate bug about the lockup itself, please mark it both against qemu and linux. If not, then let's mark this bug as also affecting linux.

Thirdly, there's the question of how to (assuming we decide to) implement this. Should /etc/init/qemu-kvm do something like

   grep -qi qemu /proc/cpuinfo && echo 0 > /sys/kernel/mm/ksm/run

?

Note that you can currently do this yourself by having your VM install (say at cloud-init or in juju install section) set KSM_ENABLED=0 in /etc/default/qemu-kvm

I'm not sure how to mark the priroity of this bug. If the bug is about the soft lockup, that's high priority (against linux). if it's about disabling ksm, that's low priority (against qemu, since there's a workaround)

Changed in qemu (Ubuntu):
importance: Undecided → Low
status: New → Confirmed
Serge Hallyn (serge-hallyn) wrote :

@gema,

please re-do the 'apport-collect 1413540' (with the new, corrected package names)

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1413540

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Gema Gomez (gema) wrote :

apport file for linux.

Gema Gomez (gema) wrote :

apport for qemu

Gema Gomez (gema) wrote :

I can reproduce this issue and hand a VM over to whoever is going to triage in a hung state.

Chris J Arges (arges) on 2015-01-22
Changed in linux (Ubuntu):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → High
Chris J Arges (arges) wrote :

@serge-hallyn:

First step is to debug this issue and see if its a reasonable fix. If that's possible then we should leave qemu alone.

Second if we want to disable KSM for nested VMs relying on a 'QEMU' cpu isn't sufficient if someone uses a different -cpu flag. We could do:
grep -q "Booting paravirtualized kernel on KVM" /var/log/kern.log && echo 0 > /sys/kernel/mm/ksm/run
Or use tools like virt-what.

Regardless I'll work on step 1.

Chris J Arges (arges) on 2015-01-22
summary: - qemu-kvm package enables KSM on VMs
+ issues with KSM enabled for nested KVM VMs

I think you'll find that nested kvm and ksm will be a worst-case scenario w.r.t memory swap-out. KSM actively scans the hosts memory for pages it can unmap , which means when a guest (level 1) or nested guest (level 2) needs to write to that memory, then you incur a page table walk (Guest virtual to host physical) which is quite costly. This can be further affected by overcommit of memory, which ends up having the hypervisor swap memory to disk to handle the working set size. KSM will not have any view into the the page tables, or swapping activity of the guest (l1, or l2) meaning that overtime it's increasingly likely that KSM will have swapped out memory needed for either the L1 or L2 guest. Swapping L1 memory utilized to run the L2 guest is likely to the the most painful since there are two levels of swap-in occuring (host to l1, and the l1 to l2). Swap in/out will be IO intensive, and blocking on io is also more likely to trigger soft-lockups in either l1 or l2 guests.

I suggest looking at the openstack environment and disabling/turning down the amount of memory overcommit to reduce the memory pressure on the host. Given that KSM isn't optimized at all for nested KVM, it's certainly worth disabling KSM when running nested guests (certainly in the L1 guest, possibly the host as well) unless one wants to invest in tuning KSM as well as close monitoring of the host memory pressure any any L1 guest memory pressure if the L1 guest also runs an L2 guest.

Gema Gomez (gema) wrote :

I have a different VM that has crashed (also nested nova compute), this one had ksm disabled. See log attached.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Ryan Harper (raharper) wrote :

Any information about the physical host running the L1 guest?

Let's concentrate on the hang without KSM on this bug. I've split the KSM enabled in nested virt issue in bug 1414153.

summary: - issues with KSM enabled for nested KVM VMs
+ soft lockup issues with nested KVM VMs running tempest
no longer affects: qemu (Ubuntu)
Chris J Arges (arges) on 2015-01-23
description: updated
Chris J Arges (arges) on 2015-02-05
description: updated
Chris J Arges (arges) wrote :

Upstream discussion about this bug:
https://lkml.org/lkml/2015/2/11/247

Ryan Beisner (1chb1n) wrote :

We have begun to see a noticeable percentage of this in our Openstack-on-Openstack automated testing. I've not yet dug too deeply, but the symptom is: a deployment completes and tests successfully, then as we're collecting logs from machines, one or more of the compute nodes falls over with: http://paste.ubuntu.com/10556479/

Let me know if there's anything you'd like for us to try. This is on our private cloud "serverstack" environment.

ubuntu@chakora:~$ uname -a
Linux chakora 3.13.0-46-generic #77-Ubuntu SMP Mon Mar 2 18:23:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@chakora:~$ dpkg-query --show linux-generic libvirt-bin qemu-system nova-compute-kvm nova-compute-libvirt
libvirt-bin 1.2.2-0ubuntu13.1.9
linux-generic 3.13.0.46.53
nova-compute-kvm 1:2014.1.3-0ubuntu2
nova-compute-libvirt 1:2014.1.3-0ubuntu2
qemu-system 2.0.0+dfsg-2ubuntu1.10

Ryan Beisner (1chb1n) wrote :

Also FWIW, RAM overcommit does not appear to be a factor on the affected compute node: http://paste.ubuntu.com/10556588/

Ryan,

Could you enable kdump (http://www.inaddy.org/mini-howtos/dumps/using-ubuntu-crash-dump-with-kdump), for these hosts and provide me a way of downloading (dropbox, public url) the /var/crash/XXX/ directory (dump from the kernel panic) ? I would like to make sure we are facing the same problem (upstream discussion with Linus Torvalds -> https://lkml.org/lkml/2015/2/23/460). Your kernel dump might be helpful.

Thank you...

Rafael Tinoco

Or, to be faster, provide me the stack trace from your panic (So I can make sure you are facing the same issue as we are interested on).

Ryan Beisner (1chb1n) on 2015-03-23
summary: - soft lockup issues with nested KVM VMs running tempest
+ Trusty soft lockup issues with nested KVM
Ryan Beisner (1chb1n) wrote :

This does not appear to be specific to OpenStack, nor tempest. I've reproduced with Trusty on Trusty on Trusty, vanilla qemu/kvm.

Simplified reproducer, with an existing MAAS cluster:

@L0 baremetal:
 - Create a Trusty bare metal host from daily images.
 - sudo apt-get update -y && sudo apt-get -y install uvtool
 - sudo uvt-simplestreams-libvirt sync release=trusty arch=amd64
 - sudo uvt-simplestreams-libvirt query
 - ssh-keygen
 - sudo uvt-kvm create --memory 2048 trusty-vm release=trusty
 - sudo virsh shutdown trusty-vm
 - # edit the /etc/libvirt/qemu/trusty-vm.xml to enable serial console dump to file:
    <serial type='file'>
      <source path='/tmp/trusty-vm-console.log'/>
      <target port='0'/>
    </serial>
    <console type='file'>
      <source path='/tmp/trusty-vm-console.log'/>
      <target type='serial' port='0'/>
    </console>
 - sudo virsh define /etc/libvirt/qemu/trusty-vm.xml
 - sudo virsh start trusty-vm
 - # confirm console output:
 - sudo tailf /tmp/trusty-vm-console.log
 - # take note of the VM's IP:
 - sudo uvt-kvm ip trusty-vm
 - # ssh into the new vm.

@L1 "trusty-vm":
 - sudo apt-get update -y && sudo apt-get -y install uvtool
 - sudo uvt-simplestreams-libvirt sync release=trusty arch=amd64
 - sudo uvt-simplestreams-libvirt query
 - ssh-keygen
 - # change .122. to .123. in /etc/libvirt/qemu/networks/default.xml
 - # make sure default.xml is static linked inside /etc/libvirt/qemu/networks
 - sudo reboot # for good measure
 - sudo uvt-kvm create --memory 768 trusty-nest release=trusty
 - # take note of the nested VM's IP
 - sudo uvt-kvm ip trusty-vm
 - # ssh into the new vm.

@L2 "trusty-nest":
 - sudo apt-get update && sudo apt-get install stress
 - stress -c 1 -i 1 -m 1 -d 1 -t 600

Now watch the "trusty-vm" console for: [ 496.076004] BUG: soft lockup - CPU#0 stuck for 23s! [ksmd:36]. It happens to me within a couple of minutes. Then, both L1 and L2 become unreachable indefinitely, with two cores on L0 stuck at 100%.

Ryan Beisner (1chb1n) wrote :

Also FYI: I was not able to reproduce this issue when using Vivid as the bare metal L0.

Ryan Beisner (1chb1n) wrote :

s/static/sym/ ;-)

Ryan Beisner (1chb1n) wrote :
Ryan Beisner (1chb1n) wrote :
Ryan Beisner (1chb1n) wrote :

A few hrs later, those two L0 bare metal host CPUs are still maxed. In scenarios where L0 is hosting many VMs, such as in a cloud, this bug can be expected to cause significant performance, consistency and capacity issues on the host and in the cloud as a whole.

Ryan Beisner (1chb1n) wrote :

I've collected crash dumps, and have stored them on an internal Canonical server as they are 2gb+. Feel free to ping me for access.

Stefan Bader (smb) wrote :

Hm, following your instructions I rather run into a situation where the l2 guest gets paused. Likely because l1 runs out of disk space. The default of uvtool is 7G which I would say the l2 stress run fills as it grows the l2 qcow image on l1 which has to stuff all the initial cloud-image and the snapshot of that for l2 into its 7G.

Ryan Beisner (1chb1n) wrote :

@smb - after repeating the test a few times, I too ran out of space with the default 8GB VM disk size, resulting in a paused VM. You'll have to re-create the VMs a little bit differently (--disk <GB>).

ex:
@L0:
sudo uvt-kvm destroy trusty-vm
sudo uvt-kvm create --memory 2048 --disk 40 trusty-vm release=trusty

@L1:
#repeat original repro

ref:
http://manpages.ubuntu.com/manpages/trusty/man1/uvt-kvm.1.html

Stefan Bader (smb) wrote :

Yeah, will do. Just got distracted and wanted to ensure that the repro was not accidentally another form of failure path to the out of space issue.

Stefan Bader (smb) wrote :

Hrmn... When I repeated the setup I seem to have triggered some kind of lockup even while bringing up l2. Of course hard to say without details of Ryan's dump. However mine seems to have backtraces in the log which remind me an awful lot of an issue related to punching holes into ext4 based qcow images. Chris had been working on something like this before... He is on a sprint this week. Anyway, my strace in the log:

[ 1200.288031] INFO: task qemu-system-x86:4545 blocked for more than 120 seconds.
[ 1200.288712] Not tainted 3.13.0-46-generic #77-Ubuntu
[ 1200.289204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1200.289892] qemu-system-x86 D ffff88007fc134c0 0 4545 1 0x00000000
[ 1200.289895] ffff88007a9c5d28 0000000000000082 ffff88007bbd3000 ffff88007a9c5fd8
[ 1200.289897] 00000000000134c0 00000000000134c0 ffff88007bbd3000 ffff88007fc13d58
[ 1200.289898] ffff88007ffcdee8 0000000000000002 ffffffff8114eef0 ffff88007a9c5da0
[ 1200.289900] Call Trace:
[ 1200.289906] [<ffffffff8114eef0>] ? wait_on_page_read+0x60/0x60
[ 1200.289909] [<ffffffff817259fd>] io_schedule+0x9d/0x140
[ 1200.289910] [<ffffffff8114eefe>] sleep_on_page+0xe/0x20
[ 1200.289912] [<ffffffff81725e82>] __wait_on_bit+0x62/0x90
[ 1200.289914] [<ffffffff8114ecbf>] wait_on_page_bit+0x7f/0x90
[ 1200.289917] [<ffffffff810ab140>] ? autoremove_wake_function+0x40/0x40
[ 1200.289919] [<ffffffff8115c4a1>] ? pagevec_lookup_tag+0x21/0x30
[ 1200.289921] [<ffffffff8114edc9>] filemap_fdatawait_range+0xf9/0x190
[ 1200.289923] [<ffffffff8115066f>] filemap_write_and_wait_range+0x3f/0x70
[ 1200.289927] [<ffffffff8123bc4a>] ext4_sync_file+0xba/0x320
[ 1200.289930] [<ffffffff811ede21>] do_fsync+0x51/0x80
[ 1200.289931] [<ffffffff811ee0d3>] SyS_fdatasync+0x13/0x20
[ 1200.289933] [<ffffffff81731d7d>] system_call_fastpath+0x1a/0x1f

Chris J Arges (arges) wrote :

Stefan,
This looks like a separate bug (as we discussed). Please file another bug for this when you have time.

description: updated
Chris J Arges (arges) wrote :

I've added instructions for a workaround. The code paths I've seen in crashes has been the following:

kvm_sched_in
 -> kvm_arch_vcpu_load
  -> vmx_vcpu_load
   -> loaded_vmcs_clear
    -> smp_call_function_single

pmdp_clear_flush
 -> flush_tlb_mm_range
  -> native_flush_tlb_others
    -> smp_call_function_many

Generally this has been caused by workloads that use nested VMs, and stress L2/L1 vms (causing non-local CPU TLB flushing or VMCS clearing).

The hang is in csd_lock_wait waiting for CSD_FLAG_LOCK bit to be cleared, which can only be triggered with non-local smp_call_function_* calls.

Another data point is that this can happen with x2apic as well as flat apic (as tested with nox2apic).

Chris J Arges (arges) wrote :

Ideas going forward:

1) Instrument kernel for debugging csd_lock
2) Determine which CPUs exhibit this issue
3) Examine pinning more in depth pin 0-0 1-2 for example
4) Test older kernels , newer kernels to verify issue

Tom Fifield (fifieldt) wrote :
Download full text (6.1 KiB)

Hi,

Just wanted to chime in that this bug also affected me - running OpenStack Juno w/KVM inside a KVM hypervisor.

CPU on the host machine is:
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

running 14.04 with the latest packages applied as of today (2015-03-27) for both the host and the guest.

Lockup appeared to happen with one host-guest VM after I altered the number of CPUs allocated to another VM (yet to reboot that VM for changes to take affect), though I had also recently booted a new host-guest-guest VM.

ar 27 15:12:43 compute ntpd[1775]: peers refreshed
Mar 27 15:12:43 compute ntpd[1775]: new interface(s) found: waking up resolver
Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPDISCOVER(br100) fa:16:3e:c3:81:22
Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPOFFER(br100) 203.0.113.27 fa:16:3e:c3:81:22
Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPREQUEST(br100) 203.0.113.27 fa:16:3e:c3:81:22
Mar 27 15:12:48 compute dnsmasq-dhcp[2044]: DHCPACK(br100) 203.0.113.27 fa:16:3e:c3:81:22 test03
Mar 27 15:15:40 compute kernel: [ 436.100002] BUG: soft lockup - CPU#5 stuck for 23s! [ksmd:68]
Mar 27 15:15:40 compute kernel: [ 436.100002] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM ebt_ip ebt_arp ebtable_filter br
idge stp llc xt_conntrack xt_nat xt_tcpudp iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6tabl
e_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_
tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_intel cirrus snd_hda_codec ttm snd_hwdep drm_kms_helper snd_pcm drm snd_page_alloc snd_
timer syscopyarea snd sysfillrect soundcore sysimgblt dm_multipath i2c_piix4 kvm_intel scsi_dh serio_raw kvm mac_hid lp parport 8139too psmous
e 8139cp mii floppy pata_acpi
Mar 27 15:15:40 compute kernel: [ 436.100002] CPU: 5 PID: 68 Comm: ksmd Not tainted 3.13.0-46-generic #79-Ubuntu
Mar 27 15:15:40 compute kernel: [ 436.100002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Mar 27 15:15:40 compute kernel: [ 436.100002] task: ffff8802306db000 ti: ffff8802306e4000 task.ti: ffff8802306e4000
Mar 27 15:15:40 compute kernel: [ 436.100002] RIP: 0010:[<ffffffff810dbf56>] [<ffffffff810dbf56>] generic_exec_single+0x86/0xb0
Mar 27 15:15:40 compute kernel: [ 436.100002] RSP: 0018:ffff8802306e5c00 EFLAGS: 00000202
Mar 27 15:15:40 compute kernel: [ 436.100002] RAX: 0000000000000006 RBX: ffff8802306e5bd0 RCX: 0000000000000005
Mar 27 15:15:40 compute kernel: [ 436.100002] RDX: ffffffff8180ade0 RSI: 0000000000000000 RDI: 0000000000000286
Mar 27 15:15:40 compute kernel: [ 436.100002] RBP: ffff8802306e5c30 R08: ffffffff8180adc8 R09: ffff880232989b48
Mar 27 15:15:40 compute kernel: [ 436.100002] R10: 0000000000000867 R11: 0000000000000000 R12: 0000000000000000
Mar 27 15:15:40 compute kernel: [ 436.100002] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 27 15:15:40 compute kernel: [ 436.100002] FS: 0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000
Mar 27 15:15:4...

Read more...

Chris J Arges (arges) wrote :

@fifieldt

Hi, that is the same bug. Things to reduce the hangs right now are:
- Disabling KSM in L1 guest
- Using 3.16 kernel on the L0 host
- Pinning L1 vCPUs to L0 host CPU

Note this doesn't fix the issue, it only decreases (potentially) the frequency of these lockups.
--chris

Aaron Rosen (arosen) wrote :

I am also hitting this issue in my CI a lot. Here is the trace I'm getting in syslog: http://logs2.aaronorosen.com/85/169585/1/check/dsvm-tempest-full-congress-nodepool/94f8441/logs/syslog.txt.gz#_Apr__1_02_43_44

Is there a work around for this?

Chris J Arges (arges) wrote :

@arosen,
This looks like a different softlockup, and also the machine seems to recover from it. Please file a new bug and be sure to attach logs to the bug. Describe in detail how to reproduce this as well, what kind of host machine do you have? what VM definition are you using? Etc etc.

description: updated
Chris J Arges (arges) on 2015-04-06
Changed in linux (Ubuntu Trusty):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → High
status: New → In Progress
description: updated
Chris J Arges (arges) on 2015-04-06
description: updated
Chris J Arges (arges) wrote :

With a revert of b6b8a145 ('Rework interception of IRQs and NMIs'), the issue does not occur readily with the test case. I was able to run for 1+ hour. Generally I can reproduce within 15m.

With 9242b5b6 ('KVM: x86: Check for nested events if there is an injectable interrupt') applied, I can run for 1+ hour without issue.

Current 3.13.0 patchlevel is in between those two commits which allows for this bug to reproduce easily.

Ramy Asselin (ramy-asselin) wrote :

Ran into this bug too on 3.13.0-48. My workaround is to run QEMU on top of KVM (instead of kvm on top of KVM)

devstack local.conf:
[[post-config|$NOVA_CONF]]
[libvirt]
virt_type = qemu

nova.conf
[libvirt]
virt_type = qemu

Andy Whitcroft (apw) on 2015-04-08
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Dr. Jens Harbott (j-harbott) wrote :

@Andy: So 3.16.0-34 is the kernel with the fix? Any chance that it will also be backported to the 3.13 series?

Chris J Arges (arges) wrote :

@j-rosenboom-j:

I've backported patch 9242b5b6 to the 3.13 series that should greatly mitigate this issue from happening. This should be available in the next update.

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Chris J Arges (arges) wrote :

Verified on my reproducers. I'm marking the development task as fixed for this bug. I'll move the upstream investigation to another bug.

Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
status: Confirmed → Fix Released
importance: High → Undecided
tags: added: verification-done-trusty
removed: verification-needed-trusty
Gema Gomez (gema) wrote :

I have been trying to verify this kernel and I haven't seen exactly the soft lockup crash, but this other one, which may or may not be related but wanted to make a note of it:

[ 2406.041444] Kernel panic - not syncing: hung_task: blocked tasks
[ 2406.043163] CPU: 1 PID: 35 Comm: khungtaskd Not tainted 3.13.0-51-generic #84-Ubuntu
[ 2406.044223] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[ 2406.044223] 00000000003fffd1 ffff88080ec7fdf0 ffffffff817225ce ffffffff81a62a65
[ 2406.044223] ffff88080ec7fe68 ffffffff8171b46d ffffffff00000008 ffff88080ec7fe78
[ 2406.044223] ffff88080ec7fe18 ffff88080ec7fe40 0000000000000100 0000000000000004
[ 2406.044223] Call Trace:
[ 2406.044223] [<ffffffff817225ce>] dump_stack+0x45/0x56
[ 2406.044223] [<ffffffff8171b46d>] panic+0xc8/0x1d7
[ 2406.044223] [<ffffffff8110d7b6>] watchdog+0x296/0x2e0
[ 2406.044223] [<ffffffff8110d520>] ? reset_hung_task_detector+0x20/0x20
[ 2406.044223] [<ffffffff8108b5d2>] kthread+0xd2/0xf0
[ 2406.044223] [<ffffffff8108b500>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2406.044223] [<ffffffff8173300c>] ret_from_fork+0x7c/0xb0
[ 2406.044223] [<ffffffff8108b500>] ? kthread_create_on_node+0x1c0/0x1c0

I have the crashdump for it, let me know how you want to proceed.

Chris J Arges (arges) wrote :

After speaking to Gema, she will re-test with this kernel installed in L0 in addition to L1.
NOTE: This fix needs to be present for L0/L1 kernels.

Guy Baconniere (lordbaco) wrote :
Download full text (5.1 KiB)

I still have the same issue with kernel 3.16.0-36-generic or 3.13.0-51-generic (proposed-updates)

# KVM HOST (3.16.0-36-generic)
sudo apt-get install linux-signed-generic-lts-utopic/trusty-proposed

# KVM GUEST (3.16.0-36-generic)
sudo apt-get install linux-virtual-lts-utopic/trusty-proposed
apt-get install cloud-installer
cloud-install

[ 1196.920613] kvm: vmptrld (null)/780000000000 failed
[ 1196.920953] vmwrite error: reg 401e value 31 (err 1)
[ 1196.921243] CPU: 23 PID: 5240 Comm: qemu-system-x86 Not tainted 3.16.0-36-generic #48~14.04.1-Ubuntu
[ 1196.921244] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 11/03/2014
[ 1196.921245] 0000000000000000 ffff88202018fb58 ffffffff81764a5f ffff880fe4960000
[ 1196.921248] ffff88202018fb68 ffffffffc0a9320d ffff88202018fb78 ffffffffc0a878bf
[ 1196.921250] ffff88202018fba8 ffffffffc0a8e1cf ffff880fe4960000 0000000000000000
[ 1196.921252] Call Trace:
[ 1196.921262] [<ffffffff81764a5f>] dump_stack+0x45/0x56
[ 1196.921277] [<ffffffffc0a9320d>] vmwrite_error+0x2c/0x2e [kvm_intel]
[ 1196.921280] [<ffffffffc0a878bf>] vmcs_writel+0x1f/0x30 [kvm_intel]
[ 1196.921283] [<ffffffffc0a8e1cf>] free_nested.part.73+0x5f/0x170 [kvm_intel]
[ 1196.921286] [<ffffffffc0a8e363>] vmx_free_vcpu+0x33/0x70 [kvm_intel]
[ 1196.921305] [<ffffffffc03fd774>] kvm_arch_vcpu_free+0x44/0x50 [kvm]
[ 1196.921312] [<ffffffffc03fe3e2>] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm]
[ 1196.921318] [<ffffffff810d28ed>] ? synchronize_srcu+0x1d/0x20
[ 1196.921323] [<ffffffffc03e635e>] kvm_put_kvm+0x10e/0x220 [kvm]
[ 1196.921328] [<ffffffffc03e64a8>] kvm_vcpu_release+0x18/0x20 [kvm]
[ 1196.921331] [<ffffffff811d5e64>] __fput+0xe4/0x220
[ 1196.921333] [<ffffffff811d5fee>] ____fput+0xe/0x10
[ 1196.921337] [<ffffffff8108e284>] task_work_run+0xc4/0xe0
[ 1196.921342] [<ffffffff810702c8>] do_exit+0x2b8/0xa60
[ 1196.921345] [<ffffffff810e3792>] ? __unqueue_futex+0x32/0x70
[ 1196.921347] [<ffffffff810e47b6>] ? futex_wait+0x126/0x290
[ 1196.921349] [<ffffffff8109eb35>] ? check_preempt_curr+0x85/0xa0
[ 1196.921351] [<ffffffff81070aef>] do_group_exit+0x3f/0xa0
[ 1196.921353] [<ffffffff81080530>] get_signal_to_deliver+0x1d0/0x6f0
[ 1196.921357] [<ffffffff81012548>] do_signal+0x48/0xad0
[ 1196.921359] [<ffffffff81011627>] ? __switch_to+0x167/0x590
[ 1196.921361] [<ffffffff81013039>] do_notify_resume+0x69/0xb0
[ 1196.921364] [<ffffffff8176d44a>] int_signal+0x12/0x17
[ 1196.921365] vmwrite error: reg 2800 value ffffffffffffffff (err -255)
[ 1196.921733] CPU: 23 PID: 5240 Comm: qemu-system-x86 Not tainted 3.16.0-36-generic #48~14.04.1-Ubuntu
[ 1196.921734] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 11/03/2014
[ 1196.921735] 0000000000000000 ffff88202018fb58 ffffffff81764a5f ffff880fe4960000
[ 1196.921736] ffff88202018fb68 ffffffffc0a9320d ffff88202018fb78 ffffffffc0a878bf
[ 1196.921737] ffff88202018fba8 ffffffffc0a8e1e0 ffff880fe4960000 0000000000000000
[ 1196.921739] Call Trace:
[ 1196.921741] [<ffffffff81764a5f>] dump_stack+0x45/0x56
[ 1196.921744] [<ffffffffc0a9320d>] vmwrite_error+0x2c/0x2e [kvm_intel]
[ 1196.921746] [<ffffffffc0a878bf>] vmcs_writel+0x1f/0x30 [kvm_intel]
[ 1196.921748] [<ffffffffc0a8e1e0>] free_nest...

Read more...

Chris J Arges (arges) wrote :

@baco-1

These backtraces look a bit different than the original bug. Can you file a new bug with how you are reproducing this and gather complete logs?

--chris

Guy Baconniere (lordbaco) wrote :

@arges

For me it's related at least part of it...

If I don't update the kernel to proposed-updates I have the following messages :
If I use one CPU instead of two, I don't have those messages.

BUG: soft lockup CPU#1 stuck for 22s! [qemu-system-x86:6889]
INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 1, t=15002 jiffies, g=5324, c=5323, q=0)
BUG: soft lockup CPU#1 stuck for 22s! [qemu-system-x86:6889]
...

My stress test is installing openstack with cloud-install on a single VM (nested KVM)
generated with uvt-kvm and two vcpus.

The backtrace I posted is the result of a manual "force off" (using virt-manager) of the test openstack VM.
shortly after cloud-install tries to launch a VM in VM, the CPU reach 100% and all the shells (also console) get stuck

The last kernel message of the VM before I loose control is
"[ 942.295014] IPv6: ADDRCONF(NETDEV_UP): virbr0: link is not ready"

Chris J Arges (arges) wrote :

@baco-1

1) What kind of hardware are you running on L0? ('ubuntu-bug linux' and filing a bug would collect the necessary info)
2) What kind of load are you seeing in L0, L1?
3) Can you give me the output of 'tail /sys/module/kvm_intel/parameters/*' ?
4) You could setup crashdump to dump on a hang (if we think its the right one), or just have a full backtrace on a softlockup by adding the following to the kernel cmdline:
softlockup_all_cpu_backtrace=1

Having a single vCPU could either be reducing the load, or avoiding a race; it would be hard to tell without a proper backtrace of the hang itsself.

This seems like a pretty simple testcase, I will put it on my list of things to try and reproduce.

Gema Gomez (gema) wrote :

My deployment is still running strong after over 36 hours. No crashes. I will leave it running for a few more days to see if it happens after a few days... and will report back.

@arges, thanks for this fix!

Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 3.13.0-51.84

---------------
linux (3.13.0-51.84) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1444141
  * Merged back Ubuntu-3.13.0-49.83 security release

linux (3.13.0-50.82) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1442285

  [ Andy Whitcroft ]

  * [Config] CONFIG_DEFAULT_MMAP_MIN_ADDR needs to match on armhf and arm64
    - LP: #1418140

  [ Chris J Arges ]

  * [Config] CONFIG_PCIEASPM_DEBUG=y
    - LP: #1398544

  [ Upstream Kernel Changes ]

  * KEYS: request_key() should reget expired keys rather than give
    EKEYEXPIRED
    - LP: #1124250
  * audit: correctly record file names with different path name types
    - LP: #1439441
  * KVM: x86: Check for nested events if there is an injectable interrupt
    - LP: #1413540
  * be2iscsi: fix memory leak in error path
    - LP: #1440156
  * block: remove old blk_iopoll_enabled variable
    - LP: #1440156
  * be2iscsi: Fix handling timed out MBX completion from FW
    - LP: #1440156
  * be2iscsi: Fix doorbell format for EQ/CQ/RQ s per SLI spec.
    - LP: #1440156
  * be2iscsi: Fix the session cleanup when reboot/shutdown happens
    - LP: #1440156
  * be2iscsi: Fix scsi_cmnd leakage in driver.
    - LP: #1440156
  * be2iscsi : Fix DMA Out of SW-IOMMU space error
    - LP: #1440156
  * be2iscsi: Fix retrieving MCCQ_WRB in non-embedded Mbox path
    - LP: #1440156
  * be2iscsi: Fix exposing Host in sysfs after adapter initialization is
    complete
    - LP: #1440156
  * be2iscsi: Fix interrupt Coalescing mechanism.
    - LP: #1440156
  * be2iscsi: Fix TCP parameters while connection offloading.
    - LP: #1440156
  * be2iscsi: Fix memory corruption in MBX path
    - LP: #1440156
  * be2iscsi: Fix destroy MCC-CQ before MCC-EQ is destroyed
    - LP: #1440156
  * be2iscsi: add an missing goto in error path
    - LP: #1440156
  * be2iscsi: remove potential junk pointer free
    - LP: #1440156
  * be2iscsi: Fix memory leak in mgmt_set_ip()
    - LP: #1440156
  * be2iscsi: Fix the sparse warning introduced in previous submission
    - LP: #1440156
  * be2iscsi: Fix updating the boot enteries in sysfs
    - LP: #1440156
  * be2iscsi: Fix processing CQE before connection resources are freed
    - LP: #1440156
  * be2iscsi : Fix kernel panic during reboot/shutdown
    - LP: #1440156
  * fixed invalid assignment of 64bit mask to host dma_boundary for scatter
    gather segment boundary limit.
    - LP: #1440156
  * quota: Store maximum space limit in bytes
    - LP: #1441284
  * ip: zero sockaddr returned on error queue
    - LP: #1441284
  * net: rps: fix cpu unplug
    - LP: #1441284
  * ipv6: stop sending PTB packets for MTU < 1280
    - LP: #1441284
  * netxen: fix netxen_nic_poll() logic
    - LP: #1441284
  * udp_diag: Fix socket skipping within chain
    - LP: #1441284
  * ping: Fix race in free in receive path
    - LP: #1441284
  * bnx2x: fix napi poll return value for repoll
    - LP: #1441284
  * net: don't OOPS on socket aio
    - LP: #1441284
  * bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify
    - LP: #1441284
  * ipv4: tcp: get rid of ugly unicast_sock
...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers