Activity log for bug #1413540

Date Who What changed Old value New value Message
2015-01-22 10:17:14 Gema Gomez bug added bug
2015-01-22 10:17:26 Gema Gomez tags cts
2015-01-22 10:19:11 Gema Gomez description When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually crash with (run tempest 2 times at least): 24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-01-22 10:25:33 Adam Collard affects ubuntu qemu-kvm (Ubuntu)
2015-01-22 10:39:36 Gema Gomez attachment added apport.qemu-kvm.3i3ndao3.apport https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/1413540/+attachment/4303546/+files/apport.qemu-kvm.3i3ndao3.apport
2015-01-22 10:45:26 Gema Gomez attachment added apport.qemu-system-x86.hhh4e6d9.apport https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/1413540/+attachment/4303547/+files/apport.qemu-system-x86.hhh4e6d9.apport
2015-01-22 14:13:11 Serge Hallyn affects qemu-kvm (Ubuntu) qemu (Ubuntu)
2015-01-22 14:21:20 Serge Hallyn qemu (Ubuntu): importance Undecided Low
2015-01-22 14:21:28 Serge Hallyn qemu (Ubuntu): status New Confirmed
2015-01-22 14:27:04 Serge Hallyn bug task added linux (Ubuntu)
2015-01-22 14:30:10 Brad Figg linux (Ubuntu): status New Incomplete
2015-01-22 14:30:12 Brad Figg tags cts cts trusty
2015-01-22 14:55:45 Gema Gomez attachment added apport.linux-image-3.13.0-44-generic.61de1tqv.apport https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1413540/+attachment/4303621/+files/apport.linux-image-3.13.0-44-generic.61de1tqv.apport
2015-01-22 14:55:55 Gema Gomez attachment added apport.linux-image-3.13.0-44-generic.61de1tqv.apport https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1413540/+attachment/4303622/+files/apport.linux-image-3.13.0-44-generic.61de1tqv.apport
2015-01-22 14:56:21 Gema Gomez attachment added apport.qemu.pnfp6lff.apport https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1413540/+attachment/4303623/+files/apport.qemu.pnfp6lff.apport
2015-01-22 15:07:00 Chris J Arges linux (Ubuntu): assignee Chris J Arges (arges)
2015-01-22 15:07:03 Chris J Arges linux (Ubuntu): importance Undecided High
2015-01-22 17:34:24 Chris J Arges summary qemu-kvm package enables KSM on VMs issues with KSM enabled for nested KVM VMs
2015-01-22 19:49:53 Gema Gomez attachment added soft-lockup-different-node.log https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1413540/+attachment/4303814/+files/soft-lockup-different-node.log
2015-01-22 19:50:06 Gema Gomez linux (Ubuntu): status Incomplete Confirmed
2015-01-23 13:20:03 Rafael David Tinoco bug added subscriber Rafael David Tinoco
2015-01-23 20:29:13 Chris J Arges summary issues with KSM enabled for nested KVM VMs soft lockup issues with nested KVM VMs running tempest
2015-01-23 20:32:46 Chris J Arges bug task deleted qemu (Ubuntu)
2015-01-23 20:34:57 Chris J Arges description When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. [Impact] Users of nested KVM for testing openstack have soft lockups as follows: [74180.076007] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:14590] <snip> [74180.076007] Call Trace: [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 [74180.076007] [<ffffffff810dbf75>] smp_call_function_single+0xe5/0x190 [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 [74180.076007] [<ffffffffa00c4300>] ? rmap_write_protect+0x80/0x80 [kvm] [74180.076007] [<ffffffff810dc3a6>] smp_call_function_many+0x286/0x2d0 [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 [74180.076007] [<ffffffff8105c8f7>] native_flush_tlb_others+0x37/0x40 [74180.076007] [<ffffffff8105c9cb>] flush_tlb_mm_range+0x5b/0x230 [74180.076007] [<ffffffff8105b80d>] pmdp_splitting_flush+0x3d/0x50 [74180.076007] [<ffffffff811ac95b>] __split_huge_page+0xdb/0x720 [74180.076007] [<ffffffff811ad008>] split_huge_page_to_list+0x68/0xd0 [74180.076007] [<ffffffff811ad9a6>] __split_huge_page_pmd+0x136/0x330 [74180.076007] [<ffffffff8117728d>] unmap_page_range+0x7dd/0x810 [74180.076007] [<ffffffffa00a66b5>] ? kvm_mmu_notifier_invalidate_range_start+0x75/0x90 [kvm] [74180.076007] [<ffffffff81177341>] unmap_single_vma+0x81/0xf0 [74180.076007] [<ffffffff811784ed>] zap_page_range+0xed/0x150 [74180.076007] [<ffffffff8108ed74>] ? hrtimer_start_range_ns+0x14/0x20 [74180.076007] [<ffffffff81174fbf>] SyS_madvise+0x3bf/0x850 [74180.076007] [<ffffffff810db841>] ? SyS_futex+0x71/0x150 [74180.076007] [<ffffffff8173186d>] system_call_fastpath+0x1a/0x1f [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-01-27 16:38:37 Stefan Bader bug added subscriber Stefan Bader
2015-01-29 08:18:33 Nobuto Murata bug added subscriber Nobuto MURATA
2015-01-29 08:26:18 Yoshi Kadokawa bug added subscriber Yoshi Kadokawa
2015-01-29 08:27:00 Janghoon-Paul Sim bug added subscriber Janghoon-Paul Sim
2015-02-05 18:01:01 Chris J Arges description [Impact] Users of nested KVM for testing openstack have soft lockups as follows: [74180.076007] BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:14590] <snip> [74180.076007] Call Trace: [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 [74180.076007] [<ffffffff810dbf75>] smp_call_function_single+0xe5/0x190 [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 [74180.076007] [<ffffffffa00c4300>] ? rmap_write_protect+0x80/0x80 [kvm] [74180.076007] [<ffffffff810dc3a6>] smp_call_function_many+0x286/0x2d0 [74180.076007] [<ffffffff8105c7a0>] ? leave_mm+0x80/0x80 [74180.076007] [<ffffffff8105c8f7>] native_flush_tlb_others+0x37/0x40 [74180.076007] [<ffffffff8105c9cb>] flush_tlb_mm_range+0x5b/0x230 [74180.076007] [<ffffffff8105b80d>] pmdp_splitting_flush+0x3d/0x50 [74180.076007] [<ffffffff811ac95b>] __split_huge_page+0xdb/0x720 [74180.076007] [<ffffffff811ad008>] split_huge_page_to_list+0x68/0xd0 [74180.076007] [<ffffffff811ad9a6>] __split_huge_page_pmd+0x136/0x330 [74180.076007] [<ffffffff8117728d>] unmap_page_range+0x7dd/0x810 [74180.076007] [<ffffffffa00a66b5>] ? kvm_mmu_notifier_invalidate_range_start+0x75/0x90 [kvm] [74180.076007] [<ffffffff81177341>] unmap_single_vma+0x81/0xf0 [74180.076007] [<ffffffff811784ed>] zap_page_range+0xed/0x150 [74180.076007] [<ffffffff8108ed74>] ? hrtimer_start_range_ns+0x14/0x20 [74180.076007] [<ffffffff81174fbf>] SyS_madvise+0x3bf/0x850 [74180.076007] [<ffffffff810db841>] ? SyS_futex+0x71/0x150 [74180.076007] [<ffffffff8173186d>] system_call_fastpath+0x1a/0x1f [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. [Impact] Users of nested KVM for testing openstack have soft lockups as follows: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02 #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203 #2 [ffff88043fd03e30] panic at ffffffff81719ff4 #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5 #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787 #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537 #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> --- #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd [exception RIP: generic_exec_single+130] RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001 RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286 RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68 R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000 R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293 RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000 RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738 R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700 R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-03-08 13:38:20 Ryan Beisner bug added subscriber Ryan Beisner
2015-03-23 21:49:27 Ryan Beisner summary soft lockup issues with nested KVM VMs running tempest Trusty soft lockup issues with nested KVM
2015-03-24 00:48:21 Ryan Beisner attachment added L0-baremetal-cpu-pegged.png https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+attachment/4353983/+files/L0-baremetal-cpu-pegged.png
2015-03-24 00:48:42 Ryan Beisner attachment added L1-console-log-soft-lockup.png https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/+attachment/4353984/+files/L1-console-log-soft-lockup.png
2015-03-25 20:50:38 Chris J Arges description [Impact] Users of nested KVM for testing openstack have soft lockups as follows: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86" #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02 #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203 #2 [ffff88043fd03e30] panic at ffffffff81719ff4 #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5 #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787 #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537 #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> --- #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd [exception RIP: generic_exec_single+130] RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001 RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286 RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68 R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000 R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293 RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000 RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738 R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700 R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. [Impact] Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-03-29 21:06:39 Dr. Jens Harbott bug added subscriber Dr. Jens Rosenboom
2015-04-01 13:54:08 Chris J Arges description [Impact] Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-04-06 14:22:33 Chris J Arges nominated for series Ubuntu Trusty
2015-04-06 14:22:33 Chris J Arges bug task added linux (Ubuntu Trusty)
2015-04-06 14:23:45 Chris J Arges linux (Ubuntu Trusty): assignee Chris J Arges (arges)
2015-04-06 14:24:29 Chris J Arges linux (Ubuntu Trusty): importance Undecided High
2015-04-06 14:24:31 Chris J Arges linux (Ubuntu Trusty): status New In Progress
2015-04-06 14:26:25 Chris J Arges description [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 Another test case is to do the following (on affected hardware): 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM Sometimes this is sufficient to reproduce the issue, I've observed that running KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). If this doesn't reproduce then you can do the following: 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between L1 vCPUs until the hang occurs. -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-04-06 14:35:46 Chris J Arges description [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 Another test case is to do the following (on affected hardware): 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM Sometimes this is sufficient to reproduce the issue, I've observed that running KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). If this doesn't reproduce then you can do the following: 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between L1 vCPUs until the hang occurs. -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details. [Impact] Upstream discussion: https://lkml.org/lkml/2015/2/11/247 Certain workloads that need to execute functions on a non-local CPU using smp_call_function_* can result in soft lockups with the following backtrace: PID: 22262 TASK: ffff8804274bb000 CPU: 1 COMMAND: "qemu-system-x86"  #0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02  #1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203  #2 [ffff88043fd03e30] panic at ffffffff81719ff4  #3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5  #4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787  #5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f  #6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537  #7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f  #8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd --- <IRQ stack> ---  #9 [ffff880426f0d958] apic_timer_interrupt at ffffffff817326dd     [exception RIP: generic_exec_single+130]     RIP: ffffffff810dbe62 RSP: ffff880426f0da00 RFLAGS: 00000202     RAX: 0000000000000002 RBX: ffff880426f0d9d0 RCX: 0000000000000001     RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286     RBP: ffff880426f0da30 R8: ffffffff8180ad48 R9: ffff88042713bc68     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: ffff8804274bb000     R13: 0000000000000000 R14: ffff880407670280 R15: 0000000000000000     ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880426f0da38] smp_call_function_single at ffffffff810dbf75 #11 [ffff880426f0dab0] smp_call_function_many at ffffffff810dc3a6 #12 [ffff880426f0db10] native_flush_tlb_others at ffffffff8105c8f7 #13 [ffff880426f0db38] flush_tlb_mm_range at ffffffff8105c9cb #14 [ffff880426f0db68] pmdp_splitting_flush at ffffffff8105b80d #15 [ffff880426f0db88] __split_huge_page at ffffffff811ac90b #16 [ffff880426f0dc20] split_huge_page_to_list at ffffffff811acfb8 #17 [ffff880426f0dc48] __split_huge_page_pmd at ffffffff811ad956 #18 [ffff880426f0dcc8] unmap_page_range at ffffffff8117728d #19 [ffff880426f0dda0] unmap_single_vma at ffffffff81177341 #20 [ffff880426f0ddd8] zap_page_range at ffffffff811784cd #21 [ffff880426f0de90] sys_madvise at ffffffff81174fbf #22 [ffff880426f0df80] system_call_fastpath at ffffffff8173196d     RIP: 00007fe7ca2cc647 RSP: 00007fe7be9febf0 RFLAGS: 00000293     RAX: 000000000000001c RBX: ffffffff8173196d RCX: ffffffffffffffff     RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007fe7be1ff000     RBP: 0000000000000000 R8: 0000000000000000 R9: 00007fe7d1cd2738     R10: 00007fe7d1f2dbd0 R11: 0000000000000206 R12: 00007fe7be9ff700     R13: 00007fe7be9ff9c0 R14: 0000000000000000 R15: 0000000000000000     ORIG_RAX: 000000000000001c CS: 0033 SS: 002b [Fix] commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3, Mitigates this issue if b6b8a1451fc40412c57d1 is applied (as in the case of the affected 3.13 distro kernel. However the issue can still occur in some cases. [Workaround] In order to avoid this issue, the workload needs to be pinned to CPUs such that the function always executes locally. For the nested VM case, this means the the L1 VM needs to have all vCPUs pinned to a unique CPU. This can be accomplished with the following (for 2 vCPUs): virsh vcpupin <domain> 0 0 virsh vcpupin <domain> 1 1 [Test Case] - Deploy openstack on openstack - Run tempest on L1 cloud - Check kernel log of L1 nova-compute nodes (Although this may not necessarily be related to nested KVM) Potentially related: https://lkml.org/lkml/2014/11/14/656 Another test case is to do the following (on affected hardware): 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM Sometimes this is sufficient to reproduce the issue, I've observed that running KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). If this doesn't reproduce then you can do the following: 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between L1 vCPUs until the hang occurs. -- Original Description: When installing qemu-kvm on a VM, KSM is enabled. I have encountered this problem in trusty:$ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.1 LTS Release: 14.04 Codename: trusty $ uname -a Linux juju-gema-machine-2 3.13.0-40-generic #69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux The way to see the behaviour: 1) $ more /sys/kernel/mm/ksm/run 0 2) $ sudo apt-get install qemu-kvm 3) $ more /sys/kernel/mm/ksm/run 1 To see the soft lockups, deploy a cloud on a virtualised env like ctsstack, run tempest on it, the compute nodes of the virtualised deployment will eventually stop responding with (run tempest 2 times at least):  24096.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24124.072003] BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:24791] [24152.072002] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24180.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24208.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24236.072004] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] [24264.072003] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:24791] I am not sure whether the problem is that we are enabling KSM on a VM or the problem is that nested KSM is not behaving properly. Either way I can easily reproduce, please contact me if you need further details.
2015-04-06 15:19:30 Ramy Asselin bug added subscriber Ramy Asselin
2015-04-08 13:30:22 Andy Whitcroft linux (Ubuntu Trusty): status In Progress Fix Committed
2015-04-09 09:27:46 VangelisAngelou bug added subscriber VangelisAngelou
2015-04-16 14:49:33 Choe, Cheng-Dae bug added subscriber Choe, Cheng-Dae
2015-04-17 14:03:48 Brad Figg tags cts trusty cts trusty verification-needed-trusty
2015-04-21 16:06:36 Chris J Arges linux (Ubuntu): assignee Chris J Arges (arges)
2015-04-21 16:06:38 Chris J Arges linux (Ubuntu): status Confirmed Fix Released
2015-04-21 16:06:41 Chris J Arges linux (Ubuntu): importance High Undecided
2015-04-21 16:06:53 Chris J Arges tags cts trusty verification-needed-trusty cts trusty verification-done-trusty
2015-04-29 15:02:25 Lei Wang bug added subscriber Ray Wang
2015-04-29 15:38:53 Launchpad Janitor linux (Ubuntu Trusty): status Fix Committed Fix Released
2015-04-29 15:38:53 Launchpad Janitor cve linked 2015-2666
2015-04-29 15:38:53 Launchpad Janitor cve linked 2015-2922
2015-06-10 22:09:56 John L. Villalovos bug added subscriber John L. Villalovos
2016-02-16 23:59:43 Duncan Idaho bug added subscriber Duncan Idaho