I'm attaching the crash tool output from the 3.13 kernel dump. Much likely related to the situation already found in the following case: -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540 Handled by Chris Arges and I on LKML discussions with Ingo and Linus: -> http://www.kernelhub.org/?p=2&msg=683682 FOR NOW, it is LIKELY that I'll rely on already known recommendations for Proliant (including the ones related to X2APIC mode): -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580 So we can TRY TO GUARANTEE that there are no LOST IRQs (IPIs) using the firmware you're using. Hopefully with the proper APIC mode set, like HP recommends, we will not have those IPIs problems. OBS: Whenever IPIs are lost (we've seen this on some nested KVMs and some buggy HW) we can be locked up in the SMP callback state machine. This means that the state machine looses IPIs ACKs and the state machine loops forever trying to shutdown the CPU for the SMP task queue to continue. I'll provide SOON a comment with SUGGESTIONS and asking for FEEDBACK. ################ For now, from the 3.13 kernel dump, the most interesting part: We had 7 CPUs executing the migration kernel thread (for the SMP callback state machine execution): #### migration tasks (state machine loop) > 93 2 4 ffff8808147b47d0 RU 0.0 0 0 [migration/4] > 118 2 9 ffff881814a2c7d0 RU 0.0 0 0 [migration/9] > 123 2 10 ffff88081404c7d0 RU 0.0 0 0 [migration/10] > 128 2 11 ffff881814a4c7d0 RU 0.0 0 0 [migration/11] > 138 2 13 ffff881814a647d0 RU 0.0 0 0 [migration/13] > 165 2 18 ffff8810149ec7d0 RU 0.0 0 0 [migration/18] > 195 2 24 ffff881014a647d0 RU 0.0 0 0 [migration/24] This logic will try to migrate tasks from one CPU to another. In order for that to happen they have to rely on the state machine logic of shutting CPUs down before migrating the tasks (turning off IRQs, etc). The state machine - shutting down the CPUs on phases - relies on the SMP callbacks bellow. We had 3 CPUs in a part of the kernel that we have already identified to be problematic under certain conditions and/or HW. ** > 17247 1 23 ffff881007055fc0 RU 1.6 7358428 2192548 qemu-system-x86 PID: 17247 TASK: ffff881007055fc0 CPU: 23 COMMAND: "qemu-system-x86" #0 [ffff88203eac6e58] crash_nmi_callback at ffffffff8103fb72 #1 [ffff88203eac6e68] nmi_handle at ffffffff8171f188 #2 [ffff88203eac6ec8] do_nmi at ffffffff8171f350 #3 [ffff88203eac6ef0] end_repeat_nmi at ffffffff8171e5f1 [exception RIP: generic_exec_single+130] RIP: ffffffff810db712 RSP: ffff8810ea7c96e0 RFLAGS: 00000202 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202 RDX: ffff8810ea7c96e0 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff810db712 R8: ffffffff810db712 R9: 0000000000000018 R10: ffff8810ea7c96e0 R11: 0000000000000202 R12: ffffffffffffffff R13: 0000000000000206 R14: 000000007bc87bc6 R15: ffff8814959f76c0 ORIG_RAX: ffff8814959f76c0 CS: 0010 SS: 0018 --- --- #4 [ffff8810ea7c96e0] generic_exec_single at ffffffff810db712 !!!! CSD_FLAG logic discussed with Linus 108 while (csd->flags & CSD_FLAG_LOCK) 0xffffffff810db712 <+130>: testb $0x1,0x20(%rbx) 0xffffffff810db716 <+134>: jne 0xffffffff810db710 109 cpu_relax(); 110 } ** > 21036 1 27 ffff8810b69947d0 RU 1.0 7484828 1401940 qemu-system-x86 PID: 21036 TASK: ffff8810b69947d0 CPU: 27 COMMAND: "qemu-system-x86" #0 [ffff88203eb46e58] crash_nmi_callback at ffffffff8103fb72 #1 [ffff88203eb46e68] nmi_handle at ffffffff8171f188 #2 [ffff88203eb46ec8] do_nmi at ffffffff8171f350 #3 [ffff88203eb46ef0] end_repeat_nmi at ffffffff8171e5f1 [exception RIP: generic_exec_single+130] RIP: ffffffff810db712 RSP: ffff8814959f7670 RFLAGS: 00000202 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202 RDX: ffff8814959f7670 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff810db712 R8: ffffffff810db712 R9: 0000000000000018 R10: ffff8814959f7670 R11: 0000000000000202 R12: ffffffffffffffff R13: 0000000000000282 R14: 0000000000000000 R15: 0000000000000100 ORIG_RAX: 0000000000000100 CS: 0010 SS: 0018 --- --- #4 [ffff8814959f7670] generic_exec_single at ffffffff810db712 !!!! CSD_FLAG logic discussed with Linus 108 while (csd->flags & CSD_FLAG_LOCK) 0xffffffff810db712 <+130>: testb $0x1,0x20(%rbx) 0xffffffff810db716 <+134>: jne 0xffffffff810db710 109 cpu_relax(); 110 } ** > 18516 1 31 ffff881dd54a2fe0 RU 1.6 7358428 2192548 qemu-system-x86 PID: 18516 TASK: ffff881dd54a2fe0 CPU: 31 COMMAND: "qemu-system-x86" #0 [ffff88203ebc6e58] crash_nmi_callback at ffffffff8103fb72 #1 [ffff88203ebc6e68] nmi_handle at ffffffff8171f188 #2 [ffff88203ebc6ec8] do_nmi at ffffffff8171f350 #3 [ffff88203ebc6ef0] end_repeat_nmi at ffffffff8171e5f1 [exception RIP: generic_exec_single+130] RIP: ffffffff810db712 RSP: ffff881dd55597a0 RFLAGS: 00000202 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202 RDX: ffff881dd55597a0 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff810db712 R8: ffffffff810db712 R9: 0000000000000018 R10: ffff881dd55597a0 R11: 0000000000000202 R12: ffffffffffffffff R13: 0000000000000206 R14: 000000007bca7bc8 R15: ffff8814959f76c0 ORIG_RAX: ffff8814959f76c0 CS: 0010 SS: 0018 --- --- #4 [ffff881dd55597a0] generic_exec_single at ffffffff810db712 !!!! CSD_FLAG logic discussed with Linus 108 while (csd->flags & CSD_FLAG_LOCK) 0xffffffff810db712 <+130>: testb $0x1,0x20(%rbx) 0xffffffff810db716 <+134>: jne 0xffffffff810db710 109 cpu_relax(); 110 }