Precise BUG: soft lockup in flush_tlb_others_ipi
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Precise |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
The following stack trace (with kernel dump) was brought to me. It looks like this crash is happening every day (at least once) in a KVM + CEPH backend environment.
"""
[1796904.032010] BUG: soft lockup - CPU#0 stuck for 23s! [java:6383]
[1796904.036004] Modules linked in: isofs psmouse virtio_balloon serio_raw acpiphp floppy
[1796904.036004] CPU 0
[1796904.036004] Modules linked in: isofs psmouse virtio_balloon serio_raw acpiphp floppy
[1796904.036004]
[1796904.036004] Pid: 6383, comm: java Not tainted 3.2.0-76-virtual #111-Ubuntu OpenStack Foundation OpenStack Nova
[1796904.036004] RIP: 0010:[<
[1796904.036004] RSP: 0018:ffff880065
[1796904.036004] RAX: 0000000000000002 RBX: ffffea0003470bf0 RCX: 0000000000000002
[1796904.036004] RDX: 0000000000000002 RSI: 0000000000000040 RDI: 0000000000000296
[1796904.036004] RBP: ffff880065791d88 R08: ffffffff81e0c0a0 R09: 0000000000000040
[1796904.036004] R10: ffffea0003471240 R11: 0000000000000000 R12: ffff880065791e20
[1796904.036004] R13: ffff880059e96f20 R14: ffff880116249848 R15: 00ff880065791d78
[1796904.036004] FS: 00007f83612d270
[1796904.036004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1796904.036004] CR2: 00007f83be381420 CR3: 0000000118be0000 CR4: 00000000000006f0
[1796904.036004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1796904.036004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1796917.981999] Process java (pid: 6383, threadinfo ffff880065790000, task ffff880053c0dbc0)
[1796917.981999] Stack:
[1796917.981999] 00007f83612ccfff ffff880059e96f20 ffff880116200e00 ffff8801162010d0
[1796917.981999] 00007f83612cd000 ffff880116200e00 ffff880065791d98 ffffffff81046aae
[1796917.981999] ffff880065791db8 ffffffff81046b7b 00007f83611d5000 ffff880065791e20
[1796917.981999] Call Trace:
[1796917.982394] ata2: lost interrupt (Status 0x58)
[1796917.981999] [<ffffffff81046
[1796917.981999] [<ffffffff81046
[1796917.981999] [<ffffffff8113b
[1796917.981999] [<ffffffff8113b
[1796917.981999] [<ffffffff8113e
[1796917.981999] [<ffffffff8113a
[1796917.981999] [<ffffffff8107b
[1796917.981999] [<ffffffff81095
[1796917.981999] [<ffffffff8113a
[1796917.981999] [<ffffffff81666
[1796917.981999] Code: 41 8d b6 cf 00 00 00 49 8d 7d 18 ff 90 d0 00 00 00 49 83 bc 24 98 c0 e0 81 00 0f 84 74 ff ff ff 66 0f 1f 84 00 00 00 00 00 f3 90 <49> 83 7d 18 00 75 f7 e9 5d ff ff ff 66 90 55 48 89 e5 66 66 66
[1796917.981999] Call Trace:
[1796917.981999] [<ffffffff81046
[1796917.981999] [<ffffffff81046
[1796917.981999] [<ffffffff8113b
[1796917.981999] [<ffffffff8113b
[1796917.981999] [<ffffffff8113e
[1796917.981999] [<ffffffff8113a
[1796917.981999] [<ffffffff8107b
[1796917.981999] [<ffffffff81095
[1796917.981999] [<ffffffff8113a
[1796917.981999] [<ffffffff81666
[1796917.992066] ata2: drained 65536 bytes to clear DRQ
"""
Analysis Bellow...
description: | updated |
Changed in linux (Ubuntu Precise): | |
status: | In Progress → Incomplete |
Changed in linux (Ubuntu): | |
assignee: | Rafael David Tinoco (rafaeldtinoco) → nobody |
Changed in linux (Ubuntu Precise): | |
assignee: | Rafael David Tinoco (rafaeldtinoco) → nobody |
tags: | added: cscc |
Analysing core dump with crash tool:
---- map-3.2. 0-80-virtual debug/boot/ vmlinux- 3.2.0-82- virtual interrupt at ffffffff81667b9e interrupt at ffffffff81667b9e others_ ipi+290]
SYSTEM MAP: /boot/System.
DEBUG KERNEL: /usr/lib/
DUMPFILE: ./VmCore
CPUS: 2
DATE: Sat Jun 6 00:23:24 2015
UPTIME: 1 days, 11:37:29
LOAD AVERAGE: 1.96, 1.11, 0.54
TASKS: 183
NODENAME: activemq2
RELEASE: 3.2.0-80-virtual
VERSION: #116-Ubuntu SMP Mon Mar 23 17:28:52 UTC 2015
MACHINE: x86_64 (1799 Mhz)
MEMORY: 4 GB
PANIC: "[128249.422864] Kernel panic - not syncing: softlockup: hung tasks"
PID: 6345
COMMAND: "java"
TASK: ffff880037412de0 [THREAD_INFO: ffff880037694000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
----
PID: 6345 TASK: ffff880037412de0 CPU: 0 COMMAND: "java"
#0 [ffff88011fc03cd0] machine_kexec at ffffffff8103970a
...
#7 [ffff88011fc03fb0] apic_timer_
--- <IRQ stack> ---
#8 [ffff880037695cd8] apic_timer_
[exception RIP: flush_tlb_
RIP: ffffffff81046932 RSP: ffff880037695d88 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff880117c06a00 RCX: 0000000000000002
RDX: 0000000000000002 RSI: 0000000000000040 RDI: 0000000000000296
RBP: ffff880037695db8 R8: ffffffff81e0d0a0 R9: 0000000000000040
R10: 0000000000000000 R11: ffff8801163df4d0 R12: ffff880037695d28
R13: 0000000000000000 R14: ffffffff81143575 R15: ffff880037695d68
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
----
It is clear that we are locked inside flush_tlb_ others_ ipi. Disassembling flush_tlb_ others_ ipi I could see we are locked inside a loop waiting for a cpumask to be emptied. This cpumask is emptied when all other CPUs received the IPI and processed the event "INVALIDATE_ TLB_VECTOR_ START" properly.