Precise BUG: soft lockup in flush_tlb_others_ipi

Bug #1467955 reported by Rafael David Tinoco
2
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Won't Fix
Undecided
Unassigned

Bug Description

The following stack trace (with kernel dump) was brought to me. It looks like this crash is happening every day (at least once) in a KVM + CEPH backend environment.

"""
[1796904.032010] BUG: soft lockup - CPU#0 stuck for 23s! [java:6383]
[1796904.036004] Modules linked in: isofs psmouse virtio_balloon serio_raw acpiphp floppy
[1796904.036004] CPU 0
[1796904.036004] Modules linked in: isofs psmouse virtio_balloon serio_raw acpiphp floppy
[1796904.036004]
[1796904.036004] Pid: 6383, comm: java Not tainted 3.2.0-76-virtual #111-Ubuntu OpenStack Foundation OpenStack Nova
[1796904.036004] RIP: 0010:[<ffffffff81046922>] [<ffffffff81046922>] flush_tlb_others_ipi+0x122/0x130
[1796904.036004] RSP: 0018:ffff880065791d58 EFLAGS: 00000202
[1796904.036004] RAX: 0000000000000002 RBX: ffffea0003470bf0 RCX: 0000000000000002
[1796904.036004] RDX: 0000000000000002 RSI: 0000000000000040 RDI: 0000000000000296
[1796904.036004] RBP: ffff880065791d88 R08: ffffffff81e0c0a0 R09: 0000000000000040
[1796904.036004] R10: ffffea0003471240 R11: 0000000000000000 R12: ffff880065791e20
[1796904.036004] R13: ffff880059e96f20 R14: ffff880116249848 R15: 00ff880065791d78
[1796904.036004] FS: 00007f83612d2700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[1796904.036004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1796904.036004] CR2: 00007f83be381420 CR3: 0000000118be0000 CR4: 00000000000006f0
[1796904.036004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1796904.036004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1796917.981999] Process java (pid: 6383, threadinfo ffff880065790000, task ffff880053c0dbc0)
[1796917.981999] Stack:
[1796917.981999] 00007f83612ccfff ffff880059e96f20 ffff880116200e00 ffff8801162010d0
[1796917.981999] 00007f83612cd000 ffff880116200e00 ffff880065791d98 ffffffff81046aae
[1796917.981999] ffff880065791db8 ffffffff81046b7b 00007f83611d5000 ffff880065791e20
[1796917.981999] Call Trace:
[1796917.982394] ata2: lost interrupt (Status 0x58)
[1796917.981999] [<ffffffff81046aae>] native_flush_tlb_others+0xe/0x10
[1796917.981999] [<ffffffff81046b7b>] flush_tlb_mm+0x5b/0xa0
[1796917.981999] [<ffffffff8113ba06>] tlb_flush_mmu+0x46/0x90
[1796917.981999] [<ffffffff8113ba64>] tlb_finish_mmu+0x14/0x40
[1796917.981999] [<ffffffff8113e3a7>] zap_page_range+0xb7/0xd0
[1796917.981999] [<ffffffff8113a85d>] madvise_vma+0xfd/0x140
[1796917.981999] [<ffffffff8107b917>] ? __set_task_blocked+0x37/0x80
[1796917.981999] [<ffffffff81095b27>] ? getnstimeofday+0x57/0xe0
[1796917.981999] [<ffffffff8113aa7e>] sys_madvise+0x1de/0x280
[1796917.981999] [<ffffffff81666b82>] system_call_fastpath+0x16/0x1b
[1796917.981999] Code: 41 8d b6 cf 00 00 00 49 8d 7d 18 ff 90 d0 00 00 00 49 83 bc 24 98 c0 e0 81 00 0f 84 74 ff ff ff 66 0f 1f 84 00 00 00 00 00 f3 90 <49> 83 7d 18 00 75 f7 e9 5d ff ff ff 66 90 55 48 89 e5 66 66 66
[1796917.981999] Call Trace:
[1796917.981999] [<ffffffff81046aae>] native_flush_tlb_others+0xe/0x10
[1796917.981999] [<ffffffff81046b7b>] flush_tlb_mm+0x5b/0xa0
[1796917.981999] [<ffffffff8113ba06>] tlb_flush_mmu+0x46/0x90
[1796917.981999] [<ffffffff8113ba64>] tlb_finish_mmu+0x14/0x40
[1796917.981999] [<ffffffff8113e3a7>] zap_page_range+0xb7/0xd0
[1796917.981999] [<ffffffff8113a85d>] madvise_vma+0xfd/0x140
[1796917.981999] [<ffffffff8107b917>] ? __set_task_blocked+0x37/0x80
[1796917.981999] [<ffffffff81095b27>] ? getnstimeofday+0x57/0xe0
[1796917.981999] [<ffffffff8113aa7e>] sys_madvise+0x1de/0x280
[1796917.981999] [<ffffffff81666b82>] system_call_fastpath+0x16/0x1b
[1796917.992066] ata2: drained 65536 bytes to clear DRQ
"""

Analysis Bellow...

Tags: cscc
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Analysing core dump with crash tool:

----
SYSTEM MAP: /boot/System.map-3.2.0-80-virtual
DEBUG KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-82-virtual
DUMPFILE: ./VmCore
CPUS: 2
DATE: Sat Jun 6 00:23:24 2015
UPTIME: 1 days, 11:37:29
LOAD AVERAGE: 1.96, 1.11, 0.54
TASKS: 183
NODENAME: activemq2
RELEASE: 3.2.0-80-virtual
VERSION: #116-Ubuntu SMP Mon Mar 23 17:28:52 UTC 2015
MACHINE: x86_64 (1799 Mhz)
MEMORY: 4 GB
PANIC: "[128249.422864] Kernel panic - not syncing: softlockup: hung tasks"
PID: 6345
COMMAND: "java"
TASK: ffff880037412de0 [THREAD_INFO: ffff880037694000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
----
PID: 6345 TASK: ffff880037412de0 CPU: 0 COMMAND: "java"
#0 [ffff88011fc03cd0] machine_kexec at ffffffff8103970a
...
#7 [ffff88011fc03fb0] apic_timer_interrupt at ffffffff81667b9e
--- <IRQ stack> ---
#8 [ffff880037695cd8] apic_timer_interrupt at ffffffff81667b9e
[exception RIP: flush_tlb_others_ipi+290]
RIP: ffffffff81046932 RSP: ffff880037695d88 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff880117c06a00 RCX: 0000000000000002
RDX: 0000000000000002 RSI: 0000000000000040 RDI: 0000000000000296
RBP: ffff880037695db8 R8: ffffffff81e0d0a0 R9: 0000000000000040
R10: 0000000000000000 R11: ffff8801163df4d0 R12: ffff880037695d28
R13: 0000000000000000 R14: ffffffff81143575 R15: ffff880037695d68
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
----

It is clear that we are locked inside flush_tlb_others_ipi. Disassembling flush_tlb_others_ipi I could see we are locked inside a loop waiting for a cpumask to be emptied. This cpumask is emptied when all other CPUs received the IPI and processed the event "INVALIDATE_TLB_VECTOR_START" properly.

Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

----

194 if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
0xffffffff81046865 <+85>: mov %gs:0xca10,%edx
0xffffffff8104688e <+126>: test %rax,%rax
0xffffffff81046899 <+137>: jne 0xffffffff81046900 <flush_tlb_others_ipi+240>

195 /*
196 * We have to send the IPI only to
197 * CPUs affected.
198 */
199 apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
0xffffffff81046900 <+240>: mov 0xc95c19(%rip),%rax # 0xffffffff81cdc520
0xffffffff8104690e <+254>: lea 0x18(%r13),%rdi
0xffffffff81046912 <+258>: callq *0xd0(%rax)

200 INVALIDATE_TLB_VECTOR_START + sender);
0xffffffff81046907 <+247>: lea 0xcf(%r14),%esi

201
202 while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
0xffffffff81046918 <+264>: cmpq $0x0,-0x7e1f2f68(%r12)
0xffffffff81046921 <+273>: je 0xffffffff8104689b <flush_tlb_others_ipi+139>
0xffffffff81046927 <+279>: nopw 0x0(%rax,%rax,1)
0xffffffff81046932 <+290>: cmpq $0x0,0x18(%r13)
0xffffffff81046937 <+295>: jne 0xffffffff81046930 <flush_tlb_others_ipi+288>
0xffffffff81046939 <+297>: jmpq 0xffffffff8104689b <flush_tlb_others_ipi+139>
0xffffffff8104693e: xchg %ax,%ax

203 cpu_relax();
204 }

----

#9 [ffff880037695dc0] native_flush_tlb_others at ffffffff81046abe
#10 [ffff880037695dd0] flush_tlb_mm at ffffffff81046b8b
...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

1) This might be happening because the IPI was sent to a offline CPU (just one example of miss-synchronization from flush_tlb_others_ipi logic) during a shutdown process (as discussed in the thread: https://lkml.org/lkml/2012/7/19/53)

At that time the fix wasn't picked because the IPI mechanism for flushing other cpu's TLB was being changed to what we have nowadays, the smp_call_function_many called by "flush_tlb_others".

commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
Author: Alex Shi <email address hidden>
Date: Thu Jun 28 09:02:23 2012 +0800

x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR

Is the one refactoring this code and probably the responsible for fixing this erratic behavior.

OR

2) MOST LIKELY, because the IPI was not delivered to a "context switched" virtual CPU on the host side (as discussed in: https://lkml.org/lkml/2012/6/4/24), causing the running CPU to spin waiting for the other cpu to ACK something did not arrive to it.

- Benchmark tool observing issues with flush_tlb_others_ipi on a workload similar to a "message broker" (like this case):

http://www.kernelhub.org/?p=2&msg=1209

- First idea of paravirt TLB flushes to avoid CPUs to stay spinning waiting for IPIs to be finished on context switched CPUs:

http://www.kernelhub.org/?p=2&msg=1218

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I've been able to back port the following commit:

"""
commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
Author: Alex Shi <email address hidden>
Date: Thu Jun 28 09:02:23 2012 +0800

x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR

There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
amount of vector in IDT. But it is still not enough, since modern x86
sever has more cpu number. That still causes heavy lock contention
in TLB flushing.

The patch using generic smp call function to replace it. That saved 32
vector number in IDT, and resolved the lock contention in TLB
flushing on large system.

In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
has 3% performance increase.

Signed-off-by: Alex Shi <email address hidden>
Link: http://<email address hidden>
Signed-off-by: H. Peter Anvin <email address hidden>
"""

Responsible to alter the logic for the flush_tlb_others_ipi sequence. I also back-ported the following needed commits:

"""
commit 3a4f7b0a59006a3986b8ed6faf0031f1e5232db4
Author: Alex Shi <email address hidden>
Date: Thu Jun 28 09:02:17 2012 +0800

x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range

commit 3331548b0d3907b1ab84e86239e149b8a52cda5d
Author: Jan Beulich <email address hidden>
Date: Tue Nov 29 11:03:46 2011 +0000

x86-64: Reduce amount of redundant code generated for invalidate_interruptNN
"""

Right now I'm sending the source code to a kernel builder machine and will provide a hotfixed kernel, to be tested, soon.

Changed in linux (Ubuntu Precise):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → In Progress
Changed in linux (Ubuntu):
status: In Progress → Fix Released
description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I have provided the following PPA:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1467955

with a 3.2 kernel with new flush_tlb logic (removing IRQ vectors and using smp_function_call_xxxx).

Please, if you are suffering from this issue, give this kernel a try and add comments giving feedback.

-Rafael

Changed in linux (Ubuntu Precise):
status: In Progress → Incomplete
Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in linux (Ubuntu Precise):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in linux (Ubuntu Precise):
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.