3.11 and 3.13 EXT4 race condition - kernel panic - ext4_es_lru_del

Bug #1389787 reported by Rafael David Tinoco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Low
Unassigned

Bug Description

It was brought to my attention the following stack traces (occurring several times on different machines):

crash> bt
PID: 2877 TASK: ffff881009b42ee0 CPU: 1 COMMAND: "git"
#0 [ffff880c0c6bb9d0] machine_kexec at ffffffff8104b141
#1 [ffff880c0c6bba40] crash_kexec at ffffffff810d5a58
#2 [ffff880c0c6bbb10] oops_end at ffffffff81748b38
#3 [ffff880c0c6bbb40] no_context at ffffffff8172dd02
#4 [ffff880c0c6bbb90] __bad_area_nosemaphore at ffffffff8172dee4
#5 [ffff880c0c6bbbf0] bad_area at ffffffff8172df5d
#6 [ffff880c0c6bbc20] __do_page_fault at ffffffff8174bda8
#7 [ffff880c0c6bbd30] do_page_fault at ffffffff8174bde7
#8 [ffff880c0c6bbd60] page_fault at ffffffff81747e98
[exception RIP: kmem_cache_alloc_trace+106]
RIP: ffffffff8119b22a RSP: ffff880c0c6bbe18 RFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff88008407e0c0 RCX: 00000000031b6bc9
RDX: 00000000031b6bc8 RSI: 00000000000080d0 RDI: 00000000000173c0
RBP: ffff880c0c6bbe68 R8: ffff88081fa373c0 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: ffff88081f403800
R13: 0000000000629050 R14: ffffffff811bcfe4 R15: 00000000000080d0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880c0c6bbe70] alloc_pipe_info at ffffffff811bcfe4
#10 [ffff880c0c6bbe90] get_pipe_inode at ffffffff811bd0aa
#11 [ffff880c0c6bbeb0] create_pipe_files at ffffffff811bd648
#12 [ffff880c0c6bbef0] __do_pipe_flags at ffffffff811bd7d2
#13 [ffff880c0c6bbf30] sys_pipe2 at ffffffff811bd8f0
#14 [ffff880c0c6bbf70] sys_pipe at ffffffff811bd980
#15 [ffff880c0c6bbf80] system_call_fastpath at ffffffff8175099d
RIP: 00007f2d245da7e7 RSP: 00007fff6a4cfb18 RFLAGS: 00010283
RAX: 0000000000000016 RBX: ffffffff8175099d RCX: 000000000000001c
RDX: 00007f2d248aeac0 RSI: 0000000000000000 RDI: 00007fff6a4cfad0
RBP: 0000000000000002 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: ffffffff811bd980
R13: ffff880c0c6bbf78 R14: 0000000000000000 R15: 000000000232efc0
ORIG_RAX: 0000000000000016 CS: 0033 SS: 002b

AND

Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152703] SysRq : Show backtrace of all active CPUs
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152708] sending NMI to all CPUs:
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152714] NMI backtrace for cpu 0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152720] CPU: 0 PID: 13579 Comm: python Tainted: GF I 3.11.0-15-generic #25~precise1-Ubuntu
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152721] Hardware name: HP ProLiant BL460c G7, BIOS I27 05/05/2011
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152724] task: ffff880085f50000 ti: ffff880a64afa000 task.ti: ffff880a64afa000
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152725] RIP: 0010:[<ffffffff81050ff5>] [<ffffffff81050ff5>] __ticket_spin_lock+0x25/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152736] RSP: 0018:ffff880a64afb748 EFLAGS: 00000293
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152737] RAX: 000000000000db25 RBX: ffff880be8576480 RCX: 000000018066002a
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152739] RDX: 000000000000db2e RSI: 0000000000000001 RDI: ffff880be8576480
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152740] RBP: ffff880a64afb748 R08: 0000000000000000 R09: ffffea002407db00
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152741] R10: ffffffff8128d8ed R11: 0000000000000001 R12: ffff8807572480b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152743] R13: ffff8807572481b8 R14: 00000000ffffffff R15: ffff8805e9c77908
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152745] FS: 00007ff00d4db700(0000) GS:ffff880c0ba00000(0000) knlGS:0000000000000000
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152746] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152748] CR2: 00007fb66906d8e0 CR3: 000000053b3dd000 CR4: 00000000000007f0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152749] Stack:
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152750] ffff880a64afb758 ffffffff8174759e ffff880a64afb778 ffffffff8128eb02
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152756] ffff8807572480b0 ffff880757248200 ffff880a64afb798 ffffffff81270595
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152759] ffff880a64afb798 ffff8807572480b0 ffff880a64afb7c8 ffffffff81257621
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152763] Call Trace:
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152770] [<ffffffff8174759e>] _raw_spin_lock+0xe/0x20
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152776] [<ffffffff8128eb02>] ext4_es_lru_del+0x32/0x80
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152792] [<ffffffff811ce490>] evict+0xc0/0x1d0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152795] [<ffffffff811ce5e1>] dispose_list+0x41/0x50
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152797] [<ffffffff8174756f>] ? _raw_spin_trylock+0xf/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152806] [<ffffffff811b7703>] prune_super+0x193/0x1b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152812] [<ffffffff8115af04>] shrink_slab+0x154/0x300
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152816] [<ffffffff8115dbf8>] do_try_to_free_pages+0x218/0x290
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152819] [<ffffffff8115dfa4>] try_to_free_pages+0xe4/0x1a0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152823] [<ffffffff811520b8>] __alloc_pages_nodemask+0x618/0x9a0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152827] [<ffffffff817336c8>] ? mem_cgroup_update_tree+0x9c/0x121
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152831] [<ffffffff811920a3>] alloc_pages_vma+0xa3/0x150
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152837] [<ffffffff81170249>] do_wp_page+0xc9/0x7c0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152840] [<ffffffff81172d8c>] handle_pte_fault+0x1ec/0x230
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152842] [<ffffffff81173ff0>] handle_mm_fault+0x2a0/0x3e0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152847] [<ffffffff8174b9ff>] __do_page_fault+0x1af/0x560
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152850] [<ffffffff811b8de7>] ? cp_new_stat+0x107/0x120
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152856] [<ffffffff810ca06c>] ? do_futex+0x7c/0x1b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152858] [<ffffffff811b91b5>] ? SYSC_newstat+0x25/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152861] [<ffffffff8174bde7>] do_page_fault+0x37/0x70
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152864] [<ffffffff81747e98>] page_fault+0x28/0x30

Tags: cscc saucy
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.4 KiB)

For This specific back trace...

Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152776] [<ffffffff8128eb02>] ext4_es_lru_del+0x32/0x80
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152792] [<ffffffff811ce490>] evict+0xc0/0x1d0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152795] [<ffffffff811ce5e1>] dispose_list+0x41/0x50
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152797] [<ffffffff8174756f>] ? _raw_spin_trylock+0xf/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
...
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152842] [<ffffffff81173ff0>] handle_mm_fault+0x2a0/0x3e0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152847] [<ffffffff8174b9ff>] __do_page_fault+0x1af/0x560
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152850] [<ffffffff811b8de7>] ? cp_new_stat+0x107/0x120
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152856] [<ffffffff810ca06c>] ? do_futex+0x7c/0x1b0
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152858] [<ffffffff811b91b5>] ? SYSC_newstat+0x25/0x30
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152861] [<ffffffff8174bde7>] do_page_fault+0x37/0x70

Which I DO NOT have a dump for (not sure if there is any old dump for this backtrace).

For this backtrace...

I can see memory pressure seems to be big and the page fault incurs in (signalled also by the fact that the page reclaiming is going through "slowest_path" : __alloc_pages_nodemask -> alloc_pages_slowpath -> __alloc_pages_direct_reclaim -> __perform_reclaim -> try_to_free_pages) - seeing that direct_reclaim is a synchronous and blocking way of handling page faults. This explains why dropping caches might be helping here, since the path for the page allocation is different on such conditions.

Moving a bit...

Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152812] [<ffffffff8115af04>] shrink_slab+0x154/0x300
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152816] [<ffffffff8115dbf8>] do_try_to_free_pages+0x218/0x290
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152819] [<ffffffff8115dfa4>] try_to_free_pages+0xe4/0x1a0

trying to shrink slab structure to free some memory so a frame can be allocated to the page fault that just happened.

Moving further.. it gets to the point that aged cache pages are scanned (LRU) to check if they can be freed AND we got probably into a page filebacked (pagecache / demand paging) that had its "shrink" function called (to flush data, free cache, update filesystem master block, etc):

# update filesystem super block
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152800] [<ffffffff811cf5d5>] prune_icache_sb+0x185/0x340
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152806] [<ffffffff811b7703>] prune_super+0x193/0x1b0

# calling inode evict function (to free inode page)
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152781] [<ffffffff81270595>] ext4_clear_inode+0x45/0x90
Oct 15 14:19:08 LGEARND8B5 kernel: [796493.152786] [<ffffffff81257621>] ext4_evict_inode+0x81/0x510

Finally we arrive at ext4_es_lru_del(struct inode *i...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

For the other back trace (the one I have the core and can use crash tool):
* I'm still analysing this...

PID: 4453 TASK: ffff8800362dddc0 CPU: 16 COMMAND: "git"
#0 [ffff880092791a60] machine_kexec at ffffffff8104b141
#1 [ffff880092791ad0] crash_kexec at ffffffff810d5a58
#2 [ffff880092791ba0] oops_end at ffffffff81748b38
#3 [ffff880092791bd0] no_context at ffffffff8172dd02
#4 [ffff880092791c20] __bad_area_nosemaphore at ffffffff8172dee4
#5 [ffff880092791c80] bad_area_nosemaphore at ffffffff8172df16
#6 [ffff880092791c90] __do_page_fault at ffffffff8174bc12
#7 [ffff880092791da0] do_page_fault at ffffffff8174bde7
#8 [ffff880092791dd0] page_fault at ffffffff81747e98
[exception RIP: kmem_cache_alloc+102]
RIP: ffffffff8119c316 RSP: ffff880092791e88 RFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffffffffffffea RCX: 00000000260000be
RDX: 00000000260000bd RSI: 00000000000000d0 RDI: 00000000000173c0
RBP: ffff880092791ed8 R8: ffff880c0bb173c0 R9: 0000000000001165
R10: 00007f2e4ac88dcc R11: 0000000000000246 R12: ffff880c0b403800
R13: ffff880ce78ccc00 R14: ffffffff8108f636 R15: 00000000000000d0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880092791ee0] prepare_creds at ffffffff8108f636
#10 [ffff880092791f00] sys_faccessat at ffffffff811b22b4
#11 [ffff880092791f70] sys_access at ffffffff811b24b8
#12 [ffff880092791f80] system_call_fastpath at ffffffff8175099d
RIP: 00007f2e4d883097 RSP: 00007f2e4ac87c88 RFLAGS: 00000206
RAX: 0000000000000015 RBX: ffffffff8175099d RCX: ffffffffffffffff
RDX: 00000000007bde07 RSI: 0000000000000000 RDI: 00000000007bdd80
RBP: 0000000000000001 R8: 0000000000000066 R9: 0000000000001165
R10: 00007f2e4ac88dcc R11: 0000000000000246 R12: ffffffff811b24b8
R13: ffff880092791f78 R14: 0000000000000002 R15: 0000000000000000
ORIG_RAX: 0000000000000015 CS: 0033 SS: 002b

We can see that ... system call access (which calls sys_faccessat) was called by user mode (CS: 0033) to check wether this process has permission to handle the given file:

FD FILE DENTRY INODE TYPE PATH
3 ffff880be925be00 ffff88076dd34b40 ffff88082a8434b0 REG /home001/jeongku.choi/8994_recent/.repo/project-objects/LG_apps/android/vendor/lge/apps/LGSettings.git/objects/pack/tmp_p

You can check that the file descriptor ffff880be925be00 is present into frame #10 (because it is an argument to the system call and was pushed into stack):

#10 [ffff880092791f00] sys_faccessat at ffffffff811b22b4
ffff880092791f08: 0000000000000001 00007f2e30001050
ffff880092791f18: ffff880be925be00 0000000000000000
ffff880092791f28: ffff880092791f78 ffffffff811b46b6
ffff880092791f38: 00007f2e4ac87d00 00007f2e4cce9590
ffff880092791f48: 00007f2e4cce9590 0000000000000457
ffff880092791f58: 0000000000000000 0000000000000002
ffff880092791f68: ffff880092791f78 ffffffff811b24b8

To check this file's permission... continuing on the backtrace... (next comment)

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

continuing on the back trace:

#8 [ffff880092791dd0] page_fault at ffffffff81747e98
[exception RIP: kmem_cache_alloc+102]
RIP: ffffffff8119c316 RSP: ffff880092791e88 RFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffffffffffffea RCX: 00000000260000be
RDX: 00000000260000bd RSI: 00000000000000d0 RDI: 00000000000173c0
RBP: ffff880092791ed8 R8: ffff880c0bb173c0 R9: 0000000000001165
R10: 00007f2e4ac88dcc R11: 0000000000000246 R12: ffff880c0b403800
R13: ffff880ce78ccc00 R14: ffffffff8108f636 R15: 00000000000000d0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880092791ee0] prepare_creds at ffffffff8108f636
#10 [ffff880092791f00] sys_faccessat at ffffffff811b22b4
#11 [ffff880092791f70] sys_access at ffffffff811b24b8

In "prepare_creds" execution we called "kmem_cache_alloc" which triggered a page fault that was handled by the same task... (so we had a page fault inside kernel context). With the rest of the back trace:

#3 [ffff880092791bd0] no_context at ffffffff8172dd02
#4 [ffff880092791c20] __bad_area_nosemaphore at ffffffff8172dee4
#5 [ffff880092791c80] bad_area_nosemaphore at ffffffff8172df16
#6 [ffff880092791c90] __do_page_fault at ffffffff8174bc12
#7 [ffff880092791da0] do_page_fault at ffffffff8174bde7

But kernel couldn't handle this page fault .. things that might have happened:

0) error code can be PF_RSVD, PF_USER or PF_PROT (to be checked)
if not
0.1) address for the page fault is not inside vmalloc area (return -1) (since its a page fault for a virtual address)
1) this wasn't a spurious fault (for sure) (caused by cpu walking the page table by itself)

kernel oops with no_context...

Im still working on this.. will come back here with comments soon.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Small comment:

We might be dealing with 2 different situations/bugs for the two different stack traces we got (possible race condition / deadlock for the ext4 page flush for high memory pressure AND kmem_cache_alloc trying to access wrong addresses)...

As I said, investigating both and checking if they can be correlated.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1389787

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: saucy
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: 3.11 memory consumption leads to HANG (not sure if 3.13 suffers from this).

For the kmem_cache_alloc problem I found a really promising fix saying that the commit:

""""
commit ba5bb147330a8737b6b5a812cc774c79c070704b
Author: Al Viro <email address hidden>
Date: Thu Mar 21 02:21:19 2013 -0400

pipe: take allocation and freeing of pipe_inode_info out of ->i_mutex
""""

present from tags v3.10 to today....

caused a use-after-free problem on VERY rare occasions (maybe until your workload was discovered). This problem really looks like the problem we are having here and is fixed by the following commit:

""""
commit b0d8d2292160bb63de1972361ebed100c64b5b37
Author: Linus Torvalds <email address hidden>
Date: Mon Dec 2 09:44:51 2013 -0800

vfs: fix subtle use-after-free of pipe_inode_info

The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:

spin_lock(&inode->i_lock);
if (!--pipe->files) {
inode->i_pipe = NULL;
kill = 1;
}
spin_unlock(&inode->i_lock);
__pipe_unlock(pipe);
if (kill)
free_pipe_info(pipe);

where the final freeing is done last.

HOWEVER. The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".

This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.

Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.

Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early. And since there were two users of this
pattern, create a helper function for it.

Introduced commit ba5bb147330a ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").

Reported-by: Simon Kirby <email address hidden>
Reported-by: Ian Applegate <email address hidden>
Acked-by: Al Viro <email address hidden>
Cc: <email address hidden> # v3.10+
Signed-off-by: Linus Torvalds <email address hidden>
""""

The fix is contained in the following tags:

inaddy@inerddy:/bugs/kernel/upstream$ git tag --contains b0d8d2292160bb63de1972361ebed100c64b5b37
v3.13
v3.13-rc3
v3.13-rc4
v3.13-rc5
v3.13-rc6
v3.13-rc7
v3.13-rc8
v3.14
v3.14-rc1
v3.14-rc2

And v3.13 might not suffer from this issue.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm going to mark this bug as invalid for now, since it is reported against an EOL Saucy kernel. Please mark the bug as confirmed if the issue still persists with Trusty/3.13.

Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
penalvch (penalvch) wrote :

Rafael David Tinoco, thank you for reporting this and helping make Ubuntu better. The Saucy enablement kernel has been EoL since August 2014 as per https://wiki.ubuntu.com/Kernel/LTSEnablementStack .

While it appears you have a thorough analysis of the root cause, your best bet would be to upgrade these machines to the Trusty enablement kernel.

Would that work for you?

Changed in linux (Ubuntu):
importance: Medium → Low
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Bianco Veigel (binco) wrote :
Download full text (14.5 KiB)

I'm also affected by this bug on Trusty with the latest kernel (3.13.0-53-generic):

[500345.624596] BUG: soft lockup - CPU#0 stuck for 23s! [khugepaged:54]
[500345.630989] Modules linked in: ipt_REJECT btrfs ufs qnx4 hfsplus[500345.636604] BUG: soft lockup - CPU#1 stuck for 23s! [kthreadd:2]
[500345.636605] Modules linked in: ipt_REJECT btrfs ufs qnx4 hfsplus hfs minix ntfs msdos jfs xt_multiport iptable_filter ip_tables x_tables kvm_amd edac_core kvm k10temp edac_mce_amd i2c_piix4 serio_raw shpchp mac_hid nfsd auth_rpcgss nfs_acl nfs lp binfmt_misc lockd sunrpc fscache parport xfs libcrc32c xts gf128mul dm_crypt raid10 raid0 multipath linear raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx xor raid6_pq raid1 pata_acpi hid_generic usbhid hid radeon psmouse i2c_algo_bit ahci ttm libahci drm_kms_helper pata_atiixp mpt2sas drm r8169 raid_class scsi_transport_sas mii wmi
[500345.636627] CPU: 1 PID: 2 Comm: kthreadd Tainted: G D 3.13.0-53-generic #89-Ubuntu
[500345.636628] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./M3A785GMH/128M, BIOS P1.40 11/26/2009
[500345.636628] task: ffff880214bd9800 ti: ffff880214458000 task.ti: ffff880214458000
[500345.636629] RIP: 0010:[<ffffffff8172ae05>] [<ffffffff8172ae05>] _raw_spin_lock+0x35/0x50
[500345.636631] RSP: 0018:ffff8802144599f8 EFLAGS: 00000202
[500345.636632] RAX: 00000000000079f7 RBX: ffffffff8127f5bc RCX: 00000000000087ea
[500345.636632] RDX: 00000000000087ec RSI: 00000000000087ec RDI: ffff8800cdef5c80
[500345.636633] RBP: ffff8802144599f8 R08: 9038000000000000 R09: 0107e5b3481c0000
[500345.636633] R10: feda1a8eb150d207 R11: 00000000000003eb R12: 0000000000000046
[500345.636634] R13: ffffffff8115e67e R14: ffff880214459a20 R15: ffffffffffffffff
[500345.636635] FS: 00007fa8ed61e740(0000) GS:ffff88021fc40000(0000) knlGS:0000000000000000
[500345.636636] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[500345.636636] CR2: 00007f8263077000 CR3: 0000000210041000 CR4: 00000000000007e0
[500345.636637] Stack:
[500345.636637] ffff880214459a18 ffffffff8127ffb9 ffff880107e5b2a8 ffff880107e5b2a8
[500345.636639] ffff880214459a30 ffffffff81260091 ffff880107e5b3f8 ffff880214459a50
[500345.636641] ffffffff8124614a ffff880107e5b2a8 ffff880107e5b3b0 ffff880214459a78
[500345.636642] Call Trace:
[500345.636643] [<ffffffff8127ffb9>] ext4_es_lru_del+0x29/0x70
[500345.636644] [<ffffffff81260091>] ext4_clear_inode+0x41/0x90
[500345.636646] [<ffffffff8124614a>] ext4_evict_inode+0x8a/0x4d0
[500345.636647] [<ffffffff811d9490>] evict+0xb0/0x1b0
[500345.636649] [<ffffffff811d95c9>] dispose_list+0x39/0x50
[500345.636651] [<ffffffff811da4f7>] prune_icache_sb+0x47/0x60
[500345.636652] [<ffffffff811c1875>] super_cache_scan+0x105/0x170
[500345.636654] [<ffffffff81161357>] shrink_slab+0x1c7/0x370
[500345.636655] [<ffffffff8116448d>] do_try_to_free_pages+0x3ed/0x540
[500345.636657] [<ffffffff811646cc>] try_to_free_pages+0xec/0x180
[500345.636658] [<ffffffff81159385>] __alloc_pages_nodemask+0x7d5/0xb80
[500345.636660] [<ffffffff810a7400>] ? load_balance+0x120/0x890
[500345.636662] [<ffffffff810652e3>] copy_process.part.26+0x143/0x16b0
[500345.636663] [<ffffffff8109d665>] ...

Changed in linux (Ubuntu):
status: Expired → Confirmed
Revision history for this message
Bianco Veigel (binco) wrote :

I've hit this bug the second time within one week on a 24/7 Server system. Is there any workaround for this, currently I'm thinking about switching from ext4 to ext3.

Revision history for this message
penalvch (penalvch) wrote :

Bianco Veigel, it will help immensely if you filed a new report via a terminal:
ubuntu-bug linux

Please feel free to subscribe me to it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Bianco,

based on my comment #1:

...
Finally we arrive at ext4_es_lru_del(struct inode *inode):

1023 spin_lock(&sbi->s_es_lru_lock);

The task is hang and someone is holding this lock. I would have to have this core to know who and why.
...

Could you provide me a kdump ?

If you are not aware how to, I suggest you this:

http://www.inaddy.org/mini-howtos/dumps/using-ubuntu-crash-dump-with-kdump

If the kdump is too big you can share it using a dropbox or something like it.

Thank you

Rafael

summary: - 3.11 memory consumption leads to HANG (not sure if 3.13 suffers from
- this).
+ 3.11 and 3.13 EXT4 race condition - kernel panic - ext4_es_lru_del
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.