Kernel hang under memory stress

Bug #433646 reported by Petr Vandrovec
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

When running memhog test program (which allocates lots of memory and then randomly touches pages) 9.10 alpha 6 gets stuck in read_swap_cache_async forever - or at least for longer than 48000 seconds, which is effectively forever:

<3>[50313.402614] BUG: soft lockup - CPU#0 stuck for 61s! [memhog:16322]
<4>[50313.402614] Modules linked in: vmblock vsock vmci vmmemctl vmhgfs pvscsi acpiphp binfmt_misc snd_ens1371 gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev psmouse serio_raw snd soundcore
 snd_page_alloc i2c_piix4 parport_pc lp intel_agp parport shpchp mptspi mptscsih mptbase e1000 scsi_transport_spi floppy
<6>[50313.402614] CPU 0:
<4>[50313.402614] Modules linked in: vmblock vsock vmci vmmemctl vmhgfs pvscsi acpiphp binfmt_misc snd_ens1371 gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ppdev psmouse serio_raw snd soundcore
 snd_page_alloc i2c_piix4 parport_pc lp intel_agp parport shpchp mptspi mptscsih mptbase e1000 scsi_transport_spi floppy
<6>[50313.402614] Pid: 16322, comm: memhog Not tainted 2.6.31-10-generic #34-Ubuntu GT5414E
<6>[50313.402614] RIP: 0010:[<ffffffff810ff6b6>] [<ffffffff810ff6b6>] read_swap_cache_async+0x36/0x120
<6>[50313.402614] RSP: 0018:ffff88002f54bd18 EFLAGS: 00000246
<6>[50313.402614] RAX: 00000000ffffffef RBX: ffff88002f54bd68 RCX: 0000000000000000
<6>[50313.402614] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff81980210
<6>[50313.402614] RBP: ffffffff81012b6e R08: 0000000000000000 R09: 0000000000000efd
<6>[50313.402614] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
<6>[50313.402614] R13: ffffc90010664f42 R14: 0000000000000000 R15: 0000000000000efd
<6>[50313.402614] FS: 00007ff09d9316f0(0000) GS:ffff8800019c6000(0063) knlGS:00000000f7e198d0
<6>[50313.402614] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
<6>[50313.402614] CR2: 00000000e8fa9000 CR3: 000000001a82e000 CR4: 00000000000006b0
<6>[50313.402614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<6>[50313.402614] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[50313.402614] Call Trace:
<4>[50313.402614] [<ffffffff810ff6de>] ? read_swap_cache_async+0x5e/0x120
<4>[50313.402614] [<ffffffff810ffe96>] ? valid_swaphandles+0x166/0x190
<4>[50313.402614] [<ffffffff810ff81f>] ? swapin_readahead+0x7f/0xb0
<4>[50313.402614] [<ffffffff810f373e>] ? do_swap_page+0x2ce/0x420
<4>[50313.402614] [<ffffffff8104a155>] ? finish_task_switch+0x65/0x120
<4>[50313.402614] [<ffffffff810129ce>] ? common_interrupt+0xe/0x13
<4>[50313.402614] [<ffffffff81272da6>] ? rb_erase+0xd6/0x160
<4>[50313.402614] [<ffffffff810f3b5b>] ? handle_mm_fault+0x2cb/0x3c0
<4>[50313.402614] [<ffffffff81032439>] ? default_spin_lock_flags+0x9/0x10
<4>[50313.402614] [<ffffffff81525fba>] ? do_page_fault+0x16a/0x370
<4>[50313.402614] [<ffffffff81523975>] ? page_fault+0x25/0x30

In my current hang this function calls find_get_page(..., 0x17A1), which in turn returns NULL because radix_tree_lookup_slot(...., 0x17A1) returns NULL, followed by swapcache_prepare(0x17A1), which returns EEXIST because its swapentry has has_cache==1 and count==1.

Problem does not occur on 9.04.

Tags: kj-triage
Revision history for this message
Petr Vandrovec (petr-vmware) wrote :
Revision history for this message
Petr Vandrovec (petr-vmware) wrote :
Revision history for this message
Petr Vandrovec (petr-vmware) wrote :

I've reproed it with all daemons & programs killed - see new attachment. Apparently one of memhog threads went to sleep on malloc while adding entry to the radix tree:

<6>[ 1789.980812] memhog R running task 0 3039 3036 0x00020000
<4>[ 1789.980812] ffff88003309dc38 0000000000000086 ffff880000010800 0000000000015580
<4>[ 1789.980812] ffff88003992c7c0 0000000000015580 0000000000015580 0000000000015580
<4>[ 1789.980812] 0000000000015580 ffff88003992c7c8 0000000000015580 0000000000015580
<4>[ 1789.980812] Call Trace:
<4>[ 1789.980812] [<ffffffff8105630c>] __cond_resched+0x1c/0x50
<4>[ 1789.980812] [<ffffffff815215ab>] _cond_resched+0x2b/0x40
<4>[ 1789.980812] [<ffffffff811103e9>] kmem_cache_alloc+0xd9/0x150
<4>[ 1789.980812] [<ffffffff81271db6>] ? radix_tree_preload+0x36/0xa0
<4>[ 1789.980812] [<ffffffff81271db6>] radix_tree_preload+0x36/0xa0
<4>[ 1789.980812] [<ffffffff810ff5d1>] add_to_swap_cache+0x21/0xd0
<4>[ 1789.980812] [<ffffffff810ff706>] read_swap_cache_async+0x86/0x120
<4>[ 1789.980812] [<ffffffff810ffe96>] ? valid_swaphandles+0x166/0x190
<4>[ 1789.980812] [<ffffffff810ff81f>] swapin_readahead+0x7f/0xb0
<4>[ 1789.980812] [<ffffffff810f373e>] do_swap_page+0x2ce/0x420
<4>[ 1789.980812] [<ffffffff8104a155>] ? finish_task_switch+0x65/0x120
<4>[ 1789.980812] [<ffffffff810129ce>] ? common_interrupt+0xe/0x13
<4>[ 1789.980812] [<ffffffff810f3b5b>] handle_mm_fault+0x2cb/0x3c0
<4>[ 1789.980812] [<ffffffff81032439>] ? default_spin_lock_flags+0x9/0x10
<4>[ 1789.980812] [<ffffffff81525fba>] do_page_fault+0x16a/0x370
<4>[ 1789.980812] [<ffffffff81523975>] page_fault+0x25/0x30

while other thread then wants to read this very same entry. I have no idea why kswapd does not get woken up and swap something to unblock this thread.

Revision history for this message
Petr Vandrovec (petr-vmware) wrote :

Ah, none of Ubuntu kernels is built with CONFIG_PREEMPT=y - only with PREEMPT_NONE or PREEMPT_VOLUNTARY. That explains why thread above never runs, as there is no cond_resched() in a read_swap_cache_async() loop mentioned above.

Revision history for this message
Petr Vandrovec (petr-vmware) wrote :

Attached is patch which seems to fix problem on my system. I'm not sure whether cond_resched() is necessary in failure path when add fails (err != -ENOMEM), as I cannot trigger that path in the test, but one in EEXIST path is rather vital...

Signed-off-by: Petr Vandrovec <email address hidden>

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -293,8 +293,10 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                 * Swap entry may have been freed since our caller observed it.
                 */
                err = swapcache_prepare(entry);
- if (err == -EEXIST) /* seems racy */
+ if (err == -EEXIST) { /* seems racy */
+ cond_resched();
                        continue;
+ }
                if (err) /* swp entry is obsolete ? */
                        break;

@@ -320,6 +322,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
                ClearPageSwapBacked(new_page);
                __clear_page_locked(new_page);
                swapcache_free(entry, NULL);
+ if (err != -ENOMEM)
+ cond_resched();
        } while (err != -ENOMEM);

        if (new_page)

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Triaged a while ago but has not had any updated comments for quite some time. Please let us know if this issue remains in the current Ubuntu release, http://www.ubuntu.com/getubuntu/download . If the issue remains, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-triage
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Petr Vandrovec (petr-vmware) wrote :

It was fixed upstream. I'm not entirely sure whether 9.10 got this, but as VMware did not receive any further bug reports I guess it did.

commit 31a5639623a487d6db996c8138c9e53fef2e2d91
Author: Daisuke Nishimura <email address hidden>
Date: Mon Sep 21 17:02:50 2009 -0700

    mm: add_to_swap_cache() must not sleep

    After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
    cache but it has SWAP_HAS_CACHE flag.

    Such entries can exist on add/delete path of swap cache. On add path,
    add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
    on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
    cleared) soon after __delete_from_swap_cache() is called. So, the
    busy-wait works well in most cases.

    But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
    read_swap_cache_async() tries to swap-in the same entry on the same cpu.

    This patch calls radix_tree_preload() before swapcache_prepare() and
    divides add_to_swap_cache() into two part: radix_tree_preload() part and
    radix_tree_insert() part(define it as __add_to_swap_cache()).

    Signed-off-by: Daisuke Nishimura <email address hidden>
    Cc: KAMEZAWA Hiroyuki <email address hidden>
    Cc: Balbir Singh <email address hidden>
    Cc: Hugh Dickins <email address hidden>
    Cc: Johannes Weiner <email address hidden>
    Signed-off-by: Andrew Morton <email address hidden>
    Signed-off-by: Linus Torvalds <email address hidden>

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.