Comment 0 for bug 1239800

Pradeep (pradeep-krishnamurthy) wrote :

When bonnie++ was run in a loop, the system exhibits a hang behavior with
"rcu_sched: self-detected stall on CPU"
The time to error can be inconsistent. One time it took 7 hours and the next time more than 2 days.

Commands to reproduce the failure:
$ sudo apt-get install bonnie++
$ mkdir bonnie
$ while true; do bonnie++ -d bonnie; done &>>bonnie0.log &

Stack trace:
[237019.072290] INFO: rcu_sched self-detected stall on CPU { 1} (t=19305216 jiffies g=580389 c=580388 q=84)
[237019.080901] CPU: 1 PID: 44 Comm: kswapd0 Tainted: GF 3.11.0-6-generic-lpae #12-Ubuntu
[237019.088879] [<c002bc00>] (unwind_backtrace+0x0/0x138) from [<c0026f1c>] (show_stack+0x10/0x14)
[237019.096700] [<c0026f1c>] (show_stack+0x10/0x14) from [<c05cbe50>] (dump_stack+0x74/0x90)
[237019.104051] [<c05cbe50>] (dump_stack+0x74/0x90) from [<c00bf37c>] (rcu_check_callbacks+0x31c/0x798)
[237019.112262] [<c00bf37c>] (rcu_check_callbacks+0x31c/0x798) from [<c00492a0>] (update_process_times+0x38/0x64)
[237019.121254] [<c00492a0>] (update_process_times+0x38/0x64) from [<c008cdbc>] (tick_sched_handle+0x54/0x60)
[237019.129933] [<c008cdbc>] (tick_sched_handle+0x54/0x60) from [<c008d00c>] (tick_sched_timer+0x44/0x74)
[237019.138300] [<c008d00c>] (tick_sched_timer+0x44/0x74) from [<c005db50>] (__run_hrtimer+0x74/0x1d4)
[237019.146433] [<c005db50>] (__run_hrtimer+0x74/0x1d4) from [<c005e6f8>] (hrtimer_interrupt+0x10c/0x2c0)
[237019.154800] [<c005e6f8>] (hrtimer_interrupt+0x10c/0x2c0) from [<c0492e44>] (arch_timer_handler_phys+0x28/0x30)
[237019.163871] [<c0492e44>] (arch_timer_handler_phys+0x28/0x30) from [<c00b8c2c>] (handle_percpu_devid_irq+0x6c/0x104)
[237019.173332] [<c00b8c2c>] (handle_percpu_devid_irq+0x6c/0x104) from [<c00b54ec>] (generic_handle_irq+0x20/0x30)
[237019.182402] [<c00b54ec>] (generic_handle_irq+0x20/0x30) from [<c0023ff4>] (handle_IRQ+0x38/0x94)
[237019.190378] [<c0023ff4>] (handle_IRQ+0x38/0x94) from [<c0008508>] (gic_handle_irq+0x28/0x5c)
[237019.198041] [<c0008508>] (gic_handle_irq+0x28/0x5c) from [<c05d1c00>] (__irq_svc+0x40/0x50)
[237019.205624] Exception stack(0xee2c1c18 to 0xee2c1c60)
[237019.210238] 1c00: 00000004 00000004
[237019.217666] 1c20: 00000008 00000001 ee2c1c8c ca208700 ca208700 0996b000 ca208708 00000001
[237019.225093] 1c40: 00000002 edb31300 00000003 ee2c1c60 c02f54fc c00923c8 200f0013 ffffffff
[237019.232523] [<c05d1c00>] (__irq_svc+0x40/0x50) from [<c00923c8>] (generic_exec_single+0x6c/0x94)
[237019.240500] [<c00923c8>] (generic_exec_single+0x6c/0x94) from [<c00924f4>] (smp_call_function_single+0x104/0x198)
[237019.249805] [<c00924f4>] (smp_call_function_single+0x104/0x198) from [<c0029920>] (broadcast_tlb_mm_a15_erratum+0x7c/0x84)
[237019.259812] [<c0029920>] (broadcast_tlb_mm_a15_erratum+0x7c/0x84) from [<c0029adc>] (flush_tlb_page+0x74/0xa8)
[237019.268882] [<c0029adc>] (flush_tlb_page+0x74/0xa8) from [<c011fc8c>] (ptep_clear_flush_young+0x6c/0xd0)
[237019.277484] [<c011fc8c>] (ptep_clear_flush_young+0x6c/0xd0) from [<c011a60c>] (page_referenced_one+0x64/0x1fc)
[237019.286554] [<c011a60c>] (page_referenced_one+0x64/0x1fc) from [<c011c034>] (page_referenced+0xf4/0x2e4)
[237019.295155] [<c011c034>] (page_referenced+0xf4/0x2e4) from [<c00fc410>] (shrink_active_list+0x1f0/0x35c)
[237019.303756] [<c00fc410>] (shrink_active_list+0x1f0/0x35c) from [<c00fdadc>] (shrink_lruvec+0x32c/0x598)
[237019.312279] [<c00fdadc>] (shrink_lruvec+0x32c/0x598) from [<c00fddb0>] (shrink_zone+0x68/0x180)
[237019.320176] [<c00fddb0>] (shrink_zone+0x68/0x180) from [<c00fe430>] (kswapd+0x568/0x9d4)
[237019.327527] [<c00fe430>] (kswapd+0x568/0x9d4) from [<c005aae0>] (kthread+0xa4/0xb0)
[237019.334487] [<c005aae0>] (kthread+0xa4/0xb0) from [<c0023198>] (ret_from_fork+0x14/0x3c)

Setup details:
Quad-core A15 server nodes on Calxeda Midway hardware.
The failure has been seen two times with DDR setting of DDR3@1600mt/s

cat /proc/version_signature
Ubuntu 3.11.0-12.18-generic-lpae 3.11.3
The issue was first seen on Ubuntu 3.11.0-6.12-generic-lpae

cat /etc/issue
Ubuntu 13.04 \n \l

Additional debug information attached