When bonnie++ was run in a loop, the system exhibits a hang behavior with
"rcu_sched: self-detected stall on CPU"
The time to error can be inconsistent. One time it took 7 hours and the next time more than 2 days.
Commands to reproduce the failure:
$ sudo apt-get install bonnie++
$ mkdir bonnie
$ while true; do bonnie++ -d bonnie; done &>>bonnie0.log &
Stack trace:
[237019.072290] INFO: rcu_sched self-detected stall on CPU { 1} (t=19305216 jiffies g=580389 c=580388 q=84)
[237019.080901] CPU: 1 PID: 44 Comm: kswapd0 Tainted: GF 3.11.0-6-generic-lpae #12-Ubuntu
[237019.088879] [<c002bc00>] (unwind_backtrace+0x0/0x138) from [<c0026f1c>] (show_stack+0x10/0x14)
[237019.096700] [<c0026f1c>] (show_stack+0x10/0x14) from [<c05cbe50>] (dump_stack+0x74/0x90)
[237019.104051] [<c05cbe50>] (dump_stack+0x74/0x90) from [<c00bf37c>] (rcu_check_callbacks+0x31c/0x798)
[237019.112262] [<c00bf37c>] (rcu_check_callbacks+0x31c/0x798) from [<c00492a0>] (update_process_times+0x38/0x64)
[237019.121254] [<c00492a0>] (update_process_times+0x38/0x64) from [<c008cdbc>] (tick_sched_handle+0x54/0x60)
[237019.129933] [<c008cdbc>] (tick_sched_handle+0x54/0x60) from [<c008d00c>] (tick_sched_timer+0x44/0x74)
[237019.138300] [<c008d00c>] (tick_sched_timer+0x44/0x74) from [<c005db50>] (__run_hrtimer+0x74/0x1d4)
[237019.146433] [<c005db50>] (__run_hrtimer+0x74/0x1d4) from [<c005e6f8>] (hrtimer_interrupt+0x10c/0x2c0)
[237019.154800] [<c005e6f8>] (hrtimer_interrupt+0x10c/0x2c0) from [<c0492e44>] (arch_timer_handler_phys+0x28/0x30)
[237019.163871] [<c0492e44>] (arch_timer_handler_phys+0x28/0x30) from [<c00b8c2c>] (handle_percpu_devid_irq+0x6c/0x104)
[237019.173332] [<c00b8c2c>] (handle_percpu_devid_irq+0x6c/0x104) from [<c00b54ec>] (generic_handle_irq+0x20/0x30)
[237019.182402] [<c00b54ec>] (generic_handle_irq+0x20/0x30) from [<c0023ff4>] (handle_IRQ+0x38/0x94)
[237019.190378] [<c0023ff4>] (handle_IRQ+0x38/0x94) from [<c0008508>] (gic_handle_irq+0x28/0x5c)
[237019.198041] [<c0008508>] (gic_handle_irq+0x28/0x5c) from [<c05d1c00>] (__irq_svc+0x40/0x50)
[237019.205624] Exception stack(0xee2c1c18 to 0xee2c1c60)
[237019.210238] 1c00: 00000004 00000004
[237019.217666] 1c20: 00000008 00000001 ee2c1c8c ca208700 ca208700 0996b000 ca208708 00000001
[237019.225093] 1c40: 00000002 edb31300 00000003 ee2c1c60 c02f54fc c00923c8 200f0013 ffffffff
[237019.232523] [<c05d1c00>] (__irq_svc+0x40/0x50) from [<c00923c8>] (generic_exec_single+0x6c/0x94)
[237019.240500] [<c00923c8>] (generic_exec_single+0x6c/0x94) from [<c00924f4>] (smp_call_function_single+0x104/0x198)
[237019.249805] [<c00924f4>] (smp_call_function_single+0x104/0x198) from [<c0029920>] (broadcast_tlb_mm_a15_erratum+0x7c/0x84)
[237019.259812] [<c0029920>] (broadcast_tlb_mm_a15_erratum+0x7c/0x84) from [<c0029adc>] (flush_tlb_page+0x74/0xa8)
[237019.268882] [<c0029adc>] (flush_tlb_page+0x74/0xa8) from [<c011fc8c>] (ptep_clear_flush_young+0x6c/0xd0)
[237019.277484] [<c011fc8c>] (ptep_clear_flush_young+0x6c/0xd0) from [<c011a60c>] (page_referenced_one+0x64/0x1fc)
[237019.286554] [<c011a60c>] (page_referenced_one+0x64/0x1fc) from [<c011c034>] (page_referenced+0xf4/0x2e4)
[237019.295155] [<c011c034>] (page_referenced+0xf4/0x2e4) from [<c00fc410>] (shrink_active_list+0x1f0/0x35c)
[237019.303756] [<c00fc410>] (shrink_active_list+0x1f0/0x35c) from [<c00fdadc>] (shrink_lruvec+0x32c/0x598)
[237019.312279] [<c00fdadc>] (shrink_lruvec+0x32c/0x598) from [<c00fddb0>] (shrink_zone+0x68/0x180)
[237019.320176] [<c00fddb0>] (shrink_zone+0x68/0x180) from [<c00fe430>] (kswapd+0x568/0x9d4)
[237019.327527] [<c00fe430>] (kswapd+0x568/0x9d4) from [<c005aae0>] (kthread+0xa4/0xb0)
[237019.334487] [<c005aae0>] (kthread+0xa4/0xb0) from [<c0023198>] (ret_from_fork+0x14/0x3c)
Setup details:
Quad-core A15 server nodes on Calxeda Midway hardware.
The failure has been seen two times with DDR setting of DDR3@1600mt/s
cat /proc/version_signature
Ubuntu 3.11.0-12.18-generic-lpae 3.11.3
The issue was first seen on Ubuntu 3.11.0-6.12-generic-lpae
When bonnie++ was run in a loop, the system exhibits a hang behavior with
"rcu_sched: self-detected stall on CPU"
The time to error can be inconsistent. One time it took 7 hours and the next time more than 2 days.
Commands to reproduce the failure:
$ sudo apt-get install bonnie++
$ mkdir bonnie
$ while true; do bonnie++ -d bonnie; done &>>bonnie0.log &
Stack trace: 6-generic- lpae #12-Ubuntu backtrace+ 0x0/0x138) from [<c0026f1c>] (show_stack+ 0x10/0x14) 0x10/0x14) from [<c05cbe50>] (dump_stack+ 0x74/0x90) 0x74/0x90) from [<c00bf37c>] (rcu_check_ callbacks+ 0x31c/0x798) callbacks+ 0x31c/0x798) from [<c00492a0>] (update_ process_ times+0x38/ 0x64) process_ times+0x38/ 0x64) from [<c008cdbc>] (tick_sched_ handle+ 0x54/0x60) handle+ 0x54/0x60) from [<c008d00c>] (tick_sched_ timer+0x44/ 0x74) timer+0x44/ 0x74) from [<c005db50>] (__run_ hrtimer+ 0x74/0x1d4) hrtimer+ 0x74/0x1d4) from [<c005e6f8>] (hrtimer_ interrupt+ 0x10c/0x2c0) interrupt+ 0x10c/0x2c0) from [<c0492e44>] (arch_timer_ handler_ phys+0x28/ 0x30) handler_ phys+0x28/ 0x30) from [<c00b8c2c>] (handle_ percpu_ devid_irq+ 0x6c/0x104) percpu_ devid_irq+ 0x6c/0x104) from [<c00b54ec>] (generic_ handle_ irq+0x20/ 0x30) handle_ irq+0x20/ 0x30) from [<c0023ff4>] (handle_ IRQ+0x38/ 0x94) IRQ+0x38/ 0x94) from [<c0008508>] (gic_handle_ irq+0x28/ 0x5c) irq+0x28/ 0x5c) from [<c05d1c00>] (__irq_ svc+0x40/ 0x50) svc+0x40/ 0x50) from [<c00923c8>] (generic_ exec_single+ 0x6c/0x94) exec_single+ 0x6c/0x94) from [<c00924f4>] (smp_call_ function_ single+ 0x104/0x198) function_ single+ 0x104/0x198) from [<c0029920>] (broadcast_ tlb_mm_ a15_erratum+ 0x7c/0x84) tlb_mm_ a15_erratum+ 0x7c/0x84) from [<c0029adc>] (flush_ tlb_page+ 0x74/0xa8) tlb_page+ 0x74/0xa8) from [<c011fc8c>] (ptep_clear_ flush_young+ 0x6c/0xd0) flush_young+ 0x6c/0xd0) from [<c011a60c>] (page_reference d_one+0x64/ 0x1fc) d_one+0x64/ 0x1fc) from [<c011c034>] (page_reference d+0xf4/ 0x2e4) d+0xf4/ 0x2e4) from [<c00fc410>] (shrink_ active_ list+0x1f0/ 0x35c) active_ list+0x1f0/ 0x35c) from [<c00fdadc>] (shrink_ lruvec+ 0x32c/0x598) lruvec+ 0x32c/0x598) from [<c00fddb0>] (shrink_ zone+0x68/ 0x180) zone+0x68/ 0x180) from [<c00fe430>] (kswapd+ 0x568/0x9d4) 0x568/0x9d4) from [<c005aae0>] (kthread+0xa4/0xb0) fork+0x14/ 0x3c)
[237019.072290] INFO: rcu_sched self-detected stall on CPU { 1} (t=19305216 jiffies g=580389 c=580388 q=84)
[237019.080901] CPU: 1 PID: 44 Comm: kswapd0 Tainted: GF 3.11.0-
[237019.088879] [<c002bc00>] (unwind_
[237019.096700] [<c0026f1c>] (show_stack+
[237019.104051] [<c05cbe50>] (dump_stack+
[237019.112262] [<c00bf37c>] (rcu_check_
[237019.121254] [<c00492a0>] (update_
[237019.129933] [<c008cdbc>] (tick_sched_
[237019.138300] [<c008d00c>] (tick_sched_
[237019.146433] [<c005db50>] (__run_
[237019.154800] [<c005e6f8>] (hrtimer_
[237019.163871] [<c0492e44>] (arch_timer_
[237019.173332] [<c00b8c2c>] (handle_
[237019.182402] [<c00b54ec>] (generic_
[237019.190378] [<c0023ff4>] (handle_
[237019.198041] [<c0008508>] (gic_handle_
[237019.205624] Exception stack(0xee2c1c18 to 0xee2c1c60)
[237019.210238] 1c00: 00000004 00000004
[237019.217666] 1c20: 00000008 00000001 ee2c1c8c ca208700 ca208700 0996b000 ca208708 00000001
[237019.225093] 1c40: 00000002 edb31300 00000003 ee2c1c60 c02f54fc c00923c8 200f0013 ffffffff
[237019.232523] [<c05d1c00>] (__irq_
[237019.240500] [<c00923c8>] (generic_
[237019.249805] [<c00924f4>] (smp_call_
[237019.259812] [<c0029920>] (broadcast_
[237019.268882] [<c0029adc>] (flush_
[237019.277484] [<c011fc8c>] (ptep_clear_
[237019.286554] [<c011a60c>] (page_reference
[237019.295155] [<c011c034>] (page_reference
[237019.303756] [<c00fc410>] (shrink_
[237019.312279] [<c00fdadc>] (shrink_
[237019.320176] [<c00fddb0>] (shrink_
[237019.327527] [<c00fe430>] (kswapd+
[237019.334487] [<c005aae0>] (kthread+0xa4/0xb0) from [<c0023198>] (ret_from_
Setup details:
Quad-core A15 server nodes on Calxeda Midway hardware.
The failure has been seen two times with DDR setting of DDR3@1600mt/s
cat /proc/version_ signature 12.18-generic- lpae 3.11.3 6.12-generic- lpae
Ubuntu 3.11.0-
The issue was first seen on Ubuntu 3.11.0-
cat /etc/issue
Ubuntu 13.04 \n \l
Additional debug information attached