Comment 65 for bug 1531768

Revision history for this message
Colin Ian King (colin-king) wrote :

On an idle Xenial cloud image I'm seeing:

[ 1485.236760] [<ffff800000086ad0>] __switch_to+0x90/0xa8
[ 1485.236772] [<ffff800000143e80>] __tick_nohz_idle_enter+0x50/0x3f0
[ 1485.236776] [<ffff800000144478>] tick_nohz_idle_enter+0x40/0x70
[ 1485.236785] [<ffff80000010baf0>] cpu_startup_entry+0x288/0x2d8
[ 1485.236791] [<ffff80000008fca8>] secondary_start_kernel+0x120/0x130
[ 1485.236795] [<000000004008290c>] 0x4008290c

after a while I get:

[ 2462.806971] rcu_sched kthread starved for 15002 jiffies! g2579 c2578 f0x0 s3 ->state=0x1
[ 2667.835351] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2667.836918] 0-...: (66 GPs behind) idle=cf0/0/0 softirq=5177/5177 fqs=0
[ 2667.838801] 2-...: (0 ticks this GP) idle=73a/0/0 softirq=4570/4570 fqs=0
[ 2667.840696] 3-...: (64 GPs behind) idle=eba/0/0 softirq=4654/4654 fqs=0
[ 2667.842533] (detected by 1, t=15002 jiffies, g=2638, c=2637, q=4389)

and at this point sleeping blocks, for example strace on sleep(1) on the VM shows nanosleep({1, 0}) sleep forever, one has to SIGINT this as it never times out.

Also the secondary_start_kernel() is indicative that the VM puts CPUs to sleep and wakes them on a timer.

I can trigger this more often with more CPUs on the VM and also by loading the host, for example, producing a lot of cache or memory activity can trigger the initial hangs more frequently than having an idle host.

So, I suspect there is a cpuhotplug and nohz combo causing issues here.