I have been trying to easily reproduce this for days.
We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-generic-hwe-16.04 (4.8).
In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.).
However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it:
We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]:
[ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds.
[ 726.077908] Tainted: G W I 4.8.0-41-generic #44~16.04.1-Ubuntu
[ 726.085850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.094383] kworker/0:1 D ffff0000080861bc 0 312 2 0x00000000
[ 726.094401] Workqueue: events vmstat_shepherd
[ 726.094404] Call trace:
[ 726.094411] [<ffff0000080861bc>] __switch_to+0x94/0xa8
[ 726.094418] [<ffff0000089854f4>] __schedule+0x224/0x718
[ 726.094421] [<ffff000008985a20>] schedule+0x38/0x98
[ 726.094425] [<ffff000008985d84>] schedule_preempt_disabled+0x14/0x20
[ 726.094428] [<ffff000008987644>] __mutex_lock_slowpath+0xd4/0x168
[ 726.094431] [<ffff000008987730>] mutex_lock+0x58/0x70
[ 726.094437] [<ffff0000080c552c>] get_online_cpus+0x44/0x70
[ 726.094440] [<ffff00000820ca24>] vmstat_shepherd+0x3c/0xe8
[ 726.094446] [<ffff0000080e1c60>] process_one_work+0x150/0x478
[ 726.094449] [<ffff0000080e1fd8>] worker_thread+0x50/0x4b8
[ 726.094453] [<ffff0000080e8eac>] kthread+0xec/0x100
[ 726.094456] [<ffff000008083690>] ret_from_fork+0x10/0x40
Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them.
On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops.
I have been trying to easily reproduce this for days. generic- hwe-16. 04 (4.8).
We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-
In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.).
However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it:
$ apt-get install stress-ng
$ stress-ng --hdd 1024
We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]: kernel/ hung_task_ timeout_ secs" disables this message. 1bc>] __switch_ to+0x94/ 0xa8 4f4>] __schedule+ 0x224/0x718 a20>] schedule+0x38/0x98 d84>] schedule_ preempt_ disabled+ 0x14/0x20 644>] __mutex_ lock_slowpath+ 0xd4/0x168 730>] mutex_lock+ 0x58/0x70 52c>] get_online_ cpus+0x44/ 0x70 a24>] vmstat_ shepherd+ 0x3c/0xe8 c60>] process_ one_work+ 0x150/0x478 fd8>] worker_ thread+ 0x50/0x4b8 eac>] kthread+0xec/0x100 690>] ret_from_ fork+0x10/ 0x40
[ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds.
[ 726.077908] Tainted: G W I 4.8.0-41-generic #44~16.04.1-Ubuntu
[ 726.085850] "echo 0 > /proc/sys/
[ 726.094383] kworker/0:1 D ffff0000080861bc 0 312 2 0x00000000
[ 726.094401] Workqueue: events vmstat_shepherd
[ 726.094404] Call trace:
[ 726.094411] [<ffff000008086
[ 726.094418] [<ffff000008985
[ 726.094421] [<ffff000008985
[ 726.094425] [<ffff000008985
[ 726.094428] [<ffff000008987
[ 726.094431] [<ffff000008987
[ 726.094437] [<ffff0000080c5
[ 726.094440] [<ffff00000820c
[ 726.094446] [<ffff0000080e1
[ 726.094449] [<ffff0000080e1
[ 726.094453] [<ffff0000080e8
[ 726.094456] [<ffff000008083
Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them.
On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops.
[1] http:// paste.ubuntu. com/24172516/