The Ubuntu-power-systems project

Bug #1846237
Comment #4

Comment 4 for bug 1846237

Revision history for this message

Rafael David Tinoco (rafaeldtinoco) wrote on 2019-10-01:

Based on stack trace:

[ 1692.658756] Call Trace:
[ 1692.658762] [c00020739ba9b970] [0000000024008842] 0x24008842 (unreliable)
[ 1692.658769] [c00020739ba9bb48] [c00000000001c270] __switch_to+0x2a0/0x4d0
[ 1692.658774] [c00020739ba9bba8] [c000000000d048a4] __schedule+0x2a4/0xb00
[ 1692.658777] [c00020739ba9bc78] [c000000000d05140] schedule+0x40/0xc0
[ 1692.658781] [c00020739ba9bc98] [c000000000537bf4] jbd2_log_wait_commit+0xf4/0x1b0
[ 1692.658784] [c00020739ba9bd18] [c0000000004c5ee4] ext4_sync_file+0x354/0x620
[ 1692.658788] [c00020739ba9bd78] [c00000000042afb8] vfs_fsync_range+0x78/0x170
[ 1692.658790] [c00020739ba9bdc8] [c00000000042b138] do_fsync+0x58/0xd0
[ 1692.658792] [c00020739ba9be08] [c00000000042b528] SyS_fsync+0x28/0x40
[ 1692.658795] [c00020739ba9be28] [c00000000000b284] system_call+0x58/0x6c
[ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
[ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 4.15.0-1017.19-bz175922-ibm-gt #bz175922
[ 1692.659835] Call Trace:
[ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] dump_stack+0xb0/0xf4 (unreliable)
[ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
[ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] watchdog+0x2c8/0x420
[ 1692.661429] [c000008fd0eefdb8] [c000000000140068] kthread+0x1a8/0x1b0
[ 1692.661881] [c000008fd0eefe28] [c00000000000b654] ret_from_kernel_thread+0x5c/0x88
[ 1692.662439] Sending IPI to other CPUs
[ 1693.971250] IPI complete

This IPI being sent to all other CPUs suggest that you preempted them by a NMI, in order to stop execution and, likely, call panic() for a dump. If that is true, that can be configured by sysctl variables:

kernel.hardlockup_panic = 0 -> THIS, for HARD lockups
kernel.hung_task_panic = 0 -> THIS, for SCHEDULING dead locks
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.panic_print = 0
kernel.softlockup_panic = 0 -> THIS, for SOFT lockups
kernel.unknown_nmi_panic = 0
vm.panic_on_oom = 0 -> THIS for OOM issues

And the panic would not happen for live virsh dumps (the live dump is likely causing delays in the VM and causing the pagecache to be fully dirtied, so the I/Os can't be commit as fast as the pages are being dirtied).

Checking the sosreport you sent:

$ cat sos_commands/kernel/sysctl_-a | grep -i panic
kernel.hardlockup_panic = 0
kernel.hung_task_panic = 1
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_warn = 0
kernel.softlockup_panic = 1
vm.panic_on_oom = 0

You have kernel.softlockup_panic = 1, this is what is causing the panic whenever the guest is having too much "steal time" to catch up with its needs (causing the lockups to happen).

Am I missing something ?

Based on stack trace:

Checking the sosreport you sent:

$ cat sos_commands/kernel/sysctl_-a  | grep -i panic
kernel.hardlockup_panic = 0
kernel.hung_task_panic = 1
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_warn = 0
kernel.softlockup_panic = 1
vm.panic_on_oom = 0

You have kernel.softlockup_panic = 1, this is what is causing the panic whenever the guest is having too much "steal time" to catch up with its needs (causing the lockups to happen).

Am I missing something ?