This IPI being sent to all other CPUs suggest that you preempted them by a NMI, in order to stop execution and, likely, call panic() for a dump. If that is true, that can be configured by sysctl variables:
kernel.hardlockup_panic = 0 -> THIS, for HARD lockups
kernel.hung_task_panic = 0 -> THIS, for SCHEDULING dead locks
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.panic_print = 0
kernel.softlockup_panic = 0 -> THIS, for SOFT lockups
kernel.unknown_nmi_panic = 0
vm.panic_on_oom = 0 -> THIS for OOM issues
And the panic would not happen for live virsh dumps (the live dump is likely causing delays in the VM and causing the pagecache to be fully dirtied, so the I/Os can't be commit as fast as the pages are being dirtied).
You have kernel.softlockup_panic = 1, this is what is causing the panic whenever the guest is having too much "steal time" to catch up with its needs (causing the lockups to happen).
Based on stack trace:
[ 1692.658756] Call Trace: to+0x2a0/ 0x4d0 0x2a4/0xb00 wait_commit+ 0xf4/0x1b0 file+0x354/ 0x620 range+0x78/ 0x170 call+0x58/ 0x6c 1017.19- bz175922- ibm-gt #bz175922 0xb0/0xf4 (unreliable) 0x2c8/0x420 kernel_ thread+ 0x5c/0x88
[ 1692.658762] [c00020739ba9b970] [0000000024008842] 0x24008842 (unreliable)
[ 1692.658769] [c00020739ba9bb48] [c00000000001c270] __switch_
[ 1692.658774] [c00020739ba9bba8] [c000000000d048a4] __schedule+
[ 1692.658777] [c00020739ba9bc78] [c000000000d05140] schedule+0x40/0xc0
[ 1692.658781] [c00020739ba9bc98] [c000000000537bf4] jbd2_log_
[ 1692.658784] [c00020739ba9bd18] [c0000000004c5ee4] ext4_sync_
[ 1692.658788] [c00020739ba9bd78] [c00000000042afb8] vfs_fsync_
[ 1692.658790] [c00020739ba9bdc8] [c00000000042b138] do_fsync+0x58/0xd0
[ 1692.658792] [c00020739ba9be08] [c00000000042b528] SyS_fsync+0x28/0x40
[ 1692.658795] [c00020739ba9be28] [c00000000000b284] system_
[ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
[ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 4.15.0-
[ 1692.659835] Call Trace:
[ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] dump_stack+
[ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
[ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] watchdog+
[ 1692.661429] [c000008fd0eefdb8] [c000000000140068] kthread+0x1a8/0x1b0
[ 1692.661881] [c000008fd0eefe28] [c00000000000b654] ret_from_
[ 1692.662439] Sending IPI to other CPUs
[ 1693.971250] IPI complete
This IPI being sent to all other CPUs suggest that you preempted them by a NMI, in order to stop execution and, likely, call panic() for a dump. If that is true, that can be configured by sysctl variables:
kernel. hardlockup_ panic = 0 -> THIS, for HARD lockups hung_task_ panic = 0 -> THIS, for SCHEDULING dead locks panic_on_ io_nmi = 0 panic_on_ oops = 1 panic_on_ rcu_stall = 0 panic_on_ unrecovered_ nmi = 0 panic_on_ warn = 0 softlockup_ panic = 0 -> THIS, for SOFT lockups unknown_ nmi_panic = 0
kernel.
kernel.panic = 0
kernel.
kernel.
kernel.
kernel.
kernel.
kernel.panic_print = 0
kernel.
kernel.
vm.panic_on_oom = 0 -> THIS for OOM issues
And the panic would not happen for live virsh dumps (the live dump is likely causing delays in the VM and causing the pagecache to be fully dirtied, so the I/Os can't be commit as fast as the pages are being dirtied).
Checking the sosreport you sent:
$ cat sos_commands/ kernel/ sysctl_ -a | grep -i panic hardlockup_ panic = 0 hung_task_ panic = 1 panic_on_ oops = 1 panic_on_ rcu_stall = 0 panic_on_ warn = 0 softlockup_ panic = 1
kernel.
kernel.
kernel.panic = 1
kernel.
kernel.
kernel.
kernel.
vm.panic_on_oom = 0
You have kernel. softlockup_ panic = 1, this is what is causing the panic whenever the guest is having too much "steal time" to catch up with its needs (causing the lockups to happen).
Am I missing something ?