Kernel Panic while virsh dump of Guest with 300G RAM is triggered.

Bug #1846237 reported by bugproxy
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Low
Canonical Kernel Team
libvirt (Ubuntu)
Fix Released
Low
Ubuntu on IBM Power Systems Bug Triage
linux (Ubuntu)
Invalid
Undecided
Canonical Kernel Team

Bug Description

== Comment: #0 - HARIHARAN T. SUNDARESH REDDY - 2019-04-01 12:23:14 ==
---Problem Description---
Kernel panic occurred when `virsh dump` is triggered on the guest with 300G of RAM.

---uname output---
Linux ltcgen6 4.15.0-1017.19-bz175922-ibm-gt #bz175922 SMP Thu Mar 21 09:34:09 CDT 2019 ppc64le ppc64le ppc64le GNU/Linux

---Debugger---
A debugger is not configured

---Steps to Reproduce---
1. Define guest with 300G RAM, Start the guest
2. Run the following command
   `virsh dump ubuntui_hsr /home/vm.core --memory-only --live'
3. Wait for some time, kernel panic will be see in the console.

--Log---
[ 1692.657251] INFO: task journal-offline:7763 blocked for more than 120 seconds.
        [ 1692.657251] INFO: task journal-offline:7763 blocked for more than 120 seconds.
        [ 1692.657754] Not tainted 4.15.0-1017.19-bz175922-ibm-gt #bz175922
        [ 1692.657754] Not tainted 4.15.0-1017.19-bz175922-ibm-gt #bz175922
        [ 1692.658220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1692.658220] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
        [ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
        [ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 4.15.0-1017.19-bz175922-ibm-gt #bz175922
        [ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 4.15.0-1017.19-bz175922-ibm-gt #bz175922
        [ 1692.659835] Call Trace:
        [ 1692.659835] Call Trace:
        [ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] dump_stack+0xb0/0xf4 (unreliable)
        [ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] dump_stack+0xb0/0xf4 (unreliable)
        [ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
        [ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
        [ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] watchdog+0x2c8/0x420
        [ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] watchdog+0x2c8/0x420
        [ 1692.661429] [c000008fd0eefdb8] [c000000000140068] kthread+0x1a8/0x1b0
        [ 1692.661429] [c000008fd0eefdb8] [c000000000140068] kthread+0x1a8/0x1b0
        [ 1692.661881] [c000008fd0eefe28] [c00000000000b654] ret_from_kernel_thread+0x5c/0x88
        [ 1692.661881] [c000008fd0eefe28] [c00000000000b654] ret_from_kernel_thread+0x5c/0x88
        [ 1692.662439] Sending IPI to other CPUs
        [ 1692.662439] Sending IPI to other CPUs
        [ 1693.971250] IPI complete
        [ 1693.971250] IPI complete
        [ 1694.122536] kexec: Starting switchover sequence.
        [ 1694.122536] kexec: Starting switchover sequence.

        [ 1827.285354188,3] PHB#0003[0:3]: CRESET: Unexpected slot state 00000102, resetting...
        [ 1831.426449832,3] PHB#0030[8:0]: CRESET: Unexpected slot state 00000102, resetting...
        [ 1832.158844758,3] PHB#0033[8:3]: CRESET: Unexpected slot state 00000102, resetting...
        [ 1832.403055326,3] PHB#0034[8:4]: CRESET: Unexpected slot state 00000102, resetting...
        [ 1.924644] integrity: Unable to open file: /etc/keys/x509_ima.der (-2)
        [ 1.924644] integrity: Unable to open file: /etc/keys/x509_ima.der (-2)
        [ 1.924647] integrity: Unable to open file: /etc/keys/x509_evm.der (-2)
        [ 1.924647] integrity: Unable to open file: /etc/keys/x509_evm.der (-2)
        [ 1.991351] vio vio: uevent: failed to send synthetic uevent
        [ 1.991351] vio vio: uevent: failed to send synthetic uevent
        [ 2.524824] nouveau 0004:04:00.0: unknown chipset (140000a1)
        [ 2.524824] nouveau 0004:04:00.0: unknown chipset (140000a1)
        [ 2.525314] nouveau 0004:05:00.0: unknown chipset (140000a1)
        [ 2.525314] nouveau 0004:05:00.0: unknown chipset (140000a1)
        [ 2.525831] nouveau 0035:03:00.0: unknown chipset (140000a1)
        [ 2.525831] nouveau 0035:03:00.0: unknown chipset (140000a1)
        [ 2.526315] nouveau 0035:04:00.0: unknown chipset (140000a1)
        [ 2.526315] nouveau 0035:04:00.0: unknown chipset (140000a1)
        /dev/sda2: recovering journal
        systemd-update-utmp.service
        systemd-timesyncd.service
        apparmor.service
        [ 19.799936] kdump-tools[1192]: Starting kdump-tools: * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201904010617/dump-incomplete
        Copying data : [100.0 %] - eta: 0s
        [ 84.327105] kdump-tools[1192]: The kernel version is not supported.
        [ 84.327190] kdump-tools[1192]: The makedumpfile operation may be incomplete.
        [ 84.327269] kdump-tools[1192]: The dumpfile is saved to /var/crash/201904010617/dump-incomplete.
        [ 84.327347] kdump-tools[1192]: makedumpfile Completed.
        [ 84.357362] kdump-tools[1192]: * kdump-tools: saved vmcore in /var/crash/201904010617
        [ 86.168085] kdump-tools[1192]: * running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201904010617/dmesg.201904010617
        [ 86.207356] kdump-tools[1192]: The kernel version is not supported.
        [ 86.207492] kdump-tools[1192]: The makedumpfile operation may be incomplete.
        [ 86.207564] kdump-tools[1192]: The dmesg log is saved to /var/crash/201904010617/dmesg.201904010617.
        [ 86.207818] kdump-tools[1192]: makedumpfile Completed.
        [ 86.211403] kdump-tools[1192]: * kdump-tools: saved dmesg content in /var/crash/201904010617
        [ 86.350667] kdump-tools[1192]: Mon, 01 Apr 2019 06:18:47 -0400
        [ 86.531296] kdump-tools[1192]: Rebooting.

== Comment: #4 - HARIHARAN T. SUNDARESH REDDY - 2019-04-01 23:43:05 ==
After kernel panic the host triggers kdump and reboots back.

== Comment: #13 - Leonardo Bras Soares Passos - 2019-05-15 11:51:33 ==
I have been doing a lot of tests lately, and found out this kind of bug usually happens when disk IO is very intense, so that no other application can write on disk.

I have studied libvirt and even proposed a patch on ML that solves the problem, sleeping once in a while to make room for other applications to write on disk.
Unfortunately, i have made a mistake. This sort of problem should not be responsibility of the application.
I got to know that it is the kernel responsibility to schedule IO access so that no one starves. It is done using the Linux IO Scheduler.

The IO Scheduler priority seems to be set/get using ionice application.
The libvirt_iohelper is called with ionice = "none: prio 0".
journald also have ionice = "none: prio 0", and at times it seems to not be able to interrupt libvirt_iohelper, causing the panic.

I have also crafted a simple test to mimic the behavior of libvirt_iohelper, that could also recreate this bug. (attached)
It enforces the problem can happen independent to libvirt.

By now, I am looking for a way to understand the scheduler workings.
I believe it can be fixed only by adjusting ionice priorities.

If it's not possible, I will try to debug the kernel and see if I can get any clue on why this happens, and a way to fix it.

== Comment: #15 - Leonardo Bras Soares Passos - 2019-05-29 16:14:09 ==
I was trying to better understand I/O priority, so I read a few documents and found out the I/O Scheduler on this host is CFQ.

After reading a brief decription of Linux I/O CFQ, I understood it was supposed to not let that starvation happen (as said before, both journald and libvirt_iohelper had the same priority (none: prio 0)), once it should create few queues per process and serve each queue using priority to allocate a disk timeslice.

By now, I am trying to better understand the behavior of CFQ (block/cfq-iosched.c), to see if is there a fix that could solve the problem.

While trying reproduce the problems on newer kernels (checking if the bug was already fixed), I found out block/cfq-iosched.c was removed as CFQ was deprecated in favor of BFQ. On Ubuntu kernel 5.0 the default IOsched was replaced by mq-deadline. (hwe-18.04-edge)

The bug seems to reproduce much more on mq-deadline. As it is simpler, I quickly studied it's workings:
The concept of deadline scheduler is that it prioritizes requests that expired the maximum timeout,

Then, I created a systemtap program to better understand the reason of it starving journald:
- It seems that the test program create write requests a lot faster than it's possible to write on disk, because even after the program finishes executing, mq-deadline keeps working for a few minutes.
- After a few seconds on writing, all requests seem to be falling on deadline, i.e. they have a write only after timeout was reached.
- It means that, due to the large amount of data to write, the scheduler became a simple FIFO.

About mq-deadline: It's important to furher debug his behavior, because looks like Ubuntu kernels over 4.20 will use it as default IOscheduler. There are a few more tests I want to try (like repeating the process with a 90% full memory) in order to understand if less memory available for caching IO changes will improve the behavior.

Revision history for this message
bugproxy (bugproxy) wrote : guest.xml

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-176572 severity-medium targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg
  • dmesg Edit (94.0 KiB, application/octet-stream)

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → libvirt (Ubuntu)
Changed in ubuntu-z-systems:
assignee: nobody → Canonical Foundations Team (canonical-foundations)
importance: Undecided → Medium
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Based on stack trace:

[ 1692.658756] Call Trace:
[ 1692.658762] [c00020739ba9b970] [0000000024008842] 0x24008842 (unreliable)
[ 1692.658769] [c00020739ba9bb48] [c00000000001c270] __switch_to+0x2a0/0x4d0
[ 1692.658774] [c00020739ba9bba8] [c000000000d048a4] __schedule+0x2a4/0xb00
[ 1692.658777] [c00020739ba9bc78] [c000000000d05140] schedule+0x40/0xc0
[ 1692.658781] [c00020739ba9bc98] [c000000000537bf4] jbd2_log_wait_commit+0xf4/0x1b0
[ 1692.658784] [c00020739ba9bd18] [c0000000004c5ee4] ext4_sync_file+0x354/0x620
[ 1692.658788] [c00020739ba9bd78] [c00000000042afb8] vfs_fsync_range+0x78/0x170
[ 1692.658790] [c00020739ba9bdc8] [c00000000042b138] do_fsync+0x58/0xd0
[ 1692.658792] [c00020739ba9be08] [c00000000042b528] SyS_fsync+0x28/0x40
[ 1692.658795] [c00020739ba9be28] [c00000000000b284] system_call+0x58/0x6c
[ 1692.658839] Kernel panic - not syncing: hung_task: blocked tasks
[ 1692.659238] CPU: 48 PID: 785 Comm: khungtaskd Not tainted 4.15.0-1017.19-bz175922-ibm-gt #bz175922
[ 1692.659835] Call Trace:
[ 1692.660025] [c000008fd0eefbf8] [c000000000cea13c] dump_stack+0xb0/0xf4 (unreliable)
[ 1692.660564] [c000008fd0eefc38] [c000000000110020] panic+0x148/0x328
[ 1692.661004] [c000008fd0eefcd8] [c000000000233a08] watchdog+0x2c8/0x420
[ 1692.661429] [c000008fd0eefdb8] [c000000000140068] kthread+0x1a8/0x1b0
[ 1692.661881] [c000008fd0eefe28] [c00000000000b654] ret_from_kernel_thread+0x5c/0x88
[ 1692.662439] Sending IPI to other CPUs
[ 1693.971250] IPI complete

This IPI being sent to all other CPUs suggest that you preempted them by a NMI, in order to stop execution and, likely, call panic() for a dump. If that is true, that can be configured by sysctl variables:

kernel.hardlockup_panic = 0 -> THIS, for HARD lockups
kernel.hung_task_panic = 0 -> THIS, for SCHEDULING dead locks
kernel.panic = 0
kernel.panic_on_io_nmi = 0
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.panic_on_warn = 0
kernel.panic_print = 0
kernel.softlockup_panic = 0 -> THIS, for SOFT lockups
kernel.unknown_nmi_panic = 0
vm.panic_on_oom = 0 -> THIS for OOM issues

And the panic would not happen for live virsh dumps (the live dump is likely causing delays in the VM and causing the pagecache to be fully dirtied, so the I/Os can't be commit as fast as the pages are being dirtied).

Checking the sosreport you sent:

$ cat sos_commands/kernel/sysctl_-a | grep -i panic
kernel.hardlockup_panic = 0
kernel.hung_task_panic = 1
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 0
kernel.panic_on_warn = 0
kernel.softlockup_panic = 1
vm.panic_on_oom = 0

You have kernel.softlockup_panic = 1, this is what is causing the panic whenever the guest is having too much "steal time" to catch up with its needs (causing the lockups to happen).

Am I missing something ?

Changed in libvirt (Ubuntu):
status: New → Triaged
importance: Undecided → Low
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: Canonical Foundations Team (canonical-foundations) → Canonical Server Team (canonical-server)
status: New → Triaged
Frank Heimes (fheimes)
no longer affects: ubuntu-z-systems
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Canonical Server Team (canonical-server)
Manoj Iyer (manjo)
no longer affects: linux
Revision history for this message
Manoj Iyer (manjo) wrote :

Adding the kernel track to this bug after checking with Michael (IBM). Although the bug was found in a flavor kernel of Ubuntu, it sounds like (from the notes in the description) that this might a generic issue with 18.04.3 and newer kernels.

The next steps for IBM is to reproduce this issue on 18.04.3 or newer released kernel. Also, compare with the upstream kernel to see if issue was already fixed. You could use the mainline builds to compare https://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

Changed in libvirt (Ubuntu):
status: Triaged → Invalid
Changed in ubuntu-power-systems:
assignee: Canonical Server Team (canonical-server) → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

comment #4 suggested that the panic only happened because kernel.softlockup_panic was set to 1, was that verified?

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Manoj,

This is a soft lockup in schedule() - a task is waiting I/O completion - inside a KVM guest.

The guest kernel has a completion that will be awaken, re-scheduling the task back to a CPU run queue, as soon as the I/O is finished (the I/O usually contains a handle that, as soon as I/O is confirmed to be sent by transport layer, calls a callback function to "release" the completion and let the application to be re-scheduled. Depending on how the cache is configured, this logic MIGHT also check for I/O being committed in I/O server, only allowing the task to continue its logic after that: That is also considered as a soft lockup (tasks keeps re-scheduling itself until the I/o is done).

The guest has just panic'ed because it had "panic on hung" configured.

It is highly probable that the "issue" here is I/O contention causing the lockup inside the Guest, nothing else. There isn't any I/O timeouts - because bad transport and/or block device layer - or any other hard lockup due to a dead lock, for example.

So, unless something else, undocumented in this bug, is happening, there is not much to be done without more information. To help kernel team, it would be good for IBM to provide more information on what was being done on the HOST, how the I/O devices are configured in KVM guest, etc.

Thats my 5 cents.

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Following Rafael's comment, we believe this is working as expected. Marking as incomplete while awaiting IBM's response.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

As we believe that this is working as expected, we're lowering the priority to "low".

Changed in ubuntu-power-systems:
importance: Medium → Low
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Closing out after discussing with Michael Ranweiler. Please re-open if required.
Thanks.

Changed in ubuntu-power-systems:
status: Incomplete → Invalid
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-02-25 10:58 EDT-------
An update:
One of my team member verified this bug on Ubuntu 20.04 and could successfully virsh dump for guest having 900G memory.
So...based on these results we are OK to close this bug. Thanks!

Here are the steps tried:
----------------------------------------------------------
Created and brought up a guest with 900GB memory and tried virsh dump

root@ws-g48-2d81-host:~# virsh list --all
Id Name State
----------------------------
1 virt-tests-raji running
- virt-tests-vm1 shut off

<name>virt-tests-raji</name>
<uuid>8623a1dd-e043-4e2b-a75d-62a9cdead80e</uuid>
<memory unit='KiB'>943718400</memory>
<currentMemory unit='KiB'>943718400</currentMemory>

root@ws-g48-2d81-host:~# virsh dump virt-tests-raji vmcore

root@ws-g48-2d81-host:~# du -sh vmcore
4.3G vmcore

root@ws-g48-2d81-host:~# makedumpfile -d 15 vmcore dumpfile
check_elf_format: Can't get valid ehdr.

makedumpfile Failed.

Encountered an issue with makedumpfile for the dump generated. Also tried the crash tool directly on the dump and crash failed too.

Tried virsh dump with --live option and tried makedumpfile & crash on the dump and encountered the same failure.

Tried virsh dump with --memory-only option.
Since the memory of the VM was 900GB, dump was being taken for the whole 900GB memory, which was taking a long time to complete.

Identified that "The crash utility no longer supports the default core dump file format of the virsh dump command and --memory-only option is needed to analyse the core dump file created by virsh dump"

Tried the virsh dump with --memory-only option on a VM with lesser memory and was able to get the dump. Used crash to analyse the core dump and it worked successfully.

root@ws-g48-2d81-host:~# virsh dump --memory-only --format=kdump-lzo virt-tests-vm1 vmcore_memory1
error: Failed to core dump domain virt-tests-vm1 to vmcore_memory1
error: invalid argument: unsupported dumpformat 'kdump-lzo' for this QEMU binary

root@ws-g48-2d81-host:~# virsh dump --memory-only --format=kdump-zlib virt-tests-vm1 vmcore_memory1

Domain virt-tests-vm1 dumped to vmcore_memory1

root@ws-g48-2d81-host:~# du -sh vmcore_memory1
153M vmcore_memory1

root@ws-g48-2d81-host:~# crash vmcore_memory1 vmlinux-5.4.0-13-generic

crash 7.2.8
Copyright (C) 2002-2020 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

crash: vmlinux-5.4.0-13-generic: no debugging data available
----------------------------------------------------------

Revision history for this message
Frank Heimes (fheimes) wrote :

Thx Iranna for the reply and positive feedback.
With that I'm changing the status to Fix Released.

Changed in libvirt (Ubuntu):
status: Invalid → Fix Released
Changed in ubuntu-power-systems:
status: Invalid → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-02-28 04:11 EDT-------
Thanks! Closing the bug from IBM side.

tags: added: targetmilestone-inin2004
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.