Ubuntu 16.04 KVM:kdump fails to mount root file system when noirqdistrib is missing as dump kernel parameter

Bug #1658733 reported by bugproxy on 2017-01-23
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Unassigned
kexec-tools (Ubuntu)
High
Louis Bouchard
Trusty
Undecided
Unassigned
Xenial
Undecided
Unassigned
Zesty
Undecided
Unassigned
Artful
High
Louis Bouchard
Bionic
High
Louis Bouchard
makedumpfile (Ubuntu)
Undecided
Thadeu Lima de Souza Cascardo
Trusty
Undecided
Unassigned
Xenial
High
Thadeu Lima de Souza Cascardo
Zesty
Undecided
Unassigned
Artful
High
Thadeu Lima de Souza Cascardo
Bionic
Undecided
Thadeu Lima de Souza Cascardo

Bug Description

[Impact]
On Power Systems, some interrupts are missed, and dumping the crash will fail. Adding the noirqdistrib kernel parameter to the kdump kernel will fix this.

[Test Case]
Setting up kdump to target a virtio-scsi device on a Power System.

[Regression Potential]
The parameter could be interpreted differently on a different platform and kdump would fail. However, it has been verified that no other platform uses such parameter. If another parameter would have been incorrectly removed on the patch, kdump could fail on other systems.

== Comment: #0 - Richard M. Scheller - 2016-12-14 16:50:26 ==

---Problem Description---

On a KVM guest installed to a multipath root device, the kdump kernel fails to mount the root file system. This error does not occur in a similar guest installed to a single path device.

Full console output of the kdump failure is attached. These messages from the output may be relevant:

Begin: Loading multipath modules ... Success: loaded module dm-multipath.
done.
Begin: Loading multipath hardware handlers ... Failure: failed to load module sc
si_dh_alua.
Failure: failed to load module scsi_dh_rdac.
Failure: failed to load module scsi_dh_emc.
done.
Begin: Starting multipathd ... done.

---uname output---
Linux dotg9 4.8.0-32-generic #34~16.04.1-Ubuntu SMP Tue Dec 13 17:01:57 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = 8247-22L Ubuntu 16.04.1 KVM guest

---Steps to Reproduce---
 - Install Ubuntu 16.04.1 to a muiltpath target disk
- Install kdump-tools package
- Configure kexec-tools to reserve sufficient RAM for the kdump kernel to load (I use 512MB) in /etc/default/grub.d/kexec-tools.cfg
- Run update-grub
- Reboot
- Initiate a system crash using "echo c > /proc/sysrq-trigger"

== Comment: #12 - Richard M. Scheller - 2016-12-20 20:37:45 ==
Here is the log level 8 kdump console log requested in comment 10.

== Comment: #21 - Richard M. Scheller - 2017-01-06 11:04:17 ==
(In reply to comment #19)
> Hi, I logged in dotkvm and I couldn't find the guest dotg9. Also, although I
> found a dotg9.xml in /kte/xml/ it doesn't look like it uses multipath (it
> uses .img files which I didn't found as disks).
>
> Could you please recreate the guest for further debug?

Yes, I recreated the guest with its correct multipath lun configuration. I have also attached the guest XML to this bug.

> Besides that could you please let us know:
> - is the multipath the system's root? I mean / is installed/mounted on the
> multipath device?

Yes, the guest has only one disk. That disk is actually a LUN from a fiber channel storage device with two paths on the host side. I have passed through both paths to the guest, so the multipath nature of the target disk is known to the guest.

In other words, the guest sees a multipath device and is using it as a multipath device. The root file system is called /dev/mapper/mpatha-part2 on the guest.

> - how did you attach the device to the guest?

Each FC LUN path on the host is mapped to a virtio-scsi controller on the guest using LUN passthrough. (See the guest XML for details on this.)

== Comment: #22 - Mauro Sergio Martins Rodrigues - 2017-01-11 09:31:38 ==
I managed to get kdump to mount rootfs and perform its tasks by setting KDUMP_CMDLINE_APPEND="nr_cpus=4" parameter in /etc/default/kdump-tools see http://pastebin.hursley.ibm.com/8239

I'm still investigating to figure out what is the reason behind this behavior.

Thanks,

--
maurosr

== Comment: #23 - Mauricio Faria De Oliveira - 2017-01-11 11:56:40 ==
Mauro,

(In reply to comment #22)
> I managed to get kdump to mount rootfs and perform its tasks by setting
> KDUMP_CMDLINE_APPEND="nr_cpus=4" parameter in /etc/default/kdump-tools see
> http://pastebin.hursley.ibm.com/8239
>
> I'm still investigating to figure out what is the reason behind this
> behavior.
>
> Thanks,
>
> --
> maurosr

That would smell like an out of memory condition that is alleviated with a smaller number of CPUs allowed for the kernel (so the amount of memory associated with per-CPU stuff is less in total).

Per the bug description, the memory reserved for the crashkernel is 512MB:

(In reply to comment #23)
> - Configure kexec-tools to reserve sufficient RAM for the kdump kernel to
> load (I use 512MB) in /etc/default/grub.d/kexec-tools.cfg

That seems low for Power guests/systems.
I think it theory is doesn't seem so, but the reality is that _for some reason(s)_ we require just too much memory to load and boot a kernel/initramfs (either on boot or kdump).

When working w/ kdump and Ubuntu, I usually set the crashkernel allocated size right away to 4GB to avoid problems.

Since this is a smaller sized guest, obviously we'd want to use less than that, but more than 512 MB given the evidence observed.

Hope this helps

== Comment: #28 - Mauro Sergio Martins Rodrigues - 2017-01-13 10:12:28 ==

>I think it theory is doesn't seem so, but the reality is that _for some reason(s)_ we require just too >much memory to load and boot a kernel/initramfs (either on boot or kdump).

For the record, as you already know, I've raise memory up to 1024mb and it didn't help.

>Per yesterday's conversations, this had to do with IRQ distribution and the nr_cpus kernel parameter, >and seemed to affect multipath only by chance, usually failing/hanging the guest at kdump at other >parts / way earlier in boot (at virtio-scsi disk probe phase too).

Yes, that's right. So looks like there are a couple of things going on here. The first and simpler:

According to kdump's documentation at https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/kdump/kdump.txt#n365 noirqdistrib is a necessary parameter for use kdump in ppc64 architecture, and indeed it solves the issue, including the case when nr_cpus=1 (which was failing in all my attempts until I used noirqdistrib).

So I believe a patch in ubuntu's kdump package to set that attribute for ppc builds may solve this definitely and that will be my focus for right now.

Nevertheless I will keep investigating why this issue is happening only w/ multipath devices.

== Comment: #29 - Mauricio Faria De Oliveira - 2017-01-16 04:22:00 ==
(In reply to comment #28)
> >I think it theory is doesn't seem so, but the reality is that _for some reason(s)_ we require just too >much memory to load and boot a kernel/initramfs (either on boot or kdump).
>
> For the record, as you already know, I've raise memory up to 1024mb and it
> didn't help.

Definitely.
What I've observed is that more than 2GB (yes..) was required on some systems I checked on at the time.
Since you've identified the IRQ distribution aspect of this issue, the crashkernel memory size might not be completely related to this problem, and for this system, the configured sizes happen to work well.

> Nevertheless I will keep investigating why this issue is happening only w/
> multipath devices.

Based on the IRQ distribution aspect, the most reasonable suspicion I've thought of is...

In our testing, several times I observed the kernel initialization to hang in the probe stage of the virtio-scsi disks, and the IO request (probably for the partition table read operation) would time out (signaled by a 'tag abort' message).

If we suppose that these initial IO requests passed correctly (say, these initial IRQs happened to be assigned the the CPU that was online, and thus were delivered/handled correctly) BUT the IO requests issued by multipath (for disk/path identification) fail (i.e., happened to be assigned to a CPU that was offline, then these requests would time out), then multipath would never get a response back, thus not initializing the individual paths and the respective multipath device.

So the /dev/mapper/mpathX device is not created, and the problem is observed.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-149956 severity-critical targetmilestone-inin16043

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kexec-tools (Ubuntu)
Manoj Iyer (manjo) on 2017-01-23
Changed in kexec-tools (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Louis Bouchard (louis-bouchard)
importance: Undecided → High

------- Comment on attachment From <email address hidden> 2017-01-23 12:10 EDT-------

This patch fixes the issue. The noirqdistrib is a necessary parameter for ppc64 platform as stated in kdump docs https://www.kernel.org/doc//Documentation/kdump/kdump.txt
With this fix crash kernel uses the correct irq server for the case where only one cpu is online.

Default Comment by Bridge

------- Comment From <email address hidden> 2017-01-30 16:37 EDT-------
I have tried the attached path to kdump-config, and it fixes the problem for me. Kdump to multipath root disk works correctly.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-03-27 13:05 EDT-------
Hi Canonical,
Is there any news related to this bug and the patch attached? Is it already accepted?

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in kexec-tools (Ubuntu):
status: New → Confirmed
Changed in makedumpfile (Ubuntu):
status: New → Confirmed
Mauro S M Rodrigues (maurosr) wrote :

Hi Canonical,
Is there any news related to this bug and the patch attached? Is it already accepted?

Manoj Iyer (manjo) on 2017-05-08
Changed in kexec-tools (Ubuntu):
assignee: Louis Bouchard (louis) → Nish Aravamudan (nacc)
Changed in makedumpfile (Ubuntu):
assignee: nobody → Nish Aravamudan (nacc)
Manoj Iyer (manjo) on 2017-05-09
Changed in ubuntu-power-systems:
status: New → Confirmed
Manoj Iyer (manjo) on 2017-06-01
tags: added: ubuntu-16.04

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

------- Comment on attachment From <email address hidden> 2017-01-23 12:10 EDT-------

This patch fixes the issue. The noirqdistrib is a necessary parameter for ppc64 platform as stated in kdump docs https://www.kernel.org/doc//Documentation/kdump/kdump.txt
With this fix crash kernel uses the correct irq server for the case where only one cpu is online.

------- Comment From <email address hidden> 2017-07-04 15:48 EDT-------
Hi Canonical,

Any updates here?

Louis Bouchard (louis) on 2017-07-06
Changed in kexec-tools (Ubuntu):
assignee: Nish Aravamudan (nacc) → Louis Bouchard (louis)
Changed in makedumpfile (Ubuntu):
assignee: Nish Aravamudan (nacc) → Louis Bouchard (louis)
Changed in kexec-tools (Ubuntu):
status: Confirmed → Invalid
Changed in makedumpfile (Ubuntu):
status: Confirmed → In Progress

This bug is a duplicate of bug #1635597.

As outlined in the other bug, could you test the potential fix in the following PPA :

ppa:louis/kdump-tools-multipath

------- Comment From <email address hidden> 2017-08-03 17:10 EDT-------
Hi Canonical,

Any updates here?

------- Comment From <email address hidden> 2017-08-03 17:17 EDT-------
Hi Canonical,

Sorry for my last comment. Disregard it.

In case, louis package does not fix this specific issue, please try ppa:cascardo/ppa.

As this bug is about the use of noirqdistrib, this could affect other targets besides multipath devices. So unmarking this as duplicate of the multipath bug. Also, changing the title to indicate better what it is about.

summary: - Ubuntu 16.04.2KVM:kdump fails to mount root file system on multipath
- root device
+ Ubuntu 16.04.2KVM:kdump fails to mount root file system when
+ noirqdistrib is missing as dump kernel parameter
Changed in makedumpfile (Ubuntu Xenial):
status: New → Confirmed
Changed in makedumpfile (Ubuntu):
assignee: Louis Bouchard (louis) → Thadeu Lima de Souza Cascardo (cascardo)
Changed in makedumpfile (Ubuntu Xenial):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)

------- Comment on attachment From <email address hidden> 2017-01-23 12:10 EDT-------

This patch fixes the issue. The noirqdistrib is a necessary parameter for ppc64 platform as stated in kdump docs https://www.kernel.org/doc//Documentation/kdump/kdump.txt
With this fix crash kernel uses the correct irq server for the case where only one cpu is online.

description: updated

The attachment "target fix for artful" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Changed in makedumpfile (Ubuntu Artful):
importance: Undecided → High
milestone: none → artful-updates
Changed in makedumpfile (Ubuntu Xenial):
importance: Undecided → High
tags: added: ppc64el-kdump
Łukasz Zemczak (sil2100) wrote :

I don't see this fix in bionic yet. Could anyone first release it there? Stable updates can only be backported if they're present in the devel series.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.6.2-1ubuntu1

---------------
makedumpfile (1:1.6.2-1ubuntu1) bionic; urgency=medium

  * KDUMP_CMDLINE_APPEND: add noirqdistrib to default command line. As it's
    only used by ppc64el, it's not required to be conditionally added.
    (LP: #1658733)
  * Set crashkernel for ppc64el to load at 128M instead of 32M. That allows
    larger kernels to boot. (LP: #1728115)

 -- Thadeu Lima de Souza Cascardo <email address hidden> Tue, 07 Nov 2017 12:23:33 +0000

Changed in makedumpfile (Ubuntu Bionic):
status: In Progress → Fix Released

Default Comment by Bridge

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.1-2ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in makedumpfile (Ubuntu Artful):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-artful
Brian Murray (brian-murray) wrote :

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.5.9-5ubuntu0.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in makedumpfile (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial

Default Comment by Bridge

Default Comment by Bridge

tags: added: triage-a

Default Comment by Bridge

Default Comment by Bridge

@ IBM: Please can you verify the packages from artful-proposed and xenial-proposed?
And thx for the artful and bionic patches.

Default Comment by Bridge

Default Comment by Bridge

------- Comment on attachment From <email address hidden> 2017-01-23 12:10 EDT-------

This patch fixes the issue. The noirqdistrib is a necessary parameter for ppc64 platform as stated in kdump docs https://www.kernel.org/doc//Documentation/kdump/kdump.txt
With this fix crash kernel uses the correct irq server for the case where only one cpu is online.

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Changed in makedumpfile (Ubuntu Zesty):
status: New → Invalid
Changed in kexec-tools (Ubuntu Zesty):
status: New → Invalid
summary: - Ubuntu 16.04.2KVM:kdump fails to mount root file system when
- noirqdistrib is missing as dump kernel parameter
+ Ubuntu 16.04 KVM:kdump fails to mount root file system when noirqdistrib
+ is missing as dump kernel parameter
Changed in kexec-tools (Ubuntu Xenial):
status: New → Invalid
Changed in kexec-tools (Ubuntu Trusty):
status: New → Invalid
Changed in makedumpfile (Ubuntu Trusty):
status: New → Invalid
Changed in ubuntu-power-systems:
status: Confirmed → Fix Committed
tags: added: verification-done verification-done-artful verification-done-xenial
removed: verification-needed verification-needed-artful verification-needed-xenial
Changed in ubuntu-power-systems:
importance: Undecided → High
Andrew Cloke (andrew-cloke) wrote :

Following comment #44, marking verfication-done for Artful and Xenial.

tags: added: triage-g
removed: triage-a
Robie Basak (racb) wrote :

What version of makedumpfile was used for SRU verification please?

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.6.1-2ubuntu0.1

---------------
makedumpfile (1:1.6.1-2ubuntu0.1) artful; urgency=medium

  * KDUMP_CMDLINE_APPEND: add noirqdistrib to default command line. As it's
    only used by ppc64el, it's not required to be conditionally added.
    (LP: #1658733)
  * Set crashkernel for ppc64el to load at 128M instead of 32M. That allows
    larger kernels to boot. (LP: #1728115)

 -- Thadeu Lima de Souza Cascardo <email address hidden> Tue, 07 Nov 2017 12:23:33 +0000

Changed in makedumpfile (Ubuntu Artful):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for makedumpfile has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.5.9-5ubuntu0.6

---------------
makedumpfile (1:1.5.9-5ubuntu0.6) xenial; urgency=medium

  * d/kernel-postinst-generate-initrd : Add scsi_dh_* modules if in
    use so the system can dump a crash when root is on multipath
    (LP: #1635597) (Closes: 862411)

  * KDUMP_CMDLINE_APPEND: add noirqdistrib to default command line. As it's
    only used by ppc64el, it's not required to be conditionally added.
    (LP: #1658733)

 -- Thadeu Lima de Souza Cascardo <email address hidden> Tue, 29 Aug 2017 16:56:04 -0300

Changed in makedumpfile (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers