makekdump should re-exec with cio_ignore on s390x

Bug #1570775 reported by Dimitri John Ledkov on 2016-04-15
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Medium
Unassigned
makedumpfile (Ubuntu)
Medium
Unassigned
Xenial
Medium
Louis Bouchard

Bug Description

[SRU justification]
kernel crash dump fails to work without the modification.

[Impact]
Broken functionality.

[Fix]
Add cio_ignore -k -y output to the kexec command.

[Test Case]
Follow indication in comment #2 to reproduce. With the fix, kdump will function as expected.

[Regression]
A regression situation is outlined in comment #16. Newer channel devices may not be visible because of the added cio_ignore output. This may be worked around with a single command.

The resolution outweights this limitation that can be easily worked around.

[Original description of the problem]

As per https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1564475/comments/19

We should re-exec with cio_ignore lines. As per report there, it should result in lowered required crashdump setting.

Hypothetically, one should be able to test this imperially by lowering crashdump memory settings until kdump does not succeed anymore. And then generated and append `cio_ignore -k -u` to the KDUMP_CMDLINE_APPEND= and see that kdump starts working again with a lower memory usage.

Once this is developed / verified / tested, we should probably SRU this back to xenial.

tags: added: s390x
Louis Bouchard (louis) on 2016-04-15
Changed in makedumpfile (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Louis Bouchard (louis-bouchard)

------- Comment From <email address hidden> 2016-04-15 07:33 EDT-------
Reverse Mirror Request for following LP bugzilla:
https://bugs.launchpad.net/ubuntu/+source/makedumpfile/+bug/1570775

tags: added: architecture-s39064 bugnameltc-140369 severity-medium targetmilestone-inin1610
bugproxy (bugproxy) on 2016-04-20
tags: added: targetmilestone-inin16041
removed: targetmilestone-inin1610
no longer affects: ubuntu-release-notes
dann frazier (dannf) on 2016-04-26
Changed in ubuntu-z-systems:
importance: Undecided → Medium
status: New → Confirmed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-05-03 07:28 EDT-------
On an LPAR with 8458 (!!) ccw devices (only 11 of them in use) I set crashkernel with various values:
256M : no dump created
288M : oom killer killed some processes, dump was created, but system did not come up
320M : no oom killer, dump properly created
Then I invoked cio_ignore -u -k and added these values to KDUMP_CMD_APPEND in /etc/default/kdump_tools. Kdump worked fine, even with crashkernel=128M

IMO the best thing is to set this dynamically in /etc/default/kdump_tools by the following suggested patch:
--- /etc/default/kdump-tools.orig 2016-04-21 15:11:57.000000000 +0200
+++ /etc/default/kdump-tools 2016-05-03 13:17:38.862816261 +0200
@@ -63,7 +63,8 @@
# for the kdump kernel. If unset, it defaults to "irqpoll maxcpus=1 nousb"
#KDUMP_KEXEC_ARGS=""
#KDUMP_CMDLINE=""
-#KDUMP_CMDLINE_APPEND="irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service"
+APPEND=
+KDUMP_CMDLINE_APPEND="irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service $APPEND"

# ---------------------------------------------------------------------------
# Architecture specific Overrides:

The cmdline during kdump is set properly, the values from cio_ignore are reflected.

Louis Bouchard (louis) wrote :

Hello,

It is becoming unclear to me if the cio_ignore needs to be tied to a bigger crashkernel value or not. I am preparing an update to kdump-tools which will set the crashkernel value upon install. Right now, I'm setting it to 196M for LPAR and 128M for zVM.

What should be the default values for LPAR and zVM ?

I will add a separate change to take the cio_ignore into account.

Kind regards,

...Louis

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-05-10 04:13 EDT-------
Hi Louis,

it's easier than expected:
If you kexec with cio_ignore as proposed above, 128M will be enough for both z/VM and LPAR. And to ease it up, you can use that cio_ignore mechanism in /etc/default/kdump-tools for z/VM and LPAR.

Louis Bouchard (louis) wrote :

Hello,

Before commiting this into the new kdump-tools, let me confirm with you :

The current use has the following kexec command (from kdump-config show) :

kexec command:
  /sbin/kexec -p --command-line="root=/dev/mapper/vg_ubuntu-root BOOT_IMAGE=0 irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

Your suggestion is to have the following kexec command :

kexec command:
  /sbin/kexec -p --command-line="root=/dev/mapper/vg_ubuntu-root BOOT_IMAGE=0 irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service cio_ignore=all,!0009,!0200,!0600-0602" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

I have implemented this by adding the cio_ignore statement (from cio_ignore -k -u) to the KDUMP_CMDLINE_APPEND statement in /etc/default/kdump-tools file if installed on s390.

This differs from the statement in https://wiki.ubuntu.com/S390X that says that it should be added to /etc/zipl.conf and, hence applied to any running kernel, not only the one that is kexec'd.

TIA,

...Louis

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.5.9-7

---------------
makedumpfile (1:1.5.9-7) sid; urgency=medium

  * [d/rules] Lower kexec-tools dependency to -2
      The ubuntu merge will happen on an 1:2.0.10-2 version so it cannot
      depends on -3.

 -- Louis Bouchard <email address hidden> Tue, 31 May 2016 14:36:07 +0200

Changed in makedumpfile (Ubuntu):
status: Confirmed → Fix Released
Frank Heimes (fheimes) on 2016-06-15
Changed in ubuntu-z-systems:
status: Confirmed → Fix Released
bugproxy (bugproxy) wrote :

+APPEND=$(/sbin/cio_ignore -u -k)

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-06-28 12:13 EDT-------
(In reply to comment #12)
> This bug was fixed in the package makedumpfile - 1:1.5.9-7
>
> ---------------
> makedumpfile (1:1.5.9-7) sid; urgency=medium
>
> * [d/rules] Lower kexec-tools dependency to -2
> The ubuntu merge will happen on an 1:2.0.10-2 version so it cannot
> depends on -3.
>
> -- Louis Bouchard <email address hidden> Tue, 31 May 2016 14:36:07
> +0200

I can not install
makedumpfile_1.5.9-7
kexec-tools_2.0.10-2
- not even via xenial-proposed?

Thus I picked makedumpfile_1.6.0_1 and kexec-tools_2.0.10-2 (from yaketty??) manually and it works as expected.
/etc/default/kdump-tools has KDUMP_CMDLINE_APPEND populated with the values as intended, so I treat this is fixed in a newer version. When a fixed version is available for _xenial_ as well, we can close this bug.

Louis Bouchard (louis) wrote :

Hello,

I don't know where you are getting these packages from :

~$ rmadison makedumpfile
 makedumpfile | 1:1.5.9-5 | xenial | source, amd64, armhf, i386, powerpc, ppc64el, s390x
 makedumpfile | 1:1.6.0-1 | yakkety | source, amd64, armhf, i386, powerpc, ppc64el, s390x
$ rmadison kexec-tools
 kexec-tools | 1:2.0.10-1ubuntu2 | xenial | source, amd64, armhf, i386, powerpc, ppc64el, s390x
 kexec-tools | 1:2.0.10-2ubuntu1 | yakkety | source, amd64, armhf, i386, powerpc, ppc64el, s390x
caribou@marvin:~$

Xenial has makedumpfile | 1:1.5.9-5 and kexec-tools | 1:2.0.10-1ubuntu2.

This change (along with a few more enablements) will have to be SRUed to Xenial. But this doesn't change the fact that the versions in the archive do install correctly.

Dimitri John Ledkov (xnox) wrote :

@thorsten

The changes landed in yakkety, not xenial yet =)

Looking at the bug statuses, this was not yet targeted at xenial series correctly, will adjust this on launchpad side now.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-01 05:48 EDT-------
As of today, I can update to the following levels:
ii kdump-tools 1:1.5.9-5
ii kexec-tools 1:2.0.10-1ubuntu2
ii makedumpfile 1:1.5.9-5

Version 1.5.9-5 does not contain the required fix. Still waiting for 1.5.9-7 in xenial updates.

Louis Bouchard (louis) on 2016-07-06
Changed in makedumpfile (Ubuntu Xenial):
status: New → In Progress
assignee: nobody → Louis Bouchard (louis-bouchard)
Changed in makedumpfile (Ubuntu):
assignee: Louis Bouchard (louis-bouchard) → nobody
Changed in makedumpfile (Ubuntu Xenial):
importance: Undecided → Medium
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-12 09:26 EDT-------
Version 1.5.9-5 does not contain the required fix. Still waiting for 1.5.9-7 in xenial updates.
@Canonical: When is this available in xenial updates?

Louis Bouchard (louis) wrote :

Hello,

The fix for this bug, along with three other (all for your employer) are queued for the next SRU for makedumpfile. As soon as I have a PPA enabled for building s390x, I will provide the package for testing.

In the meantime, manually editing the file /etc/default/kdump-tools remains a valid workaround.

Kind regards,

...Louis

Louis Bouchard (louis) wrote :

Hello,

A tentative fix for this bug is available for testing in the following PPA :

   ppa:louis-bouchard/makedumpfile-test

Please test and verify that it solves this issue on Xenial. I will then proceed with the SRU.

Kind regards,

...Louis

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-13 05:53 EDT-------
Yep, that works as intended.
And thanks for also correcting maxcpus=1 to nr_cpus=1.
You may go ahead with that and SRU version 1.5.9-5ubuntu1 of both packages to xenial.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-13 11:52 EDT-------
Hmm, just one hour ago the s390-tools maintainer submitted a patch for the manpage of cio_ignore to clarify that the example command "cio_ignore -u -k" modifies(!) the current cio_ignore blacklist.
It was not obvious until today, that it really modifies something!

What does it mean:
1. Under z/VM, if somebody had invoked "cio_ignore -u -k" via installation of kdump-tools, performs no reboot AND attaches additional CCW devices, they will not show up in the channel subsystem (e.g. via lscss) until they are unblacklisted (e.g. via cio_ignore -r <busid>)
2. Similar for LPAR, if you define additional CCW devices via dynamic I/O configuration change in HCD.
3. If someone performs a "cio_ignore -p" after invoking "cio_ignore -u -k" (e.g. via installation of kdump-tools), the change becomes effective and visible also for existing unused devices, and all unused CCW devices are blacklisted and no more visible.
4. After a reboot, the cio_ignore blacklist is concurrent to what has been specified via kernel parameter line. (Which means: Everything is fine again.)

My recommendation:
1. Go ahead with the actual implementation as of today, since it is worth to have a kdump when required, even with the above misbehaviour (which can be worked around).
2. When a new version of cio_ignore (within s390-tools) is available, you should pick up this new s390-tools version and also fix the postinst script in kdump-tools package picking by adjusting the cio_ignore parameters.

We will open a new bug to fix the postinst script again, when an updated version of s390-tools is available.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-07-14 08:43 EDT-------
As an alternate solution, try this:

1) Modify /usr/share/initramfs-tools/scripts/init-top/udev

Replace line
udevadm trigger --action=add
with
udevadm trigger --type=subsystems --action=add
udevadm trigger --type=devices --action=add

2) Modify /etc/default/kdump-tools

Replace existing KDUMP_CMDLINE_APPEND= lines with
KDUMP_CMDLINE_APPEND="cio_ignore=all,!condev"

3) Rebuild the kdump initramfs

/etc/kernel/postinst.d/kdump-tools

4) Check if kdump works

echo c > /proc/sysrq-trigger

Explanation:

- Ubuntu 16.04 uses zdev to configure z Systems specific devices
- zdev also handles cio_ignore configuration via a Udev rule that triggers
when the CCW bus is registered (see /etc/udev/rules.d/41-cio-ignore.rules)
- Because of the way that Ubuntu's initramfs tools trigger coldplug of
Uevents, the Uevent for the CCW bus is not generated and the cio_ignore
rule is not triggered. I consider this a bug in Ubuntu's udev package
because systemd provides a corresponding coldplug unit file (see
/lib/systemd/system/systemd-udev-trigger.service) that performs the steps
as proposed in 1)
- With cio_ignore handling covered by the udev rule, KDUMP can use the
command line in 2) to blacklist all devices except the console. The
latter needs to be excluded as the kernel would otherwise not boot on
a z/VM guest

Louis Bouchard (louis) on 2016-07-21
description: updated
bugproxy (bugproxy) wrote :

/etc/kernel/postinst.d/kdump-tools $(uname -r)

Hello Dimitri, or anyone else affected,

Accepted makedumpfile into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.5.9-5ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in makedumpfile (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed

------- Comment From <email address hidden> 2016-07-26 10:29 EDT-------
makedumpfile & kdump from xenial-proposed as of today are version 1.5.9-5ubuntu0.1 and are fixing that problem. Please promote it to xenial-updates.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.5.9-5ubuntu0.1

---------------
makedumpfile (1:1.5.9-5ubuntu0.1) xenial; urgency=medium

  [ Hari Bathini <email address hidden> ]
  * Fix networked kdump failure to reach remote server.
    Avoids "Network is unreachable" message when trying to do remote dumps on
    either SSH or NFS. (LP: #1571590)

  * Replace maxcpus by nr_cpus
    nr_cpus is a hard limit that has an impact on the (kdump) kernel
    memory consumption, while it is not the case with maxcpus=1, as we can
    theoretically hotplug cpus with maxcpus=1 (LP: #1568952)

  * define_stampdir() : Loop on hostname -I for 5 sec to get IP address
    if HOSTTAG=ip. The network stack may not be ready when kdump-config runs.
    Give it some time before reverting HOSTTAG to hostname if an IP address
    cannot be found. (LP: #1599561)

  * Add cio_ignore result to /etc/default/kdump-tools on s390x
    In order to have crashkernel=128M to work correctly on the s390
    architecture the result of cio_ignore -u -k needs to be appended to the
    KDUMP_CMDLINE_APPEND variable in /etc/default/kdump-tools. This patch
    adds the required logic to do the proper modification. (LP: #1570775)

  * debian/rules : drop the dh_installinit override
    Uses a syntax which is no longer supported and generate an error on
    install. (LP: #1599491)

 -- Louis Bouchard <email address hidden> Fri, 22 Jul 2016 10:15:20 +0200

Changed in makedumpfile (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for makedumpfile has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

------- Comment From <email address hidden> 2016-08-04 13:24 EDT-------
makedumpfile & kdump from xenial-proposed as of today are version 1.5.9-5ubuntu0.1 and are fixing that problem. Please promote it to xenial-updates.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-08-05 04:11 EDT-------
Thanks for promoting it to xenial-updates. Successfully verfied and closed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers