kdump fails when crash is triggered after DLPAR cpu add operation

Bug #1828596 reported by bugproxy on 2019-05-10
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Kernel Team
makedumpfile (Ubuntu)
Status tracked in Eoan
Xenial
Undecided
Unassigned
Bionic
Undecided
Thadeu Lima de Souza Cascardo
Cosmic
Undecided
Unassigned
Disco
Undecided
Thadeu Lima de Souza Cascardo
Eoan
Undecided
Thadeu Lima de Souza Cascardo

Bug Description

[Impact]
After a CPU add/hotplug operation on Power systems, kdump will fail after a crash. The kdump kernel needs to be reloaded after a CPU add/hotplug.

[Test case]
Do CPU add/hotplug, trigger a crash, and check for a successful kdump.

[Regression potential]
Multiple reloads caused by multiple sequential CPU adds may cause spurious log results, and systemd may fail to properly reload the kdump kernel. This has been handled by resetting the failure counter when doing such reloads.

== Comment: #0 - Hari Krishna Bathini - 2019-05-10 05:55:40 ==
---Problem Description---
kdump fails when crash is triggered after CPU add operation.

Machine Type = na

---System Hang---
 Crashed in early boot process of kdump kernel after crash

Had to issue system reset from HMC to reclaim

---Steps to Reproduce---
 1. Configure kdump.
2. Add cpu from HMC.
3. Trigger crash.
4. Machine hangs after crash as below:

---
[169250.213166] IPI complete
[169250.234331] kexec: Starting switchover sequence.
I'm in purgatory
                             --- STRUCK HERE ---

---uname output---
na

---Debugger---
A debugger is not configured

== Comment: #1 - Hari Krishna Bathini - 2019-05-10 05:56:46 ==
The problem is, kexec udev rule to restart kdump-tools service - when a core is added,
is not being triggered. The old DT created by kexec (before the core is added)
is being used by KDump Kernel. So, when system crashes on a thread from
the added core(s), KDump kernel is failing to get the 'boot_cpuid' and
eventually failing to boot..

== Comment: #2 - Hari Krishna Bathini - 2019-05-10 06:02:27 ==
The udev rule when CPU is added is not triggered because ppc64 does not
eject add/remove event when a CPU is hot added/removed. It only ejects
online/offline event to user space when CPU is hot added/removed.

So, the below udev rules are never triggered when needed:

SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/bin/systemctl try-restart kdump-tools.service"

Also, with how CPU hot add & remove are handled in ppc64, a udev trigger
to reload kdump after CPU is hot removed is NOT necessary. So, fix the CPU
hot add case by updating the udev rule and drop the udev rule meant for CPU
hot remove in the kdump udev rules file:

SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump-tools.service"

bugproxy (bugproxy) on 2019-05-10
tags: added: architecture-ppc64le bugnameltc-177551 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → kexec-tools (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High

I will start working on an upload to eoan by next week. I should have something for you to test early in the week.

Changed in kexec-tools (Ubuntu):
status: New → Invalid
Changed in makedumpfile (Ubuntu):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
Changed in ubuntu-power-systems:
status: New → Triaged
tags: added: powervm

At my ppa, there is a version with the change. Can you please test? The package is available for bionic, cosmic, disco and eoan.

ppa:cascardo/kdump2

Andrew Cloke (andrew-cloke) wrote :

Marking as "incomplete" while awaiting test results from Thadeu's PPA kernel.

Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Changed in makedumpfile (Ubuntu):
status: New → Incomplete

------- Comment From <email address hidden> 2019-05-21 06:16 EDT-------
Cascardo, the udev rules (/lib/udev/rules.d/50-kdump-tools.rules) should have been:

SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump-tools.service"

but the package has:

SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="cpu", ACTION=="add", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="cpu", ACTION=="remove", PROGRAM="/bin/systemctl try-restart kdump-tools.service"
SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump-tools.service"

Can we get that sorted..

Thanks
Hari

Hi, Hari.

So, as you said, other architectures will use add/remove instead of online, and we want to support them too. Any reason not to do it that you are thinking of?

Thanks.
Cascardo.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-05-22 02:33 EDT-------
(In reply to comment #11)
> Hi, Hari.
>
> So, as you said, other architectures will use add/remove instead of online,
> and we want to support them too. Any reason not to do it that you are
> thinking of?

No action with these rules on ppc64 as ADD/REMOVE events are not ejected
for CPU subsystem as of today. So, they don't have any impact and can be ignored.
But I thought this rules were there by accident and the entries would be put
under arch flags to avoid them for ppc64..

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-05-22 07:16 EDT-------
(In reply to comment #12)
[...]
> But I thought this rules were there by accident and the entries would be put
> under arch flags to avoid them for ppc64..

If that is too much to ask, I am fine with the current change.
The change works as expected..

Thanks
Hari

Andrew Cloke (andrew-cloke) wrote :

Based on the last comment, it looks like IBM's testing was successful and this patch is ready for SRU.
Thanks.

Changed in ubuntu-power-systems:
status: Incomplete → Confirmed
Changed in makedumpfile (Ubuntu):
status: Incomplete → Confirmed

This is now in eoan-proposed. Please verify. I will start the backport process when it hits eoan.

Thanks.
Cascardo.

Changed in makedumpfile (Ubuntu Eoan):
status: Confirmed → Fix Committed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-06-24 07:49 EDT-------
Thanks for the change. With it, try-restart is being triggered for
kdump-tools service after CPU add operation but systemd reported
failure with below logs:

Jun 24 06:47:06 ubuntu systemd[1]: Stopped Kernel crash dump capture service.
Jun 24 06:47:06 ubuntu systemd[1]: Starting Kernel crash dump capture service...
Jun 24 06:47:06 ubuntu kdump-tools[2023]: Starting kdump-tools: * Creating symlink /var/lib/kdump/vmlinuz
Jun 24 06:47:06 ubuntu kdump-tools[2023]: * Creating symlink /var/lib/kdump/initrd.img
Jun 24 06:47:06 ubuntu kdump-tools[2023]: Modified cmdline:BOOT_IMAGE=/vmlinux-5.0.0-17-generic root=/dev/mapper/ubuntu--vg-root ro systemd.unit=kdump-tools-dump.service maxcpus=1 irqpo
Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Main process exited, code=killed, status=15/TERM
Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Failed with result 'signal'.
Jun 24 06:47:06 ubuntu systemd[1]: Stopped Kernel crash dump capture service.
Jun 24 06:47:06 ubuntu systemd[1]: Starting Kernel crash dump capture service...
Jun 24 06:47:06 ubuntu kdump-tools[2071]: Starting kdump-tools: * Creating symlink /var/lib/kdump/vmlinuz
Jun 24 06:47:06 ubuntu kdump-tools[2071]: * Creating symlink /var/lib/kdump/initrd.img
Jun 24 06:47:06 ubuntu kdump-tools[2071]: Modified cmdline:BOOT_IMAGE=/vmlinux-5.0.0-17-generic root=/dev/mapper/ubuntu--vg-root ro systemd.unit=kdump-tools-dump.service maxcpus=1 irqpo
Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Main process exited, code=killed, status=15/TERM
Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Failed with result 'signal'.
Jun 24 06:47:06 ubuntu systemd[1]: Stopped Kernel crash dump capture service.
Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Start request repeated too quickly.
Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Failed with result 'signal'.
Jun 24 06:47:06 ubuntu systemd[1]: Failed to start Kernel crash dump capture service.

---
Looks like a ratelimit issue with systemd. Is there some systemd option to workaround it?

I am running the below command on a PowerVM machine:

# drmgr -c cpu -r -q 1 (to remove a core)
# drmgr -c cpu -a -q 1 (to add it back -> this triggers 8 CPU online udev events as SMT is 8)

To conclude, udev rule alone is not sufficient. Need a way to address the multiple
requests at once..

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.6.5-1ubuntu2

---------------
makedumpfile (1:1.6.5-1ubuntu2) eoan; urgency=medium

  [ Thadeu Lima de Souza Cascardo ]
  * Use maxcpus instead of nr_cpus on ppc64el. (LP: #1828597)
  * Reload kdump when CPU is brought online. (LP: #1828596)

 -- Thadeu Lima de Souza Cascardo <email address hidden> Fri, 14 Jun 2019 10:58:40 -0300

Changed in makedumpfile (Ubuntu Eoan):
status: Fix Committed → Fix Released
description: updated

On Mon, Jun 24, 2019 at 11:59:48AM -0000, bugproxy wrote:
> ------- Comment From <email address hidden> 2019-06-24 07:49 EDT-------
> Thanks for the change. With it, try-restart is being triggered for
> kdump-tools service after CPU add operation but systemd reported
> failure with below logs:
>
> Jun 24 06:47:06 ubuntu systemd[1]: Stopped Kernel crash dump capture service.
> Jun 24 06:47:06 ubuntu systemd[1]: Starting Kernel crash dump capture service...
> Jun 24 06:47:06 ubuntu kdump-tools[2023]: Starting kdump-tools: * Creating symlink /var/lib/kdump/vmlinuz
> Jun 24 06:47:06 ubuntu kdump-tools[2023]: * Creating symlink /var/lib/kdump/initrd.img
> Jun 24 06:47:06 ubuntu kdump-tools[2023]: Modified cmdline:BOOT_IMAGE=/vmlinux-5.0.0-17-generic root=/dev/mapper/ubuntu--vg-root ro systemd.unit=kdump-tools-dump.service maxcpus=1 irqpo
> Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Main process exited, code=killed, status=15/TERM
> Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Failed with result 'signal'.
> Jun 24 06:47:06 ubuntu systemd[1]: Stopped Kernel crash dump capture service.
> Jun 24 06:47:06 ubuntu systemd[1]: Starting Kernel crash dump capture service...
> Jun 24 06:47:06 ubuntu kdump-tools[2071]: Starting kdump-tools: * Creating symlink /var/lib/kdump/vmlinuz
> Jun 24 06:47:06 ubuntu kdump-tools[2071]: * Creating symlink /var/lib/kdump/initrd.img
> Jun 24 06:47:06 ubuntu kdump-tools[2071]: Modified cmdline:BOOT_IMAGE=/vmlinux-5.0.0-17-generic root=/dev/mapper/ubuntu--vg-root ro systemd.unit=kdump-tools-dump.service maxcpus=1 irqpo
> Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Main process exited, code=killed, status=15/TERM
> Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Failed with result 'signal'.
> Jun 24 06:47:06 ubuntu systemd[1]: Stopped Kernel crash dump capture service.
> Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Start request repeated too quickly.
> Jun 24 06:47:06 ubuntu systemd[1]: kdump-tools.service: Failed with result 'signal'.
> Jun 24 06:47:06 ubuntu systemd[1]: Failed to start Kernel crash dump capture service.
>
> ---
> Looks like a ratelimit issue with systemd. Is there some systemd option to workaround it?
>
> I am running the below command on a PowerVM machine:
>
> # drmgr -c cpu -r -q 1 (to remove a core)
> # drmgr -c cpu -a -q 1 (to add it back -> this triggers 8 CPU online udev events as SMT is 8)
>
> To conclude, udev rule alone is not sufficient. Need a way to address the multiple
> requests at once..

There are these systemd options, which default to a burst limit of 5 restart in
the interval of 10s.

       StartLimitIntervalSec=interval, StartLimitBurst=burst

One other option that I prefer, howerver, is resetting the start rate limit
counter by using systemctl reset-failed kdump-tools.service on the udev rule.

Can you try that?

Thanks.
Cascardo.

Manoj Iyer (manjo) on 2019-07-08
Changed in ubuntu-power-systems:
status: Confirmed → Incomplete

------- Comment From <email address hidden> 2019-07-15 06:36 EDT-------
Cascardo, I did not tinker with other options but disabling ratelimit helped:

"StartLimitInterval=0"

"systemctl reset-failed kdump-tools.service" seems like a good option but
may not be needed if ratelimit is disabled..

Thanks
Hari

Disabling the ratelimit in general would break other failure modes, so I would rather just reset-failed when calling try-restart because of the hotplug events.

Can you try the package in ppa:cascardo/kdump2? Packages for eoan, disco and bionic available.

Thanks.
Cascardo.

description: updated
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-07-25 05:43 EDT-------
(In reply to comment #27)
> Disabling the ratelimit in general would break other failure modes, so I
> would rather just reset-failed when calling try-restart because of the
> hotplug events.
>
> Can you try the package in ppa:cascardo/kdump2? Packages for eoan, disco and
> bionic available.

Cascardo, is the fix package you are proposing still here? I see the below
package version:

ii kdump-tools 1:1.6.5-1ubuntu2~18.04.1

which doesn't seem to have "systemctl reset-failed kdump-tools" invoked anywhere.
I was trying this out on bionic with 5.0.0-17-generic kernel and the issue is reproducible..

Hi Hari, did you manage to try the package in https://launchpad.net/~cascardo/+archive/ubuntu/kdump2?

I've downloaded the kdump-tools deb package for ppc64 from the above PPA, and could check that it contains the udev rule:
"SUBSYSTEM=="cpu", ACTION=="online", PROGRAM="/bin/systemctl try-restart kdump-tools.service"

I understand by reading the latest comments that above rule is the fix for this LP, correct?
Can you manually download the package from the above PPA, install it and verify that /lib/udev/rules.d/50-kdump-tools.rules contains the fixing rule?

In case it has that and still fails your testing, then we need to understand why the fix is not working.
Thanks,

Guilherme

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-07-26 06:48 EDT-------
Guilherme, the initial fix (udev rule) is still available. But while testing I observed failure
due to systemd ratelimiting. I proposed to disable ratelimit but IIUC, Cascardo
preferred a different approach that does not involve disabling systemd ratelimit
and provided an updated package with a different approach to solve ratelimiting.
My recent comment is that there is no updated package but just the initial fix.
Hope that clears it up..

Hi Hari, thanks for clarifying! I can now understand, seems we need to wait for Cascardo's input, to see if he already implemented the systemd reset-failed thing or not.

Cheers,

Guilherme

It was implemented, but the upload did not build on that ppa, because I used different versions. I am still catching up after vacation time, so will post some updates as soon as I have them.

Changed in ubuntu-power-systems:
status: Incomplete → Triaged
Eric Desrochers (slashd) on 2019-08-27
Changed in makedumpfile (Ubuntu Disco):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
status: New → In Progress
Changed in makedumpfile (Ubuntu Bionic):
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
status: New → In Progress
Changed in makedumpfile (Ubuntu Cosmic):
status: New → Won't Fix

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.5-1ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in makedumpfile (Ubuntu Disco):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-disco
Andy Whitcroft (apw) wrote :

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.5-1ubuntu1~18.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in makedumpfile (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Changed in ubuntu-power-systems:
status: Triaged → In Progress

------- Comment on attachment From <email address hidden> 2019-08-29 08:34 EDT-------

udev rules are not triggering kdump-tools service restart after hot adding
CPU or hot adding/removing memory with kdump-tools package version
1.6.5-1ubuntu1~18.04.2

tags: added: verification-failed-bionic
removed: verification-needed-bionic

All autopkgtests for the newly accepted makedumpfile (1:1.6.5-1ubuntu1.1) for disco have finished running.
The following regressions have been reported in tests triggered by the package:

makedumpfile/1:1.6.5-1ubuntu1.1 (s390x, ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/disco/update_excuses.html#makedumpfile

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

------- Comment (attachment only) From <email address hidden> 2019-08-30 02:48 EDT-------

bugproxy (bugproxy) on 2019-08-30
tags: added: verification-failed verification-failed-disco
removed: verification-needed verification-needed-disco
no longer affects: kexec-tools (Ubuntu Eoan)
no longer affects: kexec-tools (Ubuntu Disco)
no longer affects: kexec-tools (Ubuntu Cosmic)
no longer affects: kexec-tools (Ubuntu Bionic)
no longer affects: kexec-tools (Ubuntu Xenial)
no longer affects: kexec-tools (Ubuntu)

The version of makedumpfile in the proposed pocket of Bionic that was purported to fix this bug report has been removed because one or more bugs that were to be fixed by the upload have failed verification and been in this state for more than 10 days.

Changed in makedumpfile (Ubuntu Bionic):
status: Fix Committed → Won't Fix

Hi, can you try the package in ppa:cascardo/ppa ?

Thanks.
Cascardo.

There is some interaction between systemd MemoryDenyWriteExecute=yes setting on udevd and how grep has been built (possibly pointing out at the toolchain), so the new solution on the ppa isn't working on bionic.

We will work on this bug, and see how this behaves on disco and eoan. In case either of those is fine, we will ask IBM to test it there, while we move forward with this systemd/toolchain interaction bug.

Thanks.
Cascardo.

https://bugs.launchpad.net/ubuntu/+source/grep/+bug/1844524

This should be fixed by rebuilding grep. I have uploaded grep to my ppa, so if you install kdump-tools and grep from ppa:cascardo/ppa, you will be able to test this.

Can you please do it, so we reduce the risk of the next upload being remove from -proposed again?

Thanks.
Cascardo.

Changed in makedumpfile (Ubuntu Eoan):
status: Fix Released → In Progress
Changed in makedumpfile (Ubuntu Disco):
status: Fix Committed → In Progress
Changed in makedumpfile (Ubuntu Bionic):
status: Won't Fix → In Progress

A fixed grep is already in bionic-updates. Either that or the one on my ppa must be installed in order for the makedumpfile version in my ppa to work. I will wait for testing feeback before I get this fix uploaded to eoan, disco and bionic.

Thanks.
Cascardo.

Hi Hari, did you have a chance to re-test using the latest version Cascardo pointed in his last comment?

We wait on your testing to be sure all is working now, and we can re-upload the package to -proposed pocket.
Thanks in advance,

Guilherme

------- Comment From <email address hidden> 2019-09-23 13:28 EDT-------
Sorry about the delay. Observed that kdump/fadump is loaded even when
kdump-tools service is disabled. Not desirable, I guess. Probably need to
check if kdump-tools service is active before trying a reload?

Hi, Hari.

makedumpfile 1:1.6.5-1ubuntu1~18.04.2+cascardo2 on ppa:cascardo/ppa uses a try-reload instead. Can you test it, please?

Thanks.
Cascardo.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-09-25 02:41 EDT-------
(In reply to comment #47)
> Hi, Hari.
>
> makedumpfile 1:1.6.5-1ubuntu1~18.04.2+cascardo2 on ppa:cascardo/ppa uses a
> try-reload instead. Can you test it, please?

Cascardo, try-reload is not considering fadump case (supported on powerpc).
For fadump case, need to check whether "/sys/kernel/fadump_registered" is `1`
before proceeding with unload/load..

A suggestion I have is to check for "systemctl is-active kdump-tools" and run
"kdump-config reload" if it returns true, instead of "kdump-config try-reload"
as that should cover for both kdump and fadump cases.

Also, shouldn't we account for races when multiple udev events are triggered
simultaneously by using locks or such?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers