kdump is not captured in remote host when kdump over ssh is configured

Bug #1681909 reported by bugproxy on 2017-04-11
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Low
Unassigned
makedumpfile (Ubuntu)
Medium
Guilherme G. Piccoli
Xenial
Medium
Thadeu Lima de Souza Cascardo
Bionic
Medium
Thadeu Lima de Souza Cascardo
Cosmic
Medium
Thadeu Lima de Souza Cascardo
Disco
Medium
Thadeu Lima de Souza Cascardo
Eoan
Medium
Guilherme G. Piccoli

Bug Description

[Impact]

* Kdump over network (like NFS mount or SSH dump) relies on network-online target from systemd. Even so, there are some NICs that report "Link Up" state but aren't ready to transmit packets. This is a generally bad behavior that is credited probably to NIC firmware delays, usually not fixable from drivers. Some adapters known to act like this are bnx2x, tg3 and ixgbe.

* Kdump is a mechanism that may be a last resort to debug complex/hard to reproduce issues, so it's interesting to increase its reliability / resilience. We then propose here a solution/quirk to this issue on network dump by adding a retry/delay mechanism; if it's a network dump, kdump will retry some times and sleep between the attempts in order to exclude the case of NICs that aren't ready yet but will soon be able to transmit packets.

* Although first reported by IBM in PowerPC arch, the scope for this issue is the NIC, and it was later reported in x86 arch too.

[Test case]

Usually it's difficult to naturally reproduce this issue in a deterministic way, but we have an artificial test case on comment #24 of this LP.
Also, we have a report from this bug in which the user managed to reproduce the problem consistently - it's fixed after testing our solution.

[Regression potential]

There's not a clear regression potential here since it's just a retry/delay mechanism. Some potential problems may come from bad coding in the script.
The delay between attempts is only 3 sec per iteration, so it shouldn't block the kdump progress for a high amount of time at once.

[Other information]

Salsa Debian commit:
https://salsa.debian.org/debian/makedumpfile/commit/d63ba95337988be1eac8c8c76d90825ff5c6d17f

Related branches

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-152306 severity-high targetmilestone-inin1704
bugproxy (bugproxy) wrote :

Default Comment by Bridge

bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → makedumpfile (Ubuntu)
Manoj Iyer (manjo) on 2017-05-08
Changed in makedumpfile (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Nish Aravamudan (nacc)
importance: Undecided → High
Manoj Iyer (manjo) on 2017-06-01
tags: added: ubuntu-17.04

------- Comment From <email address hidden> 2017-06-21 10:28 EDT-------
Canonical, any take on introducing NET_WAIT_TIME in /etc/default/kdump-tools file
to deal with timing issues on some NICs?

Thanks
Hari

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-07-10 06:45 EDT-------
Issue is observed even on Briggs machine on Ubuntu 16.04.03

Thnaks,
Pavithra

tags: added: targetmilestone-inin16043
removed: targetmilestone-inin1704
Manoj Iyer (manjo) on 2017-07-19
Changed in ubuntu-power-systems:
importance: Undecided → High
Manoj Iyer (manjo) on 2017-07-31
tags: added: triage-a

I don't see any fundamental issue with providing a NET_WAIT_TIME variable (probably should be namespaced to KDUMP_) in the kdump config file, but:

1) this seems like a hack to work around slow hardware, right?

2) it can't be automatically deduced, afaict. Or do you want to have 30s delays (potentially) on all POWER machines?

3) I'm not 100% familiar with the 'upstream' of kdump-tools -- is this something that we'd need to carry forever in the Debian/Ubuntu packaging?

------- Comment From <email address hidden> 2017-07-31 14:14 EDT-------
(In reply to comment #30)
> I don't see any fundamental issue with providing a NET_WAIT_TIME variable
> (probably should be namespaced to KDUMP_) in the kdump config file, but:

Right, KDUMP_NET_WAIT_TIME is better..

>
> 1) this seems like a hack to work around slow hardware, right?
>

Yeah. On NICs that are slow to initialize. With my limited expertise
in network related problems, I thought this can be a nice config option
to have. A right fix might be somewhere in network related stuff..

> 2) it can't be automatically deduced, afaict. Or do you want to have 30s
> delays (potentially) on all POWER machines?
>

No. I think this has more to do with the NIC than arch. What I have in mind
is a 0s delay time by default but something that can be set
to a non-zero value for NICs like this using KDUMP_NET_WAIT_TIME=

> 3) I'm not 100% familiar with the 'upstream' of kdump-tools -- is this
> something that we'd need to carry forever in the Debian/Ubuntu packaging?

Probably, unless there is a fix in NIC (hardware/firmware) and/or network related code
that makes this config option redundant..

Thanks
Hari

Changed in makedumpfile (Ubuntu):
assignee: Nish Aravamudan (nacc) → nobody
Manoj Iyer (manjo) on 2017-08-14
Changed in makedumpfile (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: kernel-da-key
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Manoj Iyer (manjo) on 2017-09-11
tags: added: triage-r
removed: triage-a

Hi, I am a little eager to add this without trying to resort to other solutions first.

So, options are:

1) For some reason, this driver is not behaving correctly. Can you add the PowerIO folks to this bug on IBM side and let them do some investigation?

2) network-online is not doing the correct thing. Well, from what I read, they indeed don't care much about this and think the program should wait for the network to be available. After eliminating 1, we should look into why network-online decides the network is online or why systemd would start kdump after that, and the ssh host would still not be reachable.

3) I would rather add the timeout but also conditionally checking for the host availability. That is: wait until it's available, then dump. If not available for the timeout duration, reboot.

------- Comment From <email address hidden> 2017-11-17 03:21 EDT-------
(In reply to comment #32)
> Hi, I am a little eager to add this without trying to resort to other
> solutions first.
>
> So, options are:
>
> 1) For some reason, this driver is not behaving correctly. Can you add the
> PowerIO folks to this bug on IBM side and let them do some investigation?

done.

>
> 2) network-online is not doing the correct thing. Well, from what I read,
> they indeed don't care much about this and think the program should wait for
> the network to be available. After eliminating 1, we should look into why
> network-online decides the network is online or why systemd would start
> kdump after that, and the ssh host would still not be reachable.
>

> 3) I would rather add the timeout but also conditionally checking for the
> host availability. That is: wait until it's available, then dump. If not
> available for the timeout duration, reboot.

Sounds reasonable.

Thanks
Hari

tags: added: ppc64el-kdump
Changed in ubuntu-power-systems:
status: New → Triaged

From comment #8 and #10, is there any result from the PowerIO team's investigation into the NIC and related KDUMP_NET_WAIT_TIME config option?

Changed in ubuntu-power-systems:
status: Triaged → Incomplete
tags: added: triage-g
removed: triage-r
Download full text (5.2 KiB)

------- Comment From <email address hidden> 2018-02-26 13:57 EDT-------
I tried reproducing this issue in a Firestone system our team owns with Ubuntu 18.04 and I couldn't reproduce the issue.

root@ltc-fire1:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu Bionic Beaver (development branch)"

root@ltc-fire1:~# uname -r
4.15.0-10-generic

root@ltc-fire1:~# ethtool -i enP1p1s0f0 | grep "driver\|firmware\|bus-info"
driver: tg3
firmware-version: 5719-v1.38i
bus-info: 0001:01:00.0

root@ltc-fire1:~# lspci -vmmnn -s 0001:01:00.0
Slot: 0001:01:00.0
Class: Ethernet controller [0200]
Vendor: Broadcom Limited [14e4]
Device: NetXtreme BCM5719 Gigabit Ethernet PCIe [1657]
SVendor: IBM [1014]
SDevice: NetXtreme BCM5719 Gigabit Ethernet PCIe (FC 5260/5899 4-port 1 GbE Adapter for Power) [0420]
Rev: 01
NUMANode: 0

Here is a snippet of the kdump attempt to reproduce the issue:

[ 129.602468] kdump-tools[1599]: Starting kdump-tools: * sending makedumpfile -c -d 31 -F /proc/vmcore to root@9.40.194.212 : /var/crash/9.40.195.135-201802261141/dump-incomplete
[ 129.719688] kdump-tools[1599]: The kernel version is not supported.
[ 129.720035] kdump-tools[1599]: The makedumpfile operation may be incomplete.
Copying data : [100.0 %] \ eta: 0s
[ 144.173303] kdump-tools[1599]: The dumpfile is saved to STDOUT.
[ 144.173531] kdump-tools[1599]: makedumpfile Completed.
[ 144.184688] kdump-tools[1599]: 533781+259 records in
[ 144.184975] kdump-tools[1599]: 533918+1 records out
[ 144.185223] kdump-tools[1599]: 273366297 bytes (273 MB, 261 MiB) copied, 14.3426 s, 19.1 MB/s
[ 144.419378] kdump-tools[1599]: * kdump-tools: saved vmcore in root@9.40.194.212:/var/crash/9.40.195.135-201802261141
[ 144.439183] kdump-tools[1599]: * running makedumpfile --dump-dmesg /proc/vmcore /tmp/dmesg.201802261141
[ 144.445902] kdump-tools[1599]: The kernel version is not supported.
[ 144.446266] kdump-tools[1599]: The makedumpfile operation may be incomplete.
[ 144.446557] kdump-tools[1599]: The dmesg log is saved to /tmp/dmesg.201802261141.
[ 144.446844] kdump-tools[1599]: makedumpfile Completed.
[ 144.717033] kdump-tools[1599]: * kdump-tools: saved dmesg content in root@9.40.194.212:/var/crash/9.40.195.135-201802261141
[ 144.718981] kdump-tools[1599]: Mon, 26 Feb 2018 11:42:11 -0700
[ 144.825931] kdump-tools[1599]: Rebooting.
[ 144.950195] reboot: Restarting system

Both dmesg and dump file were transferred to the peer under /var/crash/

root@ltc-zz4-lp2:~# ls /var/crash/9.40.195.135-201802261141
dmesg.201802261141 dump.201802261141

Pavithra, if you still have the system, could you attempt to reproduce this issue in your environment?

------- Comment From <email address hidden> 2018-02-28 01:46 EDT-------
Issue is not observed on same machine with 17.10 and 18.04 on same machine.

we can close the bug.

18.04
========

Starting Kernel crash dump capture service...
[ 29.664816] kdump-tools[1255]: Starting kdump-tools: * sending makedumpfile -c -d 31 -F /proc/vmcore to root@9.40.192.198 : /var/crash/9.47.70.29-201802280145/dump-incomplete
[ 29.732510] kdump-tool...

Read more...

------- Comment From <email address hidden> 2018-03-05 08:49 EDT-------
(In reply to comment #40)
> Can you try 16.04?
>
> Thanks.
> Cascardo.

Cascardo, sure, I'll give it a try and report back the test results as soon as possible.

Best regards,
Murilo

summary: - Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is
- configured on firestone.
+ dump is not captured in remote host when kdump over ssh is configured on
+ firestone.
tags: removed: ubuntu-17.04

------- Comment From <email address hidden> 2018-03-06 09:38 EDT-------
Cascardo, I gave a try with kdump in Ubuntu 16.04 and it seems to occasionally fail.

It seems to be kind of random when it decides to fail, I see that we are hitting an EEH in slots behind a PLX switch, but even in successful attempts we hit the EEH as well. So not sure how much it is related. I collected the console log of the failure attempt (I am attaching it).

I am attempting to drop into a shell by setting sh or bash to run in the KDUMP_FAIL_CMD option, but it seems to just hang and not give me a console.I want to collect more logs and see if the adapter is able to reach the peer to see if it is possibly a timing issue. Is there a proper way to drop to a shell in case of a kdump failure?

Also I am attempting to reinstall Ubuntu 18.04 to reattempt kdump a few more times to make sure I didn't get lucky, but I am hitting IBM Bug 165336 - Canonical LP 1753449, which is preventing from reinstalling Ubuntu 18.04.

------- Comment (attachment only) From <email address hidden> 2018-03-06 09:39 EDT-------

Hi, Murilo.

Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the noirqdistrib option might be related to the EEH issues.

As this was reported to 17.04, and I don't know why there was a problem with the network, we don't know if it was something that needed fixing in the driver instead of kdump. And, then, looking at the xenial 4.4 kernel, we could maybe see if the issue has been fixed in latest kernels, but still affects it.

If you continue hitting the EEH when using noirqdistrib (with kdump-tools from -proposed), then we might look into that, with a different bug.

Thanks.
Cascardo.

Looking at the log, I noticed the EEH is frozen right after finding the Broadcom card. Is that one the tg3?

[ OK ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
[ 8.191135] EEH: Frozen PE#7 on PHB#21 detected
[ 8.191280] EEH: PE location: S00210f, PHB location: N/A

Also, the recovery problem seems to be caused by ast.

[ 18.267005] EEH: 2100000 reads ignored for recovering device at location=S00210f driver=ast pci addr=0021:10:00.0
[ 18.267334] EEH: Might be infinite loop in ast driver

Looking at the upstream logs, one commit came up. Can you open a new bug for it?

commit 298360af3dab45659810fdc51aba0c9f4097e4f6
Author: Russell Currey <email address hidden>
Date: Thu Dec 15 16:12:41 2016 +1100

    drivers/gpu/drm/ast: Fix infinite loop if read fails

    ast_get_dram_info() configures a window in order to access BMC memory.
    A BMC register can be configured to disallow this, and if so, causes
    an infinite loop in the ast driver which renders the system unusable.

    Fix this by erroring out if an error is detected. On powerpc systems with
    EEH, this leads to the device being fenced and the system continuing to
    operate.

    Cc: <email address hidden> # 3.10+
    Signed-off-by: Russell Currey <email address hidden>
    Reviewed-by: Joel Stanley <email address hidden>
    Signed-off-by: Daniel Vetter <email address hidden>
    Link: http://patchwork<email address hidden>

diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
index 904beaa932d03..f75c6421db623 100644
--- a/drivers/gpu/drm/ast/ast_main.c
+++ b/drivers/gpu/drm/ast/ast_main.c
@@ -223,7 +223,8 @@ static int ast_get_dram_info(struct drm_device *dev)
        ast_write32(ast, 0x10000, 0xfc600309);

        do {
- ;
+ if (pci_channel_offline(dev->pdev))
+ return -EIO;
        } while (ast_read32(ast, 0x10000) != 0x01);
        data = ast_read32(ast, 0x10004);

@@ -428,7 +429,9 @@ int ast_driver_load(struct drm_device *dev, unsigned long flags)
        ast_detect_chip(dev, &need_post);

        if (ast->chip != AST1180) {
- ast_get_dram_info(dev);
+ ret = ast_get_dram_info(dev);

------- Comment From <email address hidden> 2018-03-06 11:15 EDT-------
(In reply to comment #45)
> Hi, Murilo.
>
> Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the
> noirqdistrib option might be related to the EEH issues.
>

Ok, I'll give it a try.

(In reply to comment #46)
> Looking at the log, I noticed the EEH is frozen right after finding the
> Broadcom card. Is that one the tg3?
>
> [ OK ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
> [ 8.191135] EEH: Frozen PE#7 on PHB#21 detected
> [ 8.191280] EEH: PE location: S00210f, PHB location: N/A

Yeah correct, this is the tg3 device. But the EEH is seen in a PHB different then the one the adapter is in. This adapter is PHB#01, where the EEH is seen in the PHB#21.

>
> Also, the recovery problem seems to be caused by ast.
>
> [ 18.267005] EEH: 2100000 reads ignored for recovering device at
> location=S00210f driver=ast pci addr=0021:10:00.0
> [ 18.267334] EEH: Might be infinite loop in ast driver
>
> Looking at the upstream logs, one commit came up. Can you open a new bug for
> it?
>
> commit 298360af3dab45659810fdc51aba0c9f4097e4f6
> Author: Russell Currey <email address hidden>
> Date: Thu Dec 15 16:12:41 2016 +1100
>
> drivers/gpu/drm/ast: Fix infinite loop if read fails

Cascardo, about the mentioned patch, it is already in this kernel, when I look at the changelog for linux-image-4.4.0-116-generic:
* Xenial update to v4.4.41 stable release (LP: #1655041)
- drivers/gpu/drm/ast: Fix infinite loop if read fails

And also this is not the only device that is hitting the EEH, when I blacklisted the ast module I still see the EEH hitting the other slots behind the PLX switch

I was able to collect a full dmesg output by adding the dmesg command to the KDUMP_FAIL_CMD option, still no luck in getting it to drop to a shell.

------- Comment (attachment only) From <email address hidden> 2018-03-06 11:18 EDT-------

The kdump network timeout feature is a new feature request, that we will investigate for possible inclusion in 18.10.

Manoj Iyer (manjo) on 2018-03-19
summary: - dump is not captured in remote host when kdump over ssh is configured on
- firestone.
+ [18.10]dump is not captured in remote host when kdump over ssh is
+ configured on firestone.
summary: - [18.10]dump is not captured in remote host when kdump over ssh is
- configured on firestone.
+ [Feat req18.10]dump is not captured in remote host when kdump over ssh
+ is configured on firestone.
Manoj Iyer (manjo) on 2018-03-19
Changed in ubuntu-power-systems:
importance: High → Low
Changed in makedumpfile (Ubuntu):
importance: High → Low
summary: - [Feat req18.10]dump is not captured in remote host when kdump over ssh
- is configured on firestone.
+ [Feat 18.10]dump is not captured in remote host when kdump over ssh is
+ configured on firestone.
bugproxy (bugproxy) on 2018-04-05
tags: removed: bugnameltc-152306 kernel-da-key ppc64el-kdump severity-high triage-g
tags: added: kernel-da-key ppc64el-kdump triage-g
summary: - [Feat 18.10]dump is not captured in remote host when kdump over ssh is
+ [FEAT 18.10] dump is not captured in remote host when kdump over ssh is
configured on firestone.
Manoj Iyer (manjo) on 2019-05-29
Changed in makedumpfile (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Changed in ubuntu-power-systems:
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Manoj Iyer (manjo) on 2019-06-10
Changed in makedumpfile (Ubuntu):
status: New → Incomplete
summary: - [FEAT 18.10] dump is not captured in remote host when kdump over ssh is
- configured on firestone.
+ kdump is not captured in remote host when kdump over ssh is configured
no longer affects: makedumpfile (Ubuntu)
Changed in makedumpfile (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in makedumpfile (Ubuntu Xenial):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Bionic):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Cosmic):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Disco):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Xenial):
importance: Undecided → Medium
Changed in makedumpfile (Ubuntu Bionic):
importance: Undecided → Medium
Changed in makedumpfile (Ubuntu Cosmic):
importance: Undecided → Medium
Changed in makedumpfile (Ubuntu Disco):
importance: Undecided → Medium
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in makedumpfile (Ubuntu Cosmic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in makedumpfile (Ubuntu Bionic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in makedumpfile (Ubuntu Xenial):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
tags: added: sts
removed: targetmilestone-inin16043

It came to my attention that this issue still exists; an user report on x84-64 machine confirms that it's not only a ppc64 issue, it's a generic problem due to some NICs behavior; even with link up, they are really not ready to xmit packets. So, I could reproduce this by modifying virtio to mimic this scenario, and I'm working in a solution/quirk on kdump-tools (makedumpfile) package.

Cheers,

Guilherme

Changed in ubuntu-power-systems:
status: Incomplete → Confirmed

This is a mock reproducer of this issue by faking a 20s delay in virtio-net after its link is up.

To enable that, user needs to build virtio-net with the hereby attached patch, and insert:

"echo 1 > /sys/module/virtio_net/parameters/droppkt"

in the /usr/sbin/kdump-config before the network dump procedure.

It'll introduce a 20s delay in packet transmission, causing the issue reported in this LP to be reproduced.
This patch was developed in Bionioc 4.15.x kernel.

This is the debdiff with the retry/delay mechanism, for Eoan. I've discussed with Cascardo and we agreed he will do the SRU to old releases (X/B/C/D) after applying some other SRUs he's working now.

I'd like to thanks specially Hari, Murilo and Pavithra from IBM, that reported, worked and proposed a solution for this issue!

description: updated
Changed in makedumpfile (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in makedumpfile (Ubuntu Bionic):
status: Confirmed → In Progress
Changed in makedumpfile (Ubuntu Cosmic):
status: Confirmed → In Progress
Changed in makedumpfile (Ubuntu Disco):
status: Confirmed → In Progress
Changed in makedumpfile (Ubuntu Eoan):
status: Confirmed → In Progress
Changed in makedumpfile (Ubuntu Disco):
assignee: Guilherme G. Piccoli (gpiccoli) → Thadeu Lima de Souza Cascardo (cascardo)
Changed in makedumpfile (Ubuntu Cosmic):
assignee: Guilherme G. Piccoli (gpiccoli) → Thadeu Lima de Souza Cascardo (cascardo)
Changed in makedumpfile (Ubuntu Bionic):
assignee: Guilherme G. Piccoli (gpiccoli) → Thadeu Lima de Souza Cascardo (cascardo)
Changed in makedumpfile (Ubuntu Xenial):
assignee: Guilherme G. Piccoli (gpiccoli) → Thadeu Lima de Souza Cascardo (cascardo)
Changed in ubuntu-power-systems:
status: Confirmed → In Progress

The attachment "lp1681909_eoan.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch

Fix for eoan at my ppa. ppa:cascardo/kdump2.

Attaching SRU for disco and bionic.

Eric Desrochers (slashd) wrote :

Marking Cosmic as 'Won't fix'.

Ubuntu 18.10 (Cosmic Cuttlefish) End Of Life reached on July 18 2019.

Changed in makedumpfile (Ubuntu Cosmic):
status: In Progress → Won't Fix
Eric Desrochers (slashd) on 2019-07-23
description: updated
Eric Desrochers (slashd) wrote :

Sponsored for 'Eoan'.

We'll be able to start the SRU sponsoring as soon as it lands in -releases.

Notes:
* Patch lands in debian unstable ~2 weeks ago : https://salsa.debian.org/debian/makedumpfile/commit/d63ba95337988be1eac8c8c76d90825ff5c6d17f

* Patch have been "Signed-off-by" by a member of the Ubuntu kernel team.

Changed in makedumpfile (Ubuntu Eoan):
status: In Progress → Fix Committed

Thanks Eric for sponsoring this LP!

I'm marking Xenial as "Won't Fix" for now, since we had no issue reports and it'll require a slightly more complex backport than Bionic. We'll discuss about the SRU to Xenial in some time, specially if we have reports of this failure in that release.

Cheers,

Guilherme

Changed in makedumpfile (Ubuntu Xenial):
status: In Progress → Won't Fix
Eric Desrochers (slashd) wrote :

Quick update

# excuses... page:

makedumpfile (1:1.6.5-1ubuntu2 to 1:1.6.5-1ubuntu3)
Maintainer: Louis Bouchard
0 days old
autopkgtest for kpatch/0.5.0-0ubuntu2: amd64: Ignored failure
autopkgtest for makedumpfile/1:1.6.5-1ubuntu3: amd64: Pass, arm64: Pass, armhf: Pass, i386: Pass, ppc64el: Regression ♻ , s390x: Ignored failure
Not considered

# logs
.....
makedumpfile: crash test: checking for crash file
makedumpfile: ERROR: crash test: Found no compressed dumps
.....

gpicolli and I are investigating the root cause.

Eric Desrochers (slashd) wrote :

Quick update:

It seems to fail the same way with 1:1.6.5-1ubuntu2, so NOT introduced by this SRU via 1:1.6.5-1ubuntu3

We still have to test the autopkgtest locally on ppc64el arch and instrument/monitor the test to understand why no crash is found in /var/crash at the end of the test.

Eric, thanks for looking into the testing failure.

I was discussing with Cascardo today, and he was able to spot quickly the problem.
First, the reason "makedumpfile" test passed in ppc64el before was that the test was skipped; Cascardo changed "makedumpfile" to be part of the called "big packages"[0], in order the VMs used in the tests have more RAM. Now, the test gets executed and fails if kernel version is >= 4.20.

The reason of the failure was that "makedumpfile" couldn't collect a compressed dump, falling back to 'cp' - this led to the 'if' failure in the test. The root cause is that kernel patch 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory to 2PB"), introduced in v4.20, requires a counterpart in "makedumpfile", in the form of patch [1]. Without that, I was able to reproduce the problem locally:

$ makedumpfile -c -d 31 /proc/vmcore /var/crash/201908051539/dump-incomplete2
get_machdep_info_ppc64: Can't detect max_physmem_bits.
makedumpfile Failed.

When I've used kernel 4.18 in the same VM, I got:

$ makedumpfile -c -d 31 /proc/vmcore core.418
Copying data : [100.0 %] \ eta: 0s
The dumpfile is saved to core.418

The plan here according to Cascardo is to push makedumpfile 1.6.6 (already containing the fix for the "physmem_bits" issue as well as my fix for this LP, the retry/delay mechanism) to Eoan.
After that, in my understanding, we can move on with the SRU for Bionic/Disco.

Cheers,

Guilherme

[0] https://git.launchpad.net/~cascardo/autopkgtest-cloud/commit/?id=346b786925
[1] https://salsa.debian.org/debian/makedumpfile/commit/f349b51f

Eric Desrochers (slashd) wrote :

makedumpfile merge to "1:1.6.6-2ubuntu1" sponsored in Eoan.

I appended the changelog to add the entry block[0] currently found in eoan-proposed that was missing to keep track of everything that has been done on the package:
Since it was made by cascardo before 1:1.6.5-1ubuntu3 exist.

Note:
- I didn't want this to be a blocker for this upload due to many factors, but cascardo/gpicolli, can you guys have a look before the feature freeze[1] at this lintian report[2], it would be awesome. It's good to make the code more modern, but debian packaging too, especially when time permit like now (devel release).

[0]
makedumpfile (1:1.6.5-1ubuntu3) eoan; urgency=medium

  * debian/kdump-config.in:
    - Add kdump retry/delay mechanism when dumping over network.
      (LP: #1681909)

 -- <email address hidden> (Guilherme G. Piccoli) Thu, 04 Jul 2019 15:20:53 -0300

[1] - https://wiki.ubuntu.com/EoanErmine/ReleaseSchedule

[2] - https://pastebin.canonical.com/p/dWYkNhwjCb/

- Eric

Eric Desrochers (slashd) on 2019-08-16
tags: added: sts-sponsor-slashd

After updating the package for eaon-proposed with the fix for ppc64 (thanks Eric!), I've manually tested that version and it's working fine, being able to collect the kernel crash dump.
The version I've tested is:

$ rmadison makedumpfile | grep eoan-proposed
 makedumpfile | 1:1.6.6-2ubuntu1 | eoan-proposed | source, amd64, arm64, armhf, i386, ppc64el, s390x

Even with my positive test results, the autopkgtest is broken and keep failing for ppc64, according to: http://autopkgtest.ubuntu.com/packages/m/makedumpfile/eoan/ppc64el

According to the above test, we can see 2 important things:
a) The last time it succeeded was : "makedumpfile/1:1.6.5-1ubuntu2 2019-06-20".
And we can see in fact the ppc64 test was Skipped, that's the reason it succeeded.

b) Even the current released version for eoan, 1:1.6.5-1ubuntu2, is failing according to the autopkgtest that ran on: "makedumpfile/1:1.6.5-1ubuntu2 2019-07-25". This test was triggered due to makedumpfile being a reverse dependency of "file".

Also, I've try to replicate the test in one local powerpc64 server, using autopkgtest. My command-line was:

https://pastebin.ubuntu.com/p/y4KyRvJVmz/

I've switched 2 parameters, having 4 tests results:

1) With "--apt-pocket=proposed=src:makedumpfile" (to test -proposed version) and "--nova-reboot":
http://paste.ubuntu.com/p/Z8dr6ssF2J/

2) With "--apt-pocket=proposed=src:makedumpfile" (to test -proposed version) only
http://paste.ubuntu.com/p/qxDRy5xPSZ/

3) Without both parameters above (testing the released version)
http://paste.ubuntu.com/p/cYdMrWHPsR/

4) Only with "--nova-reboot":
http://paste.ubuntu.com/p/xYC4KG9DZr/

In all cases I've failed, with "Broken Pipe" in a late part of the test. During the failure, I could even SSH into the testbed, so something clearly is wrong with the test.

I'd like to ask hereby an exemption from this test, marking it as "badtest" in ppc64.
Cascardo, do you agree? I think we could release this package before the Eoan freeze, which is sane (based on manual tests made by a colleague and I), we shouldn't block based on a test that was always skipped.

We still plan to continue investigating the test failure, until we can fix it.
Thanks,

Guilherme

Merge proposal to britney/hints-ubuntu submitted, to mark this test as "badtest" for ppc64el: https://code.launchpad.net/~gpiccoli/britney/hints-ubuntu/+merge/371479

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.6.6-2ubuntu1

---------------
makedumpfile (1:1.6.6-2ubuntu1) eoan; urgency=medium

  [ Thadeu Lima de Souza Cascardo ]
  * Merge from Debian unstable. Remaining changes:
    - Bump amd64 crashkernel from 384M-:128M to 512M-:192M.
  * Add kdump retry/delay mechanism when dumping over network (LP: #1681909)
  * Allow proper reload of kdump after multiple hotplug events. (LP: #1828596)

  [ Connor Kuehl ]
  * Let the kernel decide the crashkernel offset for ppc64el (LP: #1741860)

makedumpfile (1:1.6.6-2) unstable; urgency=medium

  [ Guilherme G. Piccoli ]
  * Add kdump retry/delay mechanism when dumping over network

  [ Thadeu Lima de Souza Cascardo ]
  * Use a different service for vmcore dump.
  * Use maxcpus instead of nr_cpus on ppc64el.
  * Reload kdump when CPU is brought online.
  * Allow proper reload of kdump after multiple hotplug events.

makedumpfile (1:1.6.6-1) unstable; urgency=medium

  * Update to new upstream version 1.6.6.

 -- Thadeu Lima de Souza Cascardo <email address hidden> Tue, 06 Aug 2019 12:18:15 -0300

Changed in makedumpfile (Ubuntu Eoan):
status: Fix Committed → Fix Released
Andrew Cloke (andrew-cloke) wrote :

Next step is to consider backport feasibility for disco and bionic.

Eric Desrochers (slashd) wrote :

Hi Andrew Cloke,

Yes, I'm currently sponsoring D/B for cascardo/gpicolli.

Disco is already uploaded waiting for SRU team approval:
https://launchpad.net/ubuntu/disco/+queue?queue_state=1&queue_text=makedumpfile

Bionic debdiff needs some rework before I do the final upload.

- Eric

Andrew Cloke (andrew-cloke) wrote :

Excellent news! Thanks Eric :-)

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.5-1ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in makedumpfile (Ubuntu Disco):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-disco
Andy Whitcroft (apw) wrote :

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.5-1ubuntu1~18.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in makedumpfile (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Eric Desrochers (slashd) on 2019-08-29
tags: removed: sts-sponsor-slashd

All autopkgtests for the newly accepted makedumpfile (1:1.6.5-1ubuntu1.1) for disco have finished running.
The following regressions have been reported in tests triggered by the package:

makedumpfile/1:1.6.5-1ubuntu1.1 (s390x, ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/disco/update_excuses.html#makedumpfile

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Tested with Bionic-proposed version (1:1.6.5-1ubuntu1~18.04.2) and Disco-proposed (1:1.6.5-1ubuntu1.1), both including the fix for this bug.
All working fine. I've used the "droppkt" hack (see comment #24) in the Bionic version, to simulate the problem, and it's fixed in the -proposed version.

Thanks,

Guilherme

tags: added: verification-done verification-done-bionic verification-done-disco
removed: verification-needed verification-needed-bionic verification-needed-disco
Andrew Cloke (andrew-cloke) wrote :

Even though the bionic and disco verifications were successful (thanks for verifying), these patches were bundled in a single submission with other patches (from other bugs) which could not be successfully verified. As a result, all patches have had to be removed.

Next step is to re-upload new version of makedumpfile.

Changed in ubuntu-power-systems:
status: Fix Committed → In Progress
Changed in makedumpfile (Ubuntu Bionic):
status: Fix Committed → In Progress
Changed in makedumpfile (Ubuntu Disco):
status: Fix Committed → In Progress

Just a quick update, the package got a respin without the offending patch, soon (after it got sponsored) it'll land into -proposed pocket.

Thanks,

Guilherme

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.5-1ubuntu1.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in makedumpfile (Ubuntu Disco):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-disco
removed: verification-done verification-done-disco
Changed in makedumpfile (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
removed: verification-done-bionic
Andy Whitcroft (apw) wrote :

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.6.5-1ubuntu1~18.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

All autopkgtests for the newly accepted makedumpfile (1:1.6.5-1ubuntu1~18.04.3) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

makedumpfile/1:1.6.5-1ubuntu1~18.04.3 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#makedumpfile

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

All autopkgtests for the newly accepted makedumpfile (1:1.6.5-1ubuntu1.3) for disco have finished running.
The following regressions have been reported in tests triggered by the package:

makedumpfile/1:1.6.5-1ubuntu1.3 (s390x, ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/disco/update_excuses.html#makedumpfile

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Retested with Bionic-proposed version (1:1.6.5-1ubuntu1~18.04.3) and Disco-proposed (1:1.6.5-1ubuntu1.3), both including the fix for this bug.

All working fine. I've used the "droppkt" hack (see comment #24) to simulate the problem, and it's fixed in the -proposed version.

Thanks,

Guilherme

tags: added: verification-done verification-done-bionic verification-done-disco
removed: verification-needed verification-needed-bionic verification-needed-disco

There is some reported regressions in autopkgtests for ppc64el and s390x, I'm investigating.
Cheers,

Guilherme

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.6.5-1ubuntu1.3

---------------
makedumpfile (1:1.6.5-1ubuntu1.3) disco; urgency=medium

  [ Guilherme G. Piccoli ]
  * Add kdump retry/delay mechanism when dumping over network (LP: #1681909)

  [ Thadeu Lima de Souza Cascardo ]
  * Use maxcpus instead of nr_cpus on ppc64el. (LP: #1828597)
  * ppc64: increase MAX_PHYSMEM_BITS to 2PB (LP: #1841288)

  [ Connor Kuehl ]
  * Let the kernel decide the crashkernel offset for ppc64el (LP: #1741860)

 -- Thadeu Lima de Souza Cascardo <email address hidden> Wed, 09 Oct 2019 15:33:57 -0300

Changed in makedumpfile (Ubuntu Disco):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for makedumpfile has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.6.5-1ubuntu1~18.04.3

---------------
makedumpfile (1:1.6.5-1ubuntu1~18.04.3) bionic; urgency=medium

  [ Guilherme G. Piccoli ]
  * Add kdump retry/delay mechanism when dumping over network (LP: #1681909)

  [ Thadeu Lima de Souza Cascardo ]
  * Use maxcpus instead of nr_cpus on ppc64el. (LP: #1828597)
  * ppc64: increase MAX_PHYSMEM_BITS to 2PB (LP: #1841288)

  [ Connor Kuehl ]
  * Let the kernel decide the crashkernel offset for ppc64el (LP: #1741860)

 -- Thadeu Lima de Souza Cascardo <email address hidden> Wed, 09 Oct 2019 15:38:08 -0300

Changed in makedumpfile (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: In Progress → Fix Released
To post a comment you must log in.