Ubuntu 16.10: kdump over nfs did not generate complete vmcore

Bug #1641235 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
makedumpfile (Ubuntu)
Fix Released
Undecided
Taco Screen team
Trusty
Confirmed
Medium
Louis Bouchard
Xenial
Fix Released
Medium
Louis Bouchard
Yakkety
Confirmed
Medium
Louis Bouchard

Bug Description

== Comment: #0 - HARSHA THYAGARAJA - 2016-11-03 08:05:59 ==
---Problem Description---
kdump over nfs did not generate complete vmcore

---uname output---
Linux ltciofvtr-firestone1 4.8.0-26-generic #28-Ubuntu SMP Tue Oct 18 14:41:40 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = PowerNV (Baremetal) - Firestone

---Steps to Reproduce---
 1. Setup NFS
2. Trigger crash: echo c > /proc/sysrq-trigger

== Comment: #6 - Kevin W. Rudd - 2016-11-04 16:30:49 ==

Hi Harsha.

It looks like the base kdump NFS functionality works just fine. The known issue with makedumpfile is causing it to drop back to using "cp" to transfer the entire, non-compressed /proc/vmcore image. That's a rather large amount of data to send over to the remote server, and it appears to be sending back an I/O error after the first 122G.

Further debug would need to be done to determine if this is a client-side or server-side issue. I recommend first bringing your remote NFS server up to the current release as it is currently a bit down-rev.

== Comment: #8 - HARSHA THYAGARAJA - 2016-11-10 02:02:31 ==

Hi Kevin,
I updated my peer to Ubuntu 16.10 and still saw the same observation.
A snippet of the problem at hand is pasted below.

[ 20.610748] kdump-tools[4559]: Starting kdump-tools: * Mounting NFS mountpoint 150.1.1.20:/home/tools ...
[ 53.400516] kdump-tools[4559]: * Dumping to NFS mountpoint 150.1.1.20:/home/tools/201611100158
[ 53.409242] kdump-tools[4559]: * running makedumpfile -c -d 31 /proc/vmcore /mnt/var/crash/9.47.84.18-201611100158/dump-incomplete
[ 53.526593] kdump-tools[4559]: get_mem_map: Can't distinguish the memory type.
[ 53.527154] kdump-tools[4559]: The kernel version is not supported.
[ 53.527488] kdump-tools[4559]: The makedumpfile operation may be incomplete.
[ 53.527813] kdump-tools[4559]: makedumpfile Failed.
[ 53.528117] kdump-tools[4559]: * kdump-tools: makedumpfile failed, falling back to 'cp'
[ 90.754092] kdump-tools[4559]: cp: error writing '/mnt/var/crash/9.47.84.18-201611100158/vmcore-incomplete': Input/output error
[ 90.754857] kdump-tools[4559]: * kdump-tools: failed to save vmcore in /mnt/var/crash/9.47.84.18-201611100158
[ 90.756155] kdump-tools[4559]: * running makedumpfile --dump-dmesg /proc/vmcore /mnt/var/crash/9.47.84.18-201611100158/dmesg.201611100158
[ 90.758731] kdump-tools[4559]: get_mem_map: Can't distinguish the memory type.
[ 90.759089] kdump-tools[4559]: The kernel version is not supported.
[ 90.759436] kdump-tools[4559]: The makedumpfile operation may be incomplete.
[ 90.759780] kdump-tools[4559]: makedumpfile Failed.
[ 90.760094] kdump-tools[4559]: * kdump-tools: makedumpfile --dump-dmesg failed. dmesg content will be unavailable
[ 90.760668] kdump-tools[4559]: * kdump-tools: failed to save dmesg content in /mnt/var/crash/9.47.84.18-201611100158
[ 90.846117] kdump-tools[4559]: Thu, 10 Nov 2016 01:59:56 -0500
[ 90.886629] kdump-tools[4559]: Failed to read reboot parameter file: No such file or directory
[ 90.887070] kdump-tools[4559]: Rebooting.

== Comment: #13 - Kevin W. Rudd - 2016-11-11 17:12:33 ==

I was able to replicate this with debugging at both the kdump client and remote NFS server. The server was perfectly happy with the data coming at it, and appeared to be processing a COMMIT request from the client when the client shut down the connection.

Looking at the client-side logs after a failure showed that it was logging "server ... not responding" messages, and bailed on the connection within the span of just a few seconds.

This appears to be due to a very over-aggressive timeout being specified in /usr/sbin/kdump-config:

mount -t nfs -o nolock -o tcp -o soft -o timeo=5 -o retrans=5 $NFS $KDUMP_COREDIR

The timeo value is deciseconds, and "5" is far too aggressive for this type of connection. From my observations, the COMMIT was not issued until about 60G was transferred, and most remote servers will take a lot longer than 5 tenths of a second to flush that amount of data and respond to the COMMIT.

I'm not sure what problem specifying this timeo value was supposed to address, but it would be better to leave the timeo value at its default for a tcp connection (let the TCP protocol handle any communication timeouts on its own). When I modified kdump-config to use the default timeo of 600, the kdump process transferred the entire vmcore without error.

Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-148148 severity-high targetmilestone-inin1610
Revision history for this message
bugproxy (bugproxy) wrote : Latest crashdump console logs

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → makedumpfile (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : console logs of kdump trigger

Default Comment by Bridge

tags: added: severity-critical
removed: severity-high
bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-11-22 10:23 EDT-------
Hello Canonical.

This bug appears to have fallen between the screening cracks.

Revision history for this message
Louis Bouchard (louis) wrote :

Hello,
I do agree that the values used for NFS are quite low.

The best approach for this would be to add parameters to /etc/default/kdump and set them to the default values which would be :

NFS_TIMEO = 600
NFS_RETRANS = 3

Would that be acceptable to you ?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-12-09 09:52 EDT-------
(In reply to comment #24)
> Hello,
> I do agree that the values used for NFS are quite low.
>
> The best approach for this would be to add parameters to /etc/default/kdump
> and set them to the default values which would be :
>
> NFS_TIMEO = 600
> NFS_RETRANS = 3
>
> Would that be acceptable to you ?

That would be great. It has the nice benefit of being easily tunable in a persistent way.

Thanks Louis

Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Latest crashdump console logs

Default Comment by Bridge

Revision history for this message
Louis Bouchard (louis) wrote :

Hello,

I have just uploaded a fix for this to Debian/Unstable. It should synch up with Zesty by tomorrow. Then I will proceed to SRU the fix to the stable releases.

Louis Bouchard (louis)
Changed in makedumpfile (Ubuntu Trusty):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Xenial):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Yakkety):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Trusty):
assignee: nobody → Louis Bouchard (louis-bouchard)
Changed in makedumpfile (Ubuntu Xenial):
assignee: nobody → Louis Bouchard (louis-bouchard)
Changed in makedumpfile (Ubuntu Yakkety):
assignee: nobody → Louis Bouchard (louis-bouchard)
Changed in makedumpfile (Ubuntu Trusty):
importance: Undecided → Medium
Changed in makedumpfile (Ubuntu Xenial):
importance: Undecided → Medium
Changed in makedumpfile (Ubuntu Yakkety):
importance: Undecided → Medium
Changed in makedumpfile (Ubuntu):
status: New → Fix Released
bugproxy (bugproxy)
tags: removed: bugnameltc-148148 severity-high
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-01-11 10:45 EDT-------
Thanks for the update Louis

tags: added: bugnameltc-148148 severity-high
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-01-12 16:29 EDT-------
FYI: I just pulled the latest kdump-tools (1:1.6.1-1) for Zesty. The changes look good. Thanks again.

Revision history for this message
bugproxy (bugproxy) wrote : kdump on 17.04

------- Comment (attachment only) From <email address hidden> 2017-01-17 05:19 EDT-------

tags: removed: bugnameltc-148148 severity-high
bugproxy (bugproxy)
tags: added: bugnameltc-148148 severity-high
bugproxy (bugproxy)
tags: added: targetmilestone-inin1704
removed: targetmilestone-inin1610
Revision history for this message
Andy Whitcroft (apw) wrote : Please test proposed package

Hello bugproxy, or anyone else affected,

Accepted makedumpfile into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/makedumpfile/1:1.5.9-5ubuntu0.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in makedumpfile (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-02-15 18:17 EDT-------
Verified version 1:1.5.9-5ubuntu0.4. NFS vmcore saved with no issues and with new settings.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package makedumpfile - 1:1.5.9-5ubuntu0.4

---------------
makedumpfile (1:1.5.9-5ubuntu0.4) xenial; urgency=medium

  * d/p/0006-PATCH-Support-newer-kernels.patch :
    Support kernel versions up to 4.8 (LP: #1557751)
  * Turn hardcoded timeo and retrans NFS options into parameters that
    can be modified in /etc/default/kdump-tools. Also use the NFS defaults
    (timeo=600, retrans=3) for these parameters. Make those values visible
    in the 'show' command if NFS is configured (LP: #1641235)
  * Complete support for kernel versions 4.8 and later :
    d/p/0007-PATCH-Looking-for-page.compound_order-compound_dtor-.patch,
    d/p/0008-PATCH-Skip-examining-compound-tail-pages.patch (LP: #1655625)

 -- Louis Bouchard <email address hidden> Wed, 11 Jan 2017 11:33:42 +0100

Changed in makedumpfile (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Andy Whitcroft (apw) wrote : Update Released

The verification of the Stable Release Update for makedumpfile has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.