Kdump over network(nfs) does not work

Bug #1423483 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
makedumpfile (Ubuntu)
Invalid
Low
Unassigned

Bug Description

Problem Description
==========================
Kdump over network(nfs) does not work

---uname output---
3.18.0-13-generic

Machine Type = POWER8

System Hang
=====================
 The dump process seems to take a lot of time and it takes forever to save the dump. I waited for almost 3 hours, but the dump did not complete.

Steps to Reproduce
===========================
1) Configure kdump over nfs
    Add the following line to /etc/default/kdump-tools

    NFS="9.3.189.84:/nfsshare"

2) Load kdump

root@lop824:~# kdump-config load
Modified cmdline:BOOT_IMAGE=/boot/vmlinux-3.18.0-13-generic root=UUID=234c5426-796e-4f54-bd77-7b0fe10e0407 ro splash irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service elfcorehdr=155072K
segment[0].mem:0x8000000 memsz:24510464
segment[1].mem:0x9760000 memsz:65536
segment[2].mem:0x9770000 memsz:65536
segment[3].mem:0x9780000 memsz:65536
segment[4].mem:0x9790000 memsz:21954560
segment[5].mem:0xec70000 memsz:196608
 * loaded kdump kernel

3) Trigger a dump. Kdump boot and starts copying the dump but hangs midway.

root@lop824:~# ls -lh /nfsmount/9.114.13.128-201502170326/
total 1.3M
-rw------- 1 nobody nogroup 27M Feb 17 03:27 dump-incomplete
root@lop824:~#

root@lop824:~# kdump-config show
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
NFS: 9.3.189.84:/nfsshare
HOSTTAG: ip
current state: ready to kdump

kexec command:
  /sbin/kexec -p --args-linux --command-line="BOOT_IMAGE=/boot/vmlinux-3.18.0-13-generic root=UUID=234c5426-796e-4f54-bd77-7b0fe10e0407 ro splash irqpoll maxcpus=1 nousb systemd.unit=kdump-tools.service" --initrd=/boot/initrd.img-3.18.0-13-generic /boot/vmlinux-3.18.0-13-generic
root@lop824:~#

== Comment: #3 - SACHIN P. SANT <email address hidden> - 2015-02-17 07:17:14 ==
Following messages are seen while saving a dump

[ 31.059522] NFS: Registering the id_resolver key type
[ 31.059542] Key type id_resolver registered
[ 31.059544] Key type id_legacy registered
[ 36.021996] nfs: server 9.3.189.84 not responding, timed out
[ 36.022026] nfs: server 9.3.189.84 not responding, timed out
[ 36.022049] nfs: server 9.3.189.84 not responding, timed out
[ 40.530000] nfs: server 9.3.189.84 not responding, timed out
[ 40.530033] nfs: server 9.3.189.84 not responding, timed out
[ 45.037994] nfs: server 9.3.189.84 not responding, timed out
[ 45.038020] nfs: server 9.3.189.84 not responding, timed out
[ 48.550133] nfs: server 9.3.189.84 not responding, timed out
[ 48.550161] nfs: server 9.3.189.84 not responding, timed out
[ 51.557995] nfs: server 9.3.189.84 not responding, timed out
[ 51.558021] nfs: server 9.3.189.84 not responding, timed out
[ 55.617018] nfs: server 9.3.189.84 not responding, timed out
[ 55.617050] nfs: server 9.3.189.84 not responding, timed out
[ 58.621419] nfs: server 9.3.189.84 not responding, timed out
[ 58.621447] nfs: server 9.3.189.84 not responding, timed out
[ 58.621470] nfs: server 9.3.189.84 not responding, timed out
[ 61.413753] BUG: arch topology borken
[ 61.413757] the DIE domain not a subset of the NUMA domain
[ 61.413760] BUG: arch topology borken
[ 61.413762] the DIE domain not a subset of the NUMA domain
[ 61.413765] BUG: arch topology borken
[ 61.413766] the DIE domain not a subset of the NUMA domain
[ 61.413769] BUG: arch topology borken
[ 61.413770] the DIE domain not a subset of the NUMA domain
[ 61.413773] BUG: arch topology borken
[ 61.413774] the DIE domain not a subset of the NUMA domain
[ 61.413777] BUG: arch topology borken
[ 61.413778] the DIE domain not a subset of the NUMA domain
[ 61.413781] BUG: arch topology borken
[ 61.413782] the DIE domain not a subset of the NUMA domain
[ 61.413785] BUG: arch topology borken
[ 61.413786] the DIE domain not a subset of the NUMA domain
[ 61.625436] nfs: server 9.3.189.84 not responding, timed out
[ 66.133424] nfs: server 9.3.189.84 not responding, timed out
[ 66.133453] nfs: server 9.3.189.84 not responding, timed out
[ 70.641436] nfs: server 9.3.189.84 not responding, timed out
[ 70.641465] nfs: server 9.3.189.84 not responding, timed out
[ 74.149421] nfs: server 9.3.189.84 not responding, timed out
[ 74.149452] nfs: server 9.3.189.84 not responding, timed out
[ 78.209471] nfs: server 9.3.189.84 not responding, timed out
[ 78.209498] nfs: server 9.3.189.84 not responding, timed out
[ 81.629433] nfs: server 9.3.189.84 not responding, timed out
[ 81.629442] nfs: server 9.3.189.84 not responding, timed out
[ 84.633433] nfs: server 9.3.189.84 not responding, timed out
[ 87.637419] nfs: server 9.3.189.84 not responding, timed out
[ 90.649450] nfs: server 9.3.189.84 not responding, timed out
[ 93.653426] nfs: server 9.3.189.84 not responding, timed out
[ 95.005433] nfs: server 9.3.189.84 not responding, timed out
[ 96.653426] nfs: server 9.3.189.84 not responding, timed out
[ 98.009437] nfs: server 9.3.189.84 not responding, timed out

I can manually mount the nfs share manually (while the dump is in progress)

root@lop824:~# mount -t nfs 9.3.189.84:/nfsshare /nfsmount/
root@lop824:~# mount
/dev/sda2 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,nodev,noexec,nosuid)
sysfs on /sys type sysfs (rw,nodev,noexec,nosuid)
none on /sys/fs/cgroup type tmpfs (rw,uid=0,gid=0,mode=0755,size=1024)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,nodev,noexec,nosuid,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,nodev,noexec,nosuid,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
cgmfs on /run/cgmanager/fs type tmpfs (rw,relatime,size=128k,mode=755)
rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw)
9.3.189.84:/nfsshare on /nfsmount type nfs (rw,vers=4,addr=9.3.189.84,clientaddr=9.114.13.128)
root@lop824:~# ls
root@lop824:~# ls /nfsmount/
9.114.13.128-201502170326 test
root@lop824:~# ls /nfsmount/9.114.13.128-201502170326/
dump-incomplete
root@lop824:~#

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-121813 severity-critical targetmilestone-inin---
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1423483/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → makedumpfile (Ubuntu)
Louis Bouchard (louis)
Changed in makedumpfile (Ubuntu):
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Louis Bouchard (louis-bouchard)
Revision history for this message
Louis Bouchard (louis) wrote :

Hello,

First of all, the problem at hand is not that the mechanism doesn't work, it is the fact that NFS file transfer takes too long. From what I see, the NFS mechanism has worked at least partly.

The NFS was correctly mounted and the coredump transfer was initiated. For some reason, the NFS service started to timeout, but kdump-tools doesn't have much to do with it.

One thing did get my attention. The mount command that you issued returns the following (edited for clarity ):

# mount
/dev/sda2 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,nodev,noexec,nosuid)
...
9.3.189.84:/nfsshare on /nfsmount type nfs (rw,vers=4,addr=9.3.189.84,clientaddr=9.114.13.128)

The NFS mount on /var/crash is not appearing which is definitively a problem as this is done at a very early stage of the process. And it was mounted at the beginning since there is a vmcore-incomplete file on the remote NFS server.

I don't have any context on the size of the file to be transfered and maybe it did bring the kexec booted kernel to memory exhaustion but there is no sign of OOM which is to be expected in these situations.

Right now, with the data at hand, I cannot put forward anything else than an lack of availability of the NFS server that caused the failure.

tags: added: cts
Louis Bouchard (louis)
Changed in makedumpfile (Ubuntu):
status: Triaged → Invalid
assignee: Louis Bouchard (louis-bouchard) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.