Oops in sunrpc:rpc_shutdown_client

Bug #253004 reported by Daniel J Blueman on 2008-07-29
28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Debian)
Fix Released
Unknown
linux (Ubuntu)
Medium
Unassigned
Hardy
High
Manoj Iyer

Bug Description

SRU justification:

Impact: Oops in sunrpc:rpc_shutdown_client

Fix: likely related to: #212485 kernel bug rpc nfs client. Backported patch to Hardy.

Test: Test kernel in http://people.ubuntu.com/~manjo/lp253004-hardy/ was tested by community and reported to work.

---

Having setup an nfsv4 export in /etc/exports:

/store /192.168.20.0/24(rw,async,no_root_squash,no_subtree_check,fsid=0)

I restarted the nfs-kernel-server service:

# /etc/init.d/nfs-kernel-server stop
# /etc/init.d/nfs-kernel-server start

After ~15s, we see the callback being attempted twice and an oops:

[ 2361.807753] nfs4_cb: server 194.202.174.13 not responding, timed out
[ 2361.808059] nfs4_cb: server 194.202.174.13 not responding, timed out
[ 2361.808085] Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
[ 2361.808098] [<ffffffff8830d285>] :sunrpc:rpc_shutdown_client+0x25/0xf0
[ 2361.808140] PGD 7dfb7067 PUD 7dfae067 PMD 0
[ 2361.808155] Oops: 0000 [1] SMP
[ 2361.808166] CPU 1
[ 2361.808174] Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ipv6 ac sbs sbshc battery dock video output iptable_filter ip_tables x_tables xfs parport_pc lp parport af_packet loop container serio_raw cfi_cmdset_0002 cfi_util jedec_probe button cfi_probe gen_probe ck804xrom mtd chipreg shpchp pci_hotplug k8temp i2c_nforce2 map_funcs psmouse i2c_core evdev pcspkr ext3 jbd mbcache sg sd_mod pata_acpi sata_nv tg3 ata_generic pata_amd libata scsi_mod ehci_hcd ohci_hcd usbcore thermal processor fan fbcon tileblit font bitblit softcursor fuse
[ 2361.808369] Pid: 5877, comm: nfs4_cb_probe Not tainted 2.6.24-19-generic #1
[ 2361.808380] RIP: 0010:[<ffffffff8830d285>] [<ffffffff8830d285>] :sunrpc:rpc_shutdown_client+0x25/0xf0
[ 2361.808404] RSP: 0018:ffff81007dff3ea0 EFLAGS: 00010246
[ 2361.808412] RAX: 00000000fffffffb RBX: ffff81007b9dac00 RCX: ffff810001019480
[ 2361.808424] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 2361.808435] RBP: 0000000000000018 R08: ffffffff8833a490 R09: ffff81007b9c2000
[ 2361.808446] R10: 0000000000000000 R11: ffffffff80287840 R12: 0000000000000000
[ 2361.808471] R13: 0000000000000000 R14: ffff81007dff3eb8 R15: 0000000000000000
[ 2361.808496] FS: 00007fcf7f91a6e0(0000) GS:ffff81007dc01700(0000) knlGS:0000000000000000
[ 2361.808533] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 2361.808556] CR2: 0000000000000018 CR3: 000000007dfb6000 CR4: 00000000000006e0
[ 2361.808580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2361.808605] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2361.808630] Process nfs4_cb_probe (pid: 5877, threadinfo ffff81007dff2000, task ffff81007df68fc0)
[ 2361.808668] Stack: 0000000000000000 0000000000000000 0000000000000282 0000000000000282
[ 2361.808713] ffff81007d5bd180 ffffffff88313459 ffff81007b9dac00 ffff81007b9dac00
[ 2361.808755] ffffffff883888f0 0000000000000000 0000000000000000 ffffffff8838895c
[ 2361.808784] Call Trace:
[ 2361.808829] [<ffffffff88313459>] :sunrpc:rpc_put_task+0x99/0xc0
[ 2361.808871] [<ffffffff883888f0>] :nfsd:do_probe_callback+0x0/0x80
[ 2361.808901] [<ffffffff8838895c>] :nfsd:do_probe_callback+0x6c/0x80
[ 2361.808931] [<ffffffff8025363b>] kthread+0x4b/0x80
[ 2361.808957] [<ffffffff8020d198>] child_rip+0xa/0x12
[ 2361.808984] [<ffffffff802535f0>] kthread+0x0/0x80
[ 2361.809006] [<ffffffff8020d18e>] child_rip+0x0/0x12
[ 2361.809028]
[ 2361.809044]
[ 2361.809044] Code: 49 39 6c 24 18 0f 84 84 00 00 00 4c 89 e7 e8 88 67 00 00 49
[ 2361.809132] RIP [<ffffffff8830d285>] :sunrpc:rpc_shutdown_client+0x25/0xf0
[ 2361.809165] RSP <ffff81007dff3ea0>
[ 2361.809185] CR2: 0000000000000018
[ 2361.809493] ---[ end trace 949da475918d45ed ]---

likely related to: #212485 kernel bug rpc nfs client.

Fixed apparently [http://linux-nfs.org/pipermail/nfsv4/2008-April/008441.html] by:

commit 63c86716ea34ad94d52e5b0abbda152574dc42b5 "nfsd: move callback rpc_client creation into separate thread":
commit 46f8a64bae11f5c9b15b4401f6e9863281999b66 "nfsd4: probe callback channel only once":

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=63c86716ea34ad94d52e5b0abbda152574dc42b5
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46f8a64bae11f5c9b15b4401f6e9863281999b66

Chris Coulson (chrisccoulson) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. This bug did not have a package associated with it, which is important for ensuring that it gets looked at by the proper developers. You can learn more about finding the right package at https://wiki.ubuntu.com/Bugs/FindRightPackage . I have classified this bug as a bug in linux.
For future reference you might be interested to know that a lot of applications have bug reporting functionality built in to them. This can be accessed via the Report a Problem option in the Help menu for the application with which you are having an issue. You can learn more about this feature at https://wiki.ubuntu.com/ReportingBugs.

Chris Coulson (chrisccoulson) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Unfortunately we can't fix it, because your description does not yet have enough information.

Please include the following additional information, if you have not already done so (pay attention to lspci's additional options), as required by the Ubuntu Kernel Team:
1. Please include the output of the command "uname -a" in your next response. It should be one, long line of text which includes the exact kernel version you're running, as well as the CPU architecture.
2. Please run the command "dmesg > dmesg.log" after a fresh boot and attach the resulting file "dmesg.log" to this bug report.
3. Please run the command "sudo lspci -vvnn > lspci-vvnn.log" and attach the resulting file "lspci-vvnn.log" to this bug report.
4. Please also attach your /var/log/kern.log and /var/log/kern.log.0 files using the "Attachment:" box below.

For your reference, the full description of procedures for kernel-related bug reports is available at https://wiki.ubuntu.com/KernelTeamBugPolicies Thanks in advance!

Changed in linux:
assignee: nobody → chrisccoulson
status: New → Incomplete

Hardware it stock hp DL145 G2 with current BIOS and good ECC memory, x86-64 opteron. Installed with minimal netboot image as of today with a few other packages, acting as an NFS server.

$ uname -a
Linux labfs 2.6.24-19-generic #1 SMP Fri Jul 11 21:01:46 UTC 2008 x86_64 GNU/Linux

Chris Coulson (chrisccoulson) wrote :

Confirming, and it's also reported in Debian

Changed in linux:
assignee: chrisccoulson → ubuntu-kernel-team
status: Incomplete → Confirmed
Changed in linux:
status: Unknown → New
yknot (dennisthompso) wrote :

Opps has caused my NFS server to lockup. Have to a hard reboot to recover. Attached are a various logs and a excerpt from /var/log/messages with a kernel trace.

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Ralph Janke (txwikinger) wrote :

The Intrepid Ibex 8.10 Beta release was most recently announced - http://www.ubuntu.com/testing/intrepid/beta . It contains the 2.6.27 Ubuntu kernel. It would be great if you could test and verify if this is still an issue. The status is being set to Incomplete until we receive further feedback. Thanks.

Changed in linux:
status: Confirmed → Incomplete
Kain (kain-kain) wrote :

Will this bug fix be backported into the hardy kernel, or is this going to be a forced upgrade to intrepid for anyone running nfsv4 servers?

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Magnus Hjorth (magnus-hjorth) wrote :

I have also gotten oopses at the same location when using the 8.04 LTS server kernel on an NFS4 server, and
changing to the 8.10 server kernel fixed the issue.

Is there a point in supplying more information, i.e. are you actually interested in fixing this bug in the server LTS?

Changed in linux:
status: Incomplete → Confirmed
Kain (kain-kain) wrote :

I was *quite* interested in seeing a fix for this bug in the server LTS oh... 4 months ago? I assume the people running NFS servers would have been interested in a bug fix back in last July, especially considering that the original report even went to the trouble of bisecting and finding the commits that fixed said oops...

At this point I just run vanilla kernels now, which was actually the biggest reason I moved to Ubuntu. I thought I wouldn't have to worry about that for the most part in servers anymore. Oops.

Many organisations need to run the LTS releases, to be able to deploy a server/login/dev box and get good use from it for >24 months, rather than track the upstream releases.

Thus, we do need to add this patch to the 8.04 LTS kernel, or let Ubuntu's reputation be dented; since we want Ubuntu to penetrate the server space, this kind of thing is key.

Chris Coulson (chrisccoulson) wrote :

Can someone test this with a later release (Intrepid or Jaunty) and see if it is fixed?

Thanks

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Magnus Hjorth (magnus-hjorth) wrote :

As I wrote in an earlier comment, I did install the server kernel from the Intrepid repo and rebooted and since then I haven't seen any oopses for more than a week. Because this is a server running in production I can not upgrade the whole distribution just for testing purposes.

If someone could make a kernel package of the 8.04 LTS server kernel plus the two commits referenced in the bug description, I could test that kernel to verify that they solve the problem.

Nicholas J Kreucher (kreucher) wrote :

I am also hit by this bug. FWIW, the kernel bug policies link above is incorrect. The correct page appears to be: https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies

Nicholas J Kreucher (kreucher) wrote :
Download full text (4.4 KiB)

Ok, I tried to follow the KernelTeamBugPolicies reporting instructions to attach to this existing bug report, but they appear to be wrong: at least for hardy LTS, apport-collect doesn't seem to exist.

If Ubuntu is serious about LTS releases, this bug needs to be fixed.

Here is a quick summary--in lieu of a full bug report--to basically say "me too":

Linux pacifico 2.6.24-23-generic #1 SMP Wed Apr 1 21:43:24 UTC 2009 x86_64 GNU/Linux

[ 111.332683] nfs4_cb: server 192.168.3.4 not responding, timed out
[ 111.333679] nfs4_cb: server 192.168.3.4 not responding, timed out
[ 111.333713] Unable to handle kernel NULL pointer dereference at 0000000000000018 RIP:
[ 111.333716] [<ffffffff88a26285>] :sunrpc:rpc_shutdown_client+0x25/0xf0
[ 111.333738] PGD 1691d6067 PUD 169dfa067 PMD 0
[ 111.333741] Oops: 0000 [1] SMP
[ 111.333744] CPU 0
[ 111.333745] Modules linked in: lirc_serial nfsd exportfs ppdev vmnet vmblock vmci vmmon tun powernow_k8 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative video output sbs sbshc container dock battery nfs lockd nfs_acl iptable_filter ip_tables x_tables xfs ext3 jbd mbcache ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi ac rpcsec_gss_krb5 auth_rpcgss sunrpc ndiswrapper sbp2 lp mt2131 s5h1409 wm8775 snd_hda_intel dvb_pll cx25840 snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep or51132 tuner tea5767 snd_seq_dummy tda8290 cx23885 snd_seq_oss lirc_imon cx88_dvb cx88_vp3054_i2c tuner_simple mt20xx tea5761 videobuf_dvb dvb_core lirc_dev snd_seq_midi snd_rawmidi ivtv cx8800 cx8802 cx88xx snd_seq_midi_event ir_common cx2341x snd_seq i2c_algo_bit analog tveeprom compat_ioctl32 psmouse videodev v4l1_compat v4l2_common videobuf_dma_sg videobuf_core btcx_risc snd_timer snd_seq_device parport_pc serio_raw k8temp gameport i2c_nforce2 parport nvidia(P) snd button shpchp pci_hotplug evdev soundcore i2c_core pcspkr dm_multipath jfs sd_mod sg sr_mod cdrom sata_nv pata_amd pata_acpi ohci1394 ieee1394 ata_generic forcedeth ohci_hcd ehci_hcd libata usbcore scsi_mod raid10 raid456 async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod dm_mirror dm_snapshot dm_mod thermal processor fan fbcon tileblit font bitblit softcursor fuse
[ 111.333814] Pid: 7967, comm: nfs4_cb_probe Tainted: P 2.6.24-23-generic #1
[ 111.333816] RIP: 0010:[<ffffffff88a26285>] [<ffffffff88a26285>] :sunrpc:rpc_shutdown_client+0x25/0xf0
[ 111.333830] RSP: 0018:ffff81014e37dea0 EFLAGS: 00010246
[ 111.333831] RAX: 00000000fffffffb RBX: ffff81014e332400 RCX: ffff810001028480
[ 111.333833] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[ 111.333835] RBP: 0000000000000018 R08: ffffffff88a53490 R09: ffff81014e37a000
[ 111.333837] R10: 0000000000000000 R11: ffffffff80287c90 R12: 0000000000000000
[ 111.333839] R13: 0000000000000000 R14: ffff81014e37deb8 R15: 0000000000000000
[ 111.333841] FS: 00007f849d5196e0(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
[ 111.333843] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 111.333845] CR2: 0000000000000018 CR3: 0000000168c09000 CR4: 00000000000006e0...

Read more...

Chris Coulson (chrisccoulson) wrote :

As this seems to be fixed in Intrepid onwards, this should be set to fixed. I've nominated it for Hardy.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Fix Released
Tim Gardner (timg-tpi) on 2009-04-30
Changed in linux (Ubuntu Hardy):
assignee: nobody → Manoj Iyer (manjo)
importance: Undecided → High
milestone: none → ubuntu-8.04.3
status: New → In Progress
Manoj Iyer (manjo) wrote :

Can you please verify the hardy kernel in http://people.ubuntu.com/~manjo/lp253004-hardy/ and report back here ?

Changed in linux (Ubuntu Hardy):
status: In Progress → Incomplete
Manoj Iyer (manjo) wrote :

Verifying that the kernel in http://people.ubuntu.com/~manjo/lp253004-hardy/ works will help me SRU the patch for hardy.

buntunub (mckisick) wrote :

I have run into this bug as well. My server is running 8.04.3, uname -r, 2.6.24-24-generic, on an i686 kernel. I have one client that is also running the same kernel and causes no issues. The other client which does cause this bug is running Kubuntu, kernel 9.04 2.6.28-11-generic. Only the Kubuntu Jaunty kernel causes the NFS server to bug out with a steady stream of: "nfs4_cb: server (client machine IP addr) not responding, timed out", every 60 seconds until the server locks up hard. Is this fix going to work on various versions of Ubuntu clients running on a Hardy server?

Manoj Iyer (manjo) wrote :

Yes, I back ported the patch to Hardy, so the kernel in http://people.ubuntu.com/~manjo/lp253004-hardy/ should fix this problem. Please let me know if this fixes this bug for you.

buntunub (mckisick) wrote :

I would be delighted to but those are amd64 binaries and the server is 32 bit Hardy. :(

Manoj Iyer (manjo) wrote :

buntunub,

ok 32bit kernel uploaded. Please try the 32bit kernel in http://people.ubuntu.com/~manjo/lp253004-hardy/
really appreciate taking the time to test this kernel.

buntunub (mckisick) wrote :

The kernel is now just installed and seems to be running fine. I will monitor it and let you know, but thus far, no errors on /var/log/kern.log or syslog.

buntunub (mckisick) wrote :

Sorry, guess I spoke too soon. The issue continues just as with the old kernel.

May 7 07:25:42 kernel: [24722.738074] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:31:42 kernel: [25082.696419] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:33:42 kernel: [25202.647926] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:34:42 kernel: [25262.657654] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:36:42 dave234 kernel: [25382.637104] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:38:42 kernel: [25502.652512] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:40:42 kernel: [25622.631887] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:42:42 kernel: [25742.607180] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:44:42 kernel: [25862.610436] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:45:42 kernel: [25922.616057] nfs4_cb: server 192.168.11.3 not responding, timed out
May 7 07:47:42 kernel: [26042.579311] nfs4_cb: server 192.168.11.3 not responding, timed out

I have deployed the updated kernel Manoj kindly provided, and will be looking for further issues, though the usual workload isn't present on this server at present.

buntunub: We expect these 'not responding, timed out' messages - they were a pre-failure symptom. The problem is solved if we get these, and don't crash, which your information seems to suggest hasn't happened, so it looks good!

buntunub (mckisick) wrote :
buntunub (mckisick) wrote :
buntunub (mckisick) wrote :
buntunub (mckisick) wrote :
buntunub (mckisick) wrote :

Still getting the time out errors, although less frequently now, and 24 hours full load on the server with the new kernel and no crash, so looks like the fix worked so far.

Chris Coulson (chrisccoulson) wrote :

buntunub - your log shows no oopses though, so that problem seems fixed by the patch. The timeout messages are likely a separate problem

Manoj Iyer (manjo) wrote :

Thanks for testing, I submitted SRU for Hardy.

Matt Kassawara (ionosphere80) wrote :

I've also tested this kernel for several days with no issues.

I've not experienced issue with the updated kernel in the last 6 days of it being deployed, though this is with a reduced workload.

Tim Gardner (timg-tpi) on 2009-05-13
Changed in linux (Ubuntu Hardy):
status: Incomplete → In Progress
Stefan Bader (smb) on 2009-06-03
Changed in linux (Ubuntu Hardy):
status: In Progress → Fix Committed
description: updated
Martin Pitt (pitti) wrote :

Accepted linux into hardy-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
Magnus Hjorth (magnus-hjorth) wrote :

I have been running the hardy-proposed server kernel (amd64) for over a week now and the bug appears to be fixed, no oopses.

Martin Pitt (pitti) on 2009-06-26
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.24-24.56

---------------
linux (2.6.24-24.56) hardy-proposed; urgency=low

  [Stefan Bader]

  * Rebuild of 2.6.24-24.54 with 2.6.24-24.55 security release applied

linux (2.6.24-24.54) hardy-proposed; urgency=low

  [Andy Whitcroft]

  * SAUCE: do not make sysdev links for processors which are not booted
    - LP: #295091

  [Brad Figg]

  * SAUCE: Add information to recognize Toshiba Satellite Pro M10 Alps Touchpad
    - LP: #330885
  * SAUCE: Add signatures to airprime driver to support newer Novatel devices
    - LP: #365291

  [Stefan Bader]

  * SAUCE: vgacon: Return the upper half of 512 character fonts
    - LP: #355057

  [Upstream Kernel Changes]

  * SUNRPC: Fix autobind on cloned rpc clients
    - LP: #341783, #212485
  * Input: atkbd - mark keyboard as disabled when suspending/unloading
    - LP: #213988
  * x86: mtrr: don't modify RdDram/WrDram bits of fixed MTRRs
    - LP: #292619
  * sis190: add identifier for Atheros AR8021 PHY
    - LP: #247889
  * bluetooth hid: enable quirk handling for Apple Wireless Keyboards in
    2.6.24
    - LP: #227501
  * nfsd: move callback rpc_client creation into separate thread
    - LP: #253004
  * nfsd4: probe callback channel only once
    - LP: #253004

 -- Stefan Bader <email address hidden> Sat, 20 Jun 2009 00:14:36 +0200

Changed in linux (Ubuntu Hardy):
status: Fix Committed → Fix Released
Changed in linux (Debian):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.