Ubuntu

Kernel Oops - BUG: unable to handle kernel paging request at ffffffffffffffb8; RIP: 0010:[<ffffffffa05a5839>] [<ffffffffa05a5839>] nfs_have_delegation+0x9/0x40 [nfs]

Reported by rvaliant on 2012-04-05
86
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
Medium
Unassigned
Precise
Medium
Unassigned

Bug Description

== Precise SRU Justification ==

This bug is preventing users from using NFS clients on Precise.
Several users have reported the issue, which has been already fixed
upstreams but did not made it into stable yet.

== Fix ==

There are four commits that are relevant to fix this issue. From
mainline kernel:
 - 14977489ffdb80d4caf5a184ba41b23b02fbacd9 (cherry-pick)
 - 96dcadc2fdd111dca90d559f189a30c65394451a (backported)
From linux-nfs git tree
(git://git.linux-nfs.org/projects/trondmy/linux-nfs.git):
 - 487790f27df9bb27d3400486bd021dd59edc7589 (cherry-pick)
 - 5de4815015e550bdd33f39650554325540356f0c (cherry-pick)

== Impact ==

There are several users reporting the same issue when running NFS
clients on Precise. Other reports have also been found with the same
issue:

https://bugzilla.redhat.com/show_bug.cgi?id=811138

== Test Case ==

According to one of the bug reporters:

"Logging into a 12.04 client system with autofs5, LDAP, kerberos authenticated
nfs4 mounts. However this is to a server which is running 10.04 (contrary to #4
it seems) In my case the triggering process is gnome-keyring-d."

========================================================================================

I have no idea.

ProblemType: KernelOops
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-22-generic 3.2.0-22.35
ProcVersionSignature: Ubuntu 3.2.0-22.35-generic 3.2.14
Uname: Linux 3.2.0-22-generic x86_64
NonfreeKernelModules: fglrx
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
Annotation: Your system might become unstable now and might need to be restarted.
ApportVersion: 2.0-0ubuntu4
Architecture: amd64
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Thu Apr 5 14:35:46 2012
Failure: oops
HibernationDevice: RESUME=UUID=73ced0d0-9777-4d7c-8841-8bd3c57ec88c
InstallationMedia: Ubuntu 11.10 "Oneiric Ocelot" - Release amd64 (20111012)
IwConfig:
 lo no wireless extensions.

 eth2 no wireless extensions.
MachineType: Gigabyte Technology Co., Ltd. GA-MA78GM-US2H
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-22-generic root=UUID=dd826ebd-811d-4f2b-ac6a-5f527aee88cd ro recovery nomodeset
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions: kerneloops-daemon 0.12+git20090217-1ubuntu18
RfKill:

SourcePackage: linux
Title: BUG: unable to handle kernel paging request at ffffffffffffffb8
UpgradeStatus: Upgraded to precise on 2012-03-21 (15 days ago)
dmi.bios.date: 10/08/2009
dmi.bios.vendor: Award Software International, Inc.
dmi.bios.version: F8
dmi.board.name: GA-MA78GM-US2H
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.type: 3
dmi.chassis.vendor: Gigabyte Technology Co., Ltd.
dmi.modalias: dmi:bvnAwardSoftwareInternational,Inc.:bvrF8:bd10/08/2009:svnGigabyteTechnologyCo.,Ltd.:pnGA-MA78GM-US2H:pvr:rvnGigabyteTechnologyCo.,Ltd.:rnGA-MA78GM-US2H:rvrx.x:cvnGigabyteTechnologyCo.,Ltd.:ct3:cvr:
dmi.product.name: GA-MA78GM-US2H
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

rvaliant (rvaliant) wrote :
Brad Figg (brad-figg) on 2012-04-05
Changed in linux (Ubuntu):
status: New → Confirmed

rvaliant, thank you for reporting this bug and helping make Ubuntu better. Please answer these questions:

* Is this reproducible?
* If so, what specific steps should we take to recreate this bug?
* Could you also please test the latest upstream kernel available? It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

summary: - BUG: unable to handle kernel paging request at ffffffffffffffb8
+ Kernel Oops - BUG: unable to handle kernel paging request at
+ ffffffffffffffb8; RIP: 0010:[<ffffffffa05a5839>] [<ffffffffa05a5839>]
+ nfs_have_delegation+0x9/0x40 [nfs]
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: nfs-have-delegation
Luis Henriques (henrix) wrote :

This bug has also been reported here: https://bugzilla.redhat.com/show_bug.cgi?id=811138

Dan Bishop (danbishop) wrote :

This problem is reproduced by logging into Ubuntu with a user's home directory mounted over NFS... though only if the server is also running 12.04.

I have a stable server running 10.04 and 12.04 clients can still mount NFS home directories from there, moving to the test server, configured in the same way but running 12.04, clients crash as above.

Joseph Salisbury (jsalisbury) wrote :

Has anyone affected by this bug had a chance to test the latest mainline kernel:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc3-precise/

tags: added: kernel-da-key kernel-key
rvaliant (rvaliant) wrote :

Actually, my server is Ubuntu 11.10, so if this is an NFS related issue, it's not confined to 12.04. I have some time today and will try the latest mainline kernel on my client as suggested by jsalisbury and report back - if someone doesn't beat me to it.

Ian Morris (ipm) wrote :

I'm seeing the same OOps -- again Logging into a 12.04 client system with autofs5, LDAP, kerberos authenticated nfs4 mounts. However this is to a server which is running 10.04 (contrary to #4 it seems) In my case the triggering process is gnome-keyring-d.

Running the mainline kernel (linux-image-3.4.0-030400rc3-generic_3.4.0-030400rc3.201204152235_amd64.deb) I no longer get the oops but get lots of messages:

[ 48.701213] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
[ 48.701990] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
[ 53.696076] nfs4_reclaim_open_state: 6440 callbacks suppressed

Until the client was upgraded to 12.04, it was running 11.10 perfectly in this configuration.

Luis Henriques (henrix) wrote :

It looks like commit 14977489ffdb80d4caf5a184ba41b23b02fbacd9 should fix this issue. I have built a test kernel that includes this commit. Could someone check if it solves the problem? The test kernel can be obtained here:

http://people.canonical.com/~henrix/lp974664/

Ian Morris (ipm) wrote :

Re: #8, tried this kernel with the same effect as running the mainline kernel in #7 specifically many many messages of the type:

NFS: nfs4_reclaim_open_state: Lock reclaim failed!

Luis Henriques (henrix) wrote :

Ian: Thanks for testing the kernel. Apparently, these messages are harmless and the mainline kernel has already a ratelimiting patch for this. I have backported the ratelimiting commit (96dcadc2fdd111dca90d559f189a30c65394451a) on top of the kernel you tested and uploaded it here:

http://people.canonical.com/~henrix/lp974664v2/

Please let me know if this new kernel fixes it.

Ian Morris (ipm) wrote :

Re: #10, Luis, thanks for your efforts on this. By running http://people.canonical.com/~henrix/lp974664v2, the Oops is indeed resolved and the rate limiting is definitely reducing the numbers of messages produced but none the less there are many many dozens of messages produced still -- just like with the mainline kernel infact. I assume the mainline kernel also had the rate limiting patch applied.

I'm not entirely sure about the "harmless" nature of the messages however. My perception is that logging in is somewhat slower -- this may be down to the beta nature of the release at present however what I am observing, even after logon of the user has completed is a high level of network activity. Running a packet trace and a wireshark dump of this activity, it seems that the activity is down to NFS and it seems (see screenshot attached) to be stuck in somewhat of a loop -- specifically continually getting an error when trying to lock the file (each time the same file). I wonder if the slowness has the same root cause (the shere number of failed retries -- looking at the packet trace it seems, if I'm not mistaken, there's about 1000 per second of these!)

I observe the file being processed is user.keystore and in my case, the prior oops was triggered by gnome-keyring-d so I can't help feeling these issues are related. However I will admit I'm no expert on NFS so I could be seeing ghosts :-)

Let me know if there is any information I can provide that might help.

Luis Henriques (henrix) wrote :

Ian, thanks again for testing and helping triagging this bug.

When I said "harmless" (and I'm far from being an expert on NFS as well!) I was not aware that the flood of messages being logged would be kept that high. Obviously, logging such a huge amount of messages will at least impact the performance. My comment referred to the fact that the log was actually on an error path that would be retried again.

I'll ping upstreams for further information and post any development here.

Dan Bishop (danbishop) wrote :

"Re: #10, Luis, thanks for your efforts on this. By running http://people.canonical.com/~henrix/lp974664v2, the Oops is indeed resolved and the rate limiting is definitely reducing the numbers of messages produced but none the less there are many many dozens of messages produced still -- just like with the mainline kernel infact. I assume the mainline kernel also had the rate limiting patch applied."

I can confirm that here too, using Luis' kernel on the client and the latest, untouched, untainted 3.2.0-23-generic on the server.

Dan Bishop (danbishop) wrote :

Can this not be marked as confirmed now?

Luis Henriques (henrix) wrote :

Ian, there's something I forgot to ask: could you take a look at the server logs (running Lucid, I believe) and check whether there's some extra information there? Please attach any relevant information (in particular the kernel logs).

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Thomas Spitzlei (t-spitzlei) wrote :

Same Problem here:
Server 10.04 LTS Stock-Kernel

Clients: 11.04 - Custom Kernel, nfs-home

Kernel 3.1, 3.2 from kernel.org:
BUG: unable to handle kernel paging request at ffffffffffffffb8
IP: [<ffffffffa0e83bfb>] nfs_have_delegation+0xb/0x40 [nfs]

Kernel 3.4 from kernel.org:
NFS: nfs4_reclaim_open_state: Lock reclaim failed!

Appears when trying to run for example tomboy, gedit, pidgin. No 3.x (3.1, 3.2, 3.4) - Kernel allows to start the apps with a nfs-home.

No entries in Serverlogs.

Ian Morris (ipm) wrote :

Re: #15, Correct -- server running lucid with stock kernel (from updates). There's no messages in the server logs that I can see (same as #16)

Luis Henriques (henrix) wrote :

I have ping'ed upstreams about the issue and two new patches (not yet upstream) have been provided. I have uploaded a new test kernel here:

http://people.canonical.com/~henrix/lp974664v3/

It contains the two mentioned patches and the other two you have already tested.

Please let me know if this new kernel finally solves the issue.

Ian Morris (ipm) wrote :

Re: #18, Luis, I think you've cracked it! I have just installed that kernel and I am happy to report that I'm not getting an oops, Nor am I getting any messages on the client, the network traffic is back down to normal levels, I'm not getting any messages on the server side and last but not least, my subjective test of the time it takes to login time seems to be back to normal!

Thomas Spitzlei (t-spitzlei) wrote :

Thats it, Luis! Bug is gone with your kernel.

Where can i get the patches to compile my own? When are the fixes @kernel.org?

Thanks!

Shawn Haggett (podge-9) wrote :

I can confirm as well that using the kernel from #18 and performing the same steps as before fails to generate the bug and everything seems to work correctly.

Thanks as well!

Luis Henriques (henrix) wrote :

That's great, thank you for testing. So, to summarise, here's the list of commits that were present in this last kernel you have tested:

- From Linux mainline:
    - 14977489ffdb80d4caf5a184ba41b23b02fbacd9
    - 96dcadc2fdd111dca90d559f189a30c65394451a

- From linux-nfs, branch bugfixes:
    - 487790f27df9bb27d3400486bd021dd59edc7589
    - 5de4815015e550bdd33f39650554325540356f0c

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Dan Bishop (danbishop) wrote :

Same here, thank you so much Luis!

Is there any way this will make it into precise before launch now? It seems almost unthinkable that an LTS could be released without the ability to mount NFS home directories :s

Is there anything we can do to help this process?

Luis Henriques (henrix) on 2012-04-19
description: updated
tags: removed: kernel-key

Hello rvaliant, or anyone else affected,

Accepted linux into precise-proposed. The package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in linux (Ubuntu Precise):
status: Triaged → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.2.0-24.37

---------------
linux (3.2.0-24.37) precise-proposed; urgency=low

  [ Herton Ronaldo Krzesinski ]

  * d-i: Add hid-logitech-dj to input-modules
    - LP: #975198
  * d-i: Add rtl8187 driver to nic-usb-modules
    - LP: #971719

  [ Ian Abbott ]

  * SAUCE: staging: comedi: Add module parameters for default buffer size
    - LP: #981234
  * SAUCE: staging: comedi: Add kernel config for default buffer sizes
    - LP: #981234

  [ K. Y. Srinivasan ]

  * SAUCE: hv_storvsc: Account for in-transit packets in the RESET path
    - LP: #978394

  [ Leann Ogasawara ]

  * [Config] Set CONFIG_COMEDI_DEFAULT_BUF_[SIZE_KB,MAXSIZE_KB]
    - LP: #981234

  [ Luis Henriques ]

  * SAUCE: ite-cir: postpone ISR registration
    - LP: #984387

  [ Manoj Iyer ]

  * SAUCE: Bluetooth: btusb: Add vendor specific ID (0489 e042) for
    BCM20702A0
    - LP: #980965

  [ Tim Gardner ]

  * Extract firmware module info during getabi
  * [Config] Remove hiq-quanta module references
    - LP: #913164
  * [Config] powerpc-smp: build in ATI and RADEON frame buffer drivers
    - LP: #949288

  [ Trond Myklebust ]

  * SAUCE: NFSv4: Ensure that the LOCK code sets exception->inode
    - LP: #974664
  * SAUCE: NFSv4: Ensure that we check lock exclusive/shared type against
    open modes
    - LP: #974664

  [ Upstream Kernel Changes ]

  * Input: psmouse - allow drivers to use psmouse_{de,}activate
    - LP: #969334
  * Input: psmouse - use psmouse_[de]activate() from sentelic and hgpk
    drivers
    - LP: #969334
  * Input: sentelic - refactor code for upcoming new hardware support
    - LP: #969334
  * Input: sentelic - enabling absolute coordinates output for newer
    hardware
    - LP: #969334
  * Input: sentelic - minor code cleanup
    - LP: #969334
  * Input: sentelic - improve packet debugging information
    - LP: #969334
  * Input: sentelic - filter taps in absolute mode
    - LP: #969334
  * drm/i915: Fixes distorted external screen image on HP 2730p
    - LP: #796030
  * NFSv4: Minor cleanups for nfs4_handle_exception and
    nfs4_async_handle_error
    - LP: #974664
  * NFSv4: Rate limit the state manager for lock reclaim warning messages
    - LP: #974664
  * HID: multitouch: merge quanta driver into hid-multitouch
    - LP: #913164
  * HID: usbhid: add quirk no_get for quanta 3008 devices
    - LP: #913164
 -- Leann Ogasawara <email address hidden> Tue, 24 Apr 2012 07:47:49 -0700

Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Andrei Terechko (terechko) wrote :

I observe the same kernel bug on Ubuntu 11.10 (oneiric) with kernel 3.0.0-19-generic, see the attached syslog. Interestingly, it's not there in the vanilla install of Ubuntu 11.10 with kernel 3.0.0-12-generic (tested by booting an older kernel from grub). This issue manifests itself when I log in using an LDAP user with home on an NFS mount.

Any chance of fixing it on Ubuntu 11.10?

Herton R. Krzesinski (herton) wrote :

Tagging this bug as verified for Precise. Patch is already released with Precise, but showing up on ti-omap4 SRU report where it isn't a specific ti-omap4 change.

tags: added: verification-done-precise

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.