NFS client : kernel 4.4.0-57 crash with nfsv4 enries in /etc/fstab

Bug #1650336 reported by Thomas Fili on 2016-12-15
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Xenial
High
Seth Forshee

Bug Description

SRU Justification

Impact: A commit from upstream 4.4 stable introduced a regression in refcounting auth_gss messages which can lead to an oops.

Fix: Cherry pick upstream commit 1cded9d2974fe4fe339fc0ccd6638b80d465ab2c "SUNRPC: fix refcounting problems with auth_gss messages."

Regresssion Potential: Test results confirm that the commit fixes an existing regression. It's pretty straightforward and is already present in some upstream stable kernels, so I think the potential for regressions is minimal.

---

On (K)ubuntu 16.04.01 (Xenial) Upgrading to kernel 4.4.0-57 the kernel crash on boot when in /etc/fstab a nfsv4 entry is active

When comment the line or set to noauto the kernel boot with no error and mounting the nfs shares (after uncomment ) is also possible without error

4.4.0-54 boots without this problem.

Thomas Fili (tfili69) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1650336

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Thomas Fili (tfili69) wrote :

Yes,
in fact it is not possible to execute the apoort-collect command when the kernel crashed.

If someone need information about the hardware, i can boot the computer without the nfs stab entry and execute then the command.

But i can also report the same problem from other computers ( for example Supermicro servers )... and not only runing Xenial ... computers running Trusty with the linux-generic-lts-xenial kernel are also affected.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Medium → High
Changed in linux (Ubuntu Xenial):
importance: Medium → High
tags: added: performing-bisect xenial
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between Ubuntu 4.4.0-54 and Ubuntu 4.4.0-57. The kernel bisect will require testing of about 4-5 test kernels.

I built the first test kernel, up to the following commit:
0cd611da7d4c01b178144bc17da8cd92cae2b1fa

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1650336

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Thomas Fili (tfili69) wrote :

Thank you for this test kernel.

Unfortunately this one has the same problems like the official 4.4.0-57

Without /etc/fstab or with noauto entries for nfs the kernel boot fine.
With the nfs entries the kernel crashes ... sorry

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
ed8c9a98e60fc731a9d83a7a137d5d84210967f5

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1650336

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Thomas Fili (tfili69) wrote :

Unfortunately no changes ... same behavior ...

But i notice something strange ... if i boot some of this corrupt kernels that crash and then restart with the Magic SysRQ into a "good" kernel ...this kernel crashes at the same point.

Another Magic SysRQ or a cold start let boot the "good" kernel normal.

The affected kernels will boot in no kind ... neither cold start, warm start / reboot or minutes without power ...

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
764d47217b1f3881600e11c08f109b177e521b15

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1650336

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Thomas Fili (tfili69) wrote :

Sorry, no changes ... but i noticed something that could be usefull.

To ensure there is no hardware problem on my computer, i installed a fresh kubuntu 16.04.01 on a newer maschine and configured it.

Same behaviour ... the base kernel from installation 4.4.0-31 work without problem ... but all newer kernels i tested there crashes sometimes, also the 4.4 mainline kernels :(
Sometimes, when disconnecting power for some minutes the mainline kernel boot successfully ...

In the boot logs i see the mount of the first nfs entry seems to be successfull, the kernel crashes when trying to mount the second entry ...

With or without kerberos ... i tried several combinations ... having only one nfs entry in fstab booting without problem, adding the second one makes the kernel crash on boot time.

Mounting a secound nfs share after boot is complete is no problem

Thomas Fili (tfili69) wrote :

I noticed some additional detail ...

The problem only occur if the second nfs share is on the same NFS Server as the first share.
I tried to mount a second share from another NFSv4 Server, also running under FreeBSD 11, without problem

Thomas Fili (tfili69) wrote :

The mainline kernel 4.8.17-040817 do not have the problem, 4.4.41-040441 have the problem

Joseph Salisbury (jsalisbury) wrote :

Thanks for finding out the bug does not exist in the upstream 4.8 kernel. There are only a couple more test kernels for the bisect, so we may as well finish it.

I built the next test kernel, up to the following commit:
4891ae8e5d0801f13739c26300ac4cd162c3e63c

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1650336

It might also be worthwhile to test the latest upstream 4.4 kernel to see if the commit that fixes the bug in 4.8 was also cc'd to stable.

The latest 4.4 kernel is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.41/

Thomas Fili (tfili69) wrote :

Thank you for the test kernel :)

But unfortunately it also crashs at the same position.
Beginns with: BUG: unable to handle kernel paging request at ffffffff814121a8
...

The 4.4.41 also crash, as i mention before ... On all computers i tested i have the same behavior that i can successfully boot such a kernel once when the computer was without power some minutes before ... or sometimes when i boot a good kernel before ... very stange

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
50f208e18014589971583a8495987194724d56e4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1650336

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Thomas Fili (tfili69) wrote :

No, sorry the kernel crash, too

Joseph Salisbury (jsalisbury) wrote :

The bisect reported commit 50f208e18014589971583a8495987194724d56e4 as the first bad commit. I built a Xenial test kernel with this commit reverted. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1650336/

Can you test this kernel and see if it resolves this bug?

Note, you need to install both this linux-image and linux-image-extra .deb packages with this kernel.

Thomas Fili (tfili69) wrote :

Sorry, the same behavior.

But i found something else...

It is very frustrating, to be apparent the only one having this problem, so i try to setup a nfsv4 server on Ubuntu (14.04.5 with linux-generic-lts-wily kernel 4.2.0-42)

/etc/exports:

/home/exports *(rw,fsid=0,crossmnt,no_subtree_check,sync)
/home/exports/user *(rw,nohide,insecure,no_subtree_check,sync)
/home/exports/staff *(rw,nohide,insecure,no_subtree_check,sync)

Client (Ubuntu 16.04.1)

/etc/fstab

server:/user /home/server/user nfs _netdev,auto,rw,noatime,nfsvers=4,sync 0 0
server:/staff /home/server/staff nfs _netdev,auto,rw,noatime,nfsvers=4,sync 0 0

mount -a or command line mount works without problem ...

---

The behavior is not exactly the same as in our enviorment with a FreeBSD Server but similar i think.

Default installation kernel 4.4.0-31-generic : Ok, boot without problem

Latest kernel from repo 4.4.0-59 and all other kernel i tested, inclusive the latest mainline kernel 4.9.4

a. Booting with only one auto entry in /etc/fstab : No problem

b. Booting with both auto entries in /etc/fstab : The first share will mount fast.

   The boot log show : "A start job is running for /home/server/staff (1min 14s / 1min 38s)"

   After timeout is expired the computer finished booting but without mounting the second share.

   After logging i can mount the second share without problems

So the kernel do not crash with a ubuntu nfs server ... bit maybe the main reason for the problem is the same ?!
---

Is there anyone could confirm this behavior in his environment ?

So that i do not feel so alone any longer ;)

Joseph Salisbury (jsalisbury) wrote :

We may have provided an improper "Good" or "Bad" result to the bisect. We may have to test the previously posted kernels again to confirm test results.

However, can you first test the latest upstream 4.4 stable kernel and mainline kernel to see if this bug is already fixed upstream? They can be downloaded from:

4.4.43: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.43/

Mainline: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10-rc4

Thomas Fili (tfili69) wrote :

Thank you for your effort

Unfortunately the tests are very confusing

With the two shares from the FreeBSD 11 Server the mainline kernel 4.10.0 has no problem !!!
Also if i try to mount two additional shares from another FreeBSD 11 Server. No Problem at all !

But if i am try to mount the two shares from the ubuntu 14.04.01 nfsv4 server only one share will mounted at boot time but after the timeout expired it fully boot.

With the mainline kernel 4.9.4 the computer boots in every combination an numbers of shares without problem just like 4.8.17 mainline kernel and the 16.04.01 installation source kernel 4.0.31

Mainline kernel 4.4.43 crash just like 4.4.42 and 4.4.41

Joseph Salisbury (jsalisbury) wrote :

So this bug does not happen with the 4.10-rc4 kernel from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10-rc4

If that is the case, we can perform a "Reverse" bisect to identify the commit that resolves that bug upstream.

Thomas Fili (tfili69) wrote :

When you ask if the kernel 4.10-rc4 do not crash any long on boot time, then i can answer : Yes, the kernel do not crash any longer.

But when you ask if the kernel 4.10-rc4 work as expected with nfs

Then i can answer No, he do not work as expected with nfs shares.

IMHO the mainline kernel 4.9.4 work as expeced ;)

But maybe you think there are different parts in the kernel responsibly for this hehaviour ?

Thomas Fili (tfili69) wrote :

Very stange ...

I test 4.10-rc4 again and found that the kernel is booting without problems with all shares when using the pure IP address for the server in /etc/fstab ... it is irrelevant if the correct entry in /etc/hosts exist or dns resolution is ok.

I dunno why 4.9.4 is working with dns names only ...

Seth Forshee (sforshee) wrote :

@Thomas: I've been going back through the test results here and something isn't making sense. During the bisect you reported that the kernel built at commit 50f208e18014589971583a8495987194724d56e4 was bad. This commit has no code changes relative to 4.4.0-54, so they should behave the same. This makes me think that either the crash is intermittent, i.e. it might happen sometimes but not other times with the same kernel, or else that something is changing in your testing or environment. There's a slight chance that it could be some difference in the builds, but that's pretty unlikely.

The other question I have is whether or not you're always seeing the same problem each time you say a kernel is bad. By that I mean you see a crash with nearly identical messages in the kernel log. If there are multiple different issues going on it's best to try to focus on one at a time, if possible (it isn't always possible though if one problem is interfering with testing for another).

Thomas Fili (tfili69) wrote :

@Seth : Yes, this problem is very confusing

Since posting #8 i used a dedicated test system with a fresh kubuntu 16.04.1 installation on a well tested system.
During the testing i modify /etc/fstab switching between noauto/auto options of nfs shares and later changing server uri to ip address or adding new entries from other servers.

And i install the test kernels, of course ;)

Today i install the commit 50f208e18014589971583a8495987194724d56e4 again ... and it crashes as i said before.

Unfortunately the kernel 4.4.0-54 is not available any longer from offical repos ... so i not able to test this one again ... but maybe this kernel had also the problem ?!

At the beginning i was very confused because of some kernel crashes ... not on the first try but on the second or third try.

In fact i found only one old offical kernel that do not have the problem and this is the 16.04.1 installation kernel 4.0.31

But i will try to find the last working kernel between 4.4.0-34 to 4.4.0-53 tomorrow.

Mainline Kernel 4.10-rc4 and 4.9.4 both work since i modify server uri to ip address in /etc/stab

> By that I mean you see a crash with nearly identical messages in the kernel log.

Yes, of cource ... when i wrote the kernel crash this happend allways at the same place in the log with similar log entries

> If there are multiple different issues going on it's best to try to focus on one at a time, if possible

That is clear ... for example the resolv the ip from uri problem is secondary.
And there is also another bug with accessing a nfsv4 subshare

On Thu, Jan 19, 2017 at 06:14:33PM -0000, Thomas Fili wrote:
> @Seth : Yes, this problem is very confusing

Thanks for the clarifications.

> Unfortunately the kernel 4.4.0-54 is not available any longer from
> offical repos ... so i not able to test this one again ... but maybe
> this kernel had also the problem ?!

Yeah, looks like this build never made it out of our PPA.

> But i will try to find the last working kernel between 4.4.0-34 to
> 4.4.0-53 tomorrow.

Please do. So far we've been working under the assumption that this is a
bug introduced after 4.4.0-54, so if it was introduced before that we
would never have found it. Honestly though if the build at commit
50f208e18014589971583a8495987194724d56e4 is bad then 4.4.0-54 is almost
certainly bad as well.

> And there is also another bug with accessing a nfsv4 subshare

Yes, I haven't forgotten about this one. I'm waiting on the upstream
developers right now, but if they don't come back with something by
early next week I'll pursue a temporary fix in x/y.

Thomas Fili (tfili69) wrote :

@Seth

> Thanks for the clarifications.

Har Har

> Yeah, looks like this build never made it out of our PPA.

Unfortunately, i do not found this kernel versions on one of our computers.
But i am rather sure having seen this kernel in the offical repos

Thomas Fili (tfili69) wrote :

So, now i try to formulate a new short bug description with all facts i know at the moment :

Since kernel version 4.4.0-42 (offical repo for 16.04.1) the boot process crashed when there are at least two nfsv4 entries to the same nfs-server in /etc/fstab

With only one share entry in the /etc/fstab the boot prozess do not crash.

The last working kernel not having this problem is 4.4.0-38.

Mainline Kernel 4.10-rc4 and 4.9.4 both work without problems

Thomas Fili (tfili69) wrote :

Hm, sorry i hope there was no misunderstanding ?!

As i mentioned some times before, there was a strange behaviour when i test the kernels the first time.
Sometimes a kernel boot successfully for one or two times and crash not before the third try ...

I could reproduce this behaviour on different computers, but at the beginning of the tests i declare kernel would be ok by mistake this way. Maybe someone else are able to explain such a behaviour ...

When i tested the kernels the second time, i try to boot every kernel three time to ensure getting the correct tag.

Sorry again, for the unnecessary work

Seth Forshee (sforshee) wrote :

There were two commits to sunrpc between 4.4.0-38 and 4.4.0-42 which came from upstream 4.4 stable.

4bb0ea1f3289 SUNRPC: Handle EADDRNOTAVAIL on connection failures
8785a1d6c5b3 SUNRPC: allow for upcalls for same uid but different gss service

There's a later commit which says it fixes problems with the latter of these, and specifically mentions a NULL derefernce in rpc_pipe_read:

1cded9d2974f SUNRPC: fix refcounting problems with auth_gss messages.

That one should be coming from upstream stable too, but it looks like we don't have it yet.

Joe, could you provide Thomas with a test kernel containing that fix that he can test? Thanks.

Seth Forshee (sforshee) wrote :

Sorry for the delay. Please test the kernel below to see if it fixes the problem. It also includes the submount permission fix from the other bug.

http://people.canonical.com/~sforshee/lp1650336/linux-4.4.0-63.84+lp1650336v201702150737/

Thomas Fili (tfili69) wrote :

Ok, looks good, i tried several reboots with this kernel and all were successfull :)

Seth Forshee (sforshee) on 2017-02-16
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Changed in linux (Ubuntu Xenial):
assignee: Joseph Salisbury (jsalisbury) → Seth Forshee (sforshee)
Seth Forshee (sforshee) on 2017-02-16
description: updated
Tim Gardner (timg-tpi) on 2017-02-16
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Thomas Fili (tfili69) wrote :

Looks good,

this bug seems to be solved with the -proposed kernel 4.4.0-65.86 ... good work ... thank you very much

I changed the tag 'verification-needed-xenial' to 'verification-done-xenial'

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (14.5 KiB)

This bug was fixed in the package linux - 4.4.0-65.86

---------------
linux (4.4.0-65.86) xenial; urgency=low

  * linux: 4.4.0-65.86 -proposed tracker (LP: #1667052)

  [ Stefan Bader ]
  * Upgrade Redpine RS9113 driver to support AP mode (LP: #1665211)
    - SAUCE: Redpine driver to support Host AP mode

  * NFS client : permission denied when trying to access subshare, since kernel
    4.4.0-31 (LP: #1649292)
    - fs: Better permission checking for submounts

  * [Hyper-V] SAUCE: pci-hyperv fixes for SR-IOV on Azure (LP: #1665097)
    - SAUCE: PCI: hv: Fix wslot_to_devfn() to fix warnings on device removal
    - SAUCE: pci-hyperv: properly handle pci bus remove
    - SAUCE: pci-hyperv: lock pci bus on device eject

  * [Hyper-V/Azure] Please include Mellanox OFED drivers in Azure kernel and
    image (LP: #1650058)
    - net/mlx4_en: Fix bad WQE issue
    - net/mlx4_core: Fix racy CQ (Completion Queue) free
    - net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT
      transitions
    - net/mlx4_core: Avoid command timeouts during VF driver device shutdown

  * Xenial update to v4.4.49 stable release (LP: #1664960)
    - ARC: [arcompact] brown paper bag bug in unaligned access delay slot fixup
    - selinux: fix off-by-one in setprocattr
    - Revert "x86/ioapic: Restore IO-APIC irq_chip retrigger callback"
    - cpumask: use nr_cpumask_bits for parsing functions
    - hns: avoid stack overflow with CONFIG_KASAN
    - ARM: 8643/3: arm/ptrace: Preserve previous registers for short regset write
    - target: Don't BUG_ON during NodeACL dynamic -> explicit conversion
    - target: Use correct SCSI status during EXTENDED_COPY exception
    - target: Fix early transport_generic_handle_tmr abort scenario
    - target: Fix COMPARE_AND_WRITE ref leak for non GOOD status
    - ARM: 8642/1: LPAE: catch pending imprecise abort on unmask
    - mac80211: Fix adding of mesh vendor IEs
    - netvsc: Set maximum GSO size in the right place
    - scsi: zfcp: fix use-after-free by not tracing WKA port open/close on failed
      send
    - scsi: aacraid: Fix INTx/MSI-x issue with older controllers
    - scsi: mpt3sas: disable ASPM for MPI2 controllers
    - xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
    - ALSA: seq: Fix race at creating a queue
    - ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
    - drm/i915: fix use-after-free in page_flip_completed()
    - Linux 4.4.49

  * NFS client : kernel 4.4.0-57 crash with nfsv4 enries in /etc/fstab
    (LP: #1650336)
    - SUNRPC: fix refcounting problems with auth_gss messages.

  * [0bda:0328] Card reader failed after S3 (LP: #1664809)
    - usb: hub: Wait for connection to be reestablished after port reset

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * ibmvscsis: Add SGL LIMIT (LP: #1662551)
    - ibmvscsis: Add SGL limit

  * [Hyper-V] Bug fixes for storvsc (tagged queuing, error conditions)
    (LP: #1663687)
    - scsi: storvsc: Enable tracking of queue depth
    - scsi: storvsc: Remove the ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
urraca (urraca) wrote :

Can whoever found the root cause in this case have a look at
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1466654
as well please? It sounds very much related, and is a major issue in our environment.

urraca (urraca) wrote :

N.B. that according to the changelog of the 4.4.0-70 kernel package, the patch has only been applied to -67 (thus effectively to -70)!

Now, can we assess if the forementioned bug will be fixed by this as well?!?

Chris Mohler (evilbob) wrote :

I'm getting this same issue in kernel 4.4.0-96 on Xenial. Any suggestions?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers