NFS breaks after wake from suspend

Bug #1785788 reported by Christof Köhler
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (CentOS)
Unknown
Unknown
linux (Debian)
New
Unknown
linux (Fedora)
Unknown
Unknown
linux (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

Hello,

I am observing all symptoms of debian bug 898060 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898060) on ubuntu bionic 4.15.0-23-generic x86_64 clients
with a Centos7 NFS server (3.10.0-693.21.1.el7.x86_64).

After a wake from suspend the server always logs
kernel: NFSD: client 2001:638:redacted testing state ID with incorrect client ID
which is not observed with ubuntu 16.04 (kernel 4.4) clients. Sporadically after wake from suspend (takes days to weeks to trigger) the server logs are flooded with messages of the type
kernel: RPC request reserved 84 but used 276
and NFS on the client does not work any more. This also has not been observed with our 20 or so ubuntu 16.04 NFS clients in the last two years.

Rebooting the client appears to stop the log flooding on the server and NFS works normally after the reboot. Remark: systemd hangs during shutdown, presumably waiting for the NFS mount to stop, the power button is the only remaining option.

I will upgrade to the latest bionic kernel later and try to trigger this again.

Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :
Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :
Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :
Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: bionic
Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :

Hello,

I should add some clarification after reviewing my initial report.

While the debian bug is apparently observed without the client being in suspend state, for us the problem happened shortly after or directly after (I cannot establish the exact timeline) the client waking up from suspend. I am not sure if there is a causal connection. Also, as reported I observe always the "... testing state ID with incorrect client ID" message after an ubuntu 18.04 client wakes up from suspend. I am not sure if there is a causal connection to the "RPC request reserved ..." message and the NFS becoming unusable on the client. This might be correlation.

Regards

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.18 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc8

tags: added: needs-bisect
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :

Hello,

I believe at the moment that neither the nvidia driver for the Quadro P400 nor ZFS are available with the mainline kernel, after reading https://wiki.ubuntu.com/Kernel/MainlineBuilds and trying to check the actual contents of the repository. Please correct me if I am wrong. Without these modules and functionality the workstations would be useless to us. Also, please take into account that this bug appears to trigger only sporadically when considering my reply.

So, from my perspective it is unfortunately not possible to test the mainline kernel. I will upgrade to the current kernel which appears to be the best thing I can do. We will actively try to trigger this again by suspend/resume (which I assume is a causative factor), but only on one workstation at the beginnig.

Regards

Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :

"current kernel" -> "latest bionic kernel"

Revision history for this message
drnlm (drnlmuller+bugs) wrote :

I've also just run into this bug - I hope to be able to test the upstream kernel on the client over the weekend, although the difficulty of triggering the bug on demand makes conclusive statements a bit tricky

This looks like it could be the same bug reported at https://bugzilla.redhat.com/show_bug.cgi?id=1552037 , so it's probably worthwhile trying the workaround mentioned in comment 14 there and see if that helps.

Revision history for this message
Christof Köhler (ckoe-ubuntu) wrote :

Hello,

good find, the headline mentions autofs so I probably would have never found it !

We have not been able to trigger this again with using just one client. I will probably increase the number of clients which suspend (over the weekends) again next week.

I check on an ubuntu 16.04 machine. That uses vers=4.0 according to mount. A debian stable (debian 9) machine uses apparently vers=4.2 as does ubuntu 18.04. So forcing vers=4.0 is certainly worth a try, but the problem of reliably reproducing remains unfortunately (not even thinking about fixing it with those circumstances).

Changed in linux (Debian):
status: Unknown → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.