NFSv4 CLOSE/LOCK-operation needs timing improvement for AIX compat

Bug #1167420 reported by Bryan Quigley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Dave Chiluk

Bug Description

We're mounting our users' home directories via NFSv4 from a server
running AIX 7.1. When a process on an Ubuntu 12.04 client locks a file
accessed via NFSv4 using flock() and afterwards closes it, the close()
sometimes blocks for 15 seconds.

The cause of the problem turned out to be a race condition between the
NFSv4 operations CLOSE and RELEASE_LOCKOWNER. The client sends CLOSE
immediately after RELEASE_LOCKOWNER, without waiting for the reply
for RELEASE_LOCKOWNER (which is completely fine to do per NFS RFC, but not what AIX was expecting). Sometimes it happens that the server tries to process the CLOSE before RELEASE_LOCKOWNER has finished. In that case
it replies to the CLOSE with NFS4ERR_DELAY, which causes the client to
retry the CLOSE after 15 seconds.

Ubuntu should instead of freezing for 15 seconds retry sooner and then exponentially increase the timeout up to 15 sec.
This should make the client more responsive in other cases, while increasing compatibility with AIX 7.1's implementation issues.

OS: Ubuntu 12.04 / AIX 7.1. Not fixed in upstream Linux releases.

Steps to Reproduce:
  gcc -o open-close open-close.c
  touch testfile # create file to be used by open-close
  for i in `seq 100`; do ./open-close; done

Happens 10% of the time.

Relevant NFS mailing list discussion: https://www.ietf.org/mail-archive/web/nfsv4/current/msg11720.html

Tags: precise raring
Revision history for this message
Bryan Quigley (bryanquigley) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1167420

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
Dave Chiluk (chiluk)
Changed in linux (Ubuntu):
assignee: nobody → Dave Chiluk (chiluk)
Revision history for this message
Dave Chiluk (chiluk) wrote :

Anyone affected by this issue please attempt to recreate using the kernel available below. And report back here.
http://people.canonical.com/~chiluk/lp1167420/

The patch included with this kernel should drop the first retry from 15 seconds down to .1 second.

tags: added: precise raring
Revision history for this message
Bryan Quigley (bryanquigley) wrote :

A tunable can be set in AIX 7.1 that forces better syncronization which avoids this issue.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.