nfs_getattr starvation with heavy NFS write activity

Bug #420508 reported by Brent Nelson
4
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-ports-meta (Fedora)
Fix Released
Critical
linux-ports-meta (Ubuntu)
Triaged
Undecided
Unassigned

Bug Description

Binary package hint: linux-image

This is a known bug in kernels prior to 2.6.25 (not sure when it was introduced). If you have a long write task (such as a dd) to an NFS mount, an "ls -l" on the NFS mount won't complete until the write finishes. If you are copying a file that takes 20 minutes to complete, a simple ls -l will also take 20 minutes. A "\ls" (ls with no arguments) will work fine.

This was fixed in a really tiny patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=28c494c5c8d425e15b7b82571e4df6d6bc34594d

Revision history for this message
In , Flavio (flavio-redhat-bugs) wrote :

Created attachment 322418
suggested patch

POSIX requires that ctime and mtime, as reported by the stat(2) call,
reflect the activity of the most recent write(2). To that end, nfs_getattr()
flushes pending dirty writes to a file before doing a GETATTR to allow the
NFS server to set the file's size, ctime, and mtime properly.

However, nfs_getattr() can be starved when a constant stream of application
writes to a file prevents nfs_sync_inode_wait() from completing. This usually
results in hangs of programs doing a stat against an NFS file that is being
written. "ls -l" is a common victim of this behavior.

[root@node3 ~]# uname -a
Linux node3 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:12 EDT 2008 i686 i686
i386 GNU/Linux

[root@node3 ~]# ls -lh /tmp/Test1G
-rw-r--r-- 1 root root 1000M Oct 24 12:37 /tmp/Test1G

(from /proc/mounts):
cluster1:/usr/local /usr/local nfs
rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=cluster1

[root@node3 tmp]# ls /usr/local/tmp
locking
[root@node3 tmp]# time cp Test1G /usr/local/tmp &
[1] 25030
[root@node3 tmp]# time ls -lh /usr/local/tmp

real 2m13.872s
user 0m0.230s
sys 0m7.897s
total 1001M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 1000M Oct 28 09:39 Test1G
[1]+ Done time cp -i Test1G /usr/local/tmp

real 2m2.414s
user 0m0.230s
sys 0m8.062s

To prevent starvation, hold the file's i_mutex in nfs_getattr() to
freeze applications writes temporarily so the client can more quickly obtain
clean values for a file's size, mtime, and ctime.

Below is the upstream patch fixing this issue:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=28c494c5c8d425e15b7b82571e4df6d6bc34594d

Another interesting patch to be applied:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=634707388baa440d9c9d082cfc4c950500c8952b

they were backported to -92.EL, see the feedback below:
----
After booting the test kernel with that change, I was able to slip a few "ls" commands in during the write:

[root@node3 tmp]# time cp Test1G /usr/local/tmp &
[1] 10142
[root@node3 tmp]# time ls -lh /usr/local/tmp
total 68M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 68M Nov 3 10:44 Test1G

real 0m6.090s
user 0m0.001s
sys 0m0.054s
[root@node3 tmp]# time ls -lh /usr/local/tmp
total 114M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 114M Nov 3 10:44 Test1G

real 0m5.185s
user 0m0.005s
sys 0m0.016s
[root@node3 tmp]# time ls -lh /usr/local/tmp
total 148M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 148M Nov 3 10:44 Test1G

real 0m3.077s
user 0m0.000s
sys 0m0.030s
[root@node3 tmp]#
real 2m26.065s
user 0m0.211s
sys 0m8.042s

The bonus is that the impact to the write seemed to be minimal.
---

Attaching backported patch for RHEL-5 -92.el5

Revision history for this message
In , Issue (issue-redhat-bugs) wrote :

I'm not an NFS expert (nor do I play one on TV).. but ..
Would it be possible to simply send the 1st (and only the 1st) dirty page
(if at least one dirty page was present) for a given NFS inode any time a
local stat is made on an NFS client? Wouldn't a one page update force
mtime on the host?

This event sent from IssueTracker by jwest
 issue 234012

Revision history for this message
In , Jeff (jeff-redhat-bugs) wrote :

Sending one page would make it look close to correct right for that one stat call. Then what happens?

The rest of the pages eventually get flushed. You do another stat call and...WTF? The file has changed? Who wrote to it?

The bottom line is that on a getattr we need to sync the state of the file out to the server and that means flushing all pending writes.

(Peter, correct me if I'm wrong here)

Revision history for this message
In , Peter (peter-redhat-bugs) wrote :

This is correct, Jeff.

Perhaps I oversimplified when describing the semantics that are
required when handling the mtime.

Revision history for this message
In , Jeff (jeff-redhat-bugs) wrote :

*** Bug 464251 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Steve (steve-redhat-bugs) wrote :

*** Bug 467374 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

Peter,

How goes the alternative patch/solution? IBM is willing to help test the next solution we come up with.

--jwest

Revision history for this message
In , Issue (issue-redhat-bugs) wrote :

The customer is providing access to this NFS mount to multiple clients via
ftp. As such, they have a large mix of ftp related directory read
activity combined with the potential for large file writes. One remote
client doing a large upload effectively stalls the access for other
clients.

We have already set expectations that they will still see performance
issues if a large write is happening. They have migrated from a RHEL4
environment (where they do not see this severe a stall), and would be
happy to simply get the responsiveness closer to what it was in RHEL4.
Our testing showed that the previous proposed patch accomplished this
goal.

This event sent from IssueTracker by jwest
 issue 234012

Revision history for this message
In , Peter (peter-redhat-bugs) wrote :

Thank you for the description of the environment and the situation. This
helps me to much better understand the situation.

Yes, the previously proposed patch would make things generally better.
There might still be very large delays, but they would mostly be finite
in nature.

If I can't get my patch accepted upstream in a reasonable amount of time,
then we will go with the proposed patch for the time being and then I
will continue to pursue the better option.

Revision history for this message
In , RHEL (rhel-redhat-bugs) wrote :

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
release.

Revision history for this message
In , RHEL (rhel-redhat-bugs) wrote :

Updating PM score.

Revision history for this message
In , Don (don-redhat-bugs) wrote :

in kernel-2.6.18-132.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.

Revision history for this message
In , Issue (issue-redhat-bugs) wrote :

The x86_64 version of the 2.6.18-132 has not made any difference in
behavior:

Before:

# time cp /tmp/bigfile /usr/local/tmp

real 1m44.479s
user 0m0.015s
sys 0m0.246s

# time ls /usr/local/tmp
a b bigfile c d

real 1m38.688s
user 0m0.000s
sys 0m0.009s

After:

# time cp /tmp/bigfile /usr/local/tmp

real 1m46.442s
user 0m0.017s
sys 0m0.222s

# time ls /usr/local/tmp
a b bigfile c d

real 1m44.069s
user 0m0.000s
sys 0m0.012s

(Note: the ls was manually run in a second window once the cp was started,
so a 1-2 second variation in test times would be expected)

This event sent from IssueTracker by jwest
 issue 234012

Revision history for this message
In , Harshula (harshula-redhat-bugs) wrote :

Release note added. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency, sometimes up to 30 seconds.

Revision history for this message
In , Peter (peter-redhat-bugs) wrote :

This looks fine, except for the very last clause. Please remove everything
after the last comma in the paragraph. The latencies can be easily much
higher than that.

Revision history for this message
In , Harshula (harshula-redhat-bugs) wrote :

Removed ", sometimes up to 30 seconds".

Revision history for this message
In , Harshula (harshula-redhat-bugs) wrote :

Release note updated. If any revisions are required, please set the
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency, sometimes up to 30 seconds.+The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency.

Revision history for this message
In , Peter (peter-redhat-bugs) wrote :

This looks fine to me.

Revision history for this message
Brent Nelson (brent-phys-deactivatedaccount) wrote :
Revision history for this message
In , errata-xmlrpc (errata-xmlrpc-redhat-bugs) wrote :

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Jorge Castro (jorge)
Changed in linux-ports-meta (Ubuntu):
status: New → Triaged
Revision history for this message
In , Jeremy (jeremy-redhat-bugs) wrote :

*** Bug 491121 has been marked as a duplicate of this bug. ***

Changed in linux-ports-meta (Fedora):
status: Unknown → Fix Released
Revision history for this message
In , Harshula (harshula-redhat-bugs) wrote :

(In reply to comment #37)

The original author of comment #37 subsequently made these observations in IssueTracker:

"I just executed an additional test with a much larger file, and did see that the behavior was very dependent upon how long I let the write go before trying the ls. I suspect that I had delayed long enough in my previous test to allow the system to buffer 100% of the file before I issued the ls request."

"The customer fell into the same trap that I did on their initial test (too small of a file used for the test resulted in the entire file being buffered before the ls command == no change in behavior as the new code still waits for the buffered data to be flushed). Using a larger file did indeed demonstrate a positive change in behavior."

Changed in linux-ports-meta (Fedora):
importance: Unknown → Critical
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.