NFS client does not submit "nfs_file_sync" write requests when the file open call includes O_SYNC.

Bug #709392 reported by Joseph Salisbury
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
High
Unassigned
Lucid
High
Unassigned

Bug Description

The NFS Client is ignoring O_SYNC when opening a file.

This bug is causing unstable writes in the application environment even
when the mount option "sync" is set on both sides (NFS client and server).

We have a test case shows that the Ubuntu NFS client does
not submit "nfs_file_sync" write requests when the file open call
includes O_SYNC.

I have strace and tcpdump data showing this issue. The strace shows that
the file is opened with O_SYNC. The tcpdump shows "unstable writes"
which should not occur with either "sync" mount option or with forced
sync by using the open file system call with O_SYNC. Comparing the
tcpdump data between the Ubuntu NFS client a RHEL NFS client shows
that the Ubuntu NFS client does not issue the "nfs_file_sync" write
requests, while the RHEL NFS client does.

This issue does not appear to be tied to a specific model of storage
array or a specific file system. Tests have been performed against an NFS
server having access to a GPFS file system where a SAN attached IBM
DS5300 is the underlying storage array. But the behaviour is exactly
the same when they export a local ext3 file system from the local scsi
disk.

Revision history for this message
Robbie Williamson (robbiew) wrote :

Which version of RHEL? or more specifically, version of nfs client packages in that version of RHEL.

Changed in nfs-utils (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Server Team (canonical-server)
Changed in nfs-utils (Ubuntu Lucid):
importance: Undecided → High
status: New → Triaged
assignee: nobody → Canonical Server Team (canonical-server)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Information on the RPM fileset levels of Client and server:

RHEL 5.4 NFS CLIENT:
[root@fscc-hs21-6 mnt]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
[root@fscc-hs21-6 mnt]# cat /etc/redhat-release
[root@fscc-hs21-6 mnt]# rpm -qa | grep nfs
nfs-utils-1.0.9-42.el5
nfs-utils-lib-1.0.8-7.6.el5
[root@fscc-hs21-6 mnt]# rpm -qa | grep kernel
kernel-doc-2.6.18-164.el5
kernel-2.6.18-164.el5
kernel-headers-2.6.18-164.el5
kernel-devel-2.6.18-164.el5
[root@fscc-hs21-6 mnt]#

RHEL 5.5 NFS SERVER:
[root@h02n003mz ~]# rpm -qa | grep nfs
nfs-utils-lib-1.0.8-7.6.el5
nfs-utils-1.0.9-47.el5_5
[root@h02n003mz ~]# rpm -qa | grep kernel
kernel-2.6.18-164.el5
kernel-2.6.18-194.el5
kernel-2.6.18-194.8.1.el5
kernel-headers-2.6.18-194.8.1.el5
kernel-2.6.18-164.11.1.el5
kernel-devel-2.6.18-194.el5
kernel-devel-2.6.18-164.11.1.el5
kernel-devel-2.6.18-164.el5
kernel-devel-2.6.18-194.8.1.el5
kernel-doc-2.6.18-194.8.1.el5

Revision history for this message
Robbie Williamson (robbiew) wrote :

FYI, in lucid-updates we have: 1.2.0-4ubuntu4.1, so it "could" be an issue of a bug being introduced in a later version.

Revision history for this message
Robbie Williamson (robbiew) wrote :

The more I dig into this....the more I suspect kernel related. Basically an open() is done to an NFS mounted file with O_SYNC, followed by a write(). On the RHEL machine, the NFS write is done correctly...with a FILE_SYNC sent to the server, on the Ubuntu client, the NFS write is sent UNSTABLE, i.e. it doesn't wait to receive notification from the server that the data was sync'd.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Robbie,

Yes, I agree that this smack of a kernel issue. We should probably ask for some help from our kernel team, perhaps Stefan or John.

Revision history for this message
Chuck Short (zulcss) wrote :

Also are they using nfs v3 or nfs v4.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This is NFS V3

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We also have tcpdump/wireshark data available that shows the issue in detail. We prefer not to post that data to the bug, since it includes sensitive data.

Revision history for this message
Robbie Williamson (robbiew) wrote :

Surbhi will look into this and see if it's kernel related. She *might* be able to fix it, if it's simple, otherwise we'll need assistance from the kernel team.

Changed in nfs-utils (Ubuntu):
assignee: Canonical Server Team (canonical-server) → Surbhi Palande (csurbhi)
Changed in nfs-utils (Ubuntu Lucid):
assignee: Canonical Server Team (canonical-server) → Surbhi Palande (csurbhi)
Revision history for this message
Stefan Bader (smb) wrote :

As far as I can see, this really is an issue of the NFS v3/v4 client (filesystem driver in the kernel). I do not find any code that would change the new UNSTABLE/COMMIT behavior (which is the new default since NFSv3) back to the FILE_SYNC mode that was the default in NFSv2.

However, reading documents about O_SYNC, I would also think that this should be switching modes. For the test case of writing out a single block, I could see the old mode turned on when using O_DIRECT as well. But looking at the code, this is rather related to the small amount of bytes written than checking the O_SYNC flag in any way.

So this bug report should be targeted against the kernel package (the reason I do not do it right now is that I am not sure whether we should keep the nfs-utils reference but mark it invalid or replace it by the kernel package).

As this needs to get fixed upstream, I will open a bugzilla there an later link it to this bug report.

Surbhi Palande (csurbhi)
Changed in nfs-utils (Ubuntu):
assignee: Surbhi Palande (csurbhi) → nobody
Changed in nfs-utils (Ubuntu Lucid):
assignee: Surbhi Palande (csurbhi) → nobody
Revision history for this message
Surbhi Palande (csurbhi) wrote :

In the current upstream code, I think that instead of the WRITE operation where arg.stable=FILE_SYNC, a COMMIT operation occurs so that the server commits the requested data to the stable storage. I think that this should be good enough to simulate the FILE_SYNC behavior as the requested data will get committed to the stable storage as needed (both by fsync() and by the case when the file is opened with a sync flag)

In case of Lucid, instead of a COMMIT a WRITE operation occurs. With the Lucid code, the trouble is that the arg "stable" does not get assigned to "FILE_SYNC" during a WRITE operation as it should when the file is opened with the sync flag (and also when a fsync is requested on the file)

(Just writing my thoughts from what little I got to research)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@ Stefan, I was able to confirm the workaround. I modified my dd to use sync and direct(o_sync and o_direct):

strace -o /tmp/strace.dd.joe.out dd if=/dev/zero of=/nfs_mount/syncfile bs=1k count=5 oflag=sync,direct

I confirmed the file was opened with both those flags(From strace):
open("/nfs_mount/syncfile", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_DIRECT, 0666) = 3

I captured a tcpdump and reviewed the trace in wireshark. I can confirm that there are FILE_SYNC writes instead of UNSTABLE.

Revision history for this message
Surbhi Palande (csurbhi) wrote :

@Joseph, the semantics of O_DIRECT are different that O_SYNC.
In case of NFS, O_DIRECT *bypasses the page cache on the client* entirely. The NFS protocol does not support passing this flag to the server. By passing the client side cache could have other side effects like not being able to optimize the rpc requests. O_DIRECT on its own does not give the guarantee of writing your data to the backing store like O_SYNC.

O_SYNC on the other hand says that whatever data is requested for a write is synced to the backing store before the write operation returns. Its a synchronous write rather than an asynchronous one.

A user may want to use O_SYNC but not O_DIRECT. Coupling the two together would give you the O_SYNC effect of putting data on the disk, but would have an unwanted effect on the client side cache and could most probably degrade performance.

Revision history for this message
Stefan Bader (smb) wrote :

What I tried to say before with my test is just that for the test case, the O_DIRECT flag results in the fallback behavior which seems to be described in section 5.9 of http://www.faqs.org/docs/Linux-HOWTO/NFS-HOWTO.html#MOUNTOPTIONS (which is that the use of O_SYNC, when the export is sync should result in all requests being done with the NFS_FILE_SYNC flag set in the request and reply). This removes the need of the COMMIT call as the reply to a WRITE with NFS_FILE_SYNC only returns when the data is written to disk on the server.

What we see on both Lucid and Natty is that, regardless of the O_SYNC file flag used on open, it keeps using the newer style WRITE with UNSTABLE set plus COMMIT. I extended the testcase to write 100 blocks and checked the tcpdump for that. Both clients produce a commit for every write request sent even without fsync calls. So from a data integrity point of view there is no difference between the old mode and the new mode.

So in the end this is a matter of documentation and to a degree performance. Should a client be able to influence the method used for the transfer? Also the way chosen for now requires more network traffic. Doing a write and commit for all the writes just seems a waste of network bandwidth (as the same could be done with just a write with the file sync flag set).

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Stefan, so even though we see UNSTABLE writes, the data is synced before the write operation returns?

Revision history for this message
Stefan Bader (smb) wrote :

Hi Joseph, it depends on how you define write operation. If it is the dd, then yes. For the single write requests, which you see in the tcpdump, no. That is the v3 semantics. Which means write with unstable, the server returns before it is actually written to disk. But the commit request which follows each write will only complete (commit reply) when all previous writes are on disk (which in the dumps I see can only be one). When the client sends the write request with FILE_SYNC, the reply to the write request itself will be delayed until the data is on the disk of the server. So doing it with unstable writes and commits for each of them is behaving the same way from the applications point of view. But you send twice as many tcp packets.

Revision history for this message
Surbhi Palande (csurbhi) wrote :

I did a test to write 100 blocks using dd with and without oflag="sync" and saw the same result in wireshark, namely of getting a WRITE-COMMIT sequence. However it seems to me that this sequence is because of the nfs writeback code rather than vfs_write(). In Lucid, it appears that vfs_write() does not explicitly wait on a COMMIT to finish nor does it send stable=FILE_SYNC for a WRITE operation. So, I think that the expected synchronous behavior of O_SYNC flag or sync mounts is not seen in Lucid.

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
Surbhi Palande (csurbhi) wrote :

So sorry about the previous comment.. I confirmed that both in Lucid and in upstream, WRITE is followed by a COMMIT in *vfs_write()*. This does mean that the data is ultimately put on the backing store and that vfs_write waits for that to happen before it returns back (iff the O_SYNC flag is set - which is the expected behavior)

Changed in linux:
status: Confirmed → Fix Released
Revision history for this message
Steve Langasek (vorlon) wrote :

The upstream kernel task has been marked as resolved; I think that means this bug should be assigned to the linux package for fixing in Lucid.

affects: nfs-utils (Ubuntu) → linux (Ubuntu)
Revision history for this message
Christopher M. Peñalver (penalvch) wrote :

Closing only the linux (Ubuntu) task as the noted upstream fix is available in Precise+.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Revision history for this message
Rolf Leggewie (r0lf) wrote :

lucid has seen the end of its life and is no longer receiving any updates. Marking the lucid task for this ticket as "Won't Fix".

Changed in linux (Ubuntu Lucid):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.