NFSv4 regression in 3.4-rc1 causes Invalid Argument on chown/grp

Bug #1101292 reported by Bryan Quigley on 2013-01-18
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Dave Chiluk
Precise
Undecided
Dave Chiluk
Quantal
Undecided
Dave Chiluk
Raring
Medium
Dave Chiluk

Bug Description

   SRU Justification:

    Impact: chown/chgrp on AIX exported NFSv4 shares errors with Invalid argument
    Fix: (upstream) fix length of string returned from idmapper to not include null.
    Testcase: Mount an NFSv4 share. Take a tcpdump of running a chgrp or chown command.
     Check the stringlenth of the username or group in the setattr command inside of the
     tcpdump to verify that no null character is included in the string length.

"Our home directories are mounted via NFSv4 from a server running AIX 7.1.
After an upgrade of my client from Ubuntu 12.04 to 12.10, 'chgrp' fails when
trying to change a file or directory in my home directory:"

username@1210:~/x$ ls -l newfile
-rw------- 1 username groupname 0 Jan 3 16:56 newfile
username@1210:~/x$ chgrp groupname newfile
chgrp: changing group of `newfile': Invalid argument

groupname is his default group.

username@1210:~/x$ chown -v username newfile
chown: changing ownership of `newfile': Invalid argument
failed to change ownership of `newfile' from username to username

The issue was bisected to:
57e62324e469e092ecc6c94a7a86fe4bd6ac5172 is the first bad commit
commit 57e62324e469e092ecc6c94a7a86fe4bd6ac5172
Author: Bryan Schumaker <email address hidden>
Date: Fri Feb 24 14:14:51 2012 -0500

    NFS: Store the legacy idmapper result in the keyring

    This patch removes the old hashmap-based caching and instead uses a
    "request key actor" to place an upcall to the legacy idmapper rather
    than going through /sbin/request-key. This will only be used as a
    fallback if /etc/request-key.conf isn't configured to use nfsidmap.

    Signed-off-by: Bryan Schumaker <email address hidden>
    Signed-off-by: Trond Myklebust <email address hidden>

:040000 040000 0f16d9ec47ae5135d43213a847a87d21e0571c85 69efedae21cc967b00e36a41f934514250012930 M fs
:040000 040000 c0e64c847a273af358fc1234860c4b07c6325203 3eeac805d6bdd30a486bb2077342c1017dc0e651 M include

In addition, a difference was found in the idmapd logs when set to "Verbosity = 3":
when trying chgrp groupname, syslog differences

Working case on kernel 3.3.3:
rpc.idmapd[676]: Client 5: (user) name "<email address hidden>" -> id "5194"
rpc.idmapd[676]: Client 5: (group) name "<email address hidden>" -> id "100003"
rpc.idmapd[676]: Client 5: (group) id "100003" -> name "<email address hidden>"

Failing case on kernel 3.4-rc1 (3.4-rc first with bug, also would be same with 3.5:
rpc.idmapd[2129]: Client 0: (group) id "100003" -> name "<email address hidden>"

Possibly related side issue that impeded testing:
Mainline kernel 3.5-rc7 and up not able to mount on: mount.nfs4: an incorrect mount option was specified
[ 612.739763] gss_create: Pseudoflavor 390004 not found!
[ 612.739771] RPC: Couldn't create auth handle (flavor 390004)
This is quite confusing to me as 3.5 in Quantal can definitely mound the NFS share.

4. Reproduce steps
 4.1. Mount NFSv4 share on AIX 7.1 with Kerberos krb5i
 4.2. Create new file that you should be able to change ownership of
 4.3. Run chown/chgrp. Note error
 a. Actual Results: chgrp: changing group of `newfile': Invalid argument
 b. Expected Results: file has changed permissions

5. Known Workaround: Use a kernel pre-3.4-rc1

Exports on the AIX server (yes different format then Linux):
/cfs -vers=4,sec=krb5:krb5i:krb5p
/cfs/home -vers=4,sec=krb5:krb5i:krb5p
/cfs/share -vers=4,sec=krb5:krb5i:krb5p

Fstab on client:
cfs-nfs.domainname.com:/ /cfs nfs4 sec=krb5i 0 0

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1101292

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.8 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc4-raring/

tags: added: kernel-da-key quantal regression-release
Changed in linux (Ubuntu):
importance: Undecided → Medium
Bryan Quigley (bryanquigley) wrote :

Please see: "Possibly related side issue that impeded testing:" for more details.
3.8-rc2 was also tested and had the same "side issue" that prevented mounting of the NFS share.

tags: added: kernel-unable-to-test-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Chris J Arges (arges) on 2013-01-18
tags: added: bot-stop-nagging
Rolf Anders (rolf-anders) wrote :

I'm the original reporter of this problem and would like to add a possibly
helpful detail. I did wireshark traces and found that in the failing
case the length of the group name in the SETATTR call is given wrongly:

Opcode: SETATTR (34)
    stateid
        [StateID Hash: 0xafa9]
        seqid: 0x00000000
        Data: 000000000000000000000000
    obj_attributes
        attrmask
            recc_attr: FATTR4_OWNER_GROUP (37)
                fattr4_owner_group: <email address hidden>
                    length: 25

In the working case the length is given correctly as 24 (the actual
group name has the same length as "<email address hidden>").

Rolf Anders (rolf-anders) wrote :

The problem with mounting was my fault: I only installed 'linux-image' and
not 'linux-image-extra' which contains rpcsec_gss_krb5.ko. That module is
needed for mounting with 'sec=krb5'.

So I was able to test 3.8.0-030800rc5-generic from
http://kernel.ubuntu.com/~kernel-ppa/mainline. Unfortunately the bug is
still there:

$ uname -r
3.8.0-030800rc5-generic
$ chgrp -v groupname newfile
chgrp: changing group of `newfile': Invalid argument
failed to change group of `newfile' from groupname to groupname
$ chown -v username newfile
chown: changing ownership of `newfile': Invalid argument
failed to change ownership of `newfile' from username to username

Dave Chiluk (chiluk) on 2013-03-08
Changed in linux (Ubuntu):
assignee: nobody → Dave Chiluk (chiluk)
Dave Chiluk (chiluk) wrote :

I have verified this null-termination occurs in quantal even with a precise nfsv4 server with no security set. I have also verified that this is still the same in raring, and appears to be be maintained in upstream. I read through the RFC, and it does not appear to be incorrect in reference to the protocol *(as far as I read it). Still I sent an e-mail to linux-nfs mailing list to have upstream maintainers take a look at it.

Until I get a response from the upstream maintainers, I believe that it is actually allowed to null-terminate the string like this. As a result, I think the fix will have to be done in the AIX server.

Just in case Rolf Anders, can you please attach a tcpdump of a single run of chown or chgrp?

Thank you,

tags: added: kernel-bug-exists-upstream
removed: kernel-unable-to-test-upstream
tags: added: raring

On Fri, Mar 08, 2013 at 05:30:17PM -0000, Dave Chiluk wrote:
> I have verified this null-termination occurs in quantal even with a
> precise nfsv4 server with no security set. I have also verified that
> this is still the same in raring, and appears to be be maintained in
> upstream. I read through the RFC, and it does not appear to be
> incorrect in reference to the protocol *(as far as I read it).

I didn't find a definitive answer for that in the RFC, but to me it makes
no sense to null-terminate a string if the length is given, especially
when the length includes the null-byte. But let's see what the experts
say.

> Just in case Rolf Anders, can you please attach a tcpdump of a single
> run of chown or chgrp?

Attached.

Thank you

Dave Chiluk (chiluk) wrote :

Well apparently I was wrong, I got a response from the mailing list, and a solution.

A compiled kernel for quantal with the fix is available here

http://people.canonical.com/~chiluk/lp1101292/

Changed in linux (Ubuntu):
status: Confirmed → In Progress
Dave Chiluk (chiluk) wrote :

Rolf Anders, can you please test with the kernel supplied in comment #8?

If it does indeed fix the issue we will likely wait for it to get accepted upstream before including it into an Ubuntu kernel.

Thank you.

Bryan Quigley (bryanquigley) wrote :

The kernel from comment #8 does not fix the issue.

Dave Chiluk (chiluk) wrote :

Many apologies I just checked, and somehow that build did not pick up the change. Here's an updated kernel with the fix.

http://people.canonical.com/~chiluk/lp1101292/

Dave.

Bryan Quigley (bryanquigley) wrote :

The new build fixes the issue.

Dave Chiluk (chiluk) wrote :

The fix has been committed to the mainline kernel tree via commit cf4ab538f1516606d3ae730dce15d6f33d96b7e1.

Dave Chiluk (chiluk) on 2013-04-03
description: updated
Tim Gardner (timg-tpi) on 2013-04-04
Changed in linux (Ubuntu Precise):
assignee: nobody → Dave Chiluk (chiluk)
status: New → In Progress
Changed in linux (Ubuntu Quantal):
assignee: nobody → Dave Chiluk (chiluk)
status: New → In Progress
Changed in linux (Ubuntu Raring):
status: In Progress → Fix Committed
Dave Chiluk (chiluk) on 2013-04-10
Changed in linux (Ubuntu Precise):
status: In Progress → Invalid
Changed in linux (Ubuntu Quantal):
status: In Progress → Fix Released
Dave Chiluk (chiluk) wrote :

This has been included in
quantal via linux-image 3.5.0-28.47
raring via linux-image 3.8.0-17.27

Changed in linux (Ubuntu Raring):
status: Fix Committed → Fix Released
tags: added: verification-done

I am having this problem in precise (12.04.2) with the 3.5.0-26-generic kernel. I am connecting to netapp with DoT 8.1.
In netapp the error pops up with the following message:
Filer has received domain domaid from client xxx.xxx.xxx.xxx string which does not match value filer domainid

Dave Chiluk (chiluk) wrote :

it looks like you are using the quantal backports kernel in precise. You will have to wait for the 3.5.0-28.47 to be pushed back to precise. If I recall correctly that usually happens after the kernel moves to updates for quantal which should be another few weeks. You can always download the packages manually and install.

As for the other messages, I would suggest openning a bug up with netapp, or opening a new launchpad bug by running ubuntu-bug on your affected machine.

Steve Conklin (sconklin) on 2013-04-15
tags: added: verification-needed-quantal
Dave Chiluk (chiluk) on 2013-04-15
tags: added: verification-done-quantal
removed: verification-needed-quantal

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments