NetApp NFS driver mounts fail intermittently
Bug #1395823 reported by
Tom Barron
This bug affects 2 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
Fix Released
|
Undecided
|
Tom Barron |
Bug Description
Andrew Kerr reports:
Ever since the NFS security patch merged, the NFS drivers will occasionally (~30-50% of the time) fail during initialization with the following error:
'nfs': u"Unexpected error while running command.
Command: sudo cinder-rootwrap /etc/cinder/
Exit code: 32
Stdout: u''
Stderr: u'mount.nfs: /opt/stack/
I uploaded a patch to revert the NFS security patch and it has not hit this problem after ~11
Changed in cinder: | |
assignee: | nobody → Tom Barron (tpb) |
status: | New → In Progress |
Changed in cinder: | |
milestone: | none → kilo-1 |
status: | Fix Committed → Fix Released |
Changed in cinder: | |
milestone: | kilo-1 → 2015.1.0 |
To post a comment you must log in.
We can reproduce this issue by restarting the cinder-volume and cinder-backup processes concurrently.
At startup, both processes use brick remotefs code to ensure that the shares are mounted for nfs volumes. In remotefsclient mount() method, a simple 'mount' command is run to check whether a share is already mounted, followed by (in sub-methods)
NFSv4,1 mount and NFS 3 mount. In our CI system, where the backend filers are not currently configured for NFS v4, we expect
the v4 mount to fail, but the NFS v3 mount is also failing about 1 in 5 times with a 'busy or in use' message.
There is a window of opportunity between the initial 'mount' command that checks for already mounted shares and the subsequent mount commands in which the cinder backup and cinder volume processes can race with one another and cause this problem.
In a CI system, where the node has just been set up by devstack and the two processes are started concurrently, we are especially likely to hit the issue.
This issue seems to have "always" been there, but with the NFS security patches the first ensure_mounted() call occurs in a new context, where security options are checked and if the check cannot be made, the backend driver is not loaded. Previously, the driver could be loaded anyways and if the other process succeeded with the mount then there was no harm done, the silent failure really didn't matter.