NetApp NFS driver mounts fail intermittently

Bug #1395823 reported by Tom Barron
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Tom Barron

Bug Description

Andrew Kerr reports:

Ever since the NFS security patch merged, the NFS drivers will occasionally (~30-50% of the time) fail during initialization with the following error:

'nfs': u"Unexpected error while running command.
Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf mount -t nfs 10.250.119.63:/vol_ed760df0 /opt/stack/data/cinder/mnt/756a10f5495cbddeafe5b25c5012b9a6
Exit code: 32
Stdout: u''
Stderr: u'mount.nfs: /opt/stack/data/cinder/mnt/756a10f5495cbddeafe5b25c5012b9a6 is busy or already mounted

I uploaded a patch to revert the NFS security patch and it has not hit this problem after ~11

Tom Barron (tpb)
Changed in cinder:
assignee: nobody → Tom Barron (tpb)
status: New → In Progress
Revision history for this message
Tom Barron (tpb) wrote :

We can reproduce this issue by restarting the cinder-volume and cinder-backup processes concurrently.

At startup, both processes use brick remotefs code to ensure that the shares are mounted for nfs volumes. In remotefsclient mount() method, a simple 'mount' command is run to check whether a share is already mounted, followed by (in sub-methods)
NFSv4,1 mount and NFS 3 mount. In our CI system, where the backend filers are not currently configured for NFS v4, we expect
the v4 mount to fail, but the NFS v3 mount is also failing about 1 in 5 times with a 'busy or in use' message.

There is a window of opportunity between the initial 'mount' command that checks for already mounted shares and the subsequent mount commands in which the cinder backup and cinder volume processes can race with one another and cause this problem.

In a CI system, where the node has just been set up by devstack and the two processes are started concurrently, we are especially likely to hit the issue.

This issue seems to have "always" been there, but with the NFS security patches the first ensure_mounted() call occurs in a new context, where security options are checked and if the check cannot be made, the backend driver is not loaded. Previously, the driver could be loaded anyways and if the other process succeeded with the mount then there was no harm done, the silent failure really didn't matter.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/140019

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Tom Barron (<email address hidden>) on branch: master
Review: https://review.openstack.org/140019
Reason: inadvertently sent up a new commit instead of an amended one.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/139682
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=bc3241cf7f06871f8c15d04b77871185625165e2
Submitter: Jenkins
Branch: master

commit bc3241cf7f06871f8c15d04b77871185625165e2
Author: Tom Barron <email address hidden>
Date: Wed Nov 26 16:01:14 2014 -0500

    Fixes intermittent NFS driver mount failure

    During cinder volume driver initialization, NFS drivers often
    fail to mount the NFS share backing their volumes, complaining
    that the share in question is 'busy or already mounted'.

    This commit introduces a retry loop around the ensure_mounted()
    call inside set_nas_security_options() so that if there is contention
    between volume process and backup process mounting the same share
    the driver will not be stopped from loading.

    Change-Id: I672433c1c31f420e5dcdbe565db3bb29af3abe7b
    Closes-bug: 1395823

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in cinder:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.