NetApp Cinder driver failed to clone the image from the cache hence copy offload occurs from the Galnce

Bug #1608596 reported by Bishoy
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Eric Harney

Bug Description

it seems to be the caching logic built into NFS doesn’t update when we use os.path.exists() to verify the clone file.

2016-05-30 10:05:59.663 17325 INFO cinder.volume.flows.manager.create_volume [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Volume 4ed170b4-5529-4d08-b95f-b841902772c0: being created as image with specification: {'status': u'creating', 'image_location': (u'file:///var/lib/glance/images/1b037081-2afb-4dc9-9da6-2960666eb702', [{u'url': u'file:///var/lib/glance/images/1b037081-2afb-4dc9-9da6-2960666eb702', u'metadata': {u'mount_point': u'/var/lib/glance/images', u'type': u'nfs', u'share_location': u'nfs://192.168.35.201/nfs_openstack_glance_prod_vol01'}}]), 'volume_size': 30, 'volume_name': u'volume-4ed170b4-5529-4d08-b95f-b841902772c0', 'image_id': u'1b037081-2afb-4dc9-9da6-2960666eb702', 'image_service': <cinder.image.glance.GlanceImageService object at 0x512a210>, 'image_meta': {u'status': u'active', u'file': u'/v2/images/1b037081-2afb-4dc9-9da6-2960666e
b702/file', u'name': u'API-pm.XXXXXX1', u'tags': [], u'container_format': u'bare', u'created_at': datetime.datetime(2016, 4, 7, 4, 3, 52, tzinfo=<iso8601.iso8601.Utc object at 0x349d8d0>), u'disk_format': u'raw', u'locations': [{u'url': u'file:///var/lib/glance/images/1b037081-2afb-4dc9-9da6-2960666eb702', u'metadata': {u'mount_point': u'/var/lib/glance/images', u'type': u'nfs', u'share_location': u'nfs://192.168.35.201/nfs_openstack_glance_prod_vol01'}}], u'visibility': u'private', u'updated_at': datetime.datetime(2016, 4, 7, 4, 21, 47, tzinfo=<iso8601.iso8601.Utc object at 0x349d8d0>), u'owner': u'db80b77cfb674e378a059b496e93c02a', u'protected': False, u'id': u'1b037081-2afb-4dc9-9da6-2960666eb702', u'min_ram': 0, u'checksum': u'f2cbfb3b3032d9ef7f8b6327fc7d6494', u'min_disk': 0, u'direct_url': u'file:///var/lib/glance/images/1b037081-2afb-4dc9-9da6-2960666eb702', 'properties': {u'os_type': u'windows', u'hw_qemu_guest_agent': u'yes'}, u'size': 32212254720}}
2016-05-30 10:05:59.672 17325 INFO cin
der.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Cloning image 1b037081-2afb-4dc9-9da6-2960666eb702 from cache
2016-05-30 10:06:00.044 17325 INFO cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Cloning from cache to destination volume-4ed170b4-5529-4d08-b95f-b841902772c0
2016-05-30 10:06:00.978 17325 INFO cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Performing post clone for volume-4ed170b4-5529-4d08-b95f-b841902772c0
2016-05-30 10:06:33.366 17325 INFO cinder.volume.manager [-] Updating volume status
2016-05-30 10:06:33.386 17325 INFO cinder.brick.remotefs.remotefs [-] Already mounted: /var/lib/cinder/mnt/6a44f791b5713436
e27605fb3e255784
2016-05-30 10:06:33.389 17325 INFO cinder.volume.drivers.netapp.ssc_utils [-] Running stale ssc refresh job for server: 192.168.35.201 and vserver dc1svm06
2016-05-30 10:06:33.755 17325 INFO cinder.volume.manager [-] Updating volume replication status.
2016-05-30 10:06:34.527 17325 INFO cinder.volume.drivers.netapp.ssc_utils [-] Successfully completed stale refresh job for server: 192.168.35.201 and vserver XXXXX
2016-05-30 10:06:46.988 17325 WARNING cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Discover file retries exhausted.
2016-05-30 10:06:46.991 17325 INFO cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Image cloning unsuccessful for image 1b037081-2afb-4dc9-9da6-2960666eb702. Message: NFS file could not be discovered.
201
6-05-30 10:06:47.006 17325 INFO cinder.brick.remotefs.remotefs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Already mounted: /var/lib/cinder/mnt/6a44f791b5713436e27605fb3e255784
2016-05-30 10:06:47.007 17325 INFO cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] casted to 192.168.35.201:/nfs_openstack_cinder_prod_vol01
2016-05-30 10:06:47.921 17325 INFO cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d8504823b5d504e26e5b4221 - - -] Performing post clone for volume-4ed170b4-5529-4d08-b95f-b841902772c0
2016-05-30 10:06:48.097 17325 INFO cinder.volume.drivers.netapp.nfs [req-54156a7b-6a7e-45b5-ad0b-cfa0feb0fe19 8ea95b1cc1c07b58d3c3240bfd2a58c746f3c69458b7667bb3fbe70b9315307c e4ec15e3d85
04823b5d504e26e5b4221 - - -] Copied image 1b037081-2afb-4dc9-9da6-2960666eb702 to volume 4ed170b4-5529-4d08-b95f-b841902772c0 using copy offload workflow.

Revision history for this message
Bishoy (bishoysamy) wrote :

Has been seen on Redhat/Centos, Cinder 2014.2.4

Revision history for this message
Mike Rooney (rooneym) wrote :

This is likely an issue with the client's default configuration options for the NFS mount operation. The NFS client cache refresh interval can vary depending on how the NFS client's default mounting options are configured. In order to prevent the issue of being confronted with a stale negative cache entry, an additional option must be passed to the NFS mount command invoked by the Cinder NFS client. This can be configured by adding the line "nfs_mount_options = lookupcache=pos" to your driver configuration stanza in your cinder.conf file. Alternatively, if you are already setting other nfs mount options, then you can just add "lookupcache=pos​" to the end of your current "nfs_mount_options". What this will do is force the NFS client to ignore any negative entries in its cache and always check the NFS host when attempting to confirm the existence of a file.

Revision history for this message
Eric Harney (eharney) wrote :

Mike,

> This can be configured by adding the line "nfs_mount_options = lookupcache=pos" to your driver configuration stanza in your cinder.conf file.

If this is required, can the NetApp driver just always ensure this option is added?

Revision history for this message
Bishoy (bishoysamy) wrote :

I think indeed it should be automatically added.

Changed in cinder:
assignee: nobody → Bishoy (bishoysamy)
Revision history for this message
Mike Rooney (rooneym) wrote :

This issue is not believed to be unique to the NetApp driver. If it is desired for this option to be automatically added, then it should be done for all Cinder NFS drivers.

Revision history for this message
Eric Harney (eharney) wrote :

I don't believe there's any evidence that the standard NFS driver needs this?

Revision history for this message
Mike Rooney (rooneym) wrote :

Depending on the environment in which a customer deploys Cinder, the cache settings for the underlying NFS client could be anything unless explicitly set, correct? All we were recommending in this case was to pass in an explicit option setting for ignoring the cache on positive lookups. We couldn't replicate the issue exactly, and this is our best guess having not had a direct response from the customer.

Revision history for this message
Bishoy (bishoysamy) wrote :

The issue was resolved today by using the nfs option by the customer

Bishoy (bishoysamy)
Changed in cinder:
assignee: Bishoy (bishoysamy) → nobody
Revision history for this message
Chuck Fouts (fchuck) wrote :

Eric,

One instance where this is an issue is during rapid cloning in the NetApp driver. Per https://linux.die.net/man/5/nfs setting lookupcache=pos does the following: "If pos or positive is specified, the client assumes positive entries are valid until their parent directory's cached attributes expire, but always revalidates negative entires before an application can use them."

The issue with the NFS client's cache coherence is unique to the NetApp driver in this scenario. The NetApp driver is using an operation to create a file, in the storage pool, that does not involve the NFS client for the Cinder volume service. If multiple Cinder volumes are being created from the same Glance image in rapid succession then you have an opportunity for a cache miss in the NFS client when the image file actually exists. In this case the correct thing to do is to set lookupcache=pos, for the NetApp driver, instead of allowing the default behavior of caching negative lookup results.

In regards to the standard NFS driver this isn't the case, even when using Cinder's Image-Volume cache, since the Cinder DB keeps a record of the image-volumes that are cached. However, the same man page has this to say: "Caching directory entries improves the performance of applications that do not share files with applications on other clients. Using cached information about directories can interfere with applications that run concurrently on multiple clients and need to detect the creation or removal of files quickly, however. The lookupcache mount option allows some tuning of directory entry caching behavior. "

Since Cinder uses the DB as a source of truth then the NFS client's cache coherence isn't an issue when there is a single NFS client. But given that with HA there will be multiple NFS clients then each NFS client's cache coherence does become an issue. Continuing to support caching of negative lookup results doesn't seem to provide any performance benefit, given how Cinder uses the NFS client, and can possibly lead to difficult to debug problems.

I'd be interested in hearing your take on if this should be a Cinder core change, something to be addressed with documentation, or not really a problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/386041

Changed in cinder:
assignee: nobody → Eric Harney (eharney)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/386041
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=0865c4ac6e0348351f61b4ef6ae86a2928a544d9
Submitter: Jenkins
Branch: master

commit 0865c4ac6e0348351f61b4ef6ae86a2928a544d9
Author: Eric Harney <email address hidden>
Date: Thu Oct 13 11:15:15 2016 -0400

    NetApp NFS: Look for new file for >60s

    The default lookup cache expiry time for the
    Linux NFS client is 60 seconds. This means that
    looking for a new file could fail for the current
    45 seconds, but if the driver looks for >60s, the
    cache will be refreshed, and the file's presence
    will be detected.

    This means that we don't have to change the NFS
    client mount options to compensate for this cache
    behavior.

    Change-Id: I8c63fe6b8c5de74eac3f3c259783bb946e0482a0
    Closes-Bug: #1608596

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/386230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/newton)

Reviewed: https://review.openstack.org/386230
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=58d26c031d33a905c28a583082dd2f872262c0db
Submitter: Jenkins
Branch: stable/newton

commit 58d26c031d33a905c28a583082dd2f872262c0db
Author: Eric Harney <email address hidden>
Date: Thu Oct 13 11:15:15 2016 -0400

    NetApp NFS: Look for new file for >60s

    The default lookup cache expiry time for the
    Linux NFS client is 60 seconds. This means that
    looking for a new file could fail for the current
    45 seconds, but if the driver looks for >60s, the
    cache will be refreshed, and the file's presence
    will be detected.

    This means that we don't have to change the NFS
    client mount options to compensate for this cache
    behavior.

    Change-Id: I8c63fe6b8c5de74eac3f3c259783bb946e0482a0
    Closes-Bug: #1608596
    (cherry picked from commit 0865c4ac6e0348351f61b4ef6ae86a2928a544d9)

tags: added: in-stable-newton
Revision history for this message
Bishoy (bishoysamy) wrote :

A new bug will be opened as we shouldn't' wait for the expiry date of the negative cache,, the bug will be addressing all NFS drivers

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 9.1.0

This issue was fixed in the openstack/cinder 9.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 10.0.0.0b1

This issue was fixed in the openstack/cinder 10.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 9.1.0

This issue was fixed in the openstack/cinder 9.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.