NetApp: Errors if requests are too close together

Bug #1091480 reported by Sam Morrison on 2012-12-18
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Rushi Agrawal
Folsom
Medium
Rushi Agrawal

Bug Description

If I try and delete (or create) multiple volumes in a short succession I get errors:

2012-12-18 11:04:35 11163 ERROR suds.client [-] <?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:ns0="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="http://www.netapp.com/management/v1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
   <SOAP-ENV:Header/>
   <ns0:Body>
      <ns1:DatasetEditBegin>
         <ns1:DatasetNameOrId>442</ns1:DatasetNameOrId>
      </ns1:DatasetEditBegin>
   </ns0:Body>
</SOAP-ENV:Envelope>
2012-12-18 11:04:35 11163 ERROR cinder.openstack.common.rpc.amqp [-] Exception during message handling
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp Traceback (most recent call last):
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/cinder/openstack/common/rpc/amqp.py", line 276, in _process_data
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp rval = self.proxy.dispatch(ctxt, version, method, **args)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/cinder/openstack/common/rpc/dispatcher.py", line 145, in dispatch
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp return getattr(proxyobj, method)(ctxt, **kwargs)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 206, in delete_volume
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp {'status': 'error_deleting'})
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp self.gen.next()
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 195, in delete_volume
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp self.driver.delete_volume(volume_ref)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/cinder/volume/netapp.py", line 500, in delete_volume
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp self._remove_destroy(name, project)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/cinder/volume/netapp.py", line 444, in _remove_destroy
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp lock_id = server.DatasetEditBegin(DatasetNameOrId=lun.dataset.id)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/suds/client.py", line 542, in __call__
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp return client.invoke(args, kwargs)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/suds/client.py", line 602, in invoke
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp result = self.send(soapenv)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/suds/client.py", line 653, in send
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp result = self.failed(binding, e)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/suds/client.py", line 708, in failed
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp r, p = binding.get_fault(reply)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/suds/bindings/binding.py", line 265, in get_fault
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp raise WebFault(p, faultroot)
2012-12-18 11:04:35 11163 TRACE cinder.openstack.common.rpc.amqp WebFault: Server raised fault: '(22243:EEDITSESSIONINPROGRESS) Dataset OpenStack_a6cf29e7e30e4749975a18f52cbd7724 already enrolled in edit session 1355788168 initiated by oncommand.'

Need some way for the volume service to not take messages off the queue until it's ready

Changed in cinder:
assignee: nobody → Ben Swartzlander (bswartz)
Ben Swartzlander (bswartz) wrote :

We need to add lock-detection and retry logic to solve this issue.

Changed in cinder:
status: New → Confirmed
Brano Zarnovican (zarnovican) wrote :

I'm attaching a quick _temporary_ workaround for this issue. It will make sure that only one thread can execute DatasetEditCommit. The other thread(s) will be blocked until it can grab the semaphore lock.

Here is an example of 6s gap, where the thread was blocked:
2013-01-14 13:53:56 DEBUG nova.utils [req-76e7048d-596a-4510-947a-a9a35c8bbc96 01021119097b49f99cc13928c918bceb 103a49bb861e485ea05aa78f9b0216bd] Attempting to grab semaphore "dfm-ds-edit-lock" for method "_provision"... from (pid=24310) inner /opt/common-python/lib/python2.6/site-packages/nova/utils.py:928
2013-01-14 13:54:02 DEBUG nova.utils [req-76e7048d-596a-4510-947a-a9a35c8bbc96 01021119097b49f99cc13928c918bceb 103a49bb861e485ea05aa78f9b0216bd] Got semaphore "dfm-ds-edit-lock" for method "_provision"... from (pid=24310) inner /opt/common-python/lib/python2.6/site-packages/nova/utils.py:932

I did not put synchronize decorator around create/delete_snapshot because there, the LUN creation is done differently. I'm not sure if it's affected too..

Rushi Agrawal (rushiagr) on 2013-01-17
tags: added: netapp
Rushi Agrawal (rushiagr) wrote :

Able to reproduce. Working on it.

Fix proposed to branch: master
Review: https://review.openstack.org/23077

Changed in cinder:
assignee: Ben Swartzlander (bswartz) → Rushi Agrawal (rushiagr)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/23077
Committed: http://github.com/openstack/cinder/commit/3cff1cb24f7627dd913ce02a0f62c7ef85b5c37c
Submitter: Jenkins
Branch: master

commit 3cff1cb24f7627dd913ce02a0f62c7ef85b5c37c
Author: Rushi Agrawal <email address hidden>
Date: Wed Feb 27 06:37:41 2013 -0500

    NetApp: Fix race condition in 7-mode iSCSI driver with DFM.

    A race condition was discovered in the driver while creating
    or deleting multiple volumes in a very short interval of time
    (can be reproduced by making successive API calls to Cinder).
    This fix decorates the methods responsible for race condition
    with a synchronized() decorator. The problem was with the
    management software: DFM, and in principle, there should not
    be two processes trying to modify the DFM dataset object at
    the same time. This patch prevents from such a possibility.

    Fixes bug 1091480

    Change-Id: I11cfc8868171acec356f4140aba52b8078d109df

Changed in cinder:
status: In Progress → Fix Committed
Abel Lopez (al592b) wrote :

Please backport this to folsom

Kevin Bringard (kbringard) wrote :

And essex

Rushi Agrawal (rushiagr) on 2013-03-08
tags: added: folsom-backport-potential
Thierry Carrez (ttx) on 2013-03-15
Changed in cinder:
milestone: none → grizzly-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-04-04
Changed in cinder:
milestone: grizzly-rc1 → 2013.1
Alan Pevec (apevec) on 2013-04-05
no longer affects: cinder/grizzly
Mark McLoughlin (markmc) on 2013-04-05
tags: removed: folsom-backport-potential
tags: added: essex-backport-netapp-potential folsom-backport-netapp-potential
Navneet (singn) on 2013-10-03
tags: added: navneet-netapp-backport
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers