Cinder

EMC VMAX not threadsafe in waiting for SMI-S job completion

Bug #1393568 reported by Carl Pecinovsky on 2014-11-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	Medium	Xing Yang	Cinder 2015.1.0 "kilo"

Bug Description

For a variety of operations, the VMAX cinder driver invokes a CIM method, and then must periodically poll for completion of the backend submitted job. To accomplish this polling, the FixedIntervalLoopingCall cinder utility construct is used by both emc_vmax_utils.py-->_wait_for_job_complete() and emc_vmax_utils.py-->wait_for_sync().

However, the logic uses global class variables in order to track a retry count to a configured maximum, rather than using thread local variables. Because of the way the EMCVMAXUtils class objects get constructed, there exists 4 of these objects hanging off of common, masking, provision, and masking.provision, so there is some isolation between driver areas, but not within an area (masking, for example).

Here is one of the problematic methods:

def wait_for_sync(self, conn, syncName):
"""Given the sync name wait for it to fully synchronize.

        :param conn: connection to the ecom server
        :param syncName: the syncName
        """

        def _wait_for_sync():
            """Called at an interval until the synchronization is finished"""
            if self._is_sync_complete(conn, syncName):
                raise loopingcall.LoopingCallDone()
            if self.retries > JOB_RETRIES: <----- **************
                LOG.error(_("_wait_for_sync failed after %(retries)d tries")
                          % {'retries': self.retries})
                raise loopingcall.LoopingCallDone()
            try:
                self.retries += 1 <----- **************
                if not self.wait_for_sync_called:
                    if self._is_sync_complete(conn, syncName):
                        self.wait_for_sync_called = True <----- **************
            except Exception as e:
                LOG.error(_("Exception: %s") % six.text_type(e))
                exceptionMessage = (_("Issue encountered waiting for "
                                      "synchronization."))
                LOG.error(exceptionMessage)
                raise exception.VolumeBackendAPIException(exceptionMessage)

        self.retries = 0 <----- **************
        self.wait_for_sync_called = False <----- **************
        timer = loopingcall.FixedIntervalLoopingCall(_wait_for_sync)
        timer.start(interval=INTERVAL_10_SEC).wait()

The intent may have been to wrapper the nested method in an outer class, but instead it is just a parent method. Concurrent operations can cause self.retries to be bumped up or zeroed out incorrectly, resulting in hard-to-debug failures or timeouts.

Tags:

Xing Yang (xing-yang) on 2014-11-18

Changed in cinder:
assignee:	nobody → Xing Yang (xing-yang)
tags:	added: drivers emc

Xing Yang (xing-yang) on 2014-11-18

Changed in cinder:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-03: Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138878

Changed in cinder:
status:	Confirmed → In Progress

Xing Yang (xing-yang) on 2014-12-07

Changed in cinder:
milestone:	none → kilo-1

Mike Perez (thingee) on 2014-12-16

Changed in cinder:
milestone:	kilo-1 → kilo-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-18: Fix merged to cinder (master)

Reviewed: https://review.openstack.org/138878
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=e845c1182dde5aa8858290c2719b017fb5ca5915
Submitter: Jenkins
Branch: master

commit e845c1182dde5aa8858290c2719b017fb5ca5915
Author: Xing Yang <email address hidden>
Date: Sun Dec 7 14:57:27 2014 -0500

Fixed wait for job completion in VMAX driver

    The VMAX driver used non-thread-safe variables in the wait for job
    completion routines, resulting in failures or timeouts during concurrent
    operations. This patch fixed the problem.

Closes-Bug: #1393568
Change-Id: I444b961a6543886da35fb44127b1cf7c509ce8d8

Changed in cinder:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-02-05

Changed in cinder:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in cinder:
milestone:	kilo-2 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.