EMC VMAX not threadsafe in waiting for SMI-S job completion

Bug #1393568 reported by Carl Pecinovsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Xing Yang

Bug Description

For a variety of operations, the VMAX cinder driver invokes a CIM method, and then must periodically poll for completion of the backend submitted job. To accomplish this polling, the FixedIntervalLoopingCall cinder utility construct is used by both emc_vmax_utils.py-->_wait_for_job_complete() and emc_vmax_utils.py-->wait_for_sync().

However, the logic uses global class variables in order to track a retry count to a configured maximum, rather than using thread local variables. Because of the way the EMCVMAXUtils class objects get constructed, there exists 4 of these objects hanging off of common, masking, provision, and masking.provision, so there is some isolation between driver areas, but not within an area (masking, for example).

Here is one of the problematic methods:

    def wait_for_sync(self, conn, syncName):
        """Given the sync name wait for it to fully synchronize.

        :param conn: connection to the ecom server
        :param syncName: the syncName
        """

        def _wait_for_sync():
            """Called at an interval until the synchronization is finished"""
            if self._is_sync_complete(conn, syncName):
                raise loopingcall.LoopingCallDone()
            if self.retries > JOB_RETRIES: <----- **************
                LOG.error(_("_wait_for_sync failed after %(retries)d tries")
                          % {'retries': self.retries})
                raise loopingcall.LoopingCallDone()
            try:
                self.retries += 1 <----- **************
                if not self.wait_for_sync_called:
                    if self._is_sync_complete(conn, syncName):
                        self.wait_for_sync_called = True <----- **************
            except Exception as e:
                LOG.error(_("Exception: %s") % six.text_type(e))
                exceptionMessage = (_("Issue encountered waiting for "
                                      "synchronization."))
                LOG.error(exceptionMessage)
                raise exception.VolumeBackendAPIException(exceptionMessage)

        self.retries = 0 <----- **************
        self.wait_for_sync_called = False <----- **************
        timer = loopingcall.FixedIntervalLoopingCall(_wait_for_sync)
        timer.start(interval=INTERVAL_10_SEC).wait()

The intent may have been to wrapper the nested method in an outer class, but instead it is just a parent method. Concurrent operations can cause self.retries to be bumped up or zeroed out incorrectly, resulting in hard-to-debug failures or timeouts.

Tags: drivers emc
Xing Yang (xing-yang)
Changed in cinder:
assignee: nobody → Xing Yang (xing-yang)
tags: added: drivers emc
Xing Yang (xing-yang)
Changed in cinder:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138878

Changed in cinder:
status: Confirmed → In Progress
Xing Yang (xing-yang)
Changed in cinder:
milestone: none → kilo-1
Mike Perez (thingee)
Changed in cinder:
milestone: kilo-1 → kilo-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/138878
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=e845c1182dde5aa8858290c2719b017fb5ca5915
Submitter: Jenkins
Branch: master

commit e845c1182dde5aa8858290c2719b017fb5ca5915
Author: Xing Yang <email address hidden>
Date: Sun Dec 7 14:57:27 2014 -0500

    Fixed wait for job completion in VMAX driver

    The VMAX driver used non-thread-safe variables in the wait for job
    completion routines, resulting in failures or timeouts during concurrent
    operations. This patch fixed the problem.

    Closes-Bug: #1393568
    Change-Id: I444b961a6543886da35fb44127b1cf7c509ce8d8

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in cinder:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: kilo-2 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.