VMware: ExtendVirtualDisk_Task fails due to locked file

Bug #1333587 reported by Matthew Booth
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matthew Booth
Icehouse
Fix Released
Undecided
Unassigned
VMwareAPI-Team
New
Undecided
Unassigned

Bug Description

Extending a disk during spawn races, which can result in failure. It is possible to hit this bug by launching a large number of instances of an image which isn't already cached, simultaneously. Some of them will race to extend the cached image, ultimately resulting in an error such as:

2014-06-17 10:49:26.006 9177 WARNING nova.virt.vmwareapi.driver [-] Task [ExtendVirtualDisk_Task]
   value = "task-12073"
   _type = "Task"
 } status: error Unable to access file [datastore1] 172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk since it is locked

Revision history for this message
Matthew Booth (mbooth-9) wrote :
Changed in nova:
assignee: nobody → Matthew Booth (mbooth-9)
Changed in nova:
status: New → In Progress
Gary Kotton (garyk)
Changed in nova:
importance: Undecided → High
milestone: none → juno-2
tags: added: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/102224
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=994cdb234b2b16d97f0276c6356db65817944ee2
Submitter: Jenkins
Branch: master

commit 994cdb234b2b16d97f0276c6356db65817944ee2
Author: Matthew Booth <email address hidden>
Date: Tue Jun 24 12:12:59 2014 +0100

    VMware: Fix race in spawn() when resizing cached image

    spawn() guards against multiple threads simultaneously attempting to
    cache the same image, but it wasn't guarding against them
    simultanously trying to create a resized copy in the cache. Attempting
    to create a large number of images simultaneously of an uncached image
    would result in a race to create the resized image. This resulted in 2
    classes of failed instance:

    1. Instances whose disk was a linked clone of a copy which had been
       subsequently overwritten. These were corrupt.
    2. Instances whose spawn() failed in ExtendVirtualDisk_Task due to a
       locked image.

    This patch creates a Nova-local lock for the resized image. The image
    is in a per-Nova directory on the datastore, so inter-Nova locking is
    not a concern. The lock guards both testing for the existence of the
    image, and its creation. Therefore when multiple processes race, only
    1 will create the resized copy, and all others will find and use it.
    In normal usage this will add the overhead of an additional
    uncontended local lock creation and deletion in spawn().

    In wrapping this code in a lock, we also make certain that any failure
    to create the resized image is appropriately cleaned up. Otherwise
    subsequent users will attempt to use a corrupt copy.

    Change-Id: I3df3d614656e511c909b6c1837582c0d34bf84c6
    Closes-bug: 1333587

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/106979

Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/106979
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a8b52c05ed27faf97f0b35ecf473e4cc7eac66ab
Submitter: Jenkins
Branch: stable/icehouse

commit a8b52c05ed27faf97f0b35ecf473e4cc7eac66ab
Author: Matthew Booth <email address hidden>
Date: Tue Jun 24 12:12:59 2014 +0100

    VMware: Fix race in spawn() when resizing cached image

    spawn() guards against multiple threads simultaneously attempting to
    cache the same image, but it wasn't guarding against them
    simultanously trying to create a resized copy in the cache. Attempting
    to create a large number of images simultaneously of an uncached image
    would result in a race to create the resized image. This resulted in 2
    classes of failed instance:

    1. Instances whose disk was a linked clone of a copy which had been
       subsequently overwritten. These were corrupt.
    2. Instances whose spawn() failed in ExtendVirtualDisk_Task due to a
       locked image.

    This patch creates a Nova-local lock for the resized image. The image
    is in a per-Nova directory on the datastore, so inter-Nova locking is
    not a concern. The lock guards both testing for the existence of the
    image, and its creation. Therefore when multiple processes race, only
    1 will create the resized copy, and all others will find and use it.
    In normal usage this will add the overhead of an additional
    uncontended local lock creation and deletion in spawn().

    In wrapping this code in a lock, we also make certain that any failure
    to create the resized image is appropriately cleaned up. Otherwise
    subsequent users will attempt to use a corrupt copy.

    Conflicts:
     nova/virt/vmwareapi/vmops.py

    Required changes:
      datastore.build_path -> ds_util.build_datastore_path
      No datastore object
      vm_util.copy_virtual_disk -> _copy_virtual_disk
      self.fake_image_uuid -> 'fake_image_uuid'
      removed use of _LE() for log message

    This change includes a test which depends on change
    I2025bffa887582eaa9e9072d0400f90ca97d1898.

    Change-Id: I3df3d614656e511c909b6c1837582c0d34bf84c6
    Closes-bug: 1333587
    (cherry picked from commit 994cdb234b2b16d97f0276c6356db65817944ee2)

tags: added: in-stable-icehouse
Chuck Short (zulcss)
tags: removed: icehouse-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.