RemoteFS: race in _create_snapshot_online => infinite loop

Bug #1538496 reported by Jordan Pittier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Undecided
Jordan Pittier

Bug Description

Hi,
I think there's a race condition in _create_snapshot_online() method in the volume/drivers/remotefs.py file. [1]

In that method, there's a "while true" loop and to exit that loop, the status of the volume must be either "creating" for a long time (> timeout) or "error".

Problem is if Nova fails to create the snapshot (remember we are in the _create_snapshot_online, so Nova performs a live snapshot), then Nova will set the snapshot status to "error" [1]. If right after (like 0.1 sec after, i.e fast as Tempest) a client delete that erroneous snapshot, then the status will be now be "deleting".

Thus we will be stuck in the "while true" loop. Cinder consumes 100% of CPU and keep logging "Status of snapshot XX is now deleting from (pid=15251) _create_snapshot_online /opt/stack/cinder/cinder/volume/drivers/remotefs.py:1305".

I can reproduce at 100% with Centos 7 (with a qemu-kvm binary without support for live snapshot), Openstack master and the Tempest test "tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern"

[1] : https://github.com/openstack/cinder/blob/d3fe19cd3b6ae78c81b4c317c49e8e8d579714b1/cinder/volume/drivers/remotefs.py#L1256
[2] : https://github.com/openstack/nova/blob/c7c00d82991a16e44e628941f224740d55970d95/nova/virt/libvirt/driver.py#L1943

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/281753

Changed in cinder:
assignee: nobody → Jordan Pittier (jordan-pittier)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/281753
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=94b3288fe5430061eca0f2b48ce5bab7ce376f30
Submitter: Jenkins
Branch: master

commit 94b3288fe5430061eca0f2b48ce5bab7ce376f30
Author: Jordan Pittier <email address hidden>
Date: Wed Feb 17 19:05:49 2016 +0100

    Fix race condition in RemoteFS create_snapshot_online

    There's a `while True` loop in create_snapshot_online and each
    iteration can make the `cinder-volume` process sleep up to 10sec. In
    the mean time, if Nova fails to create the snapshot, the snapshot
    status could turn to "error" and someone could want to delete the
    erroneous snapshot, which would make its status to be "deleting".

    In that case the `while True` loop would never exit. Cinder consumes
    100% of CPU and keeps logging "Status of snapshot XX is now deleting".

    The patch fix this issue by exiting the `while True` loop if we detect
    that the snapshot is to be deleted.

    Closes-Bug: #1538496

    Change-Id: I5de0e8479a552ce101cecd06a874a170e54d5c18

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/cinder 8.0.0.0b3

This issue was fixed in the openstack/cinder 8.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.