Cinder

RemoteFS: race in _create_snapshot_online => infinite loop

Bug #1538496 reported by Jordan Pittier on 2016-01-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	Undecided	Jordan Pittier

Bug Description

Hi,
I think there's a race condition in _create_snapshot_online() method in the volume/drivers/remotefs.py file. [1]

In that method, there's a "while true" loop and to exit that loop, the status of the volume must be either "creating" for a long time (> timeout) or "error".

Problem is if Nova fails to create the snapshot (remember we are in the _create_snapshot_online, so Nova performs a live snapshot), then Nova will set the snapshot status to "error" [1]. If right after (like 0.1 sec after, i.e fast as Tempest) a client delete that erroneous snapshot, then the status will be now be "deleting".

Thus we will be stuck in the "while true" loop. Cinder consumes 100% of CPU and keep logging "Status of snapshot XX is now deleting from (pid=15251) _create_snapshot_online /opt/stack/cinder/cinder/volume/drivers/remotefs.py:1305".

I can reproduce at 100% with Centos 7 (with a qemu-kvm binary without support for live snapshot), Openstack master and the Tempest test "tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern"

[1] : https://github.com/openstack/cinder/blob/d3fe19cd3b6ae78c81b4c317c49e8e8d579714b1/cinder/volume/drivers/remotefs.py#L1256
[2] : https://github.com/openstack/nova/blob/c7c00d82991a16e44e628941f224740d55970d95/nova/virt/libvirt/driver.py#L1943

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-18: Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/281753

Changed in cinder:
assignee:	nobody → Jordan Pittier (jordan-pittier)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-29: Fix merged to cinder (master)

Reviewed: https://review.openstack.org/281753
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=94b3288fe5430061eca0f2b48ce5bab7ce376f30
Submitter: Jenkins
Branch: master

commit 94b3288fe5430061eca0f2b48ce5bab7ce376f30
Author: Jordan Pittier <email address hidden>
Date: Wed Feb 17 19:05:49 2016 +0100

Fix race condition in RemoteFS create_snapshot_online

    There's a `while True` loop in create_snapshot_online and each
    iteration can make the `cinder-volume` process sleep up to 10sec. In
    the mean time, if Nova fails to create the snapshot, the snapshot
    status could turn to "error" and someone could want to delete the
    erroneous snapshot, which would make its status to be "deleting".

In that case the `while True` loop would never exit. Cinder consumes
100% of CPU and keeps logging "Status of snapshot XX is now deleting".

The patch fix this issue by exiting the `while True` loop if we detect
that the snapshot is to be deleted.

Closes-Bug: #1538496

Change-Id: I5de0e8479a552ce101cecd06a874a170e54d5c18

Changed in cinder:
status:	In Progress → Fix Released

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-03-03: Fix included in openstack/cinder 8.0.0.0b3

This issue was fixed in the openstack/cinder 8.0.0.0b3 development milestone.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.