rmtree in obj replication is racy

Bug #1408061 reported by John Dickinson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Medium
Unassigned

Bug Description

In object replication update_deleted(), there's a call to shutil.rmtree to delete the entire partition directory tree after the partition has been successfully pushed to the primary nodes.

However, it's possible[1] that a new object has been placed in the partition directory after the suffixes are calculated but before the rmtree is called. The code would delete this new object.

[1] For example if the client internal network is down, but the replication network is working just fine.

Revision history for this message
paul luse (paul-e-luse) wrote :

So I guess one thing we could do is lock the part dir throughout the process such that anyone trying to land an object here would get an error and go try the next handoff. Or, we can look at some of the work I'm doing on the EC version of this and once that's fully baked look at applying the same kind of scheme here. On the EC side we're not killing the entire part dir at once, we're actually removing .data files as they're sync'd (for a totally different reason) and then once we're done we attempt to remove the dir but if its not empty we just move on. The reason it wouldn't be empty isn't because of the race mentioned here, its because it could be a normal case where multiple fragment indexes are stored in the same place however that logic would eliminate the issue called out there of accidentally deleting something that landed because we could contact all the primaries and therefore made the assumption that the proxy could also... I kinda like the latter option (put this on hold til the reconstructor is done and then borrow)

Revision history for this message
Samuel Merritt (torgomatic) wrote :

Either that or just make a bunch of rmdir() calls ourselves instead of delegating to rmtree(). Probably more efficient, too, since we know what dirs we've traversed so we can avoid another listdir (maybe?).

Revision history for this message
clayg (clay-gerrard) wrote :

this one time I tried to fix this:

https://review.openstack.org/#/c/158927/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by John Dickinson (<email address hidden>) on branch: master
Review: https://review.openstack.org/158927
Reason: Abanoning based on the lack of activity since the last negative review. If you want to continue working on this, please reopen this patch.

Changed in swift:
status: New → Confirmed
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.