[Re-open in 2015 Oct] DELETE operation not write affinity aware

Bug #1503161 reported by Hugo Kou on 2015-10-06
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Lingxian Kong

Bug Description

This suppose to be fixed in an old report :
DELETE operation not write affinity aware

I found this issue appears in recent swift release. My test environment is a 3 more regions for 3 replica policy.
User performs DELETE before the object been replicated to other regions.

HugotekiMacBook-Air:swift-source hugo$ swift delete c4 1
Error Deleting: c4/1: Object DELETE failed: 404 Not Found [first 60 chars of response] <html><h1>Not Found</h1><p>The resource could not be found.<

Here's a screenshot of my operation : http://imgur.com/cPQK9Uf

Seems the request was not been past into best_response in this case.

R1Node1 - Oct 6 07:16:52 ip6-localhost object-server: - - [06/Oct/2015:07:16:52 +0000] "DELETE /d0/20653/AUTH_demo1/c4/1" 204 - "DELETE" "tx7e8530fe3f474613ae348-0056137564" "proxy-server 10103" 0.0062 "-" 9865 0

R2Node2 - Oct 6 07:16:52 ip6-localhost object-server: - - [06/Oct/2015:07:16:52 +0000] "DELETE /d5/20653/AUTH_demo1/c4/1" 404 - "DELETE" "tx7e8530fe3f474613ae348-0056137564" "proxy-server 10103" 0.0286 "-" 4105 0

R3Node3 - Oct 6 07:16:52 ip6-localhost object-server: - - [06/Oct/2015:07:16:52 +0000] "DELETE /d10/20653/AUTH_demo1/c4/1" 404 - "DELETE" "tx7e8530fe3f474613ae348-0056137564" "proxy-server 10103" 0.0066 "-" 10587 0

Proxy - Oct 6 07:16:52 ip6-localhost proxy-server: 06/Oct/2015/07/16/52 DELETE /v1/AUTH_demo1/c4/1 HTTP/1.0 404 - python-swiftclient-2.5.0 demo1%2CAUTH_tkbaa... - 70 - tx7e8530fe3f474613ae348-0056137564 - 0.0355 - - 1444115812.263063908 1444115812.298532009 0

John Dickinson (notmyname) wrote :

Seems like the previous patch would only map 404->204 on deletes if there was no quorum. But if there was a quorum of 404s, it stiff returns a 404.

In the provided example, the object servers returned 404, so the proxy also returned 404. But note that the tombstones are written to disk.

This might be a bug with the CLI reporting an error when there isn't one.

Hugo Kou (tonytkdk) wrote :

How CLI has knowledge about if the DELETE object was quorum or not ?

"This might be a bug with the CLI reporting an error when there isn't one."

Don't understand about what bug it is in CLI reporting.

Samuel Merritt (torgomatic) wrote :

If you issue a DELETE request, you are trying to ensure that the object does not exist. A 2xx response indicates that the object has been deleted and thus does not exist, and a 404 indicates that the object did not exist and continues to not exist. In both cases, the object does not exist, so no error should be reported IMO.

Hugo Kou (tonytkdk) wrote :

Imagine that you can get an object with 2xx status code. Then you tried to delete it. Swift CLI get result 404.
As a user who will be very confused about that.

From operator or developer's point of view, it's deleted and a new_timestamp.ts was created in region's one primary location.
Since other two primary location will have the new_timestamp.ts "eventually".

In region1 has these entry for three copies now:


If replicator replicate the .data one to remote regions before .ts , it turns into the another bug in the following link. User will still have change to be able to GET it.

Error syncing handoff partition : no attribute 'intersection

Even though we fixed Error syncing handoff partition : no attribute 'intersection , the 404 still make user confused. Right ?

Zack M. Davis (zdavis) wrote :

> In both cases, the object does not exist, so no error should be reported IMO.

But given the ever-present possibility of human error in composing the request, distinguishing between "found the thing, deleted it" and "couldn't find the thing" seems important: if I meant to delete "foo" but the command I actually typed said to delete "fooo", I might want that to be an error, to shake me out of my lethal complacency and alert me to the fact that "foo" with two 'o's might still exist.

We could have swiftclient give a special, more-specific error message in this case of 404 on DELETE?

Zack M. Davis (zdavis) wrote :

But on the other hand, if it _is_ writing the tombstone and we take that to be the philosophical true nature of something having been "deleted," then it is weirdly counterintuitive (and discontinuous with the case where we don't happen to have quorum) for that to be a 404.

clayg (clay-gerrard) wrote :

we were probably too loose about 204=>404 mapping in the response. No need to limit to quorum; since the behavior is the same no matter what the response code any 204 is probably sufficient to return 404.

In the pathological case with some timeouts mixed in you might manage to get a second 204 from a DELETE after a previous 204, but if I think hard enough I might be able to come up with a failure scenario where that could happen anyway?

t1: 204, 503, 204
t2: 404, 204, 503

In the second request the existing override would "fix" the 404 -> 204 and return 204


t2: 404, 204, 404

... would have returned 404 as soon as it saw the quorum w/o doing any status translation - which is exactly the case we're dealing with in this bug. There's a quorum of 404's and we're taking that to mean we have a strong confidence the object who's responding 204 should have already been deleted by the previous request - but if those 404's are "missing" instead of "deleted" we should not be applying any such confidence.

I don't think this is strictly a swiftclient issue.

Bruno Lago (teolupus) wrote :

Conditions to reproduce:
1) Create geo-replicated cluster with 2 regions, 3 replicas.
2) Upload a test object and find out its placement with swift-get-node.
3) Delete the object.
4) Connect to the API endpoint of the region that is expected to have 1 replica and re-upload the object.
5) Before the replication to the other region is completed, delete the object via the API of the region that has only 1 replica.
Result: 404 Not Found. The resource could not be found.

Why is this problematic?
a) From a cloud customer point of view, the object has been deleted and will never reappear.
b) From a cloud provider point of view, the object has been deleted. When the replicator gets to this object it will notice it has been deleted will not replicate it to the other region.
c) Cloud customers are getting the same error message they get for when they try to delete an object that does not exist.

We are interested in fixing this and would like to hear the opinion of more experienced developers on what the best path would be.

John Dickinson (notmyname) wrote :

Bruno, what are you wanting to see?

The way things are now, DELETEs don't respect write affinity. This could be rephrased as "DELETEs assume they will be called more than <replication cycle time> after the initial PUT. We could have made the other choice: make DELETEs respect write affinity. However, this would mean that DELETE'ing longer-lived data (ie more than <replication cycle time> old) would get a 404 instead of a 204. In that balance, both options seem relatively equal in "rightness", but favoring longer-lived objects seems to likely impact fewer requests.

An alternative would be to always return 204 from the object server if a tombstone is written, since we *did* write a tombstone and it will be replicated out. Then the proxy would do quorum detection as normal. However, this would hide the "tried to delete a non-existant object" by returning 204 instead of 404.

We can't not write a tombstone if the .data file isn't found. There are many reasons that will happen in a normal cluster, and we need to ensure the delete action gets persisted. So there's a few options:

 * Leave it as-is, and know that with write-affinity you'll get more 404s on DELETEs than you may expect. Also, maybe turn off write-affinity.
 * Switch DELETE to use write-affinity and get more 404s on older data. Also, maybe turn off write-affinity.
 * Change the object server to always return 204 if the tombstone is written. This means that DELETEs to object that never were written will also give a 204 to the client.

clayg (clay-gerrard) wrote :

A strategy of using the timestamps in the responses to make better choices might be a good idea


clayg (clay-gerrard) wrote :
Download full text (4.3 KiB)

So at first glance one might think object controller DELETE already does this:


But it turns out no,


... the override's only kick in if you do *not* have quorum

when the responses are:

[204, Timeout, 404] there will be no "majority quorum", so the 404s get translated to 204...

[204, Timeout, 404] becomes [204, Timeout, 204] and the quorum is now clearly 204 (i.e. in the *absence* of a clear majority we can be optimistic) This behavior is particularly useful in a 2-replica ring or any even-replica config more generally, where strong majority is harder to come by under failure.

Unfortunately, however, the quorum is *equally* clear when the responses are

[204, 404, 404] - the *majority* said 404 :'(

It's easy to imagine this set of responses after the following series of requests with merely a *single* timeout:

t0 PUT => [201, 201, 201] => 201
t1 DELETE => [204, 204, Timeout] => 204
t2 DELETE => [404, 404, 204] => 404

In this scenario the 404 at t2 is "correct" in the sense that the object was already deleted at t1.

This scenario was discussed in depth in the original review that added override_responses https://review.openstack.org/#/c/114120/

In the 3 region 3 replica write_affinity case however, it's equally likely to see the [404, 404, 204] response at t1 even with *no* failures - because the .data files will be on handoffs until the async replication moves the replicas to the remote regions.

Ideally, in the situation where we collect mostly 404 but with *some* 204's we will get more information from handoff nodes before we make the final decision.

e.g. in the case where a obj node missed DELETE at t1 and poisons the response at t2 another request to a handoff node wouldn't change the majority response:

t2 DELETE => [404, 404, 204] => [..., 404] => [404, 404, 201, 404] => 404

It's going to be annoying to fix the 3-region issue without breaking this :\

But if we can add in the handoff requests in a write affinity global cluster it has two benefits over what's implemented today:

#1) we'd confidently response 204 to a DELETE iif the majority of nodes acctually *holding* the .data files responded 204
#2) we would acctually *remove* the the .data files from the handoff nodes freeing up space and preventing them from having to rsync to the remote before being swallowed by the tombstones in the remote when the rehash cleans them up.

The DELETE requests to the backend storage nodes are ultimately made here:


The most explicit and straight forward way I can think of to get the extra requests to work is to have something like a:

"handoff_delete = N"

option that defaults to 0 but indicates how many get_more_nodes handoff nodes get chained to the node iter which are used to spawn _make_requests (backend headers can just be itertools cycled)

I thought about doing it more dynamically...


Tim Burke (1-tim-z) wrote :

Clay, do you see this interacting with the per-policy config? Like, would you want to see handoff_delete be override-able in that first pass or shortly thereafter?

Considering the [404, 404, 204, 204, 204, 404, 404, 404, 404, ... N] case -- should we consider 404s without X-Timestamp (beyond the first num_replicas responses) like timeouts, so that when we're looking for a quorum we just see the [204, 204, 204] responses? And if there are only two successful deletes, we'd see [204, 204, 404] and still respond 204... but just one will get us [204, 404, 404] and we could respond 404....

clayg (clay-gerrard) wrote :

FWIW this doc string:


correctly describes overrides only come into effect when *lacking* quorum

clayg (clay-gerrard) wrote :

Yeah, additional context from timestamps would only help make better decisions.

Certainly a 404 w/o a timestamp is less valuable than a 404 w/ a timestamp, and 204 w/ a timestamp < than 404 w/ a timestamp less valuable still. The more context we can bring in the better. But the more I think about those handoffs the more I think we need a way to make more than replica# requests. An explicit (per-policy) config var would be sufficient.

clayg (clay-gerrard) on 2017-05-17
Changed in swift:
status: New → Confirmed
importance: Undecided → Medium
Matthew Oliver (matt-0) wrote :

Ahh the overrides do tend to get a bit confusing for people.

So this is interesting, and I talked briefly to Bruno in Boston and feel his pain. So we should totally fix this. So far, the best I've seen is Clay's suggestion of supporting handoff affinity to deletes.. this should help somewhat. But there also might be merit in what John was saying.

One option could be.
If DELETEs _could_ follow handoffs in clay's approach, or maybe even if we don't, we could drop ts (on handoffs or primaries) so long as at least 1 node returns a 204. As we did delete one, and even if something else happens in the cluster the <timestamp>.ts will sort itself out in the long run.
This way a we did issue a delete, so we can, I think safely return 204:

   t1 DELETE [204, 404, 404] --> [204, 204 (ts), 204 (ts)]
   t2 PUT [ 404, 200, 200 ] --> 200

In this case the TS have no effect as they will be cleared out because of the PUT at t2, expect 2 zero byte files that may exist on handoffs for a while.. however, in the system the DELETE is an event and it did happen, so we did the right thing.

Though I guess the downside to this is there is a chance we can delete without a quorum. Which is problematic. But on the other hand if we get a 404 we have no idea if the object lives on handoffs or has already been deleted.

Damn why do people what to delete on an object store anyway :P

Another option would be to do exactly as clay says, and now after thinking about it some more, I tend to agree more. Have the ability use write affinity in Deletes such that, it orders itself (handoff_deletes = 2)

  [pri, pri, pri, affinity_hndoff, affinity_handoff, hndoff]

And only read 2x replicas or something.

My 2 cents.

clayg (clay-gerrard) wrote :

Mostly orthogonal, but for context it's helpful to remember the original fix was for solving situations like lp bug #1692380

... If we go with the explicit "request_delete_count" 2-replica configs would probably just keep it @ the default value of "replica" (or maybe "0" if the option is implemented as "handoff_delete_count"?) because an extra 404 in the 2-replica case would result in a majority 404 which is *not* what we want (assuming primary 404'd because of rebalance instead of an earlier write_affinity/handoff PUT).

Here again x-backend-timestamps (or lack thereof) could offer additional context/insight.

Tim Burke (1-tim-z) wrote :

FWIW, I think there's something similar affecting POSTs...

Lingxian Kong (kong) wrote :
Changed in swift:
assignee: nobody → Lingxian Kong (kong)
Changed in swift:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/470158
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=831eb6e3ce317d8209178dd7e4dee509edd179cf
Submitter: Jenkins
Branch: master

commit 831eb6e3ce317d8209178dd7e4dee509edd179cf
Author: Lingxian Kong <email address hidden>
Date: Fri Jun 2 15:44:52 2017 +1200

    Write-affinity aware object deletion

    When deleting objects in multi-region swift delpoyment with write
    affinity configured, users always get 404 when deleting object before
    it's replcated to approriate nodes.

    This patch adds a config item 'write_affinity_handoff_delete_count' so
    that operator could define how many local handoff nodes should swift
    send request to get more candidates for the final response, or by
    default just leave it to swift to calculate the appropriate number.

    Change-Id: Ic4ef82e4fc1a91c85bdbc6bf41705a76f16d1341
    Closes-Bug: #1503161

Changed in swift:
status: In Progress → Fix Released

This issue was fixed in the openstack/swift 2.15.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers