Comment 11 for bug 1503161

Revision history for this message
clayg (clay-gerrard) wrote :

So at first glance one might think object controller DELETE already does this:

https://github.com/openstack/swift/blob/2abffb99b9e1dc2f01d86801f4197eb5735e057c/swift/proxy/controllers/obj.py#L605

But it turns out no,

https://github.com/openstack/swift/blob/2abffb99b9e1dc2f01d86801f4197eb5735e057c/swift/proxy/controllers/obj.py#L605

... the override's only kick in if you do *not* have quorum

when the responses are:

[204, Timeout, 404] there will be no "majority quorum", so the 404s get translated to 204...

[204, Timeout, 404] becomes [204, Timeout, 204] and the quorum is now clearly 204 (i.e. in the *absence* of a clear majority we can be optimistic) This behavior is particularly useful in a 2-replica ring or any even-replica config more generally, where strong majority is harder to come by under failure.

Unfortunately, however, the quorum is *equally* clear when the responses are

[204, 404, 404] - the *majority* said 404 :'(

It's easy to imagine this set of responses after the following series of requests with merely a *single* timeout:

t0 PUT => [201, 201, 201] => 201
t1 DELETE => [204, 204, Timeout] => 204
t2 DELETE => [404, 404, 204] => 404

In this scenario the 404 at t2 is "correct" in the sense that the object was already deleted at t1.

This scenario was discussed in depth in the original review that added override_responses https://review.openstack.org/#/c/114120/

In the 3 region 3 replica write_affinity case however, it's equally likely to see the [404, 404, 204] response at t1 even with *no* failures - because the .data files will be on handoffs until the async replication moves the replicas to the remote regions.

Ideally, in the situation where we collect mostly 404 but with *some* 204's we will get more information from handoff nodes before we make the final decision.

e.g. in the case where a obj node missed DELETE at t1 and poisons the response at t2 another request to a handoff node wouldn't change the majority response:

t2 DELETE => [404, 404, 204] => [..., 404] => [404, 404, 201, 404] => 404

It's going to be annoying to fix the 3-region issue without breaking this :\

But if we can add in the handoff requests in a write affinity global cluster it has two benefits over what's implemented today:

#1) we'd confidently response 204 to a DELETE iif the majority of nodes acctually *holding* the .data files responded 204
#2) we would acctually *remove* the the .data files from the handoff nodes freeing up space and preventing them from having to rsync to the remote before being swallowed by the tombstones in the remote when the rehash cleans them up.

The DELETE requests to the backend storage nodes are ultimately made here:

https://github.com/openstack/swift/blob/2abffb99b9e1dc2f01d86801f4197eb5735e057c/swift/proxy/controllers/base.py#L1595

The most explicit and straight forward way I can think of to get the extra requests to work is to have something like a:

"handoff_delete = N"

option that defaults to 0 but indicates how many get_more_nodes handoff nodes get chained to the node iter which are used to spawn _make_requests (backend headers can just be itertools cycled)

I thought about doing it more dynamically only if the first three requests response were mixed (i.e. [404, 404, 204]) but the early quorum wait there is going to be super annoying if the 404's are faster than the 204. Besides how do you decide when to stop? If the policy using write affinity in a three replica three region cluster was *configured* to set handoff_delete = 2 they would *always* get:

[404, 404, 204, 204, 204] *before* async replication
[204, 404, 204, 404, 204] *during* async replication
[204, 204, 204, 404, 404] *after* async replication

... and the majority is always clear even without the override to make up the slop under failure.

The added benefit of being simple, explicit and independently tunable is I get to use handoff_delete = N for nefarious situations where I have data on handoffs because of 507

instead of:

t0 PUT => [507, 507, 201, 201, 201] => 201
t1 DELETE => [404, 404, 201] => 404 (with the .data still on handoffs)

I could trick out my config to make 10 N extra DELETE requests - and get

t1 DELETE = [404, 404, 201, 201, 201, 404, 404, 404, 404, ... N] => 404

I still get a 404 response but hey, at least the .data's are reaped and I get to pull one out of the fire!?