OpenStack Object Storage (swift)

old async pendings result in ghost listings

Bug #1619408 reported by clayg on 2016-09-01

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	In Progress	Medium	Matthew Oliver

Bug Description

If you leave a container disk unmounted for two long there's a really good chance you're going to end up with ghost listings [1] - this is mainly because async pendings for really old transactions are super edge-casey.

One way I've seen this happen is with write_affinity turned on. You can get an async for a PUT on some random handoff in the local region and the DELETE over on the primary. The updater will keep trying that container and getting that 507 until your fix your rings - longer after the container-replicator as made all the primaries agree that the tombstone row can be reclaimed.

Once the new disk(s) come online and rings go out - BOTH updaters on separate nodes send their respective updates to the new database and there's a bunch of possible edge case outcomes.

1) if the DELETE async fails [2] or is lost the PUT will create the ghost listing
2) if the DELETE async 204's but gets reclaimed the PUT will create the ghost listing

I'm not sure exactly how common this is, but we know ghost listings are a problem and this is one scenario where it can definitely happen (see unittest).

I think we just need to be more careful when doing a container update for a PUT with a timestamp that's older than reclaim age.

1. Ghost Listings are rows in the container database replicas for deleted objects.
2. if the new device doesn't have a container db yet they get a 404 (Related lp bug #1328735)

Revision history for this message

clayg (clay-gerrard) wrote on 2016-09-01:

fails well enough if you run in a loop! Edit (1.5 KiB, text/x-diff)

Revision history for this message

Charles Hsu (charles0126) wrote on 2016-09-02:

I think it does not only happen when write_affinity enabled. If you try to send PUT requests multiple times and then DELETE it, you might see there is a DELETE request in async_pending on node-A and another PUT request in async_pending on node-B. So if the drive still unmount and the object was removed fully (reclaimed) from container DB, and then if the drive mounted back or a ring changed and DELETE request is finished before PUT requests. You'll see this issue.

* Force object-updater to check the handoff place when it got 507 from the primary place, and update it when the container DB is in handoff place. If not, just skip it and leave a warning msg and go for next run. I think container-replicator will take care the DB and sync it to the handoff place.

Revision history for this message

Charles Hsu (charles0126) wrote on 2016-09-05:

dI might know why the async_pending requests are distributed and stored in different object servers and drives.

The X-Container-* (X-Container-Host, X-Container-Device, ...) are consistency from each proxy node,
but the proxy will shuffle object PUT requests[1] and add these X-Container-* headers to PUT requests in order[2].

For example, if client issue a request (first-PUT) to proxy server and it has three requests to object servers in below,

   [{'ip': u'192.168.200.21', 'device': u'd1', ...},
    {'ip': u'192.168.200.22', 'device': u'd2', ...},
    {'ip': u'192.168.200.23', 'device': u'd3', ...},
    ...]

And another client request (second-PUT) for the same object could be,

   [{'ip': u'192.168.200.22', 'device': u'd2', ...},
    {'ip': u'192.168.200.23', 'device': u'd3', ...},
    {'ip': u'192.168.200.21', 'device': u'd1', ...},
    ...]

And if we have a X-container-Host and X-Container-Device in below,

   [{'X-Container-Host': '192.168.200.21:6001', 'X-Container-Device': 'd10', ...},
    {'X-Container-Host': '192.168.200.22:6001', 'X-Container-Device': 'd11', ...},
    {'X-Container-Host': '192.168.200.23:6001', 'X-Container-Device': 'd12', ...}]

So if the account/container drive `d11` is unmounted, you'll see the async_pending of first-PUT stores to "192.168.200.22/d2" and
the async_pending of second-PUT stores in "192.168.200.23/d3". So the object-updater of `192.168.200.22` and `192.168.200.23` try to update same container DB under `d11`,
It might cause a race condition after container reclaim age.

[1] https://github.com/openstack/swift/blob/f1b1bad79d990c82b254e5403d31c16a552702f7/swift/proxy/controllers/obj.py#L132-L168
[2] https://github.com/openstack/swift/blob/f1b1bad79d990c82b254e5403d31c16a552702f7/swift/proxy/controllers/obj.py#L547-L552

clayg (clay-gerrard) on 2016-12-01

Changed in swift:
importance:	Undecided → High

Revision history for this message

Matthew Oliver (matt-0) wrote on 2016-12-02:

old_async_pendings.patch Edit (1.7 KiB, text/plain)

OK, so updating the object-updater to unlink async pendings older then reclaim age is relatively simple (see example patch).

But if you wanted to try and stop this update happening from the container side, then I'd drop it on the container server side rather then in the broker anywhere cause otherwise there would be sharding ramifications, as when sharding (and using merge_items) I do sent old timestamps. Because I still want potential merging to happen and not go ahead and create my own ghost listings the sharding way. (i.e. move an item at t1 with a new timestamp t3 but another container has it deleted at t2.

Revision history for this message

Matthew Oliver (matt-0) wrote on 2016-12-02:

take 2 Edit (1.6 KiB, text/plain)

Opps I meant this patch, the last one was during a small refactor.

Revision history for this message

clayg (clay-gerrard) wrote on 2016-12-02:

I... think i'm ok with unlinking updates once they're a reclaim age old? The risk is dark-data.

The other extreme would introduce like a "check" where if a async update is "old enough" we use direct_client to inspect the available primary object and container nodes and apply some sort of heuristic to decide if the update should be dropped or applied (async PUT and I see the data in majority of primary data nodes but not in any container or async DELETE with 404 or older timestamp on majority of primary data nodes but still in all the container, etc.)

... that's a bunch more code tho - would it be worth it?

I think most of the time I've seen ghost listings come from container replication on re-ingested nodes rather than async's but I don't really know if I can say for sure.

N.B. I don't think the updater is documented to have a reclaim_age option, that said I'm sure everyone has already moved this value to the DEFAULT section - see lp bug #1626296

Revision history for this message

Matthew Oliver (matt-0) wrote on 2016-12-04:

oh yeah, I added reclaim age here, which isn't currently an option of the updater.. so yeah. But if people are moving them to default then awesome, that'll make life easier.

clayg (clay-gerrard) on 2016-12-06

Changed in swift:
importance:	High → Medium

Revision history for this message

Matthew Oliver (matt-0) wrote on 2016-12-22:

remove_old_async_pendings_obj_updater.patch Edit (7.6 KiB, text/plain)

Here is a newer version of the minimum approach (unlink async pendings older then reclaim age). Clay I agree something that goes and checks would be alot of code, but would be better.. so will leave this here as an option while I think about the more codey approach :)

NOTE: This patch is based of the reclaim age moving to the DEFAULT section (https://review.openstack.org/#/c/374419). So would be a follow up patch.

Changed in swift:
assignee:	nobody → Matthew Oliver (matt-0)
status:	New → Triaged

clayg (clay-gerrard) on 2017-09-12

Changed in swift:
status:	Triaged → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-12: Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/527296

Changed in swift:
status:	Confirmed → In Progress

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Bug attachments

fails well enough if you run in a loop! Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.