sharding: container GETs to root container get slow

Bug #1781291 reported by John Dickinson on 2018-07-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Medium
Unassigned

Bug Description

During and object PUT to a sharded contianer, the proxy does a GET to the container server to get the shard ranges. This container GET starts to get really slow under high concurrency.

To reproduce, make a container with a lot of shards. Then start concurrently putting new objects into it. As concurrency increases, the container GET gets very slow.

There's a few ideas on fixing this (none of these are final)

Cache the shard ranges in the proxy. Since they are unbounded, memcache isn't likely a good place. Maybe some in-process memory can be used and have some sort of LRU on it. But then you've got cache invalidation issues...

Defer the child DB update until later. Only tell the object server to update the root, and let the updater handle the redirects. Also let the root DB enqueue child updates.

Instead of updating the root, send the updates to the root's handoffs (some number of them) so that they get sharded in place and the sharder handles getting the updates to the right place.

Don't do any of the deferring in the server and instead slow down the client (ie backpressure).

Matthew Oliver (matt-0) wrote :

I like the idea of caching in the proxy, as the other options are leaving it to the sharder to move them so putting it off. In the shard case "most" the time the updates will goto the correct place, and when not, we're putting it off until later.

I don't think invalidation is necessary as if things go to the wrong place before the cache time/limit is up then we're just giving the sharder some work.
If we _want_ invalidation, we could get the get_container_info to include the latest shardrange timestamp of the container (timestamp not meta_timestamp as we'd only care if there was a structural change in the shard network). When this timestamp is > then the timestamp of the root container in the cache, we invalidate it and force a new GET.
The downside is that the info calls are also cached, so is it really important to invalidate then?

The real problem with shard caching is what to cache. If you look at the object controller in the proxy, we are only asking for the shard that object resides (https://github.com/openstack/swift/blob/2.18.0/swift/proxy/controllers/obj.py#L269-L281). This isn't a good cache candidate and it _should_ mostly just return a single shard.
And then theres the problem of what happens when there are thousands of shards? Maybe in the latter case sending to the wrong one isn't too bad a cost?

So what do we cache? We could cache the results of container GETs.. but how often does that happen?

We should get the sharder to update some memcache cache that the proxy can use.. but then what to cache, we don't want to cache _everything_ only the hot containers I'd expect.

Anyway. Just thought I'd put some comments in as I spend some time thinking about it.

Maybe caching, isn't the answer then? Or is there someway to identify the top containers to cache? Proxies update a hotlist in memcache, sharders/replicators/auditors (something) keeps caches of shards?

/me is just brain storming

Matthew Oliver (matt-0) wrote :

s/We should get the sharder to update some memcache cache/We could get the sharder to update some memcache cache/

clayg (clay-gerrard) wrote :

I agree caching is hard, but we need to keep thinking about it. The caching doesn't have to be in the proxy-server process or memcache.

I'm still concerned that the PUT rate on a single replicated but unsharded container's objects table (even accepting for .pending file bulk inserts) is faster than the GET rate to a single replicated sharded but root container's shard range table. It doesn't make sense to me that obviously needs to be true.

I really want to do some experimental analysis to provide some data on the cost of the API operations directly to the container-server API with some variables on concurrency and cardinality so we can all look at it.

It's possible the solution is to index the shard ranges table, tweak vfs_cache_pressure and let the page cache do it's thing.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers