sharding: container GETs to root container get slow

Bug #1781291 reported by John Dickinson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Unassigned

Bug Description

During and object PUT to a sharded contianer, the proxy does a GET to the container server to get the shard ranges. This container GET starts to get really slow under high concurrency.

To reproduce, make a container with a lot of shards. Then start concurrently putting new objects into it. As concurrency increases, the container GET gets very slow.

There's a few ideas on fixing this (none of these are final)

Cache the shard ranges in the proxy. Since they are unbounded, memcache isn't likely a good place. Maybe some in-process memory can be used and have some sort of LRU on it. But then you've got cache invalidation issues...

Defer the child DB update until later. Only tell the object server to update the root, and let the updater handle the redirects. Also let the root DB enqueue child updates.

Instead of updating the root, send the updates to the root's handoffs (some number of them) so that they get sharded in place and the sharder handles getting the updates to the right place.

Don't do any of the deferring in the server and instead slow down the client (ie backpressure).

Revision history for this message
Matthew Oliver (matt-0) wrote :

I like the idea of caching in the proxy, as the other options are leaving it to the sharder to move them so putting it off. In the shard case "most" the time the updates will goto the correct place, and when not, we're putting it off until later.

I don't think invalidation is necessary as if things go to the wrong place before the cache time/limit is up then we're just giving the sharder some work.
If we _want_ invalidation, we could get the get_container_info to include the latest shardrange timestamp of the container (timestamp not meta_timestamp as we'd only care if there was a structural change in the shard network). When this timestamp is > then the timestamp of the root container in the cache, we invalidate it and force a new GET.
The downside is that the info calls are also cached, so is it really important to invalidate then?

The real problem with shard caching is what to cache. If you look at the object controller in the proxy, we are only asking for the shard that object resides (https://github.com/openstack/swift/blob/2.18.0/swift/proxy/controllers/obj.py#L269-L281). This isn't a good cache candidate and it _should_ mostly just return a single shard.
And then theres the problem of what happens when there are thousands of shards? Maybe in the latter case sending to the wrong one isn't too bad a cost?

So what do we cache? We could cache the results of container GETs.. but how often does that happen?

We should get the sharder to update some memcache cache that the proxy can use.. but then what to cache, we don't want to cache _everything_ only the hot containers I'd expect.

Anyway. Just thought I'd put some comments in as I spend some time thinking about it.

Maybe caching, isn't the answer then? Or is there someway to identify the top containers to cache? Proxies update a hotlist in memcache, sharders/replicators/auditors (something) keeps caches of shards?

/me is just brain storming

Revision history for this message
Matthew Oliver (matt-0) wrote :

s/We should get the sharder to update some memcache cache/We could get the sharder to update some memcache cache/

Revision history for this message
clayg (clay-gerrard) wrote :

I agree caching is hard, but we need to keep thinking about it. The caching doesn't have to be in the proxy-server process or memcache.

I'm still concerned that the PUT rate on a single replicated but unsharded container's objects table (even accepting for .pending file bulk inserts) is faster than the GET rate to a single replicated sharded but root container's shard range table. It doesn't make sense to me that obviously needs to be true.

I really want to do some experimental analysis to provide some data on the cost of the API operations directly to the container-server API with some variables on concurrency and cardinality so we can all look at it.

It's possible the solution is to index the shard ranges table, tweak vfs_cache_pressure and let the page cache do it's thing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/667203
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=a1af3811a79963e2e5d1db3e5588cbc9748f9d57
Submitter: Zuul
Branch: master

commit a1af3811a79963e2e5d1db3e5588cbc9748f9d57
Author: Tim Burke <email address hidden>
Date: Mon Jun 24 12:25:33 2019 -0700

    sharding: Cache shard ranges for object writes

    Previously, we issued a GET to the root container for every object PUT,
    POST, and DELETE. This puts load on the container server, potentially
    leading to timeouts, error limiting, and erroneous 404s (!).

    Now, cache the complete set of 'updating' shards, and find the shard for
    this particular update in the proxy. Add a new config option,
    recheck_updating_shard_ranges, to control the cache time; it defaults to
    one hour. Set to 0 to fall back to previous behavior.

    Note that we should be able to tolerate stale shard data just fine; we
    already have to worry about async pendings that got written down with
    one shard but may not get processed until that shard has itself sharded
    or shrunk into another shard.

    Also note that memcache has a default value limit of 1MiB, which may be
    exceeded if a container has thousands of shards. In that case, set()
    will act like a delete(), causing increased memcache churn but otherwise
    preserving existing behavior. In the future, we may want to add support
    for gzipping the cached shard ranges as they should compress well.

    Change-Id: Ic7a732146ea19a47669114ad5dbee0bacbe66919
    Closes-Bug: 1781291

Changed in swift:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/losf)

Fix proposed to branch: feature/losf
Review: https://review.opendev.org/671181

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/losf)
Download full text (3.9 KiB)

Reviewed: https://review.opendev.org/671181
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=ea5cbf633cfa56a418a79cde4ccfea0c1f6f3f92
Submitter: Zuul
Branch: feature/losf

commit a1af3811a79963e2e5d1db3e5588cbc9748f9d57
Author: Tim Burke <email address hidden>
Date: Mon Jun 24 12:25:33 2019 -0700

    sharding: Cache shard ranges for object writes

    Previously, we issued a GET to the root container for every object PUT,
    POST, and DELETE. This puts load on the container server, potentially
    leading to timeouts, error limiting, and erroneous 404s (!).

    Now, cache the complete set of 'updating' shards, and find the shard for
    this particular update in the proxy. Add a new config option,
    recheck_updating_shard_ranges, to control the cache time; it defaults to
    one hour. Set to 0 to fall back to previous behavior.

    Note that we should be able to tolerate stale shard data just fine; we
    already have to worry about async pendings that got written down with
    one shard but may not get processed until that shard has itself sharded
    or shrunk into another shard.

    Also note that memcache has a default value limit of 1MiB, which may be
    exceeded if a container has thousands of shards. In that case, set()
    will act like a delete(), causing increased memcache churn but otherwise
    preserving existing behavior. In the future, we may want to add support
    for gzipping the cached shard ranges as they should compress well.

    Change-Id: Ic7a732146ea19a47669114ad5dbee0bacbe66919
    Closes-Bug: 1781291

commit 0ae1ad63c10ea8643fc0998dca159b844286c414
Author: zengjia <email address hidden>
Date: Wed Jul 10 15:46:32 2019 +0800

    Update auth_url in install docs

    Beginning with the Queens release, the keystone install guide
    recommends running all interfaces on the same port.This patch
    updates the swift install guide to reflect that change

    Change-Id: Id00cfd2c921da352abdbbbb6668b921f3cb31a1a
    Closes-bug: #1754104

commit f4bb1bea28640dc603da891559387d6b15f1f2da
Author: Tim Burke <email address hidden>
Date: Wed Jul 10 23:48:39 2019 -0700

    reconciler: Enqueue right work for shard containers

    This fixes newly-enqueued work going forward, but doesn't offer anything
    to try to parse existing bad work items or even to kick shards so they
    reset their reconciler high-water marks.

    Change-Id: I79d20209cea70a6447c4e94941e5e854889cbec5
    Closes-Bug: 1836082

commit c1d170225281a39973dda1b8e46f5b3b5c943d1a
Author: Tim Burke <email address hidden>
Date: Wed Jul 10 15:37:44 2019 -0700

    functests: Make test_PUT_metadata less flakey

    Change-Id: I846e746c2fe7591a3ab502428f587e3cbe753225

commit b10f4bae28bec9c0c394c340bf813a28ac8a3380
Author: Tim Burke <email address hidden>
Date: Tue Jun 18 09:54:02 2019 -0700

    func tests: tolerate more 404s when deleting

    Change-Id: I3129e4f94ac39964f2f17ea05b6b2dd807fba82a

commit 557335465561b7a00d08cc5a370d5fcd6e7d83b0
Author: Tim Burke <email address hidden>
Date: Sat Jun 1 10:46:54 2019 -0700

    Move calls to self.app outside of error handling

    On p...

Read more...

tags: added: in-feature-losf
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.22.0

This issue was fixed in the openstack/swift 2.22.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.