object-updater should shuffle work before making requests
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Currently the updater works through async pendings on a drive by
* looping through policies, in order
* looping through the policies suffix dirs, in os.listdir order
* looping through asyncs, in reverse sorted order.
That last one allows us to remove stale work without making any requests (which is handy), but the others can be problematic when a bunch of the updates are directed at just one or two containers. The first time through might be more or less ok:
container-A update successful
container-B update timeout
container-C update successful
container-D update successful
container-B update timeout
container-E update successful
container-B update timeout
container-F update successful
container-B update timeout
container-B update successful
container-G update successful
container-B update timeout
container-B update successful
container-H update successful
container-B update timeout
container-I update successful
container-B update timeout
but if you restart the updater partway through its cycle (to adjust concurrency settings, say) you have a minor heart attack -- nearly everything starts failing and it'll be a bit before you start seeing successes again since we're retrying all the failures again (and at approximately the same time, so they're less likely to succeed):
container-B update timeout
container-B update timeout
container-B update timeout
container-B update timeout
container-B update timeout
container-B update timeout
container-B update timeout
Reviewed: https:/ /review. opendev. org/726570 /git.openstack. org/cgit/ openstack/ swift/commit/ ?id=dee98a74d43 771d48a58d62647 a0628ef7d1cf76
Committed: https:/
Submitter: Zuul
Branch: master
commit dee98a74d43771d 48a58d62647a062 8ef7d1cf76
Author: Tim Burke <email address hidden>
Date: Sat May 9 23:16:04 2020 -0700
updater: Shuffle suffixes so we don't keep hitting the same failures
When tuning your updater, you often want to try a new config, see how it
changes your metrics, then adjust concurrency up or down depending on
how your container layer is responding.
If your containers haven't been doing well, though, and you've got a
giant backlog of async pendings to work through, updater restarts to
change concurrency previously posed a problem: the updater would walk
the suffix directories in the same order every start-up. So, if you
found a config that was making decent progress for a while but still had
*some* failures, and you wanted to try tweaking settings to see if you
could *reduce* those failures -- you'd likely start getting *all*
failures as it went to retry the failed ones first and all at once. If
you continued trying to tweak configs to get your failures to a
reasonable rate, you'd almost certainly over-correct for these handful
of overwhelmed DBs and not the overall cluster.
Now, shuffle the suffixes before we walk them.
Change-Id: I3ef34119f0cb56 3ab405a6517335a 24dbaf2b4c3
Closes-Bug: #1878056