object-updater should shuffle work before making requests

Bug #1878056 reported by Tim Burke on 2020-05-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Undecided
Unassigned

Bug Description

Currently the updater works through async pendings on a drive by

* looping through policies, in order
* looping through the policies suffix dirs, in os.listdir order
* looping through asyncs, in reverse sorted order.

That last one allows us to remove stale work without making any requests (which is handy), but the others can be problematic when a bunch of the updates are directed at just one or two containers. The first time through might be more or less ok:

  container-A update successful
  container-B update timeout
  container-C update successful
  container-D update successful
  container-B update timeout
  container-E update successful
  container-B update timeout
  container-F update successful
  container-B update timeout
  container-B update successful
  container-G update successful
  container-B update timeout
  container-B update successful
  container-H update successful
  container-B update timeout
  container-I update successful
  container-B update timeout

but if you restart the updater partway through its cycle (to adjust concurrency settings, say) you have a minor heart attack -- nearly everything starts failing and it'll be a bit before you start seeing successes again since we're retrying all the failures again (and at approximately the same time, so they're less likely to succeed):

  container-B update timeout
  container-B update timeout
  container-B update timeout
  container-B update timeout
  container-B update timeout
  container-B update timeout
  container-B update timeout

Reviewed: https://review.opendev.org/726570
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=dee98a74d43771d48a58d62647a0628ef7d1cf76
Submitter: Zuul
Branch: master

commit dee98a74d43771d48a58d62647a0628ef7d1cf76
Author: Tim Burke <email address hidden>
Date: Sat May 9 23:16:04 2020 -0700

    updater: Shuffle suffixes so we don't keep hitting the same failures

    When tuning your updater, you often want to try a new config, see how it
    changes your metrics, then adjust concurrency up or down depending on
    how your container layer is responding.

    If your containers haven't been doing well, though, and you've got a
    giant backlog of async pendings to work through, updater restarts to
    change concurrency previously posed a problem: the updater would walk
    the suffix directories in the same order every start-up. So, if you
    found a config that was making decent progress for a while but still had
    *some* failures, and you wanted to try tweaking settings to see if you
    could *reduce* those failures -- you'd likely start getting *all*
    failures as it went to retry the failed ones first and all at once. If
    you continued trying to tweak configs to get your failures to a
    reasonable rate, you'd almost certainly over-correct for these handful
    of overwhelmed DBs and not the overall cluster.

    Now, shuffle the suffixes before we walk them.

    Change-Id: I3ef34119f0cb563ab405a6517335a24dbaf2b4c3
    Closes-Bug: #1878056

Changed in swift:
status: In Progress → Fix Released
Download full text (20.6 KiB)

Reviewed: https://review.opendev.org/735381
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=481f126e6b59689599f438e5d27f7328f5b3e813
Submitter: Zuul
Branch: feature/losf

commit 51a587ed8dd5700b558ad26d70dcb7facc0f91e4
Author: Tim Burke <email address hidden>
Date: Tue Jun 16 11:34:01 2020 -0700

    Use ensure-pip role

    Hopefully this will fix the currently-broken probe test gate?

    Depends-On: https://review.opendev.org/#/c/736070/
    Change-Id: Ib652534b35236fdb6bcab131c7dc08a079bf72f6

commit 79811df34c84b416ce9f445926b31a23a32ea1a4
Author: Tim Burke <email address hidden>
Date: Fri Apr 10 22:02:57 2020 -0700

    Use ini_file to update timeout instead of crudini

    crudini seems to have trouble on py3 -- still not sure *why* it's using
    py3 for the losf job, though...

    Change-Id: Id98055994c8d59e561372417c9eb4aec969afc6a

commit e4586fdcde5267f39056bb1b5f413a411bb8e7a0
Author: Tim Burke <email address hidden>
Date: Tue Jun 9 10:50:07 2020 -0700

    memcached: Plumb logger into MemcacheRing

    This way proxies log memcached errors in the normal way instead of
    to the root logger (which eventually gets them out on STDERR).

    If no logger is provided, fall back to the root logger behavior.

    Change-Id: I2f7b3e7d5b976fab07c9a2d0a9b8c0bd9a840dfd

commit 1dfa41dada30c139129cb2771b0d68c95fd84e32
Author: Tim Burke <email address hidden>
Date: Tue Apr 28 10:45:27 2020 -0700

    swift-get-nodes: Allow users to specify either quoted or unquoted paths

    Now that we can have null bytes in Swift paths, we need a way for
    operators to be able to locate such containers and objects. Our usual
    trick of making sure the name is properly quoted for the shell won't
    suffice; running something like

       swift-get-nodes /etc/swift/container.ring.gz $'AUTH_test/\0versions\0container'

    has the path get cut off after "AUTH_test/" because of how argv works.

    So, add a new option, --quoted, to let operators indicate that they
    already quoted the path.

    Drive-bys:

      * If account, container, or object are explicitly blank, treat them
        as though they were not provided. This provides better errors when
        account is explicitly blank, for example.
      * If account, container, or object are not provided or explicitly
        blank, skip printing them. This resolves abiguities about things
        like objects whose name is actually "None".
      * When displaying account, container, and object, quote them (since
        they may contain newlines or other control characters).

    Change-Id: I3d10e121b403de7533cc3671604bcbdecb02c795
    Related-Change: If912f71d8b0d03369680374e8233da85d8d38f85
    Closes-Bug: #1875734
    Closes-Bug: #1875735
    Closes-Bug: #1875736
    Related-Bug: #1791302

commit 1b6c8f7fdf630458affe2778fc7be86df3ef1674
Author: Tim Burke <email address hidden>
Date: Fri Jun 5 16:36:32 2020 -0700

    Remove etag-quoter from 2.25.0 release notes

    This was released in 2.24.0, which already has a release note for it.

    Change-Id: I9837df281ec8baa19e8e4a7976f415e8add4a2da

commi...

tags: added: in-feature-losf
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers