Reclaim of tombstone rows is unbounded and causes LockTimeout (10s)

Bug #1877651 reported by clayg on 2020-05-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Medium
clayg

Bug Description

While troubleshooting some big slow container dbs I noticed that their was a noticeable periodicity to their successful response rate:

204 894
500 0
Fri May 8 18:32:46 UTC 2020
204 912
500 0
Fri May 8 18:33:16 UTC 2020
204 890
500 0
Fri May 8 18:33:46 UTC 2020
204 888
500 0
Fri May 8 18:34:16 UTC 2020
204 906
500 0
Fri May 8 18:34:46 UTC 2020
204 908
500 0
Fri May 8 18:35:16 UTC 2020
204 217
500 380
Fri May 8 18:35:46 UTC 2020
204 659
500 139
Fri May 8 18:36:16 UTC 2020
204 897
500 0
Fri May 8 18:36:46 UTC 2020
204 906
500 0
Fri May 8 18:37:16 UTC 2020
204 922
500 0

500s would spike and success would dip - and it was happening every 5-10ms across nodes. We were seeing a lot of LockTimeout (10s)

At first I thought it was something related to our WAL (.pending files) - but after some flailing I turned off sharders and replicators and remembered that one time in the past I sort of remembered I hated this query:

https://github.com/openstack/swift/blob/master/swift/common/db.py#L989

I'm not even sure it uses the (name, deleted) index (might need a "WHERE name > ''" - but even if it can filter NOT deleted rows there could easily still be many many millons of tombstones in a given db that it has to page through to check their `created_at` < reclaim_age

which you know, might not even delete anything for a couple of weeks - and it's not like we need to get rid of reclaimed tombstones *immediately* or anything (they've been there for weeks already)

Yet we do it every time we replicate the database... and no one else can get in sql queries while that one is running.

I nooped that function and my LockTimeouts went away and there were no more dips in success rates.

Now I just need to figure out to LIMIT the query effectively so we amortize the work across multiple queries and ideally even a less frequent interval.

Tim Burke (1-tim-z) wrote :

FWIW, looks like we *do* use the index; doing an EXPLAIN on that delete query gets me something like http://paste.openstack.org/raw/793483/ (note the use of SeekGE/IdxGT/IdxRowid).

Trying to cap the number of rows deleted per query might be tricky -- our usual sync-point system that we use for the replicator, container sync, and the reconciler uses ROWID, but that won't be useful with our index. Maybe we could first do something like

    SELECT name FROM object WHERE deleted=1 AND name > ? ORDER BY NAME LIMIT 1 OFFSET ?

to find the next sync point, then

    DELETE FROM object WHERE deleted=1 AND name > ? AND name < ? AND created_at < ?

to clean up just a portion of the namespace at a time. If the SELECT returns no rows, set the sync-point to the empty string and skip the DELETE. Probably make the offset to use something reasonably low like 1,000 or 10,000. I think I'd advocate for an easy-to-change constant somewhere over a new config option, though.

Note that if we're seriously considering bug 1521363, resolving that would almost certainly require a new (deleted, created_at) index that would make our existing query quite well-optimized... but I don't know that anyone actually wants to add a new index to container DBs, especially given how difficult it would be to make it play well with sharding.

Tim Burke (1-tim-z) wrote :

Oh, also: I wonder if this was at the core of the complaint in bug 1260460 -- 10M objects in a 2.5GB DB doesn't seem that big. (These days, anyway; maybe I've grown too accustomed to having DBs on flash...)

Fix proposed to branch: master
Review: https://review.opendev.org/727876

Changed in swift:
status: New → In Progress

Reviewed: https://review.opendev.org/727876
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=aab45880f8aaddfed15fe641939aac0d2eccc465
Submitter: Zuul
Branch: master

commit aab45880f8aaddfed15fe641939aac0d2eccc465
Author: Clay Gerrard <email address hidden>
Date: Wed May 13 13:32:18 2020 -0500

    Breakup reclaim into batches

    We want to do the table scan without locking and group the locking
    deletes into small indexed operations to minimize the impact of
    background processes calling reclaim each cycle.

    Change-Id: I3ccd145c14a9b68ff8a9da61f79034549c9bc127
    Co-Authored-By: Tim Burke <email address hidden>
    Closes-Bug: #1877651

Changed in swift:
status: In Progress → Fix Released
Download full text (20.6 KiB)

Reviewed: https://review.opendev.org/735381
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=481f126e6b59689599f438e5d27f7328f5b3e813
Submitter: Zuul
Branch: feature/losf

commit 51a587ed8dd5700b558ad26d70dcb7facc0f91e4
Author: Tim Burke <email address hidden>
Date: Tue Jun 16 11:34:01 2020 -0700

    Use ensure-pip role

    Hopefully this will fix the currently-broken probe test gate?

    Depends-On: https://review.opendev.org/#/c/736070/
    Change-Id: Ib652534b35236fdb6bcab131c7dc08a079bf72f6

commit 79811df34c84b416ce9f445926b31a23a32ea1a4
Author: Tim Burke <email address hidden>
Date: Fri Apr 10 22:02:57 2020 -0700

    Use ini_file to update timeout instead of crudini

    crudini seems to have trouble on py3 -- still not sure *why* it's using
    py3 for the losf job, though...

    Change-Id: Id98055994c8d59e561372417c9eb4aec969afc6a

commit e4586fdcde5267f39056bb1b5f413a411bb8e7a0
Author: Tim Burke <email address hidden>
Date: Tue Jun 9 10:50:07 2020 -0700

    memcached: Plumb logger into MemcacheRing

    This way proxies log memcached errors in the normal way instead of
    to the root logger (which eventually gets them out on STDERR).

    If no logger is provided, fall back to the root logger behavior.

    Change-Id: I2f7b3e7d5b976fab07c9a2d0a9b8c0bd9a840dfd

commit 1dfa41dada30c139129cb2771b0d68c95fd84e32
Author: Tim Burke <email address hidden>
Date: Tue Apr 28 10:45:27 2020 -0700

    swift-get-nodes: Allow users to specify either quoted or unquoted paths

    Now that we can have null bytes in Swift paths, we need a way for
    operators to be able to locate such containers and objects. Our usual
    trick of making sure the name is properly quoted for the shell won't
    suffice; running something like

       swift-get-nodes /etc/swift/container.ring.gz $'AUTH_test/\0versions\0container'

    has the path get cut off after "AUTH_test/" because of how argv works.

    So, add a new option, --quoted, to let operators indicate that they
    already quoted the path.

    Drive-bys:

      * If account, container, or object are explicitly blank, treat them
        as though they were not provided. This provides better errors when
        account is explicitly blank, for example.
      * If account, container, or object are not provided or explicitly
        blank, skip printing them. This resolves abiguities about things
        like objects whose name is actually "None".
      * When displaying account, container, and object, quote them (since
        they may contain newlines or other control characters).

    Change-Id: I3d10e121b403de7533cc3671604bcbdecb02c795
    Related-Change: If912f71d8b0d03369680374e8233da85d8d38f85
    Closes-Bug: #1875734
    Closes-Bug: #1875735
    Closes-Bug: #1875736
    Related-Bug: #1791302

commit 1b6c8f7fdf630458affe2778fc7be86df3ef1674
Author: Tim Burke <email address hidden>
Date: Fri Jun 5 16:36:32 2020 -0700

    Remove etag-quoter from 2.25.0 release notes

    This was released in 2.24.0, which already has a release note for it.

    Change-Id: I9837df281ec8baa19e8e4a7976f415e8add4a2da

commi...

tags: added: in-feature-losf

Reviewed: https://review.opendev.org/741755
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=170dddaacc609493d4b3e7ba1f6a9168e7633216
Submitter: Zuul
Branch: stable/train

commit 170dddaacc609493d4b3e7ba1f6a9168e7633216
Author: Clay Gerrard <email address hidden>
Date: Wed May 13 13:32:18 2020 -0500

    Breakup reclaim into batches

    We want to do the table scan without locking and group the locking
    deletes into small indexed operations to minimize the impact of
    background processes calling reclaim each cycle.

    Change-Id: I3ccd145c14a9b68ff8a9da61f79034549c9bc127
    Co-Authored-By: Tim Burke <email address hidden>
    Closes-Bug: #1877651
    (cherry picked from commit aab45880f8aaddfed15fe641939aac0d2eccc465)

tags: added: in-stable-train

Reviewed: https://review.opendev.org/741753
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=bb0a283a07df3c2c73eb696c5c3ac75e4a366f5a
Submitter: Zuul
Branch: stable/ussuri

commit bb0a283a07df3c2c73eb696c5c3ac75e4a366f5a
Author: Clay Gerrard <email address hidden>
Date: Wed May 13 13:32:18 2020 -0500

    Breakup reclaim into batches

    We want to do the table scan without locking and group the locking
    deletes into small indexed operations to minimize the impact of
    background processes calling reclaim each cycle.

    Change-Id: I3ccd145c14a9b68ff8a9da61f79034549c9bc127
    Co-Authored-By: Tim Burke <email address hidden>
    Closes-Bug: #1877651
    (cherry picked from commit aab45880f8aaddfed15fe641939aac0d2eccc465)

tags: added: in-stable-ussuri
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers