Reclaim of tombstone rows is unbounded and causes LockTimeout (10s)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Medium
|
clayg |
Bug Description
While troubleshooting some big slow container dbs I noticed that their was a noticeable periodicity to their successful response rate:
204 894
500 0
Fri May 8 18:32:46 UTC 2020
204 912
500 0
Fri May 8 18:33:16 UTC 2020
204 890
500 0
Fri May 8 18:33:46 UTC 2020
204 888
500 0
Fri May 8 18:34:16 UTC 2020
204 906
500 0
Fri May 8 18:34:46 UTC 2020
204 908
500 0
Fri May 8 18:35:16 UTC 2020
204 217
500 380
Fri May 8 18:35:46 UTC 2020
204 659
500 139
Fri May 8 18:36:16 UTC 2020
204 897
500 0
Fri May 8 18:36:46 UTC 2020
204 906
500 0
Fri May 8 18:37:16 UTC 2020
204 922
500 0
500s would spike and success would dip - and it was happening every 5-10ms across nodes. We were seeing a lot of LockTimeout (10s)
At first I thought it was something related to our WAL (.pending files) - but after some flailing I turned off sharders and replicators and remembered that one time in the past I sort of remembered I hated this query:
https:/
I'm not even sure it uses the (name, deleted) index (might need a "WHERE name > ''" - but even if it can filter NOT deleted rows there could easily still be many many millons of tombstones in a given db that it has to page through to check their `created_at` < reclaim_age
which you know, might not even delete anything for a couple of weeks - and it's not like we need to get rid of reclaimed tombstones *immediately* or anything (they've been there for weeks already)
Yet we do it every time we replicate the database... and no one else can get in sql queries while that one is running.
I nooped that function and my LockTimeouts went away and there were no more dips in success rates.
Now I just need to figure out to LIMIT the query effectively so we amortize the work across multiple queries and ideally even a less frequent interval.
FWIW, looks like we *do* use the index; doing an EXPLAIN on that delete query gets me something like http:// paste.openstack .org/raw/ 793483/ (note the use of SeekGE/ IdxGT/IdxRowid) .
Trying to cap the number of rows deleted per query might be tricky -- our usual sync-point system that we use for the replicator, container sync, and the reconciler uses ROWID, but that won't be useful with our index. Maybe we could first do something like
SELECT name FROM object WHERE deleted=1 AND name > ? ORDER BY NAME LIMIT 1 OFFSET ?
to find the next sync point, then
DELETE FROM object WHERE deleted=1 AND name > ? AND name < ? AND created_at < ?
to clean up just a portion of the namespace at a time. If the SELECT returns no rows, set the sync-point to the empty string and skip the DELETE. Probably make the offset to use something reasonably low like 1,000 or 10,000. I think I'd advocate for an easy-to-change constant somewhere over a new config option, though.
Note that if we're seriously considering bug 1521363, resolving that would almost certainly require a new (deleted, created_at) index that would make our existing query quite well-optimized... but I don't know that anyone actually wants to add a new index to container DBs, especially given how difficult it would be to make it play well with sharding.