Speed up bibs-by-item-age

Bug #1236979 reported by Mike Rylander on 2013-10-08
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Medium
Galen Charlton

Bug Description

From Hubert Lubaczewski (a.k.a. depesz), based on investigation of performance issues at the database layer of Evergreen, we received an initial implementation of a replacement bibs-by-item-age mechanism. See http://markmail.org/thread/3pe3uzokzzcqwy5b for details, but the tl;dr version is a speedup on test data from ~13s to about 3ms.

The version offered removed some functionality, which I have attempted to reimplement within his framework. See http://markmail.org/thread/3pe3uzokzzcqwy5b#query:+page:1+mid:bekujg5ggfydujdl+state:results for the details of that.

No pullrequest yet until more eyes look at it. It's untested. I also cargo-culted the four indexes depesz used during testing into the branch, but it's likely a different set may be better. Branch at:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/faster_bibs_by_item_age_depesz

Kathy Lussier (klussier) on 2013-10-08
tags: added: performance
Mike Rylander (mrylander) wrote :

I've force-pushed some updates to the stored proc, based on feedback from Liam Whalen. The changes may impact performance compared to the 2ms timing that depesz was able to attain, but that's the cost of correctness. However, they may not ... testing with real data will tell.

Liam Whalen (whalen-ld) wrote :

I think the incrementing of the v_found counter should be wrapped in a check to see if the key has been defined before.

IF NOT v_results ? v_record::TEXT THEN
 v_found := v_found + 1;
END IF;

In the case where a record that has already been added to v_results has a copy from a different circ lib that has a newer date than the current copy's date stored in v_results, then the current v_found will be incremented twice for the same record.

Liam Whalen (whalen-ld) wrote :

Another option would be to count the number of keys in v_results and compare that to limit + offset, and get rid of v_found altogether.

The counter is probably a quicker option, but I do not know.

Mike Rylander (mrylander) wrote :

Well, in the current implementation, I'm specifically looking for a full v_found amount of records (or the end of the cursor) from each org unit. This makes the logic simpler to reason about, and for reasonable (read: likely and practical) limits (limit+offset) should not have a significant effect on performance.

Thinking about it now, I'm not liking my CONTINUE-based status and location checks. I think folding the location and status checks into the outer query is probably the better solution. Thoughts, before I push yet more commits?

Liam Whalen (whalen-ld) wrote :

How about keeping track of the oldest create_date and adding a WHERE clause to the query that returns the asset.copy data, but only use this WHERE clause when you have the limit + offset number of records in v_result.

So, once you have a possible full set to return, only look for newer items after that?

IF (SELECT count(skeys(v_results)) == p_limit + p_offset)
    SELECT c.call_number, c.create_date, c.status, c.location
    FROM asset.copy c
    WHERE c.circ_lib = v_circ_lib AND NOT c.deleted'
    AND c.create_date > oldest_created_date_in_v_results
    ORDER BY c.create_date DESC;
ELSE
    SELECT c.call_number, c.create_date, c.status, c.location
    FROM asset.copy c
    WHERE c.circ_lib = v_circ_lib AND NOT c.deleted'
    ORDER BY c.create_date DESC;
END IF;

I do not know if SELECT count(skeys(v_results)) is valid syntax, but getting a count of the keys should be do able.
And doing it here means it is only done once per lib, so not too expensive.

Liam Whalen (whalen-ld) wrote :

That should probably be >= in the IF. I think it is possible for more than p_limit + p_offset worth of records to be added to v_result.

Liam Whalen (whalen-ld) wrote :

As well, if you are pulling out the status and location checks into an outer query, you could add LIMIT p_limit + p_offset to that query too. That could help speed up the lower pages at least.

Liam Whalen (whalen-ld) wrote :

I did not consider that there could be multiple copies with that last statement.

Would an SQL statement like the following:

SELECT ac.record, MAX(c.create_date)
FROM asset.copy c INNER JOIN asset.call_number ac ON c.call_number = ac.id
WHERE c.circ_lib = v_circ_lib AND NOT c.deleted
GROUP BY ac.record
ORDER BY MAX(c.create_date) DESC

be any better? This should return the newest create date and record for the circ_lib in question.

I am not sure if joining is slower than returning just the list of copies and looping. But, this could be used with
LIMIT p_limit + p_offset to get the right window for each cicr_lib.

Mike Rylander (mrylander) wrote :

I hesitate to push too much into the outer query, because one of the neat features depesz is leveraging is the fast-startup behaviour of cursors in PG when it believes that the time-to-first-row will be very significantly less than the cost of generating the full result set and you tell it that you won't try to scroll backwards with NO SCROLL. The more constraining information we put in the outer query (and limit+offset in particular can defeat this feature) the more likely PG will decide that it's better to just generate the full result set

That said, I'm not against trying out any and all formulations of the stored proc. Liam, are you in a position to point these ideas at a realistic data set? If so, some quick testing should be very instructive as to the benefits of any of these ideas.

And, thanks for digging into this. The more folks that understand these deeper bits of EG, the better!

Mike Rylander (mrylander) wrote :

Oh, and I think your optimization in comment #15 would be great. I don't think that will be of concern WRT fast-startup cursors, and it should make use of the (circ_lib,create_date) index.

Liam Whalen (whalen-ld) wrote :

Hmm, I hadn't looked at depesz comments since I first started looking at this. I think I've basically come full circle and put the inefficient stuff into my suggestion.

I will try and test this on some data here.

If the cursors are what is speeding this up, then we might be able to do the original query as a cursor and forget all the looping.

If the joining is what is slowing it down then we will have to loop over just the copy data.

I will try with retrieving just the new items per circ lib via a cursor and new items for all libs via cursor.

Liam Whalen (whalen-ld) wrote :

I modified the records_by_item_age stored procedure to check and see if the v_results set has enough records in it to fulfill the offest + limit. If it does, then it looks for newer records only from that point forward.

I started to work on a separate commit that used the join with asset.call_number, but I realized it would be no use. I was hoping that by returned an ordered result set, the offset and limit could help decrease load times for pages deeper in the results. However, I did not realize when I started it, that the v_results skips over records that are older, so using offset and limit within each circ lib would not be practical.

So, with the modifications the speed has increased significantly for results at greater page values. Here are some stats, which were run against a copy of the Sitka database from July 23rd, 2013.

Time For new_books_by_item query

This is the original SQL query that depesz identified as being a bottleneck.

Old Query:

All Libs:

19576.464 ms

Single Library

8166.984 ms

This is the original modifications by miker to depesz suggested changes. Run at the first page.

First SQL:

All Libs:

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=179.337..179.338 rows=10 loops=1)
 Total runtime: 179.353 ms

Single Library

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=0.947..0.947 rows=1 loops=1)
 Total runtime: 0.961 ms

This is my modifications to miker's commit.
New SQL

All Libs

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=13.190..13.191 rows=10 loops=1)
 Total runtime: 13.207 ms

Single Library

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=1.295..1.295 rows=1 loops=1)
 Total runtime: 1.310 ms

At page 20:

first commit:

All Libs:

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=29645.705..29645.707 rows=10 loops=1)
 Total runtime: 29645.745 ms

my commit:

All Libs:

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=32.808..32.810 rows=10 loops=1)
 Total runtime: 32.832 ms

The modifications that only look for newer records once enough records have been found to satisfy the offset + limit make a significant difference when looking at pages further into the results. They also make a difference within the first page, but not as significant.

However, in the case of a single library, my modifications increase the time needed because it makes an unnecessary skeys check. I just realized, I could put a quick check in to see if the size of p_circ_libs array is 1 and ignore the skeys check if this is the case. I will get that done.

Liam Whalen (whalen-ld) wrote :

I added the checks for the case where data from a single circ lib is requested. It has increased the speed in this case:

 Function Scan on records_by_item_age "biblio.records_by_item_age" (cost=0.25..10.25 rows=1000 width=16) (actual time=0.853..0.853 rows=1 loops=1)
 Total runtime: 0.865 ms

The commits have been pushed to the collab branch.

Ben Shum (bshum) on 2013-11-06
Changed in evergreen:
importance: Undecided → Medium
status: New → Confirmed
Kathy Lussier (klussier) wrote :

Thank you for all your work on this Mike and Liam! I've looked at this on a master and 2.4 server that has loaded the code. The improved performance is remarkable.

On the master server, I performed a pre-test looking at the feed by item age for a branch, system and consortium and noting down the exact order of the results. After the branch was loaded, I looked at the same pages to verify that the same titles were being returned in the same order.

In most cases, the results were the same. However, there were two titles that didn't display in the proper order. This testing was done on a development server, and no copies were added, deleted or otherwise changed between the pre- and post-tests.

In my pre-test, when I went to http://jasondev.mvlcstaff.org/opac/extras/browse/html/item-age/MVLC/2/10, the 10th title on the page (the one that follows How to train a train) was for The New Yorker (record 748134). However, it doesn't show up on this page after the code is loaded.

For all of the bibs you see on this page, the most recent copy in the consortium has a create date of 12/5/2013. For the missing record, there are several copies that also have a create date of 12/5/2013.

The New Yorker title does eventually show up in the feed, see towards the bottom of http://jasondev.mvlcstaff.org/opac/extras/browse/html/item-age/MVLC/1/300. However, the title that appears above it, Better homes and gardens, doesn't have any copies created on 12/5/2013. Instead, the most recent copy has a create date of 12/4/2013, so one would expect it to display before the New Yorker title.

We see something similar at http://jasondev.mvlcstaff.org/opac/extras/browse/html/item-age/MVLC/3/10. In the pre-test, Adbusters (record 1361399) displayed after Maureen O'Hara : the biography. The most recent copy on that record had a create date of 12/5/2013. I never did find that title further down in the feed for the consortium. However, I was able to pull it up easily on the feed for its owning branch - http://jasondev.mvlcstaff.org/opac/extras/browse/rss2-full/item-age/MGN/.

I don't know if its relevant that both problem records were for serials.

If there is anything I can do to help troubleshoot this further, let me know!

Kathy Lussier (klussier) wrote :

Sorry! In paragraph 6 above, I should have said one would expect Better homes and gardens to display AFTER the New Yorker, not before.

Kathy

Liam Whalen (whalen-ld) wrote :

Hi Kathy,

Do you use the serials module? Do you know if the items attached to those records were added via the serials module if you do use it? I will take a better look at this either way.

Liam

Kathy Lussier (klussier) wrote :

Hi Liam,

The test system I was using does not use the serials module. The copies attached to those records were added through traditional cataloging.

Thanks for asking the question!

Kathy

Kathy Lussier (klussier) wrote :

Just a note that the above examples are now outdated since the database has been reloaded with more recent data since my last round of testing. I'm happy to do pre- and post-testing again to see if I can come up with some fresh examples at a time when somebody has the tuits to look at it. Just let me know.

Liam Whalen (whalen-ld) wrote :

Thank you for the update on the data. It will be at least two weeks before I can take some time to look at this.

Galen Charlton (gmc) on 2016-09-27
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers