User-level search speed instability

Bug #1161601 reported by Mike Rylander
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Critical
Unassigned

Bug Description

A whole bunch of developers collaborated on a ton of important improvements to the way Evergreen parses, interprets, executes and returns results of searches initiated by end users. Most of this work culminated in the branch merged at the end of bug 1049131.

However, in pursuit of some of the features of this work, particularly the ability to nest filters that depend on non-bibliographic data and especially copy (and copy-attached) data, changes were made to the inner-most core of the logic that constructs SQL queries for Postgres to execute. In order to understand the situation, some history is needed.

In the beginning, Evergreen used a pure-SQL search mechanism. All search logic was embedded in one complex SQL query and the results were used to generate a search result page for the user. However, on large data sets, both the calculation of relevance ranking information and the inspection of visibility-affecting data are too expensive to be performed on an essentially unbounded set of records. For example, a keyword search for "dog" would, from the user's point of view, time out.

There was a progression of incremental improvements made over time that eventually led to a solution involving the use of pipelined approach to search result construction. First, a core "bibliographic-only" SQL query is used to find paged sets of cadidate records (or metarecords) and, if the user chooses relevance ranking, to calculate the baseline relevance rank of these. Second, set of visibility checks over copies, shelving locations, statuses, libraries and other criteria are applied to the current paged set of results. This mechanism also keeps track of how many records remain visible from this page after visibility checking is accomplished for estimation of the total record count the user can expect to receive. Finally, a stored procedure is used to tie these two parts together into a single database call from the application layer. The application layer uses the visibility counts from each successive page of a search to refine the total record estimate, and passes the applicable set of records and hit estimation back to the display layer of the system.

This system is not perfect. It restricts the complexity of certain filter types to be top-level, and for very large result sets, the estimated counts are not guaranteed to be exact. However, this method does provide a configurable, hard backstop to search time, so that searches do not time out as they did in the earlier, pure-SQL world.

Which brings us to the change mentioned at the top of this description. Intermingled with the large set of improvements provided by code from many developers, and functionally tested by many, was a reversion to the pure-SQL search mechanism. Concerns were raised regarding the efficacy of this approach, given the history of Evergreen, and both caution and testing of known problem cases was requested. I was stated that this testing was performed, and while no results were published broadly, the claim was made that this new pure-SQL code was, in every case, faster than the stored-procedure version that existed before as of 2.3. This claim was taken on it's face and in good faith, and after functional testing showed the code working, it was merged to master for 2.4.

However, soon after beta1 was released and initial testing by several sites with large datasets began, the exact issues of concern were noted. Broad searches were timing out, and in some cases no amount of caching via search helped.

[ASIDE: Given the resources available in the Evergreen community, and the current practice of leaving branches unmerged until late in the development cycle -- especially for highly specialized areas such as search or ACQ -- this beta1 testing is the first time we could reasonably expect to see testing beyond that for functional completeness. Perhaps it seems counter-intuitive, but I contend that early merge would help address these issues, as more individuals test master than will ever test individual branches, simply due to time and resources, and human testing will always be needed regardless of any automation we devise.]

So with that history and problem description out of the way, I offer here a branch which reverts 2 commits out of the (approximately) 36 that make up the body of work committed via bug 1049131. This branch reverts the inner-most logic to the stored-procedure style search which avoids the timeouts that are so simple to provoke in the current master code. However, it comes at a cost: Copy status, copy ownership and shelving location filters are forced to be top-level filters, and are not nestable. It is extremely important to note, however, that this is already the case for every version of Evergreen that contains those filtering options, so /no existing features are lost/, but I felt it only fair to mention them because the ability to nest these filters is provided by the code in master (unreleased) today.

To date, Ben Shum at bibliomation has tested this branch and confirms that it removes timeouts on an appropriately cached database, and testing by Jason Stephenson and Thomas Berezansky comparing current master to January master (which was before the merge of the 36-commit branch mentioned above) shows the same. Jason's data is available at https://docs.google.com/file/d/0B1R9Iilpi7wzOVp3LUxBZ3lmTEU/edit and he is planning to follow up with a second round of testing which specifically includes comparison of the branch offered here. I expect to be able to post more test data from another large site soon, as well.

Obviously, the more testing by as many groups as possible, the better. If you can test this, please do, and please test broad searches before applying this branch, for comparison purposes.

To apply this branch to an existing 2.4-beta1 system simply run the upgrade script called XXXX.schema.qp_fts_stored_proc.sql and then install Evergreen from source as normal. Then, stop services, restart memcached, restart services and test.

working/collab/miker/revive-qp-fts

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/miker/revive-qp-fts

Tags: pullrequest
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I am looking at this. I will perform timed search tests with the beta code both before and after applying this branch. I will record the results of the tests so a decision can be made based on evidence about what to do with this situation.

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
description: updated
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Oh. I should have added I will have results early tomorrow afternoon, Friday, 29 March.

Revision history for this message
Mike Rylander (mrylander) wrote :

UPDATE: I added a third commit to the branch with an upgrade script that will bring the stored procedure back for you. See the bottom of the description (since edited) for instructions.

Revision history for this message
Dan Wells (dbw2) wrote :

I'm out of the office today, so if this is too late, that's alright. I just want to express my opinion that we should strongly consider delaying the 2.4 final to see this through properly. While it seems like we should probably apply this revert branch, that also seems like a fundamental enough change by itself that another week of testing and consideration are worth the cost of delaying.

Ideally we can use that week to pursue both paths independently, then make the final determination. We all agree that the pure SQL approach is best if we can work it out, so I would hate to see it torn out at the last minute due in part to our self-imposed deadline.

Revision history for this message
Mike Rylander (mrylander) wrote :

With blessing from Rogan Hamby, who was one user that reported this issue earlier this week, I've run more tests on a full-sized data load. See https://docs.google.com/document/d/1SklSxYuDZd77zZPf3HZU9_08lVSX8V-103VFwW6QwXI/edit?usp=sharing for the gory details, but the tl;dr is that what I describe above, and expected to see, is what happened.

Thanks, Rogan and SCLENDS, for offering the use of your test server for this. It will make 2.4.0 even better.

Revision history for this message
Mike Rylander (mrylander) wrote :

Dan,

Thanks for putting though toward this issue. Based on a couple facts, I don't think the release date deadline is a factor here.

First, all the tests have shown that the current code shows the problems I worried about, and that reverting just the all-SQL part is a performance win that doesn't remove any existing features. So regardless of timing, we have a way forward. (BTW, from Jason via G+: https://docs.google.com/file/d/0B1R9Iilpi7wzdC1XUTg2UTEzaW8/edit ... apologies for any thunder stolen, Jason.)

Second, no all-SQL code to address these issues is available for testing, or even evaluation and discussion about the general direction taken. Thomas believed he'd have a branch for testing RSN since the middle of last week, but it didn't materialize. Nobody can blame Thomas for not finding a solution to this in just a few days. It's a very hard problem, and (at the risk of sounding even more arrogant than usual) I can speak authoritatively about this because I fought the war already. It's not something that I believe can (or should) be addressed by a lone developer without discussing possible angles of attack, and I would want to see more pre-code brainstorming from as many folks as want to help.

Even if blind inspiration were to strike someone right this moment and they could produce a branch for testing tomorrow morning, I'm not comfortable saying that another week or even two would be enough time to properly evaluate and understand the implications of another, likely significantly new and different, set of changes to the core of search.

Put another way, what we have in revive-qp-fts is an evolved solution that, while imperfect, has documented and understood bounds both in terms of benefits and drawbacks. I do not want to replace such a known quantity with a set of assumptions and unknowns for 2.4 -- if we want to release 2.4 within the next month or so, that is.

In a perfect world, if we wanted to delay 2.4 for, say, six weeks (recall, the conference is coming up and will eat about a week), and we started having pre-code discussion right now, and we could produce code with four weeks to test (that is, within two weeks) I would be more amenable. But a delay like that has it's own set of problems; it unfairly penalizes all the other work that's gone into 2.4 and concentrates resources on a problem that already has a short-term solution. But the biggest problem for me personally is that I cannot dedicate the next several weeks to this one problem. I just have too much other work that has to get done -- other responsibilities that must pull me away.

Alternatively, we could spend the next 5 months finding the best possible solution instead of trying to rush one in a couple weeks. To me, that is the most appropriate response to the current situation.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Thanks, Mike and everyone else who looked into this.

Committed to master.

Changed in evergreen:
status: New → Fix Committed
assignee: Jason Stephenson (jstephenson) → nobody
Ben Shum (bshum)
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.