User-level search speed instability
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
Critical
|
Unassigned |
Bug Description
A whole bunch of developers collaborated on a ton of important improvements to the way Evergreen parses, interprets, executes and returns results of searches initiated by end users. Most of this work culminated in the branch merged at the end of bug 1049131.
However, in pursuit of some of the features of this work, particularly the ability to nest filters that depend on non-bibliographic data and especially copy (and copy-attached) data, changes were made to the inner-most core of the logic that constructs SQL queries for Postgres to execute. In order to understand the situation, some history is needed.
In the beginning, Evergreen used a pure-SQL search mechanism. All search logic was embedded in one complex SQL query and the results were used to generate a search result page for the user. However, on large data sets, both the calculation of relevance ranking information and the inspection of visibility-
There was a progression of incremental improvements made over time that eventually led to a solution involving the use of pipelined approach to search result construction. First, a core "bibliographic-
This system is not perfect. It restricts the complexity of certain filter types to be top-level, and for very large result sets, the estimated counts are not guaranteed to be exact. However, this method does provide a configurable, hard backstop to search time, so that searches do not time out as they did in the earlier, pure-SQL world.
Which brings us to the change mentioned at the top of this description. Intermingled with the large set of improvements provided by code from many developers, and functionally tested by many, was a reversion to the pure-SQL search mechanism. Concerns were raised regarding the efficacy of this approach, given the history of Evergreen, and both caution and testing of known problem cases was requested. I was stated that this testing was performed, and while no results were published broadly, the claim was made that this new pure-SQL code was, in every case, faster than the stored-procedure version that existed before as of 2.3. This claim was taken on it's face and in good faith, and after functional testing showed the code working, it was merged to master for 2.4.
However, soon after beta1 was released and initial testing by several sites with large datasets began, the exact issues of concern were noted. Broad searches were timing out, and in some cases no amount of caching via search helped.
[ASIDE: Given the resources available in the Evergreen community, and the current practice of leaving branches unmerged until late in the development cycle -- especially for highly specialized areas such as search or ACQ -- this beta1 testing is the first time we could reasonably expect to see testing beyond that for functional completeness. Perhaps it seems counter-intuitive, but I contend that early merge would help address these issues, as more individuals test master than will ever test individual branches, simply due to time and resources, and human testing will always be needed regardless of any automation we devise.]
So with that history and problem description out of the way, I offer here a branch which reverts 2 commits out of the (approximately) 36 that make up the body of work committed via bug 1049131. This branch reverts the inner-most logic to the stored-procedure style search which avoids the timeouts that are so simple to provoke in the current master code. However, it comes at a cost: Copy status, copy ownership and shelving location filters are forced to be top-level filters, and are not nestable. It is extremely important to note, however, that this is already the case for every version of Evergreen that contains those filtering options, so /no existing features are lost/, but I felt it only fair to mention them because the ability to nest these filters is provided by the code in master (unreleased) today.
To date, Ben Shum at bibliomation has tested this branch and confirms that it removes timeouts on an appropriately cached database, and testing by Jason Stephenson and Thomas Berezansky comparing current master to January master (which was before the merge of the 36-commit branch mentioned above) shows the same. Jason's data is available at https:/
Obviously, the more testing by as many groups as possible, the better. If you can test this, please do, and please test broad searches before applying this branch, for comparison purposes.
To apply this branch to an existing 2.4-beta1 system simply run the upgrade script called XXXX.schema.
working/
description: | updated |
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
I am looking at this. I will perform timed search tests with the beta code both before and after applying this branch. I will record the results of the tests so a decision can be made based on evidence about what to do with this situation.