Evergreen

Bug #1704396
Comment #0

Comment 0 for bug 1704396

Revision history for this message

Kathy Lussier (klussier) wrote on 2017-07-14: Slowness for metecord and one-hit searches in 2.12

We've been investigating slowness/timeouts that have been seen since NOBLE upgraded from 2.10 to 2.12 and that I've recently seen while performing searches on test searches. We've seen the following behavior:

- Slowness and timeouts at NOBLE when users perform one-hit searches. We mostly hear of this problem occurring with ISBN searches, but I think that's because those are the searches that are most likely to retrieve just one result. NOBLE is configured to automatically go to the record on one-hit searches, which I believe may be a factor in this slowness.

- Metarecord searches on my test servers loaded with master frequently hang for over a minute before retrieving a set of search results. In testing, we've found this happening on NOBLE's 2.12 test servers too.

While investigating the issue with the metarecord searches, we found that whenever a metarecord search hangs, there is always a record in the list that is not part of a group and exhibits the behavior we saw in bug 1694497. There is no record ID in the link to the record. It appears as if the system timed out while trying to retrieve the record ID for that result.

After performing one of these metarecord searches, we took the ISBN for the record that was causing the problem for the metarecord search, and did a keyword search just for that ISBN. That search timed out, leading me to believe that the two different instances of slowness may be related.

We turned on DEBUG_TIMING and found the following:

- For the metarecord searches, we see a lot of "Firing off the multiclass query" and "Returned from the multiclass query" messages for each metarecord that's retrieved on the first page of results. When we had a slow metarecord search, we would see a 70 second lag between one of those firing off and returned messages. It was always 70 or 71 seconds (we saw similar 70 second pauses in the logs). This bottleneck always appeared where we would have been retrieving the result that had no record ID in the link.

- For the ISBN searches we've been able to catch, we also see that 70 second lag. In one case, I saw it in the exact same spot. There was 70 seconds between the one instance of "Firing off the multiclass query" and "Returned from the multiclass query." In another instance, I saw that 70 second lag happen at past peer bib id retrieval.

It's generally hard to predict when this slowdown will happen. A specific search may cause it during one trial, then be fine for another trial, and then be problematic on a third try. Also, if I see slowness both times when performing the same search twice, it isn't always the same result that exhibits the problem with the missing record id. It seems fairly random, but, if I perform enough metarecord searches on my test system, I can eventually make the problem happen.

If there are other logs I can post here to help diagnose the problem, let me know. I have looked at the postgres logs, but I'm not seeing any long-running queries in there.

We turned on DEBUG_TIMING and found the following:

If there are other logs I can post here to help diagnose the problem, let me know. I have looked at the postgres logs, but I'm not seeing any long-running queries in there.