Slowness for metarecord and one-hit searches in 2.12

Bug #1704396 reported by Kathy Lussier on 2017-07-14
40
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Evergreen
High
Unassigned
2.12
High
Unassigned

Bug Description

We've been investigating slowness/timeouts that have been seen since NOBLE upgraded from 2.10 to 2.12 and that I've recently seen while performing searches on test servers. We've seen the following behavior:

- Slowness and timeouts at NOBLE when users perform one-hit searches. We mostly hear of this problem occurring with ISBN searches, but I think that's because those are the searches that are most likely to retrieve just one result. NOBLE is configured to automatically go to the record on one-hit searches, which I believe may be a factor in this slowness.

- Metarecord searches on my test servers loaded with master frequently hang for over a minute before retrieving a set of search results. In testing, we've found this happening on NOBLE's 2.12 test servers too.

While investigating the issue with the metarecord searches, we found that whenever a metarecord search hangs, there is always a record in the list that is not part of a group and exhibits the behavior we saw in bug 1694497. There is no record ID in the link to the record. It appears as if the system timed out while trying to retrieve the record ID for that result.

After performing one of these metarecord searches, we took the ISBN for the record that was causing the problem for the metarecord search, and did a keyword search just for that ISBN. That search timed out, leading me to believe that the two different instances of slowness may be related.

We turned on DEBUG_TIMING and found the following:

- For the metarecord searches, we see a lot of "Firing off the multiclass query" and "Returned from the multiclass query" messages for each metarecord that's retrieved on the first page of results. When we had a slow metarecord search, we would see a 70 second lag between one of those firing off and returned messages. It was always 70 or 71 seconds (we saw similar 70 second pauses in the logs). This bottleneck always appeared where we would have been retrieving the result that had no record ID in the link.

- For the ISBN searches we've been able to catch, we also see that 70 second lag. In one case, I saw it in the exact same spot. There was 70 seconds between the one instance of "Firing off the multiclass query" and "Returned from the multiclass query." In another instance, I saw that 70 second lag happen at past peer bib id retrieval.

It's generally hard to predict when this slowdown will happen. A specific search may be slow in one trial, then be fine for another trial, and then be problematic on a third try. Also, if I see slowness both times when performing the same search twice, it isn't always the same result that exhibits the problem with the missing record id. It seems fairly random, but, if I perform enough metarecord searches on my test system, I can eventually make the problem happen.

If there are other logs I can post here to help diagnose the problem, let me know. I have looked at the postgres logs, but I'm not seeing any long-running queries in there.

Kathy Lussier (klussier) on 2017-07-14
description: updated
Mike Rylander (mrylander) wrote :

Kathy, do the NOBEL servers have the patch supplied for bug 1689576 applied? If they're 2.12.2 or later it looks like they should, but you didn't list the exact version so I thought I'd check.

Kathy Lussier (klussier) wrote :

Yes, they are 2.12.2. They have also applied berick's branch from bug 1697029.

For my test servers, I am running master from a few weeks ago, so I should have it too.

After filing this bug, I did some searches of a couple of other EG catalogs on 2.12, and I was able to replicate the metarecord search issue on the C/W MARS training server and at Evergreen Indiana. Jason indicated C/W MARS is on 2.12.3.

Michele Morgan (mmorgan) wrote :
Download full text (3.2 KiB)

Adding a timelog capture for an example of a search that exhibited the 70 second delay on our training server. Also attaching a capture of log entries for this search.

Note the 70 second jump in the timelog:

At 1.7283: past copy note retrieval call
At 71.8001: past peer bib id retrieval

Also note the 70 second gap in the attached log entries:

open-ils.cstore 2017-07-18 12:35:13 [INFO:21705:osrf_stack.c:163:150020070421828115] Message processing duration 0.005465
[2017-07-18 16:36:23] open-ils.circ [INFO:38459:Application.pm:159:150020070421828116] CALL: open-ils.circ open-ils.circ.copy_note.retrieve.all HASH(0x512e230)
open-ils.cstore 2017-07-18 12:36:23 [INFO:21142:osrf_application.c:1070:150020070421828116] CALL: open-ils.cstore open-ils.cstore.direct.asset.copy_note.search.atomic {"owning_copy":10622772,"pub":"t"}

Here's the query info:

Query: glamour
Qtype: title
Search_format: serial

https://egtraining.noblenet.org/eg/opac/record/1811637?query=glamour;qtype=title;fi%3Asearch_format=serial;locg=1;detail_record_view=0

Timelog

At 0.0000: New page
At 0.2291: Initial load
At 0.2969: load_record() began
At 0.3004: past added content stage 1
At 0.3013: past staff saved searches
At 0.3014: Loading results
At 0.3014: Getting search parameters
At 0.3022: Tag circed items?
At 0.3022: Got search parameters
At 0.3033: Firing off the multiclass query
At 0.3209: Returned from the multiclass query
At 0.3214: past related search info
At 0.3754: get_records_and_facets(): about to call multisession
At 0.3769: get_records_and_facets(): about to call unapi.bre via json_query (rec_ids has 1
At 1.3819: get_records_and_facets(): got response content
At 1.3828: get_records_and_facets(): parsed xml
At 1.3840: get_records_and_facets(): end of success handler
At 1.3841: get_records_and_facets():past session wait
At 1.3841: past get_records_and_facets()
At 1.3928: past copy note retrieval call
At 1.4310: past peer bib id retrieval
At 1.4446: past copy note retrieval call
At 1.4881: past peer bib id retrieval
At 1.5017: past copy note retrieval call
At 1.5448: past peer bib id retrieval
At 1.5581: past copy note retrieval call
At 1.6007: past peer bib id retrieval
At 1.6141: past copy note retrieval call
At 1.6572: past peer bib id retrieval
At 1.6704: past copy note retrieval call
At 1.7144: past peer bib id retrieval
At 1.7283: past copy note retrieval call
At 71.8001: past peer bib id retrieval
At 71.8181: past copy note retrieval call
At 71.8706: past peer bib id retrieval
At 71.8845: past copy note retrieval call
At 72.1416: past peer bib id retrieval
At 72.1566: past copy note retrieval call
At 72.2008: past peer bib id retrieval
At 72.2009: past store copy retrieval call
At 72.5801: past get_hold_copy_summary()
At 72.8488: past serials holding stuff
At 72.8489: past expandies
At 72.8509: past added content stage 2
At 72.8572: get_records_and_facets(): about to call multisession
At 72.8585: get_records_and_facets(): about to call unapi.mmr via json_query (rec_ids has 1
At 73.0858: get_records_and_facets(): got response content
At 73.0867: get_records_and_facets(): parsed xml
At 73.0885: get_records_and_facets(): end of success handler
At 73.0886: get_r...

Read more...

Mike Rylander (mrylander) wrote :

Michele, is NOBEL recording "slow" queries in the postgres logs? That is, do you have log_min_duration_statement in use and set to a value other than -1? If so, are you seeing any slow queries that are touching biblio.peer_bib_copy_map? Also, are you seeing any queries that are erroring that touch either asset.copy or biblio.peer_bib_copy_map?

The slowdown in your timelog seems to be happening when in the call to open-ils.search.multi_home.bib_ids.by_barcode from within OpenILS::WWW::EGCatLoader::load_record().

Mike Rylander (mrylander) wrote :

Also, do you have any entries in your opensrf error logs that contain the string 'children available' around the time of these incidents?

Kathy Lussier (klussier) wrote :

I'm adding links to a couple of other log entries from sample metarecord searches.

The log from a metarecord search I performed a couple of weeks ago on a MassLNC test VM:
https://pastebin.com/7HU2nacA

The log from a metarecord search I performed today on the NOBLE training server:
https://pastebin.com/Y0fhuBp5

I can't speak to whether NOBLE is tracking long-running queries and whether there are any for the one-hit search examples, but, when testing on my VM a couple of weeks ago, I was logging long-running queries. No long-running queries were showing in the logs when I experienced the hung MR searches. The one-hit searches may show something different.

Michele Morgan (mmorgan) wrote :
Download full text (14.3 KiB)

I'm pasting in some timelogs for different instances of the same searches that showed delays below. The time lag doesn't always happen at the same place in the retrieval.

To address Mike's questions, we're not currently recording slow queries, but can enable that on our training server, for starters.

We're not seeing any query errors that touch asset.copy or biblio.peer_bib_copy_map. And none of the "children available" entries in the error logs around the times of the searches.

Timelogs:

https://egtraining.noblenet.org/eg/opac/record/1811637?query=glamour;qtype=title;fi%3Asearch_format=serial;locg=1;detail_record_view=0

At 0.0000: New page
At 0.1073: Initial load
At 0.1375: load_record() began
At 0.1393: past added content stage 1
At 0.1398: past staff saved searches
At 0.1399: Loading results
At 0.1399: Getting search parameters
At 0.1403: Tag circed items?
At 0.1403: Got search parameters
At 0.1411: Firing off the multiclass query
At 0.1496: Returned from the multiclass query
At 0.1499: past related search info
At 0.1972: get_records_and_facets(): about to call multisession
At 0.1979: get_records_and_facets(): about to call unapi.bre via json_query (rec_ids has 1
At 1.1367: get_records_and_facets(): got response content
At 1.1378: get_records_and_facets(): parsed xml
At 1.1389: get_records_and_facets(): end of success handler
At 1.1390: get_records_and_facets():past session wait
At 1.1390: past get_records_and_facets()
At 1.1477: past copy note retrieval call
At 1.1778: past peer bib id retrieval
At 1.1855: past copy note retrieval call
At 1.2134: past peer bib id retrieval
At 1.2206: past copy note retrieval call
At 1.2586: past peer bib id retrieval
At 1.2722: past copy note retrieval call
At 1.3154: past peer bib id retrieval
At 1.3290: past copy note retrieval call
At 1.3722: past peer bib id retrieval
At 1.3857: past copy note retrieval call
At 1.4282: past peer bib id retrieval
At 1.4417: past copy note retrieval call
At 1.4985: past peer bib id retrieval
At 1.5121: past copy note retrieval call
At 1.5554: past peer bib id retrieval
At 1.5690: past copy note retrieval call
At 1.6125: past peer bib id retrieval
At 1.6261: past copy note retrieval call
At 71.6979: past peer bib id retrieval
At 71.6980: past store copy retrieval call
At 71.9477: past get_hold_copy_summary()
At 72.2270: past serials holding stuff
At 72.2270: past expandies
At 72.2290: past added content stage 2
At 72.2353: get_records_and_facets(): about to call multisession
At 72.2366: get_records_and_facets(): about to call unapi.mmr via json_query (rec_ids has 1
At 72.4033: get_records_and_facets(): got response content
At 72.4042: get_records_and_facets(): parsed xml
At 72.4060: get_records_and_facets(): end of success handler
At 72.4061: get_records_and_facets():past session wait

https://egtraining.noblenet.org/eg/opac/record/1818019?query=entrepreneur;qtype=title;fi%3Asearch_format=serial;locg=1;detail_record_view=0;sort=

At 0.0000: New page
At 0.2229: Initial load
At 0.4900: load_record() began
At 0.4937: past added content stage 1
At 0.6601: past staff saved searches
At 0.6601: Loading results
At 0.6602: Getting search parameters
At 0.6610: Tag circed ite...

Kathy Lussier (klussier) wrote :

I had previously only been able to replicate the MR search problem on multiple servers. To ensure that the one-hit search problem wasn't the result of a customization, I looked a little more deeply at the problem and was able to replicate it on the C/W MARS training server, which is running 2.12.3. C/W MARS has also enabled the setting to automatically redirect to the record on one-hit searches.

I don't have access to the logs, but I wanted to note down a few observations from my testing:

1. I ran about 100 one-hit ISBN searches in the C/W MARS catalog. Ten of those searches showed the 70-second delay. Another search resulted in an Internal Server error. The percentage may seem low, but when you consider the amount of ISBN searches a cataloger does throughout the day, they are likely to encounter several of these slow searches every day.

2. In most cases, the delay only happened when I performed the first search on the ISBN. Subsequent searches were fine, including the ISBN search that had previously resulted in an Internal Server error. However, two of the ISBN searches continued to show the 70-second lag in repeated searches, eventually showing normal retrieval times after 4-5 tries.

3. After the delay, the bib record is retrieved as expected (except in the case of the Internal Server error). I performed a diff on the HTML generated from the record page after two identical ISBN searches (one that experienced the lag and the other that didn't), but there were no differences. This is different from my experience with the MR search, where searches with a 70-second lag resulted in one of the links missing a record ID.

4. The one-hit redirect setting doesn't always work. We sometimes land on the results page when there is only one result (I think this has always been true.) There were a couple of cases where I performed an ISBN search and landed on the results page very quickly even though there was just one hit. I then re-executed that search in an attempt to get it to land on the record page and saw the 70-second lag. This makes me think that the problem isn't with the search query, but with attempt to load the record on redirect.

Michele Morgan (mmorgan) wrote :

Just to add to Kathy's comment #4. When examining logs for one-hit isbn searches that show the delay, the bib record id is identified immediately, which does point to the problem being with the load of the full record rather than the search query.

Mike Rylander (mrylander) wrote :

To provide a little analysis based on comment #8:

(3) suggests that the search (in all circumstances) is succeeding, because a fast return on subsequent requests means that the search is most likely cached as expected.

(4) further lends weight to the idea that it's the record detail page rendering that is the actual issue here. The fact that it only happens when the user is bounced past the result page and redirected to the detail page pretty much rules out the actual search process.

Do we have any indication that record detail pages generally are having issues, or have there /only/ been reports of 1-hit auto-redirects causing this pause behavior?

Now with more activity log traces that document the problem in real-time, it looks like the peer-bib check I noted above is a red herring. Two of the pauses happened after all those calls, which is more than 25% of the accumulated timelog evidence.

I think it's time to start looking at the opensrf system logs in detail...

Jason Boyer (jboyer) wrote :

Because I know everyone loves the spicy zing of anecdata:

Something I've seen more and more since our move to 2.12 is pileups of unapi.bre and biblio.record_copy_count queries. So many and so slow that they've been added to the list of queries that we cancel after 90 seconds. They can run for minutes and minutes if unchecked, and there was one day where postgres was literally crashed due to their piling up. I didn't bug it at the time because I assumed it was a local config bottleneck, but maybe that's not the case? It doesn't seem to be related to metarecords but if the problem is potentially record detail pages they may have some nebulous relation at least.

Jason Boyer (jboyer) wrote :

I should add to comment #11, this obviously isn't constant or it would likely be a single local bottleneck. As near as I can tell its nearly random, but higher usage times do make it more likely. And, of course, if I try to run the very same query that took 4+ minutes to run and was killed, this time it will return in a reasonable number of milliseconds.

Michele Morgan (mmorgan) wrote :

In addition to the one hit isbn searches and metarecord searches where we have see the 70 second delay, I have also seen the same delay in basic keyword searching.

I conducted a series of two-word keyword searches on a 2.12.2 training server with production data while logged in as a patron with pref_ou=SAL. Out of 26 searches, the 11th, 14th, and 24th failed with a timeout at ~70 seconds. I performed the failed searches later in the day and they returned hits normally.

A summary of the searches:

8:48am - apple kick - no hits (timeout)
8:55am - apple nest - no hits (timeout)
9:15am - apple water - no hits (timeout)

Successful re-searches:

9:18am - apple kick - succeeded, 15 hits
9:23am - apple nest - succeeded, 27 hits
9:35am - apple water - succeeded, 141 hits

I have captured timelogs and osrfsys logs for these searches and uploaded the files here:

https://drive.google.com/drive/folders/0B-CjwZFtLEK6VGdNX241RXpmaXM?usp=sharing

The failed searches all seem to have a "flushing data from socket" entry prior to the 70 second delay.

Mike Rylander (mrylander) wrote :

Thanks, Michele. All of these look like cold-cache (or similar) timeouts. That is, the mod_perl code timed out after 70s waiting for the search service to return data. Are you recording long-running queries in the postgres log, and if so, can you find those three search queries? We don't need the details, just confirmation (or not) that the query duration was longer than 70s. If they're much shorter, then that's important data, too, as it would suggest a silent failure somewhere.

Mike Rylander (mrylander) wrote :

Actually, looking more closely at the opensrf logs, it's looking like there is some sort of silent failure, and the search app does not seem to be receiving the data from the storage app. Is that the full log for the trace 1501410304144751966 on the failed "apple nest" search?

Mike Rylander (mrylander) wrote :

And, having traced the code more, I think getting the query details will be useful. We shouldn't be seeing the "NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored" messages coming from the database.

TIA!

Michele Morgan (mmorgan) wrote :

Thanks for looking at this Mike!

That is indeed the full log for the trace 1501410304144751966 on the failed "apple nest" search.

We have not been logging long-running queries on that server, but will get that turned on and capture more logs and look for associated queries.

Regarding the "NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored" entries, we see those for both successful and unsuccessful searches, and also saw them prior to our upgrade to 2.12.2, when the search delay problem started.

Kathy indicates she has not seen those messages on other servers she has tested that show the 70 second delay. I suspect those messages may be related to changes we made to add a synonym dictionary.

Jeff Davis (jdavis-sitka) wrote :

We've been seeing similar issues in our 2.12.1-ish environment. I just caught an instance similar to what Michele reports: a search that times out after a minute or so, showing no hits; repeating the search returns in a reasonable amount of time with 15 hits.

Logs show the "flushing data from socket..." message followed by a 67-second gap before one final message for that log trace:

http://paste.evergreen-ils.org/605

Unfortunately I can't get the query duration from our Postgres logs. I'll tweak our setup so I can report that next time.

Jeff Davis (jdavis-sitka) wrote :

Another example of a failed search (no hits) which succeeds on a second attempt (multiple hits), this time with Postgres query durations:

http://paste.evergreen-ils.org/636

If I'm reading our logs correctly, the db query takes a reasonable amount of time: 140ms for the failed search vs. 168ms for the successful followup. So I don't think the query itself is the issue.

As before, the failed search gives us a series of OpenSRF log messages, then the "flushing data from socket..." message about 1 second later, then one final log entry for that trace 60-70 seconds after that.

Jeff Davis (jdavis-sitka) wrote :

And here are the timing logs for the failed search:

At 0.0000: New page
At 0.2392: Initial load
At 0.2870: Loading results
At 0.2871: Getting search parameters
At 0.2875: Tag circed items?
At 0.2875: Got search parameters
At 0.2901: Firing off the multiclass query
At 70.3606: Returned from the multiclass query

Mike Rylander (mrylander) wrote :

So far, all these logs really look like the response is getting dropped on the floor -- either never sent by the Perl code, or never passed on by ejabberd.

Are there any corroborating ejabberd logs that suggest dropped stanzas? I'm not betting on it, though, because ejabberd's logs are pretty thin...

Jeff Davis (jdavis-sitka) wrote :

There's nothing unusual in ejabberd logs that I can see in that timeframe, just a bunch of Accepted connection, Accepted legacy authentication, and Close session messages.

For completeness, here are the ejabberd logs that match the PIDs from my last paste:
http://paste.evergreen-ils.org/637

Jeff Davis (jdavis-sitka) wrote :

I do note this correspondence:

2017-08-23T11:37:20.427355-07:00 app2 open-ils.search: [INFO:22652:Application.pm:159:150349595122006341] CALL: open-ils.search open-ils.search.biblio.multiclass.query HASH(0x69ebaf0), title:red is best site(BW) depth(2), 1

=INFO REPORT==== 2017-08-23 11:37:20 ===
I(<0.6443.3>:ejabberd_c2s:1502) : ({socket_state,gen_tcp,#Port<0.105223>,<0.6442.3>}) Close session for <email address hidden>/open-ils.search_drone_at_localhost_22652

That is, ejabberd shows a "Close session" message for the open-ils.search drone with PID 22652 at 11:37:20, which is the same time that the OpenSRF logs show a multiclass query for that PID. There are no further log entries for that PID in my OpenSRF logs.

Mike Rylander (mrylander) wrote :

/me stares at code...

Would it be possible for someone experiencing this issue to apply the following attached patch, to confirm that there are no problems printing to the socket?

Mike Rylander (mrylander) wrote :
Mike Rylander (mrylander) wrote :

Jeff,

Well... that's unfortunate. Now I'm getting concerned about src/perl/lib/OpenSRF/Application.pm around lines 170-177. That's where we check to see if this drone has served as many requests as it's allowed.

The status() call in that block should be tossing a message at the caller with a STATUS_REDIRECTED statusCode, which should then invoke the code at src/perl/lib/OpenSRF/DomainObject/oilsMessage.pm lines 334-340. Part of that is to write a WARN-level message to the log that says "Disconnected because of redirect", which we're not seeing, and the request should be resent to the service.

I wonder if we could be dropping the socket and going away too fast, before the data makes it out the door, if we hit max requests. That shouldn't be, since the socket should have autoflush turned on (per https://perldoc.perl.org/IO/Socket.html ) in modern versions of IO::Socket.

**sad-emoji**

The above patch may still be useful.

Michele Morgan (mmorgan) wrote :

Mike,

We have also not seen any long running queries for these searches. I applied the patch to a training system and executed a few one-hit isbn searches that exhibited the 70 second delay. I did not see any "Error printing XMPP stanza to socket" messages logged.

I've added a doc to the shared folder referenced in comment 13 with the timelog, osrfsys.log entries and ejabberd.log entries for a one-hit isbn search. The osrfsys.log pauses at 12:10:08 and resumes at 12:11:18. Here's the excerpt:

[2017-08-24 12:10:08] open-ils.search [INFO:3474:CStoreEditor.pm:139:150358408346840262] editor[0|0] request en-US open-ils.cstore.direct.biblio.peer_bib_copy_map.search.atomic {"target_copy":5735238}
open-ils.cstore 2017-08-24 12:10:08 [INFO:46670:osrf_application.c:1070:150358408346840262] CALL: open-ils.cstore open-ils.cstore.direct.biblio.peer_bib_copy_map.search.atomic {"target_copy":5735238}
open-ils.cstore 2017-08-24 12:10:08 [INFO:46670:osrf_app_session.c:1177:150358408346840262] [open-ils.cstore] sent 369 bytes of data to <email address hidden>/open-ils.search_drone_at_localhost_3474
open-ils.cstore 2017-08-24 12:10:08 [INFO:46670:osrf_stack.c:163:150358408346840262] Message processing duration 0.001401
[2017-08-24 12:10:08] open-ils.search [INFO:3474:CStoreEditor.pm:139:150358408346840262] editor[0|0] open-ils.cstore.direct.biblio.peer_bib_copy_map.search.atomic: returned 0 result(s)
[2017-08-24 12:10:08] open-ils.search [INFO:3474:Event.pm:64:150358408346840262] Loading events xml file /openils/var/data/ils_events.xml
[2017-08-24 12:11:18] open-ils.circ [INFO:46623:Application.pm:159:150358408346840263] CALL: open-ils.circ open-ils.circ.copy_note.retrieve.all HASH(0x4b00350)
open-ils.cstore 2017-08-24 12:11:18 [INFO:46873:osrf_application.c:1070:150358408346840263] CALL: open-ils.cstore open-ils.cstore.direct.asset.copy_note.search.atomic {"owning_copy":10199236,"pub":"t"}

Here's a direct link to the doc with the complete logs:

https://docs.google.com/document/d/1MG4O9BWu25AcIVcADndV5kzdBho2X_9FqTzFCoi54C4/edit?usp=sharing

Mike Rylander (mrylander) wrote :

Thanks, Michele.

One interesting thing from this log snippet is that there's no "flushing data" message -- though, I suspect that's simply because it was a C server (cstore) that stalled, but only Perl servers have that log message.

For the same reason (C server rather than Perl server) you wouldn't see the new log message. If you run across another instance where the search itself (the "Firing off the multiclass query" / "Returned from the multiclass query" pair) is where the long pause happens, that would definitely be a nearly-completely-Perl-stack API call, so would be a good candidate for spotting the new error message.

Another interesting thing is that the search service started data from the loading the ils_events file, meaning that it was trying to react to some error condition -- but what that was, I don't think we can tell.

I don't have any new theories based on this data, yet.

Kathy Lussier (klussier) on 2017-08-24
summary: - Slowness for metecord and one-hit searches in 2.12
+ Slowness for metacord and one-hit searches in 2.12
summary: - Slowness for metacord and one-hit searches in 2.12
+ Slowness for metarecord and one-hit searches in 2.12
Jeff Davis (jdavis-sitka) wrote :

I applied two patches to our system with extra logging for this bug:

1. Mike's "Error printing XMPP stanza to socket" patch above, and
2. the patch attached below, which adds log messages for certain error conditions in parts of AppSession.pm related to Mike's comment #26 above.

Since then, we've had several more instances of searches showing no results due to timeout, following the same pattern above. Unfortunately I haven't seen any extra log messages either from Mike's patch or mine. Does the negative result help to narrow down the cause?

Mike Rylander (mrylander) wrote :

Jeff,

Well, it tells us it's probably not any of that code with new log messages, which is good ...

Are all of your recent incidents failing during the search part, or are some like Michele's earlier in the bug, and failing when gathering some copy data?

Jeff Davis (jdavis-sitka) wrote :

All the incidents I've investigated failed during search, never on copies.

I had an interesting one just now:

query: overdrive search_format(ebook) site(BC_ILC) pref_ou(BNA)
database query duration: 7989.742 ms
OpenSRF logs: http://paste.evergreen-ils.org/714

This is the first example I've had where the db query took >1s (not too surprising, this is a very broad search with a format limiter). This time, there are a few additional requests between "pref_ou = " and "flushing data from socket" messages, and there are four "flushing data from socket" messages instead of just one. That's slightly different from previous incidents, all of which have been a close match for my examples in comments 18-20 above.

Michele Morgan (mmorgan) wrote :

I was able to recreate a few searches that showed the delay between the "Firing off the multiclass query" / "Returned from the multiclass query" pair as Mike mentioned in comment 28, but have not seen any instances of "Error printing XMPP stanza to socket" in the logs.

An observation I found interesting:

Two metarecord keyword searches for "apple tree" both showed the delay before returning results. The timelogs showed the delay in different places.

For the first search, the delay was between the FIRST "Firing off the multiclass query" / "Returned from the multiclass query" pair.

For the second search, the delay was between the SECOND "Firing off the multiclass query" / "Returned from the multiclass query" pair.

Viewing the search results where the time lag showed between the FIRST multiclass query pair, the first record in the list of hits showed the following:

- no record id in the title link
- no format icon and label

Viewing the search results where the time lag showed between the SECOND multiclass query pair, the second record in the list of hits showed both the missing record id and the missing format information.

Not sure if this is really new information or just restating what's already been reported, but might this suggest where more logging could be inserted to help narrow this particular facet of the problem?

Michele Morgan (mmorgan) wrote :
Download full text (6.3 KiB)

Evergreen 2.12.4
opensrf 5.1

The search delay issue is still present. Metarecord searches that show the delay consistently show a lack of format icon and no record id in the title link for the first or second hit.

An example of a timelog for a metarecord search for "apple juice" is:

At 7.4020: Returned from get_records_and_facets()
At 7.8243: Firing off the second multiclass query
At 7.8249: Gathering meta_results
At 8.6264: Returned from gathering meta_results
At 8.6264: Returned from the second multiclass query
At 8.6268: Firing off the second multiclass query
At 8.6275: Gathering meta_results
At 79.6177: Returned from gathering meta_results
At 79.6178: Returned from the second multiclass query
At 79.6182: Firing off the second multiclass query
At 79.6191: Gathering meta_results
At 80.4367: Returned from gathering meta_results
At 80.4367: Returned from the second multiclass query

The timelog gap corresponds to these lines in OpenILS/WWW/EGCatloader/Search.pm

                $self->timelog("Gathering meta_results");
                $meta_results = $req->gather(1);
                $self->timelog("Returned from gathering meta_results");
                $self->timelog("Returned from the second multiclass query");

Also attaching a screenshot of the search results screen showing the missing info

For comparison, here are some grepped log entries from the first gather, which was successful (metarecord id 2759947, source id 2598144):

Here are all the log entries where metarecord id 2759947 appears:

[2017-09-12 12:53:44] open-ils.search [INFO:6649:CStoreEditor.pm:139:15052229973698331] editor[0|0] request en-US open-ils.cstore.json_query {"from":["search.facets_for_metarecord_set","{}","{2759947,2387987,2794176,2987877,3418043,2112438,2105728,2109922}"]}
open-ils.cstore 2017-09-12 12:53:44 [INFO:5427:osrf_application.c:1075:15052229973698331] CALL: open-ils.cstore open-ils.cstore.json_query {"from":["search.facets_for_metarecord_set","{}","{2759947,2387987,2794176,2987877,3418043,2112438,2105728,2109922}"]}
open-ils.cstore 2017-09-12 12:53:44 [INFO:5429:osrf_application.c:1075:15052229973698332] CALL: open-ils.cstore open-ils.cstore.json_query {"from":["unapi.mmr","2759947","marcxml","record","{holdings_xml,mra,acp,acnp,acns,bmp,cbs,mmr.unapi}","SAL","2","acn=>5,acp=>5,bre=>5",null,null,"63"]}
[2017-09-12 12:53:46] open-ils.storage [INFO:6490:Application.pm:159:15052229973698348] CALL: open-ils.storage open-ils.storage.biblio.multiclass.staged.search_fts.atomic estimation_strategy, inclusion, skip_check, 0, query, apple juice site(SAL) depth(2), from_metarecord, 2759947, core_limit, 10000, return_query, 1, limit, 1000, check_limit, 1000, offset, 0
[2017-09-12 12:53:47] open-ils.search [INFO:6649:Biblio.pm:1349:15052229973698348] Completed canonicalized search is: from_metarecord(2759947) core_limit(10000) limit(1000) badge_orgs(1,62,63) estimation_strategy(inclusion) site(SAL) depth(2) #CD_documentLength #CD_meanHarmonic #CD_uniqueWords (keyword: apple juice)

and all the entries where source id 2598144 appears:

[2017-09-12 12:53:47] open-ils.search [INFO:6649:CStoreEditor.pm:139:] editor[0|0] request en-US open-ils.cstore.json_query {"from...

Read more...

Mike Rylander (mrylander) wrote :

Michele and Jeff,

This may be a long shot, but I want to rule out one thing. Could you both install either OpenSRF 2.5.2 (newly released) or the branch attached to bug 1717526 and see if that has any effect on this issue? I think there's a slight chance that broken JSON is been passed back because of the issue introduced in 2.5.1 by the branch attached to bug 1709710.

Keeping my fingers crossed, but I'm not confident that this is the problem...

Kathy Lussier (klussier) wrote :

For the records, Michele noted in IRC that she saw no improvement this morning after upgrading to OpenSRF 2.5.2. I also upgraded my test system to 2.5.2 and saw the problem on the 2nd metarecord search I tried. :(

Michele Morgan (mmorgan) wrote :
Download full text (18.8 KiB)

Just adding some selected log entries for a slow metarecord search in hopes it may provide some clues.

Metarecord search for "apple juice" site:SAL

Search matched 8 metarecords with the following ids and associated sources, in the following order:

mmr.id mmrsm.sources
2759947 - 2598144
2387987 - 3734245
2794176 - 2696417
2987877 - 2975240,3017142
3418043 - 2670717
2112438 - 3174791,3483376,3519535,3833608
2105728 - 2910493,3483453,3483581
2109922 - 3352056

The timelog showed the 70 second delay between Gathering meta_results/Returned from gathering meta_results for the second hit in the results (metarecord 2387987, source 3734245)

When the page loaded the second hit was missing format information. Looking at the threadtraces in the logs, the trace for the second hit shows several calls not present for the other hits, and is missing others that are present in the other hits. For hits 1 and 3-8, the threadtraces looked similar.

Here are threadtraces from the first three hits retrieved:

Hit 1

[2017-09-20 16:02:44] open-ils.search [INFO:30211:Application.pm:159:150564390479022184] CALL: open-ils.search open-ils.search.biblio.multiclass.query HASH(0x436e1c0), apple juice site(SAL) depth(2), 1
[2017-09-20 16:02:44] open-ils.storage [INFO:30241:Application.pm:159:150564390479022184] CALL: open-ils.storage open-ils.storage.biblio.multiclass.staged.search_fts.atomic estimation_strategy, inclusion, skip_check, 0, query, apple juice site(SAL) depth(2), from_metarecord, 2759947, core_limit, 10000, return_query, 1, limit, 1000, check_limit, 1000, offset, 0
open-ils.cstore 2017-09-20 16:02:44 [INFO:29957:osrf_application.c:1075:150564390479022184] CALL: open-ils.cstore open-ils.cstore.json_query.atomic {"from":["actor.org_unit_ancestors","63"]}
open-ils.cstore 2017-09-20 16:02:44 [INFO:29957:osrf_app_session.c:1177:150564390479022184] [open-ils.cstore] sent 1094 bytes of data to <email address hidden>/open-ils.storage_drone_at_localhost_30241
open-ils.cstore 2017-09-20 16:02:44 [INFO:29957:osrf_stack.c:163:150564390479022184] Message processing duration 0.002724
[2017-09-20 16:02:44] open-ils.storage [INFO:30241:metabib.pm:2984:150564390479022184] pref_ou =
[2017-09-20 16:02:44] open-ils.storage [WARN:30241:Application.pm:624:150564390479022184] open-ils.storage.biblio.multiclass.staged.search_fts: NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
[2017-09-20 16:02:44] open-ils.storage [WARN:30241:Application.pm:624:150564390479022184] open-ils.storage.biblio.multiclass.staged.search_fts: NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
[2017-09-20 16:02:44] open-ils.storage [WARN:30241:Application.pm:624:150564390479022184] open-ils.storage.biblio.multiclass.staged.search_fts: NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
[2017-09-20 16:02:44] open-ils.storage [WARN:30241:Application.pm:624:150564390479022184] open-ils.storage.biblio.multiclass.staged.search_fts: NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
[2017-09-20 16:02:44] open-ils.storage [INFO:30241:Transport.pm:163:15056439...

Michele Morgan (mmorgan) wrote :

For ease of access, I added the text of the above comment to a google doc:

https://drive.google.com/open?id=1YqGInXDeQmeLjbxKvNYkS5FmU48hbny9vN1xi__oIDI

Chris Sharp (chrissharp123) wrote :

Going to add my strident "me too!" here. Like Jason Boyer, I'm seeing a definite uptick in the number of long-running queries we're automatically canceling (every 60s we kill off queries running over 20s). I just applied OpenSRF 2.5.2 today and there's no discernible difference. Catalogers are complaining about search timeouts and network errors in the XUL client.

Happy to provide any log data that would help and happy to add any log debugging messages too - just let me know!

Changed in evergreen:
status: New → Confirmed
Mike Rylander (mrylander) wrote :

Chris, do you record the text of the queries you're killing? If so, can you paste those? Thanks!

Michele Morgan (mmorgan) wrote :

Adding a link to *possibly* related bug 1721133

Andrea Neiman (aneiman) wrote :

Equinox has been contracted by MassLNC to investigate this issue & pursue a resolution if possible.

I encourage anyone experiencing this to include here any log/other helpful info that you have.

Blake GH (bmagic) wrote :

I have a theory. Totally untested but, could it be:
http://git.evergreen-ils.org/?p=working/Evergreen.git;a=blobdiff;f=Open-ILS/src/perlmods/lib/OpenILS/WWW/EGCatLoader/Search.pm;h=2d7a8d85bb9b7f99e4c668c07e088badcbf04abd;hp=c5a53078790936413fff309e752fcb1f6c8fcda9;hb=0bb8deebafb61b91ba4904ff54f02a468cddbac0;hpb=f7215da5476e8fd73f5a2136ecde39e721c1000f

Where $args->{from_metarecord} = $rec_id; is set to a bre.id because open-ils.search.biblio.multiclass.query replaces the ID number with bre.id instead of metabib.metarecord.id because only one result?

I am not certain of any of those assumptions. Just a theory.

Jeff Davis (jdavis-sitka) wrote :

I added some more logging to query_parser_fts (see attached patch). The resulting logs for a false no-results search are here:

http://paste.evergreen-ils.org/866

I think this is another negative result, i.e. the extra logging doesn't tell me what's causing the problem. But maybe it will help to narrow things down.

Mike Rylander (mrylander) wrote :

After ALL THE LOGGING, I managed to trace this to the select() call used to see if we have data to read from the network. Bill Erickson noticed that there were spurious SIGALRMs arriving, interrupting the select() and killing the process, which we then both traced to the use of alarm() that tells the facet fetching code to give up on retrieving facets for its caller.

Modifying retrieve_cached_facets to disable the alarm timer after the facets become available seems to address this, so a patch should be forthcoming tomorrow, after some more testing here.

Bill Erickson (berick) wrote :

Thanks, Mike. You may already be doing this, but given the far-reaching impact of an ALRM that does fire, I'd like to reiterate my preference for not using alarm() in the facet code, when a simple while($i++ < $itr_count) accomplishes the same thing.

Mike Rylander (mrylander) wrote :

I disagree that it's equivalent, as it does not interrupt the sleep loop as quickly as possible, but I will post two branches, one with alarm and one with a loop implementation.

Mike Rylander (mrylander) wrote :

Here's the least-change version, that simply makes sure the alarm timer is reset:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1704396-reset_alarm_timer_for_facets

And here is a version that uses a loop to count up to 10s, 50ms at a time. This one is untested but super-simple:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/lp-1704396-reset_alarm_timer_for_facets-loop

I'll leave it to the committer that picks this bug up to choose an implementation.

Changed in evergreen:
milestone: none → 3.next
Changed in evergreen:
status: Confirmed → Fix Committed
Jason Stephenson (jstephenson) wrote :

Pushed to master, rel_3_0, and rel_2_12 for crunchy, searchy goodness.

Galen Charlton (gmc) on 2017-11-09
Changed in evergreen:
milestone: 3.next → none
milestone: none → 3.0.2
no longer affects: evergreen/3.0
Dan Wells (dbw2) wrote :

For the record (or mainly myself, since I missed it), the fix pushed in was the alarm reset version rather than the loop-based approach. Thanks, all!

Kathy Lussier (klussier) wrote :

I'm adding some details about NOBLE testing results just to demonstrate some of the log messages that were cleared up by this patch beyond the visible search slowness. I thought this information might be useful for those who have not applied the patch yet and are seeing similar log messages.

These comments came from mmorgan about a day after this patch was applied to their production servers.

"Returning NULL" errors in gateway logs:

We had been seeing around 200 errors like these on production per day:

gateway.10.log:2017-11-07 10:17:57 brick0 osrf_json_gw: [INFO:167018:osrf_app_session.c:394:151006763216701821] Returning NULL from app_request_recv after timeout: open-ils.search.biblio.copy_location_counts.summary.retrieve [3140357,11]
gateway.10.log:2017-11-07 10:21:39 brick1 osrf_json_gw: [INFO:143511:osrf_app_session.c:394:151006798914351111] Returning NULL from app_request_recv after timeout: open-ils.search.biblio.record.mods_slim.retrieve.authoritative ["4056296"]
gateway.10.log:2017-11-07 10:24:22 brick0 osrf_json_gw: [INFO:167520:osrf_app_session.c:394:15100681901675200] Returning NULL from app_request_recv after timeout: open-ils.search.biblio.record.print [["4048109"]]
gateway.10.log:2017-11-07 10:24:45 brick0 osrf_json_gw: [INFO:167290:osrf_app_session.c:394:151006794316729052] Returning NULL from app_request_recv after timeout: open-ils.search.biblio.copy_counts.location.summary.retrieve ["3688292",39,1]
gateway.10.log:2017-11-07 10:28:21 brick1 osrf_json_gw: [INFO:143412:osrf_app_session.c:394:151006786314341280] Returning NULL from app_request_recv after timeout: open-ils.search.biblio.record.print [["3841093"]]

I have seen no such errors since the patch was installed on production.

"NOT CONNECTED" errors.

We had been seeing sometimes hundreds of these errors in a day, and other than a small flurry of 46 between 2:25 and 2:30pm yesterday, there have been none since the patch was put on production.

Changed in evergreen:
status: Fix Committed → Fix Released
Elaine Hardy (ehardy) on 2019-03-07
tags: added: metarecords search
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers