Filtering Results by 'Past Year' is Very Slow

Bug #824631 reported by Nat Katin-Borland
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Shane Hathaway

Bug Description

If you filter your results by 'Past Year' on the advanced search results page, it takes a really long time to return results. I have been getting response times of between 30 and 50 seconds for several searches.

Tags: r3.74
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Things are about to get fun on performance. This, the tag cloud. Sigh.

Changed in karl3:
assignee: nobody → Chris Rossi (chris-archimedeanco)
importance: Undecided → Medium
milestone: none → m69
Changed in karl3:
milestone: m69 → m70
Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

FWIW, I studied this using sample content on my own computer, but I could not find any obvious performance issues. We'll have to dig deeper to figure this out.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

This one is likely to fall into the "not easy" category. We have a few things demanding attention in the next couple of weeks. Moving this out of the way for now.

Changed in karl3:
milestone: m70 → m73
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Let's move until next week.

Changed in karl3:
milestone: m73 → m74
Changed in karl3:
milestone: m74 → m75
Changed in karl3:
milestone: m75 → m76
Changed in karl3:
milestone: m76 → m78
Changed in karl3:
milestone: m78 → m79
Revision history for this message
Tres Seaver (tseaver) wrote :

Stealing per Paul's request.

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Tres Seaver (tseaver)
status: New → In Progress
Revision history for this message
Tres Seaver (tseaver) wrote :

Observations:

- By default, we sort search queries by 'modified_date', ascending
  (see 'karl.views.batch:get_catalog_batch').

- That same code "resolves" the docids returned by the catalog to
  model objects (there are no equivalents to the "catalog brains"
  used in Zope2).

- The first batch of search results for a "past year" query is
  thus highly likely to pull in objects which are neither in the
  ZODB connection's RAM cache nor in the second-level memcache
  maintained by relstorage. In this case, we are going to block
  on loads from the database.

- Requests which cause lots of previously uncached objects to be
  loaded are not only slow in themselves: they also tend to
  evict "popular" objects from the cache, thereby causing
  subsequent requests which reuse the same connection to be
  slower, because they need to re-fetch the just-evicted objects.

I don't see any simple way to resolve this problem.

- The "search ghetto" idea (sending search requests to a
  specially-configured instance) is very complicated, and might
  still not result in better performance on the original request.

- We could jam in a 'reverse=True' query term for the "past year"
  requests, which would increase the likelihood that the first
  batch would include already-cached objects. That likelihood
  would then decrease as subsequent batches were fetched.

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

After some analysis, it appears Karl is in fact fairly efficient in its handling of catalog results. It only resolves the docids of documents that it actually displays in the search results. The problem seems to be that the date index has to load a lot of docid buckets when users search over a large time span. On the OSF staging server, karl loaded at least 27,000 objects, each only about 150 bytes, when I performed a "past year" search. I'm guessing all of those objects are little BTree buckets.

I think if we add a secondary date index with very coarse granularity, such as 1 week, we might be able to crack this nut without too much work. The theory is that each bucket in the new index will hold a lot of docids, so getting the docids for a large time span will involve loading only a few buckets.

The costs will be:

- Slightly increased time for indexing and slightly more conflicts. I expect both costs to be very small since nearly all conflicts will be resolved invisibly at the bucket level.

- A little extra complexity; developers will need to know they should search the alternate index for large time spans.

- Less precision when using the alternate index. We could solve that by storing exact timestamps in addition to the coarse timestamps, but that solution would enlarge the index, and I don't think users asking for documents created in the past year will care if the system returns a few days of extra documents.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 824631] Filtering Results by 'Past Year' is Very Slow

We have a number of places where we might be using the dates. For example, the "communities" page filters out "active" communities, where the definition of "active" is, a recent date.

Since we don't have query optimizing, that is a date operation which isn't narrowed by content type. So it also might be doing a big date operation.

We have several other places where date might be an implicit part of a frequently-used screen.

--Paul

On Oct 28, 2011, at 4:44 AM, Shane Hathaway wrote:

> After some analysis, it appears Karl is in fact fairly efficient in its
> handling of catalog results. It only resolves the docids of documents
> that it actually displays in the search results. The problem seems to be
> that the date index has to load a lot of docid buckets when users search
> over a large time span. On the OSF staging server, karl loaded at least
> 27,000 objects, each only about 150 bytes, when I performed a "past
> year" search. I'm guessing all of those objects are little BTree
> buckets.
>
> I think if we add a secondary date index with very coarse granularity,
> such as 1 week, we might be able to crack this nut without too much
> work. The theory is that each bucket in the new index will hold a lot of
> docids, so getting the docids for a large time span will involve loading
> only a few buckets.
>
> The costs will be:
>
> - Slightly increased time for indexing and slightly more conflicts. I
> expect both costs to be very small since nearly all conflicts will be
> resolved invisibly at the bucket level.
>
> - A little extra complexity; developers will need to know they should
> search the alternate index for large time spans.
>
> - Less precision when using the alternate index. We could solve that by
> storing exact timestamps in addition to the coarse timestamps, but that
> solution would enlarge the index, and I don't think users asking for
> documents created in the past year will care if the system returns a few
> days of extra documents.
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/824631
>
> Title:
> Filtering Results by 'Past Year' is Very Slow
>
> Status in KARL3:
> In Progress
>
> Bug description:
> If you filter your results by 'Past Year' on the advanced search
> results page, it takes a really long time to return results. I have
> been getting response times of between 30 and 50 seconds for several
> searches.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/824631/+subscriptions

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

Ok, having thought more about it, I think we can reasonably solve the problem at the index level instead of the application level. I'd like to create a type of index that stores multiple resolution levels. It will look the same to the application as a CatalogFieldIndex, but will automatically adapt queries to the most appropriate resolution.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 824631] Re: Filtering Results by 'Past Year' is Very Slow

Sounds good to me. Let's set a number of hours for a little R&D project. Come back to me if it goes over. Say, 10 hours as a starting point?

FWIW tags are now impossibly slow too.

Sent from my iPhone

On Oct 31, 2011, at 6:19 AM, Shane Hathaway <email address hidden> wrote:

> Ok, having thought more about it, I think we can reasonably solve the
> problem at the index level instead of the application level. I'd like to
> create a type of index that stores multiple resolution levels. It will
> look the same to the application as a CatalogFieldIndex, but will
> automatically adapt queries to the most appropriate resolution.
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/824631
>
> Title:
> Filtering Results by 'Past Year' is Very Slow
>
> Status in KARL3:
> In Progress
>
> Bug description:
> If you filter your results by 'Past Year' on the advanced search
> results page, it takes a really long time to return results. I have
> been getting response times of between 30 and 50 seconds for several
> searches.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/824631/+subscriptions

Changed in karl3:
assignee: Tres Seaver (tseaver) → Shane Hathaway (shane-hathawaymix)
Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

I've implemented the index with multiple granularity levels and deployed it on branch1 for testing. It seems to have solved the problem: searching by past year is quick and no longer fills the ZODB cache.

JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m79 → m81
Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

The fix is now on the karl trunk.

Changed in karl3:
status: In Progress → Fix Committed
Changed in karl3:
milestone: m81 → m85
milestone: m85 → m82
JimPGlenn (jpglenn09)
tags: added: r3.74
Revision history for this message
JimPGlenn (jpglenn09) wrote :

deployed

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.