Evergreen

QP query generation is incorrect with two or more bool ops in a row

Bug #1251353 reported by Mike Rylander on 2013-11-14

258

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Fix Released	Medium	Unassigned	Evergreen 2.5.1
	2.4	Fix Released	Medium	Unassigned	Evergreen 2.4.4

Bug Description

Dan Wells raised two issues with search here: http://markmail.org/message/qch7yu6rkvmd6ojz

Of those, the second is a new issue. It's root cause is improper use of "local" variables in perl. In addition to (at least) two other bugs, the branch below addresses this, generating the correct SQL for the examples given.

From the relevant commit message:

First, we didn't need to make $last_type local, and it broke explicit
grouping anyway. That's removed, and we now reset that (and a few more
like it) at calls to the top level parse() method.

Second, we are smarter about finding the boundary of atoms. Previous
to this commit, and curly brace could send the parser into a tailspin
from which it would not recover. Now we use alternation instead of
a character class, which is much safer with the default multi-character
float syntax specifier.

Third, as a catch-all, if we can't parse the remained of a query we
now simply say so (when in debug mode) and go away, instead of risking
an infinite loop. We do this via a final, unqualified "else" clause
in decompose().

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/QP-boolean-alternation-and-parsing

See original description

Tags:

Mike Rylander (mrylander) on 2013-11-14

description:

updated

Dan Wells (dbw2) on 2013-11-15

information type:

Public → Private Security

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-15:

Thanks for your work on this, Mike. This helped my example queries in testing, so I pushed it into production. The situation is generally improved, but now that the queries are actually being generated, I have seen a few cases where "pathological" queries can actually crash the database system. In at least one case, our DB system went into "recovery mode" in such a way that it never recovered.

As a little more background, what the queries are trying to do is determine (with help of xISBN services) whether we have *any* edition of a title represented by a given ISBN. So they are only pathological in the sense that a human would never construct them, but the question itself is reasonable. I am doing these queries via an API request to open-ils.search.biblio.multiclass.query, not using the TPAC search box.

The request which crashed our DB permanently was for "Animal Farm". Apparently, there are at least 300 editions of this book with different ISBNs. So the expanded, generated query was something like 'identifier|isbn:(0010008381 || 0030554349 || 0134354680 ...(300+ ISBNs later)... || 9992772115)'.

I'm a reasonable guy, so I can accept is such a query is simply too much, and can't easily be made to work. If so, I think we should consider putting some hard limits at some point in the query generation chain to prevent these cases from getting too far along.

That said, I am also concerned about at least one aspect of the generated query for this case. The "meat" of the query looks something like this:

...
(
  (
    (xd221da0_identifier_isbn.id IS NOT NULL)
  )
  OR
  (
    (
      (
        (xd221f38_identifier_isbn.id IS NOT NULL)
      )
      OR
      (
        (
          (
     (xd222208_identifier_isbn.id IS NOT NULL)
          )
          OR
          (
     (
       (
         (xd2224d8_identifier_isbn.id IS NOT NULL)
       )
            )
          )
        )
      )
    )
  )
)
...

Is there a reason the nesting gets deeper and deeper for every added bool? Or is this another aspect to this bug? Maybe there is a reason, with the most likely being that the extra nesting makes the parser simpler in some way, but I thought it at least looked fishy enough to mention, as it seems very possible that this massive nested structure is a major factor in the DB crashes.

Finally, I have moved this to a security bug, since giving an example query which crashes EG makes me uneasy.

Thanks again.

Thanks for your work on this, Mike.  This helped my example queries in testing, so I pushed it into production.  The situation is generally improved, but now that the queries are actually being generated, I have seen a few cases where "pathological" queries can actually crash the database system.  In at least one case, our DB system went into "recovery mode" in such a way that it never recovered.

As a little more background, what the queries are trying to do is determine (with help of xISBN services) whether we have *any* edition of a title represented by a given ISBN.  So they are only pathological in the sense that a human would never construct them, but the question itself is reasonable.  I am doing these queries via an API request to open-ils.search.biblio.multiclass.query, not using the TPAC search box.

The request which crashed our DB permanently was for "Animal Farm".  Apparently, there are at least 300 editions of this book with different ISBNs.  So the expanded, generated query was something like 'identifier|isbn:(0010008381 || 0030554349 || 0134354680 ...(300+ ISBNs later)... || 9992772115)'.

I'm a reasonable guy, so I can accept is such a query is simply too much, and can't easily be made to work.  If so, I think we should consider putting some hard limits at some point in the query generation chain to prevent these cases from getting too far along.

That said, I am also concerned about at least one aspect of the generated query for this case.  The "meat" of the query looks something like this:

...
(
  (
    (xd221da0_identifier_isbn.id IS NOT NULL)
  )
  OR
  (
    (
      (
        (xd221f38_identifier_isbn.id IS NOT NULL)
      )
      OR
      (
        (
          (
	    (xd222208_identifier_isbn.id IS NOT NULL)
          )
          OR
          (
	    (
	      (
	        (xd2224d8_identifier_isbn.id IS NOT NULL)
	      )
            )
          )
        )
      )
    )
  )
)
...

Is there a reason the nesting gets deeper and deeper for every added bool?  Or is this another aspect to this bug?  Maybe there is a reason, with the most likely being that the extra nesting makes the parser simpler in some way, but I thought it at least looked fishy enough to mention, as it seems very possible that this massive nested structure is a major factor in the DB crashes.

Finally, I have moved this to a security bug, since giving an example query which crashes EG makes me uneasy.

Thanks again.

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-15:

Dan,

First, thanks for testing and I'm glad it generally improved things!

It sounds like Postgres decided to eat all the memory and then the linux OOM killer came along and shot down a backend, which would throw the db into recovery mode. If memory, I/O or CPU pressure were high enough, I could imagine this looking like an all-out crash.

That's working as intended, given the current code. The way we deal with bools is to create a nested tree of left joins and then apply the bool operator to that branch of the WHERE clause. This is a straight forward implementation that leverages postgres as much as possible, but past a certain number of joins, it's simply too much to handle.

I've been contemplating possible optimizations for a while. The way I would like to attack this problem is to teach the top-level QueryParser module to recognize situations where logically adjacent nodes use the same class and bool operator, and pull the contents of the down-link node into the parent, when the down-link is a leaf node of the parse tree. This would be a post-parsing optimization implemented as a depth-first recursive pass to pull up appropriate nodes.

Our QP driver should already know how to deal with such a data structure and generate the appropriate tsearch query. However, there will still be limits on the size of individual tsearch queries, so we may not be out of the woods with just that.

Thoughts?

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-15:

Dan,

What do you think of the following as a short-term fix:

diff --git a/Open-ILS/src/perlmods/lib/OpenILS/Application/Storage/Driver/Pg/QueryParser.pm b/Open-ILS/src/perlmods/lib/OpenILS/Application/Storage/Driver/Pg/QueryParser.pm
index e4e237c..b141664 100644
--- a/Open-ILS/src/perlmods/lib/OpenILS/Application/Storage/Driver/Pg/QueryParser.pm
+++ b/Open-ILS/src/perlmods/lib/OpenILS/Application/Storage/Driver/Pg/QueryParser.pm
@@ -820,6 +820,8 @@ SQL
sub flatten {
my $self = shift;

+ die 'ERROR: nesting too deep or boolean list too long' if ($self->plan_level > 25);
+
     my $from = shift || '';
     my $where = shift || '';
     my $with = '';

------------------------------------------------------------------------------------------

I don't know if 25 is too much or not enough ... perhaps a global flag would be in order?

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-15:

I've force-pushed that change to my branch, for the time being.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-22:

I did some testing with this safety valve in place, and it limits me to 13 "ORs" ("||"). It appears that we add two plan levels for every OR. Is that what is expected? If so, we can almost certainly go with a higher setting than that.

To be clear, my test simply searched the catalog for:

dinosaur || dinosaur2 || dinosaur3 || dinosaur4 || ... || dinosaur13

Adding dinosaur14 triggered the valve.

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-23:

Dan,

That is expected, actually. Each boolean op is represented by a plan node with an appropriate joiner and two sub nodes, one for each side of the operator.

Shall we start with, say, 40-deep (20 JOINs), and plan to add a global flag to adjust this in the future?

--miker

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-26:

Thanks, Mike. Your basic explanation makes a lot of sense of what I have seen so far. I typically take the time to fully understand the source code I am looking at, but I've tried to be upfront about not finding time to do that here, so I also hope my questions haven't seemed too dumb because of that.

Before I saw your reply, I had bumped the value up to 50, and that was still low enough to prevent the instability we had experienced (and certainly far lower than the 600+ plan_level which caused the initial problems). 40 is probably enough, and I am fine with anything up to my test case of 50.

So, I currently have all of this branch's changes in production. Based on that, I am certain it doesn't cause any major problems, but I am less sure we aren't introducing new wrinkles. I know we've talked about it in the past, but do we have any existing tests for this area of the codebase? If not, should we try to cook up a test or two before pushing this?

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-26:

I'm not against tests in the least. There is a script which purports to test QP (Open-ILS/src/support-scripts/test-scripts/query_tests.pl), but it doesn't seem to be called from anywhere. How about I add a test for the {}s there, and we agree to revisit this ASAP for 2.next, for better (failure-expected, etc) coverage.

Revision history for this message

Mike Rylander (mrylander) wrote on 2013-11-26:

I've squashed these changes into one commit, increased the protective plan depth limit, and added tests for the two types of failing queries to query_tests.pl.

Revision history for this message

Dan Wells (dbw2) wrote on 2013-11-26:

#10

Mike, thanks for pointing out the test script and adding the test cases. As noted, this is working well for us, so pushed to master through rel_2_4.

Thanks again for your work on this, Mike!

Changed in evergreen:
status:	New → Fix Committed
importance:	Undecided → Medium

Ben Shum (bshum) on 2013-12-06

Changed in evergreen:
status:	Fix Committed → Fix Released

Dan Wells (dbw2) on 2018-01-29

information type:

Private Security → Public Security

Report a bug

This report contains Public Security information

Everyone can see this security related information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.