Bug #1775958 “Malicious search queries can cause downtime” : Bugs : Evergreen

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-06-09:

#1

SQL query generated from malicious search Edit (46.7 KiB, application/x-sql)

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-06-12:

#2

Hey Jeff, thanks for the details.

There is obviously a problem with the "pullup" step of query parsing. The point of that code is to pull child parse tree nodes' atoms (individual search terms) that use the same search class and joiner into their parent, depth-first, in an effort to make the tree as shallow as possible, which translates into fewer join branches in the generated SQL. Your example should flatten completely, but obviously doesn't.

I'll be looking ASAP, but the more eyes the merrier!

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-06-12:

#3

More info:

So, we currently, intentionally skip explicitly nested clauses on the theory that the user specified that for a reason. I have a WIP branch that stops doing that and pulls up atoms from adjacent class/field/joiner-matching nodes:

security/user/miker/lp-1775958-downgrade-explicit-grouping

NOTE: this does not prevent slightly more complex constructs from doing the same basic thing, such as alternating between && and || joiners, or using different field specifiers before each term. Looking at the former now.

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-06-12:

#4

Joiner alternation pullup will take more concentrated time than I have this afternoon, but I believe it can be accomplished. The main problem is being able to preserve explicit grouping within the same field/class set but with mixed joiners.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-08-15:

#5

I believe the existing patch breaks facet search. In an environment where it is applied, a regular search succeeds, but clicking on one of the facet links in the sidebar returns zero results. Logs show the following message:

2018-08-15T10:19:52.802565-07:00 app1 open-ils.search: [WARN:18208:Application.pm:624:153426879722844341] open-ils.search.biblio.multiclass.staged: received error : service=open-ils.storage : method=open-ils.storage.biblio.multiclass.staged.search_fts.atomic : params=$VAR1 = ['limit',1000,'offset',0,'return_query',1,'skip_check',0,'check_limit',1000,'core_limit',100000,'query','subject|topic[British] chuzzlewit site(BC_ILC) pref_ou(BNA) depth(1)','estimation_strategy','inclusion']; Exception: OpenSRF::EX::ERROR 2018-08-15T10:19:52 OpenILS::Application::AppUtils /usr/local/share/perl/5.18.2/OpenILS/Application/AppUtils.pm:200 System ERROR: Exception: OpenSRF::DomainObject::oilsMethodException 2018-08-15T10:19:52 OpenSRF::AppRequest /usr/local/share/perl/5.18.2/OpenSRF/AppSession.pm:1148 <500> *** Call to [open-ils.storage.biblio.multiclass.staged.search_fts.atomic] failed for session [1534353592.541524617.96308508395], thread trace [1]: Can't locate object method "joiners" via package "OpenILS::Application::Storage::Driver::Pg::QueryParser::query_plan::facet" at /usr/local/share/perl/5.18.2/OpenILS/Application/Storage/QueryParser.pm line 1679.

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-08-15:

#6

Thanks for testing, Jeff. I'll take a look as soon as I can. I suspect we just need to be more discriminating in the second 'if' of pass 2...

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-08-23:

#7

I've force-pushed an update to this branch which protects facets (and filters and modifiers, and other tree-peers to query_plans that we might later add) by simply tossing them on the pile rather than trying to treat them like subplans, which they are not.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-09-13:

#8

lp1775958-facet-timeout.sql Edit (9.0 KiB, application/x-sql)

Thanks, Mike.

With the updated fix applied, facets basically work. However, limiting by multiple facets seems to cause problems. For example:

1. Do a search.
2. Click on a facet on the results page. This works fine.
3. Click another facet on the second set of results. The resulting search consistently times out.

On a test server with no patch for this bug, the final search works normally (no timeout).

I've attached the generated SQL query for a timed-out search.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-09-13:

#9

lp1775958-facet-timeout.explain Edit (25.3 KiB, text/plain)

EXPLAIN output for the previous facet timeout search is attached.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-09-13:

#10

SQL query generated from malicious search #2 Edit (68.2 KiB, application/x-sql)

And, finally, here's a different kind of malicious search that we've seen lately which has the same problem:

bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords core_limit(100000) badge_orgs(1,24,3) estimation_strategy(inclusion) skip_check(0) check_limit(1000) the fix;qtype=keyword;locg=3;detail_record_view=1;badges=1,1%' AND 8027=CAST((CHR(113)||CHR(106)||CHR(122)||CHR(112)||CHR(113))||(SELECT (CASE WHEN (8027=8027) THEN 1 ELSE 0 END))::text||(CHR(113)||CHR(107)||CHR(107)||CHR(120)||CHR(113)) AS NUMERIC) AND '%'=' site(BFN) depth(2)

The generated SQL for this query (see attachment) chews up all available memory even on a system where the previous fix is applied.

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-09-14:

#11

FWIW, the last is (almost certainly) a case of "ye olde mix of ampersand vs semicolon in GET URLs". I don't have any good ideas on how to fix that, unfortunately. I have some bad ideas, though... :(

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-09-14:

#12

Jeff, do you have the query that was generated for the multiple-facet timeout query?

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-09-15:

#13

Sorry, the generated SQL for multiple-facet timeout query is attached to comment #8 and the explain output for that query is in comment #9, are you looking for something else?

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-09-17:

#14

Jeff, nope, I just missed it entirely. Sorry!

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-09-17:

#15

Hi Jeff,

So, obviously, the plan is bad. The biggest point where the plan gets things wrong is in the estimation of how many rows will come out of each branch of the Append node. It thinks it's going to get tens of thousands of rows but ends up with just a few from each. What I'm not clear on is why it's not choosing to push the source filter all the way down to the bottom of each.
Also, without poking at the database directly, I'm not sure why the facet joins, in particular, are causing a big plan change. Theoretically, multiple inner joins should lower the expected row output (which they indeed do) and a plan using source should be preferable.

I have a something to try, if you're amenable. It may be useful to create combined source+field index, along the lines of:

CREATE INDEX CONCURRENTLY metabib_title_field_entry_source_field_idx ON metabib.title_field_entry (source, field);

for each of title, author, subject, series, keyword, and identifier metabib.*_field_entry tables. The theory is that this index can be used to find a smaller set of intermediate rows than the tsquery join condition.

Thanks!

Revision history for this message

Mike Rylander (mrylander) wrote on 2018-10-12:

#16

Jeff, we're testing a branch here that looks like it will address this issue. The problem now is the number of joins that end up in the main core query, the inner-most non-CTE SELECT statement. The branch we're testing pulls each search subquery out and turns it into a CTE, which effectively separates it from the main one -- they're planned separately instead of the planner trying to find an optimal join order for tens (or hundreds, in some cases) of joins all at once. That's safe because the logic and data used inside each subquery is self-contained. There is a risk of normalization toward a higher "lowest-cost" by taking away some optimizations opportunities, such as when searching within a defined record set, but only actual testing will show if that's a real risk or not -- and it may be possible to mitigate that, anyway.

More as the story develops...

Revision history for this message

Galen Charlton (gmc) wrote on 2019-02-07:

#17

I've pushed to security/collab/gmcharlt/lp1775958_catalog_search a branch that includes Mike's pullup patch and the patch he mentioned about splitting out query components into CTEs.

Together, the two patches are testing OK for me:

[1] The pathological queries listed in this bug are no longer resulting in Pg backends either consuming all system memory or getting tied up indefinitely doing query planning.
[2] Because of the pullup patch, queries like "(a) (a) (a) (a) (a) (a) (a) (a)" return results quickly. (Otherwise, with only the CTE patch, "(a) (a)..." still CPU-pins a Pg backend.)
[3] Applying multiple facets works for me, at least in Concerto.

Upshot: I think we're ready for another round of broader testing.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2019-02-14:

#18

Thanks for the update! I've put Galen's branch on one of our test servers. We'll do some testing and I'll report back on the results.

Revision history for this message

Galen Charlton (gmc) wrote on 2019-02-25:

#19

Issuing a nudge for testing: we've had the current branch running for our customers for nearly two weeks. It's eliminated the problem without (to our knowledge) breaking other search features.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2019-03-01:

#20

Added my signoff & rebased on master: securitycollab/dyrcona/lp1775958_catalog_search-signoff

I tested this on 3.2 after reproducing the bug, and it not only fixes the bug, but facets, etc. seem to be working correctly.

I didn't push it because I'm curious what Jeff Davis has to say.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2019-03-01:

#21

Bleh.... I should have read more carefully after editing and hitting Post Comment.

security/collab/dyrcona/lp1775958_catalog_search-signoff

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2019-03-01:

#22

I applied the two patches on a 3.1 test system. All the issues identified in this bug appear to be fixed, and testing has not uncovered any new problems with search. I hope to apply the patches to our production environment during our next service window.

Thanks to Mike and Galen for the fix, and Jason for testing and signoff!

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2019-03-01:

#23

Credit is also due to Jennifer Pringle and Tina Ji for testing the facet problem and checking for any new issues with these fixes.

tags:

added: pullrequest signedoff

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2019-03-01:

#24

Jeff,

I want to ask specifically about search highlighting. Do you have it enabled, and if so, did it stop working after you applied this branch?

After I applied this branch, I got an email asking why search highlighting wasn't working on our training server. I removed the patches and search highlighting still wasn't working. I don't think the search highlighting is related to this branch, but I thought that I should ask.

Jason

Revision history for this message

Dan Wells (dbw2) wrote on 2019-03-05:

#25

We recently had a query kill our DB while doing nothing apparently special, just a certain facet. It turned out to be the same as or similar to the facet bug noted here, and we had applied locally a very similar fix to good effect (moving to a CTE). Check out this EXPLAIN of the original bad query for all the gory details:

https://explain.depesz.com/s/Nko

And yes, sadly, we did let this run for 30+ minutes for the purposes of science.

So, in short, we have been running the CTE portion of this fix for about two weeks now with no noted ill effects.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2019-03-05: Re: [Bug 1775958] Re: Malicious search queries can cause downtime

#26

On 2019-03-01 2:29 p.m., Jason Stephenson wrote:
> I want to ask specifically about search highlighting. Do you have it
> enabled, and if so, did it stop working after you applied this branch?

I haven't tested thoroughly, but I don't see any search highlighting
issues on a test server with this patch applied, and Tina didn't mention
anything either.

Revision history for this message

Dan Wells (dbw2) wrote on 2019-03-05:

#27

I've applied the pullup patch and spent a few hours testing, with interesting results. I might be doing something improperly, so take this with a grain of salt.

I know it isn't exactly what was targeted here, but we do a lot of multi-ISBN queries, so that is the angle I was testing. First a bare 'OR' list:

identifier|isbn:0441172717 || 0441013597 || 0340839937 || 0450011844 || 0399128964 || 0425080021

This resulted in:
Pre-patch: 6 xxx_identifier_isbn joins
Post-patch: 6 xxx_identifier_isbn joins

Now it isn't a surprise exactly that this didn't change, since the change here was for manual parens, but oddly enough, these tokens weren't grouped at all, but still fully "pushed down". This is odd, as I thought that behavior was changed eons ago.

Adding one group of parens:

identifier|isbn:(0441172717 || 0441013597 || 0340839937 || 0450011844 || 0399128964 || 0425080021)

Resulted in:
Pre-patch: 5 xxx_identifier_isbn joins (last two grouped)
Post-patch: 5 xxx_identifier_isbn joins (last two grouped)

Again, no visible change, but again oddly enough, adding manual parens improved the resulting query.

Finally, something akin to the real test case:

identifier|isbn:(0441172717 || (0441013597 || (0340839937 || (0450011844 || (0399128964 || (0425080021))))))

Resulted in:
Pre-patch: 6 xxx_identifier_isbn joins
Post-patch: 3 xxx_identifier_isbn joins (last four grouped)

My conclusion from this last test is that the patch is working, but maybe not doing quite everything intended. (I think, in the ideal case, we want just one join here.)

For further kicks, I did a query with 165 ISBNs, structured just like the last query. I didn't count them precisely, but rough count was right around 80 JOINs, and maybe a few more. Extrapolating (unreliably, of course) from the two data points, perhaps we are getting exactly half of the collapsing we expect? (6->3 and 165->80ish)

I am interested to hear thoughts about either the non-parens or the parens half-collapsing observations above, recognizing those may be fundamentally different. Also happy to do more tests as needed!

I've applied the pullup patch and spent a few hours testing, with interesting results.  I might be doing something improperly, so take this with a grain of salt.

I know it isn't exactly what was targeted here, but we do a lot of multi-ISBN queries, so that is the angle I was testing.  First a bare 'OR' list:

identifier|isbn:0441172717  || 0441013597  || 0340839937  || 0450011844  || 0399128964  || 0425080021

This resulted in:
Pre-patch: 6 xxx_identifier_isbn joins
Post-patch: 6 xxx_identifier_isbn joins

Now it isn't a surprise exactly that this didn't change, since the change here was for manual parens, but oddly enough, these tokens weren't grouped at all, but still fully "pushed down".  This is odd, as I thought that behavior was changed eons ago.

Adding one group of parens:

identifier|isbn:(0441172717  || 0441013597  || 0340839937  || 0450011844  || 0399128964  || 0425080021)

Resulted in:
Pre-patch: 5 xxx_identifier_isbn joins (last two grouped)
Post-patch: 5 xxx_identifier_isbn joins (last two grouped)

Again, no visible change, but again oddly enough, adding manual parens improved the resulting query.

Finally, something akin to the real test case:

identifier|isbn:(0441172717  || (0441013597  || (0340839937  || (0450011844  || (0399128964  || (0425080021))))))

Resulted in:
Pre-patch: 6 xxx_identifier_isbn joins
Post-patch: 3 xxx_identifier_isbn joins (last four grouped)

My conclusion from this last test is that the patch is working, but maybe not doing quite everything intended.  (I think, in the ideal case, we want just one join here.)

For further kicks, I did a query with 165 ISBNs, structured just like the last query.  I didn't count them precisely, but rough count was right around 80 JOINs, and maybe a few more.  Extrapolating (unreliably, of course) from the two data points, perhaps we are getting exactly half of the collapsing we expect? (6->3 and 165->80ish)

I am interested to hear thoughts about either the non-parens or the parens half-collapsing observations above, recognizing those may be fundamentally different.  Also happy to do more tests as needed!

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2019-03-06:

#28

Jeff, thanks for looking! I'm about 99% certain that our OPAC customization broke our highlighting, so I am looking into that.

Only thing I have to say about Dan Wells' comments is that ISBN searches have always seemed slow to me, even when you're only looking for one. Also, when I was dealing with attackers trying SQL injection attacks against our system on Monday (3/4/19), the only such attacks that had a noticeable impact were those done via ISBN search. Naturally, the SQL injection failed, but the queries ran for 2 to 4 minutes on our database server. This system is still on 3.0, so not necessarily susceptible to this bug.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2019-03-06:

#29

And, just to confirm, our lack of search highlighting was caused by an old version of misc_util.tt2 being used for our custom templates. Once I corrected that, all is well, so that had nothing to do with this branch, as I suspected.

Revision history for this message

Dan Wells (dbw2) wrote on 2019-04-15:

#30

Likely not directly related, but at least tangentially. I just ran into another relatively simple search which completely pegged our catalog. The search was of the form:

"term1" and "term2"

Notable characteristics include 1) the quoting of individual terms, and 2) the use of 'and' for an apparent Boolean operator. I haven't had time to do any study, but I suspect the quoting is the culprit. I would be interested to hear whether this search results in very long running queries in other instances.

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-04-15:

#31

I suspect this is not related to the pull-up or CTE code specifically.

Do you mean that the problem manifests pre-patch or with the offered branch? Do you have a sample SQL query and plan (ideally with ANALYZE) available?

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-04-15:

#32

And, thinking a bit, I suspect that the deepest part of the plan is going to look like a couple seq-scans of the table behind whichever class is being searched, looking for substrings for the quoted terms. That would be the least-good way to do it, but if the stats are way off...

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-05-10:

#33

Hi Dan,

Regarding comment #27, I retested the branch that Galen posted, which is just a compilation of two of my patches, and I'm not seeing the behavior that you are. Now, I'm using the QP testing script, and not pulling the queries from a log, but in my testing the only difference between the fully flat and fully nested id|isbn queries you shared above are in the ephemeral identifiers that are generated as table aliases based on in-process memory addresses of certain internal variables.

Commands to replicate my testing, given a running Evergreen instance of recent vintage and a checkout of the above branch living at ~/Evergreen/:

user@host:~/Evergreen/Open-ILS/src/support-scripts/test-scripts$ sudo PERL5LIB=~/Evergreen/Open-ILS/src/perlmods/lib/ ./query_parser.pl --query 'identifier|isbn:0441172717 || 0441013597 || 0340839937 || 0450011844 || 0399128964 || 0425080021' > /tmp/flat.test
user@host:~/Evergreen/Open-ILS/src/support-scripts/test-scripts$ sudo PERL5LIB=~/Evergreen/Open-ILS/src/perlmods/lib/ ./query_parser.pl --query 'identifier|isbn:(0441172717 || (0441013597 || (0340839937 || (0450011844 || (0399128964 || (0425080021))))))' > /tmp/nested.test
user@host:~/Evergreen/Open-ILS/src/support-scripts/test-scripts$ diff /tmp/nested.test /tmp/flat.test

That shows full pullup, combination (flattening) of all OR'd values, and CTE-ization of the whole thing. Here's a little more complicated query that shows the code properly discriminating between when it's safe to flatten (same-joiner) and not (mixed-joiner):

user@host:~/Evergreen/Open-ILS/src/support-scripts/test-scripts$ sudo PERL5LIB=~/Evergreen/Open-ILS/src/perlmods/lib/ ./query_parser.pl --query 'identifier|isbn:(0441172717 || (0441013597 || 0340839937)) && (0450011844 || (0399128964 || 0425080021))'

Do you mind retesting?

Thanks!

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-05-13:

#34

Hi all,

I've pushed another commit to the collab branch of Galen's which deals with the phrase (exact match) issue that Dan found. While not related to the existing fix on the branch per se, it is caused by an adjacent concept -- phrases create a subquery that defeats pull-up due to direct plan nesting.

There is still more work to do here, but all else being equal phrases should not cause any bad queries simply because they exist. I think, if testing shows it to be safe, it would be good to merge the 3 commits at the top of the branch.

Thanks in advance for taking a look!

Revision history for this message

Dan Wells (dbw2) wrote on 2019-05-23:

#35

Hello Mike,

Thanks for all your work on this!

Also, thanks for pointing out the test script, that makes things quite a bit easier to test (go figure). At first, I thought I must have been crazy in my #27 comments, as my tests all turned out like yours, looking fine. I then tested a live search again, and my previous results reared their ugly head.

After some trial and error, it turns out that adding really any other modifier to the search triggers the divide-by-2 behavior instead of full collapse. For example, give this a try:

./query_parser.pl --query 'identifier|isbn:(0441172717 || (0441013597 || (0340839937 || (0450011844 || (0399128964 || (0425080021)))))) depth(1)'

When I add the 'depth(1)' to the end (which is our default catalog search), I end up with three joins of the identifier table. If I leave that out, I get just the one, as expected.

I haven't done so yet, but will also test your newest phrase patch soon.

Dan

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-05-24:

#36

Hi Dan,

With the phrase commit in place I'm not seeing the behavior you describe. I think the "clean, single-subplan pullup" that the phrase fix adds may be addressing this as a side effect.

Side note, I've pushed a change the the collab branch that avoids loading the test config when we have a real system to connect to, which should produce more fulsome and correct output. I also find it useful to use the --debug flag to get a better sense for the transformation of the plan during pullup. The commands I'm currently testing with are:

# To test your example:
./query_parser.pl --debug --query 'identifier|isbn:(0441172717 || (0441013597 || (0340839937 || (0450011844 || (0399128964 || (0425080021)))))) depth(1)' 2>&1 |less

# To test that the "losing filters" bug is gone (also, and more intentionally, part of the phrase pullup fix)
./query_parser.pl --debug --query 'identifier|isbn:(0441172717 || (0441013597 || (0340839937 || (0450011844 || (0399128964 || (0425080021))))) site(BR1) depth(1))' 2>&1 |less

Thanks again!

Revision history for this message

Dan Wells (dbw2) wrote on 2019-05-28:

#37

Mike,

It looks like you are right, the phrase commit appears to fix the issue I noted in #35. I can also confirm it fixes the performance issues caused by multi-phrase queries, so that is very good.

A few new issues surfaced in broader testing:

1) record_list() queries (e.g. from lists) do not work. They generate an error of 'Can't call method "isa" on an undefined value at /usr/local/share/perl/5.22.1/OpenILS/Application/Storage/QueryParser.pm line 1772.' and the service dies. This was easy enough to paper over with changing the line to:

} elsif ($new_nodes[0] and $new_nodes[0]->isa('QueryParser::query_plan::node')) {

but I have not looked any deeper at whether that was a good idea, as we soon hit another problem.

2) Phrase searches in a specific index (e.g. subject:"Kidnapping") generate bad SQL, with the following error:

ERROR: operator does not exist: && tsquery
LINE 2: WITH x564fb2db5a98_subject_xq AS (SELECT &&

The error is easy enough to see. It appears from the debug output to occur early in the process, as an extra '&', seeps into the query structure. It also appears just from reading the debug that the parser tries to add nodes for both the index and the phrase without realizing they are one and the same node, but again, that is just a cursory evaluation of the output. I can dig deeper if that would be helpful.

Thanks again!

Dan

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-10-23:

#38

Regarding 1) there are 2 parts to the fix, one of which is to do full facet/filter/modifier pullup, and the other is the belt-and-suspenders check that you added. Both are now in the branch, along with some other minor changes.

Re 2, I see it happening in the debug output, but it's not clear to me how that's happening. As for adding nodes for both the atom and the phrase, that's intended. They generate different SQL components. We could invent a way to combine those nodes, but that's not there yet.

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-10-25:

#39

I've added a late-phase cleanup commit that addresses Dan's second issue.

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-11-08:

#40

I've pushed out part of my branch above in hopes of addressing #1808055, which deals with "losing" dynamic filters that are inside nested nodes in the query tree.

Revision history for this message

Mike Rylander (mrylander) wrote on 2019-11-20:

#41

With the help of a colleague I identified a new pullup-related bug that discards negation of terms. I pushed a new commit to address this, and to provide a belt to the suspenders of my early commit dealing with Dan's reported "extra joiner" issue.

Eyes appreciated!

Jason Stephenson (jstephenson) on 2020-12-01

Changed in evergreen:
milestone:	none → 3.6.2
status:	New → Confirmed

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2021-02-05:

#42

I started looking at this branch today on a VM with rel_3_4, and I came to the conclusion that it can no longer be considered signed off, so I have removed that tag and the targets for the upcoming security/patch releases.

I think this branch needs Perl live tests for the search behavior and fixes.

In addition, I think the top commits could be squashed, and I wonder if commit d4b07f5c should remain.

Changed in evergreen:
milestone:	3.6.2 → none
tags:	added: needstest removed: signedoff

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2021-06-14:

#43

See also duplicate bug 1849328.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2021-11-15:

#45

We're still seeing this issue. Today the attacker switched to POST instead of GET to submit the malicious searches, which circumvented our existing mitigation (blocking known malicious query strings in nginx). I'm blocking the same patterns in EGCatLoader/Search.pm now, which is not a permanent solution.

I'm not sure of the current status of the working branch for this one?

Jason Stephenson (jstephenson) on 2022-02-17

no longer affects:	evergreen/3.4
no longer affects:	evergreen/3.5

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2022-05-11:

#46

Seemingly innocuous searches with embedded parenthesis or quotes can trigger this issue as well. The search below, entered by staff, caused the search to run forever. I don't recall if it was using a dangerous amount of RAM or temp space.

"winter" AND "sports"

Jeff Davis (jdavis-sitka) on 2022-05-11

Changed in evergreen:
importance:	Undecided → Critical

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2022-05-13:

#47

Another duplicate bug report, bug 1972901.

Revision history for this message

Mike Rylander (mrylander) wrote on 2022-12-02:

#48

Just as a head's up, I'm back to working on this.

As a preview, I have an adjacent unrelated branch from which I'll pick a commit that may help with certain subclasses of the issue -- specifically, the multiple-exact-matches pattern from Jason Stephenson in comment 47 -- by folding un-anchored quoted term searches (one word or phrase) directly into the tsquery rather than requiring a separate WHERE clause that pulls in the bare text data.

More as the story develops.

Revision history for this message

Mike Rylander (mrylander) wrote on 2022-12-15:

#49

tl;dr: Branch at security/user/miker/lp-1775958-malicious-search-queries

This branch uses, as its starting point, the content of the first few commits from the earlier branch, but then significantly reworks the pullup logic, as well as phrase and boolean handling. The outcome is much larger tsquery strings (good! more "search" bang for each index scan buck), much fewer joins where possible (less duplicated search work), and index probing for phrase searches (all phrase searches except for left/right anchored search is handled as part of the main tsquery), for all non-trivial search constructs.

Basically, the "long list of OR'd ISBNs" and "random SQL injection attempt" use cases should be entirely mitigated WRT time and RAM explosions.

There's also a commit to let the query parser testing tool work without a complete, running Evergreen stack.

From the commit message:

The bulk of this commit reworks the query tree pullup logic, which is responsible for simplifying the query tree that is used to generate the SQL query for search. In particular, we now do a better job of finding opportunities to merge adjacent parts of the query that have the same requested_class (pre-dealiasing) in the face of boolean OR operators, explicit grouping, and alternating requested_class values. The result is fewer joins in the SQL, which should speed up all but the most trivial searches, and generally help protect the database from mis- or mal-constructed queries. We also now use CTEs to separate branches of the logical search tree into descrete subqueries, which helps reduce the total core query JOINs, and provides the planner with more options for join order.

This also does away with the conversion of a negated atom into an "un-phrase". Instead, we just detect and handle those directly as atoms with a prefix, as appropriate. This allows single negated words to be used directly in the core tsquery construct, rather than having them require a separate join and special where clause.

Additionally, this commit handles phrases differently at both the QP and SQL level, making use of Postgres's phrase support in modern versions and simplifying how they're handled within the base parse tree structure.

tl;dr: Branch at security/user/miker/lp-1775958-malicious-search-queries

This branch uses, as its starting point, the content of the first few commits from the earlier branch, but then significantly reworks the pullup logic, as well as phrase and boolean handling.  The outcome is much larger tsquery strings (good! more "search" bang for each index scan buck), much fewer joins where possible (less duplicated search work), and index probing for phrase searches (all phrase searches except for left/right anchored search is handled as part of the main tsquery), for all non-trivial search constructs.

Basically, the "long list of OR'd ISBNs" and "random SQL injection attempt" use cases should be entirely mitigated WRT time and RAM explosions.

There's also a commit to let the query parser testing tool work without a complete, running Evergreen stack.

From the commit message:

The bulk of this commit reworks the query tree pullup logic, which is responsible for simplifying the query tree that is used to generate the SQL query for search.  In particular, we now do a better job of finding opportunities to merge adjacent parts of the query that have the same requested_class (pre-dealiasing) in the face of boolean OR operators, explicit grouping, and alternating requested_class values.  The result is fewer joins in the SQL, which should speed up all but the most trivial searches, and generally help protect the database from mis- or mal-constructed queries.  We also now use CTEs to separate branches of the logical search tree into descrete subqueries, which helps reduce the total core query JOINs, and provides the planner with more options for join order.

This also does away with the conversion of a negated atom into an "un-phrase".  Instead, we just detect and handle those directly as atoms with a prefix, as appropriate.  This allows single negated words to be used directly in the core tsquery construct, rather than having them require a separate join and special where clause.

Additionally, this commit handles phrases differently at both the QP and SQL level, making use of Postgres's phrase support in modern versions and simplifying how they're handled within the base parse tree structure.

Jason Stephenson (jstephenson) on 2022-12-22

Changed in evergreen:
assignee:	nobody → Jason Stephenson (jstephenson)

Jason Stephenson (jstephenson) on 2022-12-28

tags:

added: needsreleasenote signedoff

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2022-12-28:

#50

I've been testing Mike's branch with rel_3_10 and production as well as concerto data. While this is a bit difficult to test, there's a definite improvement.

I can say that known "malicious" queries (including the previous examples in this bug) no longer cause PostgreSQL crashes. Other complicated queries seem to perform better with this branch applied. I did not do any real performance analysis or comparisons with an unpatched installation owning to time and hardware limitations.

I'm satisfied that this code works, particularly in combination with a few other patches.

I've pushed a signoff branch to the security repository: user/dyrcona/lp-1775958-malicious-search-queries-signoff.

The code needs a release note, and I believe that Mike would still like to come up with some tests, so the code should probably not make it into mainline just yet, but I'm satisfied with the code so far.

Changed in evergreen:
assignee:	Jason Stephenson (jstephenson) → nobody

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2023-01-09:

#51

CW MARS has this patch installed in production as of 2023-01-04. It has been in place for nearly a week with no issues so far.

Revision history for this message

Mike Rylander (mrylander) wrote on 2023-02-21:

#52

I've pushed a rebased version of this branch with release notes to security/user/miker/lp-1775958-malicious-search-queries-signoff-rebase

tags:

removed: needsreleasenote needstest

Mike Rylander (mrylander) on 2023-02-21

Changed in evergreen:
milestone:	none → 3.11-beta

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2023-02-24:

#53

The pull up patch appears to be causing bug 2008423. It is the only one of the security patches that CWMARS has installed in production.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2023-02-24:

#54

I should have provided more detail in my previous comment:

We have the patch applied to production. Benjamin Kailish who reported bu 2008423 reported this based on his experience with the CWMARS catalog.

I installed a stock 3.10 Evergreen looking at a copy of production data upgraded to 3.10 and was not able to reproduce the "no results with parenthesis" part of the bug.

On a 3.10 test server with the patch applied, I got no results in the Angular Staff search when using parenthesis. When I reverted the commit that does the pull up changes and reinstalled Evergreen, I got results with parenthesis in the search queries.

As mentioned in my comment on the other bug, I am not able to reproduce this in the patron OPAC. I get results when using parenthesis there.

Revision history for this message

Galen Charlton (gmc) wrote on 2023-03-23:

#55

Download full text (5.8 KiB)

Confirming the problems, alas - the patch as is is resulting in SQL syntax errors, for example:

023-03-23 15:30:17.404 EDT [9850] egx@egx ERROR: syntax error at or near "AS" at character 1027
2023-03-23 15:30:17.404 EDT [9850] egx@egx STATEMENT: -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords #staff core_limit(100000) badge_orgs(1) estimation_strategy(inclusion) skip_check(0) check_limit(1000) (keyword:fortran (computer) site(CONS)
WITH w AS (

        WITH x563777dfeaf8_keyword_xq AS (SELECT
              ((to_tsquery('english_nostop', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$fortran$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) || to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$fortran$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')))&&
              ((to_tsquery('english_nostop', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$computer$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) || to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$computer$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')))) AS tsq,
              (to_tsquery('english_nostop', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$fortran$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) || to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$fortran$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), ''))) ||
              (to_tsquery('english_nostop', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$computer$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) || to_tsquery('simple', COALESCE(NULLIF( '(' || btrim(regexp_replace(search_normalize(split_date_range($_9849$computer$_9849$)),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), ''))) AS tsq_rank ),
         x563777dfeaf8_keyword AS (
                  SELECT fe.*, fe_weight.weight, x563777dfeaf8_keyword_xq.tsq, x563777dfeaf8_keyword_xq.tsq_rank /* search */
                    FROM metabib.keyword_field_entry AS fe
                      JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
                      JOIN x563777dfeaf8_keyword_xq ON (fe.index_vector @@ x563777dfeaf8_keyword_xq.tsq)
                        UNION ALL
                  SELECT fe.id, fe.source, fe.field, fe.value, fe.index_vector, fe_weight.weight, x563777dfeaf8_keyword_xq.tsq, x563777dfeaf8_keyword_xq.tsq_rank /* virtual field addition */
                    FROM metabib.author_field_entry AS fe
                      JOIN config.metabib_field_virtual_map AS fe_weight ON (fe_weight.virtual = 45 AND fe_weight.real IN (8) AND fe_weight.real = fe.field)
                      JOIN x563777dfeaf8_keyword_xq ON (fe.index_vector @@ x563777dfeaf8_keyword_xq.tsq)
                        UNION ALL
                  SELECT fe.id, fe.source, fe.field, fe.value, fe.index_vector, fe_weight.weight, ...

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	Critical	Unassigned	Evergreen 3.11-beta
3.10	Fix Released	Critical	Unassigned	Evergreen 3.10.2
3.8	Fix Released	Critical	Unassigned	Evergreen 3.8.4
3.9	Fix Released	Critical	Unassigned	Evergreen 3.9.3

Changed in evergreen:
milestone:	3.11.0 → 3.11-beta
status:	Confirmed → Fix Released
no longer affects:	evergreen/3.6
no longer affects:	evergreen/3.7
information type:	Private Security → Public Security

Evergreen

Malicious search queries can cause downtime

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches