Evergreen Denial of Service easily accomplished

Bug #1361782 reported by Dan Pearl
262
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned
3.8
Fix Released
Medium
Unassigned
3.9
Fix Released
Medium
Unassigned

Bug Description

Evergreen 2.5
Other product versions irrelevant

This afternoon, all our brick heads got saturated and users got Network Errors when attempting to log in. Load was normal on the database server. Investigation by Equinox revealed that:

"it looks like there is a search for 'glorias way' being conducted over and over at the [branch deleted to protect the guilty] ip address. 4932 times when last counted. We temporarily blocked the ip address and reloaded apache on the brick heads. We will wait a few minutes and unblock the ip address. "

Their actions resolved the problem.

It is unclear whether the problem was caused by someone falling asleep with their finger on "Enter" (autorepeat), or if an object was pressing down on the key. What is clear is that this is a not-unlikely occurrence.

Revision history for this message
Ben Shum (bshum) wrote :

Funny enough, this happened to us about two weeks ago on our production systems (master as of 2.6.1-ish era) too. Same symptoms, a library ended up doing the same search page request about 1000+ times and ate up all our apache workers on our bricks after which all the rest of the libraries were getting errors or dead pages.

We blocked that library's PC by IP, and then we found out they were still using Windows XP and kindly "suggested" that they upgrade at least to Windows 7 before we allow them back to Evergreen. We do not believe that was the real problem, but it was a good excuse at the time to get them to upgrade.

I've seen this sort of effect occur with other approaches too. Like the time we "load tested" production by pointing a small script at it to request the library home page 2000 times (which overloaded all the workers).

I've wondered if this is also something we can mitigate with more apache configuration best practices? Like adding some sort of reasonable rate limiter to requests by the same IP address so that we don't burn all our apache resources on any one person or bot.

That said if there's an Evergreen related issue, we should find that too....

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Dan Pearl (dpearl) wrote :
Revision history for this message
Dan Scott (denials) wrote :

I'm sure there's a previous security bug that Mike Rylander worked on which caches search queries to prevent exactly this sort of attack. Possibly some variation on it which makes it not effective in this situation, but we should at the very least link to the old bug...

Revision history for this message
Dan Scott (denials) wrote :

Bug 1200770 was a follow up to it, but it wasn't the one I was thinking of...

Revision history for this message
Dan Scott (denials) wrote :

Bug 1172936 was what I was thinking of that sounds very very similar to this.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

This is happening to us, and I don't think caching queries will be enough.

We have logs with the following search spammed over 22 times per second from a single host leading up to a time of load greater than 100 and over 140 apache drones running:

GET /eg/opac/results?query=the+cleaner&qtype=title&fi%3Asearch_format=&locg=1&sort= HTTP/1.1

Revision history for this message
Mike Rylander (mrylander) wrote :

Comments from Jason and others are correct, the "queue compression" code only protects the backend from an avalanche of identical searches, but does not stop apache from being overwhelmed.

There are ways to configure apache for rate limiting with modules. However, most work based on total connections per IP or similar metrics. The cost per search is much higher than the cost of other, normal requests, meaning that a relatively tiny number of search requests from an particular IP can cause a problem. IOW, the traffic that causes the problem is very difficult to detect because there is so little of it compared to other, non-problematic traffic. A cluster of apache servers only makes this problem worse, of course, because those multiple apache instances don't cooperate to compare total traffic.

What's more, in order to identify the "bad" traffic, the full URL needs to be inspected and interpreted. We do that in the "queue compression" code, but at that layer we don't have access to the IP address of the client. Perhaps we can teach the existing code to inform the mod_perl layer of the fact that there are existing requests for the search in question (IOW, that there is, in fact, a compressed search queue), as well as augment the queue compression code to count the number of concurrent searches in the queue. Then the mod_perl layer can correlate the IP to the queue size, and, if a threshold is passed, set a flag in memcache so that future requests from that IP address for any searches in general can be dropped for some amount of time. The effect would be similar to a human blocking the IP as described above.

The biggest drawback to this is that we risk blocking an entire branch without human intervention. So, maybe any requests that receive a "search is queued" message with a threshold-passing count from the search API are just cut short with a "too many requests" response to the client. That, at least, would only kill the identical searches. However, some resources would still be used to service the requests. IOW, the bar to DoS would be raised, but not removed.

It's a tricky problem...

Revision history for this message
Mike Rylander (mrylander) wrote :

... and, 7 years later, I have a branch that should move us in the right direction to mitigate these sorts of problems. Branch at security/user/miker/lp-1361782-restrict-concurrent-searches and from the commit message:

This commit adds two types of simple DoS protection:

* Limit concurrent search requests per client IP address, regardless of the searches being performed. This helps address issues of accidental spamming from a malfunctioning OPAC workstation, or crawlers of various types. The limit is controlled by a global flag called "opac.max_concurrent_search.ip".

* Limit the global concurrent search requests for the same query. This helps address both simple and distributed DoS that send the same search request over and over. The limit is controlled by a global flag called "opac.max_concurrent_search.query", and defaults to 20. When the limit is exceeded in either case the client receives an HTTP 429 "Too many requests" response from the web server, and the connection is ended.

tags: added: pullrequest
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I have tested this branch on an EOLI test server and one of my own. It works for me with Apache bench spamming requests. It can also be tested by going to an OPAC search page and holding down the "Enter" key after entering search terms. You'll get a 429 response if the settings are set.

I'd say this needs a release note, and the settings should be documented. Other than that, the functionality works, so I've pushed a signoff of Mike's branch to the security repository:

user/dyrcona/lp-1361782-restrict-concurrent-searches-signoff

tags: added: needsreleasenote signedoff
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
milestone: none → 3.11-beta
Revision history for this message
Mike Rylander (mrylander) wrote :

I've rebased the branch against master and added release notes to document the global flags beyond the commit message. New branch up at security/user/miker/lp-1361782-restrict-concurrent-searches-rebase

tags: removed: needsreleasenote
Galen Charlton (gmc)
Changed in evergreen:
milestone: 3.11-beta → 3.10.1
no longer affects: evergreen/3.10
Revision history for this message
Galen Charlton (gmc) wrote :

Committed in the branches that will be used to build the March 2023 releases. Thanks, Mike and Jason!

Changed in evergreen:
status: Confirmed → Fix Committed
Galen Charlton (gmc)
information type: Private Security → Public Security
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.