Evergreen

Bug #1361782
Comment #7

Comment 7 for bug 1361782

Revision history for this message

Mike Rylander (mrylander) wrote on 2015-12-02:

Comments from Jason and others are correct, the "queue compression" code only protects the backend from an avalanche of identical searches, but does not stop apache from being overwhelmed.

There are ways to configure apache for rate limiting with modules. However, most work based on total connections per IP or similar metrics. The cost per search is much higher than the cost of other, normal requests, meaning that a relatively tiny number of search requests from an particular IP can cause a problem. IOW, the traffic that causes the problem is very difficult to detect because there is so little of it compared to other, non-problematic traffic. A cluster of apache servers only makes this problem worse, of course, because those multiple apache instances don't cooperate to compare total traffic.

What's more, in order to identify the "bad" traffic, the full URL needs to be inspected and interpreted. We do that in the "queue compression" code, but at that layer we don't have access to the IP address of the client. Perhaps we can teach the existing code to inform the mod_perl layer of the fact that there are existing requests for the search in question (IOW, that there is, in fact, a compressed search queue), as well as augment the queue compression code to count the number of concurrent searches in the queue. Then the mod_perl layer can correlate the IP to the queue size, and, if a threshold is passed, set a flag in memcache so that future requests from that IP address for any searches in general can be dropped for some amount of time. The effect would be similar to a human blocking the IP as described above.

The biggest drawback to this is that we risk blocking an entire branch without human intervention. So, maybe any requests that receive a "search is queued" message with a threshold-passing count from the search API are just cut short with a "too many requests" response to the client. That, at least, would only kill the identical searches. However, some resources would still be used to service the requests. IOW, the bar to DoS would be raised, but not removed.

It's a tricky problem...

Comments from Jason and others are correct, the "queue compression" code only protects the backend from an avalanche of identical searches, but does not stop apache from being overwhelmed.

There are ways to configure apache for rate limiting with modules.   However, most work based on total connections per IP or similar metrics.   The cost per search is much higher than the cost of other, normal requests, meaning that a relatively tiny number of search requests from an particular IP can cause a problem.  IOW, the traffic that causes the problem is very difficult to detect because there is so little of it compared to other, non-problematic traffic.  A cluster of apache servers only makes this problem worse, of course, because those multiple apache instances don't cooperate to compare total traffic.

What's more, in order to identify the "bad" traffic, the full URL needs to be inspected and interpreted.  We do that in the "queue compression" code, but at that layer we don't have access to the IP address of the client.  Perhaps we can teach the existing code to inform the mod_perl layer of the fact that there are existing requests for the search in question (IOW, that there is, in fact, a compressed search queue), as well as augment the queue compression code to count the number of concurrent searches in the queue.  Then the mod_perl layer can correlate the IP to the queue size, and, if a threshold is passed, set a flag in memcache so that future requests from that IP address for any searches in general can be dropped for some amount of time.  The effect would be similar to a human blocking the IP as described above.

The biggest drawback to this is that we risk blocking an entire branch without human intervention.  So, maybe any requests that receive a "search is queued" message with a threshold-passing count from the search API are just cut short with a "too many requests" response to the client.  That, at least, would only kill the identical searches.  However, some resources would still be used to service the requests.  IOW, the bar to DoS would be raised, but not removed.

It's a tricky problem...