Google bot causing significant unnecessary load on our servers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Triaged
|
High
|
Unassigned |
Bug Description
In analyzing two time slices of 20 seconds [1], compared with other user agents and webservice clients, Google bot had the most requests by a factor of approximately 1.2 [2]; and was responsible for the most time rendering pages in our server by a factor of approximately 1.7 [3].
This is not conclusive, but very suggestive that Google bot is the biggest single consumer of our hardware's resources.
We have nothing that keeps Google from indexing unnecessary pages--either multiple batches of the same content in differing sort order, or content that has not changed since the Google bot's last view. Launchpad devs anecdotally report seeing Google bot as the only source of some of our rarer OOPS, found when viewing high-number batches with relatively unusual sort ordering.
There may be some low-hanging fruit here that could significantly reduce our load and improve our hardware usage, perhaps even helping to alleviate the issues causing us to have to kick haproxy periodically. Google supports pattern matching in robots.txt [4] and we might be able to use this to eliminate most duplicated search orders for our pages.
Much harder work in the same kind of direction would work to implement efficient support for the If-Modified-Since HTTP header. This bug is not about this work.
This bug is marked critical because we believe it may help with the HA Proxy Queue Depth issue by addressing a systemic problem rather than looking at the particular clients that push us over the edge of our capacity at a given incident.
First steps that I see would be to verify the analysis with a much larger time slice, and then to pursue the robots.txt configuration.
[1] Done for our repeating haproxy restarts, https:/
[2] See "Top 20 useragents (by request count)" in the two pastebins from the first footnote.
[3] See "Top 20 useragents (by aggregated duration)" in the two pastebins from the first footnote.
[4] See http://
Changed in launchpad: | |
importance: | Critical → High |
You should also look at http:// en.wikipedia. org/wiki/ Site_map which
can give it clues which pages to walk and how often, and which should
be pretty easy, and is more fine nuanced than robots.txt.
> This is not conclusive, but very suggestive that Google bot is the biggest single consumer of our hardware's resources.
I did a grep -c and I see it doing 12% of requests vs 44% of them
coming from various restfulclient consumers. I suppose it depends how
you count it.