Google bot causing significant unnecessary load on our servers

Bug #907552 reported by Gary Poster
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

In analyzing two time slices of 20 seconds [1], compared with other user agents and webservice clients, Google bot had the most requests by a factor of approximately 1.2 [2]; and was responsible for the most time rendering pages in our server by a factor of approximately 1.7 [3].

This is not conclusive, but very suggestive that Google bot is the biggest single consumer of our hardware's resources.

We have nothing that keeps Google from indexing unnecessary pages--either multiple batches of the same content in differing sort order, or content that has not changed since the Google bot's last view. Launchpad devs anecdotally report seeing Google bot as the only source of some of our rarer OOPS, found when viewing high-number batches with relatively unusual sort ordering.

There may be some low-hanging fruit here that could significantly reduce our load and improve our hardware usage, perhaps even helping to alleviate the issues causing us to have to kick haproxy periodically. Google supports pattern matching in robots.txt [4] and we might be able to use this to eliminate most duplicated search orders for our pages.

Much harder work in the same kind of direction would work to implement efficient support for the If-Modified-Since HTTP header. This bug is not about this work.

This bug is marked critical because we believe it may help with the HA Proxy Queue Depth issue by addressing a systemic problem rather than looking at the particular clients that push us over the edge of our capacity at a given incident.

First steps that I see would be to verify the analysis with a much larger time slice, and then to pursue the robots.txt configuration.

[1] Done for our repeating haproxy restarts, https://wiki.canonical.com/Launchpad/HAProxyQueueDepth20110915 . See in particular https://pastebin.canonical.com/53707/ and https://pastebin.canonical.com/53709/
[2] See "Top 20 useragents (by request count)" in the two pastebins from the first footnote.
[3] See "Top 20 useragents (by aggregated duration)" in the two pastebins from the first footnote.
[4] See http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&from=83097&rd=1 in the "Manually create a robots.txt file" section.

Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 907552] [NEW] Google bot causing significant unnecessary load on our servers

You should also look at http://en.wikipedia.org/wiki/Site_map which
can give it clues which pages to walk and how often, and which should
be pretty easy, and is more fine nuanced than robots.txt.

> This is not conclusive, but very suggestive that Google bot is the biggest single consumer of our hardware's resources.

I did a grep -c and I see it doing 12% of requests vs 44% of them
coming from various restfulclient consumers. I suppose it depends how
you count it.

Revision history for this message
Gary Poster (gary) wrote :

Hi Martin.

Yes, I was aware of the site map, but my understanding was that it was primarily about link inclusion rather than exclusion. robots.txt is more about exclusion as I understand it. Certainly that would be another avenue worth exploring. I seem to recall that we already have a Google site map of some sort set up now.

In terms of counting, sure--lies, damn lies and statistics.

One difference is that our stats divided users by user agent, and *then* restfulclient users were divided further by the restfulclient client name.

Another difference in counting that I already noted is that looking at time spent per request appears to increase the relative weight for the Google bot in a significant way. My hypothesis is that this is because it is indexing many lesser-used pages.

The reason I believe this is worth calling out is that I suspect that large chunks of this work to serve Google bot are unnecessary, and even 12% of total load by request count seems like a juicy target to me. It's even more compelling if the changes are low-hanging.

We've been tasked with reviewing individual restfulclient scripts and attempting to optimize them because of our load issues, presumably because that is viewed as lower-hanging fruit than systemic webservice changes. These individual webservice clients represent fractions of the 44% you found. Tackling the 12% is at least as valid as a target, and I suspect it is more so.

Revision history for this message
Curtis Hovey (sinzui) wrote :

I wonder if the problem of URL requests correlates with the use of safe_action (launchpadform) where we explicitly permit GET form submissions because we want users and bots see and save the URL. If there is a relationship, safe_action could add add an x-robots header or meta tag to deter bots. Adding a no-follow allows the bots to index the page, not no more, so google would know about all the first pages and a few deep pages where users have posted links to.

Curtis Hovey (sinzui)
Changed in launchpad:
importance: Critical → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.