The source browser invites spiders destroy site performance with massive amounts of spurious crawling

Bug #898858 reported by Jean-Paul Calderone on 2011-12-01
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Twisted Website
Undecided
Unassigned

Bug Description

For example, http://twistedmatrix.com/trac/browser/trunk currently has a link to http://twistedmatrix.com/trac/log/trunk/admin?rev=33262 which links to one hundred other revisions and changesets. Spiders fall down this rabbit hole and never hit the bottom.

I don't think we need to have all the old revisions indexed. Keeping trunk@HEAD up to date in the various indexes is nice and seems sufficient.

Jonathan Jacobs (jjacobs) wrote :

I see that robots.txt now disallows a bunch of browser and changeset related URLs.

Is this bug open because there is something you think should be changed with regards to spiders destroying site performance?

Jean-Paul Calderone (exarkun) wrote :

It doesn't disallow http://twistedmatrix.com/trac/log/trunk/admin?rev=33262 though, does it?

Richard Wall (richardw) wrote :

One solution (discussed on IRC) might be:
 * modify robots.txt to disallow /trac/browser
 * add a sitemap file containing links to only the current / head files in trac
 * specify a crawl-delay

See:
 * https://en.wikipedia.org/wiki/Sitemaps
 * https://en.wikipedia.org/wiki/Robots_exclusion_standard#Nonstandard_extensions
 * http://trac.calendarserver.org/robots.txt
 * http://trac.edgewall.org/robots.txt

Jean-Paul Calderone (exarkun) wrote :

The source browser is now completely excluded from spidering by robots.txt. This clearly isn't the *ideal* solution.

Changed in twisted-website:
status: New → Fix Committed
status: Fix Committed → Triaged
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers