The source browser invites spiders destroy site performance with massive amounts of spurious crawling

Bug #898858 reported by Jean-Paul Calderone
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Twisted Website
Undecided
Unassigned

Bug Description

For example, http://twistedmatrix.com/trac/browser/trunk currently has a link to http://twistedmatrix.com/trac/log/trunk/admin?rev=33262 which links to one hundred other revisions and changesets. Spiders fall down this rabbit hole and never hit the bottom.

I don't think we need to have all the old revisions indexed. Keeping trunk@HEAD up to date in the various indexes is nice and seems sufficient.

Revision history for this message
Jonathan Jacobs (jjacobs) wrote :

I see that robots.txt now disallows a bunch of browser and changeset related URLs.

Is this bug open because there is something you think should be changed with regards to spiders destroying site performance?

Revision history for this message
Jean-Paul Calderone (exarkun) wrote :

It doesn't disallow http://twistedmatrix.com/trac/log/trunk/admin?rev=33262 though, does it?

Revision history for this message
Richard Wall (richardw) wrote :

One solution (discussed on IRC) might be:
 * modify robots.txt to disallow /trac/browser
 * add a sitemap file containing links to only the current / head files in trac
 * specify a crawl-delay

See:
 * https://en.wikipedia.org/wiki/Sitemaps
 * https://en.wikipedia.org/wiki/Robots_exclusion_standard#Nonstandard_extensions
 * http://trac.calendarserver.org/robots.txt
 * http://trac.edgewall.org/robots.txt

Revision history for this message
Jean-Paul Calderone (exarkun) wrote :

The source browser is now completely excluded from spidering by robots.txt. This clearly isn't the *ideal* solution.

Changed in twisted-website:
status: New → Fix Committed
status: Fix Committed → Triaged
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers