The source browser invites spiders destroy site performance with massive amounts of spurious crawling

Bug #898858 reported by Jean-Paul Calderone on 2011-12-01
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Twisted Website

Bug Description

For example, currently has a link to which links to one hundred other revisions and changesets. Spiders fall down this rabbit hole and never hit the bottom.

I don't think we need to have all the old revisions indexed. Keeping trunk@HEAD up to date in the various indexes is nice and seems sufficient.

Jonathan Jacobs (jjacobs) wrote :

I see that robots.txt now disallows a bunch of browser and changeset related URLs.

Is this bug open because there is something you think should be changed with regards to spiders destroying site performance?

Jean-Paul Calderone (exarkun) wrote :

It doesn't disallow though, does it?

Richard Wall (richardw) wrote :

One solution (discussed on IRC) might be:
 * modify robots.txt to disallow /trac/browser
 * add a sitemap file containing links to only the current / head files in trac
 * specify a crawl-delay


Jean-Paul Calderone (exarkun) wrote :

The source browser is now completely excluded from spidering by robots.txt. This clearly isn't the *ideal* solution.

Changed in twisted-website:
status: New → Fix Committed
status: Fix Committed → Triaged
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers