Evergreen

Bug #1507845
Comment #3

Comment 3 for bug 1507845

Revision history for this message

Dan Scott (denials) wrote on 2015-10-20:

Galen - you're not wrong. In principle, that's exactly what sitemaps are supposed to do.

However, seeing the amount of our sitemap that was actually indexed plummet from the full 800K pages down to 50K pages after we told the robots to not follow links to the search results / browse pages showed that, in practice, sitemaps are just a guide that helps search engines if they've found that there is a healthy number of links to the pages in the site (both from external sources, and internally) in the first place.

This is apparently reasonably well-known in the SEO community, but I was caught up in the more idealistic view of the world where *of course* search engines want to index each and every one of our individual pages with all of their finely crafted schema.org linked data! Wrong.

Sites that experience problems with database load due to robots crawling them can stop that with a simple line in robots.txt:

Disallow: /eg/opac/results

Of course, that will only stop well-behaved robots. Those that ignore robots.txt will likely also ignore rel="nofollow" attributes.

As a corollary to this bug, I suppose I could add samples of robots.txt and documentation on basic SEO settings (telling search engines to ignore various GET parameters, etc).