Correct search engine optimization -- allow robots to crawl, but not index, results

Bug #1507845 reported by Dan Scott on 2015-10-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Undecided
Unassigned

Bug Description

Back in bug # 1414033 , we added rel="nofollow" attributes to many of the links that led to search results or call number browsing. This was the right idea, but a bit overzealous as it turns out. We actually want search engines to follow those links; we just don't want search engines to index them.

Therefore, we can tweak the <meta> tag in the results and browse page headers to say "follow,noindex", and remove many of the inline rel="nofollow" attributes from the other pages.

Dan Scott (denials) wrote :
Changed in evergreen:
milestone: none → 2.next
tags: added: pullrequest
Galen Charlton (gmc) wrote :

But wouldn't specifying that crawlers can follow such links mean that they'd end up kicking off more searches (and thereby additional load on the database) when, in principle, the generated sitemap takes care of identifying all of the records in the catalog?

Dan Scott (denials) wrote :

Galen - you're not wrong. In principle, that's exactly what sitemaps are supposed to do.

However, seeing the amount of our sitemap that was actually indexed plummet from the full 800K pages down to 50K pages after we told the robots to not follow links to the search results / browse pages showed that, in practice, sitemaps are just a guide that helps search engines if they've found that there is a healthy number of links to the pages in the site (both from external sources, and internally) in the first place.

This is apparently reasonably well-known in the SEO community, but I was caught up in the more idealistic view of the world where *of course* search engines want to index each and every one of our individual pages with all of their finely crafted schema.org linked data! Wrong.

Sites that experience problems with database load due to robots crawling them can stop that with a simple line in robots.txt:

Disallow: /eg/opac/results

Of course, that will only stop well-behaved robots. Those that ignore robots.txt will likely also ignore rel="nofollow" attributes.

As a corollary to this bug, I suppose I could add samples of robots.txt and documentation on basic SEO settings (telling search engines to ignore various GET parameters, etc).

Ben Shum (bshum) wrote :

Dan, in the interests of getting this out there, I pushed it to master for 2.10-beta, but will not be backporting it to 2.9 series for the time being. If the release maintainers decide to add it, they can, but otherwise, I'm fine with making this the practice moving forward.

no longer affects: evergreen/2.9
no longer affects: evergreen/master
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers