Evergreen

Correct search engine optimization -- allow robots to crawl, but not index, results

Bug #1507845 reported by Dan Scott on 2015-10-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Fix Released	Undecided	Unassigned	Evergreen 2.10-beta

Bug Description

Back in bug # 1414033 , we added rel="nofollow" attributes to many of the links that led to search results or call number browsing. This was the right idea, but a bit overzealous as it turns out. We actually want search engines to follow those links; we just don't want search engines to index them.

Therefore, we can tweak the <meta> tag in the results and browse page headers to say "follow,noindex", and remove many of the inline rel="nofollow" attributes from the other pages.

Tags:

Revision history for this message

Dan Scott (denials) wrote on 2015-10-20:

See http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/lp1507845_seo_tweaks for the relatively simple fixes.

Changed in evergreen:
milestone:	none → 2.next
tags:	added: pullrequest

Revision history for this message

Galen Charlton (gmc) wrote on 2015-10-20:

But wouldn't specifying that crawlers can follow such links mean that they'd end up kicking off more searches (and thereby additional load on the database) when, in principle, the generated sitemap takes care of identifying all of the records in the catalog?

Revision history for this message

Dan Scott (denials) wrote on 2015-10-20:

Galen - you're not wrong. In principle, that's exactly what sitemaps are supposed to do.

However, seeing the amount of our sitemap that was actually indexed plummet from the full 800K pages down to 50K pages after we told the robots to not follow links to the search results / browse pages showed that, in practice, sitemaps are just a guide that helps search engines if they've found that there is a healthy number of links to the pages in the site (both from external sources, and internally) in the first place.

This is apparently reasonably well-known in the SEO community, but I was caught up in the more idealistic view of the world where *of course* search engines want to index each and every one of our individual pages with all of their finely crafted schema.org linked data! Wrong.

Sites that experience problems with database load due to robots crawling them can stop that with a simple line in robots.txt:

Disallow: /eg/opac/results

Of course, that will only stop well-behaved robots. Those that ignore robots.txt will likely also ignore rel="nofollow" attributes.

As a corollary to this bug, I suppose I could add samples of robots.txt and documentation on basic SEO settings (telling search engines to ignore various GET parameters, etc).

Revision history for this message

Ben Shum (bshum) wrote on 2016-03-02:

Dan, in the interests of getting this out there, I pushed it to master for 2.10-beta, but will not be backporting it to 2.9 series for the time being. If the release maintainers decide to add it, they can, but otherwise, I'm fine with making this the practice moving forward.

no longer affects:	evergreen/2.9
no longer affects:	evergreen/master

Evergreen Bug Maintenance (bugmaster) on 2016-03-04

Changed in evergreen:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.