Evergreen needs automated sitemap generation

Bug #1330784 reported by Dan Scott on 2014-06-17
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

* Evergreen master

While the discovery of the contents of Evergreen catalogues is handled reasonably well within Evergreen's own search mechanism, that first requires users to know that the Evergreen catalogue exists, and where they can access it. When most people search for something, however, they type words into their browser's address bar or (maybe) go to a search engine like Google or Yahoo or Bing--and if the content of the library catalogue is not indexed by the search engine, then our resources are effectively invisible to most of those users.

One means of making the contents of our library catalogues available to search engines is to publish sitemaps compliant with the sitemaps.org specification. Not only do these sitemaps enable search engines to crawl catalogues efficiently (for example, only crawling pages that are new or changed since the last time the search engine visited), but they also enable third-party applications to more efficiently track changes in a standard fashion (rather than having to rely on a custom rawmeat feed or the like).

Several years ago I summarized the core logic for building sitemaps in Evergreen at http://goo.gl/8M8p23 ; my intention is to build a more flexible script around that basic logic so that it can be run as a cron job, and we can document it as a standard Evergreen feature.

Dan Scott (denials) wrote :

I have created a quick script at http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/sitemap_builder that handles the basic requirements:

1. Follows sitemaps.org standards (such as 50,000 URLs per file, with the creation of a sitemap index file);
2. Reflects the edit date of each bib record so that crawlers can skip those bibs that have not changed since their last crawl;
3. Enables users to create different site maps reflecting different sections of the library hierarchy (for example, one sitemap for SYS1 and a different sitemap for SYS2) using command line options.

This is expected to be a cron job that runs in the web root; the created files are currently dumped in the current directory.

Jeff Godin (jgodin) wrote :

Tested Dan's current branch on a test system across ~247k opac-visible bibs for a single library system. Script completes in under 30 seconds with a cold cache, generating 26M of valid XML across one index and five sitemaps of 50,000 or fewer URLs. Script also executes without error when omitting the --library-shortname argument, generating slightly different (as expected) output.


Dan Scott (denials) wrote :

Given the size of the sitemaps, two useful future features per sitemaps.org would include:
* gzipping the generated sitemap files (and linking accordingly from the sitemap index)
* using the <lastmod> property at the sitemap file level

These are a little trickier (more dependencies for the gzipping, and adhering to the spirit of the lastmod for sitemap files), so I'm not going to tackle them right now.

Dan Scott (denials) wrote :

Force-pushed an update that includes documentation (in the form of a full release notes entry) and removes a few unused variables and output ("print scalar(@bibs);") that would be annoying in a cron context.

tags: added: pullrequest
Ben Shum (bshum) wrote :

Sweet! Thanks Dan, tested and seems to be alright. I'll run it for a few days to know more details.

In the meantime though, pushed to master for inclusion in 2.7 series.

Changed in evergreen:
milestone: 2.next → 2.7.0-alpha
status: New → Fix Committed
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers