Evergreen

Evergreen needs automated sitemap generation

Bug #1330784 reported by Dan Scott on 2014-06-17

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Fix Released	Wishlist	Unassigned	Evergreen 2.7.0-alpha1

Bug Description

* Evergreen master

While the discovery of the contents of Evergreen catalogues is handled reasonably well within Evergreen's own search mechanism, that first requires users to know that the Evergreen catalogue exists, and where they can access it. When most people search for something, however, they type words into their browser's address bar or (maybe) go to a search engine like Google or Yahoo or Bing--and if the content of the library catalogue is not indexed by the search engine, then our resources are effectively invisible to most of those users.

One means of making the contents of our library catalogues available to search engines is to publish sitemaps compliant with the sitemaps.org specification. Not only do these sitemaps enable search engines to crawl catalogues efficiently (for example, only crawling pages that are new or changed since the last time the search engine visited), but they also enable third-party applications to more efficiently track changes in a standard fashion (rather than having to rely on a custom rawmeat feed or the like).

Several years ago I summarized the core logic for building sitemaps in Evergreen at http://goo.gl/8M8p23 ; my intention is to build a more flexible script around that basic logic so that it can be run as a cron job, and we can document it as a standard Evergreen feature.

Tags:

Revision history for this message

Dan Scott (denials) wrote on 2014-06-19:

#1

I have created a quick script at http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dbs/sitemap_builder that handles the basic requirements:

1. Follows sitemaps.org standards (such as 50,000 URLs per file, with the creation of a sitemap index file);
2. Reflects the edit date of each bib record so that crawlers can skip those bibs that have not changed since their last crawl;
3. Enables users to create different site maps reflecting different sections of the library hierarchy (for example, one sitemap for SYS1 and a different sitemap for SYS2) using command line options.

This is expected to be a cron job that runs in the web root; the created files are currently dumped in the current directory.

Revision history for this message

Jeff Godin (jgodin) wrote on 2014-06-19:

#2

Tested Dan's current branch on a test system across ~247k opac-visible bibs for a single library system. Script completes in under 30 seconds with a cold cache, generating 26M of valid XML across one index and five sitemaps of 50,000 or fewer URLs. Script also executes without error when omitting the --library-shortname argument, generating slightly different (as expected) output.

dbs++

Revision history for this message

Dan Scott (denials) wrote on 2014-06-19:

#3

Given the size of the sitemaps, two useful future features per sitemaps.org would include:
* gzipping the generated sitemap files (and linking accordingly from the sitemap index)
* using the <lastmod> property at the sitemap file level

These are a little trickier (more dependencies for the gzipping, and adhering to the spirit of the lastmod for sitemap files), so I'm not going to tackle them right now.

Revision history for this message

Dan Scott (denials) wrote on 2014-07-07:

#4

Force-pushed an update that includes documentation (in the form of a full release notes entry) and removes a few unused variables and output ("print scalar(@bibs);") that would be annoying in a cron context.

tags:

added: pullrequest

Revision history for this message

Ben Shum (bshum) wrote on 2014-07-10:

#5

Sweet! Thanks Dan, tested and seems to be alright. I'll run it for a few days to know more details.

In the meantime though, pushed to master for inclusion in 2.7 series.

Changed in evergreen:
milestone:	2.next → 2.7.0-alpha
status:	New → Fix Committed

Evergreen Bug Maintenance (bugmaster) on 2014-11-06

Changed in evergreen:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.