Evergreen needs automated sitemap generation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Fix Released
|
Wishlist
|
Unassigned |
Bug Description
* Evergreen master
While the discovery of the contents of Evergreen catalogues is handled reasonably well within Evergreen's own search mechanism, that first requires users to know that the Evergreen catalogue exists, and where they can access it. When most people search for something, however, they type words into their browser's address bar or (maybe) go to a search engine like Google or Yahoo or Bing--and if the content of the library catalogue is not indexed by the search engine, then our resources are effectively invisible to most of those users.
One means of making the contents of our library catalogues available to search engines is to publish sitemaps compliant with the sitemaps.org specification. Not only do these sitemaps enable search engines to crawl catalogues efficiently (for example, only crawling pages that are new or changed since the last time the search engine visited), but they also enable third-party applications to more efficiently track changes in a standard fashion (rather than having to rely on a custom rawmeat feed or the like).
Several years ago I summarized the core logic for building sitemaps in Evergreen at http://
Changed in evergreen: | |
status: | Fix Committed → Fix Released |
I have created a quick script at http:// git.evergreen- ils.org/ ?p=working/ Evergreen. git;a=shortlog; h=refs/ heads/user/ dbs/sitemap_ builder that handles the basic requirements:
1. Follows sitemaps.org standards (such as 50,000 URLs per file, with the creation of a sitemap index file);
2. Reflects the edit date of each bib record so that crawlers can skip those bibs that have not changed since their last crawl;
3. Enables users to create different site maps reflecting different sections of the library hierarchy (for example, one sitemap for SYS1 and a different sitemap for SYS2) using command line options.
This is expected to be a cron job that runs in the web root; the created files are currently dumped in the current directory.