warc series + warc file names too long

Bug #689994 reported by siznax
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Archive Widecrawl
Fix Committed
Medium
siznax

Bug Description

in the interest of CDX generation in the deriver, shorten the warc series name (item identifier) to "job-segment-date", and the warc file name template to "${prefix}-${serial}"

so, instead of:

WIDE-20101213234612854-00395-00413-ia360905/WIDE-20101213234612854-00395-29264~ia360905.us.archive.org~9443.warc.gz

more like:

WIDE-5-20101213/WIDE-00395.warc.gz

Tags: drain
Revision history for this message
siznax (siznax) wrote :

will probably need to add time to the date in order to avoid filename collisions when restarting a crawler, e.g.

WIDE-5-20101213120000/WIDE-00395.warc.gz

tags: added: drain
Changed in archivewidecrawl:
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → siznax (siznax)
Revision history for this message
siznax (siznax) wrote :

oops - the last comment only addresses _item_ name collisions in 24hrs, which is certainly required, but the warc _filename_ will also need a timestamp to protect against restart collisions (which resets the serial number). so we'll need something more like:

WIDE-5-20101213120000/WIDE-20101213120000-00395.warc.gz

or

{item_name}/{warc_filename}

item_name:

  {job}-{jobnode}-{timestamp}

warc_filename:

  {job}-{timestamp}-{serialno}.warc.gz

hopefully that's still helpful.

Revision history for this message
siznax (siznax) wrote :

it turns out that the "jobnode" of a mapped hashcrawlmapped crawler is not easily accessible, so we'll need to use the short hostname instead, i.e.

WIDE-ia360913-20101213120000/WIDE-20101213120000-00395.warc.gz

if we allow the crawler to write files with it's safe filenaming convention, but rename the files on upload, then we could conceivably drop the job prefix and timestamp from warc_filename for something like:

WIDE-ia360913-20101213120000/00395.warc.gz

however, i'm not certain of the risk in creating item members with non-unique filenames.

Revision history for this message
siznax (siznax) wrote :

fixed with r61 | steve | Wed, 19 Jan 2011
support shorter names (compact_names)

Changed in archivewidecrawl:
status: Confirmed → Fix Committed
Revision history for this message
siznax (siznax) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.