upload or derive crawl logs

Bug #661524 reported by siznax
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Archive Widecrawl
Confirmed
Medium
Unassigned

Bug Description

as discussed...

1) upload crawl logs with draintasker - have draintasker look for timestamped crawl log(s) on each pass, then create a manifest of warc series (item identifiers) that correspond to that log, then upload the log and manifest into a new item.

2) derive crawl log from warcs - on warc series derive, write an equivalent crawl log on from warc content.

please discuss in comments.

Tags: drain logs warc
Revision history for this message
siznax (siznax) wrote :

i guess the first step is to determine if warcs currently contain enough information to (sufficiently) reproduce a crawl log.

if so, then we can just "rederive" already uploaded warc series.

if not, then we'll need to upload existing crawl logs as described in (1) above.

Changed in archivewidecrawl:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
siznax (siznax) wrote :

looks like warcs do contain "hopsFromSeed", in the metadata records, eg (from newscrawl)

WARC-Type: metadata
WARC-Target-URI: http://www.foxnews.com/static/managed/img/Opinion/2Ken-ArmstrongOR.snl.jpg
WARC-Date: 2010-10-18T06:23:33Z
WARC-Concurrent-To: <urn:uuid:870be57a-37b0-4ff1-9aed-bf66258bc335>
WARC-Record-ID: <urn:uuid:7b879681-d919-493c-8274-c3164b322027>
Content-Type: application/warc-fields
Content-Length: 214

via: http://www.foxnews.com/slideshow/opinion/2010/10/12/photo-op-snl-saturday-night-live-anniversary/?test=faces
hopsFromSeed: LX
fetchTimeMs: 13
outlink: http://www.foxnews.com/favicon.ico I =INFERRED_MISC

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.