Alien Loves Predator Scraper Fix

Bug #492143 reported by Ged Walsh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Dosage
Fix Committed
Medium
Tristan Seligmann

Bug Description

class AlienLovesPredator(BasicScraper):
 imageUrl = 'http://alienlovespredator.com/%s'
 imageSearch = compile(r'<img src="(.+?)"[^>]+>(<center>\n|\n|</center>\n)<div style="height: 2px;">&nbsp;</div>', MULTILINE)
 prevSearch = compile(r'<a href="(.+?)"><img src="/images/nav_previous.jpg"')
 help = 'Index format: nnn'
 starter = indirectStarter('http://alienlovespredator.com/index.php', compile(r'<a href="(.+?)"><img src="/images/nav_previous.jpg"'))
 def namer(cls, imageUrl, pageUrl):
     vol = pageUrl.split('/')[-5]
     num = pageUrl.split('/')[-4]
     ccc = pageUrl.split('/')[-3]
        ddd = pageUrl.split('/')[-2]
     return '%s-%s-%s-%s' % (vol, num, ccc, ddd)

They use random image names now so this deliberately misses the latest strip so the rest can be named against the page url. It will catch it next update.

Related branches

Revision history for this message
Tristan Seligmann (mithrandi) wrote :

By using bounceStarter instead of indirectStarter, you can still fetch the latest strip. Essentially, bounceStarter follows the "previous" link (like you're doing with indirectStarter), but then follows the "next" link in order to get back to the very latest comic.

Revision history for this message
Ged Walsh (bleedingheart) wrote :

Thanks, fix using bouncestarter;

class AlienLovesPredator(BasicScraper):
 imageUrl = 'http://alienlovespredator.com/%s'
 imageSearch = compile(r'<img src="(.+?)"[^>]+>(<center>\n|\n|</center>\n)<div style="height: 2px;">&nbsp;</div>', MULTILINE)
 prevSearch = compile(r'<a href="(.+?)"><img src="/images/nav_previous.jpg"')
 help = 'Index format: nnn'
 starter = bounceStarter('http://alienlovespredator.com/index.php', compile(r'<a href="(.+?)"><img src="/images/nav_next.jpg"'))
 def namer(cls, imageUrl, pageUrl):
     vol = pageUrl.split('/')[-5]
     num = pageUrl.split('/')[-4]
     ccc = pageUrl.split('/')[-3]
        ddd = pageUrl.split('/')[-2]
     return '%s-%s-%s-%s' % (vol, num, ccc, ddd)

Changed in dosage:
assignee: nobody → Tristan Seligmann (mithrandi)
importance: Undecided → Medium
milestone: none → 1.7.0
status: New → In Progress
Changed in dosage:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.