lxml

html parser raises IndexError on iterlinks()

Bug #712107 reported by Sardar on 2011-02-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

lxml/html/__init__.py

Sometimes HtmlMixin.iter() returns an element with el.tag == ''
The HtmlMixin.iterlinks() on line 322 uses _nons(el.tag), which directly accesses tag[0], which raises IndexError.
I know the HTML is invalid, but the parser is expected to be robust.

Fix:

if el.tag == '':
continue

thus skip broken tag, it is not a link anyway.

Note: this is purely python code problem, not with lxml libxml2/libxslt library.

Revision history for this message

scoder (scoder) wrote on 2011-02-06:

Would you have an example of a page that shows this behaviour? I wonder how it is possible that the parser returns an empty tag name in the first place.

scoder (scoder) on 2012-06-02

Changed in lxml:
status:	New → Triaged

Revision history for this message

scoder (scoder) wrote on 2013-04-27:

Cannot reproduce.

Changed in lxml:
status:	Triaged → Incomplete

Revision history for this message

scoder (scoder) wrote on 2019-08-11:

Closing as outdated and missing a way to reproduce the problem.

Changed in lxml:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.