html parser raises IndexError on iterlinks()

Bug #712107 reported by Sardar on 2011-02-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

lxml/html/__init__.py

Sometimes HtmlMixin.iter() returns an element with el.tag == ''
The HtmlMixin.iterlinks() on line 322 uses _nons(el.tag), which directly accesses tag[0], which raises IndexError.
I know the HTML is invalid, but the parser is expected to be robust.

Fix:

if el.tag == '':
    continue

thus skip broken tag, it is not a link anyway.

Note: this is purely python code problem, not with lxml libxml2/libxslt library.

scoder (scoder) wrote :

Would you have an example of a page that shows this behaviour? I wonder how it is possible that the parser returns an empty tag name in the first place.

scoder (scoder) on 2012-06-02
Changed in lxml:
status: New → Triaged
scoder (scoder) wrote :

Cannot reproduce.

Changed in lxml:
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers