html parser raises IndexError on iterlinks()

Bug #712107 reported by Sardar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

lxml/html/__init__.py

Sometimes HtmlMixin.iter() returns an element with el.tag == ''
The HtmlMixin.iterlinks() on line 322 uses _nons(el.tag), which directly accesses tag[0], which raises IndexError.
I know the HTML is invalid, but the parser is expected to be robust.

Fix:

if el.tag == '':
    continue

thus skip broken tag, it is not a link anyway.

Note: this is purely python code problem, not with lxml libxml2/libxslt library.

Revision history for this message
scoder (scoder) wrote :

Would you have an example of a page that shows this behaviour? I wonder how it is possible that the parser returns an empty tag name in the first place.

scoder (scoder)
Changed in lxml:
status: New → Triaged
Revision history for this message
scoder (scoder) wrote :

Cannot reproduce.

Changed in lxml:
status: Triaged → Incomplete
Revision history for this message
scoder (scoder) wrote :

Closing as outdated and missing a way to reproduce the problem.

Changed in lxml:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.