html parser raises IndexError on iterlinks()
Bug #712107 reported by
Sardar
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
lxml/html/
Sometimes HtmlMixin.iter() returns an element with el.tag == ''
The HtmlMixin.
I know the HTML is invalid, but the parser is expected to be robust.
Fix:
if el.tag == '':
continue
thus skip broken tag, it is not a link anyway.
Note: this is purely python code problem, not with lxml libxml2/libxslt library.
Changed in lxml: | |
status: | New → Triaged |
To post a comment you must log in.
Would you have an example of a page that shows this behaviour? I wonder how it is possible that the parser returns an empty tag name in the first place.