htmldiff fails when there are img tags without src
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
The htmldiff function fails with a "KeyError" exception when the html contains img tags without a src attribute. Although this is not valid html, I have found that it is a common issue in many sites, so I think that it should be better to ignore these kind of tags.
Here's a simple example:
>>> from lxml.html.diff import htmldiff
>>> htmldiff('<img />', '<img />')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/
old_html_tokens = tokenize(old_html)
File "/home/
return fixup_chunks(
File "/home/
for chunk in chunks:
File "/home/
for item in flatten_el(child, include_
File "/home/
yield ('img', el.attrib['src'], start_tag(el))
File "lxml.etree.pyx", line 2198, in lxml.etree.
KeyError: 'src'
I'm attaching a patch that solves the issue by ignoring the tag when the src attribute is not found.
Changed in lxml: | |
milestone: | none → 2.3.x |
Thanks for the report. I committed a fix here:
https:/ /github. com/lxml/ lxml/commit/ 95a7adcfc9741b8 f6664e8bc62ea86 b310692a8f