htmldiff fails when there are img tags without src

Bug #889200 reported by Lucas Moauro
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

The htmldiff function fails with a "KeyError" exception when the html contains img tags without a src attribute. Although this is not valid html, I have found that it is a common issue in many sites, so I think that it should be better to ignore these kind of tags.

Here's a simple example:
>>> from lxml.html.diff import htmldiff
>>> htmldiff('<img />', '<img />')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lucas/pythonwebalert/local/lib/python2.7/site-packages/lxml-2.3.1-py2.7-linux-x86_64.egg/lxml/html/diff.py", line 168, in htmldiff
    old_html_tokens = tokenize(old_html)
  File "/home/lucas/pythonwebalert/local/lib/python2.7/site-packages/lxml-2.3.1-py2.7-linux-x86_64.egg/lxml/html/diff.py", line 534, in tokenize
    return fixup_chunks(chunks)
  File "/home/lucas/pythonwebalert/local/lib/python2.7/site-packages/lxml-2.3.1-py2.7-linux-x86_64.egg/lxml/html/diff.py", line 576, in fixup_chunks
    for chunk in chunks:
  File "/home/lucas/pythonwebalert/local/lib/python2.7/site-packages/lxml-2.3.1-py2.7-linux-x86_64.egg/lxml/html/diff.py", line 691, in flatten_el
    for item in flatten_el(child, include_hrefs=include_hrefs):
  File "/home/lucas/pythonwebalert/local/lib/python2.7/site-packages/lxml-2.3.1-py2.7-linux-x86_64.egg/lxml/html/diff.py", line 682, in flatten_el
    yield ('img', el.attrib['src'], start_tag(el))
  File "lxml.etree.pyx", line 2198, in lxml.etree._Attrib.__getitem__ (src/lxml/lxml.etree.c:49115)
KeyError: 'src'

I'm attaching a patch that solves the issue by ignoring the tag when the src attribute is not found.

Tags: htmldiff
Revision history for this message
Lucas Moauro (lagenar) wrote :
Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 2.3.3.

Changed in lxml:
status: Fix Committed → Fix Released
scoder (scoder)
Changed in lxml:
milestone: none → 2.3.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.