Wrong comment tag processing by lxml.html.diff.htmldiff

Bug #496670 reported by Alexander Voronin on 2009-12-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned

Bug Description

The comment tag from html parsed as "tag" like '<built-in function comment>' + some text from tail of comment tag + '</built-in>'.
Simple example:

===
Python 2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html as h
>>> import lxml.html.diff as d
>>> a=h.fromstring('<html><body><b>text</b><!-- the comment --> other text</body></html>')
>>> d.htmldiff(a,a)
u'<b>text</b>&gt;the comment <built-in function comment>&gt; other text</built-in>'
>>>
===

I was debug diff.py and locate bug near flatten_el and start/end_tag functions. Changing start_tag return value from "el.tag" to "el.tag if not callable(el.tag) else el.tag" solves tag escaping but not tag-contents order (like ...the comment<!----> other text...)

PS: lxml version - 2.2.2_win32_py2.6

scoder (scoder) on 2009-12-15
Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers