Wrong comment tag processing by lxml.html.diff.htmldiff

Bug #496670 reported by Alexander Voronin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
Unassigned

Bug Description

The comment tag from html parsed as "tag" like '<built-in function comment>' + some text from tail of comment tag + '</built-in>'.
Simple example:

===
Python 2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html as h
>>> import lxml.html.diff as d
>>> a=h.fromstring('<html><body><b>text</b><!-- the comment --> other text</body></html>')
>>> d.htmldiff(a,a)
u'<b>text</b>&gt;the comment <built-in function comment>&gt; other text</built-in>'
>>>
===

I was debug diff.py and locate bug near flatten_el and start/end_tag functions. Changing start_tag return value from "el.tag" to "el.tag if not callable(el.tag) else el.tag" solves tag escaping but not tag-contents order (like ...the comment<!----> other text...)

PS: lxml version - 2.2.2_win32_py2.6

scoder (scoder)
Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.