Htmldiff strips newlines from pre tags
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Low
|
Unassigned |
Bug Description
The htmldiff function strips newlines from <pre> tags, destroying their original formatting. For example:
>>> html = "<pre>test\
>>> print repr(htmldiff(html, html))
u'<pre>test test2 test3</pre>'
Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:
>>> tokenize(
... test2
... test3
... """)
[token(u'test', ['<pre>'], []), token(u'test2', [], []), token(u'test3', [], ['</pre>'])]
It then calls htmldiff_tokens, which produces output like this:
['<pre>', u'test ', u'test2 ', u'test3', '</pre>']
It then joins those with an empty string, which results in a loss of whitespace.
I've tested this on Python 2.6 and 2.7 using lxml 3.0 through to the latest sources on GitHub, all with the same result.
Python : sys.version_
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 9, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Changed in lxml: | |
status: | Confirmed → Fix Committed |
I agree that this is a bug. However, i'm not sure how much work it will be to fix it. Changing the tokeniser might make it more difficult to find text differences.
Want to give it a try? You seem to have digged into the code already.