Bug #1206077 reported by Tom on 2013-07-29
The htmldiff function strips newlines from <pre> tags, destroying their original formatting. For example:

>>> html = "<pre>test\ntest2\ntest3</pre>"
>>> print repr(htmldiff(html, html))
u'<pre>test test2 test3</pre>'

Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:
>>> tokenize("""<pre>test
... test2
... test3
... """)
[token(u'test', ['<pre>'], []), token(u'test2', [], []), token(u'test3', [], ['</pre>'])]

It then calls htmldiff_tokens, which produces output like this:
['<pre>', u'test ', u'test2 ', u'test3', '</pre>']

It then joins those with an empty string, which results in a loss of whitespace.

I've tested this on Python 2.6 and 2.7 using lxml 3.0 through to the latest sources on GitHub, all with the same result.

Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 9, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

scoder (scoder) wrote :

I agree that this is a bug. However, i'm not sure how much work it will be to fix it. Changing the tokeniser might make it more difficult to find text differences.

Want to give it a try? You seem to have digged into the code already.

Changed in lxml:
importance: Undecided → Low
status: New → Confirmed
Tom (r-tom-3) wrote :

Sure I will give it a shot, I think I have narrowed down the issue. Do you accept pull requests through github, or should I attach a patch to this issue?

Please provide a pull request. Thanks!

scoder (scoder) on 2013-08-01
Changed in lxml:
status: Confirmed → Fix Committed
Raniere Silva (raniere) wrote :

The fix won't work in some cases:

import lxml.html.diff

bug = """<pre><span>foo</span>

print("Input `bug`:\n{}\nDiff Output `bug`:\n{}".format(bug,

fine = """<pre><span>foo

print("Input `fine`:\n{}\nDiff Output `fine`:\n{}".format(fine,

scoder (scoder) wrote :

Fixed in lxml 3.2.4.

Changed in lxml:
status: Fix Committed → Fix Released
