Htmldiff strips newlines from pre tags

Bug #1206077 reported by Tom
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Low
Unassigned

Bug Description

The htmldiff function strips newlines from <pre> tags, destroying their original formatting. For example:

>>> html = "<pre>test\ntest2\ntest3</pre>"
>>> print repr(htmldiff(html, html))
u'<pre>test test2 test3</pre>'

Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:
>>> tokenize("""<pre>test
... test2
... test3
... """)
[token(u'test', ['<pre>'], []), token(u'test2', [], []), token(u'test3', [], ['</pre>'])]

It then calls htmldiff_tokens, which produces output like this:
['<pre>', u'test ', u'test2 ', u'test3', '</pre>']

It then joins those with an empty string, which results in a loss of whitespace.

I've tested this on Python 2.6 and 2.7 using lxml 3.0 through to the latest sources on GitHub, all with the same result.

Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 9, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

I agree that this is a bug. However, i'm not sure how much work it will be to fix it. Changing the tokeniser might make it more difficult to find text differences.

Want to give it a try? You seem to have digged into the code already.

Changed in lxml:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Tom (orf) wrote :

Sure I will give it a shot, I think I have narrowed down the issue. Do you accept pull requests through github, or should I attach a patch to this issue?

Revision history for this message
scoder (scoder) wrote : Re: [Bug 1206077] Re: Htmldiff strips newlines from pre tags

Please provide a pull request. Thanks!

Revision history for this message
Tom (orf) wrote :
scoder (scoder)
Changed in lxml:
status: Confirmed → Fix Committed
Revision history for this message
Raniere Silva (raniere) wrote :

The fix won't work in some cases:

```
import lxml.html.diff

bug = """<pre><span>foo</span>
<span>bar</span></pre>"""

print("Input `bug`:\n{}\nDiff Output `bug`:\n{}".format(bug,
    lxml.html.diff.htmldiff(bug,bug)))

fine = """<pre><span>foo
bar</span></pre>"""

print("Input `fine`:\n{}\nDiff Output `fine`:\n{}".format(fine,
    lxml.html.diff.htmldiff(fine,fine)))
```

Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 3.2.4.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.