lxml

Htmldiff strips newlines from pre tags

Bug #1206077 reported by Tom on 2013-07-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Low	Unassigned

Bug Description

The htmldiff function strips newlines from <pre> tags, destroying their original formatting. For example:

>>> html = "<pre>test\ntest2\ntest3</pre>"
>>> print repr(htmldiff(html, html))
u'<pre>test test2 test3</pre>'

Inside htmldiff it calls tokenize() on both inputs, which appears to loose all notion of whitespace:
>>> tokenize("""<pre>test
... test2
... test3
... """)
[token(u'test', ['<pre>'], []), token(u'test2', [], []), token(u'test3', [], ['</pre>'])]

It then calls htmldiff_tokens, which produces output like this:
['<pre>', u'test ', u'test2 ', u'test3', '</pre>']

It then joins those with an empty string, which results in a loss of whitespace.

I've tested this on Python 2.6 and 2.7 using lxml 3.0 through to the latest sources on GitHub, all with the same result.

Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 1, 0)
libxml used : (2, 9, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message

scoder (scoder) wrote on 2013-07-29:

I agree that this is a bug. However, i'm not sure how much work it will be to fix it. Changing the tokeniser might make it more difficult to find text differences.

Want to give it a try? You seem to have digged into the code already.

Changed in lxml:
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

Tom (orf) wrote on 2013-07-29:

Sure I will give it a shot, I think I have narrowed down the issue. Do you accept pull requests through github, or should I attach a patch to this issue?

Revision history for this message

scoder (scoder) wrote on 2013-07-29: Re: [Bug 1206077] Re: Htmldiff strips newlines from pre tags

Please provide a pull request. Thanks!

Revision history for this message

Tom (orf) wrote on 2013-07-29:

Done, https://github.com/lxml/lxml/pull/124

scoder (scoder) on 2013-08-01

Changed in lxml:
status:	Confirmed → Fix Committed

Revision history for this message

Raniere Silva (raniere) wrote on 2013-11-29:

The fix won't work in some cases:

```
import lxml.html.diff

bug = """<pre>foo
bar</pre>"""

print("Input `bug`:\n{}\nDiff Output `bug`:\n{}".format(bug,
lxml.html.diff.htmldiff(bug,bug)))

fine = """<pre>foo
bar</pre>"""

print("Input `fine`:\n{}\nDiff Output `fine`:\n{}".format(fine,
lxml.html.diff.htmldiff(fine,fine)))
```

Revision history for this message

scoder (scoder) wrote on 2014-01-06:

Fixed in lxml 3.2.4.

Changed in lxml:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.