lxml.html.diff gives peculiar diffs

Bug #315511 reported by Sean B. Palmer on 2009-01-09
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Undecided
Ian Bicking

Bug Description

The following two cases strike me as peculiar:

>>> from lxml.html import diff

>>> diff.htmldiff('<p>a b c</p>', '<p>a b d e f</p> <div>f</div> <div>g</div>')
u'<p>a b <ins>d e f</ins><div><ins>f</ins></div><ins> </ins><div><ins>g</ins></div> <del>c</del> </p>'

>>> diff.htmldiff('<p>a b c</p>', '<p>c d e</p> <div>f</div> <div>g</div>')
u'<p><ins>c d e</ins></p><ins> </ins><div><ins>f</ins></div><ins> </ins><div><ins>g</ins></div> <p><del>a b c</del></p>'

The lesser problem is that in the second, a new <p> is created, which is inconsistent. The major problem is that in both cases, the <div> elements are interspersed into the <p>, whereas they should be treated separately. The following, for example, would be reasonable markup to produce:

u'<p>a b <ins>d e f</ins> <del>c</del> </p><div><ins>f</ins></div><ins> </ins><div><ins>g</ins></div>'

As it stands, the markup produced by htmldiff(...) is unusable for me.

scoder (scoder) on 2009-01-23
Changed in lxml:
assignee: nobody → ianb
Noah Slater (nslater) wrote :

Any update on this issue?

A certain chap I know has his knickers in a twist about it.

Ian Bicking (ianb) wrote :

Well, trying to get the algorithm to respect the nesting rules of <p> and <div> seems like it'd be really hard. The HTML parser itself basically fixes this, but it seems terribly crude. But if you try:

from lxml.html import diff, fromstring, tostring
tostring(fromstring(diff.htmldiff('<p>a b c</p>', '<p>a b d e f</p> <div>f</div> <div>g</div>')))
'<div><p>a b <ins>d e f</ins></p><div><ins>f</ins></div><ins> </ins><div><ins>g</ins></div> <del>c</del> </div>'

And I believe that is basically a decent diff. Arguably the HTML parser has all the right rules to resolve the ambiguities, and does it in the "correct" way... so potentially this could be an actual fix. Stefan: do you find adding a parse/serialize step reasonable?

Noah Slater (nslater) wrote :

This strikes me as a worrying suggestion. I'm not sure I would be comfortable using a method that produced invalid HTML, only to be told that I should parse and serialize it to fix the errors. Feels like a case of treating the symptoms, instead of the cause.

Ian Bicking (ianb) wrote :

Well, I would have just committed it if I thought it was entirely okay. OTOH, the diff it produces isn't *entirely* wrong, it just isn't good HTML. It would be a valid diff if you ignored the way HTML works.

Well... I guess on further thought, it's not a great way of looking at it, since it should use the </p> as a kind of anchor. I guess largely arbitrarily insertions are shown before deletions, but I wonder if it would work better if that was just reversed, with deletes first. Hmm... I'll give that a try.

Ian Bicking wrote:
> Stefan: do you find adding a parse/serialize step reasonable?

I'd vote for avoiding this if at all possible. It sounds like a work-around
that a user can do until there's a better fix. Also, there is no guarantee
that the parser will really fix these things. The diff mechanism should
have its own ways of dealing with issues like this.

Stefan

Noah Slater (nslater) wrote :

On Tue, Apr 07, 2009 at 04:19:42AM -0000, Ian Bicking wrote:
> Well, I would have just committed it if I thought it was entirely okay.
> OTOH, the diff it produces isn't *entirely* wrong, it just isn't good
> HTML. It would be a valid diff if you ignored the way HTML works.

This is a HTML diff function, so producing invalid HTML seems like a fundamental
problem. It seems almost absurd to argue that this could be considered correct
behaviour, if only you ignore how HTML works. This module is, without doubt, the
best tool for working with HTML and XML in Python, but it worries me that such a
position could even be considered. I'm not trivialising the work involved in
this, and I certainly don't think I could successfully maintain and improve this
library, but if this is supposed to be used with HTML, but I would expect the
baseline of quality assurance to be valid output.

--
Noah Slater, http://tumbolia.org/nslater

scoder (scoder) wrote :

Noah Slater wrote:
> I'm not trivialising the work involved in this, and I certainly don't
> think I could successfully maintain and improve this library, but if
> this is supposed to be used with HTML, but I would expect the baseline
> of quality assurance to be valid output.

While I do agree that the diff should provide valid HTML output (and thus,
this is a bug that should be fixed), I must remind you of the contract
that you signed when you started using lxml. Assuming that you are correct
in saying that you couldn't maintain the library yourself; unless you are
willing to pay someone for doing it for you, the chances for successfully
suing anyone because the library returns invalid HTML (in cases) are quite
feeble.

Stefan

Noah Slater (nslater) wrote :

On Tue, Apr 07, 2009 at 01:28:35PM -0000, Stefan Behnel wrote:
> I must remind you of the contract that you signed when you started using lxml.
> Assuming that you are correct in saying that you couldn't maintain the library
> yourself; unless you are willing to pay someone for doing it for you, the
> chances for successfully suing anyone because the library returns invalid HTML
> (in cases) are quite feeble.

Eh? Who said ANYTHING about suing anybody?

I am hugely appreciative of this module, as I stated in my last message. For a
library that works with HTML and XML however, it would be very worrying if the
output of invalid markup was not considered a major problem. I'm certainly not
demanding the bug be fixed, nor fixed within any specific timescale, so I have
no idea why you sent me this message.

--
Noah Slater, http://tumbolia.org/nslater

Ian Bicking (ianb) wrote :

Well, the problem here is that the algorithm is really quite difficult. I spent over an hour last night just trying to remind myself how the thing works. And it's hard for a reason, because the problem it's trying to solve is quite tricky, and it basically inverts the etree model, making words and text primary and tags secondary. It knows about block level and inline elements, and it knows a little about links and images, but it doesn't know that a <div> inside a <p> is invalid, and that's a tricky rule to get in there. Heck, that's a rule I forget frequently.

I think it comes down to two things:

1. It tries to create markup that is most closely related to second document's structure. So it is basically supposed to look like what the second document looks like, just with the deleted text put in somewhere.
2. It puts deletes after inserts, so the "c" in these examples gets put at the end of the text.
3. But for some reason switching that order (ins before del) didn't really help either, so there's something else going on. The algorithm does literally try to fix up invalid markup, so that fixing stage might be messing up this particular case.

On Tue, Apr 7, 2009 at 11:43 PM, Ian Bicking wrote:

> but it doesn't know that a <div> inside a <p> is invalid, and
> that's a tricky rule to get in there.  Heck, that's a rule I forget
> frequently.

And is not entirely accurate, if by inside you mean a descendent
element: both <div> and <p> can appear inside <object> in HTML 4.01,
and <object> can appear inside <p>, so the following are valid:

<p><object><p>...
<p><object><div>...

--
Sean B. Palmer, http://inamidst.com/sbp/

A while ago I added bug 1190768 as a much simpler test case of (probably) the same issue. I missed linking it to this bug properly.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers