Comment 2 for bug 315511

Revision history for this message
Ian Bicking (ianb) wrote :

Well, trying to get the algorithm to respect the nesting rules of <p> and <div> seems like it'd be really hard. The HTML parser itself basically fixes this, but it seems terribly crude. But if you try:

from lxml.html import diff, fromstring, tostring
tostring(fromstring(diff.htmldiff('<p>a b c</p>', '<p>a b d e f</p> <div>f</div> <div>g</div>')))
'<div><p>a b <ins>d e f</ins></p><div><ins>f</ins></div><ins> </ins><div><ins>g</ins></div> <del>c</del> </div>'

And I believe that is basically a decent diff. Arguably the HTML parser has all the right rules to resolve the ambiguities, and does it in the "correct" way... so potentially this could be an actual fix. Stefan: do you find adding a parse/serialize step reasonable?