lxml

Overview
Code
Bugs
Blueprints
Translations
Answers

Bug #315511
Comment #2

Comment 2 for bug 315511

Revision history for this message

Ian Bicking (ianb) wrote on 2009-04-06:

Well, trying to get the algorithm to respect the nesting rules of <p> and <div> seems like it'd be really hard. The HTML parser itself basically fixes this, but it seems terribly crude. But if you try:

from lxml.html import diff, fromstring, tostring
tostring(fromstring(diff.htmldiff('<p>a b c</p>', '<p>a b d e f</p> <div>f</div> <div>g</div>')))
'<div><p>a b <ins>d e f</ins></p><div><ins>f</ins></div><ins> </ins><div><ins>g</ins></div> <del>c</del> </div>'

And I believe that is basically a decent diff. Arguably the HTML parser has all the right rules to resolve the ambiguities, and does it in the "correct" way... so potentially this could be an actual fix. Stefan: do you find adding a parse/serialize step reasonable?