Comment 5 for bug 400588

Revision history for this message
scoder (scoder) wrote :

I think I changed my mind on this.It's very annoying that this only happens at serialisation time and not right during parsing. The problem is that (for some hard to understand reason) libxml2 switches to Latin-1 decoding when it encounters UTF-8 errors, so the tree ends up with a mixed encoding in this case.

My proposal would be to treat this case as a hard error and raise an exception, even if the user asked the parser to "recover". There might be a way to post-process the tree (all text content, all names) to recover more gracefully from this, but even that would only be a half-baked solution that cannot provide a "correct" result, simply because there is no correct result in the case of illegal input.