Comment 38 for bug 191199

Revision history for this message
In , Khampton (khampton) wrote :

(In reply to comment #34)

> >+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;
>
> That's getting too complex. The way the filter is used, it should be
> displaying "™" if somebody writes "™".

Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is.
>
> >+ # the following nukes characters disallowed by the XML 1.0
> >+ # spec, Production 2.2.

::snipping fulgy regex::

> I'd rather replace them with HTML entities, is that possible? People export
> data via XML, and theoretically some of these characters could be in comments
> (as unlikely as it seems).

As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.