> >+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;
>
> That's getting too complex. The way the filter is used, it should be
> displaying "™" if somebody writes "™".
Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is.
>
> >+ # the following nukes characters disallowed by the XML 1.0
> >+ # spec, Production 2.2.
::snipping fulgy regex::
> I'd rather replace them with HTML entities, is that possible? People export
> data via XML, and theoretically some of these characters could be in comments
> (as unlikely as it seems).
As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.
(In reply to comment #34)
> >+ $var =~ s/&(?![ #A-Za-z] [0-9A-Za- z]+;)/& amp;/g;
>
> That's getting too complex. The way the filter is used, it should be
> displaying "™" if somebody writes "™".
Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is.
>
> >+ # the following nukes characters disallowed by the XML 1.0
> >+ # spec, Production 2.2.
::snipping fulgy regex::
> I'd rather replace them with HTML entities, is that possible? People export
> data via XML, and theoretically some of these characters could be in comments
> (as unlikely as it seems).
As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.