Comment 35 for bug 191199

Revision history for this message
In , Max Kanat-Alexander (mkanat) wrote :

Comment on attachment 354687
V1

>+ # substitute & for & unless it is already
>+ # used in a character entity.
>+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;

  That's getting too complex. The way the filter is used, it should be displaying "™" if somebody writes "™".

>+ # the following nukes characters disallowed by the XML 1.0
>+ # spec, Production 2.2. 1.0 declares that only the following
>+ # are valid:
>+ # (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
>+ $var =~ s/([\x{0001}-\x{0008}]|
>+ [\x{000B}-\x{000C}]|
>+ [\x{00E}-\x{0019}]|
>+ [\x{D800}-\x{DFFF}]|
>+ [\x{FFFE}-\x{FFFF}])//gx;

  I'd rather replace them with HTML entities, is that possible? People export data via XML, and theoretically some of these characters could be in comments (as unlikely as it seems).