Comment 2 for bug 1482410

Revision history for this message
Aaron Wells (u-aaronw) wrote :

On further research I've decided to be more thorough and just whitelist the allowed characters in XML, listed here: https://en.wikipedia.org/wiki/Valid_characters_in_XML

I wound up using preg_replace() with the "/u" modifier to make it Unicode-safe. The downside to this is that we read the entire file into memory and then do preg_replace on it, but that shouldn't use too much more memory, because we're already reading the entire file into memory in order to use simplexml.

I also discovered that htmlentities() can get rid of these invalid characters if you use the flags ENT_XML1 | ENT_DISALLOWED flags. But those flags were only added in PHP 5.4, and we still aim to support PHP 5.3. Plus, the best they can do is replace the invalid characters with a Unicode 0xFFDD character, which will display as an unprintable character. So, it's still better to just remove them entirely.