Ah, one more commit with some happiness on the corruption front - it turned out that as the input from the test case did not have decode_utf8() invoked on it, we ran into the misery that we saw.
So, my extra commit does the precautionary decode_utf8() call on the input because that's the sane thing to do, and ensures that the regexes know that they're dealing with a Unicode string instead of some random binary string and can behave accordingly. I've restored the order of the entityize() call in this commit as well.
Ah, one more commit with some happiness on the corruption front - it turned out that as the input from the test case did not have decode_utf8() invoked on it, we ran into the misery that we saw.
So, my extra commit does the precautionary decode_utf8() call on the input because that's the sane thing to do, and ensures that the regexes know that they're dealing with a Unicode string instead of some random binary string and can behave accordingly. I've restored the order of the entityize() call in this commit as well.