Comment 3 for bug 192398

Revision history for this message
Adam Buchbinder (adam-buchbinder) wrote :

I commented too soon. The supported encodings list in Adobe's implementations is very short (p. 1025); in Acrobat 4.0, it consists only of Shift-JIS; in 5.0, only Shift-JIS, UHC, GBK, and BigFive. (The spec doesn't say what later versions accept.) I had assumed that PDFDocEncoding was something like UTF-8, but it's a superset of Latin-1, so converting to PDFDocEncoding by default will mangle any text that uses odd characters. There's also a note (p. 132) explaining that Unicode strings must be encoded as UTF-16BE with a BOM to start with in order to unambiguously distinguish them from PDFDocEncoding strings. Converting to UTF-8 will make the exported forms information incompatible with at least some implementations.

The best possible solution I can think of here is to see if the string can be reencoded in PDFDocEncoding without missing any characters, and if it can't, leaving it in UTF-16. This would maintain backwards compatibility while making it way, way more hand-editable.