Comment 18 for bug 135985

Revision history for this message
era (era) wrote :

Proper UTF8 should not be hard to identify, the high-bit sequences are required to follow a particular pattern (look up UTF8 in Wikipedia to see a good illustration). What's hard is deciding how to interpret a legacy 8-bit encoding which is not valid UTF8. You can probably figure out whether something is Latin or KOI etc based on trigram frequencies, for example, but which Latin? (Latin-1 aka ISO8859-1 and Latin-9 aka ISO8859-15 differ only by a single character, but then you can probably just assume it's Latin-9 and never be too badly wrong.) If you interpret Latin-2 as Latin-9 you get all the extended code points completely wrong (or should I say c$mpl^t^ly wr$ng ... look up "mojibake" in Wikipedia too). Anyway, the baseline should be able to identify correct UTF8 and handle it without the need of any preference or user interaction. How you handle the rest might need something like what is being specified in other comments above.