Comment 3 for bug 1798144

Dick Hollenbeck (dickelbeck) wrote :

Seth,

I am not so sure about this patch. This is one version of the source to the called function:

wxCharBuffer wxSafeConvertWX2MB(const wchar_t *ws)
{
    if ( !ws )
        return wxCharBuffer();

    wxCharBuffer buf(wxConvLibc.cWX2MB(ws));
    if ( !buf )
        buf = wxMBConvUTF8(wxMBConvUTF8::MAP_INVALID_UTF8_TO_OCTAL).cWX2MB(ws);

    return buf;
}

A couple of things I understand to be true:

1) On windows, the default encoding in process (aka locale) is not utf8. So wxConvLibc.cWX2MB will not produce utf8. And since you are starting from a 16 bit character, this will not fail, and you will end up with something other than UTF8 on Windows, because the if test fails.

2) On linux, the default encoding in process is often utf8, but again, starting now with a 32 bit character, why would wxConvLibc.cWX2MB() ever fail, and if the in process encoding, aka locale, is not UTF8, you do not have UTF8 in the result.

3) On Windows, since the input character is 16 bits wide, you cannot be sure that the input string is actually unicode, since not all unicode can be represented with 16 bit characters. The input could be UTF16.

Questions:

1) On windows, does WX assume that these 16 bit wchar's are UTF16, and if so, is
wxMBConvUTF8(wxMBConvUTF8::MAP_INVALID_UTF8_TO_OCTAL).cWX2MB(ws) going to first convert each character to a 32 bit platform neutral unicode character before going to UTF8, like libiconv does?

2) Why don't you use just the 2nd part of the above function straight away. This paradigm of trying one thing then another is the right one when coming from 8 bit characters, but not when coming from 32 bit characters where you in the 32 bit case cannot fail.

The 16 bit case is simply unfortunately different. The wx designers made a tough call in even using that bit width. I suspect it is actually UTF16.

In the Debug build there was support in this module to test if the 8 bit string was actually UTF8, but I don't know if those tests were valid for unicode content wider than 16 bits.

All this to say that your fix is in my view not producing UTF8, and once you let non UTF8 characters into this class, it can be very hard to find the source of long delayed problems which pop up much later when that string is converted or used much later.

HTH,

Dick