Text::utf8ToWc and Text::wcToUtf8 interfaces/declarations incorrectly assume wchar_t on Win32 can represent any Unicode codepoint
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
DC++ |
New
|
Undecided
|
Unassigned |
Bug Description
int utf8ToWc(const char* str, wchar_t& c);
void wcToUtf8(wchar_t c, string& str);
Both assume that every relevant Unicode codepoint can be represented as one wchar_t. This is not the case.
On at least certain Win32 platforms, https:/
https:/
=======
Windows applications normally use UTF-16 to represent Unicode character data. The use of 16 bits allows direct representation of 65,536 unique characters, but this Basic Multilingual Plane (BMP) is not nearly enough to cover all the symbols used in human languages. Unicode version 4.1 includes over 97,000 characters, with over 70,000 characters for Chinese alone.
The Unicode standard has established 16 additional "planes" of characters, each the same size as the BMP. Naturally, most code points beyond the BMP do not yet have characters assigned to them, but definition of the planes gives Unicode the potential to define 1,114,112 characters (that is, 2¹⁶ * 17 characters) within the code point range U+0000 to U+10FFFF. For UTF-16 to represent this larger set of characters, the Unicode Standard defines "supplementary characters".
A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. For more details about supplementary characters, surrogates, and surrogate pairs, refer to The Unicode Standard.
=======
To use this surrogate pair mechanism, Text::utf8ToWc and Text::wcToUtf8, along with downstream users (e.g., in dcpp/Util.cpp) would have to be adapted to allow multiple wchar_t values per codepoint.