Text::utf8ToWc and Text::wcToUtf8 interfaces/declarations incorrectly assume wchar_t on Win32 can represent any Unicode codepoint

Bug #1715870 reported by cologic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DC++
New
Undecided
Unassigned

Bug Description

int utf8ToWc(const char* str, wchar_t& c);
void wcToUtf8(wchar_t c, string& str);

Both assume that every relevant Unicode codepoint can be represented as one wchar_t. This is not the case.

On at least certain Win32 platforms, https://msdn.microsoft.com/en-us/library/gg269344%28v=exchg.10%29.aspx and https://msdn.microsoft.com/en-us/library/windows/desktop/aa367308(v=vs.85).aspx among other MSDN pages document that sizeof(wchar_t) == 2, or 16 bits, not enough for e.g., many of the emoji which https://apps.timwhitlock.info/emoji/tables/unicode lists.

https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx describes how:
==================================================
Windows applications normally use UTF-16 to represent Unicode character data. The use of 16 bits allows direct representation of 65,536 unique characters, but this Basic Multilingual Plane (BMP) is not nearly enough to cover all the symbols used in human languages. Unicode version 4.1 includes over 97,000 characters, with over 70,000 characters for Chinese alone.

The Unicode standard has established 16 additional "planes" of characters, each the same size as the BMP. Naturally, most code points beyond the BMP do not yet have characters assigned to them, but definition of the planes gives Unicode the potential to define 1,114,112 characters (that is, 2¹⁶ * 17 characters) within the code point range U+0000 to U+10FFFF. For UTF-16 to represent this larger set of characters, the Unicode Standard defines "supplementary characters".

A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. For more details about supplementary characters, surrogates, and surrogate pairs, refer to The Unicode Standard.
==================================================

To use this surrogate pair mechanism, Text::utf8ToWc and Text::wcToUtf8, along with downstream users (e.g., in dcpp/Util.cpp) would have to be adapted to allow multiple wchar_t values per codepoint.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.