Invalid UTF-8 data is not always being rejected

Bug #1649066 reported by maksis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
AirDC++
Fix Released
Undecided
Unassigned
DC++
Fix Committed
Undecided
Unassigned

Bug Description

There are various cases where invalid UTF-8 data is being consumed by the core:

1. Text::convert will return the original string in case of errors (Linux only, respective Windows-specific functions will return an empty string in case of errors)
2. When using "utf-8" encoding in NMDC hubs, the original string will always be returned by conversion functions without validation (generally Linux only since "utf-8" can't be selected from DC++'s GUI)
3. UTF-8 validation is not performed for strings parsed from XML (specifically file/directory names in filelists)

This will cause issues especially when the data is processed by external sources/libraries that expect to receive valid UTF-8 data (https://github.com/airdcpp-web/airdcpp-webclient/issues/204). I'm not sure about security implications.

Another note: messages that fail UTF-8 validation in ADC hubs are ignored silently. At least Flexhub seems to be having problems with data validation which currently goes unnoticed.

Revision history for this message
maksis (maksis) wrote :

I've implemented UTF-8 validation for XML parsing in AirDC++: https://github.com/airdcpp/airgit/commit/82e02ff9f7023a4aa5a35b61c4cd313b9fed59a6

The only issue I've noticed is that the default hublist (http://dchublist.com/hublist.xml.bz2) won't pass the validation.

Revision history for this message
maksis (maksis) wrote :

This patch for DC++ addresses 1) and 3)

Revision history for this message
poy (poy) wrote :

applied - utf8-validation-1649066 bookmark.

TODO - could you please:
- add a test case for this in test/testxml.cpp?
- make this somehow optional so current hub lists keep working for now?

and I'd be interested in knowing how much this impacts the time it takes to parse some big file lists...

Revision history for this message
maksis (maksis) wrote :

The validation function is cheap and there is effectively no difference in filelist loading times.

Since it's possible to add favorite hubs directly from the hublist, it's quite likely that there are users with malformed data in their Favorites.xml as well. Disabling the validity check for certain files isn't an acceptable solution for me because of the issues that would cause. One alternative would be to (optionally) replace malformed strings with an error messages (such as "UTF-8 validation failed") instead of throwing. Any thoughts?

Revision history for this message
RoLex (hundrambit) wrote :

@ maksis

Hello. I'm developer of Team Elite Hublist. You reported UTF-8 validation issue in our XML files, so I took a good look at it. You was absolutely right, my pinger did not try to encode hub data to UTF-8 when hub encoding was already set to UTF-8. It seemed obvious to me that if hub says to pinger that it already uses UTF-8, it would return already encoded data. Sadly that seems not always to be the case though. I fixed the bug however, thank you sir for the report!

Revision history for this message
poy (poy) wrote :

good point about favs. maybe the download queue & hash files would be of concern as well?

could we handle them through some upgrade step? could be done outside of the XML parsing code, launched for these "precious" files when the XML parser encounters a validation error?
some best effort to replace by question marks or ascii equivalents, and an error message when all else fails... it is important no user loses settings (especially favorites) when upgrading.

Revision history for this message
maksis (maksis) wrote :

AirDC++ will now just replace invalid data when loading config files: https://github.com/airdcpp/airgit/commit/6ab6e22770d1c6d61a5c25a6d58fb6cdedc67858

Since I generally expect this to affect only (a small number of) favorite hub names and descriptions (and recent hubs in AirDC++), I find the solution good enough for that. I can't think of any common cases that would have caused other files to contain invalid data.

Revision history for this message
maksis (maksis) wrote :

Similar patch for DC++ with test included.

Seems that the non-Windows version of Text::utf8ToWide could be used to sanitize invalid characters already. I'm not sure that why there is a separate implementation for Windows as the other code doesn't call any platform-specific functions (the same applies to Text::wideToUtf8 as well).

Revision history for this message
maksis (maksis) wrote :

An alternative patch that should sanitize the string instead

Revision history for this message
maksis (maksis) wrote :

AirDC++ now implements the latter version.

This issue has also been partially discussed earlier: https://sourceforge.net/p/dcplusplus/mailman/message/797137/

maksis (maksis)
Changed in airdcpp:
status: New → Fix Released
Revision history for this message
maksis (maksis) wrote :

Non-Windows versions of Text::utf8ToWide and Text::wideToUtf8 won't obviously handle UTF-16 surrogate pairs at all, thus producing incorrect results. See the following test case for 🌍:

string toUtf8(const wstring& str) {
 string tgt;
 string::size_type n = str.length();
 for (string::size_type i = 0; i < n; ++i) {
  Text::wcToUtf8(str[i], tgt);
 }
 return tgt;
}

wstring emoji = L"\U0001F30D";
ASSERT_EQ(Text::wideToUtf8(emoji), toUtf8(emoji)); // error

Revision history for this message
maksis (maksis) wrote :

Looks like local file/directory names may also contain invalid characters on non-Windows operating systems (at least on Linux). Those may cause crashes and other issues, such as std::bad_alloc exceptions when generating XML files.

https://github.com/airdcpp-web/airdcpp-webclient/issues/382
https://github.com/airdcpp/airdcpp-windows/commit/a92bf8263b2a12c0258d1306d33e0915ff21d994

Revision history for this message
cologic (cologic) wrote :

https://sourceforge.net/p/dcplusplus/code/ci/65212958fd4e1e008eeac859fea56f8d71e3625a/ addresses that aspect of the issue.

The specific issue pointed out wasn't really possible to trigger on DC++'s supported platforms insofar as Linux isn't a supported target, and Windows claims to provide filenames in natively Unicode form already, but worth addressing.

Revision history for this message
eMTee (realprogger) wrote :

Committed the alternative patch which sanitizes strings with invalid data when the flag is present.

The Windows version of Text::utf8ToWide that's more conform with newer Unicode standards will replace invalid UTF-8 data with U+FFFD, the "UNICODE REPLACEMENT CHARACTER".

No change in the non-Windows version of the function right now (for some reason AirDC++ seems to be still using this one for both targets despite comment #11).

Anyone affected please reopen this if the above behavior is considered as appropriate for non-Windows targets or if there's any issue regarding the changes has been made.

Also note that Text::convert is currently not used in DC++; I applied the proposed change for that function for both targets.

Changed in dcplusplus:
status: New → Fix Committed
Revision history for this message
eMTee (realprogger) wrote :

I did some performance tests regarding the changes this patch does (using the optimized release build).

After a few startups to get the file fully cached by Windows, the load of a hashindex.xml with 80k lines, repeated 10 times with and without utf8 validation each, results overhead of an average 3.1% when validation is enabled.

While I was at it I also tested the change of Text::wideToUtf8 and utf8ToWide to the WinAPI versions, added in https://bugs.launchpad.net/dcplusplus/+bug/1473791 presumably to get better Unicode version support for the new function in the patch and in general thoughout the program.

Looks like the Win32 and the old (now for unix only) versions perform pretty much the same in a wideToUtf8(utf8ToWide(str)) operation done on all the data used in the above test scenario.
Suprisingly the Win32 version of wideToUtf8 looks to be even a few percent faster than the previously used one. So adding the Win API versions seems to be justified especially since they also appear to support UTF-16 surrogate pairs so they should work better for far-east locales.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.