Invalid UTF-8 data is not always being rejected

Bug #1649066 reported by maksis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
AirDC++
Fix Released
Undecided
Unassigned
DC++
New
Undecided
Unassigned

Bug Description

There are various cases where invalid UTF-8 data is being consumed by the core:

1. Text::convert will return the original string in case of errors (Linux only, respective Windows-specific functions will return an empty string in case of errors)
2. When using "utf-8" encoding in NMDC hubs, the original string will always be returned by conversion functions without validation (generally Linux only since "utf-8" can't be selected from DC++'s GUI)
3. UTF-8 validation is not performed for strings parsed from XML (specifically file/directory names in filelists)

This will cause issues especially when the data is processed by external sources/libraries that expect to receive valid UTF-8 data (https://github.com/airdcpp-web/airdcpp-webclient/issues/204). I'm not sure about security implications.

Another note: messages that fail UTF-8 validation in ADC hubs are ignored silently. At least Flexhub seems to be having problems with data validation which currently goes unnoticed.

Revision history for this message
maksis (maksis) wrote :

I've implemented UTF-8 validation for XML parsing in AirDC++: https://github.com/airdcpp/airgit/commit/82e02ff9f7023a4aa5a35b61c4cd313b9fed59a6

The only issue I've noticed is that the default hublist (http://dchublist.com/hublist.xml.bz2) won't pass the validation.

Revision history for this message
maksis (maksis) wrote :

This patch for DC++ addresses 1) and 3)

Revision history for this message
poy (poy) wrote :

applied - utf8-validation-1649066 bookmark.

TODO - could you please:
- add a test case for this in test/testxml.cpp?
- make this somehow optional so current hub lists keep working for now?

and I'd be interested in knowing how much this impacts the time it takes to parse some big file lists...

Revision history for this message
maksis (maksis) wrote :

The validation function is cheap and there is effectively no difference in filelist loading times.

Since it's possible to add favorite hubs directly from the hublist, it's quite likely that there are users with malformed data in their Favorites.xml as well. Disabling the validity check for certain files isn't an acceptable solution for me because of the issues that would cause. One alternative would be to (optionally) replace malformed strings with an error messages (such as "UTF-8 validation failed") instead of throwing. Any thoughts?

Revision history for this message
RoLex (hundrambit) wrote :

@ maksis

Hello. I'm developer of Team Elite Hublist. You reported UTF-8 validation issue in our XML files, so I took a good look at it. You was absolutely right, my pinger did not try to encode hub data to UTF-8 when hub encoding was already set to UTF-8. It seemed obvious to me that if hub says to pinger that it already uses UTF-8, it would return already encoded data. Sadly that seems not always to be the case though. I fixed the bug however, thank you sir for the report!

Revision history for this message
poy (poy) wrote :

good point about favs. maybe the download queue & hash files would be of concern as well?

could we handle them through some upgrade step? could be done outside of the XML parsing code, launched for these "precious" files when the XML parser encounters a validation error?
some best effort to replace by question marks or ascii equivalents, and an error message when all else fails... it is important no user loses settings (especially favorites) when upgrading.

Revision history for this message
maksis (maksis) wrote :

AirDC++ will now just replace invalid data when loading config files: https://github.com/airdcpp/airgit/commit/6ab6e22770d1c6d61a5c25a6d58fb6cdedc67858

Since I generally expect this to affect only (a small number of) favorite hub names and descriptions (and recent hubs in AirDC++), I find the solution good enough for that. I can't think of any common cases that would have caused other files to contain invalid data.

Revision history for this message
maksis (maksis) wrote :

Similar patch for DC++ with test included.

Seems that the non-Windows version of Text::utf8ToWide could be used to sanitize invalid characters already. I'm not sure that why there is a separate implementation for Windows as the other code doesn't call any platform-specific functions (the same applies to Text::wideToUtf8 as well).

Revision history for this message
maksis (maksis) wrote :

An alternative patch that should sanitize the string instead

Revision history for this message
maksis (maksis) wrote :

AirDC++ now implements the latter version.

This issue has also been partially discussed earlier: https://sourceforge.net/p/dcplusplus/mailman/message/797137/

maksis (maksis)
Changed in airdcpp:
status: New → Fix Released
Revision history for this message
maksis (maksis) wrote :

Non-Windows versions of Text::utf8ToWide and Text::wideToUtf8 won't obviously handle UTF-16 surrogate pairs at all, thus producing incorrect results. See the following test case for 🌍:

string toUtf8(const wstring& str) {
 string tgt;
 string::size_type n = str.length();
 for (string::size_type i = 0; i < n; ++i) {
  Text::wcToUtf8(str[i], tgt);
 }
 return tgt;
}

wstring emoji = L"\U0001F30D";
ASSERT_EQ(Text::wideToUtf8(emoji), toUtf8(emoji)); // error

Revision history for this message
maksis (maksis) wrote :

Looks like local file/directory names may also contain invalid characters on non-Windows operating systems (at least on Linux). Those may cause crashes and other issues, such as std::bad_alloc exceptions when generating XML files.

https://github.com/airdcpp-web/airdcpp-webclient/issues/382
https://github.com/airdcpp/airdcpp-windows/commit/a92bf8263b2a12c0258d1306d33e0915ff21d994

Revision history for this message
cologic (cologic) wrote :

https://sourceforge.net/p/dcplusplus/code/ci/65212958fd4e1e008eeac859fea56f8d71e3625a/ addresses that aspect of the issue.

The specific issue pointed out wasn't really possible to trigger on DC++'s supported platforms insofar as Linux isn't a supported target, and Windows claims to provide filenames in natively Unicode form already, but worth addressing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.