Zorba hangs with invalid utf-8 input

Bug #1169908 reported by Chris Hillery
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
Medium
Paul J. Lucas

Bug Description

import module namespace file = "http://expath.org/ns/file";

file:read-text('text-plain-utf-16be-bom-lines.txt', "utf-8") ! string-length(.)

or

fn:unparsed-text-lines('text-plain-utf-16be-bom-lines.txt', "utf-8") ! string-length(.)

Either of these queries causes Zorba to hang at 100% CPU.

The problem is that the input file (attached) is actually utf-16 with a BOM. Since we lie and tell Zorba the input is utf-8 (the same thing also happens if we don't specify the encoding since Zorba assumes utf-8), the stream is passed through untouched. This causes the implementation of string-length() to busy-loop, specifically in utf8::length(), because the char_length() of the byte \xFE (the first byte in the file) is 0.

Related branches

Revision history for this message
Chris Hillery (ceejatec) wrote :

My recollection is that in earlier discussions about Zorba verifying user input, we agreed that it should only do so at the boundaries - ie, data should be checked for validity when it first enters Zorba, and not as it flows through.

However, we aren't currently performing this input check for streaming data. Paul has written a streaming UTF-8 checker, utf8::streambuf, which verifies the data is valid UTF-8 as it comes in. However, it is very challenging and fragile to use correctly, as he explains here: https://code.launchpad.net/~zorba-coders/zorba/bug1073175/+merge/144762/comments/318167

I do not believe this is a viable solution, so I have asked for a more encapsulated solution here: https://bugs.launchpad.net/zorba/+bug/1073175/comments/3

But... given that this bug is no longer causing FOTS failures and is apparently only triggered by providing bogus input, I am currently marking this bug as "Medium" priority without a specific milestone.

Changed in zorba:
importance: Undecided → Medium
Revision history for this message
Chris Hillery (ceejatec) wrote :

Just to consolidate further, here is the API I requested:

Paul, I need you to propose an API that meets the following requirements:

1. It should accept a std::istream, and either modify it in-place or return a new istream. (Alternately it could work on std::streambuf objects, if and only if all the memory-management requirements can still be met, which I don't think they can.)

2. It should accept a Zorba error code or exception object, a QueryLoc, and any additional error-message parameters as necessary.

3. It should arrange so that any further reading from that istream will throw an instance of that Zorba exception, with all appropriate parameters, if the istream produces invalid UTF8.

4. It should correctly handle memory manage in all cases, de-allocating any created objects either when the istream is itself deleted or when an exception is thrown.

Please let me know how you would like to proceed, and any consequences of the above API that may not be obvious.

Once this is done, we may need/want to revisit transcode::streambuf in a similar fashion. If there is a clever template-y way to layer this stuff around any existing streambuf implementations, so much the better.

Changed in zorba:
assignee: nobody → Paul J. Lucas (paul-lucas)
Revision history for this message
Chris Hillery (ceejatec) wrote :

Paul asked some questions about the API; here are my answers:

> Can you explain why it needs an error code that isn't
> ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE? AFAIK,
> there isn't an error code defined in the spec for this
> (which is why I made a Zorba-specific one up).

This originally arose with fn:unparsed-text() and friends, and in that case, the spec says to throw http://www.w3.org/TR/xpath-functions-30/#ERRFOUT1190 for bogus input. However, the same thing happens with the file module, which doesn't actually specify what to do for bogus-encoded data but says that file:FOFL9999 is thrown "if any other error occurs". On the whole it just made sense to me to make the error code configurable. However, if this is for some reason Really Hard, I would suggest using only FOUT1190.

> What QueryLoc should it use? The same stream might be
> read from several places throughout the code?

That's not your problem; the API should simply accept a QueryLoc and use it. The people using this API are responsible for picking an appropriate location, probably the point at which the data-reading function is called (fn:unparsed-text(), file:read-text(), etc.).

Changed in zorba:
status: New → In Progress
Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

> This causes the implementation of string-length() to busy-loop ...

What string-length()? Where?

Revision history for this message
Chris Hillery (ceejatec) wrote :

I don't recall exactly, I'm afraid. I believe it was in utf8::length but I'm not sure. If you run this query in a debugger and suspend it once it locks up you'll see it soon enough.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

BTW: there is no attachment.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :
Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

FYI: The string-length() I was referring to is in strings_impl.cpp:748.

On this branch, I've fixed utf8::length(), sort of. It will now return 0 if the string contains any invalid UTF-8 byte. However, that's not right because then you can't distinguish between an invalid byte and a string having a true 0 length. If I instead return utf8::npos, then there are probably lots of places in the code that won't correctly handle that. I think I should instead throw an illegal_argument exception that the caller must wrap in try/catch.

Regardless of the above, string-length() can now detect the error (independently of either read-text or unparsed-text-lines) and handle it itself. Should it? If we follow the "check only at the boundaries" rule, then, in theory, string-length() should never have been given invalid UTF-8. This implies catching the invalid UTF-8 earlier when reading it (which is mostly what this bug is about). But I should still probably do SOMETHING if utf8::length() returns an error. I'm thinking I should have utf8::length() throw an exception and NOT catch it... kind of like an "assert" of sorts (belt and suspenders).

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

Update: I'm now having [C++] char_length() throw an invalid_byte exception (that is-an invalid_argument exception). [XQuery] string-length() will now report ZXQD0006_INVALID_UTF8_BYTE_SEQUENCE.

Also, I think the "check only at the boundaries" rule needs clarification, specifically how robust should the checking be? utf8::read(), for example, now (using the patched char_length()) checks the validity of the first byte of a UTF-8 character, and it also (as it has always done) checks that the remaining bytes are "continuation" bytes -- but should it? It does take non-zero CPU time to check. If the first byte says there are 2 more bytes comprising the character, it COULD just read 2 more bytes without checking them -- garbage in, garbage out.

Finally, I COULD close this bug now since it no longer hangs.

Changed in zorba:
status: In Progress → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.