fn:match fails if the string is non-utf8

Bug #867159 reported by Daniel Turcanu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
Medium
Paul J. Lucas

Bug Description

I have a query that reads a lot of files and apply fn:match on them.
Some files have non-utf8 characters, and file:read-text reads with no problem.
fn:matches calls to_string to convert to ICU string, but that fails. So fn:matches returns false, although I think it should raise an error. Actually to_string should raise an error, otherwise the non-utf8 problem gets unnoticed.

Tags: core-runtime

Related branches

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

We (meaning several Zorba team members) had a long discussion about what to do for invalid UTF-8 byte sequences a while ago and the consensus reached was that the validity of UTF-8 byte sequence should be checked only on entry into Zorba and not after it's been read in. So if the byte sequence is to be checked at all, it should be checked in read-text and any error should be raised there.

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

But here is about ICU.
It is doing the utf8 check once again, and returns error. I think to_string should throw utf8 error, not return false.
Maybe ICU returns some other errors. It is not right for fn:match to just return false, as if the pattern wasn't matched.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

I can't control what ICU does. If I were to change to_string() to throw an exception now, there are probably plenty of places in the code that would not catch it, so the exception would go all the way to the top and crash Zorba.

You didn't specify a release ("Group") for this bug, so it's not clear whether you expect anything to be done by 2.0. Given what I said above, I think it's too dangerous a change so close to a release.

If the bad UTF-8 sequence were caught when it was read-in in the first place and an exception thrown then, this problem would be moot since it would never get to my code in this case.

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

Gabriel, then it's your bug.
file:read-text should validate for utf8 characters.

Revision history for this message
Gabriel Petrovay (gabipetrovay) wrote :

There is a problem with this. read-text will generate a streamable string and the job of read-text is over. So the wrong string will probably be consumed somewhere later in the runtime.

Not sure how to solve this so let-s escalate it to zorba-coders.

Revision history for this message
Chris Hillery (ceejatec) wrote :

IMHO, the check needs to be added in the StreamableString implementation. The "rule" is that Zorba doesn't check for UTF8 validity internally, only at the entry points. This is a new entry point, so it needs the check.

Revision history for this message
Gabriel Petrovay (gabipetrovay) wrote :

I am not sure if I am the right person to do this. Removing myself from this bug.

Changed in zorba:
assignee: Gabriel Petrovay (gabipetrovay) → nobody
Changed in zorba:
assignee: nobody → Matthias Brantner (matthias-brantner)
milestone: none → 2.2
Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

Paul, can you please mark this bug as fix committed as soon as the transcoding_streambuffer branch has been merged? If I understood correctly, these changes should resolve the bug.

Changed in zorba:
assignee: Matthias Brantner (matthias-brantner) → Paul J. Lucas (paul-lucas)
Changed in zorba:
status: New → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.