Zorba

fn:match fails if the string is non-utf8

Bug #867159 reported by Daniel Turcanu on 2011-07-28

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Zorba	Fix Released	Medium	Paul J. Lucas	Zorba 2.2 "Coeus"

Bug Description

I have a query that reads a lot of files and apply fn:match on them.
Some files have non-utf8 characters, and file:read-text reads with no problem.
fn:matches calls to_string to convert to ICU string, but that fails. So fn:matches returns false, although I think it should raise an error. Actually to_string should raise an error, otherwise the non-utf8 problem gets unnoticed.

Tags:

Related branches

lp:~zorba-coders/zorba/feature-transcode_streambuf

Merged into lp:zorba at revision 10663

Matthias Brantner: Approve on 2012-02-16

Paul J. Lucas: Approve on 2012-02-16

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2011-07-28:

#1

We (meaning several Zorba team members) had a long discussion about what to do for invalid UTF-8 byte sequences a while ago and the consensus reached was that the validity of UTF-8 byte sequence should be checked only on entry into Zorba and not after it's been read in. So if the byte sequence is to be checked at all, it should be checked in read-text and any error should be raised there.

Revision history for this message

Daniel Turcanu (danielturcanu) wrote on 2011-07-28:

#2

But here is about ICU.
It is doing the utf8 check once again, and returns error. I think to_string should throw utf8 error, not return false.
Maybe ICU returns some other errors. It is not right for fn:match to just return false, as if the pattern wasn't matched.

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2011-07-28:

#3

I can't control what ICU does. If I were to change to_string() to throw an exception now, there are probably plenty of places in the code that would not catch it, so the exception would go all the way to the top and crash Zorba.

You didn't specify a release ("Group") for this bug, so it's not clear whether you expect anything to be done by 2.0. Given what I said above, I think it's too dangerous a change so close to a release.

If the bad UTF-8 sequence were caught when it was read-in in the first place and an exception thrown then, this problem would be moot since it would never get to my code in this case.

Revision history for this message

Daniel Turcanu (danielturcanu) wrote on 2011-07-28:

#4

Gabriel, then it's your bug.
file:read-text should validate for utf8 characters.

Revision history for this message

Gabriel Petrovay (gabipetrovay) wrote on 2011-07-29:

#5

There is a problem with this. read-text will generate a streamable string and the job of read-text is over. So the wrong string will probably be consumed somewhere later in the runtime.

Not sure how to solve this so let-s escalate it to zorba-coders.

Revision history for this message

Chris Hillery (ceejatec) wrote on 2011-07-29:

#6

IMHO, the check needs to be added in the StreamableString implementation. The "rule" is that Zorba doesn't check for UTF8 validity internally, only at the entry points. This is a new entry point, so it needs the check.

Revision history for this message

Gabriel Petrovay (gabipetrovay) wrote on 2011-11-22:

#7

I am not sure if I am the right person to do this. Removing myself from this bug.

Changed in zorba:
assignee:	Gabriel Petrovay (gabipetrovay) → nobody

Matthias Brantner (matthias-brantner) on 2011-11-22

Changed in zorba:
assignee:	nobody → Matthias Brantner (matthias-brantner)
milestone:	none → 2.2

Revision history for this message

Matthias Brantner (matthias-brantner) wrote on 2012-02-08:

#8

Paul, can you please mark this bug as fix committed as soon as the transcoding_streambuffer branch has been merged? If I understood correctly, these changes should resolve the bug.

Changed in zorba:
assignee:	Matthias Brantner (matthias-brantner) → Paul J. Lucas (paul-lucas)

Zorba Build Bot (zorba-buildbot) on 2012-02-16

Changed in zorba:
status:	New → Fix Committed

Paul J. Lucas (paul-lucas) on 2012-03-27

Changed in zorba:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.