Need ASCII regex

Bug #867130 reported by Paul J. Lucas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
Medium
Daniel Turcanu

Bug Description

As discussed on the zorba-coders mailing list, in order to build Zorba without ICU, there needs to be a ASCII regular expression library to take the place of ICU. The existing code in zorbatypes/regex_ascii.h/.cpp needs to be wrapped by the existing regex class for an alternate implementation when ZORBA_NO_UNICODE=ON, e.g.:

#ifndef ZORBA_NO_UNICODE
// existing regex class
#else
// new regex class backed by regex_ascii
#endif

You probably also need to provide an alternate implementation of convert_xquery_re() (in regex.h/.cpp) that currently converts an XQuery regular expression into an ICU regular expression. If the existing regex_ascii regular expressions exactly match XQuery regular expressions, then the alternate implementation of convert_xquery_re() can simply copy xq_re to lib_re as-is.

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

Done in svn 11287.
The API implemented is aproximately the same. Some functions with parameters "string" were deleted. I also had trouble understanding how next_token is supposed to work, please check it.

I also did some changes and updates in regex_ascii, but did not test. I am waiting for ZORBA_NO_UNICODE to compile before I start debugging it.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

I have created the no_unicode branch. Check it out using:

svn co https://zorba.svn.sourceforge.net/svnroot/zorba/branches/no_unicode

I have changed unicode_util.h so that unicode::string is defined simply to be zstring when ZORBA_NO_UNICODE=ON. This means you need to implement the complete API putting the functions you deleted back.

next_token differs from next_match in that next_token uses the regex to specify what SEPARATES the tokens rather than what matches the tokens. For example, if I want to parse the string:

a,b,c

that is letters separated by commas, I can do it one of 2 ways:

1. Using next_match and specifying the regex to be "[a-z]"
2. Using next_token and specifying the regex to be ","

next_token is similar to strtok(3) because in strtok(3), you specify what separates tokens.

Note: priority reduced and group changed since it's apparently not going into 2.0.

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

For next_token() function in regex api, I don't understand what should be returned for boolean values.
So as I understand, matched should be true if the regex matches something, and false if not.
And the return value should return true if the token returned is non-empty.
Is this correct?

About those functions, I don't need to add them back because they overlap with the template ones.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

next_token() must return true if it returns a token. Note that it must return true for the special case of the "last" token. Again, if I have:

a,b,c

and I specify a regex of "," and use next_token(), I should be able to call it 3 times (and it should return "true" 3 times) for the 3 tokens 'a', 'b', and 'c', even though the ',' itself is only matched twice.

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

I debugged regex_ascii, I think now it's ok.
We should merge the branch into the trunk if it's ok with you.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

Does it pass all tests that don't use UTF-8?

Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.