Zorba

Need ASCII regex

Bug #867130 reported by Paul J. Lucas on 2011-07-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Zorba	Fix Released	Medium	Daniel Turcanu

Bug Description

As discussed on the zorba-coders mailing list, in order to build Zorba without ICU, there needs to be a ASCII regular expression library to take the place of ICU. The existing code in zorbatypes/regex_ascii.h/.cpp needs to be wrapped by the existing regex class for an alternate implementation when ZORBA_NO_UNICODE=ON, e.g.:

#ifndef ZORBA_NO_UNICODE
// existing regex class
#else
// new regex class backed by regex_ascii
#endif

You probably also need to provide an alternate implementation of convert_xquery_re() (in regex.h/.cpp) that currently converts an XQuery regular expression into an ICU regular expression. If the existing regex_ascii regular expressions exactly match XQuery regular expressions, then the alternate implementation of convert_xquery_re() can simply copy xq_re to lib_re as-is.

Tags:

Revision history for this message

Daniel Turcanu (danielturcanu) wrote on 2011-07-18:

Done in svn 11287.
The API implemented is aproximately the same. Some functions with parameters "string" were deleted. I also had trouble understanding how next_token is supposed to work, please check it.

I also did some changes and updates in regex_ascii, but did not test. I am waiting for ZORBA_NO_UNICODE to compile before I start debugging it.

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2011-07-20:

I have created the no_unicode branch. Check it out using:

svn co https://zorba.svn.sourceforge.net/svnroot/zorba/branches/no_unicode

I have changed unicode_util.h so that unicode::string is defined simply to be zstring when ZORBA_NO_UNICODE=ON. This means you need to implement the complete API putting the functions you deleted back.

next_token differs from next_match in that next_token uses the regex to specify what SEPARATES the tokens rather than what matches the tokens. For example, if I want to parse the string:

a,b,c

that is letters separated by commas, I can do it one of 2 ways:

1. Using next_match and specifying the regex to be "[a-z]"
2. Using next_token and specifying the regex to be ","

next_token is similar to strtok(3) because in strtok(3), you specify what separates tokens.

Note: priority reduced and group changed since it's apparently not going into 2.0.

Revision history for this message

Daniel Turcanu (danielturcanu) wrote on 2011-07-21:

For next_token() function in regex api, I don't understand what should be returned for boolean values.
So as I understand, matched should be true if the regex matches something, and false if not.
And the return value should return true if the token returned is non-empty.
Is this correct?

About those functions, I don't need to add them back because they overlap with the template ones.

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2011-07-21:

next_token() must return true if it returns a token. Note that it must return true for the special case of the "last" token. Again, if I have:

a,b,c

and I specify a regex of "," and use next_token(), I should be able to call it 3 times (and it should return "true" 3 times) for the 3 tokens 'a', 'b', and 'c', even though the ',' itself is only matched twice.

Revision history for this message

Daniel Turcanu (danielturcanu) wrote on 2011-09-13:

I debugged regex_ascii, I think now it's ok.
We should merge the branch into the trunk if it's ok with you.

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2011-09-13:

Does it pass all tests that don't use UTF-8?

Dana Florescu (dflorescu) on 2012-06-13

Changed in zorba:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.