Multi-char escapes wrongly forbidden in character class

Bug #1022762 reported by Paul J. Lucas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
High
Paul J. Lucas

Bug Description

If you have a character range, e.g., A-Z, then the end-point chars in the range can be SingleCharEsc. A while ago, a "fix" was made for this, but the "fix" went too far and forbids MultiCharEsc within charClassExpr.

Related branches

Changed in zorba:
status: New → In Progress
Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

Removing the "fix" code results in the regex_err16.xq test failing. That test is:

  fn:matches("a", "[\s-e]")

The charClassExpr is invalid because, in character ranges, only SingleCharEsc are allowed and \s is a MultiCharEsc. ICU doesn't detect this and the test just returns "false."

Adding a proper fix for this would involve adding more state to the regex parser and knowing when we're within a character class *and* within a character range, i.e.:

  if ( in_char_class && c == '-' && prev_c_was_an_esc && !prev_c_was_a_single_char_esc )
    throw an exception

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

It actually harder than that, especially for detected the opposite case, e.g., "[e-\s]".

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

It's been decided to remove the "fix" and mark regex_err16.xq as an expected failure. A new bug #1023168 has been created.

Changed in zorba:
status: In Progress → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.