You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes#994 and solves the single-character case of #5391. Also makes slight progress towards #438.
Collating symbols and equivalences are now parsed following the intended semantics in the standard as much as is possible under the current layout of NFA nodes:
Pass the name of the collating symbol to lookup_collatename() in the traits class to obtain the collating element.
Reject the expression with error_collate if lookup_collatename() returns an empty string (see [re.grammar]/8 and 10). Also reject it with error_space if the returned collating element can't be stored in the NFA node.
If this is an expression starting with [., then we should process the collating element like a normal character. We can easily do so if the collating element is just a single character. However, ranges bounded by such collating elements can't be represented in the NFA currently. For this reason, multi-character collating elements are treated syntactically as a set and immediately added to the NFA by calling _Nfa._Add_coll2(). This means that these collating elements can be individual elements of a character class, but any attempts to use them as range bounds will be rejected by the parser with an error_range.
If this is an expression starting with [=, this denotes an equivalence class. If so, add the collating element to the NFA by calling _Nfa._Add_equiv2(). This function calls transform_primary() and checks if we can obtain a primary sort key for this collating element. If not, it will reject the equivalence with an error_collate (see [re.grammar]/10).
It would be preferable to store the primary sort key rather than the collating element in the NFA in order to avoid the (repeated) computation of the sort key during matching. Unfortunately, though, this choice wasn't made in the past, and while the parser didn't do the name lookup through lookup_collatename(), this feature worked otherwise. So this feature seems to have been usable enough that we shouldn't deliberately break people just for some performance gain.
In contrast, I did make such a change for collating elements: These are now passed through translate() or translate_nocase() before being stored, and the matcher assumes that this translation pass has happened. But the major difference is: This feature is utterly broken before this PR (as documented in <regex>: Regex erroneously returns a matchΒ #994). So I believe this assumption is fine, because even if the new matcher after this PR is picked up with the old parser before this PR, at worst there will just be some patterns for which some inputs are still mistakenly not matched; this is still much better than the utterly broken behavior that the old matcher produces.
In the matcher, the implementation of _Lookup_coll was completely revised as the old one was utterly broken. The new implementation now assumes that the collating elements are sorted descendingly in length. The parser has always sorted them in this way (as it took advantage of the sorting), but nothing in the matcher has made use of it until now.
I can also change the code to remove this assumption if this is your preference.
In addition, I also added code that marks repetitions surrounding some character classes as "not simple". The reason is that these character classes can potentially match character sequences of various lengths, so they can behave like an "if". The "simple loop optimization" doesn't work when the repeated subexpression can branch in the sense that it can match the same input in observably different ways (e.g., substrings of different lengths can be matched or capture groups are located differently). For this reason, repetitions surrounding such character classes should not be marked "simple", and I think this PR is the best opportunity to make this change since it repairs the matching of multi-character collating elements in character classes.
I made a very conservative assessment what character classes should cause a surrounding repetition to be marked as "non-simple"; I included any character classes that I could somehow reasonably imagine to match character sequences of various lengths today or in the future, but this certainly includes too many. I settled on such a conservative choice for two reasons: First, it's trivial to enable the simple loop optimization, but it's basically impossible to take it back later (except if the feature related to this change is utterly broken anyway, like collating symbols before this PR and regex_constants::collate before <regex>: Implement collating rangesΒ #5238). Second, we might change which character classes can potentially branch in the future. For example, let's assume that the collating element "ch" from the Czech alphabet is recognized. Currently, the pattern "[c[.ch.]]h" doesn't match "ch", but there is an argument to be had that it should. So all in all, if we mark too many repetitions surrounding character classes as "simple" now, it might become much more difficult or even impossible to change the semantics of character classes to a more appropriate choice later because we have to maintain backwards compatibility.
Finally, we have to make choice what collating symbol names should and can be recognized by the regex implementation by default (see #5393). Here, I have provisionally made the minimal safe choice: Only single characters are recognized by default, as every character is a collating element by definition.
bugSomething isn't workingregexmeow is a substring of homeowner
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #994 and solves the single-character case of #5391. Also makes slight progress towards #438.
Collating symbols and equivalences are now parsed following the intended semantics in the standard as much as is possible under the current layout of NFA nodes:
lookup_collatename()
in the traits class to obtain the collating element.error_collate
iflookup_collatename()
returns an empty string (see [re.grammar]/8 and 10). Also reject it witherror_space
if the returned collating element can't be stored in the NFA node.[.
, then we should process the collating element like a normal character. We can easily do so if the collating element is just a single character. However, ranges bounded by such collating elements can't be represented in the NFA currently. For this reason, multi-character collating elements are treated syntactically as a set and immediately added to the NFA by calling_Nfa._Add_coll2()
. This means that these collating elements can be individual elements of a character class, but any attempts to use them as range bounds will be rejected by the parser with anerror_range
.[=
, this denotes an equivalence class. If so, add the collating element to the NFA by calling_Nfa._Add_equiv2()
. This function callstransform_primary()
and checks if we can obtain a primary sort key for this collating element. If not, it will reject the equivalence with anerror_collate
(see [re.grammar]/10).transform_primary()
inregex_traits
can even return an empty string. But it might do so in the future when LWG-4186regex_traits::transform_primary
mistakenly detectstypeid
of a functionΒ #5291 is addressed.lookup_collatename()
, this feature worked otherwise. So this feature seems to have been usable enough that we shouldn't deliberately break people just for some performance gain.translate()
ortranslate_nocase()
before being stored, and the matcher assumes that this translation pass has happened. But the major difference is: This feature is utterly broken before this PR (as documented in <regex>: Regex erroneously returns a matchΒ #994). So I believe this assumption is fine, because even if the new matcher after this PR is picked up with the old parser before this PR, at worst there will just be some patterns for which some inputs are still mistakenly not matched; this is still much better than the utterly broken behavior that the old matcher produces.In the matcher, the implementation of
_Lookup_coll
was completely revised as the old one was utterly broken. The new implementation now assumes that the collating elements are sorted descendingly in length. The parser has always sorted them in this way (as it took advantage of the sorting), but nothing in the matcher has made use of it until now.In addition, I also added code that marks repetitions surrounding some character classes as "not simple". The reason is that these character classes can potentially match character sequences of various lengths, so they can behave like an "if". The "simple loop optimization" doesn't work when the repeated subexpression can branch in the sense that it can match the same input in observably different ways (e.g., substrings of different lengths can be matched or capture groups are located differently). For this reason, repetitions surrounding such character classes should not be marked "simple", and I think this PR is the best opportunity to make this change since it repairs the matching of multi-character collating elements in character classes.
regex_constants::collate
before<regex>
: Implement collating rangesΒ #5238). Second, we might change which character classes can potentially branch in the future. For example, let's assume that the collating element "ch" from the Czech alphabet is recognized. Currently, the pattern "[c[.ch.]]h" doesn't match "ch", but there is an argument to be had that it should. So all in all, if we mark too many repetitions surrounding character classes as "simple" now, it might become much more difficult or even impossible to change the semantics of character classes to a more appropriate choice later because we have to maintain backwards compatibility.Finally, we have to make choice what collating symbol names should and can be recognized by the regex implementation by default (see #5393). Here, I have provisionally made the minimal safe choice: Only single characters are recognized by default, as every character is a collating element by definition.