You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes#992 in a binary-compatible way by repurposing some unused NFA node flag bits. Also makes a little bit of progress towards #995.
Even if the solution in this PR is born out of the need to maintain binary compatibility, I think now that it's probably the best one for these negated classes anyway (at least as long as we keep the representation the same for std::regex_traits and custom traits classes -- we could save one more bit for std::regex_traits specifically, because we know how the character classes \w, \s and \d intersect).
Alternative solutions
Store a single char_class_type value in the NFA node to represent all specified negated character classes. Sounds like a good idea until you realize that De Morgan's law applies here: For example, we have (not d) or (not w) = not (w and d) for character class [\W\D], so we would have to store the value of w and d. While the standard guarantees that or'ing character class types makes sense ([re.grammar]/9, there is no such wording for intersecting/and'ing character class types, so and'ing them is allowed to fail for custom traits classes. (One can still make it work for std::regex_traits though because we know how \W, \S and \D relate and control the associated char_class_type values.) See [\W\D] fails to match alphabetic characters boostorg/regex#241 and [libc++] <regex>: Character class [\W\D] fails to match alphabetic characters llvm/llvm-project#131516 for bug reports related to this.
Store a char_class_type in the NFA node for each specified negated character class. Works, but it's quite wasteful.
Note that we could choose these solutions as well: We could create a class derived from _Node_class that stores these char_class_type values and repurpose a flag bit to mark that this is the extended version of the _Node_class. But this is a more complicated approach with no clear advantage. (We would have to implement the second alternative if we had to support users adding their own character class escapes like Boost.)
We could apply the same solution for #5243 as well, but it would have an unfortunate side effect: If new parser and old matcher were to be mixed, one buggy behavior would be replaced by another buggy behavior. Currently, [\w\s] matches alphanumeric characters but fails to match spaces at code points >= 256. If we applied this PR's solution to #5243 as well and the old matcher were to be picked up, [\w\s] would match spaces but fail to match alphanumeric characters at code points >= 256.
We can also keep the old buggy behavior in all cases by basically implementing alternative solution 1 above, so a more complicated alternative approach does have a clear advantage here.
So the follow-up question is: Do we go with this solution for #5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?
Why does this PR also set bits on the root node?
These bits aren't used yet, but the idea is that they tell the matcher to look up these character classes during initialization so that these lookups don't have to happen each time an attempt is made to match a squared character class with a negated character class. We should start making use of these flags when the matcher is renamed. (I think this isn't too far in the future because an efficient fix for #5365 also requires a change to some internal data structures of the matcher.)
Even if we don't use them yet, setting them now has the advantage that we won't have to handle the case in the future that the negated class flags are set on a character class node but are not on the root node.
Why did you leave a small gap between old and new node flags?
I just wanted to leave some room for new flags common to many node types. We will most likely need at least one such flag: One that marks extended versions of NFA nodes. I also defined the flag bits twice to make it clear that they are specific to two node types (_N_begin and _N_class). But feel free to change this if you prefer to do things differently.
So the follow-up question is: Do we go with this solution for #5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?
I think we should go with the first solution. Simpler is less risky and more maintainable, and changing buggy behavior in the event of mixing is fine.
I resolved adjacent-edit conflicts in <regex> where #5392 changed:
To _Add_equiv2 and _Add_coll2, while this PR changed the parameters of _Add_named_class.
_Do_ex_class2 logic commented with // process =, while this PR changed the second argument of _Nfa._Add_named_class(_Cls, false); to _Rx_char_class_kind::_Positive.
bugSomething isn't workingregexmeow is a substring of homeowner
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #992 in a binary-compatible way by repurposing some unused NFA node flag bits. Also makes a little bit of progress towards #995.
Even if the solution in this PR is born out of the need to maintain binary compatibility, I think now that it's probably the best one for these negated classes anyway (at least as long as we keep the representation the same for
std::regex_traits
and custom traits classes -- we could save one more bit forstd::regex_traits
specifically, because we know how the character classes \w, \s and \d intersect).Alternative solutions
char_class_type
value in the NFA node to represent all specified negated character classes. Sounds like a good idea until you realize that De Morgan's law applies here: For example, we have(not d) or (not w) = not (w and d)
for character class[\W\D]
, so we would have to store the value ofw and d
. While the standard guarantees that or'ing character class types makes sense ([re.grammar]/9, there is no such wording for intersecting/and'ing character class types, so and'ing them is allowed to fail for custom traits classes. (One can still make it work forstd::regex_traits
though because we know how \W, \S and \D relate and control the associatedchar_class_type
values.) See[\W\D]
fails to match alphabetic characters boostorg/regex#241 and [libc++]<regex>
: Character class[\W\D]
fails to match alphabetic characters llvm/llvm-project#131516 for bug reports related to this.char_class_type
in the NFA node for each specified negated character class. Works, but it's quite wasteful.Note that we could choose these solutions as well: We could create a class derived from
_Node_class
that stores thesechar_class_type
values and repurpose a flag bit to mark that this is the extended version of the_Node_class
. But this is a more complicated approach with no clear advantage. (We would have to implement the second alternative if we had to support users adding their own character class escapes like Boost.)What about #5243?
We could apply the same solution for #5243 as well, but it would have an unfortunate side effect: If new parser and old matcher were to be mixed, one buggy behavior would be replaced by another buggy behavior. Currently,
[\w\s]
matches alphanumeric characters but fails to match spaces at code points >= 256. If we applied this PR's solution to #5243 as well and the old matcher were to be picked up,[\w\s]
would match spaces but fail to match alphanumeric characters at code points >= 256.We can also keep the old buggy behavior in all cases by basically implementing alternative solution 1 above, so a more complicated alternative approach does have a clear advantage here.
So the follow-up question is: Do we go with this solution for #5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?
Why does this PR also set bits on the root node?
These bits aren't used yet, but the idea is that they tell the matcher to look up these character classes during initialization so that these lookups don't have to happen each time an attempt is made to match a squared character class with a negated character class. We should start making use of these flags when the matcher is renamed. (I think this isn't too far in the future because an efficient fix for #5365 also requires a change to some internal data structures of the matcher.)
Even if we don't use them yet, setting them now has the advantage that we won't have to handle the case in the future that the negated class flags are set on a character class node but are not on the root node.
Why did you leave a small gap between old and new node flags?
I just wanted to leave some room for new flags common to many node types. We will most likely need at least one such flag: One that marks extended versions of NFA nodes. I also defined the flag bits twice to make it clear that they are specific to two node types (_N_begin and _N_class). But feel free to change this if you prefer to do things differently.