`<regex>`: Make `wregex` correctly match negated character classes #5403

muellerj2 · 2025-04-12T11:57:37Z

Fixes #992 in a binary-compatible way by repurposing some unused NFA node flag bits. Also makes a little bit of progress towards #995.

Even if the solution in this PR is born out of the need to maintain binary compatibility, I think now that it's probably the best one for these negated classes anyway (at least as long as we keep the representation the same for std::regex_traits and custom traits classes -- we could save one more bit for std::regex_traits specifically, because we know how the character classes \w, \s and \d intersect).

Alternative solutions

Store a single char_class_type value in the NFA node to represent all specified negated character classes. Sounds like a good idea until you realize that De Morgan's law applies here: For example, we have (not d) or (not w) = not (w and d) for character class [\W\D], so we would have to store the value of w and d. While the standard guarantees that or'ing character class types makes sense ([re.grammar]/9, there is no such wording for intersecting/and'ing character class types, so and'ing them is allowed to fail for custom traits classes. (One can still make it work for std::regex_traits though because we know how \W, \S and \D relate and control the associated char_class_type values.) See [\W\D] fails to match alphabetic characters boostorg/regex#241 and [libc++] <regex>: Character class [\W\D] fails to match alphabetic characters llvm/llvm-project#131516 for bug reports related to this.
Store a char_class_type in the NFA node for each specified negated character class. Works, but it's quite wasteful.

Note that we could choose these solutions as well: We could create a class derived from _Node_class that stores these char_class_type values and repurpose a flag bit to mark that this is the extended version of the _Node_class. But this is a more complicated approach with no clear advantage. (We would have to implement the second alternative if we had to support users adding their own character class escapes like Boost.)

What about #5243?

We could apply the same solution for #5243 as well, but it would have an unfortunate side effect: If new parser and old matcher were to be mixed, one buggy behavior would be replaced by another buggy behavior. Currently, [\w\s] matches alphanumeric characters but fails to match spaces at code points >= 256. If we applied this PR's solution to #5243 as well and the old matcher were to be picked up, [\w\s] would match spaces but fail to match alphanumeric characters at code points >= 256.

We can also keep the old buggy behavior in all cases by basically implementing alternative solution 1 above, so a more complicated alternative approach does have a clear advantage here.

So the follow-up question is: Do we go with this solution for #5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?

Why does this PR also set bits on the root node?

These bits aren't used yet, but the idea is that they tell the matcher to look up these character classes during initialization so that these lookups don't have to happen each time an attempt is made to match a squared character class with a negated character class. We should start making use of these flags when the matcher is renamed. (I think this isn't too far in the future because an efficient fix for #5365 also requires a change to some internal data structures of the matcher.)

Even if we don't use them yet, setting them now has the advantage that we won't have to handle the case in the future that the negated class flags are set on a character class node but are not on the root node.

Why did you leave a small gap between old and new node flags?

I just wanted to leave some room for new flags common to many node types. We will most likely need at least one such flag: One that marks extended versions of NFA nodes. I also defined the flag bits twice to make it clear that they are specific to two node types (_N_begin and _N_class). But feel free to change this if you prefer to do things differently.

stl/inc/regex

StephanTLavavej · 2025-04-18T20:25:44Z

So the follow-up question is: Do we go with this solution for #5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?

I think we should go with the first solution. Simpler is less risky and more maintainable, and changing buggy behavior in the event of mixing is fine.

…`[\d\D]`.

stl/inc/regex

tests/std/tests/VSO_0000000_regex_use/test.cpp

StephanTLavavej · 2025-04-19T02:35:21Z

Thanks, this is great! 😻 I pushed tiny changes and expanded the test coverage slightly.

StephanTLavavej · 2025-04-22T10:14:22Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-04-22T21:16:11Z

I resolved adjacent-edit conflicts in <regex> where #5392 changed:

To _Add_equiv2 and _Add_coll2, while this PR changed the parameters of _Add_named_class.
_Do_ex_class2 logic commented with // process =, while this PR changed the second argument of _Nfa._Add_named_class(_Cls, false); to _Rx_char_class_kind::_Positive.

StephanTLavavej · 2025-04-22T22:15:53Z

Thanks for figuring out how to fix the unfixable! 🧠 💡 😻

<regex>: Make wregex correctly match negated character classes

1a714f5

muellerj2 requested a review from a team as a code owner April 12, 2025 11:57

github-project-automation bot added this to STL Code Reviews Apr 12, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 12, 2025

muellerj2 commented Apr 13, 2025

View reviewed changes

stl/inc/regex Show resolved Hide resolved

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Apr 14, 2025

StephanTLavavej self-assigned this Apr 14, 2025

StephanTLavavej reviewed Apr 18, 2025

View reviewed changes

stl/inc/regex Show resolved Hide resolved

StephanTLavavej added 3 commits April 18, 2025 13:31

_Lookup_char_class can be a const member function.

63981c7

Properly name neg_w_regex_skip.

ea15dec

Add test coverage for [^\W], [^\S], [^\D], [\w\W], [\s\S], …

8234a5c

…`[\d\D]`.

StephanTLavavej reviewed Apr 19, 2025

View reviewed changes

stl/inc/regex Outdated Show resolved Hide resolved

tests/std/tests/VSO_0000000_regex_use/test.cpp Outdated Show resolved Hide resolved

tests/std/tests/VSO_0000000_regex_use/test.cpp Outdated Show resolved Hide resolved

tests/std/tests/VSO_0000000_regex_use/test.cpp Show resolved Hide resolved

StephanTLavavej approved these changes Apr 19, 2025

View reviewed changes

StephanTLavavej removed their assignment Apr 19, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 19, 2025

muellerj2 mentioned this pull request Apr 20, 2025

<regex>: basic_regex wants regex_traits to provide things not required by [re.req] #995

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 22, 2025

Merge branch 'main' into regex-final-negated-char-class-fix

c2f290f

StephanTLavavej merged commit c0f5f35 into microsoft:main Apr 22, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 22, 2025

muellerj2 mentioned this pull request May 10, 2025

<regex>: Cache bitmasks of negated character classes during matching #5487

Merged

muellerj2 mentioned this pull request May 18, 2025

<regex>: Restrict control letters in escapes to alphabetic ASCII characters #5524

Merged

muellerj2 deleted the regex-final-negated-char-class-fix branch May 31, 2025 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Make `wregex` correctly match negated character classes #5403

`<regex>`: Make `wregex` correctly match negated character classes #5403

Uh oh!

muellerj2 commented Apr 12, 2025

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 19, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

<regex>: Make wregex correctly match negated character classes #5403

<regex>: Make wregex correctly match negated character classes #5403

Uh oh!

Conversation

muellerj2 commented Apr 12, 2025

Alternative solutions

What about #5243?

Why does this PR also set bits on the root node?

Why did you leave a small gap between old and new node flags?

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 19, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 22, 2025

Uh oh!

Uh oh!

`<regex>`: Make `wregex` correctly match negated character classes #5403

`<regex>`: Make `wregex` correctly match negated character classes #5403