`<regex>`: Revise parsing of escape sequences #5380

muellerj2 · 2025-03-30T11:33:33Z

Fixes #5244. There is still one remaining issue related to escape sequences in basic regular expressions. I created #5379 to track this.

Changes

ECMAScript: [\b] now matches "backspace" (see the semantics of the production in the recent ECMAScript standard, unchanged since the 3rd edition). I repurposed and renamed one of the _L_flags to enable this oddity in the parser for ECMAScript. This flag is only set for ECMAScript and had previously been used to represent that the backslash can escape itself in ECMAScript -- but this is superfluous because the flag _L_ident_ECMA implies the same.
ECMAScript: [\z] and [\Z] no longer match a custom character class "z" provided by a custom regex traits class. Instead, they will match the characters "z" and "Z", as the escapes also do outside of squared character classes. (This change will also make it easier to solve <regex> mishandles locale-based character classes outside of the char range #992 in the future.)
ECMAScript: The standard states that a decimal escape starting with digit \0 represents the literal NUL. The standard also demands that such an escape must not be followed by another decimal digit (which is still specified this way in recent ECMAScript standards), so we now throw an error_escape in this case. As a consequence of this change, a backreference escape can no longer be written with leading zeros (see also the most recent ECMAScript standard).
basic, grep: The escape sequence \| is no longer interpreted to mean the character |. Instead, it is rejected, since its meaning is undefined according to the POSIX standard version referenced by the C++ standard (Section 9.3.2). Users can write | instead.
- The current edition of the POSIX standard makes the meaning of \|implementation-defined (in a not particularly good way, I think): It states that \| could either be interpreted as the character | or an actual pipe (Section 9.3.2 in the recent standard). The standard then continues that a future version may require \| to denote a pipe.
  In any case, the interpretation of \| as the character |, which was implemented before this PR, is not the one recommended by the current version of the standard.
basic, grep: The regular expression ] is accepted now, as the POSIX standard requires because it does not declare ] to be a special character). (I noticed this issue when I added [\]] as a test case.)
- Unfortunately, there is no truly suitable _L_flags for this, so I check one of the flags that are set for basic regular expressions but not for any of the other supported grammars. Setting _L_paren_bal for basic regular expressions would make the parser accept wrong usages of \} and \) as well.
awk: As required by the POSIX standard (see here), octal escape sequences and all awk-specific character escapes are now recognized within squared character classes (before this PR, only the escapes \a, \b, \f, \n, \r, \t and \v were recognized). On the other hand, unknown escape sequences are now rejected in character classes, because their meaning is ambiguous:
- The underlying POSIX extended regular expressions don't treat \ as an escape character in character classes, so [\a] matches the backslash and a. This was how these escape sequences were interpreted before this PR.
- The POSIX standard leaves the interpretation of unknown sequences undefined in the awk specification.
- Many awk implementations treat such unknown sequences as identity escapes (as std::regex does in ECMAScript mode, too), but others stick closer to the underlying extended regular expressions semantics and treat them as backslash + character.

`\}` and `\]` in extended regular expressions

According to the POSIX standard referenced by the C++ standard, the escape sequence \} is undefined in extended regular expressions. However, more recent POSIX versions do define it and it also feels unbalanced to reject \} while \{ is accepted, so I opted to keep supporting the escape sequence \}.

The parser supports the escape sequence \[ but not \]. More recent POSIX versions define \] as well in basic and extended regular expressions. Should we add support for \]?

stl/inc/regex

StephanTLavavej · 2025-04-09T07:18:07Z

I am reviewing this now - thanks as always for the extremely detailed PR description with citations, it is super appreciated for these complex changes. I edited it to fix a typo that made a citation harder to understand - please meow if I messed it up instead.

The parser supports the escape sequence \[ but not \]. More recent POSIX versions define \] as well in basic and extended regular expressions. Should we add support for \]?

Yes, I would be in favor of that, in a followup PR - it seems analogous to implementing a C++ Defect Report, and I suspect 99.9% of users would already expect that behavior.

stl/inc/regex

tests/tr1/tests/regex2/test.cpp

tests/std/tests/GH_005244_regex_escape_sequences/test.cpp

StephanTLavavej · 2025-04-09T11:22:06Z

Thanks for this amazing overhaul and extreme attention to detail! 😻 I pushed one significant change to the product code and a few to the test code - please meow if I messed anything up.

StephanTLavavej · 2025-04-09T12:09:37Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-04-10T06:51:13Z

Resolved adjacent-edit conflicts with #5364 in <regex> where:

it changed to _ClassAtom(bool) and this PR changed to _ClassEscape3()
it changed to _ClassAtom(const bool _Initial) and this PR expanded "check for valid escape sequence"

StephanTLavavej · 2025-04-10T07:45:39Z

⬅️ 🏃 🎉

<regex>: Revise parsing of escape sequences

a192d76

muellerj2 requested a review from a team as a code owner March 30, 2025 11:33

github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 30, 2025

github-project-automation bot added this to STL Code Reviews Mar 30, 2025

add two missing member functions to custom regex traits in test

a49844c

muellerj2 commented Mar 30, 2025

View reviewed changes

stl/inc/regex Outdated Show resolved Hide resolved

Avoid parsing first decimal digit twice in _Parser::_AtomEscape()

e51ea02

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Mar 31, 2025

StephanTLavavej self-assigned this Mar 31, 2025

StephanTLavavej added 11 commits April 9, 2025 00:50

Style nitpicks.

c6508bb

Introduce _L_brk_bal for clarity.

bd8c667

Expand test coverage for "[\\]]" to all grammars.

d14c10a

Use std::string UDL concatenation to avoid suppressing clang-format.

7fb589b

Include more headers.

ee7e9ae

Capture the bug title in a comment.

c2b5659

Take const string&.

0227314

Test spaces matched against \d and \D.

ba2b655

Sort awk-only escapes.

e57c401

Fix: "\0" => "\0"s

ad6f36f

Don't check for awk in test_gh_5244_classescape_posix_not_awk.

fe21384

StephanTLavavej reviewed Apr 9, 2025

View reviewed changes

StephanTLavavej approved these changes Apr 9, 2025

View reviewed changes

StephanTLavavej removed their assignment Apr 9, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 9, 2025

StephanTLavavej mentioned this pull request Apr 9, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 9, 2025

StephanTLavavej self-assigned this Apr 9, 2025

Alcaro mentioned this pull request Apr 9, 2025

[libc++] <regex>: \00 should be rejected llvm/llvm-project#135048

Open

Merge branch 'main' into regex-revise-escape-sequences

0cdf367

StephanTLavavej approved these changes Apr 10, 2025

View reviewed changes

StephanTLavavej merged commit 0e63fc6 into microsoft:main Apr 10, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 10, 2025

muellerj2 mentioned this pull request Apr 10, 2025

<regex>: Permit escaping of closing brackets in POSIX regexes #5399

Merged

muellerj2 deleted the regex-revise-escape-sequences branch April 13, 2025 13:51

StephanTLavavej mentioned this pull request Apr 18, 2025

<regex>: Properly parse and match collating symbols and equivalences #5392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Revise parsing of escape sequences #5380

`<regex>`: Revise parsing of escape sequences #5380

Uh oh!

muellerj2 commented Mar 30, 2025 •

edited by StephanTLavavej

Loading

Uh oh!

Uh oh!

StephanTLavavej commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 9, 2025

Uh oh!

StephanTLavavej commented Apr 9, 2025

Uh oh!

StephanTLavavej commented Apr 10, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 10, 2025

Uh oh!

Uh oh!

<regex>: Revise parsing of escape sequences #5380

<regex>: Revise parsing of escape sequences #5380

Uh oh!

Conversation

muellerj2 commented Mar 30, 2025 • edited by StephanTLavavej Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

\} and \] in extended regular expressions

Uh oh!

Uh oh!

StephanTLavavej commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 9, 2025

Uh oh!

StephanTLavavej commented Apr 9, 2025

Uh oh!

StephanTLavavej commented Apr 10, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 10, 2025

⬅️ 🏃 🎉

Uh oh!

Uh oh!

`<regex>`: Revise parsing of escape sequences #5380

`<regex>`: Revise parsing of escape sequences #5380

muellerj2 commented Mar 30, 2025 •

edited by StephanTLavavej

Loading

`\}` and `\]` in extended regular expressions