You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes#5244. There is still one remaining issue related to escape sequences in basic regular expressions. I created #5379 to track this.
Changes
ECMAScript: [\b] now matches "backspace" (see the semantics of the production in the recent ECMAScript standard, unchanged since the 3rd edition). I repurposed and renamed one of the _L_flags to enable this oddity in the parser for ECMAScript. This flag is only set for ECMAScript and had previously been used to represent that the backslash can escape itself in ECMAScript -- but this is superfluous because the flag _L_ident_ECMA implies the same.
ECMAScript: [\z] and [\Z] no longer match a custom character class "z" provided by a custom regex traits class. Instead, they will match the characters "z" and "Z", as the escapes also do outside of squared character classes. (This change will also make it easier to solve <regex> mishandles locale-based character classes outside of the char rangeย #992 in the future.)
ECMAScript: The standard states that a decimal escape starting with digit \0 represents the literal NUL. The standard also demands that such an escape must not be followed by another decimal digit (which is still specified this way in recent ECMAScript standards), so we now throw an error_escape in this case. As a consequence of this change, a backreference escape can no longer be written with leading zeros (see also the most recent ECMAScript standard).
basic, grep: The escape sequence \| is no longer interpreted to mean the character |. Instead, it is rejected, since its meaning is undefined according to the POSIX standard version referenced by the C++ standard (Section 9.3.2). Users can write | instead.
The current edition of the POSIX standard makes the meaning of \|implementation-defined (in a not particularly good way, I think): It states that \| could either be interpreted as the character | or an actual pipe (Section 9.3.2 in the recent standard). The standard then continues that a future version may require \| to denote a pipe.
In any case, the interpretation of \| as the character |, which was implemented before this PR, is not the one recommended by the current version of the standard.
Unfortunately, there is no truly suitable _L_flags for this, so I check one of the flags that are set for basic regular expressions but not for any of the other supported grammars. Setting _L_paren_bal for basic regular expressions would make the parser accept wrong usages of \} and \) as well.
awk: As required by the POSIX standard (see here), octal escape sequences and all awk-specific character escapes are now recognized within squared character classes (before this PR, only the escapes \a, \b, \f, \n, \r, \t and \v were recognized). On the other hand, unknown escape sequences are now rejected in character classes, because their meaning is ambiguous:
The underlying POSIX extended regular expressions don't treat \ as an escape character in character classes, so [\a] matches the backslash and a. This was how these escape sequences were interpreted before this PR.
The POSIX standard leaves the interpretation of unknown sequences undefined in the awk specification.
Many awk implementations treat such unknown sequences as identity escapes (as std::regex does in ECMAScript mode, too), but others stick closer to the underlying extended regular expressions semantics and treat them as backslash + character.
\} and \] in extended regular expressions
According to the POSIX standard referenced by the C++ standard, the escape sequence \} is undefined in extended regular expressions. However, more recent POSIX versions do define it and it also feels unbalanced to reject \} while \{ is accepted, so I opted to keep supporting the escape sequence \}.
The parser supports the escape sequence \[ but not \]. More recent POSIX versions define \] as well in basic and extended regular expressions. Should we add support for \]?
I am reviewing this now - thanks as always for the extremely detailed PR description with citations, it is super appreciated for these complex changes. I edited it to fix a typo that made a citation harder to understand - please meow if I messed it up instead.
The parser supports the escape sequence \[ but not \]. More recent POSIX versions define \] as well in basic and extended regular expressions. Should we add support for \]?
Yes, I would be in favor of that, in a followup PR - it seems analogous to implementing a C++ Defect Report, and I suspect 99.9% of users would already expect that behavior.
Thanks for this amazing overhaul and extreme attention to detail! ๐ป I pushed one significant change to the product code and a few to the test code - please meow if I messed anything up.
bugSomething isn't workingregexmeow is a substring of homeowner
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5244. There is still one remaining issue related to escape sequences in basic regular expressions. I created #5379 to track this.
Changes
[\b]
now matches "backspace" (see the semantics of the production in the recent ECMAScript standard, unchanged since the 3rd edition). I repurposed and renamed one of the_L_flags
to enable this oddity in the parser for ECMAScript. This flag is only set for ECMAScript and had previously been used to represent that the backslash can escape itself in ECMAScript -- but this is superfluous because the flag_L_ident_ECMA
implies the same.[\z]
and[\Z]
no longer match a custom character class "z" provided by a custom regex traits class. Instead, they will match the characters "z" and "Z", as the escapes also do outside of squared character classes. (This change will also make it easier to solve <regex> mishandles locale-based character classes outside of the char rangeย #992 in the future.)\0
represents the literal NUL. The standard also demands that such an escape must not be followed by another decimal digit (which is still specified this way in recent ECMAScript standards), so we now throw anerror_escape
in this case. As a consequence of this change, a backreference escape can no longer be written with leading zeros (see also the most recent ECMAScript standard).\|
is no longer interpreted to mean the character|
. Instead, it is rejected, since its meaning is undefined according to the POSIX standard version referenced by the C++ standard (Section 9.3.2). Users can write|
instead.\|
implementation-defined (in a not particularly good way, I think): It states that\|
could either be interpreted as the character|
or an actual pipe (Section 9.3.2 in the recent standard). The standard then continues that a future version may require\|
to denote a pipe.In any case, the interpretation of
\|
as the character|
, which was implemented before this PR, is not the one recommended by the current version of the standard.]
is accepted now, as the POSIX standard requires because it does not declare]
to be a special character). (I noticed this issue when I added[\]]
as a test case.)_L_flags
for this, so I check one of the flags that are set for basic regular expressions but not for any of the other supported grammars. Setting_L_paren_bal
for basic regular expressions would make the parser accept wrong usages of\}
and\)
as well.[\a]
matches the backslash and a. This was how these escape sequences were interpreted before this PR.std::regex
does in ECMAScript mode, too), but others stick closer to the underlying extended regular expressions semantics and treat them as backslash + character.\}
and\]
in extended regular expressionsAccording to the POSIX standard referenced by the C++ standard, the escape sequence
\}
is undefined in extended regular expressions. However, more recent POSIX versions do define it and it also feels unbalanced to reject\}
while\{
is accepted, so I opted to keep supporting the escape sequence\}
.\[
but not\]
. More recent POSIX versions define\]
as well in basic and extended regular expressions. Should we add support for\]
?