You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This renames the parser to add a new member variable describing the lexer mode: Default or inside character class. This allows the lexer to correctly process a backslash when parsing a character class/bracket expression.
I also tried to do this without renaming the parser, but this would mean we would have to pass the lexer mode (in or outside a character class) as an argument to all the functions processing escapes in any way, which is a bit of a pain. By renaming the parser, we need the least changes to the logic itself.
Since the parser is renamed, this PR is also doing a number of minor cleanups to the parser and builder (which is also renamed to do these cleanups).
The PR is split into several commits to simplify reviewing:
Rename _Parser to _Parser2.
Since we have renamed the parser, we can strip any version numbers from member functions.
Clean up the parse flags, which we can do now because there is no longer any chance of mix-and-matching the parser constructor and the parser member function _Compile. Specifically:
_L_brk_bal is assigned its own bit; previously it was
_L_brk_bal = 0x20000000, // ']' special only after '[' (ERE, BRE); TRANSITION, ABI: same value as _L_brk_rstr
The _L_grp_esc flag is added to the awk flags so that the workaround in _ClassAtom can be removed.
I also extended the _Lang_flags enum and the _L_flags member variable to unsigned long long so that we can add more flags more easily in the future. (This already adds _L_dsh_rstr to signify that the dash - cannot appear as the starting point of a character range in BREs and EREs, but doesn't perform the parser changes to support it yet.)
Remove the unused member _Begin from the parser.
Slightly reorder the parser member variables to reduce padding a bit. (_Char is usually a char or wchar_t, so it [plus the single-byte _Mode member variable added in the last commit] can usually fit into the four bytes the compiler must add after _Mchar.
Rename _Builder to _Builder2.
Strip version numbers from member functions of the builder.
Remove obsolete members _Bmax and _Tmax from the builder.
Actually fix<regex>: Backslashes in character classes are sometimes not matched in basic regular expressionsΒ #5379 essentially by making _Is_esc() always return false when not in default (read: outside-bracketed-character-class) mode. Note that it matters how we change the lexer mode in _Parser2::_Alternative(): _Next() and _Expect() process the first token inside or outside the square brackets, so we must change the mode before calling these functions. The tests check that we didn't get this wrong.
Reviewing now, I'll push changes soon. I updated the PR description from saying "_L_paren_bal is assigned its own bit." to say _L_brk_bal instead; please meow if I was somehow confused.
Use muellerj2's superior descriptions of `_L_alt_nl` and `_L_no_nl`.
Note that `_L_no_nl` is (grep, egrep).
Note that `_L_esc_oct` and `_L_esc_ffnx` are (awk).
`_L_esc_ffn` confusingly said "(\[fnrtv])" when other comments like `_L_ident_ERE` mean square brackets literally. Spell out "(\f \n \r \t \v)" for clarity and improved searchability.
Rephrase `_L_ident_awk`'s comment for clarity.
Note that `_L_anch_rstr` is (BRE) only, `_L_paren_bal` is (ERE) only, and `_L_brk_rstr` is (ERE, BRE).
Thanks!! π» I pushed some follow-up commits for additional cleanups, please double-check.
I really appreciate the well-structured commit history here; ordinarily I would be nervous about mixing a refactoring and a bugfix but this was entirely reasonable.
bugSomething isn't workingregexmeow is a substring of homeowner
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #5379.
This renames the parser to add a new member variable describing the lexer mode: Default or inside character class. This allows the lexer to correctly process a backslash when parsing a character class/bracket expression.
I also tried to do this without renaming the parser, but this would mean we would have to pass the lexer mode (in or outside a character class) as an argument to all the functions processing escapes in any way, which is a bit of a pain. By renaming the parser, we need the least changes to the logic itself.
Since the parser is renamed, this PR is also doing a number of minor cleanups to the parser and builder (which is also renamed to do these cleanups).
The PR is split into several commits to simplify reviewing:
_Parser
to_Parser2
._Compile
. Specifically:_L_brk_bal
is assigned its own bit; previously it wasSTL/stl/inc/regex
Line 1879 in cbd091e
_L_grp_esc
flag is added to the awk flags so that the workaround in_ClassAtom
can be removed._Lang_flags
enum and the_L_flags
member variable tounsigned long long
so that we can add more flags more easily in the future. (This already adds_L_dsh_rstr
to signify that the dash-
cannot appear as the starting point of a character range in BREs and EREs, but doesn't perform the parser changes to support it yet.)_Begin
from the parser._Char
is usually achar
orwchar_t
, so it [plus the single-byte_Mode
member variable added in the last commit] can usually fit into the four bytes the compiler must add after_Mchar
._Builder
to_Builder2
._Bmax
and_Tmax
from the builder.<regex>
: Backslashes in character classes are sometimes not matched in basic regular expressionsΒ #5379 essentially by making_Is_esc()
always return false when not in default (read: outside-bracketed-character-class) mode. Note that it matters how we change the lexer mode in_Parser2::_Alternative()
:_Next()
and_Expect()
process the first token inside or outside the square brackets, so we must change the mode before calling these functions. The tests check that we didn't get this wrong.