You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Towards #995. A second PR in the future will remove _Uelem from the matcher.
Requirements on the character type _Elem in the standard
Effectively, the standard currently spells out the following guarantees for the character type:
The regex traits class can handle the character type for all the operations in [re.traits].
Indirectly, through the reference to string_type as a basic_string in the regex traits class requirements, the traits_type of the string_type (which should be char_traits<_Elem>) must support all operations in [char.traits.require].
According to [strings.general]/1 referenced by [re.general]/2, _Elem must be a non-array trivially copyable standard-layout and trivially default-constructible type.
Alas, this is woefully underspecified because regex has to convert and compare code points (integers) to characters (see also LWG-3835). There is a de facto requirement that regex must be able to convert and compare integers to characters, otherwise regexes couldn't be parsed or line endings couldn't be matched. There is also a de facto requirement that regex must be able to convert characters to integers again to implement [re.grammar]/12. But how these conversions and comparisons actually work is not specified at all.
One idea out might be to rely on the existing int_type in the character traits type, but the standard (a) does not actually specify any property for this type other than that it can somehow represent all characters + eof() (see [char.traits.typedefs]/1) and (b) immediately goes on to violate this only requirement in the specializations for Unicode character types (see LWG-2959). Moreover, there doesn't appear to be an actual requirement that int_type is an integer type -- there is only a guarantee that it can be used through the API of the character traits class. So int_type does not seem helpful, rather, relying on it would just open its own can of worms.
Requirements on _Elem in the regex implementations before and after this PR
The following fundamental requirements on _Elem remain unchanged by this PR:
All integral conversions must be consistent with some implicit one-to-one-mapping between unsigned integers (code points) and characters (up to lossiness and sign extension), and the natural ordering of these unsigned integers must be consistent with the lt() and eq() functions in the character traits class.
Lossy casts to and from integers are allowed and behave as one would expect for casts between integers.
There must not be any gaps between code points, i.e., if a character has code point c, then 0, .., c-1 must be valid code points of characters as well (in the sense that we can obtain a different character object for each of them through casting).
The code points for the basic character set (to the degree they have special meaning in regexes) plus the paragraph and line separator must agree with the corresponding Unicode code points. For the paragraph and line separator, this must only be the case if the _Elem type is large enough to represent these code points.
_Elem{} must represent NUL (with code point 0).
Requirements on _Elem in the regex implementation before this PR
The current implementation makes at least the following additional assumptions on convertibility and comparability:
_Elem must be equality-comparable to itself, char, int and the internal enum type _Meta_type (maybe including some implicit conversion).
_Elem type must be implicitly convertible to int and _Meta_type must be implicitly convertible to _Elem.
char, int and unsigned int must be explicitly convertible to _Elem.
_Elem must be explicitly convertible to _Meta_type.
The code point for any character can fit into an int, i.e., conversion _Elem -> int -> _Elem must yield the original character again.
The character traits must provide an unsigned integral type _Uelem such that explicit conversion to this type yields the code point for any character, i.e., conversion _Elem -> _Uelem -> _Elem produces the original character again and the natural ordering of the _Uelem values after conversion must be consistent with the lt() and eq() functions in the character traits class. (Note that conversion to an arbitrary but big unsigned integral type does not achieve this if the character type behaves like a signed integral type because the converted value will be sign-extended.)
make_unsigned<_Elem> must be well-defined. (This mostly defeats the purpose of _Uelem, because specializing make_unsigned<_Elem> for user-defined types is forbidden.)
This list might be non-exhaustive.
Requirements on _Elem in the regex implementation after this PR
This PR imposes the following requirements on comparisons and conversions:
_Elem must be equality-comparable to itself.
_Elem must be explicitly convertible to unsigned char and unsigned int.
If _Elem is an integral or enum type, it must also be explicitly convertible to and from make_unsigned<_Elem>. Explicit conversion to this character type thus yields the code points of the characters.
Otherwise, conversion to unsigned int must yield the unsigned code point for a character, if the code point can fit into an unsigned int.
This means that _Elem must behave like an unsigned integer type when converted to unsigned int.
char, unsigned char and unsigned int must be explicitly convertible to _Elem.
This list should be exhaustive; the new test checks that we don't do any conversions not listed above.
We cannot just drop most of the special logic for signed integral and enum types because we must support char.
But if desired, we could drop explicit convertibility of _Elem from and to char and unsigned char in favor of unsigned int (and make_unsigned<_Elem> for integral or enum _Elem) only. But this will mean we will have to add even more casts in <regex>.
Differences in requirements
Essentially, we have the following new requirements:
_Elem must be explicitly convertible to unsigned int.
_Elem must support explicit conversion to and from unsigned char.
_Elem must convert like an unsigned integer type when explicitly converted to unsigned int, if _Elem is not an integral or enum type. (This requirement is only kind of new: Before this PR, regex didn't even compile for such types, except when entering UB territory by specializing make_unsigned.)
In exchange, the following requirements will be dropped when the changes are completed for the matcher as well:
No more equality-comparability between _Elem and other integral types.
_Elem does not have to be convertible to and from int and _Meta_type.
No more implicit conversions from and to _Elem.
The code points for _Elem no longer have to fit into an int.
No more _Uelem in the regex traits class.
No more reliance on make_unsigned<_Elem> for types that are not integral or enum.
Changes
Add new _Unescaped_char member to _Parser2 to represent the character represented by some kind of escape sequence.
This allows us to drop the "must fit into an int" requirement, because we no longer have to represent such characters in the _Val member.
Lots of static_cast's added to avoid implicit conversions.
For many _Meta_type values, we first have to cast them to char before casting them to _Elem.
In two cases, I avoided casts from _Meta_type values by temporarily saving the character from the input character sequence.
To decide whether a character can be stored in the bitmap, the test checks now whether a character code point fits into an unsigned char.
Such tests are done by roundtrip casting: _Elem -> unsigned char -> _Elem yields the original character again if and only if the code point is less than 256.
Only call strchr() in _Parser2::_Trans() when the character code point fits into an unsigned char.
It's actually surprising that this logic worked for wregex (probably by accident). It uses the fact that casting to _Meta_type for non-ASCII code points produces meta values the parser doesn't know about, and the parser treats unknown meta values as non-special. In any case, this change is also necessary to drop the "must fit into an int" requirement.
Replace some ordering checks by calls to the lt() function of the character traits class.
This lets us avoid some casting and case distinction between integral/enum types and other types. We can also avoid the addition of a requirement that the character code points must fit into an unsigned int or similar.
Rework the logic for the bitmap optimization and small character range optimization for the new requirements.
The code for the small character range optimization is no longer generated if sizeof(_Elem) = 1U: The bitmap optimization already handles all characters with code point < 256, so this is just dead code.
Checks whether character code points are too large (to store the character in the bitmap or the code point in unsigned int) are done using roundtrip casting.
We have to differentiate between integral/enum types and other user-defined types for the small character range optimization, as we have to cast the characters to different unsigned integer types.
When _Elem is not an integral or enum type, the small character range optimization is not performed for code points unrepresentable by unsigned int.
Store the named character classes and equivalence classes if sizeof(_Elem) > 1U.
This is sufficient (due to the requirement that code points must have no gaps), but not necessary: We could have a character type with sizeof(_Elem) > 1U with maximal code point < 256. A more accurate check appears much more ugly to me, though, for little gain.
Revise logic to cast integer value (from escape sequence) to an _Elem in _CharacterEscape():
Again, we have to differentiate between integral/enum types and other user-defined types to perform the correct casts.
We check whether the integer value corresponds to a code point by performing roundtrip and testing whether it yields the same integer value again.
No more switch on _Elem. All switches are now performed on unsigned char (after checking that the code point is less than 256 while handling the case for code points >= 256 separately) or int (when it is on _Mchar). In the one switch on _Mchar, the value is cast to int first to prevent a warning that there aren't case labels for some of the enum identifiers.
Test
For now, the test only checks that the parser compiles and doesn't crash on two user-defines character types. We can only check semantic correctness after adjusting the matcher similarly.
The test currently suppresses warning C6510 because #5563 hasn't been merged yet.
Thanks as always for the extremely detailed writeup and careful changes! π» I pushed a conflict-free merge with main, followed by minor nitpicks, a theoretical correctness fix for the test's custom char traits move, and a fix for what I believe was a copy-paste typoed function call - please double-check.
Ah, the test is trying to exert lots of different paths in the parser with custom types to make sure that the parser doesn't unexpectedly crash. The regex in the test includes an equivalence, but I remember now that we had to disable regex_traits::transform_primary for /clr:pure because of its reliance on RTTI, so parsing the regex fails for /clr:pure.
If you think it's worth it, I will create a follow-up PR that replaces transform_primary in the custom traits class by some dummy implementation that doesn't rely on regex_traits::transform_primary.
Thanks for the explanation. I don't think it's worth the effort - if the impact is limited to the test alone, I don't care about /clr:pure coverage at all. (We're keeping that mode in life support for one legacy project.)
bugSomething isn't workingregexmeow is a substring of homeowner
2 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Towards #995. A second PR in the future will remove
_Uelem
from the matcher.Requirements on the character type
_Elem
in the standardEffectively, the standard currently spells out the following guarantees for the character type:
string_type
as abasic_string
in the regex traits class requirements, thetraits_type
of thestring_type
(which should bechar_traits<_Elem>
) must support all operations in [char.traits.require]._Elem
must be a non-array trivially copyable standard-layout and trivially default-constructible type.Alas, this is woefully underspecified because
regex
has to convert and compare code points (integers) to characters (see also LWG-3835). There is a de facto requirement thatregex
must be able to convert and compare integers to characters, otherwise regexes couldn't be parsed or line endings couldn't be matched. There is also a de facto requirement thatregex
must be able to convert characters to integers again to implement [re.grammar]/12. But how these conversions and comparisons actually work is not specified at all.One idea out might be to rely on the existing
int_type
in the character traits type, but the standard (a) does not actually specify any property for this type other than that it can somehow represent all characters +eof()
(see [char.traits.typedefs]/1) and (b) immediately goes on to violate this only requirement in the specializations for Unicode character types (see LWG-2959). Moreover, there doesn't appear to be an actual requirement thatint_type
is an integer type -- there is only a guarantee that it can be used through the API of the character traits class. Soint_type
does not seem helpful, rather, relying on it would just open its own can of worms.Requirements on
_Elem
in theregex
implementations before and after this PRThe following fundamental requirements on
_Elem
remain unchanged by this PR:lt()
andeq()
functions in the character traits class.c
, then0, .., c-1
must be valid code points of characters as well (in the sense that we can obtain a different character object for each of them through casting)._Elem
type is large enough to represent these code points._Elem{}
must represent NUL (with code point 0).Requirements on
_Elem
in theregex
implementation before this PRThe current implementation makes at least the following additional assumptions on convertibility and comparability:
_Elem
must be equality-comparable to itself,char
,int
and the internal enum type_Meta_type
(maybe including some implicit conversion)._Elem
type must be implicitly convertible toint
and_Meta_type
must be implicitly convertible to_Elem
.char
,int
andunsigned int
must be explicitly convertible to_Elem
._Elem
must be explicitly convertible to_Meta_type
.int
, i.e., conversion_Elem
->int
->_Elem
must yield the original character again._Uelem
such that explicit conversion to this type yields the code point for any character, i.e., conversion_Elem
->_Uelem
->_Elem
produces the original character again and the natural ordering of the_Uelem
values after conversion must be consistent with thelt()
andeq()
functions in the character traits class. (Note that conversion to an arbitrary but big unsigned integral type does not achieve this if the character type behaves like a signed integral type because the converted value will be sign-extended.)make_unsigned<_Elem>
must be well-defined. (This mostly defeats the purpose of_Uelem
, because specializingmake_unsigned<_Elem>
for user-defined types is forbidden.)This list might be non-exhaustive.
Requirements on
_Elem
in theregex
implementation after this PRThis PR imposes the following requirements on comparisons and conversions:
_Elem
must be equality-comparable to itself._Elem
must be explicitly convertible tounsigned char
andunsigned int
._Elem
is an integral or enum type, it must also be explicitly convertible to and frommake_unsigned<_Elem>
. Explicit conversion to this character type thus yields the code points of the characters.unsigned int
must yield the unsigned code point for a character, if the code point can fit into anunsigned int
._Elem
must behave like an unsigned integer type when converted tounsigned int
.char
,unsigned char
andunsigned int
must be explicitly convertible to_Elem
.This list should be exhaustive; the new test checks that we don't do any conversions not listed above.
We cannot just drop most of the special logic for signed integral and enum types because we must support
char
.But if desired, we could drop explicit convertibility of
_Elem
from and tochar
andunsigned char
in favor ofunsigned int
(andmake_unsigned<_Elem>
for integral or enum_Elem
) only. But this will mean we will have to add even more casts in<regex>
.Differences in requirements
Essentially, we have the following new requirements:
_Elem
must be explicitly convertible tounsigned int
._Elem
must support explicit conversion to and fromunsigned char
._Elem
must convert like an unsigned integer type when explicitly converted tounsigned int
, if_Elem
is not an integral or enum type. (This requirement is only kind of new: Before this PR,regex
didn't even compile for such types, except when entering UB territory by specializingmake_unsigned
.)In exchange, the following requirements will be dropped when the changes are completed for the matcher as well:
_Elem
and other integral types._Elem
does not have to be convertible to and fromint
and_Meta_type
._Elem
._Elem
no longer have to fit into anint
._Uelem
in the regex traits class.make_unsigned<_Elem>
for types that are not integral or enum.Changes
_Unescaped_char
member to_Parser2
to represent the character represented by some kind of escape sequence.int
" requirement, because we no longer have to represent such characters in the_Val
member.static_cast
's added to avoid implicit conversions._Meta_type
values, we first have to cast them tochar
before casting them to_Elem
._Meta_type
values by temporarily saving the character from the input character sequence.unsigned char
._Elem
->unsigned char
->_Elem
yields the original character again if and only if the code point is less than 256.strchr()
in_Parser2::_Trans()
when the character code point fits into anunsigned char
.wregex
(probably by accident). It uses the fact that casting to_Meta_type
for non-ASCII code points produces meta values the parser doesn't know about, and the parser treats unknown meta values as non-special. In any case, this change is also necessary to drop the "must fit into anint
" requirement.lt()
function of the character traits class.unsigned int
or similar.sizeof(_Elem) = 1U
: The bitmap optimization already handles all characters with code point < 256, so this is just dead code.unsigned int
) are done using roundtrip casting._Elem
is not an integral or enum type, the small character range optimization is not performed for code points unrepresentable byunsigned int
.sizeof(_Elem) > 1U
.sizeof(_Elem) > 1U
with maximal code point < 256. A more accurate check appears much more ugly to me, though, for little gain._Elem
in_CharacterEscape()
:switch
on_Elem
. All switches are now performed onunsigned char
(after checking that the code point is less than 256 while handling the case for code points >= 256 separately) orint
(when it is on_Mchar
). In the one switch on_Mchar
, the value is cast toint
first to prevent a warning that there aren't case labels for some of the enum identifiers.Test
For now, the test only checks that the parser compiles and doesn't crash on two user-defines character types. We can only check semantic correctness after adjusting the matcher similarly.
The test currently suppresses warning C6510 because #5563 hasn't been merged yet.