`<regex>`: Fix character range bounds in case-insensitive regexes #5164

muellerj2 · 2024-12-05T12:40:39Z

This PR deals with three related problems for character ranges in case-insensitive mode:

_Builder::_Add_range() casts the character bounds to unsigned int. As a consequence, characters with negative numeric values are not added to the bitmap, but rather to the _Large list of characters. This means that these characters are not found during matching. (Note the suspiciously different casts in the else branch for the case-sensitive case.)
The parser fails to reject some empty ranges in case-insensitive mode such as [Z-a] (= [z-a]).
When both the collate and icase flags are set, there is an unnecessary call to translate the bounds by _Traits.translate() first before passing them to _Traits.translate_nocase(). The standard says in [re.grammar]/14.1 and 14.2 that it is sufficient to call translate_nocase() only. (See also _Builder::_Add_char(), which already follows the Standard in this regard.)

The PR moves the entire character translation into the parser so that empty ranges can be reliably diagnosed there in case-insensitive mode as well. It also fixes the unsigned cast and removes the unnecessary translate() call.

The test deliberately does not use any manual signed/unsigned casts, but leaves all of these casts to char_traits to avoid getting the casts similarly wrong in <regex> and the test.

muellerj2 · 2024-12-05T13:02:01Z

stl/inc/regex

-            if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
+            if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {


Note that there is still a (pre-existing) bug in this error check when the collate flag is set: The bounds should be transformed by _Traits.transform() before performing the comparison.

However, the matcher fails to do this transform() dance as well, so I think this should rather be fixed for both matcher and parser in a coordinated fashion in a follow-up PR. (But I will adjust the PR if you rather prefer to fix this check completely now.)

A follow-up PR will be great.

StephanTLavavej · 2025-01-12T20:12:51Z

stl/inc/regex

-            if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
+            if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {


A follow-up PR will be great.

StephanTLavavej · 2025-01-12T20:33:33Z

Thanks - I really appreciate the detailed explanations in your PRs! 🐱

StephanTLavavej · 2025-01-13T20:04:25Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-01-14T07:39:24Z

Thanks! 😻+

<regex>: Fix character range bounds in case-insensitive regexes

23d8dee

muellerj2 requested a review from a team as a code owner December 5, 2024 12:40

muellerj2 commented Dec 5, 2024

View reviewed changes

CaseyCarter added the bug Something isn't working label Dec 5, 2024

StephanTLavavej self-assigned this Dec 5, 2024

StephanTLavavej added the regex meow is a substring of homeowner label Jan 8, 2025

StephanTLavavej approved these changes Jan 12, 2025

View reviewed changes

StephanTLavavej removed their assignment Jan 12, 2025

StephanTLavavej mentioned this pull request Jan 12, 2025

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Jan 13, 2025

StephanTLavavej merged commit af0bd00 into microsoft:main Jan 14, 2025
39 checks passed

muellerj2 deleted the fix-case-insensitive-char-ranges branch January 14, 2025 20:21

muellerj2 mentioned this pull request Jan 15, 2025

<regex>: Implement collating ranges #5238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Fix character range bounds in case-insensitive regexes #5164

`<regex>`: Fix character range bounds in case-insensitive regexes #5164

Uh oh!

muellerj2 commented Dec 5, 2024 •

edited

Loading

Uh oh!

muellerj2 Dec 5, 2024

Uh oh!

StephanTLavavej Jan 12, 2025

Uh oh!

StephanTLavavej Jan 12, 2025

Uh oh!

StephanTLavavej commented Jan 12, 2025

Uh oh!

StephanTLavavej commented Jan 13, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Jan 14, 2025

Uh oh!

Uh oh!

		if (static_cast<typename _RxTraits::_Uelem>(_Val) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {
		if (static_cast<typename _RxTraits::_Uelem>(_Chr2) < static_cast<typename _RxTraits::_Uelem>(_Chr1)) {

<regex>: Fix character range bounds in case-insensitive regexes #5164

<regex>: Fix character range bounds in case-insensitive regexes #5164

Uh oh!

Conversation

muellerj2 commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muellerj2 Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej Jan 12, 2025

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej Jan 12, 2025

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej commented Jan 12, 2025

Uh oh!

StephanTLavavej commented Jan 13, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Jan 14, 2025

Uh oh!

Uh oh!

`<regex>`: Fix character range bounds in case-insensitive regexes #5164

`<regex>`: Fix character range bounds in case-insensitive regexes #5164

muellerj2 commented Dec 5, 2024 •

edited

Loading