CARVIEW |
Select Language
HTTP/2 200
date: Wed, 23 Jul 2025 19:24:21 GMT
content-type: text/html; charset=utf-8
cache-control: no-cache
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
referrer-policy: no-referrer-when-downgrade
server-timing: pull_request_layout-fragment;desc="pull_request_layout fragment";dur=403.530087,conversation_content-fragment;desc="conversation_content fragment";dur=1035.993399,conversation_sidebar-fragment;desc="conversation_sidebar fragment";dur=489.439459,nginx;desc="NGINX";dur=1.189371,glb;desc="GLB";dur=101.154329
strict-transport-security: max-age=31536000; includeSubdomains; preload
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
x-content-type-options: nosniff
x-frame-options: deny
x-voltron-version: fd8fbbc
x-xss-protection: 0
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=xwgY68%2FkaxD%2B2g6cUSrfx5%2B5W9A3ALjE51oHOntYfpiVm2eBf%2BwxTEmuAIY9rz6TqeV6avky7Bh%2FTczzzcyDNHLaDPg7uR55aSsokYZN3qrfKyEzYEBVIEGCQr0CnF1eN6ZYDNT9HKNxlpYPMyQFb5NHxr8f2QNRkwIaR4rOyx9sjCfwCcOGHcd%2BcElH42Wv0LUywDSuE6UIyPPSS9NEz5NIG7NWvvq19U37uudIR7lqBu3DpDAKo5NjpvTGvWYBVahKHNdFH9CqLIPuURPlIw%3D%3D--3FFF44mbUIk%2BBl4x--9tRwjfSAVdWyelE1Y4hKNA%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1663359108.1753298660; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 19:24:20 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Thu, 23 Jul 2026 19:24:20 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: B50C:941DF:103965D:133A8D5:688136E4
`<regex>`: Make `wregex` handle small character ranges containing U+00FF and U+0100 correctly by muellerj2 Β· Pull Request #5437 Β· microsoft/STL Β· GitHub
StephanTLavavej
added
bug
Something isn't working
regex
meow is a substring of homeowner
labels
Apr 25, 2025
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>
: Make wregex
handle small character ranges containing U+00FF and U+0100 correctly
#5437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
StephanTLavavej
merged 3 commits into
microsoft:main
from
muellerj2:regex-handle-ranges-near-u+0100-correctly
May 10, 2025
Merged
<regex>
: Make wregex
handle small character ranges containing U+00FF and U+0100 correctly
#5437
StephanTLavavej
merged 3 commits into
microsoft:main
from
muellerj2:regex-handle-ranges-near-u+0100-correctly
May 10, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
β¦0FF and U+0100 correctly
This comment was marked as resolved.
This comment was marked as resolved.
StephanTLavavej
approved these changes
May 3, 2025
Thanks as always for the thorough investigation, careful fix, and extremely detailed writeup to help mortals understand the problem and solution! π» I pushed a conflict-free merge and a minor expansion of test coverage with negative cases. |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
StephanTLavavej
added a commit
to StephanTLavavej/STL
that referenced
this pull request
May 9, 2025
Thanks for noticing and fixing this bug! π¦ ποΈ π |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
You canβt perform that action at this time.
This fixes a weird bug for small character ranges near U+0100 that I stumbled upon while working on the proof-of-concept for eliminating
_Uelem
(#995). It's also a follow-up to #5238 and adds the characters in a collating range to the bitmap of the character class.I will discuss the follow-up for collating ranges first. Strictly speaking, as of now we don't have to add characters in a range to the bitmap as long as we store the character range, because the matcher always checks character ranges before looking up the character in the bitmap. (This will become relevant for the bug as well.) So adding these characters to the bitmap for collating ranges seems like unnecessary work, because the difference can't be observed by users. However, I think it makes sense to invert this check order in the matcher in the longer run for performance reasons (
regex_traits::transform()
is relatively expensive, so should be avoided during matching). That's why I think it's a good idea to correctly update the bitmap for collating ranges now before the first official STL release containing the collating ranges fix becomes available, because this might allow us to assume in some future change that the bitmap is always filled correctly when thecollate
syntax flag is set.Now let's discuss the bug.
_Builder::_Add_range2()
performs two optimizations when thecollate
flag is not set: The first optimization tries to put characters with code points<= 0xFF
in a bitmap, while the second one replaces small ranges by the individual characters in this range.In the first optimization, the loop condition checks
_Ex0 <= _Ex1 && _Ex1 < _Get_bmax()
, so it doesn't set the bits for characters between_Ex0
and U+00FF in the bitmap when_Ex1 >= 0x100
.This usually remains unobservable, though, because the matcher tests the stored character ranges first before looking up the bitmap. (I wonder whether this order was chosen as a workaround because these bits don't get set in the bitmap here. Or conversely the loop condition is like this because ranges are checked first in the matcher: If the optimization can't optimize the range away, it is actually a slight pessimization for the characters in the bitmap range performance-wise if they are represented in the bitmap and not by the range.)
But for a range with
_Ex0 < 0x100 <=_Ex1
, this leads to a problem when the second optimization in_Builder::_Add_range2
gets applied: This optimization assumes that it is applied to character ranges that don't overlap with the range of characters represented by the bitmap (i.e., it assumes_Ex0 >= 0x100
). When it kicks in, the character range is no longer represented as a range in the NFA node, but instead all characters in the range are added to a character array that is intended for all matched characters with code points >= 0x100.This leads to the following problem:
_Ex1 >= 0x100
, so the bits for characters with code points between_Ex0
and0xFF
aren't set in the bitmap._Ex1 - _Ex0
is small enough, so the character range is no longer represented as a character range. Instead, the characters are added to the matched character array for code points >= 0x100.\u00FF
to the character class[\u00FF-\u0100]
:\u00FF
in the bitmap because the bit for\u00ff
is not set due to first optimization's loop condition.\u00FF
.To fix this, I opted to change the loop condition of the first optimization to
_Ex0 <= _Ex1 && _Ex0 < _Bmp_max
(where_Bmp_max == _Get_bmax()
). As mentioned, this is actually a pessimization for characters with code points between0x00
and0xff
if_Ex >= 0x100
and the second optimization isn't applied, but (a) these ranges starting at some code point <= 0xFF and ending at some code point >= 0x100 are probably rare in practice and (b) this is a first step towards evaluating the bitmap before ranges in the matcher.Additional changes:
_Get_bmax()
and_Get_tmax()
have lost their usefulness. I replaced the functions by the returned constants and marked the members_Bmax
and_Tmax
with a// TRANSITION, ABI
comment. (We can't remove the initialization of_Bmax
and_Tmax
in the constructor yet due to the possibility of mixing.)_Builder::_Add_range3()
to avoid calling_Traits.transform()
more than necessary._Add_range2()
was renamed to ensure that the check can't get lost when old and new TUs are mixed.