You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Simplifies regex_traits<_Elem>::translate(_Elem) to just return its only argument. In #5204 (comment), I voiced my suspicion that the current implementation essentially does just that in a very complicated and expensive way. I verified this now by running the following program that tested 902 locales available on my machine (no output to stdout produced by the program):
#include<iostream>
#include<locale>
#include<regex>
#include<string>
#include<vector>
#include<windows.h>usingnamespacestd;
BOOL add_locale(LPWSTR str, DWORD flags, LPARAM param) {
vector<string>& vec = *reinterpret_cast<vector<string>*>(param);
int size = WideCharToMultiByte(CP_ACP, 0, str, -1, nullptr, 0, nullptr, nullptr);
if (size) {
vector<char> narrow(static_cast<size_t>(size), '\0');
if (WideCharToMultiByte(CP_ACP, 0, str, -1, narrow.data(), size, nullptr, nullptr)) {
vec.push_back(narrow.data());
}
}
returnTRUE;
}
intmain() {
vector<string> locales;
locales.push_back("C");
EnumSystemLocalesEx(&add_locale, 0, reinterpret_cast<LPARAM>(&locales), nullptr);
for (const string& loc_name : locales) {
const locale loc(loc_name);
regex_traits<char> traits;
regex_traits<wchar_t> wtraits;
traits.imbue(loc);
wtraits.imbue(loc);
for (unsignedint i = 0; i <= 0xff; ++i) {
try {
constchar x = (char) (unsignedchar) i;
if (traits.translate(x) != x) {
cout << loc_name << i << '\n';
}
} catch (const length_error&) {
}
}
for (unsignedint i = 0; i <= 0xffff; ++i) {
try {
constwchar_t x = (wchar_t) i;
if (wtraits.translate(x) != x) {
cout << loc_name << i << '\n';
}
} catch (const length_error&) {
}
}
}
return0;
}
Even so, this PR still introduces a minor behavior change: The previous implementation can throw length_error("string too long") when it is passed a char that isn't a valid character in the locale's encoding (e.g., 0x80 in locales using UTF-8 encoding). But I think that the old behavior is undesirable anyway, as it makes the regex engine always fail with an exception in regex_constants::collate mode when a locale using UTF-8 encoding is imbued and the regex engine is applied to strings containing non-ASCII characters.
I put so much work in this mainly because I was worried about mix-and-match scenarios, not because I consider the old implementation correct. Think of the scenario that the implementations actually commonly produce different results and regex parser and matcher pick up different implementations or some other strange combination. Then this change had the potential to actually degrade the regex engine in such a mix-and-match scenario and lead to regex bugs that are difficult to understand and reproduce.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Simplifies
regex_traits<_Elem>::translate(_Elem)
to just return its only argument. In #5204 (comment), I voiced my suspicion that the current implementation essentially does just that in a very complicated and expensive way. I verified this now by running the following program that tested 902 locales available on my machine (no output to stdout produced by the program):Even so, this PR still introduces a minor behavior change: The previous implementation can throw
length_error("string too long")
when it is passed achar
that isn't a valid character in the locale's encoding (e.g., 0x80 in locales using UTF-8 encoding). But I think that the old behavior is undesirable anyway, as it makes the regex engine always fail with an exception inregex_constants::collate
mode when a locale using UTF-8 encoding is imbued and the regex engine is applied to strings containing non-ASCII characters.