`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

muellerj2 · 2024-12-27T15:07:49Z

Simplifies regex_traits<_Elem>::translate(_Elem) to just return its only argument. In #5204 (comment), I voiced my suspicion that the current implementation essentially does just that in a very complicated and expensive way. I verified this now by running the following program that tested 902 locales available on my machine (no output to stdout produced by the program):

#include <iostream>
#include <locale>
#include <regex>
#include <string>
#include <vector>
#include <windows.h>
using namespace std;
BOOL add_locale(LPWSTR str, DWORD flags, LPARAM param) {
    vector<string>& vec = *reinterpret_cast<vector<string>*>(param);
    int size            = WideCharToMultiByte(CP_ACP, 0, str, -1, nullptr, 0, nullptr, nullptr);
    if (size) {
        vector<char> narrow(static_cast<size_t>(size), '\0');
        if (WideCharToMultiByte(CP_ACP, 0, str, -1, narrow.data(), size, nullptr, nullptr)) {
            vec.push_back(narrow.data());
        }
    }
    return TRUE;
}
int main() {
    vector<string> locales;
    locales.push_back("C");
    EnumSystemLocalesEx(&add_locale, 0, reinterpret_cast<LPARAM>(&locales), nullptr);
    for (const string& loc_name : locales) {
        const locale loc(loc_name);
        regex_traits<char> traits;
        regex_traits<wchar_t> wtraits;
        traits.imbue(loc);
        wtraits.imbue(loc);
        for (unsigned int i = 0; i <= 0xff; ++i) {
            try {
                const char x = (char) (unsigned char) i;
                if (traits.translate(x) != x) {
                    cout << loc_name << i << '\n';
                }
            } catch (const length_error&) {
            }
        }
        for (unsigned int i = 0; i <= 0xffff; ++i) {
            try {
                const wchar_t x = (wchar_t) i;
                if (wtraits.translate(x) != x) {
                    cout << loc_name << i << '\n';
                }
            } catch (const length_error&) {
            }
        }
    }
    return 0;
}

Even so, this PR still introduces a minor behavior change: The previous implementation can throw length_error("string too long") when it is passed a char that isn't a valid character in the locale's encoding (e.g., 0x80 in locales using UTF-8 encoding). But I think that the old behavior is undesirable anyway, as it makes the regex engine always fail with an exception in regex_constants::collate mode when a locale using UTF-8 encoding is imbued and the regex engine is applied to strings containing non-ASCII characters.

frederick-vs-ja

Aha, it seems that we took too much time to look into this.
[re.traits]/4 simply specifies that an implementation-provided regex_traits<C>::translate just returns the argument. See also https://en.cppreference.com/w/cpp/regex/regex_traits/translate.

muellerj2 · 2024-12-27T16:19:03Z

I put so much work in this mainly because I was worried about mix-and-match scenarios, not because I consider the old implementation correct. Think of the scenario that the implementations actually commonly produce different results and regex parser and matcher pick up different implementations or some other strange combination. Then this change had the potential to actually degrade the regex engine in such a mix-and-match scenario and lead to regex bugs that are difficult to understand and reproduce.

StephanTLavavej · 2025-01-11T00:09:35Z

Thanks for the careful analysis! 😻

StephanTLavavej · 2025-01-11T00:26:46Z

This has gotta be a perf win.

StephanTLavavej · 2025-01-13T20:05:53Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-01-14T08:44:17Z

Who knew machine translation was so easy! 🤖 😹 🤪

<regex>: Simplify regex_traits<_Elem>::translate(_Elem)

fdebb65

muellerj2 requested a review from a team as a code owner December 27, 2024 15:07

frederick-vs-ja approved these changes Dec 27, 2024

View reviewed changes

muellerj2 mentioned this pull request Dec 27, 2024

<locale>: std::collate<_Elem>::do_transform() should behave appropriately when _LStrxfrm() fails #5210

Closed

StephanTLavavej added the enhancement Something can be improved label Jan 4, 2025

StephanTLavavej self-assigned this Jan 4, 2025

Merge branch 'main' into simplify-regex_traits-translate

ff20c8f

StephanTLavavej added the regex meow is a substring of homeowner label Jan 8, 2025

StephanTLavavej approved these changes Jan 11, 2025

View reviewed changes

StephanTLavavej removed their assignment Jan 11, 2025

StephanTLavavej mentioned this pull request Jan 11, 2025

Maintainer priorities #4700

Open

StephanTLavavej added performance Must go faster and removed enhancement Something can be improved labels Jan 11, 2025

StephanTLavavej self-assigned this Jan 13, 2025

StephanTLavavej merged commit 615ce66 into microsoft:main Jan 14, 2025
39 checks passed

muellerj2 deleted the simplify-regex_traits-translate branch January 14, 2025 20:14

muellerj2 mentioned this pull request Jan 15, 2025

<regex>: Implement collating ranges #5238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

Uh oh!

muellerj2 commented Dec 27, 2024 •

edited

Loading

Uh oh!

frederick-vs-ja left a comment

Uh oh!

muellerj2 commented Dec 27, 2024

Uh oh!

StephanTLavavej commented Jan 11, 2025

Uh oh!

StephanTLavavej commented Jan 11, 2025

Uh oh!

StephanTLavavej commented Jan 13, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Jan 14, 2025

Uh oh!

Uh oh!

<regex>: Simplify regex_traits<_Elem>::translate(_Elem) #5209

<regex>: Simplify regex_traits<_Elem>::translate(_Elem) #5209

Uh oh!

Conversation

muellerj2 commented Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frederick-vs-ja left a comment

Choose a reason for hiding this comment

Uh oh!

muellerj2 commented Dec 27, 2024

Uh oh!

StephanTLavavej commented Jan 11, 2025

Uh oh!

StephanTLavavej commented Jan 11, 2025

Uh oh!

StephanTLavavej commented Jan 13, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Jan 14, 2025

Uh oh!

Uh oh!

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

`<regex>`: Simplify `regex_traits<_Elem>::translate(_Elem)` #5209

muellerj2 commented Dec 27, 2024 •

edited

Loading