Please reconsider the use of collator-based search

Step 6.1 of [*find a range from a node list*](https://wicg.github.io/scroll-to-text-fragment/#find-a-range-from-a-node-list) has this sentence: “The string search must be performed using a base character comparison, or the[ primary level](https://www.unicode.org/reports/tr10/#Multi_Level_Comparison), as defined in[ [UTS10]](https://wicg.github.io/scroll-to-text-fragment/#biblio-uts10).” Followed by the note: “Intuitively, this is a case-insensitive search also ignoring accents, umlauts, and other marks.”

There are multiple problems with this:

# In Chrome, the meaning of links depends on the UI language

Consider the link https://hsivonen.com/test/moz/text-fragment-language.html#:~:text=H%C3%A4n . Test this in Chrome with the UI language set to English, German, and Finnish. A different word (in a Finnish part of the text) is highlighted in the three different UI languages.

Also test the link https://hsivonen.com/test/moz/text-fragment-language.html#:~:text=ILIK in Chrome with the UI language set to English and Turkish. A different word is highlighted depending on the UI language.

Obviously, it’s very bad for the meaning of links to depend on the UI language!

# The current situation is not interoperable

Safari (tested 16.6 on Ventura) appears to perform a case-insensitive search that doesn’t ignore diacritics and does not appear to depend on either the UI language or the language of the content (i.e. the case mapping is wrong for languages that use Turkish i).

# The use case isn’t well-motivated

Why should the search be ignoring certain aspects of the text? This seems unnecessary when the text fragments are generated by copying concrete text from the page (by plain copy-paste, by using Chrome’s context menu item “Copy link to highlight”, or by a server-side system having loaded the target page for server-side processing). This seems to mainly cater to typing text fragments by hand when the text input method in use isn’t well matched with the page text (e.g. using U.S. English keyboard to try to match English text that has French-style accents like naïve or café). What use case justifies the resulting complexity?

# The spec doesn’t actually say what Chrome does

The definition is too vague to actually use collator-based search. The spec doesn’t say how to choose what collation to use. The referenced spec, UTS 10, specifies DUCET, but no browser carries an implementation of raw DUCET. All browsers use CLDR collations instead. Chrome seems to be using the CLDR search collation for the UI language of the browser (as with cmd/ctrl-f).

# Distinction from cmd/ctrl-f

Using the search collation for the UI language makes sense for cmd/ctrl-f, since the text is user-entered, so it’s reasonable to align locale-dependent behavior to the locale of the user as opposed to the locale of the page. However, links should mean the same thing when shared between people who use different UI languages.

# Speccing the root collation doesn’t make sense

The whole point of collator-based search is being able to make what accent-insensitive and case-insensitive mean reuse the collation data for language-dependent meanings. The meaning of case-insensitive changes for Turkish i. A bunch of European Latin-script languages treat a letter with what English analyzes as an accent as a base letter on its own right.

Saying that the CLDR search root collation shall always be used would make things always wrong for the languages that were supposed to be the beneficiaries of collator-based search.

# Collator-based search doesn’t deal with multiple language spans in the haystack

Collator-based search maps the text for both the needle and the haystack into collation elements, performs the search over collation elements, and then tries to correlate the result back to the original text.

This requires the use of the same mapping from text to collation elements for both the needle and the haystack. In practice, the allocation of tailored primary weights in a language-specific tailoring is scoped to that tailoring, so collator-based search isn’t suited for handling spans of different languages in the haystack.

# Creating a Web-exposed dependency on a complex operation

The spec as written brings collator-based search into the scope of the Web Platform when previously collator-based search hasn’t been a standardized part of the Web Platform. (WebKit and Blink implement their cmd/ctrl-f functionality via collator-based search but Firefox doesn’t. `window.find` exposes this to the Web, but [the API remains unstandardized](https://github.com/whatwg/html/issues/3539).)

Notably, while `Intl.Collator` exposes search collation data, it does so only in the context of providing a sorting API. `Intl.Collator` doesn’t provide a search API, so it can only be used for testing full-string matches–not substring or (properly) even prefix matches. (I think it’s a design error for Intl.Collator to expose search data.)

# Collator-based search means shipping ICU4C or undertaking major implementation effort

Before implementing the collator for [ICU4X](https://github.com/unicode-org/icu4x/), I surveyed the use cases of collation in Firefox and determined that certain capabilities of ICU4C’s collator aren’t exposed to the Web or otherwise needed by Firefox. (These include: collator-based search, generating search keys that a database can store after up-front computation and then use multiple times with a computationally lighter comparison operation, and generating an alphabetical index as seen in the Contacts apps on mobile phones with items grouped under exemplar characters–typically the recitable alphabet of the locale.)

I wish to keep the option open for Firefox to migrate from ICU4C to ICU4X without bringing collator-based search into scope as a blocker feature to implement.

There’s a [scope document](https://docs.google.com/document/d/1nUCQxSCCIdfBas5l-jGu58O38FaCLuvlsBFAjvXrgNM/edit) that contains the rationale for excluding collator-based search from ICU4X.

----

I suggest revisiting case-insensitivity and accent-insensitivity in light of use cases and reassessing if addressing use cases that would call for case-insensitivity or accent-insensitivity are really worth the complexity.

I am not advocating in favor of case-insensitivity, but I note that implementing case-insensitivity on top of the Unicode database’s notion of fold case would involve much less complexity than involving a collator in any way. Unlike in Safari, it could even be sensitive to Turkish i based on the language declared for each node in text content without making things infeasible. Still, even that level of complexity would need to be justified by use cases.

Accent-insensitivity is big enough a deal to do well that the use cases would need to be extraordinarily compelling.

Collator-based search conceptually is insensitive to the Unicode Normalization Form. However, in practice the Web in is Normalization Form C, and in any case generating the URL fragments by copying whatever text the page happens to have doesn't seem to justify making the comparison insensitive to the Unicode Normalization Form.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Please reconsider the use of collator-based search #233

In Chrome, the meaning of links depends on the UI language

The current situation is not interoperable

The use case isn’t well-motivated

The spec doesn’t actually say what Chrome does

Distinction from cmd/ctrl-f

Speccing the root collation doesn’t make sense

Collator-based search doesn’t deal with multiple language spans in the haystack

Creating a Web-exposed dependency on a complex operation

Collator-based search means shipping ICU4C or undertaking major implementation effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Please reconsider the use of collator-based search #233

Description

In Chrome, the meaning of links depends on the UI language

The current situation is not interoperable

The use case isn’t well-motivated

The spec doesn’t actually say what Chrome does

Distinction from cmd/ctrl-f

Speccing the root collation doesn’t make sense

Collator-based search doesn’t deal with multiple language spans in the haystack

Creating a Web-exposed dependency on a complex operation

Collator-based search means shipping ICU4C or undertaking major implementation effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions