CARVIEW |
This document defines the syntax and formal semantics of XQuery 1.0 and XPath 2.0 Full-Text
which is a language that extends XQuery 1.0
This is a public W3C Working Draft for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This is the fifth version of this document. Since the last version was published, several technical and editorial changes have been made to all the sections of the document. Among the most significant changes are: the addition of a section describing the processing model for full-text search and how it integrates with the XQuery Processing Model; the reformulation of the AllMatches model so that a primitive match (TokenInfo) now can represent an interval of token positions, and hence, a match of a phrase (in the former version phrases were modeled using distance constraints, which had certain unwanted implications when distance operators were explicitly applied to phrases); the restriction of the FTTimes operation to simple FTSelections; and several simplifications in the semantics functions that the latter two changes made possible, like the removal of the AllMatches normalization. The XQuery functions that are used to define the semantics of the full-text operations have been thoroughly revised and are now syntax- and type-checked.
This document has been produced
following the procedures set out for the W3C Process. This document
was produced through the efforts of
Public comments on this document and its open issues are invited.
Comments should be entered into the
This document was produced by groups operating under the
SA January 2004: First version of document before Feb F2F
SA 26 February 2004: Second version of document before Feb F2F meetings.
This document defines the language and the formal semantics of
XQuery 1.0 and XPath 2.0 Full-Text. This language is designed to meet the requirements
identified in W3C XQuery and XPath Full-Text Requirements
XQuery 1.0 and XPath 2.0 Full-Text extends the syntax and semantics of XQuery 1.0 and XPath 2.0.
As XML becomes mainstream, users expect to be able to
search their XML documents. This requires a standard way to do
full-text search, as well as structured searches, against XML
documents. A similar requirement for full-text search led ISO to
define the SQL/MM-FT
XML documents may contain highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.
Full-text search is different from substring search in many ways:
A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.
There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as "mouse" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens "XML" and "Query" allowing up to 3 intervening words.
Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.
As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full-Text.
The following definitions apply to full-text search:
In some natural languages, tokens and words can be used interchangeably.
Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).
Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).
Tokenization also
uniquely identifies sentences and paragraphs in which tokens appear.
The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens. Everything else is implementation-defined.
This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.
Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries, while formatting markup sometimes does not. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization.
This document is organized as follows. We first present a
Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:
xml = https://www.w3.org/XML/1998/namespace
xs = https://www.w3.org/2001/XMLSchema
xsi = https://www.w3.org/2001/XMLSchema-instance
fn = https://www.w3.org/2005/xpath-functions
xdt = https://www.w3.org/2005/xpath-datatypes
local = https://www.w3.org/2005/xquery-local-functions
In addition to the prefixes in the above list, this document uses the prefix
err
to represent the namespace URI https://www.w3.org/2005/xqt-errors
,
This namespace prefix is not predeclared and its use in this document is not normative.
Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0
specifications, particularly
Finally, this document uses the prefix fts
to represent a namespace
containing a number of functions used in this document to describe the semantics
of XQuery 1.0 and XPath 2.0 Full-Text functions. There is no
requirement that these functions be implemented, therefore no URI is associated with that prefix.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:
Adds a new expression called FTContainsExpr;
Enhances the syntax of FLWOR expressions in XQuery 1.0 and
for
expressions in XPath 2.0 with optional score
variables; and
Adds static context declarations for full-text match options to the query prolog.
Additionally, it extends the data model and processing models in various ways.
As part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance, an implementation-defined full-text process, called tokenization is usually executed.
Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of token occurrences found in the target text (nodes) of a search. These token occurrences are characterized by unique identifiers that capture the relative position of the token inside the string, the relative position of the sentence containing the token, and the relative position of the paragraph containing the token.
The tokenization process is implementation-dependent. For example, the tokenization may differ from domain to domain and from language to language. This specification will only impose a very few number of constraints on the semantics of a correct tokenizer. As a consequence, all the examples in this document are only given for explanation purposes but they are not mandatory, i.e. the result of such full-text queries will of course depend on the tokenizer that is being used.
A full-text expression or
An XPath 2.0 or XQuery 1.0 expression (RangeExpr) that specifies the sequence of items to be searched. Those items are called the search context.
The full-text selection to be applied (
Required:
Words and phrases for which a search is performed (FTWords).
Optional:
Match options, such as indicators for case sensitivity and stop words (FTMatchOptions);
Boolean full-text operators, that compose an FTSelection from simpler FTSelections;
Other full-text operators that are constraints on the positions of matches, such as indicators for distance between tokens and for the cardinality of matches; and
The weighing information. Each individual search term in an FTSelection may be annotated with optional weight information. This information may be used during the evaluation of the FTSelections to calculate scoring, information that quantifies the relevance of the result to the given search criteria.
An optional Xpath 2.0 or XQuery 1.0 expression (UnionExpr) that
specifies the set of nodes, descendents of the RangeExp, which
contents may be ignored for the purpose of determining a match
during the search (
The results of the evaluation of the FTSelection operators are instances of the AllMatches model, which complements the XQuery Data Model (XDM) for processing full-text queries. An AllMatches instance describes all possible solutions to the full-text query for a given search context item. Each solution is described by a Match instance. A Match instance contains the tokens from the search context that must be included (described using StringInclude instances which model the positive terms) and the tokens from search context item that must be excluded (described using StringExclude instances which model the negative terms). Each negative or positive term is modeled as a tuple: the position of the query word or phrase in the FTSelection, and a TokenInfo structure that describes a consecutive sequence of token occurrences in the text string which match the query word or phrase.
Figure 1 provides a schematic overview of the XQuery 1.0 and XPath
2.0 Full-Text processing steps that are discussed in detail below.
Some of these steps are completely outside the domain of XQuery; in
Figure 1, these are depicted outside the black line that represents
the boundaries of language. The diagram only shows the central pieces
of the XQuery Processing Model (see
Like all XQuery expressions, an FTContainsExpr returns an XDM Instance (see Fig. 1). With the exception of FTWords, which consumes TokenInfos, all FTSelections are closed under the AllMatches data model, i.e., their input and output are AllMatches instances. Tokenization normally occurs at the time of parsing of the original XML documents, for example, during the Data Model Generation process (see Figure 1). But here it may also occur "on-the-fly" transforming an XDM instance into TokenInfos, which ultimately get converted into AllMatches instances by the evaluation of FTSelections. Thus, the evaluation of nested full-text and XQuery expressions instances moves back and forth between these two models.
The resulting AllMatches instance obtained by the evaluation of a Full Text expression is converted into a Boolean value before being returned to the enclosing XPath or XQuery operation as follows. If at least one member of the disjunction contains only positive terms then value returned is true. If all members of the disjunction contain negative terms the result is false.
Weighing information, in an implementation-dependent fashion, may be used when calculating the scoring information computed and made available by FTContainsExpr to the optional score construct.
Section 3 describes the syntax and the informal semantics of Full Text operators. Their formal semantics is defined in Section 4. The AllMatches data model is formally defined in Section 4.
Given the components of a given Full Text expression, the evaluation
algorithm will proceed according to the following steps, also referenced in the processing model diagram as steps FT
Evaluate the search context expression, resulting in the set of search context items; (FT1 provides the evaluation of any Xpath 2.0 or XQuery 1.0 expressions that generates or modifies the search context, as well as the query string(s) in a partially evaluated FTSelection expression)
Evaluate the (optional) ignore expression, resulting in the set of ignored nodes and virtually delete the ignore nodes from the search context nodes tree. (Included in FT1)
Apply the tokenization algorithm to query string(s). (FT2.1 -- this is implementation-dependent)
For each search context item:
Apply the tokenization algorithm in order to extract potentially matching terms together with their positional information. This step results in a sequence of token occurrences. (FT2.2 -- this is implementation-dependent)
Evaluate the simple "FTWord" operators in the FTSelection against the tokenized input. This results in a set of AllMatches instances. (FT3)
Evaluate the rest of the FTSelection operator tree in a bottom up fashion. At each step the AllMatches instance produced by the previous steps are given as input, and a new instance of the AllMatches is obtained as output. At each step the FTMatchOptions are controlling the semantics of the application of the FTWords operator. (FT4)
Convert the AllMatches instance into a Boolean value. (FT5)
The additional scoring information (also part of FT5) that is produced by the evaluation of the Full Text expression is implementation dependent and is not specified in this document and is made available at the same time the Boolean value is returned.
As a syntactic construct an FTContainsExpr behaves similar to a
comparison expression (see
|
|
An FTContainsExpr may be used anywhere a ComparisonExpr may be used. FTContainsExprs have higher precedence than comparison operators, so the results of FTContainsExpr may be compared without enclosing them in parentheses.
An FTContainsExpr returns a Boolean
value. It returns true, if there is some node in
RangeExpr that, after
The following example in extended XQuery 1.0 returns the author of
each book with a title containing a token with the same root as
dog
and the token
cat
.
The same example in extended XPath 2.0 is written as:
Besides specifying a match of a full-text
search as a Boolean condition, full-text search applications
typically also have the ability to associate scores with
the results.
XQuery 1.0 and XPath 2.0 Full-Text extends the languages of
XQuery 1.0 and XPath 2.0 further by adding optional
score
variables to the for
and
let
clauses of FLWOR expressions.
The production for the extended for
clause follows.
When a score
variable is present in a for
clause the evaluation of the expression following the in
keyword not only needs to determine the result sequence of the
expression, i.e., the sequence of items which are iteratively
bound to the for
variable. It must also determine in each
iteration the relevance "score" value of the current item
and bind the score
variable to that value.
In the following example book
elements are determined that satisfy
the condition [content ftcontains "web site" && "usability" and
.//chapter/title ftcontains "testing"]
. The scores assigned to the
book
elements are returned.
XPath 2.0 Full-Text extends the language of XPath
2.0 in the for
expression in the same
way: with optional score variables. The example above is
also a legal example of the XPath 2.0 extension.
Scores are typically used to order results, as in the
following, more complete example.
The score
variable is bound to a value which reflects
the relevance of the match criteria in the
FTSelections to the nodes in the respective RangeExprs. The
calculation of relevance is implementation-dependent, but score
evaluation must follow these rules:
Score values are of type xs:double in the range [0, 1].
For score values greater than 0, a higher score must imply a higher degree of relevance
Similar to their use in a for
clause, score variables
may be specified in a let
clause. A score variable in a
let
clause is also bound to the score of the expression
evaluation, but in the let
clause one score is determined
for the complete result. The let
variable may be dropped
from the let
clause, if the
score
variable is present.
The production for the extended let
clause follows.
While when using the score option in a for
clause the
expression following the in
keyword has the dual purpose
of filtering, i.e., driving the iteration, and determining the scores,
it is possible to separately specify expressions for filtering and
scoring by combining a simple for
clause with a
let
clause that uses scoring. The following is
an example of this.
book
elements with chapter titles that contain "testing". Along with the book
elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".
Note that it is not a requirement of the score of an
FTContainsExpr to be 0, if the expression evaluates to false, nor to
be non-zero, if the expression evaluates to true.
Hence, in the example above it is not possible to infer the Boolean
value of the FTContainsExpr in the let
clause from the
calculated score of a returned result
element. For instance, an
implementation may want to assign a non-zero score to a book that
contained only "web site", but not "usability", as this may be
considered more relevant than a book that does not contain either of
both.
The use of score
variables introduces a second-order
aspect to the evaluation of expressions which cannot be emulated by
(first-order) XQuery functions. Consider the following replacement of
the clause let score $s := FTContainsExpr
where a function score
is applied to some
FTContainsExpr. If the function score
were first-order, it
would only be applied to the result of the evaluation of
its argument, which is one of the Boolean constants true
or false
. Hence, there would be at most two possible
values such a score
function would be able to return and
no further differentiation would be possible.
The effect of weights on the result score is implementation-dependent. However, weight declarations must follow these rules:
Weights in an FTContainsExpr are significant only in relation to each other; and
When no explicit weight is specified, the default weight is 0.5.
Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.
The XQuery Static Context is extended by a component for each of the
full-text match options. Thus, the default of a match option in a
query may be changed by providing a setting in the static context using the
following declaration syntax.
This section describes
FTSelections which contain the full-text
operators in the
The
The "weight" value is the result of evaluating ExprSingle and can be any numeric value.
The syntax and semantics of the individual full-text selection operators follow.
This XML document fragment is the source document for examples in this section.
Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results may be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
An FTWords is an FTWordsValue followed by the optional modifier
FTAnyallOption.
The right-hand side of FTWordsValue is an XQuery expression which must
evaluate to a sequence of string values or nodes of type
"xs:string". The result is
then atomized into a sequence of strings which is tokenized into a
sequence of tokens and phrases. If the atomized sequence is not a
subtype of "xs:string*", an error is raised:
If the "any" option is specified, a match occurs, if and only if at least one token or phrase in the sequence has a match in the searched text.
If the "all" option is specified, a match occurs, if and only if all of the tokens and phrases in the sequence are matched in the searched text.
If the "phrase" option is specified, all words and phrases are used to create a sequence of ordered words representing a new phrase. A match occurs, if and only if the resulting phrase is matched in the searched text.
If the "any word" option is specified, a match occurs, if and only if at least one token in the sequence of tokens and phrases is matched in the searched text.
If the "all word" option is specified, a match occurs, if and only if all tokens in the sequence of tokens and phrases are matched in the searched text.
If no option is specified, "any" is the default.
If the result is a single string, "any", "all", and "phrase" are equivalent.
returns the book
element whose number
is
1, because its title
element contains the token "Expert".
returns the book
element whose number
is
1, because its title
element contains the phrase "Expert
Reviews".
returns the book
element whose number
is
1, because its title
element contains two tokens "Expert"
and "Reviews".
returns false, because the p
element doesn't contain
the phrase "Web Site Usability" although it contains all of the tokens
in the phrase.
returns book
numbers of book
elements by
"Marigold" with a title about "Web Site Usability" sorting them in
descending score order.
A match must satisfy at least one of the
returns the book
element written by "Millicent".
A match must satisfy all of the
returns true, since the book
title
contains
"usability" and "testing".
returns false, because "Millicent" and "Montana" are not contained
by the same author
element in any book
element.
A match to
returns true, because "usability" appears in the title
and the p
elements and the occurrence within the phrase
"Usability Testing" in the title
element is not
considered.
The right-hand side of a
returns the empty sequence, because all book
elements
contain "usability".
returns true, because book
elements contain
"information" and "retrieval" but not "information retrieval".
return book
elements containing "web site usability"
but not "usability testing".
The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.
returns true, because titles of book
elements contain
"web site" and "usability" in the order in which they are written in
the query, i.e., "web site" must precede "usability".
returns false, because although "Montana" and "Millicent" appear in
the title
element, they do not appear in the order they
are written in the query.
Possible scopes are sentences and paragraphs.
By default, there are no restrictions on the scope of the matches.
If two tokens appear in the same sentence and in different sentences, then both same sentence and different sentence return true. The same is true for same paragraph and different paragraph.
returns false, because the tokens "usability" and "Marigold" are not contained within the same sentence.
returns true, because the tokens "usability" and "Marigold" are contained within different sentences.
returns a book
element, because it contains
"usability" and "testing" in the same paragraph.
returns a book
element, because "site" and "errors"
appear in the same sentence.
Some subtle relationships between
| ("at" "least"
| ("at" "most"
| ("from"
Let the value of the first (or only)
The following rule applies to
Zero words (sentences, paragraphs) means adjacent tokens (sentences, paragraphs).
If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the closed interval [0, M]. If "from-to" is specified, then the range is the closed interval [M, N].
Here are some examples of
'exactly 0' specifies the range [0, 0].
'at least 1' specifies the range [1,unbounded].
'at most 1' specifies the range [0, 1].
'from 5 to 10' specifies the range [5, 10].
The distances computed by FTDistance are not affected by the presence or absence of element boundaries in the text. Stop words are counted in those computations whether they are ignored or not.
returns false, because "information" and "retrieval" are more than at least 11 tokens apart.
returns true, because "web", "site", and "usability" have at most 2 intervening tokens between them.
returns the book
title. A similar query for the
p
element would return false because "web site" and
"usability" have two intervening tokens between them.
A match of an
returns true, because "web", "site", and "usability" are within a
window of 5 tokens in the title
element.
returns true, because "web" and "site" in the order they are written in the query and either "usability" or "testing" are within a window of at most 10 tokens.
returns true, because the title
element contains "Web
Site Usability". A similar query on the p
element would not
return true,
because its occurrences of "web site" and "usability" are not within a
window of 3.
returns the empty sequence, because in the selected
book
element, there is no occurrence of "efficient"
within a window of 3 tokens which would not also contain an occurrence
of "and".
In the document fragment "very very big":
The
The
The
The
returns book
numbers because book
elements contain 2 or more occurrences of "usability".
returns the empty sequence, because there are 4 occurrences of
"usability" || "testing" in the designated title
.
returns true, because the book
element contains 3
occurrences of "usability" in its title
element although
its p
element contains only 1 occurrence.
The "at" "start" option finds matches in which the tokens or phrases are the first tokens or phrases in the tokenized string value of the element being searched.
The "at" "end" option finds matches in which the tokens or phrases are the last tokens or phrases in the tokenized string value of the element being searched.
The "entire" content" option finds matches in which the tokens or phrases are the entire content of the tokenized string value of the element being searched.
returns each title
element starting with the phrase "improving the
usability of a web site".
returns each p
element ending with the phrase
"propagating few errors".
returns each note
element whose entire content is
"this site has been approved by the web site users association".
|
|
|
|
|
|
If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:
is equivalent to the query
We describe each match option in more detail in the following sections.
| "uppercase"
| ("case" "sensitive")
| ("case" "insensitive")
There are four possible character case options:
The option "uppercase" matches tokens and phrases with uppercase characters, regardless of the case of characters of the tokens and phrases as they are written in the query.
The option "lowercase" matches tokens and phrases with lowercase characters, regardless of the case of characters of the tokens and phrases as they are written in the query.
The option "case" "insensitive" matches the uppercase and lowercase characters of tokens and phrases. The case of characters as they are written in the query is not considered.
The option "case" "sensitive" matches the case of the characters in tokens and phrases as they are written in the query.
The default is "case insensitive".
The following table summarizes the interactions between the case match options and the use of the default collations.
Default collation options/Case options | UCC (Unicode Codepoint Collation) | CCS (some generic case-sensitive collation) | CCI (some generic case-insensitive collation) |
insensitive | compare as if both lower | case-insensitive variant of CCS if it exists, else error | CCI |
sensitive | UCC | CCS | case-sensitive variant of CCI if it exists, else error |
uppercase | uppercase(Expr) + UCC | uppercase(Expr) + CSS | CCI |
lowercase | lowercase(Expr) + UCC | lowercase(Expr) + CSS | CCI |
In this table, "else error" means "Otherwise, an error
is raised:
returns false, because the title
element doesn't contain
"usability" in lower-case characters.
returns true, because the character case is not considered.
| ("without" "diacritics")
| ("diacritics" "sensitive")
| ("diacritics" "insensitive")
There are four possible diacritics options:
The option "with" "diacritics" matches tokens and phrases with diacritics, regardless of whether the diacritics are written in the query.
The option "without" "diacritics" matches tokens and phrases without diacritics, regardless of whether the diacritics are written in the query.
The option "diacritics" "insensitive" matches tokens and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.
The option "diacritics" "sensitive" matches tokens and phrases only if they contain the diacritics as they are written in the query.
The default is "diacritics insensitive".
The following table summarizes the interactions between the diacritics match options and the use of the default collations.
Default collation options/Diacritics options | UCC (Unicode Codepoint Collation) | CDS (some generic diacritics-sensitive collation) | CDI (some generic diacritics-insensitive collation) |
insensitive | compare as if with and without | diacritics-insensitive variant of CDS if it exists, else error | CDI |
sensitive | UCC | CDS | diacritics-sensitive variant of CDI if it exists, else error |
with diacritics | "resume diacritic insensitive" not in "resume" | "resume diacritic insensitive" not in "resume" | CDI |
without diacritics | "resume" not in "resume diacritic sensitive" | "resume" not in "resume diacritic sensitive" | CDI |
In this table, "else error" means "Otherwise, an error
is raised:
returns true, because the editor
element
contains the token "Vera" with an acute accent.
returns false, because the editor
element does not
contain the token "Vera" without an acute accent.
The "with stemming" option specifies that matches may contain tokens that have the same stem as the tokens and phrases written in the query. It is implementation-defined what a stem of a token is.
The "without stemming" option specifies that the tokens and phrases are not stemmed.
It is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach.
The default is "without stemming".
returns true, because the title
of the specified
book
contains "improving" which has the same stem as
"improve".
| ("with" "thesaurus" "(" (
| ("without" "thesaurus")
The at
in
Thesauri add related tokens and phrases to the search. Thus, the user may narrow, broaden, or otherwise modify the search using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related search tokens and phrases in a disjunction (FTOr).
A thesaurus may be standards-based or locally-defined. It may be a traditional thesaurus, or a taxonomy, soundex, ontology, or topic map. How the thesaurus is represented is implementation-dependent.
FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.
Relationships include, but are not limited to, the relationships
and their abbreviations presented in
The "with thesaurus" option specifies that string matches include tokens that can be found in one of the specified thesauri.
The "without thesaurus" option specifies that no thesaurus will be used.
The "with default thesaurus" option specifies that a system-defined default thesaurus with a system-defined relationship is used. The default thesaurus may be used in combination with other explicitly specified thesauri.
The default is "without thesaurus".
returns true, because it finds a content
element
containing "tasks" which the thesaurus identified as a synonym for
"duties".
returns book
elements, because it finds a
content
element containing "web site components", and
narrower terms "navigation" and "layout".
returns a book
element containing "Marigold which
sounds which sound like "Merrygould".
| ("without" "stop" "words")
| ("with" "default" "stop" "words"
| ("("
at
. If a URI
is used, it must point to a sequence of string atoms or nodes of type
"xs:string". In both cases, no tokenization is performed on the
strings: they are used as they occur in the sequence.
The "with stop words" option specifies that if a token is within the
specified collection of stop words, it is removed from the search and
any token may be substituted for it. Stop words retain their position
numbers and are counted in
Multiple stop word lists may be combined using "union" or "except". If "union" is specified, every string occurring in the lists specified by the left-hand side or the right-hand side is a stop word. If "except" is specified, only strings occurring in the list specified by the left-hand side but not in the list specified by the right-hand side are stop words.
The "with default stop words" option specifies that an implementation-defined collection of stop words is used.
The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.
The default is "without stop words".
returns true, because the document contains the phrase "propagating few errors".
Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.
returns false, because "of" is not in the p
element
between "propagating" and "errors".
uses the stop words list specified at the URL. Assuming that the
specified stop word list contains the "then", this query is
reduced to a query on the phrase "planning X conducting", allowing any
token as a substitute for X. It returns a book
element,
because its content
element contains "planning then
conducting". It would have also returned the book
if the
phrases "planning and conducting" and "planning before conducting"
had been in its content
.
returns book
s containing "planning then conducting",
but not does not return book
s containing "planning and
conducting", since it is exempting "then" from being a stop word.
The StringLiteral following the keyword language
designates one language. It must either be castable to "xs:language",
or be the value "none". Otherwise, an error is raised:
The "language" option influences tokenization, stemming, and stop words.
If the language "none"
option is specified, no
language selected.
The set of valid language identifiers is implementation-defined.
By default, there is no language selected.
This is an example where the language option is used to select the appropriate stop word list.
In addition to specifying the "with wildcards"' option, indicators (represented by periods (.)) and qualifiers are appended to or inserted into tokens being searched. Zero or more characters replace each indicator and qualifier.
Indicators are mandatory. When the "with wildcards"' option is present, one or more periods (.) must be appended at the beginning or end of tokens or inserted into tokens. If the period is at the beginning of a token, the wildcard is a prefix wildcard. If the period is at the end of a token, it is a suffix wildcard. If the period is inserted into a token, it is an infix wildcard.
When the "with wildcards" option and one or more periods (.) appended to or inserted into tokens are present, characters are appended or inserted at each of the periods. Any characters may be appended or inserted except newline characters (#xA), return characters (#xD), and tab characters (#x9). The number of characters depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.
If a period is present, but no qualifiers, one character is appended or inserted.
If a period is followed by a question mark (.?), zero or one characters are appended or inserted.
If a period is followed by an asterisk (.*), zero or more characters are appended or inserted.
If a period is followed by a plus sign (.+), one or more characters are appended or inserted.
If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters is appended or inserted.
The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces recognized as regular characters.
The default is "without wildcards".
returns true, because the title
element
contains "improving".
returns true, because the title
element
contains "site".
returns true, because the p
element
contains "well".
N1, N2, ..., Nk
be the sequence of nodes of the search context. The expression
UnionExpr is evaluated in the context of each node Ni
being
searched. That is, the search context expression of the ftcontains
predicate creates a new focus for the evaluation of the UnionExpr
given with E1/E2
or a filter expression E1[E2]
(see
Now, let I1, I2, ..., In
be the sequence of items that
UnionExpr evaluates to. For each Ni (i=1..k)
a copy is
made that omits each node Ij (j=1..n)
that is not
Ni
. Those copies form the new search context. If
UnionExpr evaluates to an empty sequence no nodes are omitted.
In the following fragment, if .//annotation
is ignored,
"Web Usability" will be found 2 times: once in the title
element and once in the editor
element. The 2 occurrences
in the 2 annotation
elements are ignored. On the other
hand, "expert" will not be found, as it appears only in an
annotation
element.
By default, no element content is ignored.
This section describes the formal semantics of XQuery 1.0 and XPath 2.0 Full-Text. The figure below shows how XQuery 1.0 and XPath 2.0 Full-Text integrates with XQuery 1.0 and XPath 2.0.
The following diagram represents the interaction of XQuery 1.0 and XPath 2.0 Full-Text with the rest of XQuery 1.0 and XPath 2.0 languages. It specifies how full-text expression can be nested within XQuery 1.0 and XPath 2.0 expressions and vice versa.
Arrow 1 represents the composability of the XQuery 1.0 and XPath 2.0 expressions. This is outside the scope of this document and will not be discussed further.
Arrow 2 shows how XQuery 1.0 and XPath 2.0 expressions
can be nested inside FTSelections by evaluating them to a sequence of
items. If the XQuery 1.0 and XPath 2.0 expression is nested on the left-hand side of a
Arrow 3 represents the composability of
Arrow 4 shows how the result of the evaluation of XQuery 1.0 and XPath 2.0 Full-Text and scoring
expressions are integrated into the XQuery 1.0 and XPath 2.0
model. The section
The functions and schemas defined in this section are considered to be within the fts: namespace. These functions and schemas are used only for describing the semantics. There is no requirement that these functions and schemas be implemented, so there is no URI is associated with the fts: prefix.
Tokenization is subject to the following constraint:
Attribute values are not tokenized.
The following document fragment is the source document for examples in this section. Tokenization is implementation-defined. A sample tokenization is used for the examples in this section. The results might be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
In this sample tokenization, tokens are delimited by punctuation and whitespace symbols.
The token "Ford" is at relative position 1.
The token "Mustang" is at relative position 2.
The token "2000" is at relative position 3.
Relative position numbers are assigned sequentially through the end of the document.
Hence each token occupies exactly one position, and no overlapping of tokens occurs. The relative positions of token occurrences are shown below in parentheses.
The relative positions of paragraphs are determined similarly. In this sample tokenization, the paragraph delimiters are start tags, end tags, and end of line characters.
The tokens in the first element are assigned relative paragraph number 1.
The tokens from the next element are assigned relative paragraph number 2.
Relative paragraph numbers are assigned sequentially through the end of the document.
The relative positions of sentences are determined similarly using sentence delimiters.
Implementations may provide for the means to ignore or side-step
certain structural elements when performing tokenization. In the
following example, the implementation has decided to ignore the
markup for <bold>
and prune out the entire
subtree headed by <deleted>
.
Using the same notation as before, this sample tokenization is shown below. All the token occurrences marked with a token position also have the same sentence and paragraph relative positions. Note that there are no tokens marked for the ignored subtree.
Two representations of tokenized text will be employed in the formal semantics functions, one for the search strings of a query and one for matched token occurrences of search context items.
A
A
A
a unique identifier that captures the relative position of
the first token occurrence of the sequence in the document order: startPos
a unique identifier that captures the relative position of
the last token occurrence of the sequence in the document order: endPos
the relative position of the sentence containing
the first token occurrence or zero if the tokenizer does not report
sentences: startSent
the relative position of the sentence containing
the last token occurrence or zero if the tokenizer does not report
sentences: endSent
the relative position of the paragraph containing
the first token occurrence or zero if the tokenizer does not report
paragraphs: startPara
the relative position of the paragraph containing
the last token occurrence or zero if the tokenizer does not report
paragraphs: endPara
The following matching function is the central implementation-defined primitive performing the full-text retrieval.
The above function returns the $searchContext
that match the search string represented by
the sequence $searchTokens
, when using the match
options in $matchOptions
and stop words in
$stopWords
. If $searchTokens
is a
sequence of more than one search token, each returned
While this matching function assumes a tokenized
representation of the search strings, it does not assume a tokenized
representation of the input items in $searchContext
,
i.e. the texts in which the search happens. Hence, the tokenization of
the search context is implicit in this function and coupled to the
retrieval of matches. Of course, this does not imply that tokenization
of the search context cannot be done a priori. Because tokenization is
implementation-defined, the
tokenization of each item in $searchContext
does not
necessarily take into account the match options in
$matchOptions
or the search tokens in
$searchTokens
. This allows implementations to tokenize
and index input data without the knowledge of particular match options
used in full-text queries.
The sequence of nodes in the XQuery 1.0 and XPath 2.0 Data Model is
inadequate to support fully composable
XQuery 1.0 and XPath 2.0 Full-Text adds relative token, sentence, and
paragraph position numbers via
The
Intuitively,
The
Since in most of the examples below the tokens span only a single
position, we characterize the startPos
and the endPos
attribute. Furthermore, for expository reasons, we
include in each
The simplest example of an "Mustang"
. The
As shown, the "Mustang"
. The result represented by the first
A more complex example of an "Ford Mustang"
. The
There are two possible results for this
An even more complex example of an "Mustang"
&& ! "rust"
that searches for
"Mustang" but not "rust". The
This example introduces
The XML schema for representing
The stokenNum
attribute in
stokenNum
attribute stores
the number of search tokens used when evaluating the queryPos
attribute in new
The XML representation of the
<left>
and <right>
descendant elements. For unary <selection>
descendant element is used. Additional
characteristics of
The denotational semantics for the evaluation of
The
The semantics for the
For
concreteness, assume that the ftcontains
expression such
as searchContext ftcontains ftselection
. In order to
determine the
ftselection
, the
fts:evaluate($ftselection,
$searchContext, $matchOptions, 0)
, where
$ftselection
is the XML representation of the
ftselection
and
$searchContext
is bound to the result of
the evaluation of the XQuery expression
searchContext
.
Initially, the
$searchTokensNum
is 0, i.e., no
search tokens have been processed.
The variable $matchOptions
is bound to the
list of match options as defined in the static context (see
Appendix ftselection
modify the match options collection as
evaluation proceeds.
Match options are applied to an
The top match option in the stack is applied first.
The second match option is applied next.
Match options are applied sequentially down to the bottom of the stack.
Ordering among match options is necessary because match options are not always commutative. For example, synonym(stem(word)) is not always the same as stem(synonym(word)). Naturally, match options may be reordered when they commute, but this is an optimization issue and is beyond the scope of this document.
Given the invocation of: fts:evaluate($ftselection,
$searchContext, $matchOptions)
, evaluation proceeds as
follows. First, $ftselection
is checked to see whether
a match option is applied 1) on a nested
If $ftselection
contains a match option,
then it modifies the context for the nested
If $ftselection
contains a weight
specification, then the specification is ignored because it
does not alter the semantics. The
If $ftselection
is an
If $ftselection
contains neither a match
option nor a weight specification and is not an &&
, ||
, window
.
These operations are fully-compositional and may be
invoked on nested
First, the
The FTSelection1
which is
generically named
For example, let
FTSelection1
be FTSelection2 &&
FTSelection3
. Here FTSelection2
and
FTSelection3
may themselves be arbitrarily nested
FTSelection2
and FTSelection3
, and the
resulting &&
.
The semantics of the
The formal semantics of the
The
The $tokenInfo1
and
$tokenInfo2
. For example, two consecutive
tokens have a distance of 0 tokens.
The $tokenInfo1
and $tokenInfo2
.
The $tokenInfo1
and
$tokenInfo2
.
The $tokenInfo
describes a token whose start position is the first position of
the node $searchContext
.
The $tokenInfo
describes a token whose end position is the last position of
the node $searchContext
.
An fts:searchToken
items, and 4) the position where the latter search string occurs in the
query.
If after the application of all the match options, the sequence
of search tokens returned for an
The Pos: N
, if the attributes
startPos
and endPos
are the same
with N
being that position.
There are five variations of
When any word
is specified, at
least one token in the tokenization of the nested expression must be
matched.
When all word
is specified, all
tokens in the tokenization of the nested expression must be
matched.
When phrase
is specified, all
tokens in the tokenization of the nested expression must be
matched as a phrase.
When any
is specified, at least one
string atomic value in the nested expression must be
matched as a phrase.
When all
is specified, all
string atomic values in the nested expression must be
matched as a phrase.
The semantics for any word
is specified
is given below. Since
The tokenized search strings are passed to
ApplyFTWordsAnyWord as a sequence of
fts:searchItem
, each containing the tokens of
a single search string. A single flattened sequence of all
tokens (of type fts:searchToken
) over all
search items is constructed. For each of these,
the result of
The semantics for all word
is specified is similar to the above, however composes a
conjunction. It is given below.
The semantics for phrase
is specified
is given below.
The
The semantics for any
is specified is
given below.
The any
specified forms the disjunction of the
The semantics for all
is specified
is given below.
The difference between all
and
any
is the use of conjunction instead of
disjunction.
The
The parameters of the
The
For example, consider the "Mustang" || "Honda"
. The
The
The parameters of the
The result of the conjunction is a new
For example, consider the "Mustang" && "rust"
. The
source
The
The parameters of the
The generation of the resulting
In the
The function
The function
For example, consider the ! ("Mustang" || "Honda")
. The
source
The
The parameters of the
The resulting
For example, consider the ("Ford" mildnot "Ford
Mustang")
. The
source
The
source
The
The parameters of the
The resulting
For example, consider the ("great" && "condition")
ordered
. The source
The
The parameters of the
The semantics of same sentence
is given below.
An same sentence
contains those
The semantics of different sentence
is given below.
An different sentence
contains those
The semantics of same paragraph
is analogous to same
sentence
and is given below.
The semantics of different paragraph
is analogous to
different sentence
and is given below.
The semantics for the general case is given below.
For example, consider the ("Mustang" && "Honda") same
paragraph
. The source
The
The parameters of the
The evaluation of scope functions depends on the type of the content match.
entire match
is evaluated as
distance exactly 0 words at start at end
, i.e., all the
at start
retains only
fts:isStartToken
.
at end
retains the
fts:isEndToken
.
The parameters of the
The semantics of case word distance exactly N
is given below.
The semantics of word distance at least N
is given
below.
The semantics of word distance at most N
is given
below.
The semantics of word distance from M to N
is given
below.
The semantics of sentence distance exactly N
is given below.
The semantics of sentence distance at least N
is given below.
The semantics of sentence distance at most N
is given below.
The semantics of sentence distance from M to N
is given below.
The semantics of paragraph distance exactly N
is given below.
The semantics of paragraph distance at least N
is given below.
The semantics of paragraph distance at most N
is given below.
The semantics of paragraph distance from M to N
is given below.
The resulting
In the general case, the semantics is given below.
For example, consider the ("Ford Mustang" &&
"excellent") distance at most 3 words
.
The ("Ford Mustang" &&
"excellent")
are given below.
The result for the
The parameters of the
fts:DistanceType
, 4) a size, and 5) one
The semantics of window N words
is given below.
The semantics of window N sentences
is given below.
The semantics of word N paragraphs
is given below.
The resulting
The semantics for the general function is given below.
For example, consider the ("Ford Mustang" &&
"excellent") window 10 words
.
The ("Ford Mustang" &&
"excellent")
are given below.
The result for the
The parameters of the
The function definitions depend on the range
specification
The general semantics is given below.
The semantics of occurs exactly N times
is given
below.
The semantics of occurs at least N times
is given below.
The semantics of occurs at most N times
is given
below.
The semantics of occurs from M to N times
is given below.
The way to ensure that
there are at least at least N
contains the possible
combinations of
The range [l, u] is represented by the condition
at least l and not at least l+1
.This transformation
is performed in the function
The semantics for the general case is given below.
The above function performs a sanity check to ensure that the nested
Otherwise, an error
For example, consider the "Mustang" occurs at least 2 times
. The source
"Mustang"
is given below.
The result consists of the pairs of the
XQuery 1.0 functions are used to
define the semantics of
The previous section described FTSelections without
giving any details about how
The extension is achieved by modifying an existing
function and adding functions that are specific to the
The semantics of most of the
Two
Differently from all other fts:ApplyFTWordsAny
.
The matching of the alternatives is performed with
For the semantics of the
The expansion of
The above function
This function determines how match options of the same kind overwrite each other, so that only one option of the same kind remains.
The details of the semantics of the remaining
The function
The function $tokens
in the thesaurus $thesaurusName
for the language
$thesaurusLanguage
using the relationship
$relationship
within the optional number of levels
$range
. If $tokens
consists of
more than one search token, it is regarded as a
phrase.
The thesaurus function returns a sequence of expansion
alternatives. Each alternative is regarded as a new search
phrase and is represented as a search item.
Alternatives are treated as though they are connected with
a disjunction (
$matchOptions
parameter to
$matchOptions
parameter to
$matchOptions
parameter to
The semantics for the
Stop words interact with
The stop words set is computed using the
fts:calcStopwords
function. The function uses
the function fts:resolveStopwordsUri
to resolve any URI
to a sequence of strings. Then, the stop words are
removed from the set of search tokens.
The
$matchOptions
parameter to
The xs:boolean
atomic value. This value is true
if and only if some node
in the search contains satisifes the full-text condition given by the
Consider an EvaluationContext ftcontains FTSelection
,
where EvaluationContext
is an XQuery 1.0
expression that returns a sequence of nodes and
FTSelection
is an EvaluationContext
satisfies the FTSelection
.
If the EvaluationContext
ftcontains FTSelection without content IgnoreExpr
for
some XQuery 1.0 expression IgnoreExpr
, then
the following helper function is required.
In the general case, the XQuery 1.0 and XPath 2.0
The sequence of items returned by
EvalationContext
;
The XML node representation of FTSelection
;
The sequence of nodes returned by
IgnoreExpr
, if that expression is present, or
the empty sequence otherwise; and
The XML representation of the set of default values for each of the
The
The $ignoreNodes
that is part of the tree of a node
in the search context is pruned from that tree using the function
This section addresses the semantics of
scoring variables in XQuery 1.0 for
and
let
clauses and XPath 2.0 for
expressions.
Scoring variables associate a numeric score with the result of the evaluation
of XQuery 1.0 and XPath 2.0 expressions. This numeric score
tries to estimate the value of a result item to the user
information need expressed using the XQuery 1.0 and XPath 2.0
expression. The numeric score is computed using a implementation-provided
There are numerous scoring algorithms used in practice. Most of the scoring algorithms take as inputs a query and a set of results to the query. In computing the score, these algorithms rely on the structure of the query to estimate the relevance of the results.
In the context of defining the semantics of XQuery 1.0 and XPath 2.0 Full-Text, passing the structure of the query poses a problem. The query is an XQuery 1.0 and XPath 2.0 expression and an XQuery 1.0 and XPath 2.0 Full-text expression in particular. The semantics of XQuery 1.0 and XPath 2.0 expressions is expressed using functions take as arguments sequences of items and return sequences of items. They are not aware of what expression produced a particular sequence, i.e., they are not aware of the expression structure.
To define the semantics of scoring in XQuery 1.0 and XPath 2.0 Full-Text using XQuery 1.0, expressions that produce the query result (or the functions that implement the expressions) must be passed as arguments. In other words, second-order functions are necessary. Current XQuery 1.0 and XPath 2.0 do not provide such functions.
Nevertheless, in the interest of the exposition, assume
that such second-order functions are present. In particular, that
there are two semantic second-order function
fts:score
and fts:scoreSequence
that take one argument (an expression) and return the
score value of this expression, respectively a sequence
of score values, one for each item to which the expression
evaluates. The scores must satisfy
A for
clause containing a score variable
$scoreSeq
and $i
are
new variables, not appearing elsewhere, and
fts:scoreSequence
is the
second-order function.
Similarly, a let
clause containing a score variable
This section presents a more complex example for the evaluation of $doc
.
Consider the following
Begin by evaluating the
Step 1: Evaluate the "mustang"
.
Step 2: Evaluate the {"great", "excellent"} any word
.
Step 2.1: Match the token "great"
Step 2.2 Match the token "excellent"
Step 2.3 - Combine the above
Step 3 - Apply the {("great", "excellent")} any word occurs at least 2 times
forming two pairs of
Step 4 - Apply the "Mustang"
&& ({("great", "excellent")} any word occurs at least 2
times)
forming all possible pairs of
Step 5 - Apply the ("Mustang"
&& ({("great", "excellent")} any word
occurs at least 2 times)) window 30 words
, filtering out
Step 6 - Evaluate "rust"
.
Step 7 - Apply the ! "rust"
,
transforming the StringInclude
into a
StringExclude
.
Step 8 - Apply the (("Mustang"
&& ({("great", "excellent")} any word occurs at least 2 times))
window 30 words) && ! "rust"
, forming all
possible combintations of three
Step 9: Apply the <offer>
elements determine
paragraph boundaries).
The resulting true
.
The EBNF in this document and in this section is aligned with
the current XML Query 1.0 grammar (see
|
|
|
|
|
|
| ("//"
|
| ("descendant" "::")
| ("attribute" "::")
| ("self" "::")
| ("descendant-or-self" "::")
| ("following-sibling" "::")
| ("following" "::")
| ("ancestor" "::")
| ("preceding-sibling" "::")
| ("preceding" "::")
| ("ancestor-or-self" "::")
| (
| ("*" ":"
|
|
|
| ("'" (
|
|
|
|
|
|
|
|
|
|
| (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| "uppercase"
| ("case" "sensitive")
| ("case" "insensitive")
| ("without" "diacritics")
| ("diacritics" "sensitive")
| ("diacritics" "insensitive")
| ("with" "thesaurus" "(" (
| ("without" "thesaurus")
| ("without" "stop" "words")
| ("with" "default" "stop" "words"
| ("("
| ("at" "least"
| ("at" "most"
| ("from"
The following symbols are used only in the definition of
terminal symbols; they are not terminal symbols in the
grammar of
The EBNF in this document and in this section is aligned with
the current XPath 2.0 grammar (see
|
|
|
|
|
| ("//"
|
| ("descendant" "::")
| ("attribute" "::")
| ("self" "::")
| ("descendant-or-self" "::")
| ("following-sibling" "::")
| ("following" "::")
| ("namespace" "::")
| ("ancestor" "::")
| ("preceding-sibling" "::")
| ("preceding" "::")
| ("ancestor-or-self" "::")
| (
| ("*" ":"
| (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| "uppercase"
| ("case" "sensitive")
| ("case" "insensitive")
| ("without" "diacritics")
| ("diacritics" "sensitive")
| ("diacritics" "insensitive")
| ("with" "thesaurus" "(" (
| ("without" "thesaurus")
| ("without" "stop" "words")
| ("with" "default" "stop" "words"
| ("("
| ("at" "least"
| ("at" "most"
| ("from"
The following symbols are used only in the definition of
terminal symbols; they are not terminal symbols in the
grammar of
The following table describes the full-text components of
the
Component | Default initial value | Can be overwritten or augmented by implementation? | Can be overwritten or augmented by a query? | Scope | Consistency rules |
---|---|---|---|---|---|
case
insensitive | overwriteable | overwriteable by prolog | lexical | Value must be
case insensitive or case sensitive . |
|
diacritics insensitive | overwriteable | overwriteable by prolog | lexical | Value must be diacritics insensitive or
diacritics sensitive . |
|
without stemming | overwriteable | overwriteable by prolog | lexical | Value must be without stemming or
with stemming . |
|
without thesaurus | overwriteable | overwriteable by prolog (refer to default to augment) | lexical | Value must be part of the statically known thesauri. | |
Statically known thesauri | none | augmentable | cannot be augmented or overwritten by prolog | module | Each URI uniquely identifies a thesaurus list. |
without stopwords | overwriteable | overwriteable by prolog (refer to default to augment) | lexical | Value must be part of the statically known stop word lists. | |
Statically known stop word lists | none | augmentable | cannot be augmented or overwritten by prolog | module | Each URI uniquely identifies a stop word list. |
no language is selected | overwriteable | overwriteable by prolog | lexical | Value must be castable to "xs:language" or "none". | |
Statically known languages | none | augmentable | cannot be augmented or overwritten by prolog | module | Each string uniquely identifies a language. |
without wildcards | no | overwriteable by prolog | lexical | Value must be without wildcards or without
wildcards . |
It is a type error if, during the static analysis phase,
an expression is found to have a static type
that is not appropriate for the context in which the expression occurs, or during the
dynamic evaluation phase, the dynamic type of a value does not match a required type as
specified by the matching rules in
It is a dynamic error if, in a function invocation, the argument corresponding to the specified function's collation parameter does not identify a supported collation.
We would like to thank the members of the XQuery and XPath Full-Text group for their fruitful discussions.
We would like to thank the following people for their contributions on earlier drafts of this document.
"Andrew Eisenberg" - IBM - andrew.eisenberg@us.ibm.com
"Roland Seiffert" - IBM - seiffert@de.ibm.com
"Andrew Cencini" - Microsoft - acencini@microsoft.com
"Nimish Khanolkar" - Microsoft - nimishk@exchange.microsoft.com
"Ashok Malhotra" Oracle - ashok.malhotra@oracle.com
"Tapas Nayak" Microsoft - tapasnay@exchange.microsoft.com
This appendix provides a summary of features defined in this specification
whose effect is explicitly
Everything about tokenization, including the definition of
the term "words", is each word consists of one or more consecutive characters; the tokenizer must preserve the containment hierarchy
(paragraphs contain sentences contain words); and the tokenizer must, when tokenizing two equal strings,
identify the same tokens in each.
Implementations are free to provide
It is
When the option "with default stop words" is used, an
The set of valid language identifiers is
Certain values in the static context (see
Sihem Amer-Yahia | 2005-04-08 | Updated case matrix | Updated case matrix row "sensitive", column "CCI" from "case-insensitive variant of CCI if it exists, else error" to "case-sensitive variant of CCI if it exists, else error". |
Sihem Amer-Yahia | 2005-05-02 | Closed issues with no changes | Closed Cluster B, Issue 28 IGNORE Syntax with no change to the document. Closed Cluster B, Issue 50 IGNORE Queries with no change to the document. |
Sihem Amer-Yahia | 2005-05-02 | Updated FTTimes syntax | Closed Cluster G, Issue 14 FTTimesSelection and added a related bullet item in Section 3. |
Sihem Amer-Yahia | 2005-05-02 | Updated FTWildCard syntax | Updated FTWildCardOption in Section 3. |
Sihem Amer-Yahia | 2005-05-03 | Updated introduction | Replaced "semantic element" with "semantic markup" and "tag" with "element" in the introduction. |
Sihem Amer-Yahia | 2005-05-03 | Added issue on error codes | Added Cluster J, Issue 59 Error Codes. |
Sihem Amer-Yahia | 2005-05-03 | Closed issues with no change | Closed Cluster A, Issue 54 Weight Granularity in Scoring with same resolution as for Cluster A, Issue 5 Score Weighting, no further change to document. Closed Cluster H, Issue 9 Window with no change to the document. Closed Cluster H, Issue 19 FTScopeSelection on structure with no change to the document. Closed Cluster E, Issue 25 MatchOption Syntax with no change to the document. Closed Cluster H, Issue 44 FTContains Semantics with no change to the document. |
Sihem Amer-Yahia | 2005-05-03 | Updated FTContent syntax | Updated FTContent adding "entire content", Closed Cluster C, Issue 39 Exact Element Content. |
Sihem Amer-Yahia | 2005-05-03 | Closed issue on Boolean Naming | Closed Cluster F, Issue 38 Boolean Naming. Changes to the document are pending awaiting a decision on whether it is OK to use "and", "or", "not" for full-text. If so change existing symbols to "and", "or", "not". If not change existing symbols to "ftand", "ftor", "ftnot". |
Chavdar Botev | 2005-05-03 | Updated FTDistance semantics | Updated the semantics for distance. |
Sihem Amer-Yahia | 2005-05-03 | Updated FTRange syntax | Made "exactly" required before an exact number in FTRange. Closed Cluster F, Issue 43 Exactly in FTRangeSpec. |
Sihem Amer-Yahia | 2005-05-04 | Closed issue on collations | Closed Cluster D, Issue 57 Collations Match Option. |
Jochen Doerre | 2005-05-19 | Added issue on scoring | Added Cluster A, Issue 60 Extended Scoring. |
Chavdar Botev | 2005-06-29 | Added issue on FTNegation | Added Cluster G, Issue 62 Precise semantics of double negation. |
Chavdar Botev | 2005-06-29 | Added issue on FTTimes | Added Cluster G, Issue 61 Desired semantics of FTTimes. |
Sihem Amer-Yahia | 2005-07-11 | Updated FTMildNegation syntax | Updated the mild not syntax from "mild not" to "not in". Closed Cluster I, Issue 10 MildNot and Cluster F, Issue 41 Mildnot Naming. |
Chavdar Botev | 2005-07-12 | Updated FTIgnore semantics | Changed semantics of FTIgnoreOption. |
Sihem Amer-Yahia | 2005-07-18 | Corrected error codes | Corrected and added error codes, closing and implementing the resolution for Cluster J Issue 59 Error Codes. |
Sihem Amer-Yahia | 2005-07-18 | Closed issues with no changes | closed Cluster I, Issue 13 "loose-grammar" leaving the grammar as it is. Closed issue Cluster D, Issue 53 "matchoptions-default" with no change to the document. Closed Cluster H, Issue 58 "ft-about-operator" with no change to the document. |
Sihem Amer-Yahia | 2005-07-21 | Updated score syntax | Closed Cluster A, Issue 60 "new-scoring-proposal" and Issue 2 "scoring-values" and updated Section 2.2 Score Clause to reflect new score syntaxes. There are now syntaxes for scored queries 1) returning the same results as queries with Boolean predicates and 2) for returning more or fewer results. |
Sihem Amer-Yahia | 2005-07-21 | Added appendix for defaults | Added appendix for defaults in the query prolog analogous to C.1 in the XQuery language document. |
Sihem Amer-Yahia | 2005-07-21 | Updated FTThesaurus section | Aligned description in Section 3.2.4 FTThesaurusOption with current grammar. |
Sihem Amer-Yahia | 2005-07-21 | Opened and closed issue on nested FTNegation | Opened and closed Cluster I, Issue 65 Nested FTNegations on the right side of an FTMildNegation. |
Chavdar Botev | 2005-07-25 | Updated FTMildNegation semantics | Changed the semantics of MildNot. |
Sihem Amer-Yahia | 2005-08-10 | Added Change Log | Added Change Log harvesting back entries from CVS change log. |
Jochen Doerre | 2005-08-17 | Grammar changes | Changed XQuery/XPath grammar for new scoring syntax (resolution of Issue 60), for match option defaults in query prolog (resolution of Issue 45), for simplified window operator (resolution to Issue 51), renamed "mild not" to "not in" (resolution of Issue 41), modified FTThesaurusOption, FTStopwordOption and FTLanguageOption to require StringLiterals as decided in May 05 F2F. |
Jochen Doerre | 2005-08-17 | Changes to Section 2 | New scoring syntax introduced; rewritten most of 2.2. Corrected use of weights in 2.2.1 (wrong default, wrong use of 1.5) |
Jochen Doerre | 2005-08-17 | Changes to Section 3 | Adapting the explanations to changed syntax for FTWindow, FTThesaurusOption, FTStopwordOption and FTLanguageOption. Also corrected a couple of example explanations. Removed FTIgnoreOption from the list of match option defaults in 3.2 Corrected explanation and example of FTLanguageOption (diacritics nor case are language-specific!). Commented out last two examples of FTDistance, because distance 15 does not work for phrases. |
Jochen Doerre | 2005-08-17 | Appendices A+B | Adapted introductory comment about which version of the XQuery/XPath grammars we are aligned to. |
Jochen Doerre | 2005-08-17 | Dates in Header | Adapted current date and previous date and links in full-text-query-language-semantics.xml and in tqheader.xml. |
Jochen Doerre | 2005-08-19 | Added Section 2.3, Changes in 3+4 | Added Section 2.3 Extension to Static Context. Changed Sections 3.2 and 4.4.1.1 to refer to match option settings in the static context. |
Jochen Doerre | 2005-08-19 | Added Issue 63 | Added Cluster G Issue 63: Distance constraints do not work on phrases. |
Jochen Doerre | 2005-08-19 | Changes in Section 4 | Adapted semantics to new scoring feature (resolution of Issue 60), changed FTWindow semantics according to resolution of Issue 51, and cleaned examples. |
Jochen Doerre | 2005-08-19 | Appendix G | Added lines for statically known thesauri and stop lists. |
Jochen Doerre | 2005-08-25 | Added Issue 64 | Added Cluster E Issue 64:System Relative Operator Defaults (using wording proposed by Pat Case). |
Jochen Doerre | 2005-10-10 | Changes in Section 3 | Rephrased Section 3.2.7 FTIgnoreOption. Explanation and example adapted to simple (non-recursive) use of "ignore". |
Jochen Doerre | 2005-10-10 | Changes in Section 4 | Incorporated Section 4.3.1.4 Match and AllMatches Normal Form. |
Sihem Amer-Yahia | 2005-10-12 | Incorporated comments | Incorporated Pat's comments at https://lists.w3.org/Archives/Member/member-query-fttf/2005Sep/0068.html |
Jim Melton | 2005-10-20 | Changes in Sections 3 and 4 | Properly marked up errors and inserted error summary appendix. Re-ordered appendices so normative appendices precede non-normative appendices. |
Jochen Doerre | 2005-10-24 | Final editings | Included corrections to examples in Section 3. Changed meaning of distance 0 for sentences (paragraphs) to mean adjacent. Rework of Appendix H Checklist of Implementation-Defined Features. Resolution texts to issues 45, 59, and 62. |
Jochen Doerre | 2005-11-28 | Restrict FTTimes to FTWords | Modified EBNF syntax to allow the FTTimes operation to be applicable only to simple FTWords. |
Jochen Doerre | 2005-11-28 | Re: Bug 2299: Changes to Section 4 | The AllMatches model has been changed to allow the TokenInfo of a StringMatch to represent an interval of token positions, instead of single positions. Thus, a phrase is now modeled using a single StringMatch, and consequently distance constraints (which always apply to the individual StringMatches) can be used to constrain the entire phrase. In addition, this change allows to model overlapping tokens. The semantics functions for FTOrder (order now constrains the start positions of tokens), for FTScope, for FTDistance (a distance constraint requires a certain number of positions between the end of one token and the start of the next) and for FTWindows have been adapted. |
Jochen Doerre | 2006-01-09 | Issues List removed | Dropped Appendix I "Issues List", as issues are tracked in Bugzilla now. |
Mary Holstege | 2006-02-01 | Static context | Added known languages to static context. |
Jochen Doerre | 2006-03-06 | Bug 2776 | Changed EBNF grammar to allow weights to be specified using RangeExpr. |
Mary Holstege | 2006-03-30 | Updated Tokenization 4.2.7 | Expanded and clarified definition. Added examples. |
Pat Case | 2006-04-13 | Replaced glossary | Removed glossary copied from the XQuery language document and inserted coding to produce a full-text glossary. |
Jochen Doerre | 2006-04-24 | Section 2 | Added new Processing Model section. |
Jochen Doerre | 2006-04-25 | Section 4 | Included the completely revised semantics schemata and functions, which now (i) correctly handle interval-based TokenInfos, (ii) separate the representation of TokenInfos and SearchTokenInfos and SearchItems, (iii) have been simplified regarding the semantics of match options by no longer separating the implementation-defined matching function from (most of) the implementation-defined application of match options, and (iv) have been type- and syntax-checked. |