CARVIEW |
This document defines the syntax and formal semantics of XQuery and XPath Full Text 1.0
which is a language that extends XQuery 1.0
W3C publishes a
This document has been jointly developed by the W3C
The
A test suite is available that tests each identified XQuery and XPath Full Text 1.0 feature, both required and optional.
Minimal Conformance to this specification, as defined in
An XPath Full Text parsing applet that generates XQueryX is available.
The Working Groups have responded formally to all issues raised during the CR period against this document.
Once the entrance criteria for Proposed Recommendation have been achieved,
the Director will be requested to advance this document to
The 15
The WG believes that this document, published on 16 May 2008, is sufficiently mature and stable for the development community to begin developing implementation experience and reporting on that experience.
The WGs particularly solicit feedback regarding how
No implementation report currently exists.
However, a Test Suite for this document is under development.
Implementors are encouraged to run this test suite and report their results.
The Test Suite can be found at
This document incorporates changes made against the
Please report errors in this document using W3C's
Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by groups operating under the
SA January 2004: First version of document before Feb F2F
SA 26 February 2004: Second version of document before Feb F2F meetings.
JM 18 May 2007: Last Call Working Draft
This document defines the language and the formal semantics of
XQuery and XPath Full Text 1.0. This language is designed to meet the requirements
identified in W3C XQuery and XPath Full Text Requirements
XQuery and XPath Full Text 1.0 extends the syntax and semantics of XQuery 1.0 and XPath 2.0.
Additionally, this document defines an XML syntax for XQuery and XPath Full Text 1.0.
The most recent versions of the two XQueryX XML Schemas and the
XQueryX XSLT stylesheet for XQuery and XPath Full Text 1.0 are available at
As XML becomes mainstream, users expect to be able to
search their XML documents. This requires a standard way to do
full-text search, as well as structured searches, against XML
documents. A similar requirement for full-text search led ISO to
define the SQL/MM-FT
XML documents may contain highly structured data (fixed schemas, known types such as numbers, dates), semi-structured data (flexible schemas and types), markup data (text with embedded tags), and unstructured data (untagged free-flowing text). Where a document contains unstructured or semi-structured data, it is important to be able to search using Information Retrieval techniques such as scoring and weighting.
Full-text search is different from substring search in many ways:
A full-text search searches for tokens and phrases rather than substrings. A substring search for news items that contain the string "lease" will return a news item that contains "Foobar Corporation releases the 20.9 version ...". A full-text search for the token "lease" will not.
There is an expectation that a full-text search will support language-based searches which substring search cannot. An example of a language-based search is "find me all the news items that contain a token with the same linguistic stem as 'mouse'" (finds "mouse" and "mice"). Another example based on token proximity is "find me all the news items that contain the tokens 'XML' and 'Query' allowing up to 3 intervening tokens".
Full-text search must address the vagaries and nuances of language. Search results are often of varying usefulness. When you search a web site for cameras that cost less than $100, this is an exact search. There is a set of cameras that matches this search, and a set that does not. Similarly, when you do a string search across news items for "mouse", there is only 1 expected result set. When you do a full-text search for all the news items that contain the token "mouse", you probably expect to find news items containing the token "mice", and possibly "rodents", or possibly "computers". Not all results are equal. Some results are more "mousey" than others. Because full-text search may be inexact, we have the notion of score or relevance. We generally expect to see the most relevant results at the top of the results list.
As XQuery and XPath evolve, they may apply the notion of score to querying structured data. For example, when making travel plans or shopping for cameras, it is sometimes useful to get an ordered list of near matches in addition to exact matches. If XQuery and XPath define a generalized inexact match, we expect XQuery and XPath to utilize the scoring framework provided by XQuery and XPath Full Text.
Tokenization, in general terms, is the process of converting a text string into smaller units that are used in query processing. Those units, called tokens, are the most basic text units that a full-text search can refer to. Full-text operators typically work on sequences of tokens found in the target text of a search. These tokens are characterized by integers that capture the relative position(s) of the token inside the string, the relative position(s) of the sentence containing the token, and the relative position(s) of the paragraph containing the token. The positions typically comprise a start and an end position.
Tokenization, including the definition of the term "tokens",
Consecutive tokens need not be separated by either punctuation or space, and tokens may overlap.
In some natural languages, tokens and words can be used interchangeably.
Some XML elements represent semantic
markup, e.g., <title>. Others represent formatting markup, e.g.,
<b> to indicate bold. Semantic markup serves well as token
boundaries. Some formatting markup serves
well as token boundaries, for example, paragraphs are most commonly delimited
by formatting markup. Other formatting markup may not serve well as token
boundaries. Implementations
are free to provide
A sample tokenization is used for the examples in this document. The results might be different for other tokenizations.
Tokenization enables functions and operators that operate on a part or the root of the token (e.g., wildcards, stemming).
Tokenization enables functions and operators which work with the relative positions of tokens (e.g., proximity operators).
This specification focuses on functionality that serves all languages. It also selectively includes functionalities useful within specific families of languages. For example, searching within sentences and paragraphs is useful to many western languages and to some non-western languages, so that functionality is incorporated into this specification.
Certain aspects of language
processing are described in this specification as
This document is organized as follows. We first present a
Certain namespace prefixes are predeclared by XQuery 1.0 and, by implication, by this specification, and bound to fixed namespace URIs. These namespace prefixes are as follows:
xml = https://www.w3.org/XML/1998/namespace
xs = https://www.w3.org/2001/XMLSchema
xsi = https://www.w3.org/2001/XMLSchema-instance
fn = https://www.w3.org/2005/xpath-functions
local = https://www.w3.org/2005/xquery-local-functions
In addition to the prefixes in the above list, this document uses the prefix
err
to represent the namespace URI https://www.w3.org/2005/xqt-errors
,
This namespace prefix is not predeclared and its use in this document is not normative.
Error codes that are not defined in this document are defined in other XQuery 1.0 and XPath 2.0
specifications, particularly
Finally, this document uses the prefix fts
to represent a namespace
containing a number of functions used in this document to describe the semantics
of XQuery and XPath Full Text functions. There is no
requirement that these functions be implemented, therefore no URI is associated with that prefix.
XQuery and XPath Full Text extends the languages of XQuery 1.0 and XPath 2.0 in three ways. It:
Adds a new expression called FTContainsExpr;
Enhances the syntax of FLWOR expressions in XQuery 1.0 and
for
expressions in XPath 2.0 with optional score
variables; and
Adds static context declarations for full-text match options to the query prolog.
Additionally, it extends the data model and processing models in various ways.
A
An XPath 2.0 or XQuery 1.0 expression (RangeExpr) that
specifies the sequence of items to be searched.
The full-text selection to be applied (
Required:
Tokens and phrases for which a search is performed (
Optional:
Match options, such as indicators for case sensitivity and stop
words (
Boolean full-text operators, that compose a full-text selection from
simpler full-text selections (
Other full-text operators that are constraints on the positions of
matches, such as indicators for distance between tokens and for the
cardinality of matches (
The weighting information. Each individual search term in a full-text selection may be annotated with optional weight information. This information may be used during the evaluation of the full-text selections to calculate scoring, information that quantifies the relevance of the result to the given search criteria.
An optional XPath 2.0 or XQuery 1.0 expression (UnionExpr) that
specifies the set of nodes, descendents of the RangeExp, whose
contents must be ignored for the purpose of determining a match
during the search (
The results of the evaluation of the full-text selection operators are instances of the AllMatches model, which complements the XQuery Data Model (XDM) for processing full-text queries. An AllMatches instance describes all possible solutions to the full-text query for a given search context item. Each solution is described by a Match instance. A Match instance contains the tokens from the search context that must be included (described using StringInclude instances which model the positive terms) and the tokens from search context item that must be excluded (described using StringExclude instances which model the negative terms). Each negative or positive term is modeled as a tuple: the position of the query token or phrase in the full-text selection, and a TokenInfo structure that describes a set of tokens in the text string which match the query token or phrase.
Figure 1 provides a schematic overview of the XQuery and XPath Full Text processing steps that are discussed in detail below.
Some of these steps are completely outside the domain of XQuery; in
Figure 1, these are depicted outside the black line that represents
the boundaries of the language. The diagram only shows the central pieces
of the XQuery Processing Model (see
Like all XQuery expressions, an FTContainsExpr returns an XDM Instance (see Fig. 1). With the exception of FTWords, which consumes TokenInfos, all full-text selections are closed under the AllMatches data model, i.e., their input and output are AllMatches instances. Tokenization transforms an XDM instance into TokenInfos, which ultimately get converted into AllMatches instances by the evaluation of full-text selections. Thus, the evaluation of nested full-text and XQuery expressions instances moves back and forth between these two models.
The resulting AllMatches instance obtained by the evaluation of an FTContainsExpr is converted into a Boolean value before being returned to the enclosing XPath or XQuery operation as follows. If at least one member of the disjunction contains only positive terms then value returned is true. If all members of the disjunction contain negative terms the result is false.
Weighting information, in an
Given the components of a given full-text contains expression, the evaluation
algorithm will proceed according to the following steps, also referenced in the processing model diagram as steps FT
Evaluate the search context expression (resulting in the sequence of search context items), the ignore option, if any (resulting in the set of ignored nodes), and any other XQuery/XPath exprssions nested within the full-text contains expression. (FT1)
Tokenize the query string(s). (FT2.1)
For each search context item:
Delete the ignored nodes from the search context item.
Tokenize the result of the previous step. This produces a sequence of tokens. (FT2.2) Note that implementations may (as an optimization) perform tokenization as part of the External Processing that is described in the XQuery Processing Model, when an XML document is parsed into an Infoset/PSVI and ultimately into a XQuery Data Model instance.
Evaluate the FTSelection against the tokens of the search context. (FT3, FT4)
Convert the topmost AllMatches instances into a Boolean value. (FT5)
The additional scoring information (also part of FT5) that is produced
by the evaluation
of the full-text contains expression is
(A more detailed version of the above procedure
appears in Section
Section
As a syntactic construct, a full-text contains expression
(grammar symbol:
|
|
A full-text contains expression may be used anywhere a
ComparisonExpr may be
used. The ftcontains
operator has higher precedence than
other comparison operators, so the results of ftcontains
expressions may be compared without enclosing them in parentheses.
A full-text contains expression returns a Boolean
value. It returns true if there is some item returned by
the RangeExpr that, after
An XQuery and XPath Full Text processor
The following example in XQuery Full Text returns the author of
each book with a title containing a token with the same root as
dog
and the token
cat
.
The same example in XPath Full Text is written as:
In the next example a ComparisonExpr is combined with an FTContainsExpr
using the logical XQuery operator and
. The query
selects books that have a price of less than 50 and a title which contains
a token with the same root as train
:
The following example shows the combination of two ftcontains
expressions the results of which are compared using the not-equals operator.
The query
selects books where either the title contains the token
dog
and the token cat
and the content
does not contain a token with the same root as train
, or where the
title fails to have one of the matching tokens but the content does:
Besides specifying a match of a full-text
query as a Boolean condition, full-text query applications
typically also have the ability to associate scores with
the results.
XQuery and XPath Full Text extends the languages of
XQuery 1.0 and XPath 2.0 further by adding optional
score
variables to the for
and
let
clauses of FLWOR expressions.
The production for the extended for
clause in XQuery 1.0 follows.
In XPath 2.0, the SimpleForClause is extended similarly.
When a score
variable is present in a for
clause the evaluation of the expression following the in
keyword not only needs to determine the result sequence of the
expression, i.e., the sequence of items which are iteratively
bound to the for
variable. It must also determine in each
iteration the relevance "score" value of the current item
and bind the score
variable to that value.
The semantics of scoring and how it relates to second-order functions is
discussed in Section
In the following example book
elements are determined that satisfy
the condition [content ftcontains "web site" ftand "usability" and
.//chapter/title ftcontains "testing"]
. The scores assigned to the
book
elements are returned.
The example above is also a legal example of the XPath 2.0 extension.
Scores are typically used to order results, as in the
following, more complete example.
Note that the score variable gets in
keyword,
regardless of the number of FTContainsExprs in that expression. In the following example, two separate full-text contains expressions are
used to select the matching paragraphs. There is still just one score for each
para
returned. The highest scoring paragraphs will be returned
first:
The following more elaborate example uses multiple score variables to return the matching paragraphs ordered so that those from the highest scoring books precede those from the lowest scoring books, where the highest scoring paragraphs of each book are returned before the lower scoring paragraphs of that book:
The score
variable is bound to a value which reflects
the relevance of the match criteria in the
full-text selections to the items returned by the respective RangeExprs. The
calculation of relevance is
Score values are of type xs:double
in the range
[0, 1].
For score values greater than 0, a higher score must imply a higher degree of relevance
Similarly to their use in a for
clause, score variables
may be specified in a let
clause. A score variable in a
let
clause is also bound to the score of the expression
evaluation, but in the let
clause one score is determined
for the complete result.
The production for the extended let
clause follows.
When using the score option in a for
clause the
expression following the in
keyword has the dual purpose
of filtering, i.e., driving the iteration, and determining the scores.
It is possible to separately specify expressions for filtering and
scoring by combining a simple for
clause with a
let
clause that uses scoring. The following is
an example of this.
book
elements with chapter titles that contain "testing". Along with the book
elements scores are returned. These scores, however, reflect whether the book content contains "web site" and "usability".
Note that it is not a requirement of the score of an
FTContainsExpr to be 0, if the expression evaluates to false, nor to
be non-zero, if the expression evaluates to true.
Hence, in the example above it is not possible to infer the Boolean
value of the FTContainsExpr in the let
clause from the
calculated score of a returned result
element. For instance, an
implementation may want to assign a non-zero score to a book that
contained "web site", but not "usability", as this may be
considered more relevant than a book that does not contain "web site" or "usability".
The expression ExprSingle associated with the score variable is passed to
the scoring algorithm. The scoring
algorithm calculates the score value based on the passed expression
(not on the value returned by evaluating the expression). The set of expressions supported by the scoring algorithm is
The use of score
variables introduces a second-order
aspect to the evaluation of expressions which cannot be emulated by
(first-order) XQuery functions. Consider the following replacement of
the clause let score $s := FTContainsExpr
where a function score
is applied to some
FTContainsExpr. If the function score
were first-order, it
would only be applied to the result of the evaluation of
its argument, which is one of the Boolean constants true
or false
. Hence, there would be at most two possible
values such a score
function would be able to return and
no further differentiation would be possible.
The weight
The weights assigned are not related to any absolute standard, but typically have a relationship to other weights within the same FTContains expression.
The effect of weights on the resulting score is
When no explicit weight is specified, the default weight is 1.0; and
Weight declarations in an FTContainsExpr for which no scores are evaluated are ignored.
The following example illustrates how different weights can be used
for different search terms.
The XQuery Static Context is extended with a component for each
full-text
This section describes the
full-text selections which contain the full-text
operators in a
As shown in the grammar, a full-text selection consists of search
conditions possibly involving logical operators (xs:double
; it must
be between 0.0 and 1000.0 inclusive.
The syntax and semantics of the individual full-text selection operators follow.
This XML document is the source document for examples in this section.
Tokenization is <p>
for paragraph boundaries. The results may be different
for other tokenizations.
The first five tokens in this example using the sample tokenization would be "Improving", "the", "usability", "of", and "a".
Unless stated otherwise, the results assume a case-insensitive match.
FTWords consists of two parts: a mandatory
In general, the tokens and phrases in
The following rules specify how an xs:string*
. Then, each of those strings is tokenized into a
sequence of tokens as
described in
If
If
If
If
If
If the
If
The following expression returns the sample book
element,
because its title
element contains the token "Expert":
The following expression returns the sample book
element,
because its title
element contains the phrase "Expert Reviews":
The following expression returns the sample book
element,
because its title
element contains the two tokens "Expert" and "Reviews":
The following expression returns false for our sample document, because
the p
element doesn't
contain the phrase "Web Site Usability" although it contains all of the tokens
in the phrase:
The following expression returns book numbers of book
elements by
"Marigold" with a title about "Web Site Usability", sorting them in descending
score order:
A cardinality selection limits the number of different
matches of
In the document fragment "very very big":
The "very big"
has 1
match consisting of the second "very" and "big".
The {"very", "big"} all
has 2 matches; one consisting of the first "very" and "big", and
the other containing the second "very" and "big".
The {"very", "big"} any
has 3 matches.
The following expression returns the example book
element's
number, because the book
element contains 2 or more occurrences
of "usability":
The following expression returns the empty sequence, because there are
3 occurrences of {"usability", "testing"} any
in the designated
title
:
Full-text match options modify the matching behaviour of
the
|
|
|
|
|
|
|
Note that, along with the syntax rules above, there is an extra-grammatical
constraint,
Although match options only take effect in the application of
"(" FTSelection ")"
. Such a higher-level match option
provides a default for the respective match option group for any
embedded
Match options are propagated through the query via the static context.
For each of the seven match option groups,
the static context has a component
that contains one option from that group.
The seven settings are initialized by the implementation
in accordance with the table in
Appendix VarDecl
s and FunctionDecl
s,
and including any that happen to be nested within
another FTContainsExpr
).
At any given FTContainsExpr
,
the settings from the static context
are copied to the FTContainsExpr
's inner settings,
which are then propagated down the syntax tree.
At each
Thus, when a match option appears in an FTContainsExpr
s
that happen to be embedded within that FTPrimary
.
Instead, for a nested FTContainsExpr
,
the default match options are those declared in the Prolog
or, if not declared in the Prolog
,
then supplied by the implementation's initial values.
The match option application order is subject to some constraints:
The Language Option must be applied first The Stemming Option must be applied before the Case Option and the
Diacritics Option
More information on
their semantics is given in
If no match options declarations are present in the prolog and the implementation does not define any overwriting of the static context components for the match options, the query:
is, assuming "de" is the
We describe each match option group in more detail in the following sections.
The StringLiteral following the keyword language
designates one language. It must be castable to xs:language
; otherwise, an
error is raised:
The "language" option influences tokenization, stemming, and stop
words in an
The set of standardized language identifiers is defined in
The default language is specified in the static context.
When an XQuery and XPath Full Text processor evaluates text in a document
that is governed by an xml:lang attribute and
the portion of the full-text query doing that evaluation contains an FTLanguageOption that
specifies a different language from the language specified by the governing xml:lang attribute,
the language-related behavior of that full-text query is
This is an example where the language option is used to select the appropriate stop word list:
When the "with wildcards" option is used, wildcard indicators (represented by periods (.)) and qualifiers may be appended to or inserted into the query tokens. If the period is at the beginning of a query token, the wildcard is a prefix wildcard. If the period is at the end of a query token, it is a suffix wildcard. If the period is inserted into a query token, it is an infix wildcard.
Each indicator and qualifier in a query token will match zero or more characters within a token in the text being searched, as described below. The number of characters matched depends on the qualifier. Qualifiers available are none, question mark, asterisk, plus sign, and two numbers separated by a comma, both enclosed by curly braces.
If a period is present, but there are no qualifiers, one character in the text will match.
If a period is followed by a question mark (.?), zero or one characters in the text being searched will match.
If a period is followed by an asterisk (.*), zero or more characters will match.
If a period is followed by a plus sign (.+), one or more characters will match.
If a period is followed by two numbers separated by a comma, both enclosed by curly braces (.{n,m}), a specified range of characters (at least n characters and no more than m characters) will match.
When "with wildcards" is present and an indicator or qualifier character is intended to be taken literally (as itself), that character must be preceded by ("escaped by") a backslash (\). For example, a period (.) that is intended to be a sentence terminator or a decimal point must be preceded by a backslash so that it is not interpreted to be an indicator. Similarly a question mark (?), asterisk (*), or plus sign (+) that is intended to be interpreted as an ordinary text character must be preceded by a backslash so that it is not interpreted to be an indicator.
The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces, are always recognized as ordinary text characters.
The default is "without wildcards".
Note: Wildcard indicators and qualifiers may be token boundaries. How text with wildcard indicators and qualifiers is tokenized is implementation-defined.
The expression returns true, because the title
element
contains "improving":
The following expression returns true, because the title
element
contains "site":
The following expression returns true, because the p
element
contains "well":
The following expression returns false, because the p
element
does not contain the phrase "w ll":
(Note that, without wildcards, the sample tokenization will treat the period in "w.ll" as punctuation, thus producing "w" and "ll" as separate tokens.)
| ("with" "thesaurus" "(" (
| ("without" "thesaurus")
Thesauri add related tokens and phrases to the query or change query tokens. Thus, the user may narrow, broaden, or otherwise modify the query using synonyms, hypernyms (more generic terms), etc. The search is performed as though the user has specified all related query tokens and phrases in a disjunction (FTOr).
A thesaurus may be standards-based or locally-defined. It may be a
traditional thesaurus, or a taxonomy, soundex, ontology, or topic
map. How the thesaurus is represented is
FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.
Relationships include, but are not limited to, the relationships
and their abbreviations presented in
The "with thesaurus" option specifies that string matches include tokens that can be found in one of the specified thesauri. When "default" is used in place of a FTThesaurusID, the thesauri specified in the static context are used, which are either given by the prolog declaration for the thesaurus option, or, if no such declaration exists a system-defined default thesaurus with a system-defined relationship. The default thesaurus may be used in combination with other explicitly specified thesauri.
The "without thesaurus" option specifies that no thesaurus will be used.
The default is "without thesaurus".
The following expression returns true, because it finds a content
element containing "tasks" which the thesaurus identified as a synonym for
"duties":
The following expression returns book
elements, because it finds a
content
element containing "web site components", and
narrower terms "navigation" and "layout":
Assuming the thesaurus available at URL
"https://bstore1.example.com/UsabilitySoundex.xml"
contains soundex capabilities, the following query
returns a book
element containing "Marigold" which
sounds like "Merrygould":
The "with stemming" option specifies that matches may contain tokens
that have the same stem as the tokens and phrases written in the
query. It is
The "without stemming" option specifies that the tokens and phrases are not stemmed.
It is
The default is "without stemming".
The following expression returns true, because the title
of the specified
book
contains "improving" which has the same stem as
"improve":
| ("case" "sensitive")
| "lowercase"
| "uppercase"
There are four possible character case options:
Using the option "case insensitive", tokens and phrases are matched, regardless of the case of characters of the query tokens and phrases.
Using the option "case sensitive", tokens and phrases are matched, if and only if the case of their characters is the same as written in the query.
Using the option "lowercase", tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only lowercase characters.
Using the option "uppercase", tokens and phrases are matched, if and only if they match the query without regard to character case, but contain only uppercase characters.
The default is "case insensitive".
The effect of the case options is also influenced by the query's
default collation
(see
Case option \ Default collation | UCC (Unicode Codepoint Collation) | CCS (some generic case-sensitive collation) | CCI (some generic case-insensitive collation) |
---|---|---|---|
case insensitive | compare as if both lower | case-insensitive variant of CCS if it exists, else error | CCI |
case sensitive | UCC | CCS | case-sensitive variant of CCI if it exists, else error |
lowercase | compare using UCC after applying fn:lower-case() to the query string | compare using CCS after applying fn:lower-case() to the query string | CCI |
uppercase | compare using UCC after applying fn:upper-case() to the query string | compare using CCS after applying fn:upper-case() to the query string | CCI |
In this table, "else error" means "Otherwise, an error
is raised:
The following expression returns false, because the title
element
doesn't contain "usability" in lower-case characters:
The following expression returns true, because the character case is not considered:
| ("diacritics" "sensitive")
There are two possible diacritics options:
The option "diacritics" "insensitive" matches tokens and phrases with and without diacritics. Whether diacritics are written in the query or not is not considered.
The option "diacritics" "sensitive" matches tokens and phrases only if they contain the diacritics as they are written in the query.
The default is "diacritics insensitive".
The effect of the diacritics options is also influenced by the query's
default collation
(see
Diacritics option \ Default collation | UCC (Unicode Codepoint Collation) | CDS (some generic diacritics-sensitive collation) | CDI (some generic diacritics-insensitive collation) |
---|---|---|---|
diacritics insensitive | UCC comparison, but without considering diacritics | diacritics-insensitive variant of CDS if it exists, else error | CDI |
diacritics sensitive | UCC | CDS | diacritics-sensitive variant of CDI if it exists, else error |
In this table, "else error" means "Otherwise, an error
is raised:
The following expression returns true, because the token "Véra" in the
editor
element is matched, as the acute accent is not
considered in the comparison:
This returns false, because the editor
element does not
contain the token "Vera" in this exact form, i.e. without any diacritics:
| ("without" "stop" "words")
| ("with" "default" "stop" "words"
| ("("
at
followed by a literal URI.
If the URI specifies a list of stop words that is not found in the statically
known stop word lists, an error is raised
The "with stop words" option specifies that if a token is within the
specified collection of stop words, it is removed from the search and
any token may be substituted for it. Stop words retain their position
numbers and are counted in
Multiple stop word lists may be combined using "union" or "except". The keywords "union" and "except" are applied from left to right. If "union" is specified, every string occurring in the lists specified by the left-hand side or the right-hand side is a stop word. If "except" is specified, only strings occurring in the list specified by the left-hand side but not in the list specified by the right-hand side are stop words.
The "with default stop words" option specifies that an
The "without stop words" option specifies that no stop words are used. This is equivalent to specifying an empty list of stop words.
The default is "without stop words".
Some implementations may apply stop word lists during indexing and be
unable to comply with query-time requests to not apply those stop words. An
implementation may still support stop-word options (and therefore not raise
The following expression returns true, because the document contains the phrase "propagating few errors":
Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagation" followed by 2 stop words, or even a sequence of 3 stop words in the document.
The following expression returns false. In this case specifying "few" as a stop word has no effect, since "few" does not appear in the query. Although the words "propagating" and "errors" appear in the text being searched, the phrase "propagating errors" cannot be matched, since that phrase does not occur.
The following expression returns false, because "of" is not in the p
element between "propagating" and "errors":
The following expression uses the stop words list specified at the
URL. Assuming that the specified stop word list contains the word
"then", this query
is reduced to a query on the phrase "planning X conducting", allowing any
token as a substitute for X. It returns a book
element,
because its content
element contains "planning then
conducting". It would also return the book
if the
phrases "planning and conducting" and "planning before conducting"
had been in its content
:
The following expression returns book
s containing "planning then
conducting", but not does not return book
s containing "planning
and conducting", since it is exempting "then" from being a stop word:
An extension option consists of an identifying QName and a StringLiteral. Typically, a particular option will be recognized by some implementations and not by others. The syntax is designed so that option declarations can be successfully parsed by all implementations.
The QName of an extension option must resolve to a namespace URI and local name, using the statically known namespaces.
There is no default namespace for options.
Each implementation recognizes an
If the namespace part of the QName is not a namespace recognized by the implementation as one used to denote extension option, then the extension option is ignored.
Otherwise, the effect of the extension option, including its error behavior,
is
Implementations may impose rules on where particular extension options may appear relative to other match options, and the interpretation of an option declaration may depend on its position.
An extension option must not be used to change the syntax accepted by the processor, or to suppress the detection of static errors. However, it may be used without restriction to modify the set of tokens in the query or how they are matched against tokens in the text being searched. An extension option has the same scope as other match options.
The following examples illustrate several possible uses for extension options:
This extension option is set as part of the static context of all full-text expressions in the module and might be used to ensure that queries are insensitive to Arabic short-vowels.
This extension option applies only to the matching in the full-text selection in which it is found and might be used to specify how compound words should be matched.
Full-text selections can be combined with the logical connectives
ftor
(full-text or), ftand
(full-text and), not in
(mild not),
and ftnot
(unary full-text not).
ftor
operator.
An or-selection finds all matches that satisfy at least one of the operand full-text selections.
The following expression returns the book
element written by
"Millicent":
ftand
operator.
An and-selection finds matches that satisfy all of the operand full-text
selections simultaneously. A match of an and-selection is formed by combining
matches for each of the operand full-text selections as described in
For example, "usability" ftand "testing"
will find two
matches
in //book[@number="1"]/title
: each of the two matches for the
FTWords selection "usability"
(the two occurrences of
"usability" in the string value of the title element) is combined
with the single match for the FTWords "testing"
(only one
occurrence of "testing" in the title).
Since the above and-selection has at least one match, the following
expression will return "true".
The following expression returns false, because "Millicent" and "Montana" are not
contained by the same author
element in any book
element:
No author
element in any book
element
contains both "Millicent" and "Montana". Therefore, for any such
author
element, there are either one match for the
FTWords "Millicent"
and zero matches for the FTWords
"Montana"
, or vice versa, or no matches for both
of them. In any of these cases, the and-selection will have zero
matches.
not in
operator.
The not in
operator is a milder form of the operator combination
ftand ftnot
. The selection A not in B
matches a token
sequence that matches A
, but not when it is a part of a
match of B
.
In contrast, A ftand ftnot B
only finds matches when the token
sequence contains A
and does not contain B
.
As an example, consider a search for "Mexico" not in "New Mexico"
.
This may return, among others, a document
which is all about "Mexico" but mentions at the end that "New Mexico
was named after Mexico". The occurrence of "Mexico" in "New Mexico" is not
considered, but other occurrences of "Mexico" are matched. Note that this
document would not be matched by the full-text selection
"Mexico" ftand ftnot "New Mexico"
.
A match to a mild-not selection must contain at least one token that satisfies the first condition and does not satisfy the second condition. If it contains a token that satisfies both the first and the second condition, the token is not considered as a match.
The following expression returns true, because "usability" appears in the
title
and the p
elements and the token within
the phrase "Usability Testing" in the title
element is not
considered:
Operands of a mild-not selection may not contain a full-text selection
that evaluates to an at most
,
from ... to
, and exactly
occurrences ranges.
If such an expression is encountered, an error
ftnot
.
A not-selection selects matches that do not
satisfy the operand full-text selection.
Details about how such matches are constructed are given in
The following expression returns the empty sequence, because all book
elements contain "usability":
The following expression returns true, because book
elements contain
"information" and "retrieval" but not "information retrieval":
The following expression returns book
elements containing "web site
usability" but not "usability testing":
Recall that the grammar rule for
The default is unordered. Unordered is in effect when ordered is not specified in the query. Unordered cannot be written explicitly in the query.
An ordered selection selects matches which satisfy the operand full-text selection and which also satisfy the following constraint: the order that the matching tokens or phrases have in the text being searched is the same order that the corresponding query tokens or phrases have in the operand selection. In both cases, the ordering is determined from the minimum start positions of the contituent tokens.
The following expression returns true, because titles of book
elements
contain "web site" and "usability" in the order in which they are written in
the query, i.e., "web site" must precede "usability":
The following expression returns false, because although "Montana" and "Millicent"
both appear in the book
element, they do not appear in the order they
are written in the query:
xs:integer
.
A window selection may cross element boundaries. The size of the window is not affected by the presence or absence of element boundaries. Stop words are included in the computation of the window size whether they are ignored by the query or not.
A window selection examines the matches generated by the preceding
portion of the
The following expression returns true, because "web", "site", and "usability" are
within a window of 5 tokens in the title
element:
The following expression returns true, because "web" and "site" in the order they are written in the query and either "usability" or "testing" are within a window of at most 10 tokens:
The following expression returns true, because the title
element
contains "Web Site Usability". A similar query on the p
element
would not return true,
because its occurrences of "web site" and "usability" are not within a
window of 3:
The following expression returns the sample book
element,
because its number
attribute is 1 and it contains a
window of 2 words which contains an occurrence of "efficient"
but not an occurrence of "and". There is just one such matching window
in the sample text and it contains "enable efficient".
The following expression returns the empty sequence, because in the selected
book
element, there is no occurrence of "efficient"
within a window of 3 tokens which would not also contain an occurrence
of "and":
In order to allow meaningful results for nested positional filters,
e.g., a window selection embedded inside a distance selection, the
resulting matches for window selections are formed from the input matches
that satisfy the window constraint as follows. All StringIncludes of
such a match are coerced into a single StringInclude that spans all
token positions from the smallest to the largest position of any input
StringIncludes. This is explained in more detail in Section
| ("at" "least"
| ("at" "most"
| ("from"
A distance selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases satisfy the specified distance conditions.
Distances in the search context are measured in units of tokens, sentences, or paragraphs. Roughly speaking, the distance between two matches is the number of intervening units, so a distance of zero tokens (sentences, paragraphs) means no intervening tokens (sentences, paragraphs). More precisely, given two matches, we first determine their order by sorting on starting position and if necessary on ending position. Let M1 be the "earlier" and M2 be the "later". (If there are overlapping tokens involved, the designations "earlier" and "later" may not be intuitively obvious.) Then the distance between the two is M2's starting position minus M1's ending position, minus 1.
When computing distances in the search context, a distance selection may cross element boundaries; they affect the distance computed only to the extent that they affect the tokenization of the search context. Stop words are counted in those computations whether they are ignored or not.
When a distance selection applies a distance condition to more than two matches, the distance condition is required to hold on each successive pair of matches.
An words
, sentences
, or paragraphs
,
where words
refers to a distance measured in tokens.
An xs:integer
.
Let the value of the first (or only) operand be M. If "from" is specified, let the value of the second operand be N.
If "exactly" is specified, then the range is the closed interval [M, M]. If "at least" is specified, then the range is the half-closed interval [M, unbounded). If "at most" is specified, then the range is the half-closed interval (unbounded, M]. If "from-to" is specified, then the range is the closed interval [M, N]. Note: If M is greater than N, the range is empty.
Here are some examples of
'exactly 0' specifies the range [0, 0].
'at least 1' specifies the range [1,unbounded).
'at most 1' specifies the range (unbounded, 1].
'from 5 to 10' specifies the range [5, 10].
The following expression returns false, because "completion" and "errors" are less than 11 tokens apart:
The following expression returns false:
The search context does contain the phrase "The usability of a Web site",
in which the tokens "usability" and "Web" have a distance of 2 words,
and the tokens "Web" and "site" have a distance of 0 words,
both of which satisfy the constraint distance at most 2 words
.
However, the problem is that "usability" and "site" have a distance of 3 words,
which does not satisfy the constraint,
and so the distance selection yields no matches,
and the expression as a whole yields false.
(The phrase "Improving Web Site Usability" would satisfy the given full-text selection,
but it occurs in an attribute value, and so is not subject to tokenization.)
The following expression returns the empty sequence, because between any token "usability" and the token in any occurrence of the phrase "web site" that is the nearest to the token "usability" there is always more than one intervening token:
The following expression returns the book
title, because for
the occurrences of the tokens "web" and "users" in the note
element only one intervening token appears:
In order to allow meaningful results for nested positional filters, e.g., a distance selection embedded inside another distance selection, the resulting matches for distance selections are formed from the input matches that satisfy the distance constraint as follows. All StringIncludes of such a match are coerced into a single StringInclude that spans all token positions from the smallest to the largest position of any input StringIncludes. Thus, a distance selection that embeds a window or a distance selection takes the result of the embedded selection as a single unit.
The following gives an example of nested distance selections:
This expression allows to find book
elements that contain, for instance,
"Richard M. Nixon" and "George W. Bush" at least 20 words apart. The
matches for the inner distance selections are treated as single units
(represented by StringIncludes) by the outer distance
selection. Suppose such phrases are present in
the search context, then the outer distance selection
enforces a constraint on the number of intervening tokens ("at least
20") between the
last token of "Richard M. Nixon" and the first token of "George
W. Bush".
A scope selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases are contained in the same scope or in different scopes.
Possible scopes are sentences and paragraphs.
By default, there are no restrictions on the scope of the matches.
The following expression returns false, because the tokens "usability" and "Marigold" are not contained within the same sentence:
The following expression returns true, because the tokens "usability" and "Marigold" are contained within different sentences:
The following expression returns a book
element, because it contains
"usability" and "testing" in the same paragraph:
The following expression returns a book
element, because "site" and
"errors" appear in the same sentence:
It is possible that both "same sentence" and "different sentence" conditions are simultaneously safisfied for several tokens and/or phrases within the same document fragment. This can be observed if there are occurrences of the tokens and/or phrases both within the same sentence and within difference sentences. For example, consider the following document fragment.
This sample will satisfy both conditions ("usability" ftand "reviews")
different sentence
and ("usability" ftand "reviews") same
sentence
. The tokens "usability" and "reviews" occur both in different sentences
(the first and second shown sentences) and in the same sentence (the second shown
sentences.)
The above observation also holds for the "same paragraph" and "different paragraph" conditions.
An anchoring selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases are the first, last, or all tokens in the tokenized form of the items being searched.
Using the "at start" operator, tokens or phrases are matched, if they cover the first token position in the tokenized string value of the item being searched.
Using the "at end" operator, tokens or phrases are matched, if they cover the last token position in the tokenized string value of the item being searched.
Using the "entire content" operator, tokens or phrases are matched, if they cover all token positions of the tokenized string value of the item being searched.
The following expression returns each title
element starting with the
phrase "improving the usability of a web site":
The following expression returns the p
element of the sample,
because it ends with the phrase
"propagating few errors":
Since the distance operator doesn't imply an ordering, the last example
would also yield a match if the p
element ended with, say,
"few errors are propagated".
The following expression returns each note
element whose entire content
is "this book has been approved by the web site users association":
The following example returns true because
both the content
and the note
elements match:
The
Let I1, I2, ..., In
be the sequence of items of the search context and let
N1, N2, ..., Nk
be the sequence of nodes that
UnionExpr evaluates to. For each Ij (j=1..n)
a copy is
made that omits each node Ni (i=1..k)
.
Those copies form the new search context. If
UnionExpr evaluates to an empty sequence no nodes are omitted.
In the following fragment, if $x//annotation
is ignored,
"Web Usability" will be found 2 times: once in the title
element and once in the editor
element. The 2 occurrences
in the 2 annotation
elements are ignored. On the other
hand, "expert" will not be found, as it appears only in an
annotation
element.
By default, no element content is ignored.
Nodes
An extension selection consists of one or more pragmas followed by a full-text selection enclosed in curly braces. See
(#
and #)
, and
consists of an identifying QName followed by #)
. The QName of a
pragma must resolve to a namespace URI and local name, using the statically known namespaces.
Since there is no default namespace for pragmas, a pragma QName must include a namespace prefix.
Each implementation recognizes an
If the namespace part of a pragma QName is not recognized by the
implementation as a pragma namespace, then the pragma
is ignored. If all the pragmas in an
If an implementation recognizes the namespace of one or more pragmas in an
It is a static error
If an implementation recognizes a pragma, it must report any static errors in the following full-text selection even if it will not apply that selection.
The following examples illustrate three ways in which extension selections might be used.
A pragma can be used to furnish a hint for how to evaluate the following full-text selection, without actually changing the result. For example:
An implementation that recognizes the exq:use-index
pragma might use an
index to evaluate the full-text selection that follows. An implementation that
does not recognize this pragma would evaluate the full-text selection in its normal
way.
A pragma might be used to modify the semantics of the following
full-text selection in ways that would not (in the absence of the pragma) be
conformant with this specification. For example, a pragma might be used to
change distance counting so that adjacent words are at a distance of 1
(otherwise they would be at a distance of 0):
Such changes to the language semantics must be scoped to the expression contained within the curly braces following the pragma.
A pragma might contain syntactic constructs that are evaluated in place of the following full-text selection. In this case, the following selection itself (if it is present) provides a fallback for use by implementations that do not recognize the pragma. For example:
Here an implementation that recognizes the pragma will return the result of
evaluating the proprietary syntax with class 'animals'
,
while an implementation that does not recognize the pragma will instead
return the result of the thesaurus option.
If no fallback expression is required, or
if none is feasible, then the expression between the curly braces may be
omitted, in which case implementations that do not recognize the pragma will
raise a static error.
This section describes the formal semantics of XQuery and XPath Full Text 1.0. The figure below shows how XQuery and XPath Full Text 1.0 integrates with XQuery 1.0 and XPath 2.0.
The following diagram represents the interaction of XQuery and XPath Full Text with the rest of XQuery 1.0 and XPath 2.0. It illustrates how full-text expressions can be nested within XQuery 1.0 and XPath 2.0 expressions and vice versa.
Step 1 represents the composability of XQuery 1.0 and XPath 2.0 expressions and the fact that such expressions evaluate to a sequence of XDM items. This process is outside the scope of this document and will not be discussed further.
Step 2 shows how XQuery 1.0 and XPath 2.0 expressions
can be nested within full-text expressions.
If an XQuery 1.0 and XPath 2.0 expression
is nested on the left-hand side of an
Step 3 represents the composability of
Step 4 shows how XQuery and XPath Full Text 1.0 and scoring
expressions can be nested into XQuery 1.0 and XPath 2.0 expressions.
The sections
In the list above and throughout the rest of this section, bold
typeface has been used to distinguish the concepts that are part of the
The functions and schemas defined in this section are
considered to be within the fts: namespace (as discussed in
section
Note that by using XQuery 1.0 and XPath 2.0 to specify the formal semantics, we avoid the need to introduce new formalism. We simply reuse the formal semantics of XQuery 1.0 and XPath 2.0.
Tokenization, including the definition of the term "token",
Each token
Tokenization of an item
The tokenizer
The starting and ending position of a token
In the tokenization of an item,
consider the range of token positions
from the smallest starting position to the largest ending position;
every token position in that range must be covered by some token in the tokenization.
That is, for every token position P
,
there must exist some token T
such that
T's starting position <= P <= T's ending position
.
The tokenizer
Each token is contained in at most one sentence and at most one paragraph. (In particular, this means that no tokens of any sentence are contained in any other sentence, and no tokens of any paragraph are contained in any other paragraph.)
All tokens of a sentence are contained in at most one paragraph.
The range of token positions from the smallest starting position to the largest ending position in a sentence does not overlap with the token position range from any other sentence.
The range of token positions from the smallest starting position to the largest ending position in a paragraph does not overlap with the token position range from any other paragraph.
Useful information for tokenizer implementors may be found
in
Usually, the starting and ending positions of a token are the same. For some languages, some tokenizers may identify overlapping tokens. For example, the German word "Donaudampfschifffahrtskapitaensmuetze" might be tokenized into the following tokens: "Donaudampfschifffahrtskapitaensmuetze", "Donau", "dampf", "schiff", "dampfschiff", "kapitaen", "muetze", "kapitaensmuetze", "schifffahrt", "dampfschifffahrt", and perhaps others. In the face of overlapping tokens, it is implementation-dependent what positions a tokenizer assigns to each such token. For example, a tokenizer might assign the same position value to each of the tokens "Donaudampfschifffahrtskapitaensmuetze", "Donau", "dampf", "schiff", "dampfshiff", etc. In that case, the distance between each (overlapping) token assigned the same position is -1. Tokenizers might retain additional information about those overlapping tokens that allows the full-text implementation to distinguish among them.
Consider the sentence "Ich sehe den Dampfschifffahrtskapitän auf dem Fluß." If an implementation tokenizes "Dampfschifffahrtskapitän" as overlapping tokens at the same position, then the implementation could still determine that the query "'Schifffahrt Dampf' window 0 words ordered" fails to match the sentence because phrase matching is implementation-defined and may make use of additional implementation-dependent token information.
Even more complex situations can arise. Consider, for example,
the German sentence "Er stellte sie vor." A sophisticated tokenizer
might construct the token "vorstellen" covering positions 2 through 4,
which overlaps the token "sie" at position 3. For the purposes of
distance calculations, tokens are considered in the order of their
starting positions, so the distance between "vorstellen" and
"sie" would be 3-4-1=-2. (See fts:wordDistance
, below.)
For example, the following example must return false, because the 'secret' only occurs within an attribute and a comment, neither of which contributes characters to the string value of the 'p' element node:
The following document may lead to overlapping tokens to account for the ambiguity caused by the hyphen:
The following document fragment is the source document for examples in this section. A sample tokenization is used for the examples in this section. The results might be different for other tokenizations.
Unless stated otherwise, the results assume a case-insensitive match.
In this sample tokenization, tokens are delimited by punctuation and whitespace symbols.
The token "Ford" is at relative position 1.
The token "Mustang" is at relative position 2.
The token "2000" is at relative position 3.
Relative position numbers are assigned sequentially through the end of the document.
Hence in this example each token occupies exactly one position, and no overlapping of tokens occurs. The relative positions of tokens are shown below in parentheses.
The relative positions of paragraphs are determined similarly. In this sample tokenization, the paragraph delimiters are start tags and end tags.
The tokens in the first 'offer' element are assigned relative paragraph number 1.
The tokens from the next 'offer' element are assigned relative paragraph number 2.
Relative paragraph numbers are assigned sequentially through the end of the document.
The relative positions of sentences are determined similarly using sentence delimiters.
Implementations may provide for the means to ignore or side-step
certain structural elements when performing tokenization. In the
following example, the implementation has decided to ignore the
markup for <bold>
and prune out the entire
subtree headed by <deleted>
.
Using the same notation as before, this sample tokenization is shown below. All the tokens marked with a token position also have the same sentence and paragraph relative positions. Note that there are no tokens marked for the ignored subtree.
startPos
: the smallest starting position
of a token in the sequence
endPos
: the largest ending position
of any token of the sequence
startSent
: the relative position of the
sentence containing the token with the smallest starting
position
or zero if the tokenizer does not report
sentences
endSent
: the relative position of the sentence
containing the token with the largest ending position
or zero if the tokenizer does not report
sentences
startPara
: the relative position of the
paragraph containing the token with the smallest starting
position or zero if the tokenizer does not report
paragraphs
endPara
: the relative position of the paragraph
containing the token with the largest ending position or
zero if the tokenizer does not report paragraphs
The following matching function is the central
The above function returns the $searchContext
that match the query string represented by
the sequence $queryTokens
, when using the match
options in $matchOptions
and stop words in
$stopWords
. If $queryTokens
is a
sequence of more than one query token, each returned
While this matching function assumes a tokenized
representation of the query strings, it does not assume a tokenized
representation of the input items in $searchContext
,
i.e. the texts being searched.
Hence, the tokenization of the search context is implicit in
this function and coupled to the retrieval of matches.
Of course, this does not imply that tokenization of the
search context cannot be done a priori.
The tokenization of each item in $searchContext
does not
necessarily take into account the match options in
$matchOptions
or the query tokens in
$queryTokens
.
This allows implementations to tokenize and index input data
without the knowledge of particular match options
used in full-text queries.
The XQuery 1.0 and XPath 2.0 Data Model is
inadequate to support fully composable
XQuery and XPath Full Text adds relative token, sentence, and
paragraph position numbers via
The
Intuitively,
The
Since in most of the examples below the tokens span only a single
position, we characterize the startPos
and the endPos
attribute. Furthermore, for expository reasons, we
include in each
The simplest example of an "Mustang"
. The
As shown, the "Mustang"
. The result represented by the first
A more complex example of an "Ford Mustang"
. The
There are two possible results for this
An even more complex example of an "Mustang"
ftand ftnot "rust"
that searches for
"Mustang" but not "rust". The
This example introduces
The XML schema for representing
The stokenNum
attribute in
stokenNum
attribute stores
the number of query tokens used when evaluating the queryPos
attribute in new
The XML structures defined by the following schema
represent <left>
and <right>
descendant elements. For unary <selection>
descendant element is used. Additional
characteristics of
The semantics for the evaluation of
The
The semantics for the
For
concreteness, assume that the ftcontains
expression such
as searchContext ftcontains ftSelection
. In order to
determine the
ftSelection
, the
fts:evaluate($ftSelection,
$searchContext, $matchOptions, 0)
, where
$ftSelection
is the XML representation of the
ftSelection
and
$searchContext
is bound to the result of
the evaluation of the XQuery expression
searchContext
.
Initially, the
$queryTokensNum
is 0, i.e., no
query tokens have been processed.
The variable $matchOptions
is bound to the
list of match options as defined in the static context (see
Appendix $ftSelection
modify the match options collection as
evaluation proceeds.
Given the invocation of: fts:evaluate($ftSelection,
$searchContext, $matchOptions)
, evaluation proceeds as
follows. First, $ftSelection
is checked to see whether
1) it contains a match option,
2) it contains a weight specification,
3) it is an
If $ftSelection
contains one or more match options,
these are combined with the inherited match options
via a call to
If $ftSelection
contains a weight
specification, then the specification is ignored because it
does not alter the semantics. The
If $ftSelection
is an
If $ftSelection
contains neither a match
option nor a weight specification and is not an ftand
, ftor
, window
.
These operations are fully-compositional and may be
invoked on nested
First, the
The FTSelection1
which is
generically named
For example, let
FTSelection1
be FTSelection2 ftand
FTSelection3
. Here FTSelection2
and
FTSelection3
may themselves be arbitrarily nested
FTSelection2
and FTSelection3
, and the
resulting ftand
.
The semantics of the
The formal semantics of the
The $tokenInfo1
and
$tokenInfo2
. For example, two tokens with consecutive
positions have a distance of 0 tokens, and two overlapping tokens
have a distance of -1 tokens.
The $tokenInfo1
and
$tokenInfo2
.
The $tokenInfo1
and $tokenInfo2
.
The $tokenInfo
describes a token whose starting position is the first position of
the item $searchContext
.
The $tokenInfo
describes a token whose ending position is the last position of
the item $searchContext
.
An fts:queryToken
items, and 4) the position where the latter query string occurs in the
query.
If after the application of all the match options, the sequence
of query tokens returned for an
The Pos: N
, if the attributes
startPos
and endPos
are the same
with N
being that position.
There are five variations of
When any word
is specified, at
least one token in the tokenization of the nested expression must be
matched.
When all word
is specified, all
tokens in the tokenization of the nested expression must be
matched.
When phrase
is specified, all
tokens in the tokenization of the nested expression must be
matched as a phrase.
When any
is specified, at least one
string atomic value in the nested expression must be
matched as a phrase.
When all
is specified, all
string atomic values in the nested expression must be
matched as a phrase.
The semantics for any word
is specified
is given below. Since
The tokenized query strings are passed to
ApplyFTWordsAnyWord as a sequence of
fts:queryItem
, each containing the tokens of
a single query string. A single flattened sequence of all
tokens (of type fts:queryToken
) over all
query items is constructed. For each of these,
the result of
The semantics for all word
is specified is similar to the above, however composes a
conjunction. It is given below.
The semantics for phrase
is specified
is given below.
The
The semantics for any
is specified is
given below.
The any
specified forms the disjunction of the
The semantics for all
is specified
is given below.
The difference between all
and
any
is the use of conjunction instead of
disjunction.
The
XQuery 1.0 functions are used to
define the semantics of
The previous section described FTSelections without
giving any details about how
The extension is achieved by modifying an existing
function and adding functions that are specific to the
The semantics of most of the
Two
Unlike all other fts:ApplyFTWordsAny
.
The matching of the alternatives is performed with
For the semantics of the
The expansion of
The $ftSelection/fts:matchOptions
to override any options
of the same group declared up the query tree ($matchOptions
).
This function determines how match options of the same group overwrite each other, so that only one option of the same group remains.
The details of the semantics of the remaining
The function
The function $tokens
in the thesaurus $thesaurusName
for the language
$thesaurusLanguage
using the relationship
$relationship
within the optional number of levels
$range
. If $tokens
consists of
more than one query token, it is regarded as a
phrase.
The thesaurus function returns a sequence of expansion
alternatives. Each alternative is regarded as a new search
phrase and is represented as a query item.
Alternatives are treated as though they are connected with
a disjunction (
$matchOptions
parameter to
$matchOptions
parameter to
$matchOptions
parameter to
The semantics for the
Stop words interact with
The stop words set is computed using the
fts:calcStopWords
function. The function uses
the function fts:resolveStopWordsUri
to resolve any URI
to a sequence of strings. Then, the stop words are
removed from the set of query tokens.
The
$matchOptions
parameter to
The parameters of the
The
For example, consider the "Mustang" ftor "Honda"
. The
The
The parameters of the
The result of the conjunction is a new
For example, consider the "Mustang" ftand "rust"
. The
source
The
The
The generation of the resulting
In the
The function
The function
For example, consider the ftnot ("Mustang" ftor "Honda")
. The
source
The
The parameters of the
The resulting
For example, consider the ("Ford" not in "Ford
Mustang")
. The
source
The
source
The
The
The resulting
For example, consider the ("great" ftand "condition")
ordered
. The source
The
The parameters of the
The semantics of same sentence
is given below.
An same sentence
contains those
The semantics of different sentence
is given below.
An different sentence
contains those
The semantics of same paragraph
is analogous to same
sentence
and is given below.
The semantics of different paragraph
is analogous to
different sentence
and is given below.
The semantics for the general case is given below.
For example, consider the ("Mustang" ftand "Honda") same
paragraph
. The source
The
The parameters of the at start
, at end
, or entire content
),
and 3) one
The evaluation of scope functions depends on the type of the content match.
entire content
is evaluated as
distance exactly 0 words at start at end
, i.e., all the
at start
retains only
fts:isStartToken
.
at end
retains the
fts:isEndToken
.
Before we define the semantics functions of the joinIncludes
that will
be used in their definitions. joinIncludes
takes a sequence of
The parameters of the
fts:distanceType
, 2) a size, and 3) one
The semantics of window N words
is given below.
The semantics of window N sentences
is given below.
The semantics of window N paragraphs
is given below.
The resulting
The semantics for the general function is given below.
For example, consider the ("Ford Mustang" ftand
"excellent") window 10 words
.
The ("Ford Mustang" ftand
"excellent")
are given below.
The result for the
The parameters of the
The semantics of case word distance exactly N
is given below.
The semantics of word distance at least N
is given
below.
The semantics of word distance at most N
is given
below.
The semantics of word distance from M to N
is given
below.
The semantics of sentence distance exactly N
is given below.
The semantics of sentence distance at least N
is given below.
The semantics of sentence distance at most N
is given below.
The semantics of sentence distance from M to N
is given below.
The semantics of paragraph distance exactly N
is given below.
The semantics of paragraph distance at least N
is given below.
The semantics of paragraph distance at most N
is given below.
The semantics of paragraph distance from M to N
is given below.
The resulting
In the general case, the semantics is given below.
For example, consider the ("Ford Mustang" ftand
"excellent") distance at most 3 words
.
The ("Ford Mustang" ftand
"excellent")
are given below.
The result for the
The parameters of the
The function definitions depend on the range
specification
The general semantics is given below.
The semantics of occurs exactly N times
is given
below.
The semantics of occurs at least N times
is given below.
The semantics of occurs at most N times
is given
below.
The semantics of occurs from M to N times
is given below.
The way to ensure that
there are at least at least N
contains the possible
combinations of
The range [L, U] is represented by the condition
at least L and not at least U+1
. This transformation
is performed in the function
The semantics for the general case is given below.
The above function performs a sanity check to ensure that the nested
For example, consider the "Mustang" occurs at least 2 times
. The source
"Mustang"
is given below.
The result consists of the pairs of the
Consider an SearchContext ftcontains FTSelection
,
where SearchContext
is an XQuery 1.0
expression that returns a sequence of items.
The FTSelection
.
If the SearchContext
ftcontains FTSelection without content IgnoreExpr
for
some XQuery 1.0 expression IgnoreExpr
, then
any nodes returned by IgnoreExpr
are (notionally) pruned from each search context item
before attempting to satisfy the FTSelection
.
More formally, evaluation of an
For each XQuery/XPath expression nested within the FTContainsExpr, evaluate it with respect to the same dynamic context as the FTContainsExpr (FT1). Specifically:
Evaluate the search context expression (SearchContext
),
resulting in the sequence of search context items.
Evaluate the ignore option (IgnoreExpr
) if any,
resulting in the set of ignored nodes.
At each FTWordsValue,
evaluate the literal/expression and convert the result to xs:string*
.
At each weight specification,
evaluate the expression and convert the result to xs:double
.
At each FTWindow and FTRange,
evaluate the AdditiveExpr(s) and convert each to xs:integer
.
Using the settings of the match option components
in the FTContainsExpr's static context,
construct an element(fts:matchOptions)
structure.
Based on the parse-tree of the FTContainsExpr's FTSelection
and the results of steps 1c-1e,
construct an element(*,fts:ftSelection)
structure.
We refer to this as the "operator tree" below.
In this process:
Construct the operator tree from the top down, propagating FTMatchOptions down to FTWordsValues.
Tokenize the query string(s) obtained at 1c. (FT2.1)
Call the function
$searchContextItems
:
The sequence of items returned by SearchContext
,
calculated in step 1a.
$ignoreNodes
:
The sequence of items returned by IgnoreExpr
(in 1b),
if that expression is present,
or the empty sequence otherwise.
$ftSelection
:
The XML node representation of FTSelection
(constructed in step 2).
$defOptions
:
The XML representation of the match options
in the FTContainsExpr's static context
(constructed in step 3).
Within the function, for each search context item:
Delete the ignored nodes from the search
context item.
[
Traverse the operator tree from the top down,
propagating FTMatchOptions down to FTWordsValues.
[
At each FTWordsValue, using the prevailing FTMatchOptions:
Tokenize the search context obtained at 4a. (FT2.2)
(Whether this pays any attention to FTMatchOptions is
up to the implementation.)
[This happens within
Match the search context tokens and the query tokens,
yielding an
element(fts:tokenInfo)*
structure.
[This happens within
Convert that into an element(fts:allMatches)
. (FT3)
[This happens in
Traverse the operator tree from the bottom up.
At each point,
the
If the topmost true
.
[This is handled by the QuantifiedExpr in
[Note that the section 4 code doesn't implement 4b-4d as three sequential steps. Instead, they are different aspects of a single traversal of the operator tree.]
If none of the topmost false
.
The boolean value returned by the call to
This section addresses the semantics of
scoring variables in XQuery 1.0 for
and
let
clauses and XPath 2.0 for
expressions.
Scoring variables associate a numeric score with the result of the evaluation
of XQuery 1.0 and XPath 2.0 expressions. This numeric score
tries to estimate the value of a result item to the user
information need expressed using the XQuery 1.0 and XPath 2.0
expression. The numeric score is computed using an
There are numerous scoring algorithms used in practice. Most of the scoring algorithms take as inputs a query and a set of results to the query. In computing the score, these algorithms rely on the structure of the query to estimate the relevance of the results.
In the context of defining the semantics of XQuery and XPath Full Text, passing the structure of the query poses a problem. The query may contain XQuery 1.0 and XPath 2.0 expressions and XQuery and XPath Full Text expressions in particular. The semantics of XQuery 1.0 and XPath 2.0 expressions is defined using (among other things) functions that take as arguments sequences of items and return sequences of items. They are not aware of what expression produced a particular sequence, i.e., they are not aware of the expression structure.
To define the semantics of scoring in XQuery and XPath Full Text using XQuery 1.0, expressions that produce the query result (or the functions that implement the expressions) must be passed as arguments. In other words, second-order functions are necessary. Currently XQuery 1.0 and XPath 2.0 do not provide such functions.
Nevertheless, in the interest of the exposition, assume
that such second-order functions are present. In particular, that
there are two semantic second-order function
fts:score
and fts:scoreSequence
that take one argument (an expression) and return the
score value of this expression, respectively a sequence
of score values, one for each item to which the expression
evaluates. The scores must satisfy
A for
clause containing a score variable
$scoreSeq
and $i
are
new variables, not appearing elsewhere, and
fts:scoreSequence
is the
second-order function.
Similarly, a let
clause containing a score variable
This section presents a more complex example for the evaluation of $doc
.
Consider the following
Begin by evaluating the
Step 1: Evaluate the "mustang"
.
Step 2: Evaluate the {"great", "excellent"} any word
.
Step 2.1: Match the token "great"
Step 2.2 Match the token "excellent"
Step 2.3 - Combine the above
Step 3 - Apply the {("great", "excellent")} any word occurs at least 2 times
forming two pairs of
Step 4 - Apply the "Mustang"
ftand ({("great", "excellent")} any word occurs at least 2
times)
forming all possible pairs of
Step 5 - Apply the ("Mustang"
ftand ({("great", "excellent")} any word
occurs at least 2 times)) window 11 words
, filtering out
Step 6 - Evaluate "rust"
.
Step 7 - Apply the ftnot "rust"
,
transforming the StringInclude
into a
StringExclude
.
Step 8 - Apply the (("Mustang"
ftand ({("great", "excellent")} any word occurs at least 2 times))
window 11 words) ftand ftnot "rust"
, forming all
possible combintations of three
Step 9: Apply the <offer>
elements determine
paragraph boundaries).
The resulting true
.
This section defines the conformance criteria for a XQuery and XPath Full Text 1.0 processor.
In this section, the following terms are used to indicate the
requirement levels defined in
An XQuery and XPath Full Text 1.0 processor that claims to conform to
this specification
Minimal Conformance to this specification
Minimal support for XQuery 1.0
Support for everything specified in this document except those
operators and match options specified in
A definition of every item specified to be
Implementations are not required to define items specified to
be
It is optional whether the implementation supports the FTMildNot. If
it does not support FTMildNot and encounters one in a full-text
query, then it
The unrestricted form of negation in FTUnaryNot, that can negate every kind of FTSelection, is optional. Implementations may choose to support the negation operation in a restricted form, enforcing one or both of the following restrictions.
Consider the following example FTSelections.
The first two FTSelections both violate restriction 1, while the third and
the fourth are conform with both restrictions. The fifth one violates
restriction 2, while obeying restriction 1. Note that in the last example
the FTSelection to which the window operation is applied is
"information" ftand ftnot "retrieval"
, which contains an FTUnaryNot
expression.
If the implementation does enforce
one or both of these restrictions on FTUnaryNot and encounters a
full-text query that does not obey the restriction then it
It is optional whether the implementation supports all the choices
of
The unrestricted form of the FTOrder postfix operator, that can be applied to any kind of FTSelection, is optional. Implementations may choose to enforce the following restriction on the use of FTOrder.
If the implementation does enforce this restriction and encounters a
full-text query that does not obey the restriction then it
It is optional whether the implementation supports the FTScope
operator. If it does not support FTScope and encounters one in a
full-text query, then it
The unrestricted form of the FTWindow postfix operator, that can be applied to any kind of FTSelection, is optional. Implementations may choose to enforce the following restriction on the use of FTWindow.
If the implementation does enforce this restriction and encounters a
full-text query that does not obey the restriction then it
The unrestricted form of the FTDistance postfix operator, that can be applied to any kind of FTSelection, is optional. Implementations may choose to enforce the following restriction on the use of FTDistance.
If the implementation does enforce this restriction and encounters a
full-text query that does not obey the restriction then it
It is optional whether the implementation supports the FTTimes
operator. If it does not support FTTimes and encounters one in a
full-text query, then it
It is optional whether the implementation supports the FTContent
operator. If it does not support FTContent and encounters one in a
full-text query, then it
It is optional whether the implementation supports the
"lowercase" and "uppercase" choices for the
FTCaseOption. If it does not support these choices for the FTCaseOption
and encounters an unsupported choice in a full-text query, then it
It is optional whether the implementation supports the
FTStopWordOption. If it does not support FTStopWordOption and
encounters one in a full-text query, then it
It is optional whether the implementation supports the
FTStopWordOption in the body of the query. If it supports
FTStopWordOption in the prolog, but not in the body of a query, and
encounters one in the body of a query it
It is optional whether the implementation supports the StringLiteral
alternative of
It is optional whether the implementation supports the unrestricted form of FTLanguageOption. Implementations may choose to enforce the following restriction on the use of FTLanguageOption.
If the implementation does enforce this restriction and encounters a
full-text query that does not obey the restriction then it
The implementation may constrain the set of ignored nodes.
If the operand of
The implementation may restrict the allowable expressions used to
compute scores. The restrictions are
If the implementation does enforce such restrictions and encounters a
full-text query that does not obey the restriction then it
An implementation may constrain the range of valid weights to
non-negative values. If an implementation does enforce this restriction and
encounters a full-text query that uses a negative weight, it
The EBNF in this document and in this section is aligned with
the current XML Query 1.0 grammar (see
|
|
|
|
|
|
| ("//"
|
| ("descendant" "::")
| ("attribute" "::")
| ("self" "::")
| ("descendant-or-self" "::")
| ("following-sibling" "::")
| ("following" "::")
| ("ancestor" "::")
| ("preceding-sibling" "::")
| ("preceding" "::")
| ("ancestor-or-self" "::")
| (
| ("*" ":"
|
|
|
| ("'" (
|
|
|
|
|
|
|
|
|
|
| (
|
|
|
|
|
|
|
|
| ("at" "least"
| ("at" "most"
| ("from"
|
|
|
|
|
|
|
| ("case" "sensitive")
| "lowercase"
| "uppercase"
| ("diacritics" "sensitive")
| ("with" "thesaurus" "(" (
| ("without" "thesaurus")
| ("without" "stop" "words")
| ("with" "default" "stop" "words"
| ("("
The following symbols are used only in the definition of
terminal symbols; they are not terminal symbols in the
grammar of
This section contains constraints on the EBNF productions, which are required to parse legal sentences. The note below is referenced from the right side of the production, with the notation:
No single alternative for FTMatchOption can be specified more than once as part of the same FTMatchOptions. For example, if the FTCaseOption "lowercase" is specified, then "uppercase" cannot also be specified as part of the same FTMatchOptions.
The EBNF in this document and in this section is aligned with
the current XPath 2.0 grammar (see
|
|
|
|
|
| ("//"
|
| ("descendant" "::")
| ("attribute" "::")
| ("self" "::")
| ("descendant-or-self" "::")
| ("following-sibling" "::")
| ("following" "::")
| ("namespace" "::")
| ("ancestor" "::")
| ("preceding-sibling" "::")
| ("preceding" "::")
| ("ancestor-or-self" "::")
| (
| ("*" ":"
| (
|
|
|
|
|
|
|
|
| ("at" "least"
| ("at" "most"
| ("from"
|
|
|
|
|
|
|
| ("case" "sensitive")
| "lowercase"
| "uppercase"
| ("diacritics" "sensitive")
| ("with" "thesaurus" "(" (
| ("without" "thesaurus")
| ("without" "stop" "words")
| ("with" "default" "stop" "words"
| ("("
The following symbols are used only in the definition of
terminal symbols; they are not terminal symbols in the
grammar of
The following table describes the full-text components of
the
Component | Default initial value | Can be overwritten or augmented by implementation? | Can be overwritten or augmented by a query? | Scope | Consistency rules |
---|---|---|---|---|---|
case
insensitive | overwriteable | overwriteable by prolog | lexical | Value must be
case insensitive , case sensitive ,
lowercase , or uppercase . |
|
diacritics insensitive | overwriteable | overwriteable by prolog | lexical | Value must be diacritics insensitive or
diacritics sensitive . |
|
without stemming | overwriteable | overwriteable by prolog | lexical | Value must be without stemming or
with stemming . |
|
without thesaurus | overwriteable | overwriteable by prolog (refer to default to augment) | lexical | Value must be part of the statically known thesauri. | |
Statically known thesauri | none | augmentable | cannot be augmented or overwritten by prolog | module | Each URI uniquely identifies a thesaurus list. |
without stop words | overwriteable | overwriteable by prolog (refer to default to augment) | lexical | Value must be part of the statically known stop word lists. | |
Statically known stop word lists | none | augmentable | cannot be augmented or overwritten by prolog | module | Each URI uniquely identifies a stop word list. |
overwriteable | overwriteable by prolog | lexical | Value must be castable to xs:language . |
||
Statically known languages | none | augmentable | cannot be augmented or overwritten by prolog | module | Each string uniquely identifies a language. |
without wildcards | no | overwriteable by prolog | lexical | Value must be without wildcards or without
wildcards . |
An implementation that does not support the FTMildNot operator must raise a static error if a full-text query contains a mild not.
An implementation that enforces one of the restrictions on FTUnaryNot must raise a static error if a full-text query does not obey the restriction.
An implementation that does not support one or more of the choices on FTUnit and FTBigUnit must raise a static error if a full-text query contains one of those choices.
An implementation that does not support the FTScope operator must raise a static error if a full-text query contains a scope.
An implementation that does not support the FTTimes operator must raise a static error if a full-text query contains a times.
An implementation that restricts the use of FTStopWordOption must raise a static error if a full-text query contains a stop word option that does not meet the restriction.
An implementation that restricts the use of FTIgnoreOption must raise a static error if a full-text query contains an ignore option that does not meet the restriction.
It is a static error if, during the static analysis phase, the query is found to contain a stop word option that refers to a stop word list that is not found in the statically known stop word lists.
It may be a static error if, during the static analysis phase, the query is found to contain a language identifier in a language option that the implementation does not support. The implementation may choose not to raise this error and instead provide some other implementation-defined behavior.
It is a static error if, during the static analysis phase, an expression is found to use an FTOrder operator that does not appear directly succeeding an FTWindow or an FTDistance operator and the implementation enforces this restriction.
An implementation may restrict the use of FTWindow and FTDistance to an FTOr that is either a single FTWords or a combination of FTWords involving only the operators && and ||. If it a static error if, during the static analysis phase, an expression is found that violates this restriction and the implementation enforces this restriction.
An implementation that does not support the FTContent operator must raise a static error if a full-text query contains one.
It is a static error if, during the static analysis phase, an implementation that restricts the use of FTLanguageOption to a single language, encounters more than one distinct language option.
An implementation may constrain the form of the expression used to compute scores. It is a static error if, during the static analysis phase, such an implementation encounters a scoring expression that does not meet the restriction.
It is a static error if, during the static analysis phase, an implementation that restricts the choices of FTCaseOption encounters the "lowercase" or "uppercase" option.
It is a dynamic error if an implementation that does not support negative weights encounters a weight expression that does not meet the restriction.
It is a dynamic error if an implementation encounters a mild not
selection, one of whose operands evaluates to an
It is a type error if, during the static analysis phase,
an expression is found to have a static type
that is not appropriate for the context in which the expression occurs, or during the
dynamic evaluation phase, the dynamic type of a value does not match a required type as
specified by the matching rules in
It is a dynamic error if, in a function invocation, the argument corresponding to the specified function's collation parameter does not identify a supported collation.
The XML Schema specified in this appendix accomplishes integration by importing
the XML Schema defined for XQueryX in
The semantics of a Full Text XQueryX document are determined by the
semantics of the XQuery Full Text expression that
results from transforming the XQueryX document into XQuery Full Text
syntax using the XSLT stylesheet that appears in
section
The XML Schema that defines the complex types and elements for XQueryX in support of XQuery and XPath Full Text 1.0, including the ftContainsExpr, incorporates a second XML Schema that defines types and elements to support the ftMatchOption. Both XML Schemas are defined in this section.
The XSLT stylesheet that defines the semantics of XQueryX
in support of XQuery and XPath Full Text 1.0 integrates seamlessly with the
XQueryX XSLT stylesheet defined in
The following example is based on the data and queries of one of the use cases
in
Comparison of the results of the Full Text XQueryX-to-XQuery Full Text
transformation given in this document with the XQuery Full Text solutions
in the
The XQuery Full Text Use Cases solution given for the example is provided only to assist readers of this document in understanding the Full Text XQueryX solution. There is no intent to imply that this document specifies a "compilation" or "transformation" of XQuery Full Text syntax into Full Text XQueryX syntax.
In the following example, note that path expressions are expanded to show their structure. Also, note that the prefix syntax for binary operators like "and" makes the precedence explicit. In general, humans find it easier to read an XML representation that does not expand path expressions, but it is less convenient for programmatic representation and manipulation. XQueryX is designed as a language that is convenient for production and modification by software, and not as a convenient syntax for humans to read and write.
Finally, please note that white space, including new lines, have been added to some of the Full Text XQueryX documents and XQuery Full Text expressions for readability. That additional white space is not necessarily produced by the Full Text XQueryX-to-XQuery Full Text transformation.
Here is Q4 from the
Application of the stylesheet in
We would like to thank the members of the XQuery and XPath Full-Text group for their fruitful discussions.
We would like to thank the following people for their contributions on earlier drafts of this document.
Andrew Cencini, Microsoft - acencini@microsoft.com
Andrew Eisenberg, IBM - andrew.eisenberg@us.ibm.com
Nimish Khanolkar, Microsoft - nimishk@exchange.microsoft.com
Ashok Malhotra, Oracle - ashok.malhotra@oracle.com
Tapas Nayak, Microsoft - tapasnay@exchange.microsoft.com
Roland Seiffert, IBM - seiffert@de.ibm.com
This appendix provides a summary of features defined in this specification
whose effect is explicitly
Tokenization, including the definition of the term "tokens",
A phrase is an ordered sequence of any number of tokens. Beyond that, phrases
are
A sentence is an ordered sequence of any number of tokens. Beyond that,
sentences are
A paragraph is an ordered sequence of any number of tokens. Beyond that,
paragraphs are
Implementations are free to provide
How text with wildcard indicators and qualifiers is tokenized is
The set of expressions (of form ExprSingle) that can be assigned to a
score variable in a let-clause is
The
It is
It is
The behavior of the implementation when it encounters a combination of
thesauri, levels, and relationships that it does not support is
When the option "with default stop words" is used, an
When a stop word is specified in a query, then the number of tokens in the text that are matched by that stop word is
The "language" option influences tokenization, stemming, and stop
words in an
The set of valid language identifiers is
The behavior of the implementation when it encounters a language
identifier it does not support is
Certain values in the static context (see
Which namespace URIs will be recognized for denoting extension
selection pragmas is
Which namespace URIs will be recognized for denoting extension
options is
The conditions under which tokenization of two equal items produces
different tokens is
The restrictions on allowable expressions used to compute scores are
Sihem Amer-Yahia | 2005-04-08 | Updated case matrix | Updated case matrix row "sensitive", column "CCI" from "case-insensitive variant of CCI if it exists, else error" to "case-sensitive variant of CCI if it exists, else error". |
Sihem Amer-Yahia | 2005-05-02 | Closed issues with no changes | Closed Cluster B, Issue 28 IGNORE Syntax with no change to the document. Closed Cluster B, Issue 50 IGNORE Queries with no change to the document. |
Sihem Amer-Yahia | 2005-05-02 | Updated FTTimes syntax | Closed Cluster G, Issue 14 FTTimesSelection and added a related bullet item in Section 3. |
Sihem Amer-Yahia | 2005-05-02 | Updated FTWildCard syntax | Updated FTWildCardOption in Section 3. |
Sihem Amer-Yahia | 2005-05-03 | Updated introduction | Replaced "semantic element" with "semantic markup" and "tag" with "element" in the introduction. |
Sihem Amer-Yahia | 2005-05-03 | Added issue on error codes | Added Cluster J, Issue 59 Error Codes. |
Sihem Amer-Yahia | 2005-05-03 | Closed issues with no change | Closed Cluster A, Issue 54 Weight Granularity in Scoring with same resolution as for Cluster A, Issue 5 Score Weighting, no further change to document. Closed Cluster H, Issue 9 Window with no change to the document. Closed Cluster H, Issue 19 FTScopeSelection on structure with no change to the document. Closed Cluster E, Issue 25 MatchOption Syntax with no change to the document. Closed Cluster H, Issue 44 FTContains Semantics with no change to the document. |
Sihem Amer-Yahia | 2005-05-03 | Updated FTContent syntax | Updated FTContent adding "entire content", Closed Cluster C, Issue 39 Exact Element Content. |
Sihem Amer-Yahia | 2005-05-03 | Closed issue on Boolean Naming | Closed Cluster F, Issue 38 Boolean Naming. Changes to the document are pending awaiting a decision on whether it is OK to use "and", "or", "not" for full text. If so change existing symbols to "and", "or", "not". If not change existing symbols to "ftand", "ftor", "ftnot". |
Chavdar Botev | 2005-05-03 | Updated FTDistance semantics | Updated the semantics for distance. |
Sihem Amer-Yahia | 2005-05-03 | Updated FTRange syntax | Made "exactly" required before an exact number in FTRange. Closed Cluster F, Issue 43 Exactly in FTRangeSpec. |
Sihem Amer-Yahia | 2005-05-04 | Closed issue on collations | Closed Cluster D, Issue 57 Collations Match Option. |
Jochen Doerre | 2005-05-19 | Added issue on scoring | Added Cluster A, Issue 60 Extended Scoring. |
Chavdar Botev | 2005-06-29 | Added issue on FTNegation | Added Cluster G, Issue 62 Precise semantics of double negation. |
Chavdar Botev | 2005-06-29 | Added issue on FTTimes | Added Cluster G, Issue 61 Desired semantics of FTTimes. |
Sihem Amer-Yahia | 2005-07-11 | Updated FTMildNegation syntax | Updated the mild not syntax from "mild not" to "not in". Closed Cluster I, Issue 10 MildNot and Cluster F, Issue 41 Mildnot Naming. |
Chavdar Botev | 2005-07-12 | Updated FTIgnore semantics | Changed semantics of FTIgnoreOption. |
Sihem Amer-Yahia | 2005-07-18 | Corrected error codes | Corrected and added error codes, closing and implementing the resolution for Cluster J Issue 59 Error Codes. |
Sihem Amer-Yahia | 2005-07-18 | Closed issues with no changes | closed Cluster I, Issue 13 "loose-grammar" leaving the grammar as it is. Closed issue Cluster D, Issue 53 "matchoptions-default" with no change to the document. Closed Cluster H, Issue 58 "ft-about-operator" with no change to the document. |
Sihem Amer-Yahia | 2005-07-21 | Updated score syntax | Closed Cluster A, Issue 60 "new-scoring-proposal" and Issue 2 "scoring-values" and updated Section 2.2 Score Clause to reflect new score syntaxes. There are now syntaxes for scored queries 1) returning the same results as queries with Boolean predicates and 2) for returning more or fewer results. |
Sihem Amer-Yahia | 2005-07-21 | Added appendix for defaults | Added appendix for defaults in the query prolog analogous to C.1 in the XQuery language document. |
Sihem Amer-Yahia | 2005-07-21 | Updated FTThesaurus section | Aligned description in Section 3.2.4 FTThesaurusOption with current grammar. |
Sihem Amer-Yahia | 2005-07-21 | Opened and closed issue on nested FTNegation | Opened and closed Cluster I, Issue 65 Nested FTNegations on the right side of an FTMildNegation. |
Chavdar Botev | 2005-07-25 | Updated FTMildNegation semantics | Changed the semantics of MildNot. |
Sihem Amer-Yahia | 2005-08-10 | Added Change Log | Added Change Log harvesting back entries from CVS change log. |
Jochen Doerre | 2005-08-17 | Grammar changes | Changed XQuery/XPath grammar for new scoring syntax (resolution of Issue 60), for match option defaults in query prolog (resolution of Issue 45), for simplified window operator (resolution to Issue 51), renamed "mild not" to "not in" (resolution of Issue 41), modified FTThesaurusOption, FTStopwordOption and FTLanguageOption to require StringLiterals as decided in May 05 F2F. |
Jochen Doerre | 2005-08-17 | Changes to Section 2 | New scoring syntax introduced; rewritten most of 2.2. Corrected use of weights in 2.2.1 (wrong default, wrong use of 1.5) |
Jochen Doerre | 2005-08-17 | Changes to Section 3 | Adapting the explanations to changed syntax for FTWindow, FTThesaurusOption, FTStopwordOption and FTLanguageOption. Also corrected a couple of example explanations. Removed FTIgnoreOption from the list of match option defaults in 3.2 Corrected explanation and example of FTLanguageOption (diacritics nor case are language-specific!). Commented out last two examples of FTDistance, because distance 15 does not work for phrases. |
Jochen Doerre | 2005-08-17 | Appendices A+B | Adapted introductory comment about which version of the XQuery/XPath grammars we are aligned to. |
Jochen Doerre | 2005-08-17 | Dates in Header | Adapted current date and previous date and links in full-text-query-language-semantics.xml and in tqheader.xml. |
Jochen Doerre | 2005-08-19 | Added Section 2.3, Changes in 3+4 | Added Section 2.3 Extension to Static Context. Changed Sections 3.2 and 4.4.1.1 to refer to match option settings in the static context. |
Jochen Doerre | 2005-08-19 | Added Issue 63 | Added Cluster G Issue 63: Distance constraints do not work on phrases. |
Jochen Doerre | 2005-08-19 | Changes in Section 4 | Adapted semantics to new scoring feature (resolution of Issue 60), changed FTWindow semantics according to resolution of Issue 51, and cleaned examples. |
Jochen Doerre | 2005-08-19 | Appendix G | Added lines for statically known thesauri and stop lists. |
Jochen Doerre | 2005-08-25 | Added Issue 64 | Added Cluster E Issue 64:System Relative Operator Defaults (using wording proposed by Pat Case). |
Jochen Doerre | 2005-10-10 | Changes in Section 3 | Rephrased Section 3.2.7 FTIgnoreOption. Explanation and example adapted to simple (non-recursive) use of "ignore". |
Jochen Doerre | 2005-10-10 | Changes in Section 4 | Incorporated Section 4.3.1.4 Match and AllMatches Normal Form. |
Sihem Amer-Yahia | 2005-10-12 | Incorporated comments | Incorporated Pat's comments at https://lists.w3.org/Archives/Member/member-query-fttf/2005Sep/0068.html |
Jim Melton | 2005-10-20 | Changes in Sections 3 and 4 | Properly marked up errors and inserted error summary appendix. Re-ordered appendices so normative appendices precede non-normative appendices. |
Jochen Doerre | 2005-10-24 | Final editings | Included corrections to examples in Section 3. Changed meaning of distance 0 for sentences (paragraphs) to mean adjacent. Rework of Appendix H Checklist of Implementation-Defined Features. Resolution texts to issues 45, 59, and 62. |
Jochen Doerre | 2005-11-28 | Restrict FTTimes to FTWords | Modified EBNF syntax to allow the FTTimes operation to be applicable only to simple FTWords. |
Jochen Doerre | 2005-11-28 | Re: Bug 2299: Changes to Section 4 | The AllMatches model has been changed to allow the TokenInfo of a StringMatch to represent an interval of token positions, instead of single positions. Thus, a phrase is now modeled using a single StringMatch, and consequently distance constraints (which always apply to the individual StringMatches) can be used to constrain the entire phrase. In addition, this change allows to model overlapping tokens. The semantics functions for FTOrder (order now constrains the start positions of tokens), for FTScope, for FTDistance (a distance constraint requires a certain number of positions between the end of one token and the start of the next) and for FTWindows have been adapted. |
Jochen Doerre | 2006-01-09 | Issues List removed | Dropped Appendix I "Issues List", as issues are tracked in Bugzilla now. |
Mary Holstege | 2006-02-01 | Static context | Added known languages to static context. |
Jochen Doerre | 2006-03-06 | Bug 2776 | Changed EBNF grammar to allow weights to be specified using RangeExpr. |
Mary Holstege | 2006-03-30 | Updated Tokenization 4.2.7 | Expanded and clarified definition. Added examples. |
Pat Case | 2006-04-13 | Replaced glossary | Removed glossary copied from the XQuery language document and inserted coding to produce a full-text glossary. |
Jochen Doerre | 2006-04-24 | Section 2 | Added new Processing Model section. |
Jochen Doerre | 2006-04-25 | Section 4 | Included the completely revised semantics schemata and functions, which now (i) correctly handle interval-based TokenInfos, (ii) separate the representation of TokenInfos and SearchTokenInfos and SearchItems, (iii) have been simplified regarding the semantics of match options by no longer separating the implementation-defined matching function from (most of) the implementation-defined application of match options, and (iv) have been type- and syntax-checked. |
Mary Holstege | 2006-05-31 | Bug 2483 | Clarified type constraints on full-text operator parameters in Section 3. Revised EBNF to be more specific in some cases. |
Jochen Doerre | 2006-08-04 | Bug 3374 | Revised complete example in Section 4.3.3. |
Jim Melton | 2006-08-17 | Added XQueryX support | Added new normative appendix defining the XML schemas and XSLT stylesheet necessary for XQuery and XPath Full Text 1.0 to integrate into XQueryX. |
Jochen Doerre | 2006-08-21 | Bug 3439 | Fixed FTMildNot semantics. |
Mary Holstege | 2006-08-22 | Conformance | Added new conformance section as section 5. Add error code definitions to appendix D. |
Mary Holstege | 2006-08-22 | FTWords | Fixed wording of FTWords with respect to type constraints. |
Mary Holstege | 2006-10-05 | Score Variables | Added more complex scoring examples as clarification for bug #3596. |
Mary Holstege | 2006-10-05 | FTSelection | Improved reading flow for examples. Make linkage of non-terminals consistent. |
Mary Holstege | 2006-11-01 | Overall | Reorganized structure of document to improve reading flow. |
Jim Melton | 2006-12-26 | FTLanguageOption | Revised text dealing with FTLanguageOption values that do not identify a known, defined language in RFC 3066. Added reference to RFC 4646. |
Jim Melton | 2006-12-26 | FTLanguageOption and FTContainsExpr | Added text saying that a full-text processor SHOULD use xml:lang information when choosing collations and when processing FTMatchOptions. Also added text saying that an xml:lang specification SHOULD take precedence over an FTLanguageOption specification. |
Jim Melton | 2006-12-26 | Tokenization | Made changes clarifying that tokenization SHOULD be implementation-defined (implicitly permitting it to be implementation-dependent). |
Jochen Doerre | 2007-01-22 | Definitions for implementation-defined/ -dependent. | Added definitions for implementation-defined/dependent to Introduction as in XQuery document. Added links throughout the paper. |
Jochen Doerre | 2007-02-17 | Bug 3698 | Removed options "with diacritics", "without diacritics". |
Jochen Doerre | 2007-02-17 | Bug 3914 | Changed syntax of Booleans to "ftand", "ftor", "ftnot". |
Jochen Doerre | 2007-02-17 | Bug 3920 | Changed 3rd example in 3.3.7 FTDistance and added a 4th. |
Jim Melton | 2007-02-25 | Bug 3935 | Added text to define how wildcard characters can be escaped so they can be used in a search. |
Pat Case | 2007-02-26 | Itemized sample tokens in 3 FTSelections | To resolve Bug 3913, added a sentence itemizing the first 5 tokens in the sample tokenization. |
Pat Case | 2007-02-26 | Corrected example in 3.3.7 FTDistance | To resolve Bug 3920, corrected the first example and preceding text in 3.3.7 FTDistance to remove the "not in" operator and to use terms from the sample data. |
Pat Case | 2007-02-26 | Inserted sentence into 3.2.6 FTLanguageOption | To resolve Bug 3926, inserted sentence into 3.2.6 FTLanguageOption saying that the "language" option MAY influence the behavior of other match options. |
Pat Case | 2007-02-26 | Inserted a sentence into 3.2.5 FTStopWordOption | To resolve Bug 3930, inserted a sentence into 3.2.5 FTStopWordOption saying that "union" and "except" are applied from left to right. |
Pat Case | 2007-02-26 | Added a note to 3.2.5 FTStopWordOption | To resolve Bug 3932, added a note to 3.2.5 FTStopWordOption saying Stop word lists MAY be applied during indexing. If applied during indexing asking for stop words to not be used during a query, will have no effect. |
Pat Case | 2007-02-26 | Added a note to 3.4 FTIgnore | To resolve Bug 3936, added a note to 3..4 FTIgnore saying Nodes MAY be ignored during indexing and during query processing. Ignore option applies only to query processing. Whether and how indexing ignores nodes is out of scope for this specification. |
Jochen Doerre | 2007-02-26 | Bug 3924 | Changed grammar for match options: now precedence of match options is higher than Booleans. Included restriction to have at most one option of a group at a level. |
Jochen Doerre | 2007-02-27 | Bug 3910, 3924, 3928 | Reformulated what the case options mean. Added lower/uppercase as possible values for the case option to table in Appendix C (Static Context Components) and put rules and alternatives in the grammar into a more logical order. Also ordered tables and lists in the text the same. |
Jochen Doerre | 2007-03-02 | Bug 3737 | Reformulated and restructured most of section 3. Added explanation of the application structure of positional filters (formerly: FTProximities) and how match options take effect. Renamed the following grammar symbols: FTWordsSelection to FTPrimary, FTWordsMatches to FTPrimaryWithOptions, FTProximity to FTPosFilter. |
Mary Holstege | 2007-04-02 | Bugs 4345, 4355, 4358, 4445 | Reworked description of the wildcard option and added a new example. Added note on the effect when the lower bound of a range is greater than the upper bound. Fixed FTContent example to be "with wildcards". |
Jochen Doerre | 2007-04-09 | Bug 3939 | Added example for overlapping tokens in 4.1. |
Jochen Doerre | 2007-04-09 | Bug 3931 | Added match option application order, as agreed in FTTF-136. |
Mary Holstege | 2007-04-19 | Conformance | Made support for uppercase and lowercase FTCaseOptions optional. |
Mary Holstege | 2007-04-19 | Extensions | Added text to describe extension options and selections. |
Jochen Doerre | 2007-04-19 | Bug 4386 | And-selection description fixed in Sec. 3. |
Jochen Doerre | 2007-04-20 | Bugs 3898, 4388 | Finalized the additions needed to allow for nested FTDistance/FTWindow. |
Jochen Doerre | 2007-04-23 | Section 4 | Simplifications to the match option schemata and processing. |
Mary Holstege | 2007-04-25 | Schemas | Misc. editorial improvements to schemas. |
Pat Case | 2007-09-13 | Definition of a token | Refined the definition of a token. |
Pat Case | 2007-09-13 | Sections 1-2 | Made editorial changes throughout Sections 1-2. |
Mary Holstege | 2007-10-11 | Semantics | Clarified definition of tokenization; fix-ups wrt overlapping tokens. |
Mary Holstege | 2007-10-11 | Conformance | Reinstated lost conformance item on negative weights; fixed up constraints on scoring expressions. |
Pat Case | 2007-10-12 | Reorganized Section 1.1 | Reorganized Section 1.1, taking paragraphs out of the second ordered list, removing 2 sentences, reordering some of the paragraphs. |
Pat Case | 2007-10-12 | Tokenization | Consolidated the early, informal introduction to tokenization into Section 1.1, moving what was in 2.1 Processing Model to Section 1.1. Removed some text and added a forward reference to the formal definition and constraints in 4.1. |
Pat Case | 2007-10-13 | Using Weights | In 2.3.1. Using Weights, relabelled and reorganized the constraints pertaining to weights and scoring algorithms. |
Pat Case | 2007-10-15 | Processing Model | In 2.1 Processing Model, made step 2, the new step 4a. |
Jochen Doerre | 2007-11-09 | FTStopWords grammar and description | Renamed nonterminals: FTRefOrList to FTStopWords, FTInclExclStringLiteral to FTStopWordsInclExcl. Added negative stop words example: .../p ftcontains "propagating errors" with stop words ("few"). |
Jochen Doerre | 2007-11-12 | Chapter 3 | Adapt text were it assumed that tokens have unique positions. Talk explicitly of covered token positions (in FTWords, FTContent). |
Jochen Doerre | 2007-11-13 | Chapter 3 | More explanation for 2nd example for anchoring selection "at end" (3.6.5). Bug 4717. |
Pat Case | 2007-12-4 | Title | Removed 1.0, 2.0, and hyphen from title and title references. |
Mary Holstege | 2008-01-24 | Misc. | Bug fixes: 4714, 4715/2, 4717, 4728, 5415. Replaced incorrect text in definition of FTWindow. Eliminated notion of "adjacent" and "consecutive" tokens; replaced with description in terms of token positions. Made definition of Ignore option consistent with formal semantics: no new context focus is generated. Added additional examples. Added informative reference to UAX29. Consistent usage of the term "query string" etc. |
Mary Holstege | 2008-01-24 | Grammar. | Move ft-option to first part of prolog. |
Mary Holstege | 2008-02-28 | Semantics. | Clarify handling of overlapping tokens with respect to distance. |
Mary Holstege | 2008-03-17 | Semantics. | Minor fixes to function definitions to resolve issues: 5572, 5573, 5574, and 5575. |
Mary Holstege | 2008-04-22 | Naming conventions. | Use the case "StopWord" and "MildNot" consistently. |