CARVIEW |
This is a Public Working Draft for review by W3C Members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document describes how
XSLT 2.0 and XQuery 1.0 Serialization has been defined jointly by
the
This is a
Patent disclosures relevant to this specification may be found on the
XML Query Working Group's patent disclosure page at
This document defines serialization for the
See the CVS changelog.
This document defines serialization of the W3C XQuery 1.0 and XPath 2.0 Data Model,
which is the data model of at least
Serialization is the process of converting an instance of the
In this specification the words must, must not,
should, should not, may, required, and
recommended are to be interpreted as described in
The XQuery 1.0 and XPath 2.0 Data Model is richer and less
constrained than XML. There are valid instances of the data model
that have no direct analog in XML. In particular, data model
instances can contain typed values, sequences, and sequences of typed
values. And whereas XML deals only with documents
, data
model instances can have as their root any node type, simple value, or
sequence and may even be empty.
This section describes how to convert an arbitrary data model instance into one of several simplified forms. We then describe how these forms are serialized. This greatly simplifies the sections which follow. Implementations are not required to implement serialization of arbitrary data model instances in this way, provided that they produce the same results as this conceptual model.
If the data model instance contains any
xs:string
and replace the value with its
string representation. xs:string
, serialization of the data model is
undefined.
If adjacent strings occur in a sequence, replace both values with their concatenation separated by a single space.
If empty sequences occur, replace them with the empty string.
To complete the simplification, perform the following steps
If the data model instance has as its root an attribute or
namespace node,
If the data model instance has as its root
If the data model instance has as its root a sequence of document nodes, or a sequence which contains document nodes, replace each document node with its children in document order.
If the data model instance has as its root a string value, or a sequence which contains one or more string values, replace each string value with a text node that contains the same string.
If there are any remaining string values among the children of elements in the data model instance, replace them with text nodes that contain the same string values and merge adjacent text nodes.
An instance of the data model that is input to the serialization process is a sequence. The serialization process must first place that input sequence into a normalized form for serialization; it is the normalized sequence that is actually serialized. The normalized form for serialization is constructed by applying all of the following rules in order, with the initial sequence being input to the first step, and the sequence that results from any step being used as input to the subsequent step.
Replace an empty sequence with a zero-length string.
If the data model instance contains any atomic values, or sequences that contain atomic values, convert the atomic values to strings: obtain the lexical representation of each value by casting it to an xs:string and replace the value with its string representation. It is a serialization error if the value cannot be cast to xs:string.
Replace all adjacent strings in the sequence, with a single string equal to the values of the strings concatenated, each separated by a single space.
Replace any string in the sequence with a text node whose string value is equal to the string.
Replace any document node in the sequence with its children.
It is a serialization error if an item in the sequence is an attribute node or a namespace node. Otherwise, create a new document node and make all the items in the sequence, which are all nodes, children of that document node.
The tree rooted in the document node that is created by the final step of this normalization process is the instance of the data model to which the rules of the appropriate output method are applied. If the normalization process results in a serialization error, the processor must signal the error.
$seq
is equivalent
to constructing a document node using the
or the XQuery expression:
This process will fail with certain sequences,
for example sequences containing parentless attribute and namespace
nodes, or atomic values such as xs:QName
and
xs:NOTATION
that cannot be cast to a
string.
There are a number of parameters that influence how serialization is performed. Host languages may allow users to specify any or all of these parameters, but they are not required to be able to do so.
The following serialization parameters are defined:
encoding
specifies the preferred character
encoding charset
registered with the Internet
Assigned Numbers Authority X-
If this parameter is not specified,
cdata-section-elements
specifies a list of the
names of elements whose text node children
If this parameter is not specified, no elements will be treated specially.
doctype-system
specifies the system identifier
to be used in the document type declaration
If this parameter is not specified,
doctype-public
.
doctype-public
specifies the public identifier
to be used in the document type declaration
If this parameter is not specified,
escape-uri-attributes
specifies whether the
processor yes
or no
.
If this parameter is not specified, the value is implementation defined.
include-content-type
specifies whether the serialization
process meta
element in HTML and XHTML
output. The value must be yes
or no
.
If this parameter is not specified, the value is implementation defined.
indent
specifies whether the processor may add
additional whitespace when outputting the data model; the value must
be yes
or no
If this parameter is not specified, the value is implementation defined.
media-type
specifies the media type (MIME
content type) charset
parameter
text
, a charset
parameter should be added
according to the character encoding actually used by the output
method
If this parameter is not specified, the media type is implementation defined.
normalize-unicode
yes
or
no
.
If this parameter is not specified, the value is implementation defined.
omit-xml-declaration
specifies whether the serialization
process encoding
parameter or the
standalone
parameteryes
or no
If this parameter is not specified, the value is implementation defined.
standalone
specifies whether the processor
yes
or no
If this parameter is not specified,
undeclare-namespaces
specifies whether namespaces,
yes
or no
.
If this parameter is not specified, the value is implementation defined.
This parameter only applies when the XML serialization method is used and the version is greater than 1.0.
use-character-maps
provides a list of character/string
pairs that are used in serialization (see
If this parameter is not specified, no character maps are used.
version
specifies the version of the output
method
If this parameter is not specified, the value is implementation defined.
The method
identifies the overall method that should
be used for serializing. The value of the method
parameter must be a valid QName. If the QName is in no namespace,
then it identifies a method specified in this document and must be one
of xml
, html
, xhtml
, or
text
The detailed semantics of each parameter will be described separately for each output method for which it is applicable. If the semantics of a parameter are not described for an output method, then it is not applicable to that output method.
Serialization can be regarded as involving four phases of processing, carried out sequentially as follows:
method
, doctype-system
,
doctype-public
, include-content-type
,
indent
, omit-xml-declaration
,
standalone
, and version
.
URI escaping (in the case of URI-valued attributes in the
HTML and XHTML output methods), as determined by the
escape-uri-attributes
parameter
Creation of CDATA sections, as determined by the
cdata-section-elements
parameter. Note that this is also
affected by the encoding
parameter, in that characters
not present in the selected encoding cannot be represented in a CDATA
section.
Character mapping, as determined by the
use-character-maps
parameter.
Escaping of special characters according to XML or HTML
rules, for example replacing <
by
<
normalize-unicode
parameter. Unicode normalization is
applied to the character stream that results after all markup
generation and character expansion has taken place.
encoding
parameter. This converts the character stream
produced by the previous phases into a byte stream.
The xml
output method outputs the
data model as an XML entity that
xml
output
method are described using the verb "should"; the processor might
not be able to meet the requirements of the xml
output method due to:
serialization errors;
specification of character mapping, as determined by the
use-character-maps
parameter, whose expansion results
in XML that is not well-formed; or
disabled output escaping, that results in XML that is not well-formed.
In all other circumstances, the serialized form
must comply with the requirements described for the xml
output method.
If the document node of the data model has a single element
node child and no text node children,
where entity-URI
is a URI for the entity, produces a
document which
In addition, the output
If the document was produced by adding a document wrapper, as
described above, then it will contain an extra doc
element as the document element.
The order of attribute and namespace nodes in the two trees may be different.
The base URIs of nodes in the two trees may be different.
The new tree may contain additional attributes and text nodes resulting from the expansion of default and fixed values in its DTD or schema.
The type annotations of the nodes in the two trees may be
different. Type annotations in a result tree are discarded when the
tree is serialized. Any new type annotations obtained by parsing the
document will
In order to permit such type annotations
to be available in a data model that results from processing a
serialized XML document, the process that creates the input data
model could create it so that the serialized form
uses mechanisms provided by xsi:type
and xsi:schemaLocation
attributes.
Additional namespace nodes may be present
in the new tree if the serialization process undeclared namespaces,
as described in
Additional nodes may be present in the new tree, and the values of attribute nodes and text nodes in the new tree may be different from those in the original tree, due to the character expansion phase of serialization.
A consequence of this rule is that certain whitespace characters

or an equivalent; while CR, NL, and TAB
characters in attribute nodes 
, 

, and
	
, or their equivalents.
For example, an attribute with the value "x" followed by "y"
separated by a newline will result in the output
"x
y"
(or with any equivalent character
reference). The XML output cannot be "x" followed by a literal newline
followed by a "y" because after parsing, the attribute value would be
"x y"
as a consequence of the XML attribute normalization
rules.
To anticipate the proposed changes to end-of-line handling in XML 1.1, implementations may also output the characters x85 and x2028 as character references. This will not affect the way they are interpreted by an XML 1.0 parser.
It is a serialization error to request the output of a document
type declaration, or of a standalone
parameter, if the
data model contains text nodes or multiple element nodes as children
of the root node. The processor
standalone
parameter.
The result of serialization using the XML output method is not
guaranteed to be well-formed XML if character maps have been specified
(see
version
Parameter
The version
parameter specifies the version of XML to
be used for outputting the data model. If the processor does not
support this version of XML, it version
parameter
encoding
Parameter
The encoding
parameter specifies the preferred
encoding to use for outputting the data model. Processors are
required to respect values of UTF-8
and
UTF-16
. A serialization error occurs when an output
encoding other than UTF-8
or UTF-16
is
requested, if the implementation does not support that encoding. The
processor UTF-8
or UTF-16
instead. The processor must
not use an encoding whose name does not match the
encoding
parameter is specified, then the processor
UTF-8
or UTF-16
.
When outputting a newline character in the data model, the
implementation is free to represent it using any character sequence
that will be normalized to a newline character by an XML parser,
unless a specific mapping for the newline character is
provided in a character map: see
When outputting any other character that is defined in the
selected encoding, the character
It is possible that the data model will contain a character that
cannot be represented in the encoding that the processor is using for
output. In this case, if the character occurs in a context where XML
recognizes character references (that is, in the value of an attribute
node or text node), then the character
indent
Parameter
If the indent
parameter has the value
yes
, then the xml
output method may output
whitespace in addition to the whitespace in the data model (possibly
based on whitespace stripped from either the source document or the
stylesheet, indent
parameter has the value no
, it
xml
output method does output additional whitespace,
Whitespace characters must not be added adjacent to a text node that contains non-whitespace characters.
Whitespace may only be added adjacent to an element node, that is, immediately before a start tag or immediately after an end tag.
The new whitespace characters may replace existing whitespace characters in the same position, for example a tab may be inserted as a replacement for existing spaces. However, existing whitespace must not be removed without such a replacement.
Whitespace characters must not be inserted in a part of the
result document that is controlled by an
xml:space="preserve"
attribute.
The effect of these rules is to ensure that whitespace
<xsl:strip-space>
declaration could cause it to be removed, and
(b) it does not affect the string value of any element node with
simple content. It is usually not safe to indent document types that include elements
with mixed content.
cdata-section-elements
Parameter
The cdata-section-elements
parameter contains a list
of expanded-QNames. If the expanded-QName of the parent of a text node
is a member of the list, then the text node
If the text node contains the sequence of characters
]]>
, then the currently open CDATA section
]]
and a new CDATA section opened
before the >
.
If the text node contains characters that are not
representable in the character encoding being used to output the
data model, then the currently open CDATA section
CDATA sections cdata-section-elements
parameter, or by using some other
implementation-defined mechanism.
This is phrased to permit an implementor to provide an option that attempts to preserve CDATA sections present in the source document.
omit-xml-declaration
Parameter
The xml
output method
omit-xml-declaration
parameter has the value
yes
. The XML declaration
standalone
parameter is specified, it
standalone
parameter. Otherwise, it
The omit-xml-declaration
parameter
standalone
parameter is present, or if the
encoding
parameter specifies a value other than UTF-8 or
UTF-16.
doctype-system
and doctype-public
Parameters
If the doctype-system
parameter is specified, the
xml
output method <!DOCTYPE
doctype-public
parameter is also specified, then the
xml
output method PUBLIC
followed by the public identifier and then the system identifier;
otherwise, it SYSTEM
followed by the system
identifier. The internal subset doctype-public
parameter doctype-system
parameter is specified.
undeclare-namespaces
Parameter
The Data Model allows an element to have fewer in-scope namespaces than
its parent. In XML 1.1, this can be represented most accurately by undeclaring
namespaces. If undeclare-namespaces
is "yes
" and
the output method is XML and the version
is greater than
Consider an element x:foo
with three in-scope
namespaces:
Suppose that it has a child element with two in-scope namespaces:
If namespace undeclaration is in effect, it will be serialized this way:
In XML 1.0, namespace undeclaration is not possible.
xml
and
the value of the version
parameter is 1.0, namespace
declaration is not performed, and the undeclare-namespace
parameter is ignored.
The media-type
parameter is applicable for the
xml
output method.
The normalize-unicode
parameter is applicable for the
xml
output method.
The use-character-maps
parameter is applicable for the
xml
output method.
The xhtml
output method serializes the data model as
XML, using the HTML compatibility guidelines defined in the XHTML
specification.
It is entirely the responsibility of the
The serialization of the data model follows the same rules as for
the xml
output method, with the exceptions noted below.
These differences are based on the HTML compatibility guidelines
published in Appendix C of
Given an empty instance of an <p></p>
and not
<p />
.
Given an XHTML element whose content model is EMPTY, the serializer
<br />
, as the alternative syntax
<br></br>
allowed by XML gives uncertain
results in many existing user agents. The serializer
/>
, e.g.
<br />
, <hr />
and
<img src="karen.jpg" alt="Karen" />
.
The serializer should avoid outputting line breaks and multiple whitespace characters within attribute values. These are handled inconsistently by user agents.
The serializer '
which, although legal in XML and therefore in
XHTML, is not defined in HTML and is not recognized by all HTML user
agents.
The serializer should output namespace declarations
in a way that is consistent with the requirements of the XHTML DTD if this is
possible. The DTD requires the declaration
xmlns="https://www.w3.org/1999/xhtml"
to appear on the html
element, and only on the html
element.
The serializer must output namespace declarations that are consistent with
the namespace nodes present in the result tree, but it should avoid outputting
redundant namespace declarations on elements where the DTD would make them invalid.
Where the process used to construct
the input data model does not provide complete control over the prefix
used for an element name in the data model or control of whether the element is
in the default namespace (for instance, the XSLT namespace fixup process),
implementors are encouraged to provide means or endeavor to preserve the
obvious intent of a user to place the html
element in
in the default namespace, wherever possible. For example, implementors
of XSLT processors are encouraged to place the html
element that results from a literal result element like the following in
the default namespace.
Although the specification of the namespace
fixup process provides no guarantees about the namespace prefixes that
are allocated,
implementors are encouraged to ensure that where possible,
writing the literal result element
<html xmlns="https://www.w3.org/1999/xhtml"> ... </html>
places the resulting html
element in the default namespace.
If head
element in
the XHTML namespaceinclude-content-type
parameter has the
value "no"
, the xhtml
output method
meta
element immediately after the start-tag of the
head
element specifying the character encoding actually
used.
For example,
The content type should be set to the value given for the
media-type
parameter; the default value for XHTML is
text/html
. The value application/xhtml+xml
,
registered in
If the data model includes a head
element that has a meta
element child, the processor should
replace any content
attribute of the meta
element, or add such an attribute, with the value as described above,
rather than output a new meta
element.
Unless the escape-uri-attributes
parameter
has the value no
, the xhtml
output
method
This escaping is deliberately confined to non-ASCII characters,
because escaping of ASCII characters is not always appropriate, for
example when URIs or URI fragments are interpreted locally by the HTML
user agent. Even in the case of non-ASCII characters, escaping can
sometimes cause problems. More precise control of URI escaping is
therefore available by setting escape-uri-attributes
to
no
, and controlling the escaping of URIs by means of the
As with the XML output method, the XHTML
output method outputs an XML declaration unless it is suppressed using
the omit-xml-declaration
parameter. Appendix C.1 of
The html
output method outputs the data model as
HTML.
For example,
The version
parameter indicates the version of the
HTML. The default value is 4.0
, which specifies that the
result
The html
output method
xml
output method unless the
expanded-QName of the element has a null namespace URI; an element
whose expanded-QName has a non-null namespace URI
span
. In particular:
If the result tree contains namespace nodes for namespaces other than the
XML namespace, the HTML output method xmlns
or xmlns:
If the result tree contains elements or attributes whose names have a
non-null namespace URI, the HTML output method
Where special rules are defined later in this section for
serializing specific HTML elements and attributes, these rules
When serializing an element whose name is not defined in the
HTML specification, but that is in the null namespace, the HTML output
method
span
element. The descendants of such
an element span
element.
When serializing an element whose name is in a non-null
namespace, the HTML output method div
element. The descendants of such an element
div
element.
The html
output method area
, base
, basefont
,
br
, col
, frame
,
hr
, img
, input
,
isindex
, link
, meta
and
param
. For example, an element written as
<br/>
or <br></br>
in an
XSLT stylesheet <br>
.
The html
output method br
, BR
or Br
br
element and output without an
end-tag.
The html
output method script
and style
elements.
For example, script
element
created by an XQuery direct element constructor or an XSLT
or
A common requirement is to output a script
element
as shown in the example below:
This is illegal HTML, for the reasons explained in section B.3.2 of the HTML 4.01 specification. Nevertheless, it is possible to output this fragment, using either of the following constructs:
Firstly, by use of script
element
created by an XQuery direct element constructor or an
XSLT
Secondly, by constructing the markup from ordinary text characters:
As the HTML specification points out, the correct way to write this is to use the escape conventions for the specific scripting language. For JavaScript, it can be written as:
The HTML 4.01 specification also shows examples of how to write this in various other scripting languages. The escaping must be done manually, it will not be done by the serializer.
The html
output method
<
" characters occurring in attribute values.
If the indent
parameter has the value
yes
, then the html
output method may add or
remove whitespace as it outputs the data model, so long as it does
not change how an HTML user agent would render the output.
Unless the escape-uri-attributes
parameter is present
and has the value no
, the html
output method
This escaping is deliberately confined to non-ASCII characters,
because escaping of ASCII characters is not always appropriate, for
example when URIs or URI fragments are interpreted locally by the HTML
user agent. Even in the case of non-ASCII characters, escaping can
sometimes cause problems. More precise control of URI escaping is
therefore available by setting escape-uri-attributes
to
no
, and controlling the escaping of URIs by means of the
The html
output method
For example, a start-tag
The html
output method &
character occurring in an attribute value
immediately followed by a {
character (see
For example, a start-tag
If the indent
attribute has the value
yes
, then the html
output method may add or
remove whitespace as it outputs the result tree, so long as it does
not change the way that a conforming HTML user agent would render the output. The
default value is yes
.
This rule can be satisfied by observing the following constraints:
Whitespace must only be added before or after an element, or adjacent to an existing whitespace character.
Whitespace must not be added or removed adjacent to an inline element.
The inline elements are those included in the %inline
category INS
and
DEL
elements if they are used as inline elements
(i.e., if they do not contain element children).
Whitespace must not be added or removed inside a formatted element,
the formatted elements being pre
, script
,
style
, and textarea
.
Note that the HTML definition of whitespace is different from the XML definition: see section 9.1 of the HTML 4.01 specification.
The html
output method may output a character using a
character entity reference in preference to using a numeric character
reference, if an entity is defined for the character in the version of
HTML that the output method is using. Entity references and character
references should be used only where the character is not present in
the selected encoding, or where the visual representation of the
character is unclear (as with
, for
example).
When outputting a sequence of whitespace characters in the data
model, within an element where whitespace is treated normally,
pre
and
textarea
)html
output method
Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML. It is an error to use the HTML output method when such characters appear in the data model. The processor may signal the error, but is not required to do so. If it does not signal the error, it may copy the offending characters into the serialized output, creating invalid HTML.
The html
output method >
rather than
?>
.
The encoding
parameter specifies the preferred
encoding to be used. If there is a HEAD
element, then
unless the include-content-type
parameter is present and
has the value "no"
, the html
output method
META
element
immediately after the start-tag
of the HEAD
element specifying the character encoding
actually used.
For example,
The content type media-type
parameter; the default value is
text/html
.
If the data model includes a head
element that has a meta
element child, the processor should
replace any content
attribute of the meta
element, or add such an attribute, with the value as described above,
rather than output a new meta
element.
It is possible that the data model will contain a character that
cannot be represented in the encoding that the processor is using for
output. In this case, if the character occurs in a context where HTML
recognizes character references, then the character script
or
style
element or in a comment), the processor
If the doctype-public
or doctype-system
parameters are specified, then the html
output method
<!DOCTYPE
HTML
or html
. If the
doctype-public
parameter is specified, then the output
method PUBLIC
followed by the specified
public identifier; if the doctype-system
parameter is
also specified, it doctype-system
parameter is specified but the doctype-public
parameter
is not specified, then the output method
SYSTEM
followed by the specified system identifier.
The media-type
parameter is applicable for the
html
output method.
The normalize-unicode
parameter is applicable for the
html
output method.
The use-character-maps
parameter is applicable for the
html
output method.
The text
output method outputs the data model by
outputting the string-value of every text node in the data model in
document order without any escaping.
A newline character in the data model may be output using any character sequence that is conventionally used to represent a line ending in the chosen system environment.
The media-type
parameter is applicable for the
text
output method.
The encoding
parameter identifies the encoding that
the text
output method
encoding
parameter.
The default encoding for the text
output method is
implementation-defined.
The unicode-normalization
parameter is applicable for the
text
output method.
The use-character-maps
parameter is applicable for the
text
output method.
The use-character-maps
parameter is a list of characters
and corresponding string substitutions.
Character maps allow a specific character appearing in a text or attribute node in the data model to be substituted by a specified string of characters during serialization. The string that is substituted is output "as is", and the serializer performs no checks that the resulting document is well-formed. This mechanism can therefore be used to introduce arbitrary markup in the serialized output.
Character mapping is applied to the characters that actually appear in a text or attribute node in the data model, before any other serialization operations such as escaping or Unicode normalization are applied. If a character is mapped, then it is not subjected to XML or HTML escaping, nor to Unicode normalization. The string that is substituted for a character is not validated or processed in any way by the serializer, except for translation into the target encoding. In particular, it is not subjected to XML or HTML escaping, it is not subjected to Unicode normalization, and it is not subjected to further character mapping. If the string cannot be represented using the target encoding, the serializer takes the same action as it would if the offending characters appeared directly in the data model.
Character mapping is not applied to characters in text nodes whose
parent elements are listed in the cdata-section-elements
parameter, nor to characters in attribute
values that are subject to the URI escaping defined for the HTML and
XHTML output methods, unless URI escaping has been disabled using the
escape-uri-attributes
parameter in the output
definition.
On serialization, occurrences of a character specified in the
use-character-maps
in text nodes and attribute values
are replaced by the corresponding string from the use-character-maps
parameter.
Using a character map can result in non-well-formed documents if the string contains XML-significant characters. For example, it is possible to create documents containing unmatched start and end tags, references to entities that are not declared, or attributes that contain tags or unescaped quotation marks.
Character mapping is applied to the characters that actually appear in a text or attribute node in the data model, before any other serialization operations such as escaping or Unicode normalization are applied.
Character mapping is not applied to characters for which output
escaping has been disabled (disabling output escaping is an cdata-section-elements
parameter,
nor to characters in attribute values that are
subject to the URI escaping defined for the HTML and XHTML output
methods, unless URI escaping has been disabled using the
escape-uri-attributes
parameter.
If a character is mapped, then it is not subjected to XML or HTML escaping.
A serialization error occurs if character mapping causes the output
of a string containing a character that cannot be represented in the
encoding that the processor is using for output. The processor