CARVIEW |
Canonical XML
Version 1.0
W3C Working Draft 19 January 2000
- This version:
- https://www.w3.org/TR/2000/WD-xml-c14n-20000119
- Latest version:
- https://www.w3.org/TR/xml-c14n
- Previous versions:
- https://www.w3.org/TR/1999/WD-xml-c14n-19991115
- https://www.w3.org/TR/1999/WD-xml-c14n-19991109
- Editors:
- Tim Bray <tbray@textuality.com>
- James Clark <jjc@jclark.com>
- James Tauber <jtauber@jtauber.com>
- John Cowan <jcowan@reutershealth.com>
Copyright ©2000 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
Abstract
This document describes a subset of the information contained in an XML document and a syntax for expressing that subset. This syntax, called Canonical XML, is designed to encode the logical structure of XML documents; two XML documents whose Canonical-XML form is identical will be considered equivalent for the purposes of many applications.
Status of this document
The XML Core Working Group, with this 19 January 2000 Infoset Last Call working draft, invites comment on this specification. The Last Call period ends the 22 February 2000.
The W3C Membership and other interested parties are invited to review the specification and report implementation experience. Please send comments to www-xml-canonicalization-comments@w3.org (archive).
Note: The XML Core Working Group strongly solicits commentary, especially from early implementors of this Working Draft, on the appropriateness of the requirement that Canonical XML be in W3C normalized text form as well. The Working Group has published a minority report on this question at https://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0000.html. A rationale for the majority viewpoint embodied in this draft has been published at https://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0001.html.
For background on this work, please see the XML Activity Statement . While we welcome implementation experience reports, the XML Core Working Group will not allow early implementation to constrain its ability to make changes to this specification prior to final release.
A list of current W3C Recommendations and other technical documents can be found at https://www.w3.org/TR.
Table of contents
1 Introduction
2 Information Included in Canonical
XML
2.1 The Document
Information Item
2.2 Element
Information Items
2.3 Attribute
Information Items
2.4 Processing
Instruction Information Item
2.5 Reference to
Skipped Entity Information Items
2.6 Character
Information Items
2.7 Comment
Information Items
2.8 Document
Type Declaration Information Items
2.9 Entity
Information Items
2.10 Notation
Information Items
2.11 Entity
Start Marker Information Items
2.12 Entity End
Marker Information Items
2.13 CDATA
Start Marker Information Items
2.14 CDATA End
Marker Information Items
2.15 Namespace
Declaration Information Items
3 Document Type Definition Processing
4 Entity and Reference Processing
5 The Syntax of Canonical XML
5.1 Character
Encoding
5.2 Character
Escaping
5.3 Prolog
5.4 Epilog
5.5 Elements
5.6 Tags
5.7
Attributes
5.8 Processing
Instructions
5.9
Namespaces
Appendices
A References
B Acknowledgements (Non-Normative)
1 Introduction
The XML 1.0 Recommendation [XML] describes the syntax of a class of resources called XML documents. It is possible for XML documents which are equivalent for the purposes of many applications to differ in their physical representation. In particular, they may differ in their entity structure, attribute ordering, and character encoding. This means that much equivalence testing of XML documents cannot be done at the byte-comparison level. This Canonical XML specification aims to introduce a notion of equivalence between XML documents which can be tested at the syntactic level and, in particular, by byte-for-byte comparison. In the syntax it describes, logically equivalent documents are byte-for-byte identical.
The syntax described in this specification is called Canonical XML. XML documents may be transformed into Canonical XML (with potentially some information loss) - the result of this transformation is described as the canonical form of the original document. Canonical XML is XML - that is to say, the canonical form of any XML document is an XML document.
There are two essential aspects to the specification of Canonical XML:
- Which information from an XML document is included in its canonical form (and which is not).
- How information is expressed in Canonical XML.
2 Information Included in Canonical XML
For the purposes of this specification, the information in an XML document is that described by the XML Information Set Specification [Infoset]. The canonical form of an XML document, which is itself an XML document, also has an information set. This section describes what portion of an XML document's information set is included in that of its canonical form.
Note that information not included in Canonical XML may still be used to produce it. In particular:
- Attribute types serve as the basis of the normalization process for attribute values in Canonical XML, but the type of attributes is not preserved in it.
- The replacement text of general parsed entities that are referenced is included in Canonical XML, but the information about which entity any character or logical structure came from is not.
- Attribute values provided by default are included in Canonical XML, but the fact that the value was provided by default is not.
2.1 The Document Information Item
The information set of the canonical form includes only the "children" property of the document information item. It does not include any of the peripheral properties of the document information item, nor the "notations" or "carview.php?tsp=entities" properties.
2.2 Element Information Items
The information set of the canonical form includes the properties: "namespace URI," "local name," "children" and "attributes" from each element information item. It does not include the "declared namespaces" property, nor any of the peripheral properties. Note that the infoset lists the "children" property as including references to skipped entity information items, but the canonical form does not include these.
2.3 Attribute Information Items
The information set of the canonical form includes all of the core properties, but none of the peripheral properties, of the attribute information item.
2.4 Processing Instruction Information Item
For Processing Instructions appearing outside of the Document Type Definition, the information set of the canonical form includes all of the core properties, but none of the peripheral properties, of the processing instruction information item. For those which appear in the Document Type Definition, the information set of the canonical form includes no Processing Instruction information items.
2.5 Reference to Skipped Entity Information Items
Reference to skipped entity information items are not included in the information set of the canonical form of a document. Such information items could not appear in Canonical XML because canonicalization requires the reading of declarations for all entities referenced in a document.
2.6 Character Information Items
The information set of the canonical form includes the core "character code" property of the character information item. None of the peripheral properties of the character information item are included.
2.7 Comment Information Items
The information set of the canonical form does not include comment information items.
2.8 Document Type Declaration Information Items
The information set of the canonical form does not include document type declaration information items.
2.9 Entity Information Items
The information set of the canonical form does not include entity information items.
2.10 Notation Information Items
The information set of the canonical form does not include notation information items.
2.11 Entity Start Marker Information Items
The information set of the canonical form does not include entity start marker information items.
2.12 Entity End Marker Information Items
The information set of the canonical form does not include entity end marker information items.
2.13 CDATA Start Marker Information Items
The information set of the canonical form does not include CDATA start marker information items.
No CDATA sections occur in the information set of the canonical form. They are not necessary since all syntactically-significant characters in Canonical XML are escaped in the fashion described in this specification.
2.14 CDATA End Marker Information Items
The information set of the canonical form does not include CDATA end marker information items.
2.15 Namespace Declaration Information Items
The information set of the canonical form does not include namespace declaration information items.
3 Document Type Definition Processing
The process of canonicalizing an XML document depends on its standalone document declaration. If the declaration is present and its value is "yes", then assuming the XML document satisfies the Standalone Document Declaration validity constraint, no external portion of the DTD can contain material which affects its canonical form.
In all other cases, the process of canonicalization requires reading the whole of the DTD. The following information from the DTD affects the canonical form of an XML document:
- Default attribute values.
- Declarations of general entities which are referenced in the document.
- Attribute type declarations which affect the process of attribute value normalization.
Note that the process of canonicalization is effectively impossible for a non-standalone document for which some external component of the DTD cannot be retrieved. Implementors of software which is designed to produce Canonical XML should provide an interface to users which allows this error condition to be signaled.
The canonical form of an XML document is standalone.
4 Entity and Reference Processing
`The canonical form of an XML document contains no general entity references - all such references are expanded so that the canonical form contains only the replacement text. Since it contains no DTD, it also contains no parameter entity references.
Suppose a file named "e1.xml" contains the following text, with no trailing newline (#xA) character.
Hallelujah, I'm a bum!
then if the following XML document is stored in a file in the same directory
<!DOCTYPE d [ <!ENTITY lsb '['> <!ENTITY rsb ']'> <!ENTITY bum SYSTEM "e1.xml"> ]> <d>&lsb;&bum;&rsb;</d>
its canonical form is
<d>[Hallelujah, I'm a bum!]</d>
5 The Syntax of Canonical XML
This section describes the syntax of Canonical XML. This syntax is a proper subset of the syntax of XML 1.0. The canonical form of an XML document is identical to its original form except as described in this section.
Each Canonical XML document must match the production labeled canonXML in the grammar below, where the notation and the semantics of the word "match" are those described in the XML 1.0 specification.
Canonical XML
[1] | canonXML | ::= | (PI #xA)* element #xA (PI #xA)* | |
[2] | element | ::= | Stag (Datachar | element | PI)* Etag | |
[3] | Stag | ::= | '<' Name NSDecl? (Att NSDecl?)* '>' | |
[4] | Etag | ::= | '</' Name '>' | |
[5] | NSDecl | ::= | #x20 'xmlns:' Prefix '=' '"' Attvalchar* '"' | |
[6] | Att | ::= | #x20 Name '=' '"' Attvalchar* '"' | |
[7] | Datachar | ::= | '&' | '<' | '>' | '
' | |
| (Char - ('&' | '<' | '>' | #xD )) | ||||
[8] | Attvalchar | ::= | '&' | '<' | '"' | '	' | '
' | '
' | |
| (Char - ('&' | '<' | '"' | #x9 | #xA | #xD)) | ||||
[9] | Name | ::= | (Prefix ':')? NCName | |
[10] | Prefix | ::= | 'n' [1-9] [0-9]* | |
[11] | PI | ::= | '<?' PITarget (#x20 (Char+ - (Char* '?>' Char*)))? '?>' | |
[12] | PITarget | ::= | NCName - (('X' | 'x') ('M' | 'm') ('L' | 'l')) |
The remainder of this section expresses additional constraints beyond those expressed in the grammar and provides further explanatory material on key aspects of Canonical XML.
5.1 Character Encoding
Canonical XML uses UTF-8 in the normalized form recommended by [CharModel] as the character encoding.
For example, consider the following small document:
<?xml version="1.0" encoding="ISO-8859-1"?> <lang>Español</lang>
Since it is encoded in ISO-8859-1 ("ISO Latin-1"), the character "ñ" is represented as #xF1. In Canonical XML, however, that character must be represented using UTF-8 in two bytes whose values are #xC3 and #xB1.
The Unicode standard [Unicode] allows multiple different representations of certain "precomposed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. The W3C has recommended a normalized representation [CharModel]. Canonical XML uses this normalized form.
Note: The XML Core Working Group strongly solicits commentary, especially from early implementors of this Working Draft, on the appropriateness of this requirement for normalized form. The Working Group has published a minority report on this question at https://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0000.html. A rationale for the majority viewpoint embodied in this draft has been published at https://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0001.html.
5.2 Character Escaping
The XML 1.0 specification requires XML processors to perform certain simple transformations on white-space characters in XML documents, when they serve as line separators and when they appear in attribute values. For each character in the result of the transformation, there will be a character information item as described by the Information Set. For example, in an XML 1.0 document:
- Where an element contains two lines are separated by CR-NL (#xD, #xA), the information set contains a single NL (#xA) character information item.
- Where an element or attribute value contains the string " ", the information set contains a single CR (#xD) character information item.
- Where a CDATA attribute value contains a TAB (#x9) character, the information set contains a single space (#x20) character information item.
- When a non-CDATA attribute value contains a TAB (#x9) character, the information set contains a single space (#x20) character information item if the TAB character immediately followed a non white-space character, and, otherwise contains nothing at all.
- Where an attribute value contains the string "	", the information set contains a TAB character (#x9).
All character information items are represented in a Canonical XML document by their UTF-8 encoding, with the following exceptions:
- In character data and attribute values, the character information items "<" and "&" are represented by "<" and "&" respectively.
- In character data, the character information item ">" is represented by ">".
- In attribute values, the double-quote character information item (") is represented by """.
- In character data, the carriage-return (#xD) character information item is represented by "
".
- In attribute values, the character information items TAB (#x9), newline (#xA), and carriage-return (#xD) are represented by "	", "
", and "
" respectively.
5.3 Prolog
Canonical-XML documents have a prolog which contains only those Processing Instructions appearing before the start-tag of the root element but not within the Document Type Definition. Each PI is followed by a single newline (#xA) character. These PIs and newline characters make up the whole content of the prolog. If there are no such PIs, the first character is the "<" marking the beginning of the root element's start-tag.
For the following XML document
<!DOCTYPE x PUBLIC "myX" "x.dtd" [ <!ENTITY a "aVal"> ]> <x>y</x>
the canonical form is
<x>y</x>
If PIs are involved
<?t1 t1-body ?> <!DOCTYPE x PUBLIC "myX" "x.dtd" [ <?t2 t2-body ?> <!ENTITY a "aVal"> ]> <?xml-stylesheet href="mystyle.css" type="text/css" ?> <?rating mostly-harmless?> <x>y</x><?t3 ?>
the canonical form is
<?t1 t1-body ?> <?xml-stylesheet href="mystyle.css" type="text/css" ?> <?rating mostly-harmless?> <x>y</x> <?t3?>
5.4 Epilog
The epilog of all Canonical-XML documents contains a single newline (#xA) character, which immediately follows the ">" marking the end of the root element's end-tag. If the epilog contains Processing Instructions they are preserved in the Canonical-XML epilog, each followed by a newline (#xA) character.
For the following XML document
<x>y</x><?audio stop here ?> <!-- Local variables: mode: xml End: --><?pi?>
the canonical form is
<x>y</x> <?audio stop here ?> <?pi?>
5.5 Elements
In Canonical XML, all elements have a start-tag and an end-tag. For elements which have no content, the end-tag follows the start-tag with no intervening characters.
For the following element
<x> <a n="1"/><b n="2"/> <c n="3"/></x>
the canonical form is
<x> <a n="1"></a><b n="2"></b> <c n="3"></c></x>
5.6 Tags
In Canonical XML, for end-tags and start-tags which contain no attributes, the ">" character closing the tag follows the element type immediately with no intervening white space. Any attributes and namespace declarations follow with each attribute and namespace declaration preceded by one space (#x20) character. When the element type and the attribute names do not have namespaces, the attributes are sorted lexicographically by attribute name (based on Unicode character code points); the ordering when namespaces are present is described in [5.9 Namespaces].
The canonical form of an XML document includes all its attributes, whether provided explicitly or by default in the original document.
For the following element
<x a="Earth" ñ="Wind" z="Fire" >!!</x >
the canonical form is
<x a="Earth" z="Fire" ñ="Wind">!!</x>
5.7 Attributes
In the canonical form of an XML document, attribute values are normalized in the fashion required of an XML processor.
In Canonical XML, attribute names and values are separated by a single "=" character and no spaces. All attribute values are delimited by double-quote (") characters. Within attribute values, all occurrences of double-quote are replaced by """.
For the following start-tag
<x a = '"Don't!", he cried.' b = "'>'">
the canonical form is
<x a=""Don't!", he cried." b="'>'">
5.8 Processing Instructions
In Canonical XML, there is no Document Type Definition and thus no PIs contained in it. PIs which precede and follow the root element are normalized as follows:
- The white-space separating the PI Target from the rest of the PI contents is replaced by a single space (#x20) character.
- The "?>" sequence which closes the PI is followed by a single newline (#xA) character.
PIs which are contained in the content of an element are normalized as follows:
- The white-space separating the PI Target from the rest of the PI contents is replaced by a single space (#x20) character.
For the following XML document
<?pi1 v1 ?><?pi2 v2 ?><root>Hello <?audio bang! ?> he said.</root><?pi3?>
the canonical form is
<?pi1 v1 ?> <?pi2 v2 ?> <root>Hello <?audio bang! ?> he said.</root> <?pi3?>
5.9 Namespaces
In Canonical XML, namespace prefixes always have the form
n1
, n2
and so on. The positive integer
following the n
is called the index of the prefix.
A start-tag always contains namespace declarations for exactly those prefixes that are used in the element type and the attribute names occurring in the start-tag. Namespace declarations are never inherited.
NOTE: This approach was chosen so that canonicalization is context-independent: the canonical form of an element is independent of where it occurs in the document.
The default namespace is never used. An attribute name never has the same prefix as the element type or another attribute name. The namespace declaration for a prefix immediately follows the element type or attribute that uses the prefix. Attributes are ordered primarily by the lexicographic order of the namespace URI with which the prefix of the attribute name is associated, and secondarily by the lexicographic order of the local part of the attribute name. A null namespace URI is considered to precede a non-null namespace URI: thus all attributes without prefixes precede all attributes with prefixes.
In the start-tag namespace prefixes occur in order of prefix
index. The index of the first namespace prefix in the start-tag is
always 1. The indices of the prefixes occurring in the start-tag
are always consecutive integers. Thus if the element type has a
prefix, its prefix will be n1
; the prefix of the first
attribute name in the start-tag that has a prefix will be
n2
if the element type has a prefix, and n1
otherwise; for subsequent attributes, the index of the prefix of
the attribute name will be one greater than the index of the prefix
of the name of the preceding attribute.
For example, for the following element
<doc xmlns:x="https://w3.org/2" xmlns:y="https://w3.org/1"> <x:e a="a"/> <x:e x:a="x:a"/> <e x:a="x:a"/> <e x:a="x:a" y:a="y:a"/> <e x:a="x:a" a="a"/> <e x:a="x:a" x:b="x:b"/> </doc>
the canonical form is
<doc> <n1:e xmlns:n1="https://w3.org/2" a="a"></n1:e> <n1:e xmlns:n1="https://w3.org/2" n2:a="x:a" xmlns:n2="https://w3.org/2"></n1:e> <e n1:a="x:a" xmlns:n1="https://w3.org/2"></e> <e n1:a="y:a" xmlns:n1="https://w3.org/1" n2:a="x:a" xmlns:n2="https://w3.org/2"></e> <e a="a" n1:a="x:a" xmlns:n1="https://w3.org/2"></e> <e n1:a="x:a" xmlns:n1="https://w3.org/2" n2:b="x:b" xmlns:n2="https://w3.org/2"></e> </doc>
A References
- CharModel
- Character Model for the World Wide Web, ed. Martin J. Dürst, François Yergeau. Available at https://www.w3.org/TR/charmod.
- Infoset
- XML Information Set, ed. John Cowan. Available at https://www.w3.org/TR/xml-infoset.
- Namespaces
- Namespaces in XML, eds. Tim Bray, Dave Hollander, and Andrew Layman. Available at https://www.w3.org/TR/REC-xml-names.
- Unicode
- The Unicode Consortium. The Unicode Standard, version 3.0. ISBN 0-201-61633-5. Described at https://www.unicode.org/unicode/standard/versions/Unicode3.0.html.
- XML
- Extensible Markup Language (XML) 1.0, eds. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. 10 February 1998. Available at https://www.w3.org/TR/REC-xml.
B Acknowledgements (Non-Normative)
The work of producing this specification was accomplished by the membership of the W3C XML Syntax Working Group and its successor, the W3C XML Core Working Group:
- Joel Nava, Adobe (Co-chair, Syntax)
- Tim Bray, Invited Expert (Co-chair, Syntax; Editor)
- Paul Grosso, Arbortext (Co-chair, Core)
- Arnaud Le Hors, IBM (Co-chair, Core)
- James Clark, Invited Expert (Editor)
- James Tauber, Bow Street Software (Editor)
- John Cowan, Reuters (Editor)
- Bert Bos, W3C (W3C Liaison, Syntax)
- Joseph Reagle, W3C (W3C Liaison, Syntax)
- Dan Connolly, W3C (W3C Liaison, Core)
- Daniel Veillard, W3C (W3C Liaison, Core)
- Daniel Austin, Ask Jeeves
- Gary Bisaga, Mitre
- Tim Boland, NIST, Invited Expert
- Allen Brown, Microsoft
- John Evdemon, XMLSolutions
- Charles Frankston, Microsoft
- Eduardo Gutentag, Sun Microsystems
- Michael Hyman, Microsoft
- Murata Makoto, Fuji Xerox
- Eve Maler, Sun Microsystems
- Murray Maloney, Commerce One
- Jonathan Marsh, Microsoft
- Mark Needleman, Data Research Associates
- Anguel Novoselsky, Oracle
- David Orchard, IBM
- Lew Shannon, NCR
- Michael Sperberg-McQueen, U. Ill. and W3C
- Steph Tryphonas, Microstar
- Norman Walsh, Arbortext
- François Yergeau, Alis