CARVIEW |
Copyright ? 2001
This Note describes the features and syntax for XML Pipeline Definition Language. Pipeline is an XML vocabulary for describing the processing relationships between XML resources. A pipeline document specifies the inputs and outputs to XML processes and a pipeline controller uses this document to figure out the chain of processing that must be executed in order to get a particular result.
This document is a submission to the World Wide Web Consortium
(see
This Note is made available by W3C for discussion only. Publication of this Note by the W3C indicates no endorsement by W3C or the W3C Team, or any W3C Members. The W3C has had no editorial control over the preparation of this Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.
For a full list of acknowledged submissions, see
Created in electronic form.
There is a large and growing set of specifications that describe
processes
operating on
This specification is not generally concerned with the XML
parsing
process.
XML documents are here considered to be operated on as
The processes of interest in this specification are those that construct, inspect, augment, or extract from information sets. A process begins with zero or more information sets and produces zero or more information sets (it may also produce ancillary information, such as whether it succeeded or failed).
Although some, perhaps most, applications work with concrete object models that are not identical to the Infoset, it is still useful to describe the processing model in terms of the Infoset. In practice, applications will use SAX event streams or DOM object models, or even use XML representations of infosets, to pass information back and forth.
Applications, for example many web services, can be implemented by integrating business processing and standard XML processing. Business processes vary greatly among different applications and environments, but XML processing is often mostly a matter of composing the individual operations described by separate specifications (validation, transformation, and so on) in a useful order.
However, the order of processing actually used in any one
application is
an important aspect of the semantics of the application, in that it
might
be imperative that
Note that for any given application, the processes might require a partial order, and not necessarily a total order. In other words:
There are dependency relationships among the processes and any processing order that violates these dependencies is erroneous.
But
The satisfaction of dependencies between processes is already a
well-understood
problem in software development; it forms the heart of systems like
This specification describes a declarative XML vocabulary that addresses this need in a way that is tuned for XML processing.
The focus of this specification is intentionally quite narrow, in order to provide a working solution for the most pressing needs first. It is clear that a number of useful extensions could be made to support processing meta-models, wildcards in process inputs and outputs, and so on. These features might be developed over time, but they are not critical to solving the most immediate problems.
It is clear that addressing the whole problem will ultimately require APIs that allow a process manager to initiate individual processes as necessary. This specification does not attempt to address this issue. However, the software development community has already begun to do so with the development of SAX, TRaX, and other APIs.
The identification of a processing model itself could be performed entirely at the programming API level, but this approach seems to set the bar too high. True interoperability will be achieved only when it is possible for end-users working with stock XML tools to build processing models for themselves. Thus, this specification proposes an XML vocabulary rather than an API.
The following process classification provides a framework in which to discuss the operations of applications in general terms:
Constructive processes. These are processes that build new
information
sets. Processes which produce a new information set or add
information items
or property values defined in
Augmenting processes. Processes, like
Inspection processes. Processes that inspect but do not
modify a
document, such as
Extraction processes. Some processes reach into an existing
information
set and remove or copy parts of it for further processing. Processes
that
use
Packaging processes. Distributed or federated web
applications will
need to package a collection of resources to transmit to another
location
or service. This packaging could be performed using SOAP with
attachments
It is important to note that the classes are not mutually exclusive, as shown, for example, by the fact that some schemas can cause a process to perform construction, augmentation, and extraction. In addition, processes can be hierarchical, with a constructive process performing a bit of validation or extraction, for example.
The fact that documents can be augmented or transformed raises
another
set of issues with respect to addressing into augmented or
transformed results.
the Pipeline
language
) is a vocabulary for describing the processing
relationships
between XML resources.
The high-level requirements of the Pipeline language are as follows:
It shall be expressed in XML.
It should be possible to author and manipulate pipeline documents using standard XML tools.
It shall be as declarative as possible.
Declarative languages are more amenable to optimization and other compilation techniques. It should be relatively easy to implement a conformant pipeline controller, but it should also be possible to build a sophisticated controller that can perform parallel operations, lazy or greedy processing, and other optimizations.
It shall be neutral with respect to implementation language.
Just as there is no single language that can process XML exclusively, there should be no single language that can implement the Pipeline language exclusively. It should be possible to interoperably exchange pipeline documents across computing platforms.
An individual process produces one or more result resources from some set of input resources. The Pipeline language uses URIs to identify resources (both inputs and results). A pipeline controller uses these URIs to keep track of the resources that it has available. In the context of a pipeline, every resource is identified by exactly one URI. Two resources are the same if they have the same URI. Two URIs are the same if they are lexically equivalent.
To indicate that an inspection process (a process that does not modify an input infoset) has succeeded, the Pipeline forces the process to produce a result with a new URI. The resulting information set will be identical to the input information set, but its new label allows the pipeline controller to keep track of its status.
Note that the processing of URIs in pipeline documents depends on
The resources operated on by a pipeline controller are considered to consist of XML information sets.
In practice, some information set properties introduce dependencies between information items that makes composition non-trivial:
XML Schema validity properties introduce dependencies on descendants.
The in-scope namespaces property can introduce dependencies (at least implicitly) on ancestors.
Markers, as have existed in earlier drafts of
These dependencies have to be addressed by the specification of each process that operates on information sets.
Processing begins with an initial set of zero or more information
sets,
the name of some target that is the desired result, and a pipeline
document
that describes how documents are related to each other. Every
relationship
has three parts: a set of input information sets, a process, and a
set of
result information sets. For example, the result
Processing begins with the pipeline document and the URI of the desired target.
If the target is up to date, no processing is required and
the target
is returned.
If the target is not up to date, the controller identifies
the process
from the pipeline document that can produce the target. It does this
by examining
the
It is an error if there is not exactly one such process in the pipeline document. If there is no such process, the target cannot be produced; if there is more than one, the processing is ambiguous.
Assuming that exactly one process is found, the controller considers each of the information sets that are inputs to that process as intermediate targets. These intermediate targets are resolved in exactly the same manner as the principal target.
It is an error for processes to depend directly or indirectly on their outputs. In other words, if Process 1 produces C from B and Process 2 produces B from A, it is an error for the pipeline document to contain a process that produces A from B or C (or any intermediate target produced from B or C).
When all of the input documents are up to date, and no errors have occurred, the process is executed to produce the output result(s) and the target is returned.
If an error occurs, processing terminates. Each process can selectively ignore errors and return appropriate error documents in place of its normal outputs.
Note that the order of the processes specified in the pipeline document is insignificant. Users need not figure out the right order for the whole pipeline; they need only declare the dependencies. Naturally, dependencies can be made linear, forcing a fixed order if that is what is desired.
We consider a simple application of
Begin with a source document,
Expand any XInclude directives that it contains.
Make sure that it is schema-valid with respect to
If it is, transform it with
The following example shows what the corresponding pipeline document might look like.
Assume that initially, the pipeline controller is handed this
pipeline
document along with
The target information set,
The process p3
has an
The p3
process depends on
The process p2
has an
The p2
process in turn depends on
The process p1
has an
The only input to the p1
process already
exists, so
the xinclude.p
process is executed according to
whatever definition
was provided.
Assuming that p1
runs without error, it
produces the p2
.
With all of the inputs to p2
available, it
can be
executed, producing the necessary
Finally, p3
can be executed, producing the
result.
The controller succeeds.
If at any point an error occurs, the controller returns either a specified error document or a built-in error document and fails.
In this example, there is only one order of processing that can
satisfy
all of the dependencies: xinclude.p
then validate.p
then transform.p
. In a more complex pipeline,
multiple
or even
parallel orders are possible. For example, in
The following sections define the rules for pipeline documents and the required semantics of elements in the Pipeline language.
The following XML Schema describes the Pipeline language. The
A pipeline document primarily contains elements from the
pipeline
namespace.
The processing pipeline begins with the root element,
In certain locations, the pipeline document may contain any
element not
from the pipeline namespace, provided that the expanded-name of the
element
has a non-null namespace URI. The presence of such foreign content
must not
change the behavior of pipeline elements and functions defined in
this specification.
Thus, a pipeline controller is always free to ignore such foreign
content,
and must ignore a foreign element or attribute (other than
It is an error for foreign elements to contain elements from the pipeline namespace.
The
The
The
The value of the
The
Each step
in
the
pipeline. A process must have a
A
The
A process is executed if and only if at least one of its
The dependency evaluation process is recursive. If Process 2 depends on information set A that is produced by Process 1, the dependencies on A described by Process 1 must be evaluated, and information set A may be updated by Process 1, before the controller can determine if the outputs of Process 2 are up to date with respect to A.
If the process has unnamed
It is an error for more than one process to produce the same information set.
The pipeline controller is responsible for storing and tracking
output
documents. Arguments are passed to processes by name, not position,
so the
order of the children of the
When a process is executed, it either succeeds or fails. If it
succeeds,
it produces its named
If it fails, processing either terminates or proceeds. A process may proceed only if both of the following conditions are met:
The true
.
For each
If the process is to proceed, the
If the process fails, no information sets are produced and an indication of failure is returned. The failure of any executed process causes the pipeline controller to abandon the task of building the target.
An
If the
If the
Otherwise, the resource is retrieved from the URI in the
It is an error for the
An
When a process is executed, it is the responsibility of the pipeline controller to collect the result information sets produced by the controller, associate them with the appropriate output statements, and store them for use in evaluating other processes.
The
If a process fails, the
The content of the error information set comes from one of two locations:
If the
Otherwise, the resource is retrieved from the URI in the
Additional parameters may be passed to a process with the
Pipeline parameters are named. It is the responsibility of the pipeline controller to marshal them appropriately for the actual process used.
If the
The
The
and a process definition for the "identity" transformation, the
The need for a pipeline definition language is motivated by a selection of use cases.
Two parties agree to conduct business electronically. They will exchange purchase orders, invoices, and other business documents using some appropriate transport protocol. Before responding to a request, each party wishes to validate the request against a known schema so that errors do not result in mismanaged funds.
This use case depends on approximately the following processing order:
Documents must be parsed and verified as well-formed with
respect
to
Any XInclude elements must be expanded.
Validation must be performed against a specific schema, ignoring any schema location information in the document.
A company has a collection of documents in "XYZ Schema V1.0". A new release, "XYZ Schema V1.1" is released which contains a small number of backwards incompatible (but programmatically correctable) changes and a few new features. The company needs to to update its existing documents to the new schema in order to exploit the new features.
This use case depends on approximately the following processing order:
Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.
Validation must be performed against a specific "XYZ Schema V1.0" schema, ignoring any schema location information in the document (in order to be sure that local extensions will not interfere with the automated conversion process).
Documents must be transformed with a specific stylesheet, ignoring any style information in the document.
The transformation must preserve all existing markup (it must be an identity transformation) except for the specific changes required to convert to V1.1. (In particular, it must preserve existing XInclude elements.)
Schema location information in the document must be updated.
The converted document must be validated against the V1.1 schema, making use of any local schema information that is present, in order to assure that the transformation did not introduce errors.
A company has a collection of documents in XML. It wishes to publish them on a periodic basis.
This use case depends on approximately the following processing order:
Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.
The document must be schema validated using local schema information that is present.
The document must be transformed and published either with the stylesheet information in the document or with a specific set of stylesheets.
A company has built some sort of a hub for doing business transaction processing. It accepts documents in a variety of schemas, transforms them to some internal schema, operates on those documents, and returns results in the schema appropriate for the requestor.
This use case depends on approximately the following processing order:
Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.
XInclude elements must be expanded.
The document must be schema validated using either the schema identified by the document or an out of band schema.
The document must be transformed to the "hub schema". This new document may use XInclude elements to refer to standard boilerplate or other constant information.
XInclude elements must be expanded again.
Validate again, to make sure the transformation did not introduce errors.
Perform whatever processing is required.
Transform the result into an appopriate outbound schema.
Perhaps expand XIncludes again.
Consider a web service that is part of some larger service chain. It might need to operate only on portions of a document (because other portions are encrypted, for example, or simply because it only deals with a certain namespace). It might perform validation on only some elements, for example, or expand only certain XIncludes.
This use case depends on approximately the following processing order:
Documents must be parsed and verified as well-formed with respect to XML 1.0, XML Namespaces, and XML Base.
Selected portions of the document must be schema validated.
XInclude processing must be performed selectively.
The information set may be augmented or transformed as a result of the web services operation.
The following pipeline document describes the build process for
the HTML
version of this specification. This specification consists of three
source
XML files (
The pipeline controller in this case is a command-line processor. Each of the processes is defined to be a standard command line that is executed to perform the process. All of the information sets in this case are XML files on disk, but that is not necessary (or even desirable).
Note that an implementation-specific XPath-like notation is used
here in the
This appendix identifies some open issues.
In
Can a pipeline document be its own target? Can it have XInclude instructions and validity constraints and other anticipated processing?
In order to provide greater flexibility in the Pipeline document,
does it make sense to consider allowing some attribute values (for example,
the
As currently defined, the
The schema for Pipeline in
The issue of controlling extractive processes (beyond the
control
already built into specifications such as XML Schema) is not so far
addressed
here. The issue was explored to a certain extent at the XML
Processing Model
Workshop (see, for example, the paper presented by Philippe Le
Hégaret