CARVIEW |
Speech Synthesis Markup Language Version 1.0
W3C Candidate Recommendation 18 December 2003
- This version:
- https://www.w3.org/TR/2003/CR-speech-synthesis-20031218/
- Latest version:
- https://www.w3.org/TR/speech-synthesis/
- Previous version:
- https://www.w3.org/TR/2002/WD-speech-synthesis-20021202/
Editors:- Daniel C. Burnett, Nuance
- Mark R. Walker, Intel
- Andrew Hunt, Scansoft
Copyright ©1999 - 2003 W3C ® (MIT , ERCIM , Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
Abstract
The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
Status of this Document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This is the 18 December 2003 W3C Candidate Recommendation of "Speech Synthesis Markup Language (SSML) Version 1.0". W3C publishes a technical report as a Candidate Recommendation to indicate that the document is believed to be stable, and to encourage implementation by the developer community. Candidate Recommendation status is described in section 7.1.1 of the Process Document. Comments can be sent until 18 February 2004.
Comments on this document and requests for further information should be sent to the Working Group's public mailing list www-voice@w3.org (archive). See W3C mailing list and archive usage guidelines. Please check the disposition of comments received during the Last Call period.
This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).
The entrance criteria to the Proposed Recommendation phase require at least two independently developed interoperable implementations of each required feature, and at least one or two implementations of each optional feature depending on whether the feature's conformance requirements have an impact on interoperability. Detailed implementation requirements and the invitation for participation in the Implementation Report are provided in the Implementation Report Plan. Note, this specification already has significant implementation experience that will soon be reflected in its Implementation Report. We expect to meet all requirements of that report within the Candidate Recommendation period closing 18 February 2004.
Patent disclosures relevant to this specification may be found on the Working Group's patent disclosure page in conformance with W3C policy.
Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress. A list of current W3C Recommendations and other technical documents can be found at https://www.w3.org/TR/.
0. Table of Contents
- 1. Introduction
- 2.SSML Documents
- 3. Elements and Attributes
- 3.1 Document Structure, Text
Processing and Pronunciation
- 3.1.1 "speak" Root Element
- 3.1.2 "xml:lang" Language Attribute
- 3.1.3 "xml:base" Attribute
- 3.1.4 "lexicon" Element
- 3.1.5 "meta" Element
- 3.1.6 "metadata" Element
- 3.1.7 "p" and "s"
- 3.1.8 "say-as" Element
- 3.1.9 "phoneme" Element
- 3.1.10 "sub" Element
- 3.2 Prosody and Style
- 3.2.1 "voice" Element
- 3.2.2 "emphasis" Element
- 3.2.3 "break" Element
- 3.2.4 "prosody" Element
- 3.3 Other Elements
- 3.3.1 "audio" Element
- 3.3.2 "mark" Element
- 3.3.3 "desc" Element
- 3.1 Document Structure, Text
Processing and Pronunciation
- 4. References
- 5. Acknowledgments
- Appendix A. Audio File Formats (normative)
- Appendix B. Internationalization (normative)
- Appendix C. MIME Types and File Suffix (normative)
- Appendix D. Schema for the Speech Synthesis Markup Language (normative)
- Appendix E. DTD for the Speech Synthesis Markup Language (informative)
- Appendix F. Example SSML (informative)
- Appendix G. Summary of changes since the Last Call Working Draft (informative)
1. Introduction
This W3C Specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].
SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [SABLE], which tried to integrate many different XML-based markups for speech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS]. Since then, SABLE itself has not undergone any further development.
The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (see Section 1.2). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (see Section 2.2.2) or as part of a fragment (see Section 2.2.1) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like phoneme and prosody (e.g. for speech contour design) may require specialized knowledge.
1.1 Design Concepts
The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS].
The following items were the key design criteria.
- Consistency: provide predictable control of voice output across platforms and across speech synthesis implementations.
- Interoperability: support use along with other W3C specifications including (but not limited to) VoiceXML, aural Cascading Style Sheets and SMIL.
- Generality: support speech output for a wide range of applications with varied speech content.
- Internationalization: Enable speech output in a large number of languages within or across documents.
- Generation and Readability: Support automatic generation and hand authoring of documents. The documents should be human-readable.
- Implementable: The specification should be implementable with existing, generally available technology, and the number of optional features should be minimal.
1.2 Speech Synthesis Process Steps
A Text-To-Speech system (a synthesis processor) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.
Document creation: A text document provided as input to the synthesis processor may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.
Document processing: The following are the six major processing steps undertaken by a synthesis processor to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.
-
XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps. Tokens (words) in SSML cannot span markup tags. A simple English example is "cup<break/>board"; the synthesis processor will treat this as the two words "cup" and "board" rather than as one word with a pause in the middle. Breaking one token into multiple tokens this way will likely affect how the processor treats it.
-
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.
-
Markup support: The p and s elements defined in SSML explicitly indicate document structures that affect the speech output.
-
Non-markup behavior: In documents and parts of documents where these elements are not used, the synthesis processor is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
-
-
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the synthesis processor that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on. By the end of this step the text to be spoken has been converted completely into tokens. The exact details of what constitutes a token are language-specific. In English, tokens are usually separated by white space and are typically words. For languages with different tokenization behavior, the term "word" in this specification is intended to mean an appropriately comparable unit.
-
Markup support: The say-as element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more. Note that many acronyms and abbreviations can be handled by the author via direct text replacement or by use of the sub element, e.g. "BBC" can be written as "B B C" and "AAA" can be written as "triple A". These replacement written forms will likely be pronounced as one would want the original acronyms to be pronounced. In the case of Japanese text, if you have a synthesis processor that supports both Kanji and kana, you may be able to use the sub element to identify whether 今日㯠should be spoken as ãょã†ã¯ ("kyou wa" = "today") or ã“ã‚“ã«ã¡ã¯ ("konnichiwa" = "hello").
-
Non-markup behavior: For text content that is not marked with the say-as element the synthesis processor is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.
-
-
Text-to-phoneme conversion: Once the synthesis processor has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes, Hawai'ian has between 12 and 18 (depending on who you ask), and some languages have more than 100! This conversion is made complex by a number of issues. One issue is that there are differences between written and spoken forms of a language, and these differences can lead to indeterminacy or ambiguity in the prounciation of written words. For example, compared with their spoken form, words in Hebrew and Arabic are usually written with no vowels, or only a few vowels specified. In many languages the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Both human speakers and synthesis processors can pronounce these words correctly in context but may have difficulty without context (see "Non-markup behavior" below). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English synthesis processor will often have trouble determining how to speak some non-English-origin names, e.g. "Caius College" (pronounced "keys college") and President Tito (pronounced "sutto"), the president of the Republic of Kiribati (pronounced "kiribass").
-
Markup support: The phoneme element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The say-as element might also be used to indicate that text is a proper name that may allow a synthesis processor to apply special rules to determine a pronunciation. The lexicon element can be used to reference external definitions of pronunciations. These elements can be particularly useful for acronyms and abbreviations that the processor is unable to resolve via its own text normalization and that are not addressable via direct text substitution or the sub element (see paragraph 3, above).
-
Non-markup behavior: In the absence of a phoneme element the synthesis processor must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary (which may be language-dependent) and applying rules to determine other pronunciations. synthesis processors are designed to perform text-to-phoneme conversions so most words of most documents can be handled automatically. As an alternative to relying upon the processor, authors may choose to perform some conversions themselves prior to encoding in SSML. Written words with indeterminate or ambiguous pronunciations could be replaced by words with an unambiguous pronunciation; for example, in the case of "read", "I will reed the book". Authors should be aware, however, that the resulting SSML document may not be optimal for visual display.
-
-
Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
-
Markup support: The emphasis element, break element and prosody element may all be used by document creators to guide the synthesis processor in generating appropriate prosodic features in the speech output.
-
Non-markup behavior: In the absence of these elements, synthesis processors are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
While most of the elements of SSML can be considered high-level in that they provide either content to be spoken or logical descriptions of style, the break and prosody elements mentioned above operate at a later point in the process and thus must coexist both with uses of the emphasis element and with the processor's own determinations of prosodic behavior. Unless specified in the appropriate sections, details of the interactions between the processor's own determinations and those provided by the author at this level are processor-specific. Authors are encouraged not to casually or arbitrarily mix these two levels of control.
-
-
Waveform production: The phonemes and prosodic information are used by the synthesis processor in the production of the audio waveform. There are many approaches to this processing step so there may be considerable processor-specific variation.
1.3 Document Generation, Applications and Contexts
There are many classes of document creator that will produce marked-up documents to be spoken by a synthesis processor. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.
-
The document creator has no access to information to mark up the text. All processing steps in the synthesis processor must be performed fully automatically on raw text. The document requires only the containing speak element to indicate the content is to be spoken.
-
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
-
Some document creators make considerable effort to mark as many details of the document as possible to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
-
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level speech synthesis markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.
The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.
-
Dialog language: It is a requirement that it should be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.
-
Interoperability with aural CSS (ACSS): Any HTML processor that is aural CSS-enabled can produce SSML. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification [CSS2 §19]. This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
-
Application-specific style sheet processing: As mentioned above, there are classes of applications that have knowledge of text content to be spoken, and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to SSML. This is equivalent to the use of ACSS with HTML and once again SSML is the resulting representation to be passed to the synthesis processor. In this context, SSML may be viewed as a superset of ACSS [CSS2 §19] capabilities, excepting spatial audio.
1.4 Platform-Dependent Output Behavior of SSML Content
SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.
Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the synthesis processor cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.
1.5 Terminology
Requirements terms- The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.
At user option- A conforming synthesis processor may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior described.
Error- Results are undefined. A conforming synthesis processor may detect and report an error and may recover from it.
Fatal error- An error which a conforming synthesis processor must detect and report to the host environment. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to create audio or other output).
Media Type- A media type (defined in [RFC2045] and [RFC2046]) specifies the nature of a linked
resource. Media types are case insensitive. A list of registered
media types is available for download [TYPES].
See Appendix C for information on media types for SSML.
Speech Synthesis- The process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects.
Synthesis Processor- A Text-To-Speech system that accepts SSML documents as input and renders them as spoken output.
Text-To-Speech- The process of automatic generation of speech output from text or annotated text input.
URI: Uniform Resource Identifier- A URI is a unifying syntax for the expression of names and
addresses of objects on the network as used in the World Wide Web.
A URI is defined as any legal
anyURI
primitive as defined in XML Schema Part 2: Datatypes [SCHEMA2 §3.2.17]. For informational purposes only, [RFC2396] and [RFC2732] may be useful in understanding the structure, format, and use of URIs. Any relative URI reference must be resolved according to the rules given in Section 3.1.3.1. In this specification URIs are provided as attributes to elements, for example in the audio and lexicon elements.
Voice Browser- A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
2. SSML Documents
2.1 Document Form
A legal stand-alone Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8]. If present, the optional DOCTYPE must read as follows:
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd">
The XML prolog is followed by the root speak element. See Section 3.1.1 for details on this element.
The speak element must
designate the SSML namespace. This can be achieved by declaring an
xmlns
attribute or an attribute with an
"xmlns" prefix. See [XMLNS §2]
for details. Note that when the xmlns
attribute is used alone, it sets the default namespace for the
element on which it appears and for any child elements. The
namespace for SSML is defined to be https://www.w3.org/2001/10/synthesis.
It is recommended that the speak element also indicate the location of the SSML
schema (see Appendix D) via the
xsi:schemaLocation
attribute from [SCHEMA1 §2.6.3]. Although such
indication is not required, to encourage it this document provides
such indication on all of the examples.
The following are two examples of legal SSML headers:
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US">
<?xml version="1.0"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
The meta, metadata and lexicon elements must occur before all other elements and text contained within the root speak element. There are no other ordering constraints on the elements in this specification.
2.2. Conformance
2.2.1 Conforming Speech Synthesis Markup Language Fragments
A document fragment is a Conforming Speech Synthesis Markup Language Fragment if:
- it conforms to the criteria for Conforming
Stand-Alone Speech Synthesis Markup Language Documents after:
- with the exception of
xml:lang
andxml:base
, all non-synthesis namespace elements and attributes and allxmlns
attributes which refer to non-synthesis namespace elements are removed from the document, - and, an appropriate XML declaration (i.e.,
<?xml...?>
) is included at the top of the document, - and, if the speak
element does not already designate the synthesis namespace using
the
xmlns
attribute, thenxmlns="https://www.w3.org/2001/10/synthesis"
is added to the element.
- with the exception of
2.2.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents
A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if it meets both the following conditions:
- It is a well-formed XML document [XML §2.1] conforming to Namespaces in XML [XMLNS].
- It is a valid XML document [XML §2.8] which adheres to the specification described in this document (Speech Synthesis Markup Language Specification) including the constraints expressed in the Schema (see Appendix D) and having an XML Prolog and speak root element as specified in Section 2.1.
The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.
2.2.3 Using SSML with other Namespaces
The synthesis namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.
2.2.4 Conforming Speech Synthesis Markup Language Processors
A Speech Synthesis Markup Language processor is a program that can parse and process Conforming Stand-Alone Speech Synthesis Markup Language documents.
In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined by XML 1.0 [XML] and Namespaces in XML [XMLNS]. This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is optional to apply or expand external entity references defined in an external DTD.
A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics of each markup element as described by this document.
A Conforming Speech Synthesis Markup Language Processor must meet the following requirements for handling of natural (human) languages:
- A Conforming Speech Synthesis Markup Language Processor is required to parse all legal natural language declarations successfully.
- A Conforming Speech Synthesis Markup Language Processor may be able to apply the semantics of markup languages which refer to more than one natural language. When a processor is able to support each natural language in the set but is unable to handle them concurrently it should inform the hosting environment. When the set includes one or more natural languages that are not supported by the processor it should inform the hosting environment.
- A Conforming Speech Synthesis Markup Language Processor may implement natural languages by approximate substitutions according to a documented, processor-specific behavior. For example, using a US English synthesis processor to process British English input.
When a Conforming Speech Synthesis Markup Language Processor
encounters elements or attributes, other than xml:lang
and xml:base
, in a non-synthesis namespace it
may:
- ignore the non-standard elements and/or attributes
- or, process the non-standard elements and/or attributes
- or, reject the document containing those elements and/or attributes
There is, however, no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.
2.2.5 Conforming User Agent
A Conforming User Agent is a Conforming Speech Synthesis Markup Language Processor that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent must support at least one natural language.
Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test may, however, require some examples of correct synthesis of a reference document to determine conformance.
2.3 Integration With Other Markup Languages
2.3.1 SMIL
The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [SMIL] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor. See the SMIL/SSML integration examples in Appendix F.
2.3.2 ACSS
Aural Cascading Style Sheets [CSS2 §19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
2.3.3 VoiceXML
The Voice Extensible Markup Language [VXML] enables Web-based development and content-delivery for interactive voice response applications (see voice browser ). VoiceXML supports speech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see Appendix F.
2.4 SSML Document Fetching
The fetching and caching behavior of SSML documents is defined by the environment in which the synthesis processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.
3. Elements and Attributes
The following elements and attributes are defined in this specification.
- 3.1 Document Structure, Text
Processing and Pronunciation
- 3.1.1 "speak" Root Element
- 3.1.2 "xml:lang" Language Attribute
- 3.1.3 "xml:base" Attribute
- 3.1.4 "lexicon" Element
- 3.1.5 "meta" Element
- 3.1.6 "metadata" Element
- 3.1.7 "p" and "s"
- 3.1.8 "say-as" Element
- 3.1.9 "phoneme" Element
- 3.1.10 "sub" Element
- 3.2 Prosody and Style
- 3.2.1 "voice" Element
- 3.2.2 "emphasis" Element
- 3.2.3 "break" Element
- 3.2.4 "prosody" Element
- 3.3 Other Elements
- 3.3.1 "audio" Element
- 3.3.2 "mark" Element
- 3.3.3 "desc" Element
3.1 Document Structure, Text Processing and Pronunciation
3.1.1 speak Root Element
The Speech Synthesis Markup Language is an XML application. The
root element is speak.
xml:lang
is a required attribute specifying the
language of the root document. xml:base
is an optional
attribute specifying the Base URI of the
root document. The version
attribute is a
required attribute that indicates the version of the specification
to be used for the document and must have the value "1.0".
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> ... the body ... </speak>
The speak element can only contain text to be rendered and the following elements: audio, break, emphasis, lexicon, mark, meta, metadata, p, phoneme, prosody, say-as, sub, s, voice.
3.1.2 xml:lang
Attribute: Language
The
xml:lang
attribute, as defined by XML 1.0 [XML §2.12], can be used in SSML to
indicate the natural language of the enclosing element and its
attributes and subelements. RFC 3066 [RFC3066] may be of some use in understanding how
to use this attribute.
Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.
xml:lang
is a defined attribute for the voice, speak, p, and
s elements. For vocal
rendering, a language change can have an effect on various other
parameters (including gender, speed, age, pitch, etc.) which may be
disruptive to the listener. There might even be unnatural breaks
between language shifts. For this reason authors are encouraged to
use the voice element to
change the language. xml:lang
is permitted on
p and s only because it is common to
change the language at those levels.
Although this attribute is also permitted on the desc element, none of the voice-change behavior described in this section applies when used with that element.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p>I don't speak Japanese.</p> <p xml:lang="ja">日本語ãŒåˆ†ã‹ã‚Šã¾ã›ã‚“。</p> </speak>
In the case that a document requires speech output in a language
not supported by the processor, the synthesis processor largely determines
behavior. Specifying xml:lang
does not imply a
change in voice, though this may indeed occur. When a given voice
is unable to speak content in the indicated language, a new voice
may be selected by the processor. No change in the voice or prosody
should occur if the xml:lang
value is the same as
the inherited value. Further information about voice selection
appears in Section 3.2.1.
There may be variation across conforming processors in the
implementation of xml:lang
voice changes for different markup
elements (e.g. p and
s elements).
All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis, break, p and s elements should each be rendered in a manner that is appropriate to the current language.
The text normalization processing step may be affected by the enclosing language. This is true for both markup support by the say-as element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <s>Today, 2/1/2000.</s> <!-- Today, February first two thousand --> <s xml:lang="it">Un mese fà , 2/1/2000.</s> <!-- Un mese fà , il due gennaio duemila --> <!-- One month ago, the second of January two thousand --> </speak>
3.1.3 Base URI
Relative URIs are resolved according to a base URI, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See Section 3.1.3.1 for details on the resolution of relative URIs.
The base URI declaration is permitted but optional. The two elements affected by it are
The xml:base attribute
The base URI declaration follows
[XML-BASE] and is indicated by an
xml:base
attribute on the root speak element.
<?xml version="1.0"?> <speak version="1.0" xml:lang="en-US" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:base="https://www.example.com/base-file-path">
<?xml version="1.0"?> <speak version="1.0" xml:lang="en-US" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:base="https://www.example.com/another-base-file-path">
3.1.3.1 Resolving Relative URIs
User agents must calculate the base URI for resolving relative URIs according to [RFC2396]. The following describes how RFC 2396 applies to synthesis documents.
User agents must calculate the base URI according to the following precedences (highest priority to lowest):
- The base URI is set by the
xml:base
attribute on the speak element (see Section 3.1.3). - The base URI is given by metadata discovered during a protocol interaction, such as an HTTP header (see [RFC2616]).
- By default, the base URI is that of the current document. Not all synthesis documents have a base URI (e.g., a valid synthesis document may appear in an email and may not be designated by a URI). It is an error if such documents contain relative URIs.
3.1.4 Pronunciation Lexicon
An SSML document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a URI with an optional media type. No standard lexicon media type has yet been defined as the default for this specification.
The W3C Voice Browser Working Group is developing the Pronunciation Lexicon Markup Language [LEX]. The specification will address the matching process between tokens and lexicon entries and the mechanism by which a synthesis processor handles multiple pronunciations from internal and synthesis-specified lexicons. Pronunciation handling with proprietary lexicon formats will necessarily be specific to the synthesis processor.
A lexicon document contains pronunciation information for tokens that can appear in a text to be spoken. The pronunciation information contained within a lexicon is used for tokens appearing within the referencing document.
Pronunciation lexicons are necessarily language-specific. Pronunciation lookup in a lexicon and pronunciation inference for any token may use an algorithm that is language-specific. As mentioned in Section 1.2, the definition of what constitutes a "token" may itself be language-specific.
When multiple lexicons are referenced, their precedence goes from lower to higher with document order. Precedence means that a token is first looked up in the lexicon with highest precedence. Only if not found in that lexicon, the next lexicon is searched and so on until a first match or until all lexicons have been used for lookup.
The lexicon element
Any number of lexicon
elements may occur as immediate children of the speak element. The lexicon element must have a
uri
attribute specifying a URI that identifies the location of the
pronunciation lexicon document.
The lexicon element may
have a type
attribute that specifies the
media type of the pronunciation
lexicon document.
<?xml version="1.0"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <lexicon uri="https://www.example.com/lexicon.file"/> <lexicon uri="https://www.example.com/strange-words.file" type="media-type"/> ... </speak>
Details of the type attribute
Note: the description and table that follow use an imaginary
vendor-specific lexicon type of x-vnd.example.lexicon
.
This is intended to represent whatever format is
returned/available, as appropriate.
A lexicon resource indicated by a URI
reference may be available in one or more media types. The SSML author can specify the
preferred media type via the type
attribute. When the content represented by a URI is available in
many data formats, a synthesis
processor may use the preferred type to influence which of the
multiple formats is used. For instance, on a server implementing
HTTP content negotiation, the processor may use the type to order
the preferences in the negotiation.
Upon delivery, the resource indicated by a URI reference may be
considered in terms of two types. The declared media type is
the alleged value for the resource and the actual media type is the
true format of its content. The actual type should be the
same as the declared type, but this is not always the case (e.g. a
misconfigured HTTP server might return text/plain
for
a document following the vendor-specific
x-vnd.example.lexicon
format). A specific URI scheme
may require that the resource owner always, sometimes, or never
return a media type. Whenever a type is returned, it is treated as
authoritative. The declared media type is determined by the value
returned by the resource owner or, if none is returned, by the
preferred media type given in the SSML document.
Three special cases may arise. The declared type may not be supported by the processor; this is an error. The declared type may be supported but the actual type may not match; this is also an error. Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of the synthesis processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616 §7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:
HTTP 1.1 request
|
Local file access
|
|||
Media type returned by the resource owner | text/plain | x-vnd.example.lexicon | <none> | <none> |
Preferred media type from the SSML document | Not applicable; the returned type is authoritative | x-vnd.example.lexicon | <none> | |
Declared media type | text/plain | x-vnd.example.lexicon | x-vnd.example.lexicon | <none> |
Behavior for an actual media type of x-vnd.example.lexicon | The must be processed as text/plain. This will generate an error if text/plain is not supported or if the document does not follow the expected format. | The declared and actual types match; success if x-vnd.example.lexicon is supported by the synthesis processor; otherwise an error | Scheme specific; the synthesis processor might introspect the document to determine the type. |
The lexicon element is an empty element.
3.1.5 meta
The metadata and meta elements are containers in which information about the document can be placed. The metadata element provides more general and powerful treatment of metadata information than meta by using a metadata schema.
A meta declaration
associates a string to a declared meta property or declares
"http-equiv" content. Either a name
or
http-equiv
attribute is required. It is an
error to provide both name
and http-equiv
attributes. A content
attribute is
required. The seeAlso
property is the only
defined meta property name.
It is used to specify a resource that might provide additional
metadata information about the content. This property is modelled
on the
rdfs:seeAlso
property of Resource Description Framework
(RDF) Schema Specification 1.0 [RDF-SCHEMA §2.3.4]. The
http-equiv
attribute has a special
significance when documents are retrieved via HTTP. Although the
preferred method of providing HTTP header information is by using
HTTP header fields, the "http-equiv" content may be used in
situations where the SSML document author is unable to configure
HTTP header fields associated with their document on the origin
server, for example, cache control information. Note that, as with
meta
in HTML documents [HTML], HTTP servers and caches are not required to
introspect the contents of meta in SSML documents and thereby override the header
values they would send otherwise.
Informative: This is an example of how meta elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.
<?xml version="1.0"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <meta name="seeAlso" content="https://example.com/my-ssml-metadata.xml"/> <meta http-equiv="Cache-Control" content="no-cache"/> </speak>
The meta element is an empty element.
3.1.6 metadata
The metadata element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with metadata, it is recommended that the Resource Description Framework (RDF) schema [RDF-SCHEMA] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [DC].
RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Copyrights, etc.).
Document properties declared with the metadata element can use any metadata schema.
Informative: This is an example of how metadata can be included in an SSML document using the Dublin Core version 1.0 RDF schema [DC] describing general document information such as title, description, date, and so on:
<?xml version="1.0"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <metadata> <rdf:RDF xmlns:rdf = "https://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "https://www.w3.org/TR/1999/PR-rdf-schema-19990303#" xmlns:dc = "https://purl.org/metadata/dublin_core#"> <!-- Metadata about the synthesis document --> <rdf:Description about="https://www.example.com/meta.ssml" dc:Title="Hamlet-like Soliloquy" dc:Description="Aldine's Soliloquy in the style of Hamlet" dc:Publisher="W3C" dc:Language="en-US" dc:Date="2002-11-29" dc:Rights="Copyright 2002 Aldine Turnbet" dc:Format="application/ssml+xml" > <dc:Creator> <rdf:Seq ID="CreatorsAlphabeticalBySurname"> <rdf:li>William Shakespeare</rdf:li> <rdf:li>Aldine Turnbet</rdf:li> </rdf:Seq> </dc:Creator> </rdf:Description> </rdf:RDF> </metadata> </speak>
The metadata element can have arbitrary content, although none of the content will be rendered by the synthesis processor.
3.1.7 p and s: Text Structure Elements
A p element represents a paragraph. An s element represents a sentence
xml:lang
is a defined attribute on the p and s elements.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </p> </speak>
The use of p and s elements is optional. Where text occurs without an enclosing p or s element the synthesis processor should attempt to determine the structure using language-specific knowledge of the format of plain text.
The p element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, s, voice.
The s element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.
3.1.8 say-as Element
The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages. SSML only specifies the say-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.
The say-as element has
three attributes: interpret-as
,
format
, and detail
. The interpret-as
attribute is always required; the other two attributes are
optional. The legal values for the format
attribute depend on the value of the interpret-as
attribute.
The say-as element can only contain text to be rendered.
The interpret-as
and
format
attributes
The interpret-as
attribute indicates
the content type of the contained text construct. Specifying the
content type helps the synthesis
processor to distinguish and interpret text constructs that may
be rendered in different ways depending on what type of information
is intended. In addition, the optional format
attribute can give further hints on the precise
formatting of the contained text for content types that may have
ambiguous formats.
When specified, the interpret-as
and
format
values are to be interpreted by the
synthesis processor as hints provided
by the markup document author to aid text normalization and pronunciation.
In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context. A synthesis processor should be able to support the common, orthographic forms of the specified language for every content type that it supports.
When the value for the interpret-as
attribute is unknown or unsupported by a processor, it must render
the contained text as if no interpret-as
value were specified.
When the value for the format
attribute
is unknown or unsupported by a processor, it must render the
contained text as if no format
value were
specified, and should render it using the interpret-as
value that is specified.
When the content of the say-as element contains additional text next to the
content that is in the indicated format
and interpret-as
type, then this
additional text MUST be rendered. The processor may make the
rendering of the additional text dependent on the interpret-as
type of the element in which it
appears.
When the content of the say-as element contains no content in the indicated
interpret-as
type or format
, the processor MUST render the content either
as if the format
attribute were not
present, or as if the interpret-as
attribute were not present, or as if neither the format
nor interpret-as
attributes were present. The processor SHOULD also notify the
environment of the mismatch.
Indicating the content type or format does not necessarily affect the way the information is pronounced. A synthesis processor should pronounce the contained text in a manner in which such content is normally produced for the language.
The detail
attribute
The detail
attribute is an optional
attribute that indicates the level of detail to be read aloud or
rendered. Every value of the detail
attribute must render all of the informational content in the
contained text; however, specific values for the detail
attribute can be used to render content that is
not usually informational in running text but may be important to
render for specific purposes. For example, a synthesis processor will usually render
punctuations through appropriate changes in prosody. Setting a
higher level of detail may be used to speak punctuations
explicitly, e.g. for reading out coded part numbers or pieces of
software code.
The detail
attribute can be used for
all interpret-as
types.
If the detail
attribute is not
specified, the level of detail that is produced by the synthesis processor depends on the text
content and thelanguage.
When the value for the detail
attribute
is unknown or unsupported by a processor, it must render the
contained text as if no value were specified for the detail
attribute.
3.1.9 phoneme Element
The phoneme element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.
The ph
attribute is a required
attribute that specifies the phoneme/phone string.
This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon (see Section 3.1.4), while values in say-as and sub may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.
The alphabet
attribute is an optional
attribute that specifies the phonemic/phonetic alphabet. An
alphabet in this context refers to a collection of symbols to
represent the sounds of one or more human languages. The only valid
values for this attribute are "ipa" (see the next
paragraph) and vendor-defined strings of the form
"x-organization" or
"x-organization-alphabet". For example, the Japan
Electronics and Information Technology Industries Association
[JEITA] might wish to encourage the use of
an alphabet such as "x-JEITA" or "x-JEITA-2000" for their phoneme
alphabet [JEIDAALPHABET].
Synthesis processors should
support a value for alphabet
of
"ipa", corresponding to Unicode representations of
the phonetic characters developed by the International Phonetic
Association [IPA]. In addition to an
exhaustive set of vowel and consonant symbols, this character set
supports a syllable delimiter, numerous diacritics, stress symbols,
lexical tone symbols, intonational markers and more. For this
alphabet, legal ph
values are strings of
the values specified in Appendix 2 of [IPAHNDBK]. Informative tables of the
IPA-to-Unicode mappings can be found at [IPAUNICODE1] and [IPAUNICODE2]. Note that not all of the IPA
characters are available in Unicode. For processors supporting this
alphabet,
- The processor must syntactically accept all legal
ph
values. - The processor should produce output when given Unicode IPA codes that can reasonably be considered to belong to the current language.
- The production of output when given other codes is entirely at processor discretion.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> tomato </phoneme> <!-- This is an example of IPA using character entities --> <!-- Because many platform/browser/text editor combinations do not correctly cut and paste Unicode text, this example uses the entity escape versions of the IPA characters. Normally, one would directly use the UTF-8 representation of these symbols: "tÉ™mei̥ɾouÌ¥". --> </speak>
It is an error if a value for
alphabet
is specified that is not known or
cannot be applied by a synthesis
processor. The default behavior when the alphabet
attribute is left unspecified is
processor-specific.
The phoneme element itself can only contain text (no elements).
3.1.10 sub Element
The sub element is employed
to indicate that the text in the alias
attribute value replaces the contained text for pronunciation. This
allows a document to contain both a spoken and written form. The
required alias
attribute specifies the
string to be spoken instead of the enclosed string. The processor
should apply text normalization
to the alias
value.
The sub element can only contain text (no elements).
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <sub alias="carview.php?tsp=World Wide Web Consortium">W3C</sub> <!-- World Wide Web Consortium --> </speak>
3.2 Prosody and Style
3.2.1 voice Element
The voice element is a production element that requests a change in speaking voice. Attributes are:
-
xml:lang
: optional language specification attribute. -
gender
: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral". -
age
: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. Acceptable values are of type xsd:nonNegativeInteger [SCHEMA2 §3.3.20]. -
variant
: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second male child voice). Valid values ofvariant
are of type xsd:positiveInteger [SCHEMA2 §3.3.25]. -
name
: optional attribute indicating a processor-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. As a result a name must not contain any white space.
Although each attribute individually is optional, at least one must be specified any time the voice element is used.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <voice gender="female">Mary had a little lamb,</voice> <!-- now request a different female child's voice --> <voice gender="female" variant="2"> Its fleece was white as snow. </voice> <!-- processor-specific voice selection --> <voice name="Mike">I want to be like Mike.</voice> </speak>
The voice element is
commonly used to change the language. When there is not a voice
available that exactly matches the attributes specified in the
document, or there are multiple voices that match the criteria, the
following voice selection algorithm must be used. There are cases
in the algorithm that are ambiguous; in such cases voice selection
may be processor-specific. Approximately speaking, the xml:lang
attribute has the highest priority and all other attributes are
equal in priority but below xml:lang
. The complete
algorithm is:
- If a voice is available for a requested
xml:lang
, a synthesis processor must use it. If there are multiple such voices available, the processor should use the voice that best matches the specified values forname
,variant
,gender
andage
. - If there is no voice available for the requested
xml:lang
, the processor should select a voice that is closest to the requested language (e.g. a variant or dialect of the same language). If there are multiple such voices available, the processor should use a voice that best matches the specified values forname
,variant
,gender
andage
. - It is an error if the processor decides it does not have a voice that sufficiently matches the above criteria.
Note that simple cases of foreign-text embedding (where a voice change is not needed or undesirable) can be done. See Appendix F for examples.
voice attributes are inherited down the tree including to within elements that change the language.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <voice gender="female"> Any female voice here. <voice age="6"> A female child voice here. <p xml:lang="ja"> <!-- A female child voice in Japanese. --> </p> </voice> </voice> </speak>
Relative changes in prosodic parameters should be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.
The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.
The voice element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.
3.2.2 emphasis Element
The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:
-
level
: the optionallevel
attribute indicates the strength of emphasis to be applied. Defined values are "strong", "moderate", "none" and "reduced". The defaultlevel
is "moderate". The meaning of "strong" and "moderate" emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The "reduced"level
is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The "none"level
is used to prevent the synthesis processor from emphasizing words that it might typically emphasize. The values "none", "moderate", and "strong" are monotonically non-decreasing in strength.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> That is a <emphasis> big </emphasis> car! That is a <emphasis level="strong"> huge </emphasis> bank account! </speak>
The emphasis element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.
3.2.3 break Element
The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not present between words, the synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:
-
strength
: thestrength
attribute is an optional attribute having one of the following values: "none", "x-weak", "weak", "medium" (default value), "strong", or "x-strong". This attribute is used to indicate the strength of the prosodic break in the speech output. The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between words. The stronger boundaries are typically accompanied by pauses. "x-weak" and "x-strong" are mnemonics for "extra weak" and "extra strong", respectively. -
time
: thetime
attribute is an optional attribute indicating the duration of a pause to be inserted in the output in seconds or milliseconds. It follows the time value format from the Cascading Style Sheets Level 2 Recommendation [CSS2], e.g. "250ms", "3s".
The strength
attribute is used to
indicate the prosodic strength of the break. For example, the
breaks between paragraphs are typically much stronger than the
breaks between words within a sentence. The synthesis processor may insert a pause as
part of its implementation of the prosodic break. A pause of a
specific length can also be inserted by using the time
attribute.
If a break element is
used with neither strength
nor
time
attributes, a break will be produced
by the processor with a prosodic strength greater than that which
the processor would otherwise have used if no break element was supplied.
If both strength
and time
attributes are supplied, the processor will
insert a break with a duration as specified by the time
attribute, with other prosodic changes in the
output based on the value of the strength
attribute.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! <break strength="weak"/> Please repeat. </speak>
3.2.4 prosody Element
The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:
-
pitch
: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels. -
contour
: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below. -
range
: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges. -
rate
: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well. -
duration
: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s". -
volume
: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0. Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.
Although each attribute individually is optional, at least one must be specified any time the prosody element is used. The "x-foo " attribute value names are intended to be mnemonics for "extra foo". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.
Number
A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.
Relative values
Relative changes for the attributes above can be specified
- as a percentage (a number optionally preceded by "+" or "-" and followed by "%"), e.g. "3%", "+15.2%", "-8.0%", or
- as a relative number:
- For the
rate
attribute, relative changes are a number. - For the
volume
attribute, relative changes are a number preceded by "+" or "-", e.g. "+10", "-5.5". - For the
pitch
andrange
attributes, relative changes can be given in semitones (a number preceded by "+" or "-" and followed by "st") or in Hertz (a number preceded by "+" or "-" and followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz", "-5.5Hz". A semitone is half of a tone (a half step) on the standard diatonic scale.
- For the
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> The price of XYZ is <prosody rate="-10%">$45</prosody> </speak>
Pitch contour
The pitch contour is defined as a set of white space-separated
targets at specified time positions in the speech output. The
algorithm for interpolating between the targets is
processor-specific. In each pair of the form (time
position,target)
, the first value is a percentage of the
period of the contained text (a number
followed by "%") and the second value is the value of the
pitch
attribute (a number followed by "Hz", a relative change, or a label value). Time
position values outside 0% to 100% are ignored. If a pitch value is
not defined for 0% or 100% then the nearest pitch target is copied.
All relative values for the pitch are relative to the pitch value
just before the contained text.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <prosody contour="(0%,+20Hz)(10%,+30%)(40%,+10Hz)"> good morning </prosody> </speak>
The duration
attribute takes precedence
over the rate
attribute. The contour
attribute takes precedence over the
pitch
and range
attributes.
The default value of all prosodic attributes is no change. For
example, omitting the rate
attribute means
that the rate is the same within the element as outside.
The prosody element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.
Limitations
All prosodic attribute values are indicative. If a synthesis processor is unable to accurately render a document as specified (e.g., trying to set the pitch to 1Mhz or the speaking rate to 1,000,000 words per minute), it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.
In some cases, synthesis processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.
3.3 Other Elements
3.3.1 audio Element
The audio element
supports the insertion of recorded audio files (see Appendix A for required formats) and the insertion of
other audio formats in conjunction with synthesized speech output.
The audio element may be
empty. If the audio element
is not empty then the contents should be the marked-up text to be
spoken if the audio document is not available. The alternate
content may include text, speech markup, desc elements, or other audio elements. The alternate content may also be
used when rendering the document to non-audible output and for
accessibility (see the desc
element). The required attribute is src
,
which is the URI of a document with an
appropriate MIME type.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <!-- Empty element --> Please say your name after the tone. <audio src="beep.wav"/> <!-- Container element with alternative text --> <audio src="prompt.au">What city do you want to fly from?</audio> <audio src="welcome.wav"> <emphasis>Welcome</emphasis> to the Voice Portal. </audio> </speak>
An audio element is successfully rendered:
- If the referenced audio source is played, or
- If the synthesis processor is unable to execute #1 but the alternative content is successfully rendered, or
- If the processor can detect that text-only output is required and the alternative content is successfully rendered.
Deciding which conditions result in the alternative content being rendered is processor-dependent. If the audio element is not successfully rendered, a synthesis processor should continue processing and should notify the hosting environment. The processor may determine after beginning playback of an audio source that the audio cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.
The audio element can only contain text to be rendered and the following elements: audio, break, desc, emphasis, mark, p, phoneme, prosody, say-as, sub, s, voice.
3.3.2 mark Element
A mark element is an empty
element that places a marker into the text/tag sequence. It has one
required attribute, name
, which is of type
xsd:token
[SCHEMA2 §3.3.2]. The
mark element can be used to
reference a specific location in the text/tag sequence, and can
additionally be used to insert a marker into an output stream for
asynchronous notification. When processing a mark element, a synthesis processor must do one or both of
the following:
- inform the hosting environment with the value of the
name
attribute and with information allowing the platform to retrieve the corresponding position in the rendered output. - when audio output of the SSML document reaches the mark, issue an event that includes
the required
name
attribute of the element. The hosting environment defines the destination of the event.
The mark element does not affect the speech output process.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Go from <mark name="here"/> here, to <mark name="there"/> there! </speak>
3.3.3 desc Element
The desc element can only occur within the content of the audio element. When the audio source referenced in audio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a desc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the synthesis processor, the content of the desc element(s) should be rendered instead of other alternative content in audio. The optional xml:lang attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses of xml:lang in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.
<?xml version="1.0"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <!-- Normal use of <desc> --> Heads of State often make mistakes when speaking in a foreign language. One of the most well-known examples is that of John F. Kennedy: <audio src="ichbineinberliner.wav">If you could hear it, this would be a recording of John F. Kennedy speaking in Berlin. <desc>Kennedy's famous German language gaffe</desc> </audio> <!-- Suggesting the language of the recording --> <!-- Although there is no requirement that a recording be in the current language (since it might even be non-speech such as music), an author might wish to suggest the language of the recording by marking the entire <audio> element using <voice>. In this case, the xml:lang attribute on <desc> can be used to put the description back into the original language. --> Here's the same thing again but with a different fallback: <voice xml:lang="de-DE"> <audio src="ichbineinberliner.wav">Ich bin ein Berliner. <desc xml:lang="en-US">Kennedy's famous German language gaffe</desc> </audio> </voice> </speak>
The desc element can only contain descriptive text.
4. References
4.1 Normative References
- [CSS2]
- Cascading Style Sheets, level 2: CSS2 Specification , B. Bos, et al., Editors. World Wide Web Consortium, 12 May 1998. This version of the CSS2 Recommendation is https://www.w3.org/TR/1998/REC-CSS2-19980512/. The latest version of CSS2 is available at https://www.w3.org/TR/REC-CSS2/.
- [IPAHNDBK]
- Handbook of the International Phonetic Association , International Phonetic Association, Editors. Cambridge University Press, July 1999. Information on the Handbook is available at https://www.arts.gla.ac.uk/ipa/ipa.html#Handbook_of_the_IPA.
- [RFC1521]
- MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies , N. Borenstein and N. Freed, Editors. IETF, September 1993. This RFC is available at https://www.ietf.org/rfc/rfc1521.txt.
- [RFC2045]
- Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. , N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at https://www.ietf.org/rfc/rfc2045.txt.
- [RFC2046]
- Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types , N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at https://www.ietf.org/rfc/rfc2046.txt.
- [RFC2119]
- Key words for use in RFCs to Indicate Requirement Levels , S. Bradner, Editor. IETF, March 1997. This RFC is available at https://www.ietf.org/rfc/rfc2119.txt.
- [RFC2396]
- Uniform Resource Identifiers (URI): Generic Syntax , T. Berners-Lee et al., Editors. IETF, August 1998. This RFC is available at https://www.ietf.org/rfc/rfc2396.txt.
- [RFC3066]
- Tags for the Identification of Languages , H. Alvestrand, Editor. IETF, January 2001. This RFC is available at https://www.ietf.org/rfc/rfc3066.txt.
- [SCHEMA1]
- XML Schema Part 1: Structures , H. S. Thompson, et al., Editors. World Wide Web Consortium, 2 May 2001. This version of the XML Schema Part 1 Recommendation is https://www.w3.org/TR/2001/REC-xmlschema-1-20010502/. The latest version of XML Schema 1 is available at https://www.w3.org/TR/xmlschema-1/.
- [SCHEMA2]
- XML Schema Part 2: Datatypes , P.V. Biron and A. Malhotra, Editors. World Wide Web Consortium, 2 May 2001. This version of the XML Schema Part 2 Recommendation is https://www.w3.org/TR/2001/REC-xmlschema-2-20010502/. The latest version of XML Schema 2 is available at https://www.w3.org/TR/xmlschema-2/.
- [TYPES]
- MIME Media types . IANA. This continually-updated list of media types registered with IANA is available at https://www.iana.org/assignments/media-types/index.html.
- [XML]
- Extensible Markup Language (XML) 1.0 (Second Edition) , T. Bray et al., Editors. World Wide Web Consortium, 6 October 2000. This version of the XML 1.0 Recommendation is https://www.w3.org/TR/2000/REC-xml-20001006. The latest version of XML 1.0 is available at https://www.w3.org/TR/REC-xml.
- [XML-BASE]
- XML Base , J. Marsh, Editor. World Wide Web Consortium, 27 June 2001. This version of the XML Base Recommendation is https://www.w3.org/TR/2001/REC-xmlbase-20010627/. The latest version of XML Base is available at https://www.w3.org/TR/xmlbase/.
- [XMLNS]
- Namespaces in XML , T. Bray et al., Editors. World Wide Web Consortium, 14 January 1999. This version of the XML Namespaces Recommendation is https://www.w3.org/TR/1999/REC-xml-names-19990114/. The latest version of XML Namespaces is available at https://www.w3.org/TR/REC-xml-names/.
4.2 Informative References
- [DC]
- Dublin Core Metadata Initiative. See https://dublincore.org/
- [HTML]
- HTML 4.01 Specification , D. Raggett et al., Editors. World Wide Web Consortium, 24 December 1999. This version of the HTML 4 Recommendation is https://www.w3.org/TR/1999/REC-html401-19991224/. The latest version of HTML 4 is available at https://www.w3.org/TR/html4/.
- [IPA]
- International Phonetic Association . See https://www.arts.gla.ac.uk/ipa/ipa.html for the organization's website.
- [IPAUNICODE1]
- The International Phonetic Alphabet , J. Esling. This table of IPA characters in Unicode is available at https://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm.
- [IPAUNICODE2]
- The International Phonetic Alphabet in Unicode , J. Wells. This table of Unicode values for IPA characters is available at https://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.
- [JEIDAALPHABET]
- JEIDA-62-2000 Phoneme Alphabet . JEITA. An abstract of this document (in Japanese) is available at https://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf.
- [JEITA]
- Japan Electronics and Information Technology Industries Association . See https://www.jeita.or.jp/.
- [JSML]
- JSpeech Markup Language , A. Hunt, Editor. World Wide Web Consortium, 5 June 2000. Copyright ©2000 Sun Microsystems, Inc. This version of the JSML submission is https://www.w3.org/TR/2000/NOTE-jsml-20000605/. The latest W3C Note of JSML is available at https://www.w3.org/TR/jsml/.
- [LEX]
- Pronunciation Lexicon Markup Requirements , F. Scahill, Editor. World Wide Web Consortium, 12 March 2001. This document is a work in progress. This version of the Lexicon Requirements is https://www.w3.org/TR/2001/WD-lexicon-reqs-20010312/. The latest version of the Lexicon Requirements is available at https://www.w3.org/TR/lexicon-reqs/.
- [RDF-SYNTAX]
- Resource Description Framework (RDF) Model and Syntax Specification , O. Lassila and R. Swick, Editors. World Wide Web Consortium, 22 February 1999. This version of the RDF Syntax Recommendation is https://www.w3.org/TR/1999/REC-rdf-syntax-19990222/. The latest version of RDF Syntax is available at https://www.w3.org/TR/REC-rdf-syntax/.
- [RDF-SCHEMA]
- Resource Description Framework (RDF) Model and Syntax Specification , D. Brickley and R. Guha, Editors. World Wide Web Consortium, 27 March 2000. This document is a work in progress. This version of the RDF Schema Candidate Recommendation is https://www.w3.org/TR/2000/CR-rdf-schema-20000327/. The latest version of RDF Schema is available at https://www.w3.org/TR/rdf-schema/.
- [REQS]
- Speech Synthesis Markup Requirements for Voice Markup Languages , A. Hunt, Editor. World Wide Web Consortium, 23 December 1999. This document is a work in progress. This version of the Synthesis Requirements is https://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/. The latest version of the Synthesis Requirements is available at https://www.w3.org/TR/voice-tts-reqs/.
- [RFC2616]
- Hypertext Transfer Protocol -- HTTP/1.1 , R. Fielding, et al., Editors. IETF, June 1999. This RFC is available at https://www.ietf.org/rfc/rfc2616.txt.
- [RFC2732]
- Format for Literal IPv6 Addresses in URL's , R. Hinden, et al., Editors. IETF, December 1999. This RFC is available at https://www.ietf.org/rfc/rfc2732.txt.
- [SABLE]
- "SABLE: A Standard for TTS Markup", Richard Sproat, et al. Proceedings of the International Conference on Spoken Language Processing, R. Mannell and J. Robert-Ribes, Editors. Causal Productions Pty Ltd (Adelaide), 1998. Vol. 5, pp. 1719-1722. Conference proceedings are available from the publisher at https://www.causalproductions.com/.
- [SMIL]
- Synchronized Multimedia Integration Language (SMIL 2.0) , J. Ayars, et al., Editors. World Wide Web Consortium, 7 August 2001. This version of the SMIL 2 Recommendation is https://www.w3.org/TR/2001/REC-smil20-20010807/. The latest version of SMIL2 is available at https://www.w3.org/TR/smil20/.
- [UNICODE]
- The Unicode Standard . The Unicode Consortium. Information about the Unicode Standard and its versions can be found at https://www.unicode.org/standard/standard.html.
- [VXML]
- Voice Extensible Markup Language (VoiceXML) Version 2.0 , S. McGlashan, et al., Editors. World Wide Web Consortium, 20 February 2003. This document is a work in progress. This version of the VoiceXML 2.0 Candidate Recommendation is https://www.w3.org/TR/2003/CR-voicexml20-20030220/. The latest version of VoiceXML 2 is available at https://www.w3.org/TR/voicexml20/.
5. Acknowledgments
This document was written with the participation of the following participants in the W3C Voice Browser Working Group (listed in alphabetical order):
- Paolo Baggia, Loquendo
Dan Burnett, Nuance
Dave Burke, VoxPilot
Jerry Carter, Independent Consultant
Sasha Caskey, IBM
Brian Eberman, ScanSoft
Andrew Hunt, ScanSoft
Jim Larson, Intel
Bruce Lucas, IBM
Scott McGlashan, HP
T.V. Raman, IBM
Dave Raggett, W3C/Canon
Laura Ricotti, Loquendo
Richard Sproat, ATT
Luc Van Tichelen, ScanSoft
Mark Walker, Intel
Kuansan Wang, Microsoft
Dave Wood, Microsoft
Appendix A: Audio File Formats
This appendix is normative.
SSML requires that a platform support the playing of the audio formats specified below.
Audio Format | Media Type |
---|---|
Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711) | audio/basic (from [RFC1521]) |
Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711) | audio/x-alaw-basic |
WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel. | audio/wav |
WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel. | audio/wav |
The 'audio/basic' MIME type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this MIME type is specified for playing, the mu-law format must be used. For playback with the 'audio/basic' MIME type, processors must support the mu-law format and may support the 'au' format.
Appendix B: Internationalization
This appendix is normative.
SSML is an application of XML 1.0 [XML] and thus supports [UNICODE] which defines a standard universal character set.
SSML provides a mechanism for control of the spoken language via
the use of the xml:lang
attribute. Language changes can occur as
frequently as per word, although excessive language changes can
diminish the output audio quality. SSML also permits finer control
over output pronunciations via the lexicon and phoneme elements, features that can help to mitigate
poor quality default lexicons for languages with only minimal
commercial support today.
Appendix C: MIME Types and File Suffix
This appendix is normative.
The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The media type applied for is "application/ssml+xml".
The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where speak is the root element.
Appendix D: Schema for the Speech Synthesis Markup Language
This appendix is normative.
The synthesis schema is located at https://www.w3.org/TR/speech-synthesis/synthesis.xsd.
Note: the synthesis schema includes a no-namespace core schema, located at https://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments (Sec. 2.2.1) embedded in non-synthesis namespace schemas.
Appendix E: DTD for the Speech Synthesis Markup Language
This appendix is informative.
The SSML DTD is located at https://www.w3.org/TR/speech-synthesis/synthesis.dtd.
Due to DTD limitations, the SSML DTD does not correctly express that the metadata element can contain elements from other XML namespaces.
Appendix F: Example SSML
This appendix is informative.
The following is an example of reading headers of email messages. The p and s elements are used to mark the text structure. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.
<?xml version="1.0"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> 3:45pm. </s> <s> The subject is <prosody rate="-20%">ski trip</prosody> </s> </p> </speak>
The following example combines audio files and different spoken voices to provide information on a collection of music.
<?xml version="1.0"> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <voice gender="male"> <s>Today we preview the latest romantic music from Example.</s> <s>Hear what the Software Reviews said about Example's newest hit.</s> </voice> </p> <p> <voice gender="female"> He sings about issues that touch us all. </voice> </p> <p> <voice gender="male"> Here's a sample. <audio src="https://www.example.com/music.wav"/> Would you like to buy it? </voice> </p> </speak>
It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via the voice element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> The title of the movie is: "La vita è bella" (Life is beautiful), which is directed by Roberto Benigni. </speak>
With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (see Section 3.1.4) or via the phoneme element as shown in the next example.
It is worth noting that IPA alphabet support is an optional feature and that phonemes for an external language may be rendered with some approximation (see Section 3.1.4 for details). The following example only uses phonemes common to US English.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> The title of the movie is: <phoneme alphabet="ipa" ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə"> La vita è bella </phoneme> <!-- The IPA pronunciation is ˈlÉ‘ ˈviËɾə ˈʔeɪ ˈbÉ›lÉ™ --> (Life is beautiful), which is directed by <phoneme alphabet="ipa" ph="ɹəˈbɛːɹɾoʊ bɛˈniːnji"> Roberto Benigni </phoneme> <!-- The IPA pronunciation is ɹəˈbÉ›ËɹɾoÊŠ bɛˈniËnji --> <!-- Note that in actual practice an author might change the encoding to UTF-8 and directly use the Unicode characters in the document rather than using the escapes as shown. The escaped values are shown for ease of copying. --> </speak>
SMIL Integration Example
The SMIL language [SMIL] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.
File 'greetings.ssml' contains the following:
<?xml version="1.0"> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "https://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/10/synthesis https://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <s> <mark name="greetings"/> <emphasis>Greetings</emphasis> from the <sub alias="carview.php?tsp=World Wide Web Consortium">W3C</sub>! </s> </speak>
SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:
<smil xmlns="https://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <par> <img src="https://w3clogo.gif" region="whole" begin="0s"/> <ref src="greetings.ssml" begin="1s"/> </par> </body> </smil>
SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File 'greetings.smil' contains the following:
<smil xmlns="https://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <seq> <img id="logo" src="https://w3clogo.gif" region="whole" begin="0s" end="logo.activateEvent"/> <ref src="greetings.ssml"/> </seq> </body> </smil>
VoiceXML Integration Example
The following is an example of SSML in VoiceXML (see Section 2.3.3) for voice browser applications. It is worth noting that the VoiceXML namespace includes the SSML namespace elements and attributes. See Appendix O of [VXML] for details.<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" xmlns="https://www.w3.org/2001/vxml" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://www.w3.org/2001/vxml https://www.w3.org/TR/voicexml20/vxml.xsd"> <form> <block> <prompt> <emphasis>Welcome</emphasis> to the Bird Seed Emporium. <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/> We have 250 kilogram drums of thistle seed for $299.95 plus shipping and handling this month. <audio src="https://www.birdsounds.example.com/mourningdove.wav"/> </prompt> </block> </form> </vxml>
Appendix G: Summary of changes since the Last Call Working Draft
This is a list of the major changes to the specification since the Last Call Working Draft:
- Reorganized the document to consolidate all element and attribute definitions
- Incorporated conformance section into Document Form section
- Reintroduced strength/time distinction in <break> element. (CR140-5/8, CR153)
- Noted for all labelled values in <break> and <prosody> that values are monotonically non-decreasing. (CR115)
- Removed claim that SSML doesn't provide explicit controls over the generation of output waveforms. (CR122)
- Style clean-up. (CR124/125, CR133-5)
- Removed claim that synthesizers are expert at performing text-to-speech conversion. (CR129)
- For <voice> and <prosody>, added requirement that at least one attribute must be specified. (CR148)
- Added high/low-level distinction and warnings to 1.2 and 1.4. (CR126-2, CR126-3)
- Clarified in 1.2 that actual behavior is always a combination of what the author specified and what the processor would have done on its own and that this behavior varies by tag. (CR126-4)
- In 3.2.1, clarified that relative prosodic value changes should be carried across voice changes. (CR126-9)
- In 3.3.2, changed <mark> name to be of type xsd:token. (CR131, CR146)
- Clarified in 2.1 how namespace and schema information is to be indicated. (CR133-1, CR159)
- Explained in 2.1 why we include namespace info in every example. (CR133-2)
- Clarified definition of "error" in 1.5. (CR133-3)
- Clarified type values in 3.2.1, 3.2.4, and 3.3.2. (CR133-4, CR133-6, CR133-7, CR140-3)
- In 2.2.1, removed requirement for SSML fragments to be well-formed XML documents. (CR133-10)
- In 3.1.4, clarified the lexicons operate at the token level. (CR134-2)
- Noted in 3.1.2 that <voice> element can be used just to change the language. (CR145-3)
- Misc wording changes. (CR141-2, CR145-4, CR145-5, CR145-8, CR145-9, CR145-11, CR145-16, CR145-24, CR145-29, CR145-32, CR145-46, CR145-50, CR145-55, CR145-56, CR145-57, CR145-59, CR145-61, CR145-64, CR145-75, CR145-77, CR145-79, CR145-80, CR145-81)
- Added paragraph describing the intended production and use of SSML to the Introduction. (CR145-6)
- Added SABLE info to Introduction. (CR145-7)
- Added phoneme inventory size for Hawai'ian to 1.2. (CR145-10)
- Replaced Tlalpachicatl example in 1.2 with better ones. (CR145-12)
- Moved definitions in 1.1 to 1.5. (CR145-13)
- Clarified URI definition in 1.5. (CR145-14)
- Clarified XML 1.0 as the definer of the xml:lang attribute (and its values) in 1.5 and 3.1.2. (CR145-15, CR145-19)
- Made xml:lang required on <speak>. (CR145-17)
- Clarified that version attribute on <speak> must have the value "1.0". (CR145-18)
- Clarification of xml:lang behavior in 3.1.2, 3.2.1, and 3.3.3. Also added example to 3.3.3. (CR133-8, CR145-20, CR145-71)
- Added references in 3.1.2 and 3.1.10 to text normalization description in 1.2. (CR145-21)
- Changed examples to use utf-8 and updated <phoneme> example. (CR145-22, CR145-45)
- Updated example in 3.1.2 to use real Japanese text. (CR145-23)
- Clarified in 3.1.2 that conformance variation for xml:lang is only in terms of voice change. (CR145-25)
- Removed <paragraph> and <sentence> elements, leaving only <p> and <s>. (CR145-26)
- Removed examples from <say-as> section. (CR145-27)
- Clarified behavior of <say-as> when mismatching content is present. (CR145-28)
- Clarifications to the phoneme element. (CR145-38, CR145-39, CR145-40, CR145-41, CR145-42, CR145-43, CR145-44)
- Added comment to 1.2 on what to do with acronyms and abbreviations. (CR145-49)
- In 3.2.1, clarified the voice selection algorithm. (CR145-51)
- Added examples of short foreign-text embedding in Examples appendix and linked from <voice> section. (CR145-52)
- Added word/token explanation to section 1.2 in steps 1 and 3 and related comment to 3.1.4. (CR145-60, CR145-73)
- Noted that customary pitch levels and ranges may differ by language. (CR145-62)
- Added short explanations of baseline pitch and pitch range in section 3.2.4. (CR145-63)
- Added semitone definition to section 3.2.4. (CR145-65)
- Clarified syntax of pitch contour in section 3.2.4. (CR145-66)
- Changed prosody rate to use relative specifiers rather than WPM. (CR126-12, CR127-3, CR145-68, and CR152)
- Clarified that <prosody> units Hz and st are case-sensitive. (CR145-70)
- Cleaned up 2.1, including removing requirement for xml line. (CR133-1, CR145-72)
- Explain how type attribute on lexicon works, synced with TAG findings. (CR145-74, CR154)
- Clarified that xml:lang and xml:base were to be left in when converting a fragment to a document. (CR145-76)
- In 2.2.5 added requirement that User Agent must support at least one natural language. (CR145-78)
- Removed Future features appendix. (CR124-3, CR145-83, CR145-84, CR145-85)
- Cleaned up Internationalization appendix. (CR145-86, CR145-87)
- Reordered appendices to place normative ones first. Changes are A->F, B->E, C->D, D->A, E->C, and G->B (CR145-88)
- Made explicit that the default behavior when no alphabet specified for <phoneme> is processor-specific. (CR160)
- Added <meta> element (CR161)
- Removed statements in 3.1.3 and 3.1.4 implying a multi-document environment for SSML. (CR162)
- Clarified case in 3.1.3.1 where no base URI is found and document has relative URIs. (CR163)
- Added JEITA alphabet example to 3.1.9. (CR164)
- Clarified precedence if multiple lexicons are present. (CR167)
- Made MIME types appendix normative. (CR168)
- Updated Acknowledgments section.
- Removed unnecessary speak attribute text from 2.1 and linked to 3.1.1.
- Reformatted Normative references (CR136-1).
- Revised element content wording (CR145-35).
- Fixed lexicon type text (CR154).
- Removed "language identifier" term from 1.5.
- Removed unused ISO3166 reference in 4.
- Reformatted Informative references.
- Added Appendix G.
- Updated Status section to reflect that this is now a Candidate Recommmendation.
- Miscellaneous editorial fixes