CARVIEW |
HTML Text to Speech (TTS) API Specification
Editor's Draft 28 October 2010
- Latest Editor's Draft:
- https://dev.w3.org/...
- Editors:
- Bjorn Bringert, Google Inc.
Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
Abstract
This is a proposal for adding support for speech synthesis to HTML.
Status of This Document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document is an API proposal from Google Inc. to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives).
All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Table of Contents
- 1 Conformance requirements
- 2 Introduction
- 3 Scope
- 4 API Description
- 5 Backwards compatibility
- Acknowledgments
- References
1 Conformance requirements
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]
Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]
2 Introduction
This section is non-normative.
The HTML Text to Speech API aims to provide web developers with programmatic access to speech synthesis and playback. The API itself is agnostic of the underlying speech synthesizer implementation and can support both server based as well as embedded synthesizers.
The API consists of a new element, tts
, with a
corresponding DOM interface HTMLTtsElement
. Like the
existing audio
and video
elements, the new tts
element extends HTMLMediaElement
. Like with the
audio
element, the playback of synthesized spech can be
controlled with a playback UI, or by scripting. The text to
synthesize can be specified in plain text, or in SSML.
Use Cases
- Speech translation
- The app works as an interpreter between two users that speak different languages.
- Speech-enabled webmail client, e.g. for in-car use.
- Reads out e-mails and gives confirmations for commands processed such as "e-mail sent to Bob".
- Turn-by-turn navigation
- Speaks driving instructions, e.g. "in 500 meters, left turn on Buckingham Palace Road".
- Dialog systems
- For exmaple flight booking, pizza ordering.
All of the above should be easily extended to work in multiple languages.
Some of these use cases require speech recognition as well as speech synthesis. See HTML Speech Input API for a proposed API for speech recognition.
Examples
The following code extracts illustrate how to use speech synthesis in various cases:
Hello World
<tts autoplay value="hello world">Behavior
- "hello world" is spoken when the page has loaded.
- In browsers that don't support TTS, the text "hello world" is displayed.
Speak Spanish text typed by the user
<form> <input name="t" type="text"> <input type="button" value="Speak" onclick="var tts = document.getElementById('say'); tts.value = this.form.t.value; tts.play()" > </form> <tts id="say" lang="es">Behavior
- The text typed in the input field is spoken in Spanish when the button is pressed.
Read out text, highlighting current word
<style type="text/css"> .current { background-color: yellow; } </style> <script type="text/javascript"> var prevLine = null; function highlight(event) { var mark = event.target.lastMark; var line = document.getElementById(mark); line.className = "current"; if (prevLine) { prevLine.className=""; } prevLine = line; } </script> <blockquote><span id="l1">Beware the Jabberwock, my son!</span><br> <span id="l2">The jaws that bite, the claws that catch!</span><br> <span id="l3">Beware the Jubjub bird, and shun</span><br> <span id="l4">The frumious Bandersnatch!</span></blockquote> <tts id="out" src="text.ssml" controls ontimeupdate="highlight">
text.ssml:
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <s><mark name="l1" />Beware the Jabberwock, my son!</s> <s><mark name="l2" />The jaws that bite, the claws that catch!</s> <s><mark name="l3" />Beware the Jubjub bird, and shun <mark name="l4" />the frumious Bandersnatch!<s> </speak>Behavior
- The TTS element shows playback controls.
- When play is pressed, the synthesized speech is played back.
- When a new line starts to play back, that line is highlighted.
3 Scope
This section is non-normative.
This specification is limited to adding a new HTML element for speech synthesis.
The scope of this specification does not include providing a new markup language of any kind.
The scope of this specification does not include interfacing with telephony systems of any kind.
4 API Description
4.1 The tts
HTML element
This API adds a new tts
element that extends
HTMLMediaElement.
interface HTMLTtsElement : HTMLMediaElement { attribute DOMString value; readonly attribute DOMString lastMark; };
The content of the HTMLTtsElement is the data to be given as input to the speech syntheizer.
The new value
attribute
sets the content of the HTMLTtsElement to the plain text value of
the attribute.
The new lastMark
attribute contains the name of the last SSML mark
element that was encountered during playback.
4.1.1 Notes about existing attributes, events and methods
This section describes how some existing attributes of HTMLMediaElement should be interpreted when used on HTMLTtsElement.
The src
attribute
contains the URI of a document whose contents should override the content of the HTMLTtsElement.
Implementations should support at least UTF-8 encoded
text/plain
and application/ssml+xml
.
If the
src
attribute is not set, or is set to a URI that does not
reference a valid document that the user agent can use as input to
the speech synthesizer, value of the value
attribute
should be used instead. If value is not set either, the TTS element
has no content and playback should produce no audio.
The lang
attribute,
if present, sets the language in which speech should be synthesized.
If this attribute is not set the implementation
must fall back to the language of the closest ancestor that has a lang
attribute, and
finally to the language of the document. If the value to be
synthesized is SSML, any language attributes
in the SSML document override any language attirbutes in the HTML document.
All other HTMLMediaElement attributes work in the same way as for HTMLAudioElements, including
autoplay
,
loop
etc.
The existing timeupdate
event is dispatched to report
progress through the synthesized speech. If the value is SSML,
timeupdate
events should be fired for each mark
element that is encountered.
All other HTMLMediaElement events work in the same way as for HTMLAudioElements, including
play
,
ended
,
error
etc.
All HTMLMediaElement methods work in the same way as for HTMLAudioElements, including
play()
,
pause()
.
5 Backwards Compatibility
A DOM application can use the hasFeature(feature, version)
method of the
DOMImplementation
interface with parameter values "TTS" and "1.0" (respectively)
to determine whether or not this module is supported by the implementation.
Since the tts
element does not have any child
elements, the element should not be displayed in UAs that don't
support speech synthesis.
Acknowledgments
Satish Sampath, Dave Burke, Andrei Popescu, Jeremy Orlow
References
- [RFC3066]
- Tags for the Identification of Languages, Harald Tveit Alvestrand. Internet Engineering Task Force, January 2001. See https://www.ietf.org/rfc/rfc3066.txt
- [WEBIDL]
- Web IDL, Cameron McCormack, Editor. World Wide Web Consortium, 19 December 2008. See https://dev.w3.org/2006/webapi/WebIDL/
- [SSML]
- Speech Synthesis Markup Language (SSML) Version 1.1, Daniel C. Burnett, Zhi Wei Shuang, Editors, W3C Recommendation, 7 September 2010.
- [HTML5]
- HTML5 A vocabulary and associated APIs for HTML and XHTML, Ian Hickson, Editor, W3c Editor's Draft 25 October 2010.