CARVIEW |
Sourcing In-band Media Resource Tracks from Media Containers into HTML
Unofficial Draft
- Latest editor's draft:
- https://dev.w3.org/html5/html-sourcing-inband-tracks/
- Editors:
- Silvia Pfeiffer, NICTA
- Bob Lund, CableLabs Inc
This document is licensed under a Creative Commons Attribution 3.0 License.
Abstract
This specification is provided to promote interoperability among implementations and users of in-band text tracks sourced for [HTML5]/[HTML] from media resource containers. The specification provides guidelines for the creation of video, audio and text tracks and their attribute values as mapped from in-band tracks from media resource types typically supported by User Agents. It also explains how the UA should map in-band text track content into text track cues.
Mappings are defined for [MPEGDASH], [ISOBMFF], [MPEG2TS], [OGGSKELETON] and [WebM].
Status of This Document
This document is merely a public working draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organisation.
This is the first draft. Please send feedback to: public-inbandtracks@w3.org.
Table of Contents
1. Introduction
The specification maintains mappings from in-band audio, video and other data tracks of media resources to HTML VideoTrack
, AudioTrack
, and TextTrack
objects and their attribute values.
This specification defines the mapping of tracks from media resources depending on the MIME type of that resource. If an implementation claims to support that MIME type and exposes a track from a resource of that type, the exposed track must conform to this specification.
Which actual tracks are exposed by a user agent from a supported media resource is implementation dependent. A user agent may expose tracks, for which it supports parsing, decoding and rendering, for playback selection by the web application or user. A user agent may also decide to expose tracks coded in formats it is not able to decode, but which it can identify, and describe through metadata such as the HTML kind
attribute and others as defined in this specification. For text tracks, the track content may be exposed to the Web application via TextTrackCue or DataCue objects.
A generic rule to follow is that a track as exposed in HTML only ever represents a single semantic concept. When mapping from a media resource, sometimes an in-band track does not relate 1-to-1 to a HTML text, audio or video track.
For example, a HTML TextTrack
object is either a subtitle track or a caption track, never both. However, in-band text tracks may encapsulate caption and subtitle cues of the same language as a single in-band track. Since a caption track is essentially a subtitle track with additional cues of transcripts of audio-only information, such an encapsulation in a single in-band track can save space. In HTML, these tracks should be exposed as two TextTrack
objects, since they represent different semantic concepts. The cues appear in their relevant tracks - subtitle cues would be present in both. This allows users to choose between the two tracks and activate the desired one in the same manner that they do when the two tracks are provided through two track elements.
A similar logic applies to in-band text tracks that have subtitle cues of different languages mixed together in one track. They, too, should be exposed in a track of their own language each.
A further example is when a UA decides to implement rendering for a caption track but without exposing the caption track through the TextTrack
API. To the Web developer and the Web page user, such a video appears as though it has burnt-in captions. Therefore, the UA could expose two video tracks on the HTMLMediaElement - one with captions and a kind
attribute set to captions
and one without captions with a kind
attribute set to main
. In this way, the user and the Web developer still get the choice of whether to see the video with or without captions.
Another generic rule to follow for in-band data tracks is that in order to map them to TextTrack
objects, the contents of the track need to be mapped to media-time aligned cues that relate to a non-zero interval of time.
For every MIME-type/subtype of an existing media container format, this specification defines the following information:
- Track order.
Tracks sourced according to this specification are referenced by HTML
TrackList
objects (audioTracks
,videoTracks
ortextTracks
). The [HTML5]/[HTML] specification mandates that the tracks in those objects be consistently ordered. This requirement insures that the order of tracks is not changed when a track is added or removed, e.g. thatvideoTracks[3]
points to the same object if the tracks with indices 0, 1, 2 and 3 were not removed. This also insures a deterministic result when calls togetTrackById
are made with media resources, possibly invalid, that declares two tracks with the same id. This specification defines a consistent ordering of tracks between the media resource andTrackList
objects when the media resource is consumed by the user agent.Note that in some media workflows, the order of tracks in a media resource may be subject to changes (e.g. tracks may be added or removed) between authoring and publication. Applications associated with a media resource should not rely on an order of tracks being the same between when the media resource was authored and when it is consumed by the user agent.
All media resource formats used in this specification support identifying tracks using a unique identifier. This specification defines how those unique identifiers are mapped onto the
id
attribute of HTML Track objects. Application authors are encouraged to use theid
attribute to identify tracks, rather than the index in aTrackList
object. - How to identify the type of tracks - one of audio, video or text.
- Setting the attributes
id
,kind
,language
andlabel
for sourcedTextTrack
objects. - Setting the attributes
id
,kind
,language
andlabel
for sourcedAudioTrack
andVideoTrack
objects. - Mapping Text Track content into text track cues.
2. MPEG-DASH
MIME type/subtype:application/dash+xml
[MPEGDASH] defines formats for a media manifest, called MPD (Media Presentation Description), which references media containers, called media segments. [MPEGDASH] also defines some media segments formats based on [MPEG2TS] or [ISOBMFF]. Processing of media manifests and segments to expose tracks to Web applications can be done by the user agent. Alternatively, a web application can process the manifests and segments to expose tracks. When the user agent processes MPD and media segments directly, it exposes tracks for AdaptationSet
and ContentComponent
elements, as defined in this document. When the Web application processes the MPD and media segments, it passes media segments to the user agent according to the MediaSource Extension [MSE] specification. In this case, the tracks are exposed by the user agent according to [MSE]. The Web application may set default track attributes from MPD data, using the trackDefaults
object, that will be used by the user agent to set attributes not set from initialization segment data.
Track Order
If an
AdaptationSet
containsContentComponents
, a track is created for eachContentComponent
. Otherwise, a track is created for theAdaptationSet
itself. The order of tracks specified in the MPD (Media Presentation Description) format [MPEGDASH] is maintained when sourcing multiple MPEG DASH tracks into HTML.Determining the type of track
A user agent recognises and supports data from a MPEG DASH media resource as being equivalent to a HTML track using the content type given by the MPD. The content type of the track is the first present value out of: The
ContentComponents
's "contentType" attribute, theAdaptationSet
's "contentType" attribute, or the main type in theAdaptationSet
's "mimeType" attribute (i.e. for "video/mp2t", the main type is "video").- text track:
- the content type is "
application
" or "text
" - the content type is "
video
" and theAdaptationSet
contains one or more ISOBMFF CEA 608 or 708 caption services.
- the content type is "
- video track: the content type is "
video
" - audio track: the content type is "
audio
"
- text track:
Track Attributes for sourced Text Tracks
Data for sourcing text track attributes may exist in the media content or in the MPD. Text track attribute values are first sourced from track data in the media container, as described for text track attributes in MPEG-2 Transport Streams and text track attributes in MPEG-4 ISOBMFF. If a track attribute's value cannot be determined from the media container, then the track attribute value is sourced from data in the track's
ContentComponent
. If the needed attribute or element does not exist on theContentComponent
(or if theAdaptationSet
doesn't contain anyContentComponents
), then that attribute or element is sourced from theAdaptationSet
:Attribute How to source its value id
The track is: - An ISOBMFF CEA 608 caption service: the string "cc" concatenated with the value of the '
channel-number
' field in theAccessibility
descriptor in theContentComponent
orAdaptationSet
. - An ISOBMFF CEA 708 caption service: the string "sn" concatenated with the value of the '
service-number
' field in theAccessibility
descriptor in theContentComponent
orAdaptationSet
. - Otherwise, the content of the '
id
' attribute in theContentComponent
, orAdaptationSet
.
kind
The track: - Represents a
ContentComponent
orAdaptationSet
containing aRole
descriptor withschemeIdURI
attribute = "urn:mpeg:dash:role:2011
":- "
captions
": if theRole
descriptor's value is "caption
" - "
subtitles
": if theRole
descriptor's value is "subtitle
" - "
metadata
": otherwise
- "
- Is an ISOBMFF CEA 608 or 708 caption service: "
captions
".
label
The empty string. language
The track is: - An ISOBMFF CEA 608 708 caption service: the value of the '
language
' field in theAccessibility
descriptor, in theContentComponent
orAdaptationSet
, where the corresponding 'channel-number
' or 'service-number
' is the same as this track's 'id
' attribute. The empty string if there is no such corresponding 'channel-number
' or 'service-number
'. - Otherwise: the content of the '
lang
' attribute in theContentComponent
orAdaptationSet
element.
inBandMetadataTrackDispatchType
If kind
is "metadata
", an XML document containing theAdaptationSet
element and all childRole
descriptors andContentComponents
, and their childRole
descriptors. The empty string otherwise.mode
" disabled
"- An ISOBMFF CEA 608 caption service: the string "cc" concatenated with the value of the '
Track Attributes for sourced Audio and Video Tracks
Data for sourcing audio and video track attributes may exist in the media content or in the MPD. Audio and video track attribute values are first sourced from track data in the media container, as described for audio and video track attributes in MPEG-2 Transport Streams and audio and video track attributes in MPEG-4 ISOBMFF. If a track attribute's value cannot be determined from the media container, then the track attribute value is sourced from data in the track's
ContentComponent
. If the needed attribute or element does not exist on theContentComponent
(or if theAdaptationSet
doesn't contain anyContentComponents
), then that attribute or element is sourced from theAdaptationSet
:Attribute How to source its value id
Content of the id
attribute in theContentComponent
orAdaptationSet
element. Empty string if theid
attribute is not present on either element.kind
Given a
Role
scheme of "urn:mpeg:dash:role:2011
", determine thekind
attribute from the value of theRole
descriptors in theContentComponent
andAdaptationSet
elements.- "
alternative
": if the role is "alternate
" but not also "main
" or "commentary
", or "dub
" - "
captions
": if the role is "caption
" and also "main
" - "
descriptions
": if the role is "description
" and also "supplementary
" - "
main
": if the role is "main
" but not also "caption
", "subtitle
", or "dub
" - "
main-desc
": if the role is "main
" and also "description
" - "
sign
": not used - "
subtitles
": if the role is "subtitle
" and also "main
" - "
translation
": if the role is "dub
" and also "main
" - "
commentary
": if the role is "commentary
" but not also "main
" - "": otherwise
label
The empty string. language
Content of the lang
attribute in theContentComponent
orAdaptationSet
element.- "
Mapping Text Track content into text track cues
TextTrackCue
objects may be sourced from DASH media content in the WebVTT, TTML, MPEG-2 TS or ISOBMFF format.Media content with the MIME type "
text/vtt
" is in the WebVTT format and should be exposed as aVTTCue
object as defined in [WEBVTT].Media content with the MIME type "
application/ttml+xml
" is in the TTML format and should be exposed as an as yet to be definedTTMLCue
object. Alternatively, browsers can also map the TTML features toVTTCue
objects [WEBVTT]. Finally, browsers that cannot render TTML [ttaf1-dfxp] format data should expose them asDataCue
objects [HTML51]. In this case, the TTML file must be parsed in its entirety and then converted into a sequence of TTML Intermediate Synchronic Documents (ISDs). Each ISD creates aDataCue
object with attributes sourced as follows:Attribute How to source its value id
Decimal representation of the id
attribute of thehead
element in the XML document. Null if there is noid
attribute.startTime
Value of the beginning media time of the active temporal interval of the ISD. endTime
Value of the ending media time of the active temporal interval of the ISD. pauseOnExit
" false
"data
The (UTF-16 encoded) ArrayBuffer
composing the ISD resource.Media content with the MIME type "
application/mp4
" or "video/mp4
" is in the [ISOBMFF] format and should be exposed following the same rules as for ISOBMFF text track.Media content with the MIME type "
video/mp2t
" is in the MPEG-2 TS format and should be exposed following the same rules as for MPEG-2 TS text track.
3. MPEG-2 Transport Streams
MIME type/subtype:audio/mp2t
, video/mp2t
Track Order
Tracks are called "elementary streams" in a MPEG-2 Transport Stream (TS) [MPEG2TS]. The order in which elementary streams are listed in the "Program Map Table" (PMT) of a MPEG-2 TS is maintained when sourcing multiple MPEG-2 tracks into HTML. Additions or deletions of elementary streams in the PMT should invoke
addtrack
orremovetrack
events in the user agent.NoteThe order of elementary streams in the PMT may change between when the media resource was created and when it is received by the user agent. Scripts should not infer any information from the ordering, or rely on any particular ordering being present.
Determining the type of track
A user agent recognizes and supports data in an MPEG-2 TS elementary stream identified by the
elementary_PID
field in the Program Map Table as being equivalent to an HTML track based on the value of thestream_type
field associated with thatelementary_PID
:- text track:
- The elementary stream with PID 0x02 or the
stream_type
value is "0x02", "0x05" or between "0x80" and "0xFF". - The CEA 708 caption service [CEA708], as identified by:
- A
caption_service_descriptor
[ATSC65] in the 'Elementary Stream Descriptors' in the PMT entry for a video stream with stream type 0x02 or 0x1B. - For
stream_type
0x02, the presence of caption data in theuser_data()
field [ATSC52]. - For
stream_type
0x1B, the presence of caption data in theATSC1_data()
field [SCTE128-1].
- A
- a DVB subtitle component [DVB-SUB] as identified by a
subtitling_descriptor
[DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06" - an ITU-R System B Teletext component [DVB-TXT] as identified by an
teletext_descriptor
[DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06" - a VBI data component [DVB-VBI] as identified by a
VBI_data_descriptor
[DVB-SI] or aVBI_teletext_descriptor
[DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06"
- The elementary stream with PID 0x02 or the
- video track: the
stream_type
value is "0x01", "0x02", "0x10", "0x1B", between "0x1E" and "0x24" or "0xEA". - audio track:
- the
stream_type
value is "0x03", "0x04", "0x0F", "0x11", "0x1C", "0x81" or "0x87". - an AC-3 audio component as identified by an
AC-3_descriptor
[DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06" - an Enhanced AC-3 audio component as identified by an
enhanced_ac-3_descriptor
[DVB-SI]in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06" - a DTS® audio component as identified by a
DTS_audio_stream_descriptor
[DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06" - a DTS-HD® audio component as identified by a
DTS-HD_audio_stream_descriptor
[DVB-SI] in the 'Elementary Stream Descriptors' in the PMT entry for a stream with astream_type
of "0x06"
- the
- text track:
Track Attributes for sourced Text Tracks
Attribute How to source its value id
Decimal representation of the elementary stream's identifier ( elementary_PID
field) in the PMT.For CEA 608 closed captions, the string "cc" concatenated with the decimal representation of the channel number.
For CEA 708 closed captions, the string "sn" concatenated with the decimal representation of the
service_number
field in the 'Caption Channel Service Block'.If program 0 (zero) is present in the transport stream, a string of the format "OOOO.TTTT.SSSS.CC" consisting of the following, lower-case hexadecimal encoded fields:
- OOOO is the four character representation of the 16-bit
original_network_id
[DVB-SI]. - TTTT is the four character representation of the 16-bit
transport_stream_id
[DVB-SI]. - SSSS is the four character representation of the 16-bit
service_id
[DVB-SI]. - CC is:
- If a
stream_identifier_descriptor
[DVB-SI] is present in the PMT, a two character representation of the 8-bitcomponent_tag
value. - Otherwise, a four character representation of the elementary stream's identifier (13-bit
elementary_PID
field) in the PMT.
- If a
kind
- "
captions
":- For a CEA708 caption service.
- for a DVB subtitle component [DVB-SUB] as identified by a
subtitling_descriptor
[DVB-SI] in the PMT with asubtitling_type
in the range "0x20" to "0x25". - an ITU-R System B Teletext component [DVB-TXT] as identified by an
teletext_descriptor
[DVB-SI] with ateletext_type
value of "0x05" in the PMT - a VBI data component [DVB-VBI] as identified by a
VBI_teletext_descriptor
[DVB-SI] with ateletext_type
value of "0x05" in the PMT.
- "
subtitles
":- If the stream type value is "0x82".
- for a DVB subtitle component [DVB-SUB] as identified by a
subtitling_descriptor
[DVB-SI] in the PMT with asubtitling_type
in the range "0x10" to "0x15". - an ITU-R System B Teletext component [DVB-TXT] as identified by an
teletext_descriptor
[DVB-SI] with ateletext_type
value of "0x02" in the PMT - a VBI data component [DVB-VBI] as identified by a
VBI_teletext_descriptor
[DVB-SI] with ateletext_type
value of "0x02" in the PMT.
- "
metadata
": otherwise
label
- If a
component_name_descriptor
[ATSC65] is found immediately after theES_info_length
field in the Program Map Table [MPEG2TS], theDOMString
representation of thecomponent_name_string
in thatcomponent_name_descriptor
. - If a
component_descriptor
[DVB-SI] for the component is present in the SDT or EIT, theDOMString
representation of the content of the text field in thatcomponent_descriptor
- The empty string otherwise.
language
kind
is- "
captions
":- For a CEA708 caption service.
- Content of the
language
field for the caption service in thecaption_service_descriptor
, if present. - Otherwise, for the first caption service, as identified by the
service_number
field in theservice_block
[CEA708] with a value of 1, the value oflanguage
of the audio track wherekind
has the value "main
". - The empty string for all other caption services, as identified by values greater than 1 in the
service_number
field.
- Content of the
- For a DVB subtitle component [DVB-SUB], the value of the
ISO_639_language_code
field in thesubtitling_descriptor
[DVB-SI] in the PMT - For an ITU-R System B Teletext component [DVB-TXT], the value of the
ISO_639_language_code
field in theteletext_descriptor
[DVB-SI] in the PMT - For a VBI data component [DVB-VBI], the value of the
ISO_639_language_code
field in theVBI_teletext_descriptor
[DVB-SI] in the PMT
- For a CEA708 caption service.
- "
subtitles
":- If
stream_type
value is "0x82", the content of theISO_639_language_code
field in theISO_639_language_descriptor
in the elementary stream descriptor array in the PMT. - for a DVB subtitle component [DVB-SUB], the value of the
ISO_639_language_code
field in thesubtitling_descriptor
[DVB-SI] in the PMT - for an ITU-R System B Teletext component [DVB-TXT], the value of the
ISO_639_language_code
field in theteletext_descriptor
[DVB-SI] in the PMT - for a VBI data component [DVB-VBI], the value of the
ISO_639_language_code
field in theVBI_teletext_descriptor
[DVB-SI] in the PMT
- If
- "
metadata
": The empty string.
inBandMetadataTrackDispatchType
If kind
is "metadata
", then the concatenation of thestream_type
byte field in the program map table andES_info_length
bytes following theES_info_length
field expressed in hexadecimal using uppercase ASCII hex digits. The empty string otherwise.mode
" disabled
"- OOOO is the four character representation of the 16-bit
Track Attributes for sourced Audio and Video Tracks
Attribute How to source its value id
- Decimal representation of the elementary stream's identifier (
elementary_PID
field) in the PMT. - If a program 0 (zero) is present in the transport stream, a string of the format "OOOO.TTTT.SSSS.CC" or "OOOO.TTTT.SSSS.CC&CC", consisting of the following, lower-case hexadecimal encoded fields:
- OOOO is the four character representation of the 16-bit
original_network_id
[DVB-SI]. - TTTT is the four character representation of the 16-bit
transport_stream_id
[DVB-SI]. - SSSS is the four character representation of the 16-bit
service_id
[DVB-SI]. - CC is:
- If a
stream_identifier_descriptor
[DVB-SI] is present in the PMT, a two character representation of the 8-bitcomponent_tag
value. - Otherwise, a four character representation of the elementary stream's identifier (13-bit
elementary_PID
field) in the PMT.
- If a
Where a track is derived from two components, the second form ("CC&CC") identifies the independent and dependent streams, where the first 'CC' identifies the independent stream, and the second 'CC' identifies the dependent stream. Otherwise the first form is used.
- OOOO is the four character representation of the 16-bit
kind
- If a
supplementary_audio_descriptor
[DVB-SI] is present in the PMT for an audio component, the value is derived according to the audio purpose defined in table J.3 of [DVB-SI] using the following rules:- "
main
" if PSI signalling of audio purpose indicates "Main audio" for the audio track that the user agent would select by default, otherwise to "translation
"NoteNeed to define how UA would select track by default.
- components with an audio purpose of "Audio description (broadcast-mix)" map to "
main-desc
" - components with an audio purpose of "Audio description (receiver-mix)":
- The user agent exposes an audio track of
kind
"main-desc
" for each permitted combination of this track with another audio track as defined in annex J.2 of [DVB-SI]. Enabling this track results in the combination being presented. - If the user agent can present the stream in isolation, it also exposes an audio track of
kind
"descriptions
" for this audio component.
- The user agent exposes an audio track of
- components with an audio purpose of "Clean audio (broadcast-mix)", "Parametric data dependent stream", or "Unspecific audio for the general audience" map to "
alternative
" - components with other audio purposes map to the empty string
- "
- Otherwise:
- "
descriptions
":- For AC-3 audio [ATSC52] if the
bsmod
field is 2 and thefull_svc
field is 0 in theAC-3_audio_stream_descriptor()
in the PMT - For E-AC-3 audio [ATSC52] if the
audio_service_type
field is 2 and thefull_service_flag
is 0 in theE-AC-3_audio_descriptor()
in the PMT - For AAC audio [SCTE193-2] if the
AAC_service_type
field is 2 and thereceiver_mix_rqd
is 1 in theMPEG_AAC_descriptor()
in the PMT
- For AC-3 audio [ATSC52] if the
- "
main
" if the first audio (video) elementary stream in the PMT and theaudio_type
field in theISO_639_language_descriptor
, if present, is "0x00" or "0x01" - "
main-desc
":- For AC-3 audio [ATSC52] if the
bsmod
field is 2 and thefull_svc
field is 1 in theAC-3_audio_stream_descriptor()
- For E-AC-3 audio [ATSC52] if the
audio_service_type
field is 2 and thefull_service_flag
is 1 in theE-AC-3_audio_descriptor()
- For AAC audio [SCTE193-2] if the
AAC_service_type
field is 2 and thereceiver_mix_rqd
is 0 in theMPEG_AAC_descriptor()
- For AC-3 audio [ATSC52] if the
- "
sign
" video components with acomponent_descriptor
[DVB-SI] in the SDT or EIT, where thestream_content
is "0x3" and thecomponent_type
is "0x30" or "0x31" - "
translation
": not first audio elementary stream in the PMT and theaudio_type
field in theISO_639_language_descriptor
is "0x00" or "0x01" and bsmod=0 - "": otherwise
- "
label
- If a
component_descriptor
[DVB-SI] is present in the SDT or EIT, theDOMString
representation of the content of the text field in thatcomponent_descriptor
- If a
component_name_descriptor
[ATSC65] is present for this elementary in the Program Map Table [MPEG2TS], theDOMString
representation of thecomponent_name_string
field in that descriptor . - The empty string otherwise.
language
kind
is:- "
descriptions
" or "main-desc
": Content of thelanguage
field in theAC-3_audio_stream_descriptor
orAC-3_audio_stream_descriptor
[ATSC52] if present. - otherwise: Content of the
ISO_639_language_code
field in theISO_639_language_descriptor
.
- Decimal representation of the elementary stream's identifier (
Mapping Text Track content into text track cues for MPEG-2 TS
MPEG-2 transport streams may contain data that should be exposed as cues on "
captions
", "subtitles
" or "metadata
" text tracks. No data is defined that equates to "descriptions
" or "chapters
" text track cues.Metadata cues
Cues on an MPEG-2 metadata text track are created as
DataCue
objects [HTML51]. Eachsection
in an elementary stream identified as a text track creates aDataCue
object with itsTextTrackCue
attributes sourced as follows:Attribute How to source its value id
The empty string. startTime
0 endTime
The time, in the media resource timeline, that corresponds to the presentation time of the video frame received immediately prior to the section
in the media resource.pauseOnExit
" false
"data
The entire MPEG-TS section, starting with table_id
and endingsection_length
bytes after thesection_length
field.Captions cues
- CEA 708
MPEG-2 TS captions in the CEA 708 format [CEA708] are carried in the video stream in Picture User Data [ATSC53-4] for
stream_type
0x02 and in Supplemental Enhancement Information [ATSC72-1] forstream_type
0x1B. Browsers that can render the CEA 708 format should expose them in as yet to be specifiedCEA708Cue
objects. Alternatively, browsers can also map the CEA 708 features toVTTCue
objects [VTT708]. Finally, browsers that cannot render CEA 708 captions should expose them asDataCue
objects [HTML51]. In this case, eachservice_block
in a digital TV closed caption (DTVCC) transport channel creates aDataCue
object withTextTrackCue
attributes sourced as follows:Attribute How to source its value id
Decimal representation of the service_number
in theservice_block
.startTime
The time, in the HTML media resource timeline, that corresponds to the presentation time stamp for the video frame that contained the first 'Caption Channel Data Byte' of the service_block
.endTime
The sum of the startTime
and 4 seconds.NoteCEA 708 captions do not have an explicit end time - a rendering device derives the end time for a caption based on subsequent caption data. Setting
endTime
equal tostartTime
might be more appropriate but this would require better support for zero-length cues, as proposed in HTML Bug 25693.pauseOnExit
" false
"data
The service_block
DVB
MPEG-2 TS captions in the DVB subtitle format [DVB-SUB], ITU-R System B Teletext [DVB-TXT] and VBI [DVB-VBI] formats are not exposed in a
TextTrackCue
.
- CEA 708
Subtitles cues
- SCTE 27
MPEG-2 TS subtitles in the SCTE 27 format [SCTE27] should should be exposed in an as yet to be specified
SCTE27Cue
objects. Alternatively, browsers can also map the SCTE 27 features toVTTCue
object via an as yet to be specified mapping process. Finally, browsers that cannot render SCTE 27 subtitles, should expose them asDataCue
objects [HTML51]. In this case, eachsection
in an elementary stream identified as a subtitles text track creates aDataCue
object withTextTrackCue
attributes sourced as follows:Attribute How to source its value id
The empty string. startTime
The time, in the HTML media resource timeline, that corresponds to the display_in_PTS
field in thesection
data.endTime
The sum of the startTime
and thedisplay_duration
field in thesection
data expressed in seconds.pauseOnExit
" false
"data
The entire MPEG-TS section, starting with table_id
and endingsection_length
bytes after thesection_length
field. DVB
MPEG-2 TS subtitles in the DVB subtitle format [DVB-SUB], ITU-R System B Teletext [DVB-TXT] and VBI [DVB-VBI] formats are not exposed in a
TextTrackCue
.
- SCTE 27
4. MPEG-4 ISOBMFF
MIME type/subtype:audio/mp4
, video/mp4
, application/mp4
Track Order
The order of tracks specified by
TrackBox
(trak
) boxes in theMovieBox
(moov
) container [ISOBMFF] is maintained when sourcing multiple MPEG-4 tracks into HTML.Determining the type of track
A user agent recognises and supports data from a
TrackBox
as being equivalent to a HTML track based on the value of thehandler_type
field in theHandlerBox
(hdlr
) of theMediaBox
(mdia
) of theTrackBox
:- text track:
- the
handler_type
value is "meta
", "subt
" or "text
" - the
handler_type
value is "vide
" and an ISOBMFF CEA 608 or 708 caption service is encapsulated in the video track as an SEI message as defined in [DASHIFIOP].
- the
- video track: the
handler_type
value is "vide
" - audio track: the
handler_type
value is "soun
"
- text track:
Track Attributes for sourced Text Tracks
Attribute How to source its value id
For ISOBMFF CEA 608 closed captions, the string "cc" concatenated with the decimal representation of the
channel_number
.For ISOBMFF CEA 708 closed captions, the string "sn" concatenated with the decimal representation of the
service_number
field in the 'Caption Channel Service Block'.Otherwise, the decimal representation of the
track_ID
of aTrackHeaderBox
(tkhd
) in aTrackBox
(trak
).kind
- "
captions
":- WebVTT caption:
handler_type
is "text
" andSampleEntry
format isWVTTSampleEntry
[ISO14496-30] and the VTT metadata headerKind
is "captions
" - SMPTE-TT caption:
handler_type
is "subt
" andSampleEntry
format isXMLSubtitleSampleEntry
[ISO14496-30] and thenamespace
is set to "https://www.smpte-ra.org/schemas/2052-1/2013/smpte-tt#cea708
" [SMPTE2052-11]. - An ISOBMFF CEA 608 or 708 caption service.
- 3GPP caption:
handler_type
is "text
" and theSampleEntry
code (format
field) is "tx3g
".NoteAre all sample entries of this type "
captions
"?
- WebVTT caption:
- "
subtitles
":
- WebVTT subtitle:
handler_type
is "text
" andSampleEntry
format isWVTTSampleEntry
[ISO14496-30] and the VTT metadata headerKind
is "subtitles
" - SMPTE-TT subtitle:
handler_type
is "subt
" andSampleEntry
format isXMLSubtitleSampleEntry
[ISO14496-30] and thenamespace
is set to a TTML namespace that does not indicate a SMPTE-TT caption.
- WebVTT subtitle:
- "
metadata
": otherwise
label
Content of the name
field in theHandlerBox
.language
If the track is an ISOBMFF CEA 608 or 708 caption service then the empty string (""). Otherwise, the content of the
language
field in theMediaHeaderBox
.NoteNo signaling is currently defined for specifying the langaugae of CEA 608 or 708 captions in ISOBMFF. MPEG DASH MPDs may specify caption track metadata, including language [DASHIFIOP]. The user agent should set the
language
attribute of CEA 608 or 708 caption text tracks to the empty string so that script may use the media source extensions [MSE]TrackDefault
object to provide a default for thelanguage
attribute.inBandMetadataTrackDispatchType
kind
is "metadata
":- if a
XMLMetaDataSampleEntry
box is present the concatenation of the string "metx
", a U+0020 SPACE character, and the value of thenamespace
field - if a
TextMetaDataSampleEntry
box is present the concatenation of the string "mett
", a U+0020 SPACE character, and the value of themime_format
field - otherwise the empty string
- if a
- otherwise the empty string
mode
" disabled
"- "
Track Attributes for sourced Audio and Video Tracks
Attribute How to source its value id
Decimal representation of the track_ID
of aTrackHeaderBox
(tkhd
) in aTrackBox
(trak
).kind
- "
alternative
": not used - "
captions
": not used - "
descriptions
"- For E-AC-3 audio [ETSI102366] if the
bsmod
field is 2 and theasvc
is 1 in theEC3SpecificBox
- For E-AC-3 audio [ETSI102366] if the
- "
main
": first audio (video) track - "
main-desc
- For AC-3 audio [ETSI102366] if the
bsmod
field is 2 in theAC3SpecificBox
- For E-AC-3 audio [ETSI102366] if the
bsmod
field is 2 and theasvc
is 0 in theEC3SpecificBox
- For AC-3 audio [ETSI102366] if the
- "
sign
": not used - "
subtitles
": not used - "
translation
": not first audio (video) track - "
commentary
": not used - "": otherwise
label
Content of the name
field in theHandlerBox
.language
Content of the language
field in theMediaHeaderBox
.- "
Mapping Text Track content into text track cues for MPEG-4 ISOBMFF
[ISOBMFF] text tracks may be in the WebVTT or TTML format [ISO14496-30], 3GPP Timed Text format [3GPP-TT], or other format.
[ISOBMFF] text tracks carry WebVTT data if the media handler type is "
text
" and aWVTTSampleEntry
format is used, as described in [ISO14496-30]. Browsers that can render text tracks in the WebVTT format should expose aVTTCue
[WEBVTT] as follows:Attribute How to source its value id
The cue_id
field in theCueIDBox
.startTime
The sample presentation time. endTime
The sum of the startTime
and the sample duration.pauseOnExit
" false
"cue setting attributes The settings
field in theCueSettingsBox
.text
The cue_text
field in theCuePayloadBox
.ISOBMFF text tracks carry TTML data if the media handler type is "
subt
" and anXMLSubtileSampleEntry
format is used with a TTML-basedname_space
field, as described in [ISO14496-30]. Browsers that can render text tracks in the TTML format should expose an as yet to be definedTTMLCue
. Alternatively, browsers can also map the TTML features toVTTCue
objects. Finally, browsers that cannot render TTML [ttaf1-dfxp] format data should expose them asDataCue
objects [HTML51]. Each TTML subtitle sample consists of an XML document and creates aDataCue
object with attributes sourced as follows:Attribute How to source its value id
Decimal representation of the id
attribute of thehead
element in the XML document. Null if there is noid
attribute.startTime
Value of the beginning media time of the top-level temporal interval of the XML document. endTime
Value of the ending media time of the top-level temporal interval of the XML document. pauseOnExit
" false
"data
The (UTF-16 encoded) ArrayBuffer
composing the XML document.TTML data may contain tunneled CEA708 captions [SMPTE2052-11]. Browsers that can render CEA708 data should expose it as defined for MPEG-2 TS CEA708 cues.
3GPP timed text data is carried in [ISOBMFF] as described in [3GPP-TT]. Browsers that can render text tracks in the 3GPP Timed Text format should expose an as yet to be defined
3GPPCue
. Alternatively, browsers can also map the 3GPP features toVTTCue
objects.
5. WebM
MIME type/subtype:audio/webm
, video/webm
Track Order
The order of tracks specified in the EBML initialisation segment [WebM] is maintained when sourcing multiple WebM tracks into HTML.
Determining the type of track
A user agent recognises and supports data from a WebM resource as being equivalent to a HTML track based on the value of the
TrackType
field of the track in the Segment info:- text track:
TrackType
field is "0x11" or "0x21" - video track:
TrackType
field is "0x01" - audio track:
TrackType
field is "0x02"
- text track:
Track Attributes for sourced Text Tracks
WebM has defined how to store WebVTT [WEBVTT] files in WebM [WebM][WEBVTT-WEBM]. Sourcing text tracks from WebM is different for chapter tracks from tracks of other kinds and is explained below the table.
Attribute How to source its value id
Decimal representation of the TrackNumber
field of the track in theTrack
section of the WebM file Segment.kind
Map the content of the
TrackType
andCodecID
fields of the track as follows:- "
captions
":TrackType
is "0x11" andCodecId
is "D_WEBVTT/captions
" - "
subtitles
":TrackType
is "0x11" andCodecId
is "D_WEBVTT/subtitles
" - "
descriptions
":TrackType
is "0x11" andCodecId
is "D_WEBVTT/descriptions
" - "
metadata
": otherwise
label
Content of the name
field of the track.language
Content of the language
field of the track.inBandMetadataTrackDispatchType
If kind
is "metadata
", then the value of theCodecID
element. The empty string otherwise.mode
" disabled
"Tracks of
kind
"chapters
" are found in the "Chapters
" section of the WebM file Segment, which are all at the beginning of the WebM file, such that chapters can be used for navigation. The details of this mapping have not been specified yet and simply point to the more powerful Matroska chapter specification [Matroska]. Presumably, theid
attribute could be found inEditionUID
,label
is empty, andlanguage
can come from the first ChapterAtom'sChapLanguage
value.NoteThe Matroska container format, which is the basis for WebM, has specifications for other text tracks, in particular SRT, SSA/ASS, and VOBSUB. The described attribute mappings can be applied to these, too, except that the
kind
field will always be "subtitles
". The information of theirCodecPrivate
field is exposed in theinBandMetadataTrackDispatchType
attribute.- "
Track Attributes for sourced Audio and Video Tracks
Attribute How to source its value id
Decimal representation of the TrackNumber
field of the track in the Segment info.kind
- "
alternative
": not used - "
captions
": not used - "
descriptions
": not used - "
main
": theFlagDefault
element is set on the track - "
main-desc
": not used - "
sign
": not used - "
subtitles
": not used - "
translation
": not first audio (video) track - "
commentary
": not used - "": otherwise
label
Content of the name
field of the track in the Segment info.language
Content of the language
field of the track in the Segment info.- "
Mapping Text Track content into text track cues
The only types of text tracks that WebM is defined for are in the WebVTT format [WEBVTT-WEBM]. Therefore, cues on a text track are created as
VTTCue
objects [WEBVTT]. EachBlock
in theBlockGroup
of the WebM track that has the actual data of the text track creates aVTTCue
object with itsTextTrackCue
attributes sourced as follows:Attribute How to source its value id
First line of the Block's data. startTime
Calculated from the BlockTimecode
field in the Block's header and theTimecode
field in the Cluster relative to whichBlockTimecode
is specified.endTime
Calculated from the BlockDuration
filed in the Block's header and thestartTime
.pauseOnExit
" false
"cue setting attributes Parsed from the second line of the Block's data. text
The third and all following lines of the Block's data. NoteOther Matroska container format's text tracks can also be mapped to
TextTrackCue
objects. These will be created asDataCue
objects [HTML51] withid
,startTime
,endTime
, andpauseOnExit
attributes filled identically to theVTTCue
objects, and thedata
attribute containing the Block's data.
6. Ogg
MIME type/subtype:audio/ogg
, video/ogg
Track Order
The order of tracks specified in the Skeleton fisbone headers [OGGSKELETON] is maintained when sourcing multiple Ogg tracks into HTML. If no Skeleton track is available, the order of the "beginning of stream" (BOS) pages which determines track order [OGG].
Determining the type of track
A user agent recognises and supports data from a Ogg resource as being equivalent to a HTML track based on the value of the
Role
field of the fisbone header in Ogg Skeleton:- text track:
Role
starts with "text
" - video track:
Role
starts with "video
" - audio track:
Role
starts with "audio
"
If no Skeleton track is available, determine the type based on the codec used in the BOS pages, e.g. Vorbis is an "audio" track and "theora" is a video track.
- text track:
Track Attributes for sourced Text Tracks
Attribute How to source its value id
Content of the name
message header field of the fisbone header in Ogg Skeleton. If no Skeleton header is available, use a decimal representation of the stream's serialnumber as given in the BOS.kind
Map the content of the
Role
message header fields of Ogg Skeleton as follows:- "
captions
":Role
is "text/captions
" - "
subtitles
":Role
is "text/subtitle
" or "text/karaoke
" - "
descriptions
":Role
is "text/textaudiodesc
" - "
chapters
":Role
is "text/chapters
" - "
metadata
": otherwise
label
Content of the title
message header field of the fisbone header. If no Skeleton header is available, the empty string.language
Content of the language
message header field of the fisbone header. If no Skeleton header is available, the empty string.inBandMetadataTrackDispatchType
If kind
is "metadata
", then the value of theRole
header field. The empty string otherwise.mode
" disabled
"- "
Track Attributes for sourced Audio and Video Tracks
Attribute How to source its value id
Content of the name
message header field of the fisbone header in Ogg Skeleton. If no Skeleton header is available, use a decimal representation of the stream's serialnumber as given in the BOS.kind
Map the content of the
Role
message header fields of Ogg Skeleton as follows:- "
alternative
":Role
is "audio/alternate
" or "video/alternate
" - "
captions
":Role
is "video/captioned
" - "
descriptions
":Role
is "audio/audiodesc
" - "
main
":Role
is "audio/main
" or "video/main
" - "
main-desc
":Role
is "audio/described
" - "
sign
":Role
is "video/sign
" - "
subtitles
":Role
is "video/subtitled
" - "
translation
":Role
is "audio/dub
" - "
commentary
":Role
is "audio/commentary
" - "": otherwise
label
Content of the title
message header field of the fisbone header. If no Skeleton header is available, the empty string.language
Content of the language
message header field of the fisbone header. If no Skeleton header is available, the empty string.- "
Mapping Text Track content into text track cues
TBD
A. Acknowledgements
Thanks to all In-band Track Community Group members in helping to create this specification.
Thanks also to the WHATWG and W3C HTML WG where a part of this specification originated.
B. References
B.1 Informative references
- [3GPP-TT]
- Transparent end-to-end Packet switched Streaming Service (PSS) Timed text format (Release 12). URL: https://www.3gpp.org/ftp/Specs/archive/26_series/26.245/26245-c00.zip
- [ATSC52]
- Digital Audio Compression (AC-3, E-AC-3). 17 December 2012. URL: https://www.atsc.org/cms/standards/A52-2012(12-17).pdf
- [ATSC53-4]
- MPEG-2 Video System Characteristics. 7 August 2009. URL: https://www.atsc.org/cms/standards/a53/a_53-Part-4-2009.pdf
- [ATSC65]
- Program and System Information Protocol for Terrestrial Broadcast and Cable. 7 August 2013. URL: https://www.atsc.org/cms/standards/A65_2013.pdf
- [ATSC72-1]
- Video System Characteristics of AVC in the ATSC Digital Television System. 18 February 2014. URL: https://www.atsc.org/cms/standards/a72/A72-Part-1-2014.pdf
- [CEA708]
- Digital Television (DTV) Closed Captioning CEA-708-B. URL: https://www.ce.org/Standards/Standard-Listings/R4-3-Television-Data-Systems-Subcommittee/CEA-708-D.aspx
- [DASHIFIOP]
- Guidelines for Implementation: DASH-IF Interoperability Points. 7 April 2015. Version 3.0 (Final Version). URL: https://dashif.org/w/2015/04/DASH-IF-IOP-v3.0.pdf
- [DVB-SI]
- ETSI EN 300 468: "Digital Video Broadcasting (DVB); Specification for Service Information (SI) in DVB systems". URL: https://www.etsi.org/deliver/etsi_en/300400_300499/300468/01.14.01_60/en_300468v011401p.pdf
- [DVB-SUB]
- ETSI EN 300 743: "Digital Video Broadcasting (DVB); Subtitling systems". URL: https://www.etsi.org/deliver/etsi_en/300700_300799/300743/01.05.01_60/en_300743v010501p.pdf
- [DVB-TXT]
- ETSI EN 300 472: "Digital Video Broadcasting (DVB); Specification for conveying ITU-R System B Teletext in DVB bitstreams". URL: https://www.etsi.org/deliver/etsi_en/300400_300499/300472/01.03.01_60/en_300472v010301p.pdf
- [DVB-VBI]
- ETSI EN 301 775: ""Digital Video Broadcasting (DVB); Specification for the carriage of Vertical Blanking Information (VBI) data in DVB bitstreams. URL: https://www.etsi.org/deliver/etsi_en/301700_301799/301775/01.02.01_60/en_301775v010201p.pdf
- [ETSI102366]
- Digital Audio Compression(AC-3, Enhanced AC-3) Standard v1.3.1. URL: https://www.etsi.org/deliver/etsi_ts/102300_102399/102366/01.03.01_60/ts_102366v010301p.pdf
- [HTML]
- Ian Hickson. HTML. Living Standard. URL: https://html.spec.whatwg.org/
- [HTML5]
- Ian Hickson; Robin Berjon; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Edward O'Connor; Silvia Pfeiffer. HTML5. 28 October 2014. W3C Recommendation. URL: https://www.w3.org/TR/html5/
- [HTML51]
- Ian Hickson; Robin Berjon; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Edward O'Connor; Tab Atkins Jr.; Simon Pieters; Yoav Weiss; Marcos Caceres; Mathew Marquis. HTML 5.1. 17 April 2015. W3C Working Draft. URL: https://www.w3.org/TR/html51/
- [ISO14496-30]
- Information technology — Coding of audio-visual objects — Part 30: Timed text and other visual overlays in ISO base media file format. 11 March 2014. URL: https://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=63107
- [ISOBMFF]
- Information technology -- Coding of audio-visual objects -- Part 12: ISO base media file format ISO/IEC 14496-12:2012. URL: https://standards.iso.org/ittf/PubliclyAvailableStandards/c061988_ISO_IEC_14496-12_2012.zip
- [MPEG2TS]
- Information technology -- Generic coding of moving pictures and associated audio information: Systems ITU-T Rec. H.222.0 / ISO/IEC 13818-1:2013. URL: https://www.itu.int/rec/T-REC-H.222.0-201206-I
- [MPEGDASH]
- ISO/IEC 23009-1:2014 Information technology -- Dynamic adaptive streaming over HTTP (DASH) -- Part 1: Media presentation description and segment formats. URL: https://standards.iso.org/ittf/PubliclyAvailableStandards/c065274_ISO_IEC_23009-1_2014.zip
- [MSE]
- Aaron Colwell; Adrian Bateman; Mark Watson. Media Source Extensions. 31 March 2015. W3C Candidate Recommendation. URL: https://www.w3.org/TR/media-source/
- [Matroska]
- Matroska Specifications. 9 January 2014. URL: https://matroska.org/technical/specs/index.html
- [OGG]
- S. Pfeiffer. The Ogg Encapsulation Format Version 0. May 2003. Informational. URL: https://tools.ietf.org/html/rfc3533
- [OGGSKELETON]
- Ogg Skeleton 4 Message Headers. 17 March 2014. URL: https://wiki.xiph.org/SkeletonHeaders
- [SCTE128-1]
- ANSI/SCTE 128-1 2013 AVC Constraints for Cable Television Part 1- Coding. URL: https://www.scte.org/documents/pdf/Standards/ANSI_SCTE%20128-1%202013.pdf
- [SCTE193-2]
- SCTE 193-2 2014 MPEG-4 AAC Family Audio System – Part 2 Constraints for Carriage over MPEG-2 Transport. URL: https://www.scte.org/documents/pdf/standards/SCTE%20193-2%202014.pdf
- [SCTE27]
- Subtitling Methods For Broadcast Cable. URL: https://www.scte.org/documents/pdf/Standards/ANSI_SCTE_27_2011.pdf
- [SMPTE2052-11]
- Conversion from CEA-708 Caption Data to SMPTE-TT. URL: https://www.smpte.org/sites/default/files/RP2052-11-2013.pdf
- [VTT708]
- Silvia Pfeiffer. Conversion of 608/708 captions to WebVTT. Draft Community Group Report. URL: https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html
- [WEBVTT]
- Silvia Pfeiffer; Philip Jägenstedt; Ian Hickson. WebVTT: The Web Video Text Tracks Format. 16 May 2014. W3C Editor's Draft. URL: https://dev.w3.org/html5/webvtt/
- [WEBVTT-WEBM]
- Matthew Heaney; Frank Galligan. Embedding WebVTT in WebM. 1 February 2012. URL: https://wiki.webmproject.org/webm-metadata/temporal-metadata/webvtt-in-webm
- [WebM]
- WebM Container Guidelines. 28 April 2014. URL: https://www.webmproject.org/code/specs/container/
- [ttaf1-dfxp]
- Glenn Adams. Timed Text Markup Language (TTML) 1.0 (Second Edition). 9 July 2013. W3C Proposed Edited Recommendation. URL: https://www.w3.org/TR/ttaf1-dfxp/