Standards: Part 24 - Timed-text & Subtitles Overview

Carriage of timed-text must be closely synchronized to the AV stream to ensure it is presented in a timely manner so here we describe the standards that enable this for both broadcast and internet delivery.

This article is part of our growing series on Broadcast Standards.
The first 26 articles are now available in Broadcast Standards – The Book.

Accessibility in broadcasting is important. It is often mandated by regulations. Various techniques make media available to viewers with hearing impairments or when foreign language translation is required.

The fundamental concept is very simple but the wide variety of implementations, languages and delivery systems make it extremely complex.

Implementation

Subtitles for the Deaf and Hard of Hearing (SDH) must be delivered in a timely fashion and carefully composed so the intent and meaning of the spoken word is preserved accurately.

Captions (closed or otherwise) typically provide translation for content spoken in a foreign language or additional information for all viewers. This is not the same thing as SDH.

Synchronized text can be implemented in several different ways:

Hard - Rendered text burned in to the video. They cannot be hidden, edited or altered.
Pre-rendered - The subtitles are encoded as a secondary video track that is overlayed on top of the main program material. There may be several alternatives to cater for multiple languages.
Soft - Delivered as timed-text data fragments synchronized to the program. These are infinitely flexible and can be styled and positioned using controls in the player.

Conventions

Various conventions have evolved to make timed-text more useful. For example:

Narrative subtitles transcribe the dialogue spoken by the actors in the scene. This is sometimes abridged but it should always convey the correct meaning.
Different text colors identify multiple speakers.
Positioning the text to avoid important visual detail or imply location specific details.
Enclosing text in square brackets when it refers to sound effects rather than spoken words.
Translations of foreign narrative spoken words.
Translations of onscreen foreign language text.
Bonus texts or iconography as special features perhaps accompanying a directors commentary audio track.
Forced narrative subtitles appear even when subtitles are turned off. They convey important information to the viewer when it is essential to the plot. For example, when a character speaks a phrase in a foreign language.

The Important Standards

Ancillary timed-text tracks are based on very low bitrate streams. These are the primary standards:

Standard	Description
AES18	Embed ancillary metadata within AES3 audio streams.
DVB	Refer to DVB-TXT for historical details of how analog Teletext data is supported in Digital TV platforms. Consult DVB-SUB and DVB-TTML for more recent delivery of text services on broadcast platforms.
ISO 14496-17	Synchronized text streams for delivery in MPEG transport containers for 3GPP timed-text applications.
ISO 14496-30	An update that adds TTML and WebVTT support.
ISO 23001-10	Advances timed-text capabilities to carry metadata metrics in the ISO Base Media File Format.
ISO 23001-18	Extends timed-text to pass triggering events in the ISOBMFF.
ST 291-1	Describes ancillary data in an SDI TV signal. This is where the classic Teletext data is carried. Relevant when ingesting old analog material from the archives.
ST 2110-40	Describes how to carry ST 291-1 ancillary data synchronized to an ST 2110-10 foundation for timing and control.
ST 2110-43	Describes how to carry Timed Text Markup Language (TTML) on an ST 2110-part 10 foundation specifically using RTP.

Timed-text Formats

Wikipedia lists over twenty different timed-text formats, most of which have niche applications. Two principal formats have become popular in recent times:

TTML - Timed Text Markup Language.
WebVTT - Web Video Text Tracks (based on the earlier SRT format).

TTML is popular in broadcast workflows and has been described in standards from ISO, SMPTE and W3C.

WebVTT is more popular for delivering subtitles to Internet web browsers. It is easily created by processing TTML in your workflow. WebVTT is easy to author or edit manually if necessary. It is also described in ISO and W3C standards.

Timed Text Markup Language - TTML

Timed Text Markup Language (TTML) is an XML format originally standardized by W3C. Use an XML parser to import these TTML files and reconstruct the object graph inside a workflow process.

TTML is widely used by the broadcast industry to exchange subtitle information. This allows the same source content to be authored once and used for broadcast TV and web streamed media. It is not well supported in web browsers and must be converted for deployment.

TTML was originally based on Synchronized Multimedia Integration Language (SMIL) standards. The first edition was completed in 2010. It was revised and renamed as TTML1 when TTML2 was released in 2015. A third edition of TTML1 was published in 2018. TTML1 and TTML2 are both very large standards. TTML 2 adds more sophisticated support for Asian languages.

Most deployments only use a fraction of the TTML functionality so profiles are used to simplify the standards for deployment. The W3C describes more than twenty different profiles for TTML1 and TTML2 in a registry:

https://www.w3.org/TR/ttml-profile-registry/

The carriage of TTML in MPEG streams and files is described in Section 5 of ISO 14496-30.

SMPTE developed an extended superset of TTML1 described in ST2110-43 which the FCC declared as a safe format to use for storing timed-text sequences.

TTML is also supported by DVB for Digital TV applications.

Inside TTML

A TTML document is constructed within a <tt> tag container:

There are two main sections. A header block (<head>) containing metadata, styling and layout details and a body block (<body>) containing the subtitle texts.

The <metadata/> placeholder in the header expands like this:

<metadata xmlns:ttm="http://www.w3.org/ns/ttml#metadata">
<ttm:title>Timed-text TTML Example</ttm:title>
<ttm: copyright>The Authors (c) 2006</ttm:copyright>
</metadata>

Metadata lives in a separate namespace and in this example carries a title and copyright text.

The <styling/> placeholder in the header expands like this:

The styling information exists in a separate namespace and constructs style sets in a similar way to CSS but the format and syntax is different. Cascading is supported and style s2 modifies the initial s1 style definition in this example.

Layout control exists in the same namespace as styling and the <layout/> placeholder in the header is expanded like this:

The extent describes a rectangle where the text is drawn. Padding and background colour describe the appearance of the container.

The <body/> block contains text messages for the entire program. Only the first two are shown here to save space. A real text stream would have a much longer <body> section:

<body region="subtitleArea">
<div>
    <p xml:id="subtitle1" begin="0.76s" end="3.45s">
      It seems a paradox, does it not,
    </p>
    <p xml:id="subtitle2" begin="5.0s" end="10.0s">
      that the image formed on<br/>
      the Retina should be inverted?
    </p>
</div>
</body>

The times (relative to the program start) when the text should appear and disappear are described in decimal notation measured in seconds. This is independent of the video frame rate and will survive transcoding between 25fps and 30fps systems.

About Subtitle Resource Track (SRT) Files

The SubRip application was released for Windows in 2000. This remarkable app was designed to extract subtitle texts from media files. Extracting text streams would be easy but SubRip used optical character recognition to 'read' the text that was burned into the video.

The time when the subtitle first appears and when it is removed are logged and stored with the recognized text. The output is an .srt file with a simple format. Here is a very short example:

1
00:03:17,512 --> 00:03:20,386
Captain, the enemy just fired
a torpedo at the aircraft carrier.

2
00:03:29,382 --> 00:03:31,629
Carry on, bosun.

3
00:04:14,346 --> 00:04:16,277
Direct hit sir.

The SRT abbreviation stands for Subtitle Resource Track.

Note that the term SRT used here describes a file format for storing timed-text subtitles. This is not to be confused with Secure Reliable Transport which uses the same acronym to describe a streaming technology.

Web Video Text Tracks - WebVTT

The WHATWG group steers research and development of the web standards published by the W3C. In 2010, they considered TTML and SRT as candidates for carrying timed-text.

WHATWG and W3C chose SRT as a starting point and created WebSRT (soon renamed as WebVTT). Media players access this via the <track> tag owned by the <video> tag inside web pages.

Because the <video>, <source> and <track> tags are all first-class citizens in an HTML5 web browser, they are well supported in dynamic HTML pages by JavaScript, DOM and CSS.

When a time-coded text item is traversed by the play-head in the viewing application, it triggers a JavaScript event. JavaScript events are captured by an event handler to call the active code to action. This might present a sub-title text or could completely alter the appearance and content of the user interface.

WebVTT files are well documented in free to access W3C standards. The W3C standards will define the fundamental syntax. The Mozilla reference library provides a document describing how to apply WebVTT files. This includes the browser versions that support each feature.

Inside WebVTT

The format is similar to SRT with a few minor differences which make WebVTT files incompatible with older (SRT only) players.

Here is a simplified example of a WebVTT file:

WEBVTT

00:21.000 --> 00:23.000
<v First Speaker>Spoken text 1

00:40.500 --> 00:42.500 align:left size:50%
<v Second Speaker>Spoken text 2

00:42.000 --> 00:45.500 align:right size:50%
<v First Speaker>Spoken text 3

00:42.500 --> 00:43.500 align:left size:50%
<v Second Speaker>Spoken text 4

These files use Unicode text which allows for very easy localization support for foreign languages.

An optional Unicode UTF-8 byte-order-mark can be placed at the start of the file.

The first line of content is the WEBVTT format identifier. This must always be present.

An optional header section can be placed after the format identifier and before the first timed-text cue. This can carry notes and style definitions before the text begins.

Each text cue is preceded by the start and end timecode values and some size and alignment settings. Note that the timecode format is slightly different to that used in SRT files.

The text cues can carry extended HTML5 markup tags for styling and structure. In the example, the speakers are identified with the <v> tag at the start of the line. HTML character entities are supported for displaying complex Unicode characters (provided the player supports the necessary glyphs). Embedded metadata using JSON formatting rules is also supported.

Each text cue is separated from the next with a blank line.

WebVTT is supported in recent versions of all browsers compatible with HTML5.

Multiple <track> tags using .vtt files as their source are supported as child elements of the containing <video> tag. These can carry different kinds of information such as chapter marks, descriptions and captions (which are different to subtitles). Localization support for different languages is easy to implement with multiple <track> tags.

Styling can be defined inside the WebVTT file or in separate CSS stylesheets. The CSS support is complete on Firefox and Safari browsers.

File Extensions

Subtitles created for TV broadcast are stored in a variety of file formats. The majority of these formats are proprietary to the vendors of subtitle insertion systems. These are the generic file extensions that are preferred for TTML, SRT and WebVTT files:

File extension	Description
.dfxp	Distribution Format Exchange Profile. An earlier nomenclature used for TTML.
.mks	Matroška container file containing subtitle texts.
.srt	SubRip SRT files.
.ttml	TTML files.
.vtt	WebVTT files.
.xml	Since TTML is fundamentally an XML format, this file extension can also be used.

Relevant Standards

All of these standards were referred to by the source material used to compile this article. Add them to your library of reference documents to facilitate your TTML and WebVTT deployments.

Standard	Version	Description
AES18	1996	Format for the user data channel of the AES digital audio interface. Minor revisions to the text made in 2019.
ASCII	1986	Originally designed for use with Teletype terminals. It forms the basis of many character sets including Unicode. The first 127 characters defined by 7-bit values in any code set are often equivalent to the ASCII standard.
CTA-708-E	2023	ANSI Closed captioning specification used for ATSC digital TV services.
EN 300 743	2018	DVB Subtitling systems.
EN 303 560	2018-05	DVB-TTML subtitling systems.
ISO 639	2023	Codes for individual languages and language groups. Earlier separate parts are now merged into one standard.
ISO 639-2	1998	Two letter codes for the representation of names of languages. Multiple parts merged into a single standard (ISO 639).
ISO 8859	1999-2003	8-bit single-byte coded graphic character sets. Multiple parts describing different language sub-sets. Superseded by Unicode.
ISO 14496-1	2010	MPEG-4 Systems layer.
ISO 14496-12	2022	MPEG-4 ISO Base Media File Format (ISOBMFF).
ISO 14496-17	2006	MPEG-4 Streaming Text Format for 3GPP.
ISO 14496-18	2004	MPEG-4 Font compression and streaming. Corrections made in 2007 and amended in 2014.
ISO 14496-30	2018	MPEG-4 Carriage of timed-text and other visual overlays in ISO base media file format.
ISO 15444-3	2007	Motion JPEG 2000 file format. Identical to ISOBMFF.
LMT	In progress	Language Metadata Table devised by MESA and adopted by SMPTE.
QTFF	2016	Originally designed in 1991, the latest version was published in 2016. For practical purposes, ISOBMFF is probably the best alternative.
RFC 3550	2003	RTP, A Transport Protocol for Real Time Applications.
RFC 3640	2003	RTP payload for transport of generic MPEG-4 content.
RFC 3986,	2005	Uniform Resource Identifier (URI): Generic Syntax.
RFC 4396	2006	RTP payload for 3GPP timed-text.
RFC 5691	2009	RTP Payload Format for Elementary Streams with MPEG Surround Multi-Channel Audio.
RFC 6381	2011	The Codecs and Profiles Parameters for Bucket Media Types (MIME types).
RFC 8141	2017	Uniform Resource Names (URNs).
RFC 8216	2017	An informal document describing HTTP Live Streaming. Describes MPEG-2 streams with embedded timed-text stored in ID3 tags. A useful reference source but not endorsed by IETF as a formal standard.
SCTE 128-1	2020	AVC Video Constraints for Cable Television, Part 1 - Coding. Published by ANSI.
SRT	2000	SubRip Subtitle files. Informally standardized but widely used and supported by most players.
ST 291-1	2011	Ancillary Data Packet and Space Formatting. Document was renumbered in 2013 but the content was unchanged.
ST 2052-1	2013	Timed-Text Format (SMPTE-TT). Refer to the other parts of ST 2052 for converting other formats.
ST 2110-10	2022	System Timing and Definitions.
ST 2110-40	2023	Transport of SMPTE ST 291-1 Ancillary Data.
ST 2110-43	2021	Transport of Timed Text Markup Language for captions and subtitles in systems conforming to SMPTE ST 2110-10.
TS 26.245	2024-05	Timed-text delivery specification published by the 3GPP org. This is described in ISO 14496-17 section 7.
TS 126.245	2024-05	ETSI published version of TS 26.245.
TTML	2020	Third edition of the W3C Recommendation, Timed Text Markup Language. Used without a numeric suffix, this describes TTML1.
TTML2	2018	A revised version of TTML referred to as version 2.0.
Unicode	2024-10	Version 16 of the Unicode standard describing a total of 154,998 character glyphs.
UTF-8	See Unicode	Coding scheme for Unicode multi-byte characters as 8-bit character strings.
WebVTT	2019	Candidate recommendation for WebVTT: The Web Video Text Tracks Format.
Windows-1252	1998	A legacy standard superset of the ISO 8859 and ASCII character sets used on the Microsoft Windows platform. Superseded by Unicode.

Conclusion

TTML is ideal for production workflows. It can be transformed into other formats for deployment to broadcast head ends and web serving platforms.

WebVTT is well integrated with the HTML5 environment and easy to use in web pages. Implementing subtitles in web-based media players is not difficult. The JavaScript event driven model provides a powerful framework for implementing dynamic content in web pages.

Using timed-text streams for subtitles is merely a starting point. There is a huge opportunity to introduce other more compelling ideas to enhance accessibility and provide more interactivity and dynamism in web pages.

These Appendix articles contain additional information you may find useful:

Part of a series supported by

You might also like...

Ad & Content Targeting With First Party Data And Video SMS

The continuing rise in streaming combined with a swing away from third party to first party data is driving broadcasters to seek new ways of engaging and reaching viewers for both content and ad targeting. Some video service providers are…

Monitoring & Compliance In Broadcast: Monitoring QoS & QoE To Power Monetization

Measuring Quality of Experience (QoE) as perceived by viewers has become critical for monetization both from targeted advertising and direct content consumption.

Preventing The Streaming Tsunami

Today, most broadcasters deliver less than 10% of their total viewing hours via OTT streaming services. As that shifts to streaming first delivery the Tsunami will be big… so what can be done about it?

Local TV In The U.S.A – 1967 Style

Our very own TV pioneer shares recollections of local TV in the US from his start in 1967.

Monitoring & Compliance In Broadcast: Monitoring Delivery In The Converged OTA – OTT Ecosystem

Convergence or coexistence between linear broadcast, IP based delivery and 5G mobile networks creates new challenges for monitoring of delivery paths, both technically and logistically.