Standards: Part 24 - Timed-text & Subtitles Overview

Carriage of timed-text must be closely synchronized to the AV stream to ensure it is presented in a timely manner so here we describe the standards that enable this for both broadcast and internet delivery.


This article is part of our growing series on Standards.
There is an overview of all 26 articles in Part 1 -  An Introduction To Standards.


Accessibility in broadcasting is important. It is often mandated by regulations. Various techniques make media available to viewers with hearing impairments or when foreign language translation is required.

The fundamental concept is very simple but the wide variety of implementations, languages and delivery systems make it extremely complex.

Implementation

Subtitles for the Deaf and Hard of Hearing (SDH) must be delivered in a timely fashion and carefully composed so the intent and meaning of the spoken word is preserved accurately.

Captions (closed or otherwise) typically provide translation for content spoken in a foreign language or additional information for all viewers. This is not the same thing as SDH.

Synchronized text can be implemented in several different ways:

  • Hard - Rendered text burned in to the video. They cannot be hidden, edited or altered.
  • Pre-rendered - The subtitles are encoded as a secondary video track that is overlayed on top of the main program material. There may be several alternatives to cater for multiple languages.
  • Soft - Delivered as timed-text data fragments synchronized to the program. These are infinitely flexible and can be styled and positioned using controls in the player.

Conventions

Various conventions have evolved to make timed-text more useful. For example:

  • Narrative subtitles transcribe the dialogue spoken by the actors in the scene. This is sometimes abridged but it should always convey the correct meaning.
  • Different text colors identify multiple speakers.
  • Positioning the text to avoid important visual detail or imply location specific details.
  • Enclosing text in square brackets when it refers to sound effects rather than spoken words.
  • Translations of foreign narrative spoken words.
  • Translations of onscreen foreign language text.
  • Bonus texts or iconography as special features perhaps accompanying a directors commentary audio track.
  • Forced narrative subtitles appear even when subtitles are turned off. They convey important information to the viewer when it is essential to the plot. For example, when a character speaks a phrase in a foreign language.

The Important Standards

Ancillary timed-text tracks are based on very low bitrate streams. These are the primary standards:

Standard Description
AES18 Embed ancillary metadata within AES3 audio streams.
DVB Refer to DVB-TXT for historical details of how analog Teletext data is supported in Digital TV platforms. Consult DVB-SUB and DVB-TTML for more recent delivery of text services on broadcast platforms.
ISO 14496-17 Synchronized text streams for delivery in MPEG transport containers for 3GPP timed-text applications.
ISO 14496-30 An update that adds TTML and WebVTT support.
ISO 23001-10 Advances timed-text capabilities to carry metadata metrics in the ISO Base Media File Format.
ISO 23001-18 Extends timed-text to pass triggering events in the ISOBMFF.
ST 291-1 Describes ancillary data in an SDI TV signal. This is where the classic Teletext data is carried. Relevant when ingesting old analog material from the archives.
ST 2110-40 Describes how to carry ST 291-1 ancillary data synchronized to an ST 2110-10 foundation for timing and control.
ST 2110-43 Describes how to carry Timed Text Markup Language (TTML) on an ST 2110-part 10 foundation specifically using RTP.

 

Timed-text Formats

Wikipedia lists over twenty different timed-text formats, most of which have niche applications. Two principal formats have become popular in recent times:

  • TTML - Timed Text Markup Language.
  • WebVTT - Web Video Text Tracks (based on the earlier SRT format).

TTML is popular in broadcast workflows and has been described in standards from ISO, SMPTE and W3C.

WebVTT is more popular for delivering subtitles to Internet web browsers. It is easily created by processing TTML in your workflow. WebVTT is easy to author or edit manually if necessary. It is also described in ISO and W3C standards.

Timed Text Markup Language - TTML

Timed Text Markup Language (TTML) is an XML format originally standardized by W3C. Use an XML parser to import these TTML files and reconstruct the object graph inside a workflow process.

TTML is widely used by the broadcast industry to exchange subtitle information. This allows the same source content to be authored once and used for broadcast TV and web streamed media. It is not well supported in web browsers and must be converted for deployment.

TTML was originally based on Synchronized Multimedia Integration Language (SMIL) standards. The first edition was completed in 2010. It was revised and renamed as TTML1 when TTML2 was released in 2015. A third edition of TTML1 was published in 2018. TTML1 and TTML2 are both very large standards. TTML 2 adds more sophisticated support for Asian languages.

Most deployments only use a fraction of the TTML functionality so profiles are used to simplify the standards for deployment. The W3C describes more than twenty different profiles for TTML1 and TTML2 in a registry:

https://www.w3.org/TR/ttml-profile-registry/

The carriage of TTML in MPEG streams and files is described in Section 5 of ISO 14496-30.

SMPTE developed an extended superset of TTML1 described in ST2110-43 which the FCC declared as a safe format to use for storing timed-text sequences.

TTML is also supported by DVB for Digital TV applications.

Inside TTML

A TTML document is constructed within a <tt> tag container:

<tt xml:lang="" xmlns="http://www.w3.org/ns/ttml">
  <head>
    <metadata/>
    <styling/>
    <layout/>
  </head>
  <body/>
</tt>

There are two main sections. A header block (<head>) containing metadata, styling and layout details and a body block (<body>) containing the subtitle texts.

The <metadata/> placeholder in the header expands like this:

<metadata xmlns:ttm="http://www.w3.org/ns/ttml#metadata">
  <ttm:title>Timed-text TTML Example</ttm:title>
  <ttm: copyright>The Authors (c) 2006</ttm:copyright>
</metadata>

Metadata lives in a separate namespace and in this example carries a title and copyright text.

The <styling/> placeholder in the header expands like this:

<styling xmlns:tts="http://www.w3.org/ns/ttml#styling">
  <!-- s1 specifies default color, font, and text alignment -->
  <style xml:id="s1"
    tts:color="white"
    tts:fontFamily="proportionalSansSerif"
    tts:fontSize="22px"
    tts:textAlign="center"
  />
  <!-- alternative yellow text otherwise same as style s1 -->
  <style xml:id="s2" style="s1" tts:color="yellow"/>
  <!-- a style based on s1 but justified to the right -->
  <style xml:id="s1Right" style="s1" tts:textAlign="end" />
  <!-- a style based on s2 but justified to the left -->
  <style xml:id="s2Left" style="s2" tts:textAlign="start" />
</styling>

The styling information exists in a separate namespace and constructs style sets in a similar way to CSS but the format and syntax is different. Cascading is supported and style s2 modifies the initial s1 style definition in this example.

Layout control exists in the same namespace as styling and the <layout/> placeholder in the header is expanded like this:

<layout xmlns:tts="http://www.w3.org/ns/ttml#styling">
  <region xml:id="subtitleArea"
    style="s1"
    tts:extent="560px 62px"
    tts:padding="5px 3px"
    tts:backgroundColor="black"
    tts:displayAlign="after"
  />
</layout>

The extent describes a rectangle where the text is drawn. Padding and background colour describe the appearance of the container.

The <body/> block contains text messages for the entire program. Only the first two are shown here to save space. A real text stream would have a much longer <body> section:

<body region="subtitleArea">
  <div>
    <p xml:id="subtitle1" begin="0.76s" end="3.45s">
      It seems a paradox, does it not,
    </p>
    <p xml:id="subtitle2" begin="5.0s" end="10.0s">
      that the image formed on<br/>
      the Retina should be inverted?
    </p>
  </div>
</body>

The times (relative to the program start) when the text should appear and disappear are described in decimal notation measured in seconds. This is independent of the video frame rate and will survive transcoding between 25fps and 30fps systems.

About Subtitle Resource Track (SRT) Files

The SubRip application was released for Windows in 2000. This remarkable app was designed to extract subtitle texts from media files. Extracting text streams would be easy but SubRip used optical character recognition to 'read' the text that was burned into the video.

The time when the subtitle first appears and when it is removed are logged and stored with the recognized text. The output is an .srt file with a simple format. Here is a very short example:

1
00:03:17,512 --> 00:03:20,386
Captain, the enemy just fired
a torpedo at the aircraft carrier.

2
00:03:29,382 --> 00:03:31,629
Carry on, bosun.

3
00:04:14,346 --> 00:04:16,277
Direct hit sir.

The SRT abbreviation stands for Subtitle Resource Track.


Note that the term SRT used here describes a file format for storing timed-text subtitles. This is not to be confused with Secure Reliable Transport which uses the same acronym to describe a streaming technology.


Web Video Text Tracks - WebVTT

The WHATWG group steers research and development of the web standards published by the W3C. In 2010, they considered TTML and SRT as candidates for carrying timed-text.

WHATWG and W3C chose SRT as a starting point and created WebSRT (soon renamed as WebVTT). Media players access this via the <track> tag owned by the <video> tag inside web pages.

Because the <video>, <source> and <track> tags are all first-class citizens in an HTML5 web browser, they are well supported in dynamic HTML pages by JavaScript, DOM and CSS.

When a time-coded text item is traversed by the play-head in the viewing application, it triggers a JavaScript event. JavaScript events are captured by an event handler to call the active code to action. This might present a sub-title text or could completely alter the appearance and content of the user interface.

WebVTT files are well documented in free to access W3C standards. The W3C standards will define the fundamental syntax. The Mozilla reference library provides a document describing how to apply WebVTT files. This includes the browser versions that support each feature.

Inside WebVTT

The format is similar to SRT with a few minor differences which make WebVTT files incompatible with older (SRT only) players.

Here is a simplified example of a WebVTT file:

WEBVTT

00:21.000 --> 00:23.000
<v First Speaker>Spoken text 1

00:40.500 --> 00:42.500 align:left size:50%
<v Second Speaker>Spoken text 2

00:42.000 --> 00:45.500 align:right size:50%
<v First Speaker>Spoken text 3

00:42.500 --> 00:43.500 align:left size:50%
<v Second Speaker>Spoken text 4

These files use Unicode text which allows for very easy localization support for foreign languages.

An optional Unicode UTF-8 byte-order-mark can be placed at the start of the file.

The first line of content is the WEBVTT format identifier. This must always be present.

An optional header section can be placed after the format identifier and before the first timed-text cue. This can carry notes and style definitions before the text begins.

Each text cue is preceded by the start and end timecode values and some size and alignment settings. Note that the timecode format is slightly different to that used in SRT files.

The text cues can carry extended HTML5 markup tags for styling and structure. In the example, the speakers are identified with the <v> tag at the start of the line. HTML character entities are supported for displaying complex Unicode characters (provided the player supports the necessary glyphs). Embedded metadata using JSON formatting rules is also supported.

Each text cue is separated from the next with a blank line.

WebVTT is supported in recent versions of all browsers compatible with HTML5.

Multiple <track> tags using .vtt files as their source are supported as child elements of the containing <video> tag. These can carry different kinds of information such as chapter marks, descriptions and captions (which are different to subtitles). Localization support for different languages is easy to implement with multiple <track> tags.

<video controls src="video.mp4">
  <track kind="subtitles"    src="subtitles_en.vtt" srclang="en" />
  <track kind="subtitles"    src="subtitles_de.vtt" srclang="de" />
  <track kind="captions"     src="captions.vtt"     srclang="en" />
  <track kind="descriptions" src="descriptions.vtt" srclang="en" />
  <track kind="chapters"     src="chapters.vtt"     srclang="en" />
</video>

Styling can be defined inside the WebVTT file or in separate CSS stylesheets. The CSS support is complete on Firefox and Safari browsers.

File Extensions

Subtitles created for TV broadcast are stored in a variety of file formats. The majority of these formats are proprietary to the vendors of subtitle insertion systems. These are the generic file extensions that are preferred for TTML, SRT and WebVTT files:

File extension Description
.dfxp Distribution Format Exchange Profile. An earlier nomenclature used for TTML.
.mks Matroška container file containing subtitle texts.
.srt SubRip SRT files.
.ttml TTML files.
.vtt WebVTT files.
.xml Since TTML is fundamentally an XML format, this file extension can also be used.

 

Relevant Standards

All of these standards were referred to by the source material used to compile this article. Add them to your library of reference documents to facilitate your TTML and WebVTT deployments.

Standard Version Description
AES18 1996 Format for the user data channel of the AES digital audio interface. Minor revisions to the text made in 2019.
ASCII 1986 Originally designed for use with Teletype terminals. It forms the basis of many character sets including Unicode. The first 127 characters defined by 7-bit values in any code set are often equivalent to the ASCII standard.
CTA-708-E 2023 ANSI Closed captioning specification used for ATSC digital TV services.
EN 300 743 2018 DVB Subtitling systems.
EN 303 560 2018-05 DVB-TTML subtitling systems.
ISO 639 2023 Codes for individual languages and language groups. Earlier separate parts are now merged into one standard.
ISO 639-2 1998 Two letter codes for the representation of names of languages. Multiple parts merged into a single standard (ISO 639).
ISO 8859 1999-2003 8-bit single-byte coded graphic character sets. Multiple parts describing different language sub-sets. Superseded by Unicode.
ISO 14496-1 2010 MPEG-4 Systems layer.
ISO 14496-12 2022 MPEG-4 ISO Base Media File Format (ISOBMFF).
ISO 14496-17 2006 MPEG-4 Streaming Text Format for 3GPP.
ISO 14496-18 2004 MPEG-4 Font compression and streaming. Corrections made in 2007 and amended in 2014.
ISO 14496-30 2018 MPEG-4 Carriage of timed-text and other visual overlays in ISO base media file format.
ISO 15444-3 2007 Motion JPEG 2000 file format. Identical to ISOBMFF.
LMT In progress Language Metadata Table devised by MESA and adopted by SMPTE.
QTFF 2016 Originally designed in 1991, the latest version was published in 2016. For practical purposes, ISOBMFF is probably the best alternative.
RFC 3550 2003 RTP, A Transport Protocol for Real Time Applications.
RFC 3640 2003 RTP payload for transport of generic MPEG-4 content.
RFC 3986, 2005 Uniform Resource Identifier (URI): Generic Syntax.
RFC 4396 2006 RTP payload for 3GPP timed-text.
RFC 5691 2009 RTP Payload Format for Elementary Streams with MPEG Surround Multi-Channel Audio.
RFC 6381 2011 The Codecs and Profiles Parameters for Bucket Media Types (MIME types).
RFC 8141 2017 Uniform Resource Names (URNs).
RFC 8216 2017 An informal document describing HTTP Live Streaming. Describes MPEG-2 streams with embedded timed-text stored in ID3 tags. A useful reference source but not endorsed by IETF as a formal standard.
SCTE 128-1 2020 AVC Video Constraints for Cable Television, Part 1 - Coding. Published by ANSI.
SRT 2000 SubRip Subtitle files. Informally standardized but widely used and supported by most players.
ST 291-1 2011 Ancillary Data Packet and Space Formatting. Document was renumbered in 2013 but the content was unchanged.
ST 2052-1 2013 Timed-Text Format (SMPTE-TT). Refer to the other parts of ST 2052 for converting other formats.
ST 2110-10 2022 System Timing and Definitions.
ST 2110-40 2023 Transport of SMPTE ST 291-1 Ancillary Data.
ST 2110-43 2021 Transport of Timed Text Markup Language for captions and subtitles in systems conforming to SMPTE ST 2110-10.
TS 26.245 2024-05 Timed-text delivery specification published by the 3GPP org. This is described in ISO 14496-17 section 7.
TS 126.245 2024-05 ETSI published version of TS 26.245.
TTML 2020 Third edition of the W3C Recommendation, Timed Text Markup Language. Used without a numeric suffix, this describes TTML1.
TTML2 2018 A revised version of TTML referred to as version 2.0.
Unicode 2024-10 Version 16 of the Unicode standard describing a total of 154,998 character glyphs.
UTF-8 See Unicode Coding scheme for Unicode multi-byte characters as 8-bit character strings.
WebVTT 2019 Candidate recommendation for WebVTT: The Web Video Text Tracks Format.
Windows-1252 1998 A legacy standard superset of the ISO 8859 and ASCII character sets used on the Microsoft Windows platform. Superseded by Unicode.

 

Conclusion

TTML is ideal for production workflows. It can be transformed into other formats for deployment to broadcast head ends and web serving platforms.

WebVTT is well integrated with the HTML5 environment and easy to use in web pages. Implementing subtitles in web-based media players is not difficult. The JavaScript event driven model provides a powerful framework for implementing dynamic content in web pages.

Using timed-text streams for subtitles is merely a starting point. There is a huge opportunity to introduce other more compelling ideas to enhance accessibility and provide more interactivity and dynamism in web pages.

Part of a series supported by

You might also like...

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

IP Security For Broadcasters: Part 4 - MACsec Explained

IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.

Standards: Part 23 - Media Types Vs MIME Types

Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.

Six Considerations For Transitioning To Cloud Based Video Distribution

There are many reasons why companies are transitioning from legacy video distribution workflows to ones hosted entirely in the public cloud, but it’s not a simple process and takes an enormous amount of planning. Many potential pitfalls can be a…

IP Security For Broadcasters: Part 3 - IPsec Explained

One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…