Standards: Part 19 - ST 2110-30/31 & AES Standards For Audio

Our series continues with the ST 2110-3x standards which deploy AES3 and AES67 digital audio in an IP networked studio. Many other AES standards are important as the foundations on which AES3 and AES67 are constructed.

This article is part of our growing series on Broadcast Standards.
The first 26 articles are now available in Broadcast Standards – The Book.

The ST 2110-3x series of standards describe how to deliver digital audio within an IP studio. There are currently only two parts relating to uncompressed audio. Their payloads are described by AES3 and AES67, hence the SMPTE standards themselves are quite brief.

AES3 - Describes a simple two channel stereo streamed transmission. Deployed as ST 2110 Part 31.
AES67 - Adds more channels and improves the performance. Deployed as ST2110 Part 30.

Those standards are based on many other AES documents on related topics. Study them all to understand the complete picture. These are the most important but they do refer to others:

AES5 - Preferred sample frequencies.
AES11 - Synchronization of digital audio equipment.
AES18 - Ancillary user data channel format.

The Audio Engineering Society (AES)

The Audio Engineering Society was established in 1948. They produce standards and guidelines for audio engineers. Annual conventions are currently organized every year in Europe, the USA and Latin America. The technical papers are collected into a proceedings volume and selected papers are published in the member’s journal. The individual papers are easily obtained from the AES Electronic Library.

The AES works closely with the European Broadcasting Union (EBU) and provides input to the ISO/IEC standards bodies.

AES3 - Serial Transmission For 2-channel Digital Audio

This standard describes how to transmit 2 channels of digital audio over a variety of different mediums. The supported audio format is linear Pulse Code Modulation (PCM) which is an uncompressed stream of samples. Sample sizes between 16 and 24 bits are supported. Other formats are possible but not described by AES3. See AES5 for the list of acceptable sample rates.

AES3 comprises four separate parts:

Part	Description
1	Audio content semantics. Describing the sampling frequency based on AES5.
2	Metadata and sub-code data transmitted with the audio content such as channel-status, user and ancillary data. The use of pre-emphasis to enhance the audio is indicated in the channel status.
3	Unidirectional transport link framing and channel co-ordination. This also embeds a recoverable clock signal.
4	Physical and electrical signal levels and wiring.

The use of abbreviations in audio/visual contexts is sometimes ambiguous and overloaded with hidden meaning. For example, when the interface is described as AES rather than AES/EBU, the means of electrical connection might be different. This quotation from Ray Arthur Rayburn - a highly respected audio engineer in the AES community explains why:

“AES3 allows the use of transformer or transformerless interfaces, while the corresponding EBU standard requires the use of transformers. Therefore, it has become a common shorthand to say AES/EBU when the interface is transformer coupled, and AES3 when it is not or if the interface type is unknown.”

AES/EBU is described in the third edition of the EBU Tech 3250 document.

AES67 - High-Performance Streaming Audio-Over-IP Interoperability

The original intent for AES67 was to deliver professional quality audio over a high-performance IP network with less than 10 ms latency. Bridging diverse pre-existing audio networking systems to provide interoperability is also a core goal. This is suitable for sound reinforcement at live events.

High performance is feasible on existing local area networks (LAN). If suitable switching hardware is available, it can be supported widely across an enterprise.

These are the main features:

Based on existing and well-known IT standards described in IETF RFC documents.
Synchronization with boundary clock converters.
Streaming transport via RTP.
Session description with SDP.
Low-latency delivery of uncompressed audio.
Ideal for live, studio and broadcast situations.
Decentralized configuration and management of devices.
Coexists with other IT data traffic on the same network.

Prior to AES67, the available audio networking solutions were incompatible with one another. AES67 is designed to reconcile the needs of architectures designed by different manufacturers and facilitates interoperability between:

Dante.
Ravenna.
QLAN.
WheatNet-IP.
Livewire.

These topics are addressed by the standard:

Transport synchronization - A variety of techniques are discussed in Section 4 of the standard.

Media profiles - Standard IP networks must adhere to a media profile (see Annex A) to ensure timely delivery of packets.

Boundary clock converters - Networks using switching hardware that supports IEE PTP protocols can provide boundary clock conversion and should provide adequate performance for audio delivery.

AVB - Enhanced Ethernet Networks that conform to IEEE 802.1Q are described as Audio Video Bridging (AVB) and provide synchronization based on IEEE PTP. This is covered in Annexes C and D.

Media clocks - These are described in Section 5 and provide synchronization at the sample level. A media clock advances in sync with the sample rate. The same frequency should be used for the RTP clock.

Payload encoding - This is described in Section 7, which reiterates the limited range of three preferred sample rates from AES5 with two possible sample sizes. Packet sizes are determined primarily by how long the data in them would play for the given sample rate. AES67 describes these sample rates (derived from AES5):

48 kHz.
96 kHz.
44.1 kHz.

The standardized sample sizes and formats are defined in great detail in these IETF RFC documents:

L16 - 16-bit linear format as defined in RFC 3551 clause 4.5.11.
L24 - 24-bit linear format as defined in RFC 3190 clause 4.

Channel count - Up to 120 channels of audio can be carried in a generic AES67 link. ST 2110-30 limits the number of channels depending on the conformance level of the receiving device. This may be as low as 4 channels at level AX and not more than 64 for level C.

SDP - Session Description Protocol provides discovery and connection management support. This includes keep-alive heartbeats to maintain connections. The discovery systems are described in Annex E. These include the AMWA NMOS IS-04 specification used by ST 2110.

IETF RFC references - Because this is a standard describing IP network transmission, there are many RFC documents cited in the normative references in Section 2 of the standard and more references are included in the bibliography in Annex H. Using the IETF specifications ensures compatibility with the rest of the IP network traffic.

AES5 - Preferred Sample Frequencies

This standard describes various sample rates and recommends 48 kHz at the outset because it is numerically easier to convert this to other sample rates. See Section 4.2 of the standard for an explanation. Sample rates at 96 and 44.1 kHz are also described.

There is an interesting paragraph on bandwidth (see Section 4.1) based on the Nyquist-Shannon sampling theorem.

Derived sample rates ranging from half to 8 times the basic sample rate are also described. There are tables listing the number of samples per frame of video at different frames per second rates vs. audio sample rates.

This is an important foundational standard referred to by AES3 and AES67 and other related documents.

AES11 - Synchronization Of Digital Audio Equipment

Multiple channels of audio must be carefully synchronized. The sample clocks governing when the source audio is captured must be accurately regulated. Any downstream processing needs to maintain the phase relationships between channels to avoid introducing unwanted audible artefacts. This is a complex topic and there are many solutions.

Equipment having an internal sample clock must be locked to an external source. AES11 describes this as a Digital Audio Reference Signal (DARS) which is delivered separately from the audio content (usually via a separate connection). AES5 describes multiples of (up to 8 times) the basic sample rate. The internal sample clock must be capable of reliably locking to all of these.

Alternative synchronization techniques can be used instead of DARS:

Embedded time signatures based on the packet header timestamps. This may drift out of sync with other streams.
Video reference syncing to frame start events.
GPS locked. This requires a separate receiver device and locks to real-world time.

AES11 describes the word clock (see Annex B). This synchronizes hardware devices (such as digital tape machines or CD players). The word clock governs the timing of each sample passing through the system and is derived from a centralized reference. This will be familiar to broadcast engineers who ensure that video across an enterprise is frame synchronous by distributing sync pulses from a reliable source.

The word clock is not the same as timecode. The word clock is integral to the sampling process and transmission of the digital audio where the timecode is a separate metadata service that describes the media being transmitted.

AES11 refers to AES5 and augments the sample rate descriptions with advice pertaining to video reference timing.

AES18 - Ancillary User Data Channel Format

Ancillary user specified metadata can be embedded within an AES3 audio stream. Messages can be any length. The only limitation is the maximum bitrate which caps the amount of data that can be inserted in addition to the audio payload. A long message could describe the entire asset with an abstract for display in an EPG. Shorter messages provide synchronous data such as:

Subtitle text.
Script cues.
Editing information.
Copyright assertions.
Performer credits.
Downstream switching instructions.

This is managed carefully to avoid delaying the audio content. Messages can be split and portions deferred to accommodate the bitrate capping limit.

Ancillary data adapts the High-level Data Link Control (HDLC) protocol originally defined in ISO 3309 (as defined in AES18). That standard has now been withdrawn and replaced by ISO 132239. HDLC is bi-directional, but in the context of AES3, the messages only travel one way with no handshaking.

Error resilience helps detect data corruption at the receiver. If necessary, important data could be delivered in a carousel-like structure and repeated periodically.

The standard lists many external references in the Annex C Bibliography. These date from the mid 1980's to the 1990's and cover radio text services which are now deployed worldwide. Because of the vintage, the specified character sets do not yet use Unicode. Text is constrained to 8-bit character codes as defined in ISO 4873. UTF-8 character encoding of Unicode text is briefly mentioned in the AES67 standard.

ST 2110-30 - RTP Streamed PCM AES67 Audio

The ST 2110-30 standard describes how to deliver uncompressed AES67 audio via RTP streams in an IP based studio. The delivery is supported by signaling metadata delivered using the Session Description Protocol (SDP). This metadata is necessary to receive and correctly interpret the stream.

SMPTE ST 2110-30 can be seen as a subset of AES67. Most of the requirements for stream transport, packet setup, and signaling are common to ST 2110-30 and AES67. ST 2110-30 profiles AES67 with these constraints to ensure more reliable interoperability:

Support of the PTP profile defined in SMPTE ST 2059-2 instead of that defined in AES67.
An offset value of zero between the media clock and the RTP stream clock.
Mandatory signaling to force a device to operate in PTP slave-only mode.
Support of IGMPv3 for multicasting. Refer to RFC 3376 for details.
RTCP messaging is tolerated but not mandated or required.
Receivers need not support the Session Initiation Protocol (SIP) or other connection management support.
The size of UDP datagrams (packets) is specified in ST 2110-10 and supersedes those in AES67.
SDP channel ordering syntax must follow RFC 3190 conventions.
The maximum number of channels is limited depending on the conformance level.

The AIMS Alliance has published a helpful white paper that describes how ST 2110-30 and AES67 interact. Download a copy of AES67-SMPTE-ST-2110-Commonalities-and-Constraints here:

https://aimsalliance.org/white-papers/

Channel Ordering

Channel ordering is described in a Session Description Protocol message with symbolic names. The receiver can use them to deduce how to unpack the received samples and reconstruct the correct channel mapping. If this SDP description is omitted the receiver will assume all channels are of an undefined type:

Symbol	Channels	Description
M	1	Single Monophonic.
DM	2	Dual Monophonic.
ST	2	Stereo-pair.
LtRt	2	Matrix Stereo.
51	6	Surround 5.1.
71	8	Surround 7.1.
222	24	Surround 22.2.
SGRP	4	SDI audio group.
U{xx}	1 - 64	Arbitrary number channels of an undefined type indicated by the value {xx} which must be between 01 and 64. Subject also to the overall 64 channel maximum in an ST 2110-30 implementation of an AES67 link.

Note that the channel ordering can be stacked. This example SDP fragment describes the first six channels as 5.1 surround format and the next two as a stereo-pair delivered alongside them:

channel-order=SMPTE2110.(51,ST)

This second example SDP fragment describes the first four channels as separate monophonic channels, and the next two channels as a stereo-pair and the last two channels as an undefined type:

channel-order=SMPTE2110.(M,M,M,M,ST,U02)

Conformance Levels

The ST 2110-30 standard describes receiver conformance levels that mandate how many streams must be supported. This is based on the sample rate vs. the packet times. The multiple of these determine how many channels can be carried within the available capacity. If the chosen level limits the number of channels to less than you need, multiple AES67 links can be delivered with the channels sensibly divided between them. If all 16 channels are arriving in an SDI stream but your receiver is only able to support Level A, you will need to transmit 8 channels in each of two separate AES67 links running side-by-side.

These are the three basic conformance levels:

Level	Receiver must support
A	48 kHz incoming streams. 1 to 8 audio channels at packet times of 1.0 ms.
B	48 kHz incoming streams. 1 to 8 audio channels at packet times of 1.0 ms. 1 to 8 audio channels at packet times of 0.125 ms.
C	48 kHz incoming streams. 1 to 8 audio channels at packet times of 1.0 ms. 1 to 64 audio channels at packet times of 0.125 ms.

Level A is mandatory and must be supported by all receivers. This is also defined in AES67 as the minimum support. These are all based on 24-bit samples.

Levels B & C support shorter packet times to improve latency. They also support more channels for interoperability with MADI (AES10) systems.

Levels AX, BX and CX add support for 96 kHz sample rates but reduce the number of supported channels where necessary.

ST 2110-31 - AES3 Transparent Transport Over RTP

The ST 2110-31 standard describes real-time, RTP-based transport of any audio format that can be encapsulated into AES3.

The RTP packet header and payload format is described in Section 5. Packets are synchronized to a network reference clock described in ST 2110-10. Session Description Protocol (SDP) messages describe how the payload is constructed for the benefit of the receiver. See Section 6 of the standard.

This is all based on the original Ravenna AM824 specification and registered with the IANA as RTP Media Type 'AM824'. Refer to RFC 4855 and RFC 6838 for details.

Annex A of the standard describes how AES3 and AES10 (MADI) protocols interact. They are broadly compatible but some data needs to be correctly framed and some flag-bits must be adjusted within the packets as they move between the two environments.

Relevant Standards

These are the relevant standards you need to fully explore ST 2110-30 and ST 2110-31. The version indicates when they were revised or in the case of AES standards reaffirmed as being up to date. Note that the AES standards refer to some IEC standards that are more often identified as ISO standards. Where standards are superseded, the original reference (from AES or SMPTE) and the replacement standard are listed. The version column lists the most recent edition, amendment or reaffirmation of a standard:

Standard	Version	Description
AES3	-	A specification for 2-channel digital audio interconnection. Commonly known as AES/EBU.
AES3-1	2009	Part 1: Audio Content.
AES3-2	2009	Part 2: Metadata and Sub-code.
AES3-3	2009	Part 3: Transport.
AES3-4	2009	Part 4: Physical and electrical.
AES5	2018	Preferred sampling frequencies for applications employing pulse-code modulation.
AES10	2020	Serial Multichannel Audio Digital Interface (MADI).
AES11	2020	Synchronization of digital audio equipment in studio operations.
AES14-1	1992	Part 1: Analog XLR pin-out polarity and gender.
AES31	-	AES standard for network and file transfer of audio - Audio-file transfer and exchange.
AES31-1	2001	Part 1: Disk format.
AES31-2	2019	Part 2: File format for transferring digital audio data between systems of different type and manufacture.
AES31-3	2021	Part 3: Simple project interchanges.
AES31-4	2015	Part 4: XML Implementation of Audio Decision Lists.
AES42	2019	Digitally interfaced microphones.
AES47	2017	Transmission of digital audio over asynchronous transfer mode (ATM) networks.
AES51	2017	Transmission of ATM cells over Ethernet physical layer.
AES53	2018	Sample-accurate timing in AES47.
AES67	2018	High-performance streaming audio-over-IP interoperability.
AES70	-	Open Control Architecture.
AES70-1	2018	Part 1: Framework.
AES70-2	2018	Part 2: Class structure.
AES70-3	2018	Part 3: OCP.1: Protocol for IP Networks.
AES74	2019	Requirements for Media Network Directories and Directory Services.
RP 168	2009	SMPTE Definition of Vertical Interval Switching Point for Synchronous Video Switching.
ST 318	2015	Synchronization of 59.94 or 50 Hertz related video and audio systems in analogue and digital areas.
ST 337	2015	Format for Non-PCM Audio and Data in an AES3 Serial Digital Audio Interface.
ST 338	2019	Format for Non-PCM Audio and Data in AES3 - Data Types.
ST 339	2015	Format for Non-PCM Audio and Data in AES3 - Generic Data Types.
ST 340	2015	Format for Non-PCM Audio and Data in AES3 - ATSC A/52B Digital Audio Compression Standard for AC-3 and Enhanced AC-3 Data Types.
ST 2036-2	2013	Ultra-High-Definition-Television - Audio Characteristics and Audio Channel Mapping for Program production.
ST 2059-2	2021	SMPTE Profile for Use of IEEE-1588 Precision Time Protocol in Professional Broadcast Applications.
ST 2067-8	2013	Interoperable Master Format - Common Audio Labels.
ST 2110-30	2017	PCM Digital Audio.
ST 2110-31	2022	AES3 Transparent Transport.
ST 2116	2019	Format for Non-PCM Audio and Data in AES3 - Carriage of Metadata of Serial ADM (Audio Definition Model).
ISO 3309	-	Information processing systems, Data communications High-level data link control procedures and Frame structure. Referred to by AES18 but now withdrawn and replaced by ISO 13239.
ISO 13239	2002	High-level data link control (HDLC) procedures.
ISO 9314-1	1989	Part 1: Token Ring Physical Layer Protocol (PHY).
ISO 9314-3	1990	Fibre Distributed Data Interface (FDDI) - Part 3: Physical Layer Medium Dependent (PMD).
ISO 10646	2023	Information technology — Universal coded character set (UCS). Currently being revised.
IEC 60169-8	1978	Radio-frequency connectors. Part 8: R.F. coaxial connectors with inner diameter of outer conductor 6.5 mm (0.256 in) with bayonet lock (BNC). Replaced by 61169-8.
IEC 60958		Two channel digital audio data format used by S/PDIF and AES3.
IEC 60958-1	2021	Digital audio interface - Part 1: General.
IEC 60958-3	2006	Part 3: Consumer applications - Sony/Philips consumer optical digital interface (S/PDIF) based on AES3.
IEC 60958-4	2016	Digital audio interface - Part 4: Professional applications.
IEC 61169-8	2007	RF coaxial connectors.
IEC 61937	2021	Digital audio - Interface for non-linear PCM encoded audio bitstreams applying IEC 60958 - Surround sound digital audio data format.
IEC 61883-6	2014	Part 6: Audio and music data transmission protocol.
IEEE 1588-2008	2008	PTP - Precision Clock Synchronization Protocol for Networked Measurement and Control Systems.
RFC 3190	2002	RTP Payload Format for 12-bit DAT Audio and 20- and 24-bit Linear Sampled Audio.
RFC 3376	2002	Internet Group Management Protocol, Version 3.
RFC 3550	2003	RTP: A Transport Protocol for Real-Time Applications.
RFC 3551	2003	RTP Profile for Audio and Video Conferences with Minimal Control.
RFC 3629	2003	UTF-8, a transformation format of ISO 10646.
RFC 4566	2006	SDP: Session Description Protocol.
RFC 4855	2007	Media Type Registration of RTP Payload Formats.
RFC 6838	2013	Media Type Specifications and Registration Procedures.
RFC 7273	2014	RTP Clock Source Signaling.
EBU Tech 3250	2017	Specification of the digital audio interface (the AES/ EBU interface).
ITU-R BS.450-3	2001	Transmission standards for FM sound broadcasting at VHF (aka CCIR Rec 450-1).
ITU-R BS.647	2011	A digital audio interface for broadcasting studios.
ITU-T J.17	1988	Pre-emphasis used on sound-program circuits.
ITU-T J.53	2000	Sampling frequency to be used for the digital transmission of studio-quality and high-quality sound-program signals.
Ravenna AM824	2012	RTP Payload Format for AES3.
Rane Note 149	2014	Describes the differences between AES3 and S/PDIF and how to interface them correctly.
BBC WHP 074	IBC 2003	BBC White Paper - The development of ATM network technology for live production infrastructure.
IS-04	Version 1.3.2	NMOS Discovery & Registration.
IGMPv3	2002	See RFC 3376.

Refer to the appendices for a complete list of AES standards. Guideline documents have an 'id' suffix.

Conclusion

The ST 2110 and AES standards described here convey uncompressed digital audio around an IP network. Compressed audio formats such as AAC are currently outside the scope of these standards.

Although AES67 is an open standard, patent licensing may be necessary if you are designing a commercial product based on it. The principal patent holder is an Australian company called Audinate Pty Ltd. There may be other relevant patents that AES are unaware of.

The AES and SMPTE standards are easier to read and understand than MPEG standards. They tend to be much shorter and focused on a specific topic. They are very dependent on other earlier documents. SMPTE, AES and IETF RFC documents frequently refer to each other.

IETF standards are available online to download free of charge. SMPTE and AES standards are free downloads for subscribing members. Joining both societies is easy and worth the subscription fee if you intend purchasing more than a couple of their standards.

These Appendix articles contain additional information you may find useful:

Part of a series supported by

You might also like...

Building Software Defined Infrastructure: Asynchronous & Synchronous Media Processing

One of the key challenges of building software defined infrastructure is moving to a fundamentally synchronous media like video to an asynchronous architecture.

Monitoring & Compliance In Broadcast: Monitoring Cloud Infrastructure

If we take cloud infrastructures to their extreme, that is, their physical locality is unknown to us, then monitoring them becomes a whole new ball game, especially as dispersed teams use them for production.

Phil Rhodes Image Capture NAB 2025 Show Floor Report

Our resident image capture expert Phil Rhodes offers up his own personal impressions of the technology he encountered walking the halls at the 2025 NAB Show.

Building Hybrid IP Systems

It is easy to assume that the industry is inevitably and rapidly making the move to all-IP infrastructures to leverage IP’s flexibility and scalability, but the reality is often a bit more complex.

Microphones: Part 9 - The Science Of Stereo Capture & Reproduction

Here we look at the science of using a matched pair of microphones positioned as a coincident pair to capture stereo sound images.