Standards: Part 18 - High Efficiency And Other Advanced Audio Codecs

Our series on Standards moves on to discussion of advancements in AAC coding, alternative coders for special case scenarios, and their management within a consistent framework.


This article is part of our growing series on Standards.
There is an overview of all 26 articles in Part 1 -  An Introduction To Standards.


The previous two articles addressed the evolution of MPEG audio. Starting with MPEG-1 and MPEG-2 for layers 1 to 3. Later, MPEG-2 introduced AAC as a better but non-backwards compatible alternative in Part 7. MPEG-4 improves on the coding and MPEG-D adds a hybrid codec for speech and music.

MPEG-4 Part 3

Where earlier MPEG Audio standards concentrated on general audio coding, MPEG-4 covers a much wider range of target applications. It manages a collection of versatile codecs through a unified interface. The new codecs are sometimes a more efficient choice than AAC:

Category Codecs
Lossy speech coding HVXC and CELP.
General audio coding AAC, TwinVQ and BSAC.
Lossless coding MPEG-4 SLS, Audio Lossless Coding, MPEG-4 DST.
Text to speech TTSI.
Structured Audio SAOL, SASL and MIDI.
Audio Synthesis Wavetable based, Sample based, Algorithmic and Effects.

 


Building your own custom player for these special codecs might incur a patent licensing liability.


High Efficiency AAC Coding (HE-AAC/AAC+)

High Efficiency AAC (HE-AAC) refines the coding techniques but remains compatible with MPEG-2 Part 7. It can deliver almost CD-quality sound at 32 Kbps.

HE-AAC is known by a variety of other names:

Alias Canonical name
AAC+ HE-AAC v1
aacPlus HE-AAC v1
aacPlus v2 HE-AAC v2
eAAC+ HE-AAC v2

 

These are some important features:

  • Spectral Band Replication (SBR).
  • Parametric Stereo (SSC).
  • Perceptual Noise Substitution (PNS).
  • Long Term Predictor (LTP).
  • Low Delay (AAC-LD).
  • MPEG-4 Scalable to Lossless (SLS).
  • AAC Scalable Sample Rate (SSR).
  • Structured Audio (SA).
  • Text To Speech (TTSI).
  • Error Resilience (ER).

Structured Audio and text to speech are extremely compact because they describe a sound that the player renders entirely in the receiving client. This can deliver performance as low as 100bps.

There are a many historical versions with profiles adding more variants:

Version Description
AAC (Original) Described in ISO 13818-7:1997.
AAC (Version 1) Described in ISO 14496-3:1999.
AAC (Version 2) Described in the ISO 14496-3:2000 revision.
AAC (Current) Described in ISO 14496-3:2009.
AAC+ (Version 2) Version 2 of aacPlus is described in ETSI TS 102 005:2010.
HE-AAC v1 (AAC+) Profile Described in the ISO 14496-3:2001 revision. Combines AAC LC with Spectral Band Replication (SBR).
HE-AAC v2 (aacPlus) Profile Described in the ISO 14496-3:2005 Revision. Adds Parametric Stereo (PS) to the version 1 features to achieve lower bitrates. Sometimes described as eAAC+.
xHE-AAC Fraunhofer introduced Loudness Control and adaptive streaming around 2016 (See MPEG-DASH). Well supported by many players including iOS and Android.
Extended HE-AAC ISO 23003-3:2020 add USAC coding to HE-AAC v2, extending the tool set.

 

Tools & Technologies

The coding tools were reorganized in MPEG-4 to make them more flexible. New tools were added and older tools were refactored into separate items. They are now described as Audio Objects, each one having a specific identity and purpose. Some of them are containers for descriptive information. This is an additional layer of abstraction facilitating profile definitions.

These are the basic audio objects and the technologies that are used inside them. Refer to the MPEG-4 Part 3 sub-parts for functional descriptions of these tools and how they are mapped into the profiles. The sub-part references are all in the MPEG-4 Part 3 standard:

Terminology Description
AAC Main Based on AAC LC.
AAC LC The Low Complexity Audio Object combines the MPEG-2 Part 7 Low Complexity profile (LC) with Perceptual Noise Substitution (PNS). See sub-part 4.
AAC SSR Scalable Sample Rate is based on the MPEG-2 Part 7 Scalable Sampling Rate profile (SSR) combined with Perceptual Noise Substitution (PNS). See sub-part 4.
AAC LTP Long Term Prediction introduces a forward predictor with lower computational complexity. Also uses AAC LC.
AAC LD Low Delay, used with CELP, HVXC, and TTSI in the Low Delay Profile. Suitable for real-time conversation applications.
AAC ELD Enhanced Low Delay improves the bitrate and latency at the expense of a small increase in computational workload.
SBR Spectral Band Replication used with AAC LC in the HE-AAC Profile version 1.
TwinVQ Transform-domain Weighted Interleave Vector Quantization is designed for coding audio at extremely low bitrates (8 kbps). See sub-part 4.
CELP Speech coding with Code Excited Linear Prediction operates at low bitrates. TwinVQ may be more efficient. Not suited for use with music. See sub-part 3.
HVXC Speech coding with Harmonic Vector eXcitation Coding works well with low sample rates around 8 kHz delivering coded output at 1.6 kbps. Latency is very low making it suitable for telephony applications. See sub-part 2.
SSC SinuSoidal Coding. The technical underpinnings of Parametric Stereo coding for high quality audio. See sub-part 8.
PS Parametric Stereo used with AAC LC and SBR in the HE-AAC v2 Profile. The implementation uses SinuSoidal Coding (SSC). Stereo audio is coded as a monaural channel with two differential channels for the left and right signals. See sub-part 8.
MP1, MP2, MP3 MPEG-1/MPEG-2 Audio Layer 1,2 & 3 in MPEG-4 See sub-part 9.
USAC Unified Speech and Audio Coding switches the coding strategy mid-stream between low bitrate CELP (for speech) and HE-AAC (for music) as it determines which is more efficient for each segment. See ISO 23003-3.
BSAC Bit Sliced Arithmetic Coding is an alternative scalable noiseless coding mechanism providing almost perfect quality at 64 kbps. Used for Digital Media Broadcasting (DMB) services. See sub-part 4.
HILN Parametric audio coding with Harmonic and Individual Line plus Noise. Sound can be coded as various harmonics of a sine wave plus a noise component described as a spectral envelope. See sub-part 7.
PNS Perceptual Noise Substitution improves efficiency by representing noise-like signal components with a parametric representation instead of coding the exact waveform. The decoder synthesizes the noise component based on the description.
DST Lossless coding of oversampled audio with Direct Stream Transfer. Popularized by Super Audio CDs. See sub-part 10.
ALS Audio Lossless Coding uses short and long-term predictors to encode sounds that are rich in harmonics. See sub-part 11.
SLS Scalable Lossless Coding is based on a layered approach which implements a lossy coding component in AAC with an additional correction layer that enhances it to provide the lossless result. SLS and ALS are not related to one another. See sub-part 12.
SLS non-core A lossless audio coder with a single coding stream without the lossy General Audio base layer.
MPEG Surround Also known as MPEG Spatial Audio Coding (SAC). Not the same as SAOC.
SAOC Spatial Audio Object Coding. See ISO 23003-2.
SAOC-DE Spatial Audio Object Coding Dialogue Enhancement.
LD MPEG Surround Low Delay MPEG Surround coding. The side channel information is described in ISO 23003-2.
Audio Sync Audio synchronization maintains the coherence of multiple content streams in multiple devices. See sub-part 13.
TTSI Text to Speech Interface that synthesizes the audio. See sub-part 6.
SA Structured Audio describes the audio as components or algorithms. The top level is a scheduler for controlling the construction and playback. See sub-part 5.
Wavetable synthesis Uses combinations of waveforms to create virtual instrument sounds.
Sample based synthesis Sampled natural sound fragments are combined and mixed to create a track. Based on SoundFont technologies.
Algorithmic synthesis Converts a description of a sound and how to play it into a compiled source code format (such as C Language). Then an application can be created to generate the sound.
Audio effects Part of the structured audio toolset.
SMR Simple Simplified version of Symbolic Music Representation. See ISO 14496-23.
SMR Main Main version of Symbolic Music Representation. See ISO 14496-23.
SAOL Structured Audio Orchestra Language. Derived from the earlier MUSIC-N language.
SASL Structured Audio Score Language.
SASBF Structured Audio Sample Bank Format.
MIDI Musical Instrument Digital Interface describes sound (predominantly music based) as a series of events (notes), sounds (patches) and modulations (controls).
General MIDI A standard set of sounds defined by Roland Corp to provide instrument sound (patch) compatibility across multiple MIDI devices.
DLS Downloadable Sounds standardized digital musical instrument sound banks which can be used with data driven sound tracks such as MIDI or SAOL.

 

Spectral Band Replication (SBR)

Spectral Band Replication discards redundant harmonic components in the encoder but reconstructs them by replicating the lower frequencies to derive suitable replacements in the player. This can be used with any codec.

A typical stream of audio might be coded to a target bitrate of around 128kbps. This would reproduce all frequencies up to 15kHz with a small reduction in the frequency response at the top end.

SBR cuts off the incoming frequencies at around 7.5kHz. This loses a lot of the detail but reduces the bitrate to 64kbps.

Then the higher band from 7.5kHz to 15kHz is processed through a more aggressive compression tool. This generates a description of the high frequency sounds that can be used in the decoder to reconstruct them from the lower order harmonics. The description is carried in auxiliary segments within the stream and only adds 1.5kbps to the bitstream (65.5 kbps in total).

The player transposes the lower frequencies into the upper band where it can filter and mix them in using the descriptions in the auxiliary segments.

Parametric Stereo (PS) & SinuSoidal Coding (SSC)

Parametric stereo exploits the similarity between the left and right channels to code them more efficiently.

The two channels are mixed down into a single monophonic channel and coded at full resolution. This is a base from which two differential channels can be derived. Those differences can be coded to a 3-kbps bitrate using SinuSoidal Coding.

The player decodes the mono channel and applies the differences to make the left and right outputs.

Instead of delivering two full bitrate channels, the encoder delivers 1 full bitrate channel and two very low bitrate differentials.

Perceptual Noise Substitution (PNS)

The bitrate gains from using PNS are often not worth the computational workload when the audio is of a high quality.

For noisy audio sources, the noise can be filtered out and described as control parameters for a pseudorandom noise generator in the player where they can be recreated.

Scalable Sample Rate (SSR)

Scaling the audio coding by splitting at the sample level is an interesting alternative to using base and enhancement layers.

If we de-interleave CD audio into 3 scalable streams then stream 1 carries the first sample, stream 2 the second and stream 3 the third and perhaps the fourth. The next sample is added to stream 1 and so on. This yields two 11 kHz sample streams and one 22 kHz stream which can be used by the target device in any combination.

A low-quality service can be reconstructed from one stream or all of them can be combined to reconstruct the original sample stream.

Error Resilience (ER)

Some audio objects have Error Resilient counterparts which are indicated with the ‘ER’ prefix. This is useful for transmitting coded audio over unreliable and error prone network links.

Additional error resilience is possible with checksums and Forward Error Correction introduced as the payload is segmented into network packets.

Profiles & Audio Objects

A profile could use a single Audio Object while other profiles stack the tools hierarchically to make more efficient and sophisticated coders. Complexity requires more computational effort and increases the latency:

  • MPEG-2 AAC-LC profile just uses the Low Complexity AAC-LC object.
  • MPEG-4 AAC-LC adds Perceptual Noise Substitution (PNS).
  • MPEG-4 HE-AAC v1 adds Spectral Band Replication (SBR).
  • MPEG-4 HE-AAC v2 Adds Parametric Stereo (PS).

Because they are hierarchical, HE-AAC v2 players can decode any of the lower stacked levels.

These are the standardized profiles. Organizations such as Fraunhofer create their own proprietary profiles:

Profile Introduced by
Low-Complexity MPEG-2
Main MPEG-2
Scalable Sampling Rate MPEG-2
AAC MPEG-4
High Efficiency AAC (v1) MPEG-4
HE-AAC v2 MPEG-4
Main Audio MPEG-4
Scalable Audio MPEG-4
Speech Audio MPEG-4
Synthetic Audio MPEG-4
High Quality Audio MPEG-4
Low Delay Audio MPEG-4
Low Delay v2 Audio MPEG-4
Natural Audio MPEG-4
Mobile Audio Inter-networking MPEG-4
HD-AAC MPEG-4
ALS Simple MPEG-4
Extended High Efficiency AAC MPEG-D
(Limited) Scalable Lossless Coding Fraunhofer HD-AAC

 

The Fraunhofer Scalable Lossless Coding (HD-AAC) is not the same as the SLS support defined by the MPEG-4 standard.


Refer to section 1.5 of the MPEG-4 Part 3 standard for a detailed description of the Audio Objects and how they are mapped to the profiles.


Additional MIME Type

Because HE-AAC is coded differently to classic AAC, a new MIME type is needed so that browsers can distinguish between the two formats:

MIME type Description
audio/aac Use this for Standard AAC format content. This is the most widely supported.
audio/aacp Describes AAC+ content but is not as widely supported by web browsers.

 

Related Standards

The version date indicates the most recent base standard, corrigenda or amendment. Although the latest versions are indicated, earlier versions may contain relevant information that is removed from later standards. Some devices may be compatible only with an earlier version and you should use that when developing your services for them.

There is a gradual refactoring of the MPEG standards so they can benefit from reusing supporting technologies without needing to repeat them. The MPEG-D and Coding Independent Code Points standards are examples of that as are the ISO 23XXX group of MPEG related standards which provide additional infrastructural support outside of the individual coding specifications.

Standard Version Description
ISO 11172-3 1996 MPEG-1 Part 3 - Audio.
ISO 13818-1 2023 MPEG-2 Part 1 - Systems.
ISO 13818-3 1998 MPEG-2 Part 3 - Audio.
ISO 13818-7 2007 MPEG-2 Part 7 - Advanced Audio Coding (AAC).
ISO 14496-1 2014 MPEG-4 Part 1 - Systems. Currently being revised.
ISO 14496-3 2009 MPEG-4 Part 3 - Audio coding. Released in 2001 and amended in 2003 and 2004.
ISO 14496-4 2019 MPEG-4 Part 4 - Conformance bit-streams specification.
ISO 14496-5 2019 MPEG-4 Part 5 - Reference Software.
ISO 14496-11 2015 MPEG-4 Part 11 - Scene description and application engine.
ISO 14496-23 2008 Symbolic Music Representation.
ISO 23091-3 2022 MPEG-CICP - Coding Independent Code Points for delivering out of band metadata.
ISO 23001-8 n/a Withdrawn and replaced by ISO 23091.
ISO 23003 n/a MPEG-D is a group of standards for audio coding.
ISO 23003-1 2017 MPEG-D Part 1 - MPEG Surround (a.k.a. Spatial Audio Coding).
ISO 23003-2 2018 MPEG-D Part 2 - Spatial Audio Object Coding (SAOC).
ISO 23003-3 2021 MPEG-D Part 3 - Unified speech and audio coding (USAC).
ISO 23003-4 2023 MPEG-D Part 4 - Dynamic Range Control. Currently being revised.
ISO 23003-5 2020 MPEG-D Part 5 - Uncompressed audio in MPEG-4 File Format.
ISO 23003-6 2022 MPEG-D Part 6 - USAC Reference Software.
ISO 23003-7 2022 MPEG-D Part 7 - USAC Conformance specification.
DVB-H 2004 Handheld mobile TV services.
DVB-SH 2008 Handheld mobile TV services delivered via a satellite link.
ETSI TS 101 154 2019 HE-AAC and HE-AAC v2 audio coding for DVB applications.
ETSI TS 102 005 2010 Video and Audio Coding in DVB services delivered directly over IP protocols.
ETSI TR 102 377 2009 DVB-H Implementation guidelines
ETSI TS 103 466 2019 DAB audio coding (MPEG Layer II).
ETSI TS 126 401 2017 Enhanced aacPlus general audio codec.
ETSI EN 302 304 2004 Describes DVB-H.
3GPP TS 26.401 2024 Describes the use of Enhanced AAC+ for mobile services.
General MIDI 1999 Developed by Roland to allow MIDI devices to sound similar when music sequences are played through them.
DLS 1998 The MIDI Downloadable Sounds Specification by the MIDI Manufacturers Association.
MIDI 1.0 1996 The Complete MIDI 1.0 Detailed Specification by the MIDI Manufacturers Association.
MIDI 2.0 2020 Extends MIDI 1.0 with additional capabilities.
ITU Rec H.223 1998 Annexe C describes a Multiplexing Protocol for Low Bitrate Multimedia Communication Over Highly Error-Prone Channels.
ITU Rec H.222.0 1995 See ISO/IEC 13818-1 - Systems.

 

Patent Licenses

Patents for MPEG-4 Audio coding are managed by Via Licensing. Contact them for a license if you design and sell an Encoder or Decoder (Player) of your own.

Content owners do not need a license to distribute their MPEG Audio content.

Patents for AAC baseline technologies expire in 2028 and some newer extensions will have active patents until 2031.

Conclusion

Audio and video compression is a complex subject. We balance it here at a level sufficient to explain the fundamentals whilst avoiding a deep dive into the ‘Rabbit Hole’. Consult the MPEG-4 Part 3 standard if you need to explore MPEG Audio coding in greater detail.

The AAC audio standard is increasingly being used with High-Definition TV services (HDTV). This is supported by the DVB standards that are managed by ETSI. HE-AAC is particularly relevant for mobile TV using DVB-H.

Digital radio services such as DAB+ and Digital Radio Mondiale are also adopting HE-AAC

Our recent articles have traced the history of MPEG audio coding from MPEG-1 Layer 1 up to the latest hybrid USAC codec. We are not quite finished with Audio yet. ST 2110-3x and AES are on the horizon.

Part of a series supported by

You might also like...

HDR & WCG For Broadcast - Expanding Acquisition Capabilities With HDR & WCG

HDR & WCG do present new requirements for vision engineers, but the fundamental principles described here remain familiar and easily manageable.

What Does Hybrid Really Mean?

In this article we discuss the philosophy of hybrid systems, where assets, software and compute resource are located across on-prem, cloud and hybrid infrastructure.

HDR & WCG For Broadcast - HDR Picture Fundamentals: Color

How humans perceive color and the various compromises involved in representing color, using the historical iterations of display technology.

The Streaming Tsunami: Testing In Streaming Part 2: The Quest For “Full Confidence”

Part 1 of this article explored the implementation of a Zero Bug policy for a successful Streamer like Channel 4 (C4) in the UK, and the priorities that the policy drives. In Part 2 we conclude with looking at how Streamers can move…

Encoding & Transport For Remote Contribution At IBC 2024

The technology required to get high quality content from the venue to the viewer for live sports production remains an area of intense research and development, so there will be plenty of innovation and expertise in this area on the…