Standards: Part 18 - High Efficiency And Other Advanced Audio Codecs

Our series on Standards moves on to discussion of advancements in AAC coding, alternative coders for special case scenarios, and their management within a consistent framework.

This article is part of our growing series on Broadcast Standards.
The first 26 articles are now available in Broadcast Standards – The Book.

The previous two articles addressed the evolution of MPEG audio. Starting with MPEG-1 and MPEG-2 for layers 1 to 3. Later, MPEG-2 introduced AAC as a better but non-backwards compatible alternative in Part 7. MPEG-4 improves on the coding and MPEG-D adds a hybrid codec for speech and music.

MPEG-4 Part 3

Where earlier MPEG Audio standards concentrated on general audio coding, MPEG-4 covers a much wider range of target applications. It manages a collection of versatile codecs through a unified interface. The new codecs are sometimes a more efficient choice than AAC:

Category	Codecs
Lossy speech coding	HVXC and CELP.
General audio coding	AAC, TwinVQ and BSAC.
Lossless coding	MPEG-4 SLS, Audio Lossless Coding, MPEG-4 DST.
Text to speech	TTSI.
Structured Audio	SAOL, SASL and MIDI.
Audio Synthesis	Wavetable based, Sample based, Algorithmic and Effects.

Building your own custom player for these special codecs might incur a patent licensing liability.

High Efficiency AAC Coding (HE-AAC/AAC+)

High Efficiency AAC (HE-AAC) refines the coding techniques but remains compatible with MPEG-2 Part 7. It can deliver almost CD-quality sound at 32 Kbps.

HE-AAC is known by a variety of other names:

Alias	Canonical name
AAC+	HE-AAC v1
aacPlus	HE-AAC v1
aacPlus v2	HE-AAC v2
eAAC+	HE-AAC v2

These are some important features:

Spectral Band Replication (SBR).
Parametric Stereo (SSC).
Perceptual Noise Substitution (PNS).
Long Term Predictor (LTP).
Low Delay (AAC-LD).
MPEG-4 Scalable to Lossless (SLS).
AAC Scalable Sample Rate (SSR).
Structured Audio (SA).
Text To Speech (TTSI).
Error Resilience (ER).

Structured Audio and text to speech are extremely compact because they describe a sound that the player renders entirely in the receiving client. This can deliver performance as low as 100bps.

There are a many historical versions with profiles adding more variants:

Version	Description
AAC (Original)	Described in ISO 13818-7:1997.
AAC (Version 1)	Described in ISO 14496-3:1999.
AAC (Version 2)	Described in the ISO 14496-3:2000 revision.
AAC (Current)	Described in ISO 14496-3:2009.
AAC+ (Version 2)	Version 2 of aacPlus is described in ETSI TS 102 005:2010.
HE-AAC v1 (AAC+) Profile	Described in the ISO 14496-3:2001 revision. Combines AAC LC with Spectral Band Replication (SBR).
HE-AAC v2 (aacPlus) Profile	Described in the ISO 14496-3:2005 Revision. Adds Parametric Stereo (PS) to the version 1 features to achieve lower bitrates. Sometimes described as eAAC+.
xHE-AAC	Fraunhofer introduced Loudness Control and adaptive streaming around 2016 (See MPEG-DASH). Well supported by many players including iOS and Android.
Extended HE-AAC	ISO 23003-3:2020 add USAC coding to HE-AAC v2, extending the tool set.

Tools & Technologies

The coding tools were reorganized in MPEG-4 to make them more flexible. New tools were added and older tools were refactored into separate items. They are now described as Audio Objects, each one having a specific identity and purpose. Some of them are containers for descriptive information. This is an additional layer of abstraction facilitating profile definitions.

These are the basic audio objects and the technologies that are used inside them. Refer to the MPEG-4 Part 3 sub-parts for functional descriptions of these tools and how they are mapped into the profiles. The sub-part references are all in the MPEG-4 Part 3 standard:

Terminology	Description
AAC Main	Based on AAC LC.
AAC LC	The Low Complexity Audio Object combines the MPEG-2 Part 7 Low Complexity profile (LC) with Perceptual Noise Substitution (PNS). See sub-part 4.
AAC SSR	Scalable Sample Rate is based on the MPEG-2 Part 7 Scalable Sampling Rate profile (SSR) combined with Perceptual Noise Substitution (PNS). See sub-part 4.
AAC LTP	Long Term Prediction introduces a forward predictor with lower computational complexity. Also uses AAC LC.
AAC LD	Low Delay, used with CELP, HVXC, and TTSI in the Low Delay Profile. Suitable for real-time conversation applications.
AAC ELD	Enhanced Low Delay improves the bitrate and latency at the expense of a small increase in computational workload.
SBR	Spectral Band Replication used with AAC LC in the HE-AAC Profile version 1.
TwinVQ	Transform-domain Weighted Interleave Vector Quantization is designed for coding audio at extremely low bitrates (8 kbps). See sub-part 4.
CELP	Speech coding with Code Excited Linear Prediction operates at low bitrates. TwinVQ may be more efficient. Not suited for use with music. See sub-part 3.
HVXC	Speech coding with Harmonic Vector eXcitation Coding works well with low sample rates around 8 kHz delivering coded output at 1.6 kbps. Latency is very low making it suitable for telephony applications. See sub-part 2.
SSC	SinuSoidal Coding. The technical underpinnings of Parametric Stereo coding for high quality audio. See sub-part 8.
PS	Parametric Stereo used with AAC LC and SBR in the HE-AAC v2 Profile. The implementation uses SinuSoidal Coding (SSC). Stereo audio is coded as a monaural channel with two differential channels for the left and right signals. See sub-part 8.
MP1, MP2, MP3	MPEG-1/MPEG-2 Audio Layer 1,2 & 3 in MPEG-4 See sub-part 9.
USAC	Unified Speech and Audio Coding switches the coding strategy mid-stream between low bitrate CELP (for speech) and HE-AAC (for music) as it determines which is more efficient for each segment. See ISO 23003-3.
BSAC	Bit Sliced Arithmetic Coding is an alternative scalable noiseless coding mechanism providing almost perfect quality at 64 kbps. Used for Digital Media Broadcasting (DMB) services. See sub-part 4.
HILN	Parametric audio coding with Harmonic and Individual Line plus Noise. Sound can be coded as various harmonics of a sine wave plus a noise component described as a spectral envelope. See sub-part 7.
PNS	Perceptual Noise Substitution improves efficiency by representing noise-like signal components with a parametric representation instead of coding the exact waveform. The decoder synthesizes the noise component based on the description.
DST	Lossless coding of oversampled audio with Direct Stream Transfer. Popularized by Super Audio CDs. See sub-part 10.
ALS	Audio Lossless Coding uses short and long-term predictors to encode sounds that are rich in harmonics. See sub-part 11.
SLS	Scalable Lossless Coding is based on a layered approach which implements a lossy coding component in AAC with an additional correction layer that enhances it to provide the lossless result. SLS and ALS are not related to one another. See sub-part 12.
SLS non-core	A lossless audio coder with a single coding stream without the lossy General Audio base layer.
MPEG Surround	Also known as MPEG Spatial Audio Coding (SAC). Not the same as SAOC.
SAOC	Spatial Audio Object Coding. See ISO 23003-2.
SAOC-DE	Spatial Audio Object Coding Dialogue Enhancement.
LD MPEG Surround	Low Delay MPEG Surround coding. The side channel information is described in ISO 23003-2.
Audio Sync	Audio synchronization maintains the coherence of multiple content streams in multiple devices. See sub-part 13.
TTSI	Text to Speech Interface that synthesizes the audio. See sub-part 6.
SA	Structured Audio describes the audio as components or algorithms. The top level is a scheduler for controlling the construction and playback. See sub-part 5.
Wavetable synthesis	Uses combinations of waveforms to create virtual instrument sounds.
Sample based synthesis	Sampled natural sound fragments are combined and mixed to create a track. Based on SoundFont technologies.
Algorithmic synthesis	Converts a description of a sound and how to play it into a compiled source code format (such as C Language). Then an application can be created to generate the sound.
Audio effects	Part of the structured audio toolset.
SMR Simple	Simplified version of Symbolic Music Representation. See ISO 14496-23.
SMR Main	Main version of Symbolic Music Representation. See ISO 14496-23.
SAOL	Structured Audio Orchestra Language. Derived from the earlier MUSIC-N language.
SASL	Structured Audio Score Language.
SASBF	Structured Audio Sample Bank Format.
MIDI	Musical Instrument Digital Interface describes sound (predominantly music based) as a series of events (notes), sounds (patches) and modulations (controls).
General MIDI	A standard set of sounds defined by Roland Corp to provide instrument sound (patch) compatibility across multiple MIDI devices.
DLS	Downloadable Sounds standardized digital musical instrument sound banks which can be used with data driven sound tracks such as MIDI or SAOL.

Spectral Band Replication (SBR)

Spectral Band Replication discards redundant harmonic components in the encoder but reconstructs them by replicating the lower frequencies to derive suitable replacements in the player. This can be used with any codec.

A typical stream of audio might be coded to a target bitrate of around 128kbps. This would reproduce all frequencies up to 15kHz with a small reduction in the frequency response at the top end.

SBR cuts off the incoming frequencies at around 7.5kHz. This loses a lot of the detail but reduces the bitrate to 64kbps.

Then the higher band from 7.5kHz to 15kHz is processed through a more aggressive compression tool. This generates a description of the high frequency sounds that can be used in the decoder to reconstruct them from the lower order harmonics. The description is carried in auxiliary segments within the stream and only adds 1.5kbps to the bitstream (65.5 kbps in total).

The player transposes the lower frequencies into the upper band where it can filter and mix them in using the descriptions in the auxiliary segments.

Parametric Stereo (PS) & SinuSoidal Coding (SSC)

Parametric stereo exploits the similarity between the left and right channels to code them more efficiently.

The two channels are mixed down into a single monophonic channel and coded at full resolution. This is a base from which two differential channels can be derived. Those differences can be coded to a 3-kbps bitrate using SinuSoidal Coding.

The player decodes the mono channel and applies the differences to make the left and right outputs.

Instead of delivering two full bitrate channels, the encoder delivers 1 full bitrate channel and two very low bitrate differentials.

Perceptual Noise Substitution (PNS)

The bitrate gains from using PNS are often not worth the computational workload when the audio is of a high quality.

For noisy audio sources, the noise can be filtered out and described as control parameters for a pseudorandom noise generator in the player where they can be recreated.

Scalable Sample Rate (SSR)

Scaling the audio coding by splitting at the sample level is an interesting alternative to using base and enhancement layers.

If we de-interleave CD audio into 3 scalable streams then stream 1 carries the first sample, stream 2 the second and stream 3 the third and perhaps the fourth. The next sample is added to stream 1 and so on. This yields two 11 kHz sample streams and one 22 kHz stream which can be used by the target device in any combination.

A low-quality service can be reconstructed from one stream or all of them can be combined to reconstruct the original sample stream.

Error Resilience (ER)

Some audio objects have Error Resilient counterparts which are indicated with the ‘ER’ prefix. This is useful for transmitting coded audio over unreliable and error prone network links.

Additional error resilience is possible with checksums and Forward Error Correction introduced as the payload is segmented into network packets.

Profiles & Audio Objects

A profile could use a single Audio Object while other profiles stack the tools hierarchically to make more efficient and sophisticated coders. Complexity requires more computational effort and increases the latency:

MPEG-2 AAC-LC profile just uses the Low Complexity AAC-LC object.
MPEG-4 AAC-LC adds Perceptual Noise Substitution (PNS).
MPEG-4 HE-AAC v1 adds Spectral Band Replication (SBR).
MPEG-4 HE-AAC v2 Adds Parametric Stereo (PS).

Because they are hierarchical, HE-AAC v2 players can decode any of the lower stacked levels.

These are the standardized profiles. Organizations such as Fraunhofer create their own proprietary profiles:

Profile	Introduced by
Low-Complexity	MPEG-2
Main	MPEG-2
Scalable Sampling Rate	MPEG-2
AAC	MPEG-4
High Efficiency AAC (v1)	MPEG-4
HE-AAC v2	MPEG-4
Main Audio	MPEG-4
Scalable Audio	MPEG-4
Speech Audio	MPEG-4
Synthetic Audio	MPEG-4
High Quality Audio	MPEG-4
Low Delay Audio	MPEG-4
Low Delay v2 Audio	MPEG-4
Natural Audio	MPEG-4
Mobile Audio Inter-networking	MPEG-4
HD-AAC	MPEG-4
ALS Simple	MPEG-4
Extended High Efficiency AAC	MPEG-D
(Limited) Scalable Lossless Coding	Fraunhofer HD-AAC

The Fraunhofer Scalable Lossless Coding (HD-AAC) is not the same as the SLS support defined by the MPEG-4 standard.

Refer to section 1.5 of the MPEG-4 Part 3 standard for a detailed description of the Audio Objects and how they are mapped to the profiles.

Additional MIME Type

Because HE-AAC is coded differently to classic AAC, a new MIME type is needed so that browsers can distinguish between the two formats:

MIME type	Description
audio/aac	Use this for Standard AAC format content. This is the most widely supported.
audio/aacp	Describes AAC+ content but is not as widely supported by web browsers.

Related Standards

The version date indicates the most recent base standard, corrigenda or amendment. Although the latest versions are indicated, earlier versions may contain relevant information that is removed from later standards. Some devices may be compatible only with an earlier version and you should use that when developing your services for them.

There is a gradual refactoring of the MPEG standards so they can benefit from reusing supporting technologies without needing to repeat them. The MPEG-D and Coding Independent Code Points standards are examples of that as are the ISO 23XXX group of MPEG related standards which provide additional infrastructural support outside of the individual coding specifications.

Standard	Version	Description
ISO 11172-3	1996	MPEG-1 Part 3 - Audio.
ISO 13818-1	2023	MPEG-2 Part 1 - Systems.
ISO 13818-3	1998	MPEG-2 Part 3 - Audio.
ISO 13818-7	2007	MPEG-2 Part 7 - Advanced Audio Coding (AAC).
ISO 14496-1	2014	MPEG-4 Part 1 - Systems. Currently being revised.
ISO 14496-3	2009	MPEG-4 Part 3 - Audio coding. Released in 2001 and amended in 2003 and 2004.
ISO 14496-4	2019	MPEG-4 Part 4 - Conformance bit-streams specification.
ISO 14496-5	2019	MPEG-4 Part 5 - Reference Software.
ISO 14496-11	2015	MPEG-4 Part 11 - Scene description and application engine.
ISO 14496-23	2008	Symbolic Music Representation.
ISO 23091-3	2022	MPEG-CICP - Coding Independent Code Points for delivering out of band metadata.
ISO 23001-8	n/a	Withdrawn and replaced by ISO 23091.
ISO 23003	n/a	MPEG-D is a group of standards for audio coding.
ISO 23003-1	2017	MPEG-D Part 1 - MPEG Surround (a.k.a. Spatial Audio Coding).
ISO 23003-2	2018	MPEG-D Part 2 - Spatial Audio Object Coding (SAOC).
ISO 23003-3	2021	MPEG-D Part 3 - Unified speech and audio coding (USAC).
ISO 23003-4	2023	MPEG-D Part 4 - Dynamic Range Control. Currently being revised.
ISO 23003-5	2020	MPEG-D Part 5 - Uncompressed audio in MPEG-4 File Format.
ISO 23003-6	2022	MPEG-D Part 6 - USAC Reference Software.
ISO 23003-7	2022	MPEG-D Part 7 - USAC Conformance specification.
DVB-H	2004	Handheld mobile TV services.
DVB-SH	2008	Handheld mobile TV services delivered via a satellite link.
ETSI TS 101 154	2019	HE-AAC and HE-AAC v2 audio coding for DVB applications.
ETSI TS 102 005	2010	Video and Audio Coding in DVB services delivered directly over IP protocols.
ETSI TR 102 377	2009	DVB-H Implementation guidelines
ETSI TS 103 466	2019	DAB audio coding (MPEG Layer II).
ETSI TS 126 401	2017	Enhanced aacPlus general audio codec.
ETSI EN 302 304	2004	Describes DVB-H.
3GPP TS 26.401	2024	Describes the use of Enhanced AAC+ for mobile services.
General MIDI	1999	Developed by Roland to allow MIDI devices to sound similar when music sequences are played through them.
DLS	1998	The MIDI Downloadable Sounds Specification by the MIDI Manufacturers Association.
MIDI 1.0	1996	The Complete MIDI 1.0 Detailed Specification by the MIDI Manufacturers Association.
MIDI 2.0	2020	Extends MIDI 1.0 with additional capabilities.
ITU Rec H.223	1998	Annexe C describes a Multiplexing Protocol for Low Bitrate Multimedia Communication Over Highly Error-Prone Channels.
ITU Rec H.222.0	1995	See ISO/IEC 13818-1 - Systems.

Patent Licenses

Patents for MPEG-4 Audio coding are managed by Via Licensing. Contact them for a license if you design and sell an Encoder or Decoder (Player) of your own.

Content owners do not need a license to distribute their MPEG Audio content.

Patents for AAC baseline technologies expire in 2028 and some newer extensions will have active patents until 2031.

Conclusion

Audio and video compression is a complex subject. We balance it here at a level sufficient to explain the fundamentals whilst avoiding a deep dive into the ‘Rabbit Hole’. Consult the MPEG-4 Part 3 standard if you need to explore MPEG Audio coding in greater detail.

The AAC audio standard is increasingly being used with High-Definition TV services (HDTV). This is supported by the DVB standards that are managed by ETSI. HE-AAC is particularly relevant for mobile TV using DVB-H.

Digital radio services such as DAB+ and Digital Radio Mondiale are also adopting HE-AAC

Our recent articles have traced the history of MPEG audio coding from MPEG-1 Layer 1 up to the latest hybrid USAC codec. We are not quite finished with Audio yet. ST 2110-3x and AES are on the horizon.

These Appendix articles contain additional information you may find useful:

Part of a series supported by

You might also like...

Monitoring & Compliance In Broadcast: Monitoring QoS & QoE To Power Monetization

Measuring Quality of Experience (QoE) as perceived by viewers has become critical for monetization both from targeted advertising and direct content consumption.

Preventing The Streaming Tsunami

Today, most broadcasters deliver less than 10% of their total viewing hours via OTT streaming services. As that shifts to streaming first delivery the Tsunami will be big… so what can be done about it?

Local TV In The U.S.A – 1967 Style

Our very own TV pioneer shares recollections of local TV in the US from his start in 1967.

Monitoring & Compliance In Broadcast: Monitoring Delivery In The Converged OTA – OTT Ecosystem

Convergence or coexistence between linear broadcast, IP based delivery and 5G mobile networks creates new challenges for monitoring of delivery paths, both technically and logistically.

Seeing The Streaming Tsunami Coming

Streaming video is on the cusp of becoming a major problem for broadband networks. We are about to see a huge Tsunami wave of demand emerge as broadcasters finally make a big shift towards streaming-first.