Standards: Part 18 - High Efficiency And Other Advanced Audio Codecs
Our series on Standards moves on to discussion of advancements in AAC coding, alternative coders for special case scenarios, and their management within a consistent framework.
This article is part of our growing series on Standards.
There is an overview of all 26 articles in Part 1 - An Introduction To Standards.
The previous two articles addressed the evolution of MPEG audio. Starting with MPEG-1 and MPEG-2 for layers 1 to 3. Later, MPEG-2 introduced AAC as a better but non-backwards compatible alternative in Part 7. MPEG-4 improves on the coding and MPEG-D adds a hybrid codec for speech and music.
MPEG-4 Part 3
Where earlier MPEG Audio standards concentrated on general audio coding, MPEG-4 covers a much wider range of target applications. It manages a collection of versatile codecs through a unified interface. The new codecs are sometimes a more efficient choice than AAC:
Category | Codecs |
---|---|
Lossy speech coding | HVXC and CELP. |
General audio coding | AAC, TwinVQ and BSAC. |
Lossless coding | MPEG-4 SLS, Audio Lossless Coding, MPEG-4 DST. |
Text to speech | TTSI. |
Structured Audio | SAOL, SASL and MIDI. |
Audio Synthesis | Wavetable based, Sample based, Algorithmic and Effects. |
Building your own custom player for these special codecs might incur a patent licensing liability.
High Efficiency AAC Coding (HE-AAC/AAC+)
High Efficiency AAC (HE-AAC) refines the coding techniques but remains compatible with MPEG-2 Part 7. It can deliver almost CD-quality sound at 32 Kbps.
HE-AAC is known by a variety of other names:
Alias | Canonical name |
---|---|
AAC+ | HE-AAC v1 |
aacPlus | HE-AAC v1 |
aacPlus v2 | HE-AAC v2 |
eAAC+ | HE-AAC v2 |
These are some important features:
- Spectral Band Replication (SBR).
- Parametric Stereo (SSC).
- Perceptual Noise Substitution (PNS).
- Long Term Predictor (LTP).
- Low Delay (AAC-LD).
- MPEG-4 Scalable to Lossless (SLS).
- AAC Scalable Sample Rate (SSR).
- Structured Audio (SA).
- Text To Speech (TTSI).
- Error Resilience (ER).
Structured Audio and text to speech are extremely compact because they describe a sound that the player renders entirely in the receiving client. This can deliver performance as low as 100bps.
There are a many historical versions with profiles adding more variants:
Version | Description |
---|---|
AAC (Original) | Described in ISO 13818-7:1997. |
AAC (Version 1) | Described in ISO 14496-3:1999. |
AAC (Version 2) | Described in the ISO 14496-3:2000 revision. |
AAC (Current) | Described in ISO 14496-3:2009. |
AAC+ (Version 2) | Version 2 of aacPlus is described in ETSI TS 102 005:2010. |
HE-AAC v1 (AAC+) Profile | Described in the ISO 14496-3:2001 revision. Combines AAC LC with Spectral Band Replication (SBR). |
HE-AAC v2 (aacPlus) Profile | Described in the ISO 14496-3:2005 Revision. Adds Parametric Stereo (PS) to the version 1 features to achieve lower bitrates. Sometimes described as eAAC+. |
xHE-AAC | Fraunhofer introduced Loudness Control and adaptive streaming around 2016 (See MPEG-DASH). Well supported by many players including iOS and Android. |
Extended HE-AAC | ISO 23003-3:2020 add USAC coding to HE-AAC v2, extending the tool set. |
Tools & Technologies
The coding tools were reorganized in MPEG-4 to make them more flexible. New tools were added and older tools were refactored into separate items. They are now described as Audio Objects, each one having a specific identity and purpose. Some of them are containers for descriptive information. This is an additional layer of abstraction facilitating profile definitions.
These are the basic audio objects and the technologies that are used inside them. Refer to the MPEG-4 Part 3 sub-parts for functional descriptions of these tools and how they are mapped into the profiles. The sub-part references are all in the MPEG-4 Part 3 standard:
Terminology | Description |
---|---|
AAC Main | Based on AAC LC. |
AAC LC | The Low Complexity Audio Object combines the MPEG-2 Part 7 Low Complexity profile (LC) with Perceptual Noise Substitution (PNS). See sub-part 4. |
AAC SSR | Scalable Sample Rate is based on the MPEG-2 Part 7 Scalable Sampling Rate profile (SSR) combined with Perceptual Noise Substitution (PNS). See sub-part 4. |
AAC LTP | Long Term Prediction introduces a forward predictor with lower computational complexity. Also uses AAC LC. |
AAC LD | Low Delay, used with CELP, HVXC, and TTSI in the Low Delay Profile. Suitable for real-time conversation applications. |
AAC ELD | Enhanced Low Delay improves the bitrate and latency at the expense of a small increase in computational workload. |
SBR | Spectral Band Replication used with AAC LC in the HE-AAC Profile version 1. |
TwinVQ | Transform-domain Weighted Interleave Vector Quantization is designed for coding audio at extremely low bitrates (8 kbps). See sub-part 4. |
CELP | Speech coding with Code Excited Linear Prediction operates at low bitrates. TwinVQ may be more efficient. Not suited for use with music. See sub-part 3. |
HVXC | Speech coding with Harmonic Vector eXcitation Coding works well with low sample rates around 8 kHz delivering coded output at 1.6 kbps. Latency is very low making it suitable for telephony applications. See sub-part 2. |
SSC | SinuSoidal Coding. The technical underpinnings of Parametric Stereo coding for high quality audio. See sub-part 8. |
PS | Parametric Stereo used with AAC LC and SBR in the HE-AAC v2 Profile. The implementation uses SinuSoidal Coding (SSC). Stereo audio is coded as a monaural channel with two differential channels for the left and right signals. See sub-part 8. |
MP1, MP2, MP3 | MPEG-1/MPEG-2 Audio Layer 1,2 & 3 in MPEG-4 See sub-part 9. |
USAC | Unified Speech and Audio Coding switches the coding strategy mid-stream between low bitrate CELP (for speech) and HE-AAC (for music) as it determines which is more efficient for each segment. See ISO 23003-3. |
BSAC | Bit Sliced Arithmetic Coding is an alternative scalable noiseless coding mechanism providing almost perfect quality at 64 kbps. Used for Digital Media Broadcasting (DMB) services. See sub-part 4. |
HILN | Parametric audio coding with Harmonic and Individual Line plus Noise. Sound can be coded as various harmonics of a sine wave plus a noise component described as a spectral envelope. See sub-part 7. |
PNS | Perceptual Noise Substitution improves efficiency by representing noise-like signal components with a parametric representation instead of coding the exact waveform. The decoder synthesizes the noise component based on the description. |
DST | Lossless coding of oversampled audio with Direct Stream Transfer. Popularized by Super Audio CDs. See sub-part 10. |
ALS | Audio Lossless Coding uses short and long-term predictors to encode sounds that are rich in harmonics. See sub-part 11. |
SLS | Scalable Lossless Coding is based on a layered approach which implements a lossy coding component in AAC with an additional correction layer that enhances it to provide the lossless result. SLS and ALS are not related to one another. See sub-part 12. |
SLS non-core | A lossless audio coder with a single coding stream without the lossy General Audio base layer. |
MPEG Surround | Also known as MPEG Spatial Audio Coding (SAC). Not the same as SAOC. |
SAOC | Spatial Audio Object Coding. See ISO 23003-2. |
SAOC-DE | Spatial Audio Object Coding Dialogue Enhancement. |
LD MPEG Surround | Low Delay MPEG Surround coding. The side channel information is described in ISO 23003-2. |
Audio Sync | Audio synchronization maintains the coherence of multiple content streams in multiple devices. See sub-part 13. |
TTSI | Text to Speech Interface that synthesizes the audio. See sub-part 6. |
SA | Structured Audio describes the audio as components or algorithms. The top level is a scheduler for controlling the construction and playback. See sub-part 5. |
Wavetable synthesis | Uses combinations of waveforms to create virtual instrument sounds. |
Sample based synthesis | Sampled natural sound fragments are combined and mixed to create a track. Based on SoundFont technologies. |
Algorithmic synthesis | Converts a description of a sound and how to play it into a compiled source code format (such as C Language). Then an application can be created to generate the sound. |
Audio effects | Part of the structured audio toolset. |
SMR Simple | Simplified version of Symbolic Music Representation. See ISO 14496-23. |
SMR Main | Main version of Symbolic Music Representation. See ISO 14496-23. |
SAOL | Structured Audio Orchestra Language. Derived from the earlier MUSIC-N language. |
SASL | Structured Audio Score Language. |
SASBF | Structured Audio Sample Bank Format. |
MIDI | Musical Instrument Digital Interface describes sound (predominantly music based) as a series of events (notes), sounds (patches) and modulations (controls). |
General MIDI | A standard set of sounds defined by Roland Corp to provide instrument sound (patch) compatibility across multiple MIDI devices. |
DLS | Downloadable Sounds standardized digital musical instrument sound banks which can be used with data driven sound tracks such as MIDI or SAOL. |
Spectral Band Replication (SBR)
Spectral Band Replication discards redundant harmonic components in the encoder but reconstructs them by replicating the lower frequencies to derive suitable replacements in the player. This can be used with any codec.
A typical stream of audio might be coded to a target bitrate of around 128kbps. This would reproduce all frequencies up to 15kHz with a small reduction in the frequency response at the top end.
SBR cuts off the incoming frequencies at around 7.5kHz. This loses a lot of the detail but reduces the bitrate to 64kbps.
Then the higher band from 7.5kHz to 15kHz is processed through a more aggressive compression tool. This generates a description of the high frequency sounds that can be used in the decoder to reconstruct them from the lower order harmonics. The description is carried in auxiliary segments within the stream and only adds 1.5kbps to the bitstream (65.5 kbps in total).
The player transposes the lower frequencies into the upper band where it can filter and mix them in using the descriptions in the auxiliary segments.
Parametric Stereo (PS) & SinuSoidal Coding (SSC)
Parametric stereo exploits the similarity between the left and right channels to code them more efficiently.
The two channels are mixed down into a single monophonic channel and coded at full resolution. This is a base from which two differential channels can be derived. Those differences can be coded to a 3-kbps bitrate using SinuSoidal Coding.
The player decodes the mono channel and applies the differences to make the left and right outputs.
Instead of delivering two full bitrate channels, the encoder delivers 1 full bitrate channel and two very low bitrate differentials.
Perceptual Noise Substitution (PNS)
The bitrate gains from using PNS are often not worth the computational workload when the audio is of a high quality.
For noisy audio sources, the noise can be filtered out and described as control parameters for a pseudorandom noise generator in the player where they can be recreated.
Scalable Sample Rate (SSR)
Scaling the audio coding by splitting at the sample level is an interesting alternative to using base and enhancement layers.
If we de-interleave CD audio into 3 scalable streams then stream 1 carries the first sample, stream 2 the second and stream 3 the third and perhaps the fourth. The next sample is added to stream 1 and so on. This yields two 11 kHz sample streams and one 22 kHz stream which can be used by the target device in any combination.
A low-quality service can be reconstructed from one stream or all of them can be combined to reconstruct the original sample stream.
Error Resilience (ER)
Some audio objects have Error Resilient counterparts which are indicated with the ‘ER’ prefix. This is useful for transmitting coded audio over unreliable and error prone network links.
Additional error resilience is possible with checksums and Forward Error Correction introduced as the payload is segmented into network packets.
Profiles & Audio Objects
A profile could use a single Audio Object while other profiles stack the tools hierarchically to make more efficient and sophisticated coders. Complexity requires more computational effort and increases the latency:
- MPEG-2 AAC-LC profile just uses the Low Complexity AAC-LC object.
- MPEG-4 AAC-LC adds Perceptual Noise Substitution (PNS).
- MPEG-4 HE-AAC v1 adds Spectral Band Replication (SBR).
- MPEG-4 HE-AAC v2 Adds Parametric Stereo (PS).
Because they are hierarchical, HE-AAC v2 players can decode any of the lower stacked levels.
These are the standardized profiles. Organizations such as Fraunhofer create their own proprietary profiles:
Profile | Introduced by |
---|---|
Low-Complexity | MPEG-2 |
Main | MPEG-2 |
Scalable Sampling Rate | MPEG-2 |
AAC | MPEG-4 |
High Efficiency AAC (v1) | MPEG-4 |
HE-AAC v2 | MPEG-4 |
Main Audio | MPEG-4 |
Scalable Audio | MPEG-4 |
Speech Audio | MPEG-4 |
Synthetic Audio | MPEG-4 |
High Quality Audio | MPEG-4 |
Low Delay Audio | MPEG-4 |
Low Delay v2 Audio | MPEG-4 |
Natural Audio | MPEG-4 |
Mobile Audio Inter-networking | MPEG-4 |
HD-AAC | MPEG-4 |
ALS Simple | MPEG-4 |
Extended High Efficiency AAC | MPEG-D |
(Limited) Scalable Lossless Coding | Fraunhofer HD-AAC |
The Fraunhofer Scalable Lossless Coding (HD-AAC) is not the same as the SLS support defined by the MPEG-4 standard.
Refer to section 1.5 of the MPEG-4 Part 3 standard for a detailed description of the Audio Objects and how they are mapped to the profiles.
Additional MIME Type
Because HE-AAC is coded differently to classic AAC, a new MIME type is needed so that browsers can distinguish between the two formats:
MIME type | Description |
---|---|
audio/aac | Use this for Standard AAC format content. This is the most widely supported. |
audio/aacp | Describes AAC+ content but is not as widely supported by web browsers. |
Related Standards
The version date indicates the most recent base standard, corrigenda or amendment. Although the latest versions are indicated, earlier versions may contain relevant information that is removed from later standards. Some devices may be compatible only with an earlier version and you should use that when developing your services for them.
There is a gradual refactoring of the MPEG standards so they can benefit from reusing supporting technologies without needing to repeat them. The MPEG-D and Coding Independent Code Points standards are examples of that as are the ISO 23XXX group of MPEG related standards which provide additional infrastructural support outside of the individual coding specifications.
Standard | Version | Description |
---|---|---|
ISO 11172-3 | 1996 | MPEG-1 Part 3 - Audio. |
ISO 13818-1 | 2023 | MPEG-2 Part 1 - Systems. |
ISO 13818-3 | 1998 | MPEG-2 Part 3 - Audio. |
ISO 13818-7 | 2007 | MPEG-2 Part 7 - Advanced Audio Coding (AAC). |
ISO 14496-1 | 2014 | MPEG-4 Part 1 - Systems. Currently being revised. |
ISO 14496-3 | 2009 | MPEG-4 Part 3 - Audio coding. Released in 2001 and amended in 2003 and 2004. |
ISO 14496-4 | 2019 | MPEG-4 Part 4 - Conformance bit-streams specification. |
ISO 14496-5 | 2019 | MPEG-4 Part 5 - Reference Software. |
ISO 14496-11 | 2015 | MPEG-4 Part 11 - Scene description and application engine. |
ISO 14496-23 | 2008 | Symbolic Music Representation. |
ISO 23091-3 | 2022 | MPEG-CICP - Coding Independent Code Points for delivering out of band metadata. |
ISO 23001-8 | n/a | Withdrawn and replaced by ISO 23091. |
ISO 23003 | n/a | MPEG-D is a group of standards for audio coding. |
ISO 23003-1 | 2017 | MPEG-D Part 1 - MPEG Surround (a.k.a. Spatial Audio Coding). |
ISO 23003-2 | 2018 | MPEG-D Part 2 - Spatial Audio Object Coding (SAOC). |
ISO 23003-3 | 2021 | MPEG-D Part 3 - Unified speech and audio coding (USAC). |
ISO 23003-4 | 2023 | MPEG-D Part 4 - Dynamic Range Control. Currently being revised. |
ISO 23003-5 | 2020 | MPEG-D Part 5 - Uncompressed audio in MPEG-4 File Format. |
ISO 23003-6 | 2022 | MPEG-D Part 6 - USAC Reference Software. |
ISO 23003-7 | 2022 | MPEG-D Part 7 - USAC Conformance specification. |
DVB-H | 2004 | Handheld mobile TV services. |
DVB-SH | 2008 | Handheld mobile TV services delivered via a satellite link. |
ETSI TS 101 154 | 2019 | HE-AAC and HE-AAC v2 audio coding for DVB applications. |
ETSI TS 102 005 | 2010 | Video and Audio Coding in DVB services delivered directly over IP protocols. |
ETSI TR 102 377 | 2009 | DVB-H Implementation guidelines |
ETSI TS 103 466 | 2019 | DAB audio coding (MPEG Layer II). |
ETSI TS 126 401 | 2017 | Enhanced aacPlus general audio codec. |
ETSI EN 302 304 | 2004 | Describes DVB-H. |
3GPP TS 26.401 | 2024 | Describes the use of Enhanced AAC+ for mobile services. |
General MIDI | 1999 | Developed by Roland to allow MIDI devices to sound similar when music sequences are played through them. |
DLS | 1998 | The MIDI Downloadable Sounds Specification by the MIDI Manufacturers Association. |
MIDI 1.0 | 1996 | The Complete MIDI 1.0 Detailed Specification by the MIDI Manufacturers Association. |
MIDI 2.0 | 2020 | Extends MIDI 1.0 with additional capabilities. |
ITU Rec H.223 | 1998 | Annexe C describes a Multiplexing Protocol for Low Bitrate Multimedia Communication Over Highly Error-Prone Channels. |
ITU Rec H.222.0 | 1995 | See ISO/IEC 13818-1 - Systems. |
Patent Licenses
Patents for MPEG-4 Audio coding are managed by Via Licensing. Contact them for a license if you design and sell an Encoder or Decoder (Player) of your own.
Content owners do not need a license to distribute their MPEG Audio content.
Patents for AAC baseline technologies expire in 2028 and some newer extensions will have active patents until 2031.
Conclusion
Audio and video compression is a complex subject. We balance it here at a level sufficient to explain the fundamentals whilst avoiding a deep dive into the ‘Rabbit Hole’. Consult the MPEG-4 Part 3 standard if you need to explore MPEG Audio coding in greater detail.
The AAC audio standard is increasingly being used with High-Definition TV services (HDTV). This is supported by the DVB standards that are managed by ETSI. HE-AAC is particularly relevant for mobile TV using DVB-H.
Digital radio services such as DAB+ and Digital Radio Mondiale are also adopting HE-AAC
Our recent articles have traced the history of MPEG audio coding from MPEG-1 Layer 1 up to the latest hybrid USAC codec. We are not quite finished with Audio yet. ST 2110-3x and AES are on the horizon.
These Appendix articles contain additional information you may find useful:
Part of a series supported by
You might also like...
HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows
Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…
IP Security For Broadcasters: Part 4 - MACsec Explained
IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.
Standards: Part 23 - Media Types Vs MIME Types
Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.
Building Software Defined Infrastructure: Part 1 - System Topologies
Welcome to Part 1 of Building Software Defined Infrastructure - a new multi-part content collection from Tony Orme. This series is for broadcast engineering & IT teams seeking to deepen their technical understanding of the microservices based IT technologies that are…
IP Security For Broadcasters: Part 3 - IPsec Explained
One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…