Audio For Broadcast: Synchronization

There is nothing worse than out of sync audio. We examine timing and synchronization in IP, baseband and hybrid systems from the perspective of audio… with a little history lesson in synchronization formats along the way.

Radio has been part of our lives for more than 125 years and gives us almost everything that television does; it tells stories, reports live news, covers sporting events and hosts discussions. It makes us laugh, it entertains and educates us, and it builds communities. And it does all of this purely with audio and without the need for pictures.

Television adds pictures to bring us what most see as a more immersive experience, however, it needs both audio and video to function; that’s kind of its thing. But it doesn’t just need both to function; it needs both to function together, time-aligned and synchronous, working in harmony with each other so that people watching buy into the illusion of being onlookers to a distant world.

If it doesn’t then the illusion is quickly broken.

Every bit of equipment in television production operates to the rhythm of the same heartbeat. Everything has to be in sync for television to work, and although broadcasters have adapted the ways they manage this over time, it has been this way since the beginning. Until now.

But we’ll get to that.

Audio + Video

The problem lies with those pesky pictures. The end goal is to deliver perfectly synced audio and video content to the viewer, but the entire production process is set up to fail. As audio and video signals are processed, delays are added to the production chain, and those delays are never the same. Every single component has the potential to create synchronization errors.

Audio is seldom the culprit; it tends to be extremely low latency, and adjustments like compression and EQ can be processed in real-time. On the other hand, video processing can take significantly longer, with more power required to process more data, which in turn requires delay to be added into the audio chain for the AV to keep everything lip-synced.

Luckily, our brains are able to deal with some sync issues; in fact, we are considerably more forgiving when sound lags behind the visuals because we have always been used to seeing things before we hear them.

Even so, there are handy international guidelines to advise broadcasters on how much lag is acceptable. ITU-R BT.1359-1 from the International Telecommunications Union (ITU) dates back to 1998 and advises that the public’s detection of AV sync errors lies between 45ms of audio leading video, and 125ms of audio lagging behind video (note the higher delay for the audio lagging behind the video; thanks, science!)

But the fact that it can drift out of sync doesn’t mean it should, and there are other reasons why the AV should be synced throughout the process.

How It Is Done

In produced programming, like sitcoms or continuous drama, AV can easily be synced up in post-production; while syncing all the cameras is still important, all you really need is a clapperboard and an eyeball. But live broadcast environments are less forgiving.

Timecode is used to synchronize audio with video over time, and it uses a binary code to identify video frames in a refreshingly sensible format; hours:minutes:seconds:frames. Timecode refers to the SMPTE/EBU time and control code and conforms to video frame rates used by different broadcasters across the world, such as 25fps for PAL and 29.97fps for NTSC.

So far, so good. It works by embedding a timestamp in both the audio and video signals to provide a reference point for synchronization, but where it can cause problems is on long, real-time, live broadcasts like sport and entertainment events. The issue is that even if every piece of equipment on set is time-aligned at the start of a shoot, things can drift over time. This is because different equipment has internal clocks with ever-so-slightly different levels of accuracy, and although they all start at the same point the cumulative effect over time is for them to drift further apart. Over longer periods these differences become more and more noticeable. Damn you, time!

For video, generator locking (genlock for short) enables cameras and recording devices to synchronize with each other to an external clock source, which enables the director on a real-time live broadcast to switch between these different cameras without any sync issues.

For audio, wordclock provides the same function.

Wordclock

Wordclock works by sending electrical pulses when a sample occurs during A/D and D/A conversion (or, when it clocks each audio sample, hence the name). In the same way that there are standard frame rates for video, broadcast audio uses standard sample rates, which we covered right at the start of this series; broadcast digital audio tends to operate at 48kHz, and HD audio is commonly assumed to be 96kHz (but at any rate must be higher than 44.1kHz).

Formats like MADI and AES3, which we looked at in the previous section, both use wordclock to synchronize audio devices in a digital audio network.

In a broadcast infrastructure there can be multiple connected devices which generate these signals, but it’s more likely that there is one central sync device that will generate the timecode, genlock and wordclock together.

Sync is achieved when every device shares the same clock source. In other words, when every clock follows the same master clock generating timecode for the initial sync, and both genlock and wordclock to maintain timing for video and audio throughout the broadcast.

Let’s Get Messy

Thanks to external clocking systems, audio and video are always in sync within the production environment, and since the advent of digital formats they have stayed that way by embedding audio into the video signal.

And it works a treat; well-established technologies like SDI can embed up to 16 channels of audio, as well as metadata, into the video signal on a single coax cable. It is simple, it is clean and it is efficient.

IP changes all of that because it is fundamentally asynchronous. In an SDI or AES network, the distributed baseband synchronization signals are on an independent infrastructure and this kind of sync has maintained backwards compatibility all the way back to the 1930’s. But IP clocking is completely different. An IP environment uses a protocol called PTP and everything is timed relative to its timestamps.

PTPv2

PTP stands for Precision Time Protocol and in a SMPTE ST2110 network it actually refers to PTPv2, a network protocol developed by the IEEE. PTP removes all requirement for an independent sync infrastructure. In fact, in accordance with both the AES67 and ST2110 standards, it demands it.

A single PTP grand master clock (GMC) provides the timing for every single device on an IP network, and a GMC can be either a dedicated device or any device on the network. All connected devices adhere to Leader/Follower relationships with the GMC, which can all be configured via a standard control User Interface (UI).

Although fundamentally different, video and audio are both streams that can be packetized and transported as data across an IP network, and because all IP equipment uses the same standard transport protocol (ST2110), they can all connect to one another irrespective of manufacturer.

Meanwhile, more comprehensive interoperability between equipment is provided by integrating NMOS-04 and NMOS-05 for discovery and management of IP devices. This means that the management of sync across IP networks can be performed across a variety of manufacturer’s UIs.

PTP Messaging

PTP setup is worthy of an entire article if not an entire book, but in essence PTP works by multicasting data packets between clocks on a network. Multicast works by sending packets of data to multiple networked devices at the same time; they are transmitted once but they are replicated as and when necessary by the network.

Timing data is passed between Leader and Follower devices through a series of messages that calculate the offset from the GMC and the delay on the link. Follower devices use the timestamps in these messages to derive their own time, which enables them to take things like path delay into account. They do this in real time and are continuously updated as new data is received from the Leader.

PTP is incredibly accurate, but there are multiple ways to set up and manage these networks, with different kinds of GMC, message intervals, PTP Port States and something called the Best Master Clock Algorithm (known as the BCMA) which decides how every potential clock on a network advertises itself.

All of these settings have an effect on scalability and Quality of Service and should not be entered into lightly.

The Abstraction Of Timing

All this fundamentally changes how we approach sync.

With ST2110, audio and video are processed independently of any external timing processes and the genius of SMPTE 2110 is the ability to divorce the timing of audio and video from each other.

This asynchronous processing approach is something that broadcasters have never had the opportunity to do before, and although everything has to be resynchronized relative to those timestamps, it promotes the ability to do new things in new ways, such as cloud production.

We are used to television being a synchronous system, with video at prescribed frame rates and audio at 48 or 96khz. The side effect of making these things asynchronous is that it does away with all of this – it frees up storage and transport and processing without worrying about the timing.

It just needs to be correct when it goes to the viewer at the end. 

Supported by

You might also like...

Building Software Defined Infrastructure: What Is Software Defined Infrastructure?

We begin our new series by asking a simple question; what is Software Defined Infrastructure and why do we need it?

IP Security For Broadcasters: Part 6 - NAT And VPN

NAT will operate without IPsec and vice versa, but making them work together is a fundamental challenge that needs detailed configuration and understanding.

Microphones: Part 4 - Microphone Technology - The Diaphragm

Most microphones need a diaphragm in order to follow some aspect of the air motion that carries the sound.

IP Security For Broadcasters: Part 5 - NAT Explained

When IP was first envisaged back in the 1970s, just over 4 billion unique IP addresses were allocated. However, the overwhelming international adoption of the internet with a world population of nearly 8 billion people has demonstrated there are simply not enough…

Standards: Part 24 - Timed-text & Subtitles Overview

Carriage of timed-text must be closely synchronized to the AV stream to ensure it is presented in a timely manner so here we describe the standards that enable this for both broadcast and internet delivery.