Deep Learning Accelerates Object Tracking In TV Production

Advances in application motion tracking in audiovisual production, both live and recorded, have been slow until recently accelerated by the advent of modern AI techniques associated with neural network based deep learning and mathematical graph theory. These advances have converged from multiple application domains including robotics, medical imaging, sports science and video surveillance, as well as broadcasting.

Multiple Object Tracking (MOT) in audiovisual (AV) production has a long history but until recently a slow trajectory, with significant technical advances elusive. But that has all changed with the advent of modern AI and in particular neural network based deep learning, allied to unprecedented underlying computational power.

The first recorded use of real time manipulation of live video in response to motion tracking came at the 1964 Tokyo Summer Olympics rather than the recent pandemic-affected 2020 games staged a year late in the same city, but that proved rather a false dawn. It sprang off advances in automated photo finish cameras designed to adjudicate close finishes in sprinting events primarily, allowing almost instant integration of runners’ times on the screen.

While this was a significant technical achievement in 1964, it did not then herald more general advances in motion, which required computational power many orders of magnitude beyond what was available then. It is only really in the last 10 to 15 years that continual increases in transistor count on integrated circuits in line with Moore’s Law of doubling every two years has provided the foundation for the AI era.

This unleashed innovation in MOT with both R&D investment and application across multiple fields, such as medical imaging, robotics, drone vision, video surveillance, military, and sports analysis, as well as broadcast production. Much of the underlying R&D feeds all these areas, but broadcast production has specific requirements in terms of integration with existing and emerging production tools and workflow.

MOT has to couple with graphical playout systems, special effects and CGI, as well as versions of Extended Reality, all of which are in turn potentially enhanced by this dovetailing with motion tracking. There is also the audio dimension with the promise of combining sound effects and enhanced audio with the movement of video objects within the visual field.

There are generic challenges mostly associated with the state of the MOT science, such as catering for blurring, occlusion of objects being tracked by others on the screen, and fast movement with rapid changes in speed and direction. These challenges are being addressed by application of deep learning, which with training and application can home in on target objects while ignoring others as background visual noise.

There are also challenges in exploiting these features effectively in production. To some extent these challenges revolve around blending the real and virtual worlds, increasing the effectiveness of XR and CGI, which can be combined with live footage in video montages, for example superimposing robots or UFOs onto a background that has been captured on camera.

It is not just objects or people in the field of view that move, but also the cameras themselves, whether in a studio or at an outside broadcast event. Indeed, there is more and more use of roving cameras to capture greater detail at sporting or other events.

There is also ever greater use of virtual studios for production of adverts, feature films, and series, as well as live broadcasts, where camera mobility is essential. This adds a further dimension to motion tracking, since the movement of the cameras has to be taken into account to maintain the correct studio geometry. That is necessary for accurate overlay of captions, statistics, or CGI effects.

Increasingly, motion tracking tools are incorporated into established video editing packages such as Apple’s Final Cut Pro, as well as postproduction systems like Adobe After Effects. At the same time, the latest motion tracking software integrates machine learning algorithms to automate the motion analysis as far as possible within the video editing process.

However, managing this process and getting the best results takes practice and has become a skill in its own right, an example of how AI can create new domains of expertise by automating processes that previously required manual intervention. Judgement is needed to track objects during video editing for matching with artificially created artefacts, or insertion of graphics and statistics.

The process typically begins by importing the footage into the motion tracking software, which may be part of a larger editing package. Tracking objects are then selected, which could range from almost point like targets such as eyes, to whole bodies, or vehicles. The required animations, graphics, captions or data are then brought up and aligned with the movement, followed by fine tuning, which may be at least partially automated.

Some of the skills involved in optimizing motion tracking are quite basic, associated with selection of appropriate tracking points for a given application, which could be cameras rather than objects in the frame. Factors such as lighting come into play, all of which over time will be increasingly automated, but not just yet.

There are also advances continually being made that increase the scope and application of motion tracking. The 2024 Paris Summer Olympics showed how coverage of the events could benefit from innovations in imaging and tracking across multiple sports, including those with a ball, and also sports where analyzing the detailed movements of athletes is valuable.

Techniques on this front have evolved from various applications of imaging in sport, both for applying rules and measuring aspects of performance. Examples range from the Hawkeye system used to determine whether tennis balls are in or out, to the contentious VAR (Video Assistant Referee) system that has been adopted by some football leagues and knockout competitions.

Among evolutions is a system of cameras employed at the Olympics for diving competitions, both for enforcement of safety and helping with the judging, which has always had a subjective element. The image data is used to track the distance between the athlete and the diving board during the routine, known as the “safe gap” because any closer risks injury. Points are deducted if the diver is too close, with the ultimate potential sanction being disqualification.

The same system reproduces a 3D vision of the dive, which can assist judges in their assessment of each performance. There is then potential for the system to provide nominal scores on the basis of smoothness, splash, and complexity of routine.

In several other sports camera systems represented an important technical advance by avoiding the need for athletes to wear tracking devices, such as chips attached to running shoes, or body sensors inside uniforms. Instead, HD cameras around the field of action can now track each athlete, as well as the ball itself, with equally high or even better precision than the previous sensor driven approach.

In Beach Volleyball, the system was able to determine statistics for the viewer, such as maximum speed and distance covered by each player, as well as individual variations in jump heights, and shot types, all of which enrich the experience for viewers, as well as spectators potentially through large screen displays.

Among other examples is the Pole Vault, where the system made it possible for the first time to measure accurately the distance by which each athlete had cleared the bar. Such information is useful for the athlete and of interest to the viewer, because it gives some idea of whether the vaulter will clear the bar when it is raised a notch as the competition progresses.

A lot of such information can be incorporated into the video as it goes out live, where relevant applying machine tracking software.

For broadcasters, at least at this stage in the game, motion tracking might most easily be exploited effectively within virtual studios set up to allow use of common tools across footage from multiple events. This is particularly the case for an event such as the Summer Olympics, with its diversity of event types and forms.

Radio Television Serbia was one of the first national broadcasters to modify its workflow for the Olympic Games with the objective of incorporating real time graphics and overlays, whether or not these were derived from motion tracking and analysis. The broadcaster in effect transformed a relatively small studio into a massive virtual stadium designed for graphical overlay over objects set out in a standard format.

Producers were able to trigger real-time statistics, player information, graphics and in principle any dynamic element directly inside their accustomed newsroom environment, for display at the right time in the right place.

Motion tracking has benefited from the combination of contemporary methods in machine learning and established techniques for capturing spatial, or intra frame, and temporal, or inter frame, correlations in video. The starting point is the ability to identify objects as groups of pixels within a single frame, and then track them between frames in a suitably fuzzy way, allowing for changes in light intensity and viewing angle, which results in pixels changing in value as well as some dropping off images and others coming in.

Then a technique now employed in machine learning called Graph Neural Networks (GNN) will often come into play. This derives from mathematical graph theory developed long before contemporary machine learning, depicting objects as points and relationships between them as lines joining those points. No crossover is allowed – the graphs are two dimensional with only one relationship between any pair of points. These relationships can be mutual, or directional pointing in just one direction, like the flow of water.

These GNNs have proved effective in many complex domains that involve relationships and interactions between individual components or objects, including pattern recognition, medical imaging, and also TV recommendation systems.

In motion analysis and computer vision, the graphs comprise pixels as nodes, with connections only allowed to adjacent ones, or else the rules would be broken. This allows graphs to be built up around objects in a hierarchical way, so that a model of a person could be broken down into faces, then eyes, then irises, and even in theory eye lashes, conferring flexibility and granularity.

Although the main focus has been on video objects, there is also the audio dimension, although this is more intimately bound with output. Object based sound has already been employed to direct the audio output. For example in a cinema, a helicopter flying across the screen can be associated with specific sounds of the engine, separated from the rest of the background noise.

Processing this within spatial surround systems such as Dolby Atmos can be enhanced by AI techniques. One challenge is that the configuration of the sound system in the home is not known, so that in practice the benefits are lost on many viewers.

This can be addressed to some extent by application of audio beamforming, where sound is directed from a speaker system so as to make it sound like it is coming from a particular point, which can move continuously. The advantage is that this can transmit object-based sound without requiring speakers to be in the right place, providing they support beamforming.

This comes under the heading of Audio AI, a research field in its own right with applications, like motion tracking, in many fields. In AV transmission it can be applied to lip synchronization of audio and video, which can be an issue not just in streaming but also interactive applications such as conferencing.

Integration between motion tracking and audio analysis has potential not just for generation and insertion of graphics or other dynamic elements on the fly, but also for more incisive and accurate metadata creation for archiving. The whole field is really in its infancy, even more so than AI itself. 

You might also like...

The Creative Challenges Of HDR-SDR Simulcast

HDR can make choices easier - or harder - at every stage of production but the biggest challenge may be just how subjective those choices are.

A New Year Speculation On Immersion

As we head into another new year it seems ok to indulge in some obvious speculation about what the future may bring. Here we consider the proposition that eventually, and probably not far into the future, broadcasters will have to…

Microphones: Part 4 - Microphone Technology - The Diaphragm

Most microphones need a diaphragm in order to follow some aspect of the air motion that carries the sound.

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

The Resolution Revolution

We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?