Machine Learning (ML) For Broadcasters: Part 6 - ML In Production

Machine learning is touching just about every aspect of TV production, pre and post, in studios, large venues and remote sites. It is cutting costs of many routine operations while also opening new possibilities from archives to live.

Machine learning is touching just about all aspects of TV production now, extending across pre and post, from archive liberation to creation of near live highlights or summaries. It is still at an early stage for many areas with plenty of scope for radical improvements in the viewing experience enabled within the production cycle, as well as cost savings.

When broadcasters talk about AI, they often mean its subset machine learning (ML), which involves convergence around patterns in data relating to particular tasks, such as recognition of faces. The system learns during training to perform given tasks as well as possible within limitations of the data and the computational power available, sometimes able to improve further in actual operation. ML has become possible and applicable to various tasks in video production as a result of continuing rapid advances in processing power and memory, enabling effective training across massive data sets.

Other forms of AI, such as use of decision trees based on mathematical graph theory, also feature in some aspects of video production, such as processing of natural language to understand audio, but to a lesser extent than ML.

Major broadcasters, among others, have been conducting R&D on the application of machine learning for well over a decade now. Some of this work came to fruition around 2018 through projects that highlighted the potential of ML in various aspects but also identified limitations and where further innovations were needed.

This applied to the BBC’s prototype of an unattended production system called Ed, which in turn was a sequel of two automated systems designed to cover live events by cutting between virtual cameras, called Primer and SOMA. For Ed, high resolution cameras were placed manually to point at locations where action was to take place, such as a football pitch. Then the automation kicked in to follow the action, switching between virtual cameras on the basis of decisions led by ML algorithms to obtain optimal raw footage then cropped and framed.

The results were on the whole inferior to what human editors would obtain but led to clear rules that such a system should follow, derived in part from testing with humans along the same lines of Mean Opinion Scoring (MOS). Such rules, enforced by ML, included keeping edges of frames clear of objects and people, especially avoiding the tops off heads.

This was all quite obvious and as the BBC admitted the rule-based approach, while initially necessary, was also a hindrance for further advances in scope to embrace more subtle decisions involved. Among key underlying processes leading towards more advanced automated production where ML is increasingly being applied is object recognition to analyze the character of a scene. This might include landscape type, whether people are included and what kind of clothing they have. More subtle still perhaps is visual energy, where successive frames are analyzed to identify levels of action or dynamism as a prelude to creating compilations involving variations in pace.

Some of these techniques have been applied commercially first by the big tech companies such as Google and Amazon, which are also major content producers. Google has a system that derived still images from video content almost at professional quality, which can be used to generate thumbnails to mark content.

Amazon Web Services (AWS) has been at the forefront of AI application to content moderation, which has become an integral part of content production, particularly for non-live User Generated Content (UGC), but also increasingly live shows. AWS applies ML or other AI techniques during at least four stages of content moderation.

The first is in Amazon Transcribe for initial speech to text conversion, after which it is checked for specified profanities or signatures of hate speech. The latter is partly in the ear of the listener with cultural distinctions and so the text is then translated to the languages of areas where the content may be played, using ML technology employed in Amazon Translate. Text analysis is then applied again using NLP capabilities in Amazon Comprehend. Finally, the results are integrated with Amazon A2I (Augmented AI) to facilitate human review for subsequent ML workflows. This A2I is designed to enable human supervisors to oversee ML applications, here in video production, minimizing the effort and time involved.

Many ML applications in production so far have involved one of three ML categories, called supervised learning, which was the first form of the method to be widely applied, essentially an extension of statistical regression. Training is an iterative process where the model’s output is continually compared with a reference as a target. The difference between the output and the reference is determined for each iteration to update the internal parameters of the system with the aim of improving the match, until the two have converged.

The other two types of ML are unsupervised learning and reinforcement learning, with the best method depending on the application. Under unsupervised learning, the model works on its own without human intervention to seek patterns in data that might be unexpected or impossible to predict in advance. This can find valuable correlations or patterns that were totally unexpected, but can be too complex, inaccurate and time consuming for tasks that are clearly defined, such as recognition of biometrics such as human faces or handwriting.

Reinforcement learning is then a system based on reward and penalty, so could be called the stick and carrot approach, where the model is encouraged when it performs an action or makes a prediction that is closer to a desired goal and discouraged when it seems to go backward. This approach could be used to teach a robot to walk for example, avoiding the need to code directly for the required capabilities, which is very challenging.

There are various applications in broadcast production where unsupervised learning has potential, or indeed is already being evaluated. One is the area of nonlinear editing involving tasks that are hard to define clearly, such as selection of cutaways when editing a news package, or assessing multiple takes of a dramatic drama for qualities that are subjective, such as rapport or “chemistry” between actors, or comic timing.

It also has potential for final quality control in post-production, which in the past has been a labor intensive and therefore costly manual process. With unsupervised ML there is scope for automating many processes and also extending their scope.

An area related to this is lighting adjustment, an essential component of TV and film production and postproduction. In the studio and for major sets such as festivals or in big sports stadia this is less of an issue, but for smaller venues and fringe or educational events where professional lighting is not available. There are several projects now investigating unsupervised ML for re-lighting footage for events lacking dedicated equipment and personnel.

Such systems generate new versions of a scene that look like they were shot that way originally, a kind of automated photoshop for video.

There could also be scope for reinforcement learning where the results of a new production system or process have to be assessed by a human panel. The model could then be updated on the basis of average opinion scores, aiming to converge around the highest obtainable level.

With so many opportunities for ML in production some are more mature than others, with many still at the research stage. It is almost certain that ML in its various forms will be widely deployed across the whole production process and interact increasingly with other aspects of workflow to improve quality, reduce costs and enable novel effects, including codecs, cameras and viewing devices. 

You might also like...

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

IP Security For Broadcasters: Part 4 - MACsec Explained

IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.

Standards: Part 23 - Media Types Vs MIME Types

Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.

Building Software Defined Infrastructure: Part 1 - System Topologies

Welcome to Part 1 of Building Software Defined Infrastructure - a new multi-part content collection from Tony Orme. This series is for broadcast engineering & IT teams seeking to deepen their technical understanding of the microservices based IT technologies that are…

IP Security For Broadcasters: Part 3 - IPsec Explained

One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…