Machine Learning (ML) For Broadcasters: Part 6 - ML In Production

Machine learning is touching just about every aspect of TV production, pre and post, in studios, large venues and remote sites. It is cutting costs of many routine operations while also opening new possibilities from archives to live.

Machine learning is touching just about all aspects of TV production now, extending across pre and post, from archive liberation to creation of near live highlights or summaries. It is still at an early stage for many areas with plenty of scope for radical improvements in the viewing experience enabled within the production cycle, as well as cost savings.

When broadcasters talk about AI, they often mean its subset machine learning (ML), which involves convergence around patterns in data relating to particular tasks, such as recognition of faces. The system learns during training to perform given tasks as well as possible within limitations of the data and the computational power available, sometimes able to improve further in actual operation. ML has become possible and applicable to various tasks in video production as a result of continuing rapid advances in processing power and memory, enabling effective training across massive data sets.

Other forms of AI, such as use of decision trees based on mathematical graph theory, also feature in some aspects of video production, such as processing of natural language to understand audio, but to a lesser extent than ML.

Major broadcasters, among others, have been conducting R&D on the application of machine learning for well over a decade now. Some of this work came to fruition around 2018 through projects that highlighted the potential of ML in various aspects but also identified limitations and where further innovations were needed.

This applied to the BBC’s prototype of an unattended production system called Ed, which in turn was a sequel of two automated systems designed to cover live events by cutting between virtual cameras, called Primer and SOMA. For Ed, high resolution cameras were placed manually to point at locations where action was to take place, such as a football pitch. Then the automation kicked in to follow the action, switching between virtual cameras on the basis of decisions led by ML algorithms to obtain optimal raw footage then cropped and framed.

The results were on the whole inferior to what human editors would obtain but led to clear rules that such a system should follow, derived in part from testing with humans along the same lines of Mean Opinion Scoring (MOS). Such rules, enforced by ML, included keeping edges of frames clear of objects and people, especially avoiding the tops off heads.

This was all quite obvious and as the BBC admitted the rule-based approach, while initially necessary, was also a hindrance for further advances in scope to embrace more subtle decisions involved. Among key underlying processes leading towards more advanced automated production where ML is increasingly being applied is object recognition to analyze the character of a scene. This might include landscape type, whether people are included and what kind of clothing they have. More subtle still perhaps is visual energy, where successive frames are analyzed to identify levels of action or dynamism as a prelude to creating compilations involving variations in pace.

Some of these techniques have been applied commercially first by the big tech companies such as Google and Amazon, which are also major content producers. Google has a system that derived still images from video content almost at professional quality, which can be used to generate thumbnails to mark content.

Amazon Web Services (AWS) has been at the forefront of AI application to content moderation, which has become an integral part of content production, particularly for non-live User Generated Content (UGC), but also increasingly live shows. AWS applies ML or other AI techniques during at least four stages of content moderation.

The first is in Amazon Transcribe for initial speech to text conversion, after which it is checked for specified profanities or signatures of hate speech. The latter is partly in the ear of the listener with cultural distinctions and so the text is then translated to the languages of areas where the content may be played, using ML technology employed in Amazon Translate. Text analysis is then applied again using NLP capabilities in Amazon Comprehend. Finally, the results are integrated with Amazon A2I (Augmented AI) to facilitate human review for subsequent ML workflows. This A2I is designed to enable human supervisors to oversee ML applications, here in video production, minimizing the effort and time involved.

Many ML applications in production so far have involved one of three ML categories, called supervised learning, which was the first form of the method to be widely applied, essentially an extension of statistical regression. Training is an iterative process where the model’s output is continually compared with a reference as a target. The difference between the output and the reference is determined for each iteration to update the internal parameters of the system with the aim of improving the match, until the two have converged.

The other two types of ML are unsupervised learning and reinforcement learning, with the best method depending on the application. Under unsupervised learning, the model works on its own without human intervention to seek patterns in data that might be unexpected or impossible to predict in advance. This can find valuable correlations or patterns that were totally unexpected, but can be too complex, inaccurate and time consuming for tasks that are clearly defined, such as recognition of biometrics such as human faces or handwriting.

Reinforcement learning is then a system based on reward and penalty, so could be called the stick and carrot approach, where the model is encouraged when it performs an action or makes a prediction that is closer to a desired goal and discouraged when it seems to go backward. This approach could be used to teach a robot to walk for example, avoiding the need to code directly for the required capabilities, which is very challenging.

There are various applications in broadcast production where unsupervised learning has potential, or indeed is already being evaluated. One is the area of nonlinear editing involving tasks that are hard to define clearly, such as selection of cutaways when editing a news package, or assessing multiple takes of a dramatic drama for qualities that are subjective, such as rapport or “chemistry” between actors, or comic timing.

It also has potential for final quality control in post-production, which in the past has been a labor intensive and therefore costly manual process. With unsupervised ML there is scope for automating many processes and also extending their scope.

An area related to this is lighting adjustment, an essential component of TV and film production and postproduction. In the studio and for major sets such as festivals or in big sports stadia this is less of an issue, but for smaller venues and fringe or educational events where professional lighting is not available. There are several projects now investigating unsupervised ML for re-lighting footage for events lacking dedicated equipment and personnel.

Such systems generate new versions of a scene that look like they were shot that way originally, a kind of automated photoshop for video.

There could also be scope for reinforcement learning where the results of a new production system or process have to be assessed by a human panel. The model could then be updated on the basis of average opinion scores, aiming to converge around the highest obtainable level.

With so many opportunities for ML in production some are more mature than others, with many still at the research stage. It is almost certain that ML in its various forms will be widely deployed across the whole production process and interact increasingly with other aspects of workflow to improve quality, reduce costs and enable novel effects, including codecs, cameras and viewing devices. 

You might also like...

IP Security For Broadcasters: Part 1 - Psychology Of Security

As engineers and technologists, it’s easy to become bogged down in the technical solutions that maintain high levels of computer security, but the first port of call in designing any secure system should be to consider the user and t…

Demands On Production With HDR & WCG

The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.

If It Ain’t Broke Still Fix It: Part 2 - Security

The old broadcasting adage: ‘if it ain’t broke don’t fix it’ is no longer relevant and potentially highly dangerous, especially when we consider the security implications of not updating software and operating systems.

Standards: Part 21 - The MPEG, AES & Other Containers

Here we discuss how raw essence data needs to be serialized so it can be stored in media container files. We also describe the various media container file formats and their evolution.

NDI For Broadcast: Part 3 – Bridging The Gap

This third and for now, final part of our mini-series exploring NDI and its place in broadcast infrastructure moves on to a trio of tools released with NDI 5.0 which are all aimed at facilitating remote and collaborative workflows; NDI Audio,…