The Technology Of The Internet: Part 5 - Search & Discovery

As legacy linear broadcasting converges with internet streaming, so the two worlds of search and discovery are also coming together. Machine Learning, techniques, are enhancing and accelerating relevant metadata generation, as well as enabling ever more sophisticated direct content search.

Some broadcasters are still lagging behind the larger streaming and social media platforms over search and recommendation, but others have successfully narrowed or closed the gap. For all video service providers, Machine Learning (ML) is now gaining a pivotal role, primarily at this stage for generating relevant metadata but also increasingly enabling direct content search for keywords, images, or contexts.

The primary motivations for broadcasters and video service providers are almost the same as ever. These are to retain or gain subscribers by improving the experience through faster search and compelling recognition, while also liberating content archives that had previously been locked away behind archaic indexing mechanisms. Broadcasters with large archives dating back decades, such as the UK’s BBC, have already enjoyed some success opening them to viewers by generating new metadata.

The other challenge only recently succumbing to technical advances lies in searching within videos to find specific clips or sequences. This could be on the basis of keywords such as a particular actor, scene type, or even on the basis of a presented image such as a face of somebody known to feature within the footage.

It is also notable that video content analysis or metadata generation is applicable not purely to search and recommendation, but also creation of clips, especially in the case of live content. Comcast, owner of Sky and NBC Universal, as well as distributing its own legacy Xfinity cable channels in the USA and now online video under the Peacock brand, won a Technology & Engineering Emmy Award early in 2023 by applying machine learning to create sports clips and highlights rapidly after the event. This was achieved using the company’s VideoAI technology to facilitate automatic generation of contextual metadata on the basis of particular actions or events within the video.

VideoAI was launched as a commercial product available to other pay TV operators and broadcasters in January 2022 on a SaaS (Software as a Service) basis, having been used for content chaptering, creation of thumbnails to advertise a program, and for Dynamic Ad Insertion (DAI) through automatic identification of ad breaks.

Such ML based technology can also be used for contextual targeted advertising, taking account of the content enveloping the ad as well as the user’s preferences and other possible factors such as time and date. Indeed, Comcast described contextual advertising as potentially the most interesting and lucrative application of its VideoAI technology.

User viewing preferences can also vary with mood, and that is something service providers have been trying to grapple with for some time, with generally little success. ML provides a powerful tool for incorporating mood in content discovery and recommendation, and that has spawned a number of start-ups aiming to exploit the hype around this relatively new opportunity.

Many of these are quite speculative and even open to ridicule, with little proven success, but the underlying concept is becoming more mainstream and subject to research and development by major broadcasters. Again, the BBC deserves full marks for effort with its work on what it called “mood metadata”. Unlike some of the opportunistic start-ups, the BBC has built this on a solid foundation around metadata creation in general, based on a taxonomic hierarchy in which descriptions are built up from combinations of words describing elements of mood or context.

The most interesting aspect of the BBC’s mood metadata creation, which goes beyond what many fellow broadcasters are doing, lies in the way the metadata tags are created in the first place, rather than in how they are then assembled into more complex memes. The BBC uses signal processing to generate raw data for subsequent application of ML algorithms to characterize moods.

This yields sequences of audio and video that are then aligned with moods during training. The ML models converge to identify given moods with given combinations of audio and video sequence without any recognition of particular objects, words or higher-level features. The concept is the same as in ML models trained to diagnose medical conditions through analysis of blood samples for example, in the absence of direction from a human consultant. It is an application of unsupervised, or semi-supervised, ML, in which a model converges around data patterns associated with specific outcomes, actions or decisions.

There are some aspects of search and recommendation that have changed at both the technical and commercial level. One is associated with the growth in mobile viewing on smartphones which, like mood, changes the content a particular user might want to watch. That may favor shorter form content and also requires presentation of results in a way conducive to the small screen.

The trend towards streaming and especially mobile viewing is encouraging new genres of short films and documentaries that can fruitfully be recommended to users on the road but not at home. There is also growing scope for aggregating such shorter items, moving a bit closer to the idea of playlists already popular with music, accompanied by some video.

Some broadcasters have already experimented with curated music playlists generated by application of ML to a user’s known previous selections or preferences. Spotify has gone one further with an AI-based DJ unveiled as a beta version in February 2023. This creates a curated lineup of music interspersed with commentary about the tracks, singers, players and composers as appropriate in a voice also generated by ML to match the content. Some broadcasters are attempting to apply this idea to short form video.

Another trend is towards filtering video by quality or format as well as content. This has become more valuable in the streaming era because content providers have less control over, or knowledge of, the end viewing device such that users themselves may want to specify the quality level. The major online search engines and viewing platforms now allow video search on the basis of keywords such as Live, 4K, HD, HDR, 360 degrees, as well as factors such as duration, date of creation, source, and whether there are subtitles. Broadcasters have to be more aware of these distinctions now that their content may be consumed via platforms beyond their control, and certainly devices of varying capabilities.

The ability to search inside video to locate desired positions or sequences is also becoming established within search engines and EPGs. It is not yet possible, certainly with a commercial product, to use a video clip as a search key, because that would be too demanding computationally and in bandwidth. But it is already possible to extend reverse image search capabilities to video. It is called reverse search because the image is used as the key rather than as more often the result returned to the user.

Such capabilities are already supported by leading search engines such as Google’s, enabling an image file taken say from a photo to be uploaded into the engine, which then returns either identical or very similar images found online. Applied to video it would return content containing frames very similar to the uploaded image, which could then be used as the start point for playback. In principle that could identify clips containing a given object such as a mountain, or person, or a scene matching one in a photo taken by the user, or a thumbnail.

Over the next few years, it is likely such capabilities will be extended to the familiar voice enabled engines for finding videos, as well as for image-based search to become fuzzier with the help of deeper ML. There will also be further progress in application of the same technology to near instant clip creation from live content, especially sports, but also music events and news. 

You might also like...

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

IP Security For Broadcasters: Part 4 - MACsec Explained

IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.

Standards: Part 23 - Media Types Vs MIME Types

Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.

Six Considerations For Transitioning To Cloud Based Video Distribution

There are many reasons why companies are transitioning from legacy video distribution workflows to ones hosted entirely in the public cloud, but it’s not a simple process and takes an enormous amount of planning. Many potential pitfalls can be a…

IP Security For Broadcasters: Part 3 - IPsec Explained

One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…