Big Data Retrieval

As the volume of file-based media grows, the requirement for metadata advances significantly. Simultaneously, the number of sources of metadata available expands as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is increasing exponentially, and no one person will be able to access all of the available data in a coherent way.

Traditionally, metadata consisted of a few scrawled notes on a piece of camera tape stuck to the side of a video tape box, or written on a piece of paper. Producers used to visit the tape librarian to find footage for their edit, and a stack of tapes would be given to the edit assistant, who would dutifully load them onto a trolley and wheel them to the edit suite, a process that could take many hours or even days.

As the volume of file based media continues to grow, the requirement for metadata advances significantly. Browsing online, a producer can find library media, provide timecode for edit points and make the material available to an editor in a matter of minutes, sometimes even editing the programme themselves. The power of these search facilities are through stored tags and descriptions derived from the metadata element of the media file. The number of sources of metadata available continues to grow as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is literally increasing exponentially, and no one person will be able to access all of the available data in a coherent way.

Workflows rarely respect metadata embedded in media files. For example, the camera information in a recorded file such as f-stop, lens focal length, time and GPS location can be easily stripped out during ingest to non-linear editing software. It’s not uncommon for a separate XML file to be created containing further metadata which has to be kept in sync with the video and audio essence. Even embedded MXF doesn’t solve the problem as a file’s metadata can change during an edit, or transcode, and valuable original data could be lost or modified.

Retrieving the metadata and doing meaningful searches has its own challenges. Searching the media files alone can be time consuming and cumbersome. Should you search the rushes or edited master?The transmission transcode or library archive?Data storage is only as effective as data retrieval, searching takes time and requires local knowledge of the system.

Historic data can be stored in databases that are no longer supported and would require a highly competent programmer to extract any meaningful information, assuming the operating system was still available to retrieve the data in the first place. Metadata could be spread over many databases, from the media asset management system to the transmission logs, creating further problems with searching.

A media assets value can only be truly monetized if clients can find the material they’re looking for. As well as having robust storage systems, the tag and search information has to be there in the first place. Having a human sat watching every piece of media to transcribe the dialogue is unrealistic and simply would not work.

As we have seen, the very act of ingesting, editing or transcoding a file can change its contents by stripping out and creating new metadata. An automated way of extracting and recording has to be provided at each workflow node in the chain, thus removing the problem of losing information during processing.The system has to be scalable beyond our usual understanding of scalability, that is we have to be able to parse file formats we haven’t yet designed.

Metadata storage and retrieval is much more involved than just providing access to XML files or embedded MXF data. Historic practices have seen users creating metadata rich filenames in an attempt to overcome retrieval problems, an unsustainable practice as the filenames are limited in length and cannot be easily parsed. The more information we can provide about a media asset, the more useful it is to us, and the more people will be willing to buy it.

Companies such as GrayMeta have been able to take the initiative with big data farming and retrieval. The software works across many systems; public cloud, private cloud and on-prem. And is completely expandable through modularity, being able to deal with all of the future formats we haven’t yet designed.

Using big data systems, producers can now search archive media and rushes with unprecedented granularity, to quickly find obscure and interesting shots. For instance, if we consider a scenario where a producer wants a shot of a cloudy day over New York at 8am. The camera would have recorded the time and GPS position of the shot, a publically available weather system would be able to provide historical data about the weather conditions at that time, and the big data retrieval engine would be able to join and match all of this information and provide a link to the media. This fundamentally requires the camera’s data to be still available. As the metadata would have been recorded by the big data system at the point of ingest, the information created by the camera is maintained, even after the rushes have been edited or transcoded.

Future proofing legacy systems will be a giant task going forward. The traditional method has been to copy SQL databases by decoding their data and tables, reformatting to the new database design, and then transferring the data, hoping there are no corruptions, ambiguities or errors on the way. Another method is to leave the legacy database alone and build an API to extract information from it. This will work to a certain extent, but the database is generally heavily tied to the underlying operating system which at some point will no longer be supported by the manufacturer, causing a major headache for the IT department, especially with security and compliance. It’s quite common for a large client to audit an IT system, and if they find unsupported operating systems, they will simply refuse to deal with you because of their concerns about security.

A single solution, that is easily accessible, can parse data from different sources including scripts, rights databases, xml files and emails. Then meaningfully join them together, and make the search results available is key to library and archive systems of the future.

You might also like...

Designing IP Broadcast Systems - The Book

Designing IP Broadcast Systems is another massive body of research driven work - with over 27,000 words in 18 articles, in a free 84 page eBook. It provides extensive insight into the technology and engineering methodology required to create practical IP based broadcast…

Demands On Production With HDR & WCG

The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.

If It Ain’t Broke Still Fix It: Part 2 - Security

The old broadcasting adage: ‘if it ain’t broke don’t fix it’ is no longer relevant and potentially highly dangerous, especially when we consider the security implications of not updating software and operating systems.

Standards: Part 21 - The MPEG, AES & Other Containers

Here we discuss how raw essence data needs to be serialized so it can be stored in media container files. We also describe the various media container file formats and their evolution.

NDI For Broadcast: Part 3 – Bridging The Gap

This third and for now, final part of our mini-series exploring NDI and its place in broadcast infrastructure moves on to a trio of tools released with NDI 5.0 which are all aimed at facilitating remote and collaborative workflows; NDI Audio,…