Big Data Retrieval
Just having metadata is not sufficient. One needs to be able to effectively search it.
As the volume of file-based media grows, the requirement for metadata advances significantly. Simultaneously, the number of sources of metadata available expands as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is increasing exponentially, and no one person will be able to access all of the available data in a coherent way.
Traditionally, metadata consisted of a few scrawled notes on a piece of camera tape stuck to the side of a video tape box, or written on a piece of paper. Producers used to visit the tape librarian to find footage for their edit, and a stack of tapes would be given to the edit assistant, who would dutifully load them onto a trolley and wheel them to the edit suite, a process that could take many hours or even days.
As the volume of file based media continues to grow, the requirement for metadata advances significantly. Browsing online, a producer can find library media, provide timecode for edit points and make the material available to an editor in a matter of minutes, sometimes even editing the programme themselves. The power of these search facilities are through stored tags and descriptions derived from the metadata element of the media file. The number of sources of metadata available continues to grow as each node in the programme chain adds information to the file. And here lies the problem, the amount of information a producer has to search through is literally increasing exponentially, and no one person will be able to access all of the available data in a coherent way.
Workflows rarely respect metadata embedded in media files. For example, the camera information in a recorded file such as f-stop, lens focal length, time and GPS location can be easily stripped out during ingest to non-linear editing software. It’s not uncommon for a separate XML file to be created containing further metadata which has to be kept in sync with the video and audio essence. Even embedded MXF doesn’t solve the problem as a file’s metadata can change during an edit, or transcode, and valuable original data could be lost or modified.
Retrieving the metadata and doing meaningful searches has its own challenges. Searching the media files alone can be time consuming and cumbersome. Should you search the rushes or edited master?The transmission transcode or library archive?Data storage is only as effective as data retrieval, searching takes time and requires local knowledge of the system.
Historic data can be stored in databases that are no longer supported and would require a highly competent programmer to extract any meaningful information, assuming the operating system was still available to retrieve the data in the first place. Metadata could be spread over many databases, from the media asset management system to the transmission logs, creating further problems with searching.
A media assets value can only be truly monetized if clients can find the material they’re looking for. As well as having robust storage systems, the tag and search information has to be there in the first place. Having a human sat watching every piece of media to transcribe the dialogue is unrealistic and simply would not work.
As we have seen, the very act of ingesting, editing or transcoding a file can change its contents by stripping out and creating new metadata. An automated way of extracting and recording has to be provided at each workflow node in the chain, thus removing the problem of losing information during processing.The system has to be scalable beyond our usual understanding of scalability, that is we have to be able to parse file formats we haven’t yet designed.
Metadata storage and retrieval is much more involved than just providing access to XML files or embedded MXF data. Historic practices have seen users creating metadata rich filenames in an attempt to overcome retrieval problems, an unsustainable practice as the filenames are limited in length and cannot be easily parsed. The more information we can provide about a media asset, the more useful it is to us, and the more people will be willing to buy it.
Companies such as GrayMeta have been able to take the initiative with big data farming and retrieval. The software works across many systems; public cloud, private cloud and on-prem. And is completely expandable through modularity, being able to deal with all of the future formats we haven’t yet designed.
Using big data systems, producers can now search archive media and rushes with unprecedented granularity, to quickly find obscure and interesting shots. For instance, if we consider a scenario where a producer wants a shot of a cloudy day over New York at 8am. The camera would have recorded the time and GPS position of the shot, a publically available weather system would be able to provide historical data about the weather conditions at that time, and the big data retrieval engine would be able to join and match all of this information and provide a link to the media. This fundamentally requires the camera’s data to be still available. As the metadata would have been recorded by the big data system at the point of ingest, the information created by the camera is maintained, even after the rushes have been edited or transcoded.
Future proofing legacy systems will be a giant task going forward. The traditional method has been to copy SQL databases by decoding their data and tables, reformatting to the new database design, and then transferring the data, hoping there are no corruptions, ambiguities or errors on the way. Another method is to leave the legacy database alone and build an API to extract information from it. This will work to a certain extent, but the database is generally heavily tied to the underlying operating system which at some point will no longer be supported by the manufacturer, causing a major headache for the IT department, especially with security and compliance. It’s quite common for a large client to audit an IT system, and if they find unsupported operating systems, they will simply refuse to deal with you because of their concerns about security.
A single solution, that is easily accessible, can parse data from different sources including scripts, rights databases, xml files and emails. Then meaningfully join them together, and make the search results available is key to library and archive systems of the future.
You might also like...
HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows
Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…
IP Security For Broadcasters: Part 4 - MACsec Explained
IPsec and VPN provide much improved security over untrusted networks such as the internet. However, security may need to improve within a local area network, and to achieve this we have MACsec in our arsenal of security solutions.
Standards: Part 23 - Media Types Vs MIME Types
Media Types describe the container and content format when delivering media over a network. Historically they were described as MIME Types.
Building Software Defined Infrastructure: Part 1 - System Topologies
Welcome to Part 1 of Building Software Defined Infrastructure - a new multi-part content collection from Tony Orme. This series is for broadcast engineering & IT teams seeking to deepen their technical understanding of the microservices based IT technologies that are…
IP Security For Broadcasters: Part 3 - IPsec Explained
One of the great advantages of the internet is that it relies on open standards that promote routing of IP packets between multiple networks. But this provides many challenges when considering security. The good news is that we have solutions…