The Ultimate Compression Technology?

Our resident provocateur Dave Shapton speculates on the nature of compression and its potential future evolutionary path.

We are starting to see a path towards the ultimate video compression codec. It will be one codec for everything, and curiously, it will be a subset of another type of compression: knowledge compression.

Ray Kurzweil, in his 2024 book “The Singularity Is Nearer”, speculates that the “singularity” will occur around 2040. It’s a big claim, but essentially, he says that by then, we will have brains enhanced by additional cognitive layers in the cloud and be billions of times more intelligent. If that seems almost impossible to imagine, Kurzweil predicts that as our brain power grows, we will see it as a natural extension of ourselves.

It’s very easy to be skeptical about this, except that since his first “singularity” book, “The Singularity is Near”, published in 2005, Kurzweil has been remarkably prescient, accurate almost to the exact year about the rise of AI and the growth in computer power.

About ten years ago, I predicted that the ultimate manifestation of digital video technology would be when you can feed a film script into a computer, and the output will be a feature film. I also predicted this wouldn’t happen for a long time - perhaps a hundred years. I was very wrong about the timescale but substantially correct that it would become possible. I can say that with some degree of certainty because it has happened. If 2022 was the year of ChatGPT, then 2024 was the year of text-to-video, which has rapidly gone from being like a bad LSD trip to a thing of aesthetic wonder. And we certainly do need to wonder where this will end up.

Text-to-video technology is incredibly disruptive for the film and broadcast industries. It also raises questions about authenticity and verifiable provenance. Meanwhile, I’d like to discuss an aspect of this new technology that will sound familiar to anyone involved in video for the past thirty years: compression.

But this time, we’re not (just) talking about video compression. What the rapidly evolving Large Language Models that power the likes of GPT-4 and Claude 3.5 (LLMs) do at their core is compress knowledge.

That’s quite a hard concept to understand, but when you use an LLM, you can see why this characterization might be a good one.

At the risk of oversimplifying, the macro effect of what LLMs do is to encode and compress knowledge, not in the sense of data reduction but by reducing its dimensionality.

So far, so abstract. We’ll unpack “dimensionality” later.

If the world represents uncompressed knowledge, then any attempt to pack that knowledge (losslessly or lossily) into another kind of space must represent knowledge compression. In effect, an LLM is a knowledge codec: it encodes and packs knowledge into a different form, and it decodes it when we ask it questions, give it a prompt or ask it to carry out a task.

The knowledge is compressed because of the way that LLMs work. Encoding knowledge is not the same as learning the meaning of words. Instead, it encodes how words (and parts of words) are used and, in doing so, the relationships behind the concepts expressed in the words. It does this in a so-called high-dimensional vector space, which is not the same as our familiar three dimensions but a way of thinking about relative concepts. This “high-dimensional space” allows the model to, well, I want to use the word “understand” here, but that might be seen as a bit too anthropomorphic.

LLMs know that a calf is to a cow what a puppy is to a dog. These conceptual mappings - billions of them - make LLMs so powerful.

The models contain the knowledge embedded within them but don’t encompass everything. You might even find that they have “compressed” their knowledge deliberately in various ways. For example, the “weights” that indicate a “good” result for a neuron in a neural net don’t need to be encoded with great precision: it could be that 4-bit representation is very nearly as good as sixteen bits in terms of accuracy.

Some new smaller LLM models can run on surprisingly low-powered devices - smartphones, for example. Effectively, much of the world’s knowledge is compressed into your pocket.

In a sense, an LLM (or future, different AI models) is a codec. It encodes knowledge to a state where it is still almost fully useful, and, via our questions, it can reproduce that knowledge pretty accurately (at least sometimes, when it’s not “hallucinating”).

It’s different with images, but the process is similar at a very high level. Instead of LLMs (which may still be used to interpret user prompts), still images and videos are often created using a “diffusion model.” These work in a somewhat unintuitive way.

In essence, diffusion models work by gradually adding controlled amounts of random noise to an image over several iterations, eventually transforming it into pure noise so that nothing of the original image is left. The model learns to reverse this process by predicting and removing the added noise step by step. This allows it to reconstruct images or generate new ones from noise by referring to the learned probability distribution of the data. Please understand that what’s being stored in the model isn’t the images themselves but a representation of their statistical properties - the probabilities that specific patterns or features exist across multiple images. This probabilistic encoding allows diffusion models to generate realistic outputs by sampling from these learned distributions and iteratively refining them during decoding.

A practical example might explain this better. This one involves a furry animal.

Imagine that you have an image of a cat. The diffusion model gets to work by gradually adding random noise to the image set by step until it is unrecognizable - just a picture full of static. This is the encoding process, where the diffusion model learns how the image changes as noise is added. As it is doing this, it captures patterns from the image at different scales:

Small scale: It notices fine details like the texture of individual strands of fur or the sharp edges of whiskers.
Mid-scale: It notices features like the shape of the cat’s eyes, ears, and nose and how these are placed relative to each other.
Large scale: It identifies larger features, like the roundness of the cat’s head or the arch of its back.

To decode, the model starts with pure noise and works backwards. First, it draws the largest, most basic shapes (like the body's overall shape). Then, it refines the smaller-scale features, adding ears, eyes and nose in the appropriate places. And finally, it fills in the minute details like individual strands of fur. Each step reduces the overall noise while adding more detail until you’re left with a cat. But, importantly, it's not the same cat.

Raw pixels are meaningless in themselves. Conventional digital video treats pixels as having an independent existence, but viewers don’t see pixels; they see pictures. Pictures have meaning; pixels don’t unless you see a lot of them. A much more efficient way to record images would be to write a description of each image and how it moves based on changes from frame to frame. You can see a glimmer of this in the way Long GOP compression works: there’s no need to record every object in every frame when its motion is predictable.

But AI goes far beyond that. Instead of pixels or even groups of pictures, it encodes images and video into a “high dimensional vector space”, where the content of images is stored not as colored points of luminosity but as a statistical distribution of aspects of the pictures. But this is not one-to-one encoding; it does not store each input so that it can be faithfully reproduced. Instead, it’s added cumulatively to a “latent space”, building a repository of stored information about what a given subject (selectable via a prompt or some other input) can be used to create novel images based on what things tend to look like.

A latent space is a lower dimensional representation - you could argue compressed - representation of the original visual data. No longer in the domain of pixels, which have very high dimensionality because they don’t mean anything taken individually, images are stored as patterns and probabilities, and some of these abstract relationships might be meaningless or invisible to humans, but, behind the scenes, explain generative AI’s ability to make such detailed images.

This kind of “compression” is tantalizing for video technologists. It can’t substitute for a conventional codec, but there may be ways to modify the technique to store and retrieve individual video feeds accurately. (The word “modify” is doing a lot of heavy lifting here.)

AI makes breakthroughs almost daily, and it would be foolish to make predictions more than a month or two into the future. It seems at least possible that researchers could find a way to store a very low-resolution copy of the original video and use AI technology to reconstruct it based on its ability to “know” what each element in the video should look like and recreate the input video based on that knowledge. It might mean that if you use a video frame as an input or “prompt”, the model could reproduce it from its data and no longer have a fixed resolution or frame rate. It would be made up of generalized data based on patterns and probabilities. Still, the errors, instead of being stark digital artefacts, would be at the very least “like” what is required in the reproduction. Significant errors would be conspicuous: instead of someone’s head, you might get a teapot. But if the errors were of the same magnitude as film grain or a pixel grid, you wouldn’t notice them.

Existing AI-based image enhancement apps tend to take liberties with reality and can be shockingly inaccurate compared to the original while looking plausible to anyone unfamiliar with the scene.

But if you look at the rate of progress over the last two years, it is incredible how far the technology has improved. Early text-to-video looked like a different artist was responsible for each frame, but the latest models are virtually pixel-perfect and even seem to understand elements of physics. The last three months have shown that the rate of change is accelerating dramatically with OpenAI’s “reasoning” models and with even more authentic-looking text-to-video apps. I personally would be surprised if, within a decade, we’re not using AI to encode and decode all our content.

You might also like...

Broadcast Standards - The Book

Broadcast Standards – The Book is a unique reference resource for broadcast engineers, operators and system designers. Never before has such a huge body of broadcast industry specific information been collated from international standards bodies and distilled into a single source o…

Live Sports Production: Control Room Teams & Workflow

Why the composition and workflow of the gallery creative team have remained largely unchanged for many years… and the effort taken by engineering to support creative teams.

Microphones: Part 7 - Microphones For Stereophony

Once the basic requirements for reproducing sound were in place, the most significant next step was to reproduce to some extent the spatial attributes of sound. Stereophony, using two channels, was the first successful system.

Disruptive Future Technologies For HDR & WCG

Consumer demands and innovations in display technology might change things for the future but it is standardization which perhaps holds the most potential for benefit to broadcasters.

Microphones: Part 6 - Omnidirectional Response In Practice

Having looked at how microphones are supposed to work, here we see that what happens in practice isn’t quite the same because the ideal and the actual are somewhat different.