Compression: Part 4 - Introducing Motion Compensation
Here we introduce the different types of redundancy that can be located in moving pictures.
Related articles:
The redundancies that a modern compression algorithm looks for can be divided into roughly three categories. As we don’t have any moving picture technology, instead we send a series of still pictures and leave it to the viewer to imagine the motion. Each picture is somewhat like a photograph, and so a compressor can look for redundancy within an individual picture, just as JPEG does for actual photographs. Compression of this kind is known as intra coding, or spatial coding, because it works with the spatial, or two dimensional information within a single picture, without needing reference to any other picture.
In addition to intra coding, compressors can also look for redundancy through time, where similarities between successive pictures can be exploited. This is known as inter coding or temporal coding.
As digital images are just a particular form of data, it is possible to consider data statistically and obtain compression by representing common bit patterns by short codes and infrequent patterns by longer codes. A real coder is likely to combine all three approaches.
The most commonly cited example of spatial coding is the identification of a patch of blue sky in which all of the pixels have substantially the same value. The example is not wrong, but there is a good deal more to it than that as we shall see. Inter coding has one dominant advantage for the broadcaster, which is that cut editing is possible on the compressed data without quality loss. As the pictures are compressed independently, each one becomes a kind of data file. Cut editing can re-arrange the sequence of a number of these files without changing their content in any way. When the files are decoded the picture quality is as good as if the editing had not taken place.
Why not use intra coding for everything, in that case? The answer is simple, which is that intra coding does not yield a high enough compression factor for most distribution purposes.
That leads us on to coding based on redundancy between successive pictures or within a series of pictures. This is known as inter coding, which is a good name for it. The term temporal coding will also be found, but that’s not quite such a good name, since it suggests that the redundancy is found on the time axis, when it frequently isn’t.
In the case of a still scene, there will be no difference between successive pictures and the redundancy will be total. But nobody wants to watch that. In real TV material, things move. In the case of a pan, the whole background moves. Real inter coders need to deal with that.
If one imagines a camera set up to pan across a still scene, the movement of the camera will cause every picture to be different, thereby destroying the temporal redundancy that existed when nothing moved. However, even though the camera moved, the scene didn’t and the scene still contains its redundancy. In the presence of the camera motion, that redundancy is strictly no longer temporal. It exists along a new axis that is known as the axis of optic flow.
Fig.1 shows the idea. A moving object appears in different places in successive pictures, but with respect to an optic flow axis, the movement is cancelled. There are many optic flow axes. They may begin when motion allows something to enter the frame or when motion of a foreground object reveals background. They may end when something leaves the frame or when concealed by a moving foreground. In the special case where nothing moves, the optic flow axes are parallel to the time axis, otherwise they are not.
In fact most of the illusion of moving pictures is based on optic flow axes. As the eyes can move, they will try to follow moving objects of interest both in real life and on a screen. A complete explanation of that will have to wait for another time.
When any image processing system is using optic flow axes correctly to portray motion, it is said to be motion compensated. A good example is a standards convertor, which has to create pictures in a new standard at times that are different to the timing of the input standard. By displacing moving objects along the appropriate optic flow axes, the motion in the new standard will still be correct.
Figure 1 - An optic flow axis follows the same point on a moving object. With respect to the optic flow axis, there is no motion. A tracking eye will follow an optic flow axis
All motion compensated processes include the step of measuring the motion over a sequence of pictures. The fundamental difference between compression and other applications of motion compensation is the required accuracy. A standards convertor should correctly identify the outline and motion of every moving object on the screen.
A compressor does not need to do this, because the motion compensation is, or should be, part of the prediction process. The prediction process is not expected to be perfect, and the predicted picture is compared with the actual picture to produce a prediction error or residual. Transmission of the residual cancels out the shortcomings of all of the prediction techniques used in the picture.
This means that if motion compensation is used in the prediction, it can be imperfect, with considerable saving in complexity and cost. If for example, the boundary of a moving object is not correctly identified, some of the prediction will be wrong, but the residual puts it right. The same thing applies to motion vectors. If they do not accurately describe the motion, the residual data must increase.
This is the same old story of the compromise between complexity and compression factor. The simple motion compensator that makes a lot of prediction errors will need a higher bit rate to send the residual, whereas the complex motion compensator should require a smaller residual. What we see over time is that the complexity of motion compensation rises as demand for higher compression factors is allowed by the progress of microelectronics.
For example, in MPEG-2, the unit of motion compensation was the macroblock which was 16 pixels square. Where there was motion between two pictures, pixel data from the first picture could be shifted a macroblock at a time in two dimensions according to a pair of vectors sent with each macroblock.
This worked well for macroblocks that were totally within a moving object, but at the boundaries, some pixels were shifted when they should not have been and required residual data to put them right. Later codecs such as AVC adopted motion compensation blocks of variable size and aspect ratio going down to 4 x 4 pixels, which also required a different transform to the usual 8 x 8 DCT.
Motion compensation has a problem when motion of a foreground object reveals background that had never been seen before and so could not be predicted from a previous picture. The solution is to use prediction from a future picture. Taken literally, this is impossible, but if a sequence of pictures is stored in memory, then it is possible, provided that an overall delay is acceptable. In bidirectional coding, a predicted picture can be assembled using motion compensated pixel data from previous or later pictures.
As was seen earlier, the coding delay or latency tends to rise with compression factor. Here is one of the reasons. Bidirectional coding requires pictures to be sent out of sequence, so that future pixel data are available. The decoder incorporates a time base corrector that puts decoded pictures back in the correct order and this causes delay.
Inter coding brings the concept of predicting all or some of a picture from another picture, which means the pictures are no longer independent as they were in intra coding. Once inter coding is used, editing can no longer be performed on compressed data. If a given picture requires information from an earlier picture for decoding, clearly decoding cannot work if that earlier picture has been removed by an edit.
Although inter coding could be convolutional, for practical reasons the bit stream requires entry points where a decoder can begin decoding if it missed the beginning of the clip or if non-linear access is required. Pictures that depend on one another are assembled into groups. Typically a group will start with an intra coded picture so that decoding can start there without reference to any earlier picture.
It follows that in order to edit inter coded pictures, they must be decoded back to conventional video first, before being re-encoded. When lossy codecs are used, this must result in generation loss. This is a fundamental characteristic of inter coding. Using motion compensation, inter coding out-performs intra coding by a significant factor, so where high compression factors are needed, inter coding is the way to go.
The greatest strength of inter coding is for final delivery of video after all post production steps have been taken and no further processing is anticipated. High compression factors can be used in such applications that will use long GOPs.
You might also like...
Live Sports Production: Part 1 - New Sports Production Workflows
Welcome to Part 1 of ‘Live Sports Production’ - This new multi-part series uses a round table style format to explore the technology of live sports production with some of the industry’s leading system designers. It is a fascinating insight i…
Automating HDR-SDR Conversion
Automation seems like an obvious solution but effective conversion involves understanding what the image content is and therefore what the priorities are for how it should look.
Building Software Defined Infrastructure: Virtualization Vs Microservices
How virtualization and microservices differ, and workflows where virtualization and microservices would be used or avoided in terms of reliability, flexibility and security.
IP Security For Broadcasters: Part 8 - RADIUS Network Access
Maintaining controlled access is critical for any secure network, especially when working with high-value media in broadcast environments.
Standards: Part 25 - Designing Client-Side Video Players
Here we chart the historical development of client-side video players, describe the building blocks used to create them and the relevant standards.