Compression: Part 8 - Spatial Compression

Now we turn to Spatial Compression, which takes place within individual images and takes no account of other images.

Spatial compression, or intra coding, works by looking for redundancy in individual frames. In the case of a photograph, the input is an individual frame. In the case of video, the frame may be one of a sequence and the use of intra coding on that frame allows a decoder to start working because it relies on no previous frames. In the case where the viewer changes channel, earlier frames from the new channel are not available.

Typically, Groups of Pictures (GOPs) start with an I picture at which decoding can begin.

Another use of I pictures is that they arrest error propagation. In differential coding, where new pictures are decoded by altering the previous picture, a failure could corrupt all subsequent pictures, whereas an I picture does not rely on any previous picture and so will be unaffected.

In practice, video signals contain cut edits, and these are handled badly by temporal coders since there is no redundancy across a cut edit. All optic flow axes end at a cut edit and new ones begin. One of the best ways of handling a cut edit is to encode the first frame after the cut as an I picture that is the beginning of a GOP. This may require earlier GOPs to have varying length, but there is nothing in the codec standards that requires GOPs to have a particular size.

I pictures are less efficient than temporally coded pictures, but without them the practicality and reliability of compression schemes would be compromised.

In many compression schemes, spatial coding is used in all types of pictures, not just in I pictures. To give an example, if a P picture is being encoded, the encoder will use motion compensation in the form of vectors to predict as much of the P picture as is possible. The encoder contains a decoder so it knows what the actual decoder would make of the prediction, which is bound to be imperfect.

The encoder compares the predicted picture with the actual input picture by subtracting pixel by pixel to produce an error picture, also known as a residual. If that error picture is added to the imperfect prediction, the errors are cancelled out. A further saving in bit rate can be obtained if the residual picture is spatially coded. Although a residual is not a recognizable picture, it is still a spatial array of pixel differences and can be treated like an image for coding purposes.

The majority of pictures contain objects. We recognize them instinctively because they are typically areas of the picture bounded by edges. Within the edges the object is self-similar whereas outside the edges there will be a contrasting background.

According to Murray Gell-Mann, real objects in a picture must be considerably larger than one pixel. If every object was one pixel in size, the picture would look like noise and be incompressible. It follows immediately that a picture containing recognizable objects is compressible.

As William Schreiber pointed out, most of the information in a picture is in the edges. It is at edges where the greatest changes in pixel values will be found, obviously requiring the greatest amount of data to describe.

An edge reveals itself as a significant change in pixel value in a series of pixels. A vertical edge, due to a utility pole or the side of a building, will show up in a horizontal run of pixels. If it is vertical, the edge will appear in the same place in the next row of pixels. This means that the position of the edge could be predicted, requiring fewer data to describe it. That, in a nutshell, is how spatial prediction works. As usual, any prediction failure is compensated using a residual.

Any compression system needs to understand the statistics of the original data as well as the sensitivity of the destination. Analysis of typical pictures shows that the spectrum of spatial frequency is far from flat. Typically, the most energy is concentrated at zero frequency, where the average brightness of the entire picture resides. The energy then falls as spatial frequency rises. In the presence of motion, the image will move across the sensor and filter out the highest frequencies. The spectrum of the video will be truncated and that gives further potential for compression.

Fig.1 a) - the pixel block is mirrored by putting the same data in reverse order before the original data b) - the mirroring cancels out all of the sine components, leaving only the cosine components.

Fig.1 a) - the pixel block is mirrored by putting the same data in reverse order before the original data b) - the mirroring cancels out all of the sine components, leaving only the cosine components.

One interesting factor is that the amount of image smear in the camera, and therefore the maximum amount of information in the frame, is determined by the shutter opening time, which is function of the frame rate and the motion, which depends on the subject material. The amount of image smear has nothing to do with the pixel count of the sensor.

What this means is that the information to be conveyed is fixed by the motion speed and cannot be increased by raising the pixel count. If the pixel count is raised, all that happens is that the finite picture information is being oversampled. There is nothing to occupy the wider spectrum the higher pixel count allows.

Suppose that with a 2K picture the motion within a frame is sufficient to halve the spatial spectrum width. The same picture shot with a 4K sensor will have information in the bottom quarter of the spatial spectrum and if using an 8K sensor the picture information will reside in the bottom eighth of the spectrum. Codec designers are laughing because high pixel counts simply increase the amount of redundancy in the image.

The only way out of this is to reduce the image smear on the camera sensor, which requires the frame rate to rise. Until frame rates based on power frequencies are abandoned for rates based on psycho-optics, high pixel counts serve only to impress the gullible.

Entering the spatial frequency domain has another advantage, which is that the sensitivity of the HVS to noise is strongly dependent on spatial frequency. Spatial coding is the part of a codec where the compression can be made lossy to meet bit rate targets. Usually, the bit rate that is available is constant, but the information content of the video input varies with content. Thus, the codec has to be able to vary the degree of compression to make the two match.

The degree of compression is increased by raising the noise floor of the pictures, which allows shorter wordlength to be used. Clearly if this were to be done in the pixel domain, the noise at all spatial frequencies would be raised by the same amount. However, if the process were to be carried out in the frequency domain, then the resulting noise could be shaped so as to be less visible to the viewer.

Most codecs convert the pixel-based pictures to the frequency domain using some kind of transform so that spatial frequencies where there is little or no energy can be identified and noise shaping can be used.

There are many different types of transforms but the requirement for processing the coefficients to reduce bit rate narrows the field. The Fourier transform, for example has sine and cosine components for each frequency. Shortening the wordlength of a sine-cosine pair by truncation does not guarantee to preserve their relative amplitude and so could alter the phase, which shifts that frequency across the screen.

The Discrete Cosine Transform (DCT) is not a complex transform and has only one coefficient per frequency. This means that truncating the coefficient does not result in a phase shift. Fig.1 shows that the DCT has an additional step before the transform proper. The block of pixels to be transformed is mirrored, meaning that the same series of pixel values runs in reverse from the original values.

If a mirrored block of pixels is subject to a Fourier transform, all of the sine components in the mirrored part will cancel the sine components in the original part and all of the sine coefficients will be zero. Only the cosine coefficients will be non-zero. In practice the sine coefficients are not computed.

The zero frequency coefficient will be left untouched, but higher frequency coefficients, that will typically be smaller in magnitude, will be companded. Then according to the degree of loss that is demanded, the coefficients will be truncated to fewer bits. Typically there is an output buffer memory that is read at the target bit rate. If the compression is insufficient, the memory will overflow and if it is too strong the memory will underflow. The memory is balanced by increasing or reducing the amount of truncation.

The process is often referred to as noise shaping, but in practice truncation does not produce noise as the error introduced is not random and not decorrelated from the signal. If it were, artefacts caused by truncation such as blocking would not be visible. Equally trying to measure the performance of a codec by measuring an assumed signal to noise ratio is likely to produce meaningless results. 

You might also like...

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

The Resolution Revolution

We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?

Microphones: Part 3 - Human Auditory System

To get the best out of a microphone it is important to understand how it differs from the human ear.

HDR Picture Fundamentals: Camera Technology

Understanding the terminology and technical theory of camera sensors & lenses is a key element of specifying systems to meet the consumer desire for High Dynamic Range.

Demands On Production With HDR & WCG

The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.