Compression: Part 10 - Determining Bitrate
Having earlier looked at the different ingredients of spatial coding, the opportunity now arises to pull them together into a system.
Other articles in this series and other series by the same author:
The two main approaches to moving image compression are inter-coding and intra-coding. Using motion compensation, intra-coding exploits the great redundancy that exists through time in most moving picture material. Motion prediction is never perfect and the predicted pictures are always compared with the originals to produce a residual, which has the spatial structure of a frame but contains pixel errors rather than true pixels.
It should be evident that if the residual is transmitted fully along with the prediction, the decoder could use the residual to cancel the prediction errors so the result would be completely lossless. In that case all inter-coders would have the same picture quality and the only way we can choose one over another is that it achieves the ideal quality with a lower bit rate. This could be done by improving the efficiency of the prediction, and improving the encoding of the predicted data. The amount of residual data would fall.
In most cases, the compression factor achieved by the use of inter-coding alone will not be enough and further steps will be needed to obtain greater compression. The usual approach is to use spatial coding on the residual data, treating it as if it were a picture.
Moving pictures vary considerably in their difficulty and the immediate result is that to maintain constant quality the bit rate required must vary. This is readily achieved in media such as video disks, but is not generally possible in broadcasting or communications where variable bit rate causes technical and administrative difficulties.
In that case the bit rate has to be fixed and the quality must rise and fall with picture difficulty. Somewhere, the ideal bit rate must be reduced. Should this be in the inter-coding or the intra-coding?
Considering inter-coding, if efforts were made to reduce the bit rate of the predicted pictures, this would make them less accurate and would simply increase the size of the residual, which would be counter productive. In practice the inter-coding is left to its own devices to produce the best predictions possible and any bit rate reduction to achieve some external constraint is achieved by making the intra-coding lossy.
In summary, the inter-coding remains lossless, and it falls to the intra-coding to reduce the output bit rate by reducing the accuracy of intra-coded pictures and residual pictures.
It is important to realise that the last thing a compression standard does is to explain how to build an encoder. Instead, the standard specifies a common language that all compatible decoders can understand. This means that it is not possible to establish how an encoder works by examining its output. In turn that means encoder designers could expend a lot of effort without having their design copied as soon as it came into use.
It follows that any example of an encoder given here must be speculative. Fig.1 shows a representative spatial coder.
The input to the coder will be either I-pictures or residuals. The picture is broken into coding area called macroblocks. These interface the chroma subsampling used in component television pictures with the needs of the later transform coding. In the 4:2:0 format, the color difference samples are subsampled by a factor of two both horizontally and vertically. In a block of 16 x 16 pixels there will be two blocks of 8 x 8 color difference values, one for R-Y and one for B-Y.
A 16 x 16 macroblock can be represented by four 8 x 8 blocks of luma values and two blocks of color difference values, making 6 blocks that all have the same size and are compatible with the transform coding. MPEG-2 also has a structure called a slice, which is a horizontal row of macroblocks associated together for coding purposes.
Fig.1 shows the transform coding of MPEG-2 which uses the DCT. Later codes can use other transforms such as Haar and wavelets. The goal of any transform is to convert the input data into a form in which any redundancy can better be identified.
Real pictures will produce sets of DCT coefficients that are sparse, meaning that many of the coefficients will have small or zero magnitude. Small coefficients can be set to zero without doing much harm and zero valued coefficients need not be sent. In real pictures, most of the energy is at low frequencies. Motion smear filters out high frequencies most of the time. The result is that the magnitude of coefficients tends to increase towards the top left corner of the coefficient block. The opposite corner may have no coefficients of any consequence.
The zig-zag scan takes advantage of that distribution by re-ordering coefficients such that the largest values are first and those having zero value come last. This allows the use of variable length messages to describe the coefficient block, since the message can stop as soon as all remaining coefficients in the scan are zero.
It is a characteristic of the HVS that the visibility of noise is far from uniform. At low spatial frequencies, noise is highly visible and is related to wide area flicker. As spatial frequency rises, visibility of noise falls, not least because it is masked by picture detail.
The zero frequency coefficient is not noise shaped, but is carried intact. The zero frequency coefficient represents the average brightness of the whole block and nearby blocks may have a similar value, so the differential coding can be used along a slice.
Compressors take advantage of that, making coefficients more approximate by reducing their wordlength. The process is known as noise shaping. Done well, the noise floor of the picture remains below the sensitivity of the HVS. In the process of matching the coder’s output bit rate to the channel it is feeding, the coder must drive the noise floor up if the bit rate needs to be reduced.
Noise shaping consists of truncating coefficient data by removing low order bits. The truncation must vary according to the bit rate target. The truncation process is simplified if it can be the same for all frequencies. This is achieved by weighting the coefficients before truncation.
Fig.2 shows that coefficients are weighted, or divided by a constant before truncation. An inverse weighting is required in the decoder. If the coefficient is divided by one, the effect of the truncation is minimal, whereas if it is divided by ten, the effect of truncation will be ten times greater. The shape of the noise is determined by the weighting and the level of the noise is determined by the truncation.
Fig.2 - A noise shaping system uses weighting at the encoder and a common quantizing step. The inverse weighting at the decoder changes the noise amplitude as well as restoring coefficients to their correct magnitudes.
In the case of a P picture, the pixel values are obtained by prediction from, for example, an I picture to which is added a residual. The I picture will have had noise shaping and if the residual also had noise shaping, the result would be a build up of noise. This is avoided by restricting the use of noise shaping to intra-coded pictures. Residuals are treated the same as pictures, except that they are flat weighted.
The truncated coefficients are then subject to variable length coding which uses short codes for the most common values and longer codes for uncommon values. One such system is the Huffman code in which the short codes are never a prefix of a long code. The decoder tests an increasing number of bits until it finds a code that is in the code book. The next bit must be the first bit of the next code. In that way the bit stream becomes self-parsing. The coding stops when there are no more non-zero coefficients.
Prior to that there will be a phase in which the coefficients become sparse and along the zig-zag scan there will be coefficients of zero value before another non-zero coefficient is found. The number of zero coefficients in the scan is encoded into the next non-zero coefficient, so the decoder knows where to place the decoded coefficient in the DCT block. This process will fail if a single bit error is encountered so compressed data need good error protection.
The output stage of a coder consists of a memory buffer which is read at the target bit rate. It is written by the output of spatial coder. Clearly if the spatial coder bit rate exceeds the output bit rate, the memory will overflow and data will be lost. The amount of available memory can be obtained by comparing the read and write addresses and as this falls, the quantizing stage in the spatial coder is made more aggressive. If the available memory is increasing, the quantizer backs off.
To an extent, the buffer memory absorbs temporary increases in bit rate resulting from difficult pictures, but the memory capacity must be duplicated in the input stages of the decoder so that it can replicate the bit rate variations of the encoder. This raises cost, so the size of the buffering memory is limited.
You might also like...
Microphones: Part 5 - The Variable Directivity Microphone
The variable directivity microphone is very popular for studio work. What goes on inside is very clever and not widely appreciated.
IP Security For Broadcasters: Part 7 - Operating Systems
As well as providing the core functionality of a computer, operating systems have the potential to be a primary issue for security and keeping hackers at bay.
Deep Learning Accelerates Object Tracking In TV Production
Advances in application motion tracking in audiovisual production, both live and recorded, have been slow until recently accelerated by the advent of modern AI techniques associated with neural network based deep learning and mathematical graph theory. These advances have converged…
The Creative Challenges Of HDR-SDR Simulcast
HDR can make choices easier - or harder - at every stage of production but the biggest challenge may be just how subjective those choices are.
IP Security For Broadcasters: Part 6 - NAT And VPN
NAT will operate without IPsec and vice versa, but making them work together is a fundamental challenge that needs detailed configuration and understanding.