Is Gamma Still Needed?: Part 9 - Processing In Floating Point

Floating-point notation and gamma are both techniques that trade precision for dynamic range. However they differ fundamentally. Gamma is a non-linear function whereas floating point remains linear. Any mathematical manipulations carried out on floating-point encoded data will be correct whereas manipulations of gamma-encoded luma cannot be. Gamma was intended to linearize a cathode ray tube whereas floating point encoding was designed from the outset for mathematical manipulation.

As any encoding step can affect quality, it is important to consider what floating-point notation does to the encoded information. If some physical quantity, such as distance or size, expressed as a high-resolution fixed-point binary number, is converted to a floating point number there may be a further quantizing step caused by the finite word length of the mantissa.

Fig.1a) shows part of the transfer function of a sixteen-bit fixed-point parameter, having about 65,000 possible values. Fig.1b) shows the same parameter expressed with a twelve-bit mantissa. The quantizing steps are four times as big. If the parameter is doubled in size, the exponent increases by one, which has the effect of doubling the size of the quantizing steps.

Note that in computing terminology the quantizing step is called rounding. It's not quite the same thing, because in computing there are various different ways of rounding numbers, but only the round-to-nearest option simulates the quantizing of an ADC. 

Fig.1 The transfer function of a 16-bit quantizer is shown at a). If a sixteen bit number is expressed to 12-bit accuracy, the steps b) become four times larger. In floating point this can only happen in the presence of a signal, so the approximation is masked.

Fig.1 The transfer function of a 16-bit quantizer is shown at a). If a sixteen bit number is expressed to 12-bit accuracy, the steps b) become four times larger. In floating point this can only happen in the presence of a signal, so the approximation is masked.

Essentially a floating-point number displays a quantizing error in comparison with the original fixed-point value, which changes according to the exponent. Whether that quantizing error matters or not depends upon the length of the mantissa, the accuracy of the original parameter and the sensitivity of the destination.

In real audiovisual signals, from microphones and cameras, there is always a noise floor associated with the wanted signal. In audio, the noise floor will only be audible if there is no sound. If there is an audible signal, the noise will be masked.

In the case of video, the human visual system perceives noise that exceeds a fixed proportion of the brightness. Once again, if the noise level increases due to a larger quantizing step, this can only be because the brightness has also increased.

Unlike the use of gamma, the floating-point number when used in imaging is at all times proportional to the luminance. There are two important consequences. Firstly, the linearity means that the increase in bandwidth caused by the non-linearity of gamma is not present in floating point coding. Secondly, artifacts caused by processing gamma-encoded signals are absent.

For example a color difference signal expressed in floating point will be just that: a signal containing no luminance, whereas a color difference signal calculated from gamma corrected signals still contains luma. Color bars computed in linear light using floating-point notation and viewed on a linear display will not show a dark bar at the green-magenta transition. This will be true whether RGB or color difference signals are used.

Fig.2 A five-bit exponent has 32 combinations. The ends of the scale denote zero and infinity, leaving fourteen negative values, zero and fifteen positive values for the exponent.

Fig.2 A five-bit exponent has 32 combinations. The ends of the scale denote zero and infinity, leaving fourteen negative values, zero and fifteen positive values for the exponent.

Another way of looking at the issue is that whilst it is possible to pretend that gamma encoded signals are linear and to live with the sub-optimal results, such a pretense is not necessary with floating point.

A binary number is represented in floating point by shifting it until there is a leading one immediately to the left of the radix point. The process is called normalizing. This leading one does not need to be stored and adds a bit of precision. The exponent records the number of shifts necessary. Numbers below one require negative exponents. The exponent is typically encoded with an offset of half its maximum value, so that a zeroth power exponent is in the middle of the range.

Fig.2 shows that with a 5-bit exponent there are 32 combinations. The exponent is represented with an offset of 15, which means that 15 must be subtracted from the encoded value to obtain the exponent. An encoded value of 15 represents 2 to the zeroth power, which is 1.

The implicit bit approach of floating point coding does, however mean that extra steps must be taken to handle an input having a value of zero or infinity, as clearly neither can be encoded in the usual way. The solution is to reserve two of the values of the exponent.

In the five-bit example of Fig.2, there are 32 possible combinations. The largest exponent code is 31, which represents infinity or an invalid result and the smallest is zero, which indicates that the number cannot be normalized because it has a value of zero (not to be confused with a power of zero). There are fifteen positive powers, a power of zero, fourteen negative powers, a value of zero and a value of infinity, using up the 32 combinations.

Fig.3 shows how two floating-point numbers can be added. This is only possible if the two numbers have the same exponent, and if this is not the case, one of the numbers will have to be shifted. It must be the one that needs to be shifted right, as a left shift would by definition lose the most significant bit(s), causing a gross error. After the addition, which must take into account the implicit ones to the left of the radix point, the sum must be re-normalized.

The presence of the implicit ones means that there will always be an overflow when the addition is performed, requiring the sum to be shifted right and the exponent to be incremented. If the case of adding a number to itself is considered, the result will be the same as incrementing the exponent, which doubles the number.

In imaging, the most accurate result is obtained if the process simulates the addition of two analog signals whose sum is then quantized to the available accuracy. One way of doing that would be to perform the addition using logic having sufficient word length to accommodate both mantissae after shifting them to make the exponents match. The extended word length sum could then be rounded/quantized in the re-normalization process.

Fig.3 Addition first requires the exponents to be the same. Then the mantissae are added and the result is re-normalized.

Fig.3 Addition first requires the exponents to be the same. Then the mantissae are added and the result is re-normalized.

In practice this is not necessary, and with care the word length of the logic need only be extended by three bits in order to obtain the same result. Fig.4 shows that when a mantissa is shifted right, bits will extend from the end. These are designated the guard, round and sticky bits respectively. The value of the mantissa shifts right into these bits. The guard and round bits are simply an extension to the shift register, whereas if any shift causes the sticky bit to become one, it will remain set to one even if a subsequent shift would have re-set it.

That is where the term sticky comes from. If the bit becomes one, it sticks.

The addition includes the extra three bits and the normalizing process is based on their values. After normalization, these three bits disappear, so the goal is to determine whether their removal requires the LSB of the mantissa to stay the same (round down) or whether one should be added (round up).

As the LSB of the mantissa corresponds to one quantizing interval, the three extra bits divide the quantizing interval into eight parts, and anything above four parts out of eight rounds up. Four or fewer out of eight rounds down.

Subtraction is similar, in that the exponents must first be made the same. After subtraction the result is re-normalized using the three extra bits.

Multiplication of two floating-point numbers is simple; the mantissas are multiplied together, the exponents are added and the result is re-normalized. This ease of multiplication is especially important for video processing. Simple processes like vision mixing all the way up to filtering and transforms all depend on multiplication of pixel values. 

Fig.4. When a mantissa is shifted right, three bits shifting from the end are considered. These are the guard, round and sticky bits that are used to optimize the normalization.

Fig.4. When a mantissa is shifted right, three bits shifting from the end are considered. These are the guard, round and sticky bits that are used to optimize the normalization.

Whilst standards exist for interchange of floating point data, for many cases the word length is too long and a shorter representation, called a minifloat, can be used. A minifloat is a floating-point encoding using less than 16 bits. A ten-bit minifloat has some interesting characteristics, not least that needs no more storage capacity or bandwidth than ten-bit binary that is traditionally used for gamma corrected video.

When used for luminance, there is no requirement for a sign bit and the exponent can be unipolar. If the luminance value of zero is forbidden, as is done in 601, the exponent needs no reserved value to handle it. The ten-bit word can be broken up into exponent and mantissa. If, for example a seven-bit mantissa and a three-bit exponent are used, the mantissa is actually eight bits and the exponent has a range of eight values. With an exponent of zero, the encoded value is equal to the mantissa and has values from 1 to 255. Only values above 100 have steps of 1 percent required to be perceptually accurate.

However, with a three-bit exponent the mantissa can be shifted up to seven times, allowing a maximum code of 32K, corresponding to a dynamic range of around 8 stops.

You might also like...

HDR & WCG For Broadcast: Part 3 - Achieving Simultaneous HDR-SDR Workflows

Welcome to Part 3 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 3 discusses the creative challenges of HDR…

The Resolution Revolution

We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?

Microphones: Part 3 - Human Auditory System

To get the best out of a microphone it is important to understand how it differs from the human ear.

HDR Picture Fundamentals: Camera Technology

Understanding the terminology and technical theory of camera sensors & lenses is a key element of specifying systems to meet the consumer desire for High Dynamic Range.

Demands On Production With HDR & WCG

The adoption of HDR requires adjustments in workflow that place different requirements on both people and technology, especially when multiple formats are required simultaneously.