Floating Point Numbers

The general name for the Decimal Point is Radix Point
Computers store the sign, exponent and mantissa of a floating-point number
Mantissa is also called Fraction

Biasing

It is the process of offsetting numbers in a series by a fixed value (offset)

Assume we have 4 bits to store the exponent of a floating-point number
Using 4 bits we can represent 16 unique values
Exponents can be positive and negative so the range is equally divided
Now we can represent numbers ranging from -8 to 7

Next we find the largest number in the series (in our case is 8)
This value is added to all the numbers in the series
This will give us a new series with numbers ranging from 0 to 15
Using the new series negative exponents can also be stored as a positive value

Assume we have 10 bits to store floating point numbers
1 bit for sign, 4 bits for exponent and 5 bits for mantissa
$(0.0101)_{2} = 0.101 \times 1 0^{- 1} = 0011110100$
$(101.101)_{2} = 1.01101 \times 2^{2} = 0101001101$

Normalization

The process of representing a floating-point number in scientific notation
$(101.011)_{2} = 0.101011 \times 2^{3}$
$(101.011)_{2} = 101011 \times 2^{- 3}$

Explicit Normalization

Move radix point to the LHS of the most significant 1 in the bit sequence $(101.011)_{2} = 0.101011 \times 2^{3}$
Formula: $(- 1)^{S} \times 0. M \times 2^{E - B ia s}$

$(5.625)_{10} = (101.101)_{2} = 0.101101 \times 2^{3} = 0101110110$
The last 1 is dropped since the machine does not have space to store it
Converting to Decimal: $(101.1)_{2} = (5.5)_{10}$

Implicit Normalization

Move radix point to the RHS of the most significant 1 in the bit sequence
$(101.011)_{2} = 1.01011 \times 2^{2}$
Formula: $(- 1)^{S} \times 1. M \times 2^{E - B ia s}$

Implicit nomination allows to stores values with higher precision
$(5.625)_{10} = (101.101)_{2} = 1.01101 \times 2^{2} = 0101001101$
Converting to Decimal: $(101.101)_{2} = (5.625)_{10}$

IEEE 754 Standard

Name	Common Name	Significant bits	Exponent bits	Exponent Bias
binary16	Half Precision	11	5	15
binary32	Single Precision	24	8	127
binary64	Double Precision	53	11	1023
binary128	Quadruple Precision	113	15	16383
binary256	Octuple Precision	237	19	262143

Significant Bits: Sign + Mantissa
Programming languages implement Single and Double Precision Floats

When 5 bits are reserved for exponent we have 32 unique combinations (0-31)
If we consider signed numbers as well then the range becomes -16 to 15
In the IEEE 754 standard the exponent pattern all 0s and all 1s are reserved
So the range of the exponent becomes -14 to 15

Exponent	Mantissa	Represents
All 0s	All 0s	$\pm 0$
All 1s	All 0s	$\pm \infty$
$1 \leq E \leq 2^{E - 2}$	Any value	Implicit Normal Form	$(- 1)^{S} \times 1. M \times 2^{E - B ia s}$
All 0s	$M \neq = 0$	Fractional Form	$(- 1)^{s} \times 0. M \times 2^{- (B ia s - 1)}$
All 1s	$M \neq = 0$	NaN	Exception Handling

Precision

Decimal Precision: $S \times lo g_{10} (2)$
Single Precision Floats: $24 \times lo g_{10} (2) = 7.22 d i g i t s$
Double Precision Floats: $53 \times lo g_{10} (2) = 15.95 d i g i t s$

What is the difference between float and double? - Stack Overflow

Digital Archive

Explorer