8.4 Floating-Point Data Types
The compiler uses 32- and 64-bit forms of the IEEE-754 floating-point
format to store floating-point values. The implementation limits applicable to a
translation unit are contained in the float.h
header.
Type | Bits |
---|---|
float |
32 |
double |
64 |
long double |
64 |
Variables may be declared using the float
,
double
and long double
keywords, respectively, to hold
values of these types. Floating-point types are always signed and the
unsigned
keyword is illegal when specifying a floating-point type. All
floating-point values are represented in little endian format with the Least Significant
Byte (LSB) at the lower address.
This format is described in the table below, where:
- Sign is the sign bit which indicates if the number is positive or negative.
- For 32-bit floating-point values, the exponent is 8 bits which is stored as excess 127 (i.e., an exponent of 0 is stored as 127).
- For 64-bit floating-point values, the exponent is 11 bits which is stored as excess 1023 (i.e., an exponent of 0 is stored as 1023).
- Mantissa is the mantissa, which is to the right of the radix point. There is an implied bit to the left of the radix point which is always 1 except for a zero value, where the implied bit is zero. A zero value is indicated by a zero exponent.
The value of this number for 32-bit floating-point values is:
(-1)sign x 2(exponent-127) x 1. mantissa
and for 64-bit values
(-1)sign x 2(exponent-1023) x 1. mantissa.
In the following table, examples of the 32- and 64-bit IEEE 754 formats are shown. Note that the Most Significant bit of the mantissa column (that is, the bit to the left of the radix point) is the implied bit, which is assumed to be 1 unless the exponent is zero (in which case the float number is zero).
Format | Number | Biased Exponent | 1.mantissa | Decimal |
---|---|---|---|---|
32-bit | 0x7DA6B69C | 11111011b (251) |
1.01001101011011010011100b (1.3024477959) |
2.770000117e+37 — |
64-bit | 0x47B4D6D37131A DE | 10001111011b (1147) |
1.0100110101101101001101110001001100011010011011011110b (1.3024477407110946) |
2.77e+37 — |
The example in the table can be calculated manually as follows.
The sign bit is zero; the biased exponent is 251, so the exponent is 251-127=124. Take the binary number to the right of the decimal point in the mantissa. Convert this to decimal and divide it by 223 where 23 is the number of bits taken up by the mantissa, to give 0.302447676659. Add 1 to this fraction. The floating-point number is then given by:
-10×2124×1.302447676659
which becomes:
1×2.126764793256e+37×1.302447676659
which is approximately equal to:
2.77000e+37
Binary floating-point values are sometimes misunderstood. It is important to remember that not every floating-point value can be represented by a finite-sized floating-point number. The size of the exponent in the number dictates the range of values that the number can hold, and the size of the mantissa relates to the spacing of each value that can be represented exactly. Thus the 64-bit floating-point format allows for values with a larger range of values and that can be more accurately represented.
So, for example, if you are using a 32-bit wide floating-point type, it can exactly store the value 95000.0. However, the next highest number it can represent is (approximately) 95000.00781 and it is impossible to represent any value in between these two in such a type as it will be rounded. This implies that C/C++ code which compares floating-point type may not behave as expected. For example:
volatile float myFloat;
myFloat = 95000.006;
if(myFloat == 95000.007) // value will be rounded
LATA++; // this line will be executed!
in which the result of the if()
expression will be true,
even though it appears the two values being compared are different.
The characteristics of the floating-point formats are summarized in Table 8-3. The symbols in this table are
preprocessor macros which are available after including <float.h>
in
your source code. Two sets of macros are available for float
and
double
types, where XXX
represents
FLT
and DBL
, respectively. So, for example,
FLT_MAX
represents the maximum floating-point value of the float type.
DBL_MAX
represents the same values for the double
type. As the size and format of floating-point data types are not fully specified by the
ANSI Standard, these macros allow for more portable code which can check the limits of the
range of values held by the type on this implementation.
Symbol | Meaning | 32-bit Value | 64-bit Value |
---|---|---|---|
XXX_RADIX |
Radix of exponent representation | 2 | 2 |
XXX_ROUNDS |
Rounding mode for addition | 1 | |
XXX_MIN_EXP |
Min. n such that FLT_RADIXn-1 is a normalized float value | -125 | -1021 |
XXX_MIN_10_EXP |
Min. n such that 10n is a normalized float value | -37 | -307 |
XXX_MAX_EXP |
Max. n such that FLT_RADIXn-1 is a normalized float value | 128 | 1024 |
XXX_MAX_10_EXP |
Max. n such that 10n is a normalized float value | 38 | 308 |
XXX_MANT_DIG |
Number of FLT_RADIX mantissa digits | 24 | 53 |
XXX_EPSILON |
The smallest number which added to 1.0 does not yield 1.0 | 1.1920929e-07 | 2.2204460492503131e-16 |