9.4 Floating-Point Data Types

The compiler uses 32- and 64-bit forms of the IEEE-754 floating-point format to store floating-point values. The implementation limits applicable to a translation unit are contained in the float.h header.

Attention: Some target PIC32M devices implement a Floating-point Unit (FPU). The compiler implements certain features, described in this guide, to support this hardware.

The table below shows the data types and their corresponding size and arithmetic type.


Type	Bits
`float`	32
`double`	64
`long double`	64

Variables may be declared using the float, double and long double keywords, respectively, to hold values of these types. Floating-point types are always signed and the unsigned keyword is illegal when specifying a floating-point type. All floating-point values are represented in little endian format with the Least Significant Byte (LSB) at the lower address.

This format is described in the table below, where:

Sign is the sign bit which indicates if the number is positive or negative.
For 32-bit floating-point values, the exponent is 8 bits which is stored as excess 127 (i.e., an exponent of 0 is stored as 127).
For 64-bit floating-point values, the exponent is 11 bits which is stored as excess 1023 (i.e., an exponent of 0 is stored as 1023).
Mantissa is the mantissa, which is to the right of the radix point. There is an implied bit to the left of the radix point which is always 1 except for a zero value, where the implied bit is zero. A zero value is indicated by a zero exponent.

The value of this number for 32-bit floating-point values is:

(-1)^sign x 2⁽^exponent^-127) x 1. mantissa

and for 64-bit values

(-1)^sign x 2⁽^exponent^-1023) x 1. mantissa.

In the following table, examples of the 32- and 64-bit IEEE 754 formats are shown. Note that the Most Significant bit of the mantissa column (that is, the bit to the left of the radix point) is the implied bit, which is assumed to be 1 unless the exponent is zero (in which case the float number is zero).

Table 9-1. Floating-Point Format Example IEEE 754
Format	Number	Biased Exponent	1.mantissa	Decimal
32-bit	0x7DA6B69C	11111011b (251)	1.01001101011011010011100b (1.3024477959)	2.770000117e+37 —
64-bit	0x47B4D6D37131A DE	10001111011b (1147)	1.0100110101101101001101110001001100011010011011011110b (1.3024477407110946)	2.77e+37 —

The example in the table can be calculated manually as follows.

The sign bit is zero; the biased exponent is 251, so the exponent is 251-127=124. Take the binary number to the right of the decimal point in the mantissa. Convert this to decimal and divide it by 2²³ where 23 is the number of bits taken up by the mantissa, to give 0.302447676659. Add 1 to this fraction. The floating-point number is then given by:

-1⁰×2¹²⁴×1.302447676659

which becomes:

1×2.126764793256e+37×1.302447676659

which is approximately equal to:

2.77000e+37

Binary floating-point values are sometimes misunderstood. It is important to remember that not every floating-point value can be represented by a finite-sized floating-point number. The size of the exponent in the number dictates the range of values that the number can hold, and the size of the mantissa relates to the spacing of each value that can be represented exactly. Thus the 64-bit floating-point format allows for values with a larger range of values and that can be more accurately represented.

So, for example, if you are using a 32-bit wide floating-point type, it can exactly store the value 95000.0. However, the next highest number it can represent is (approximately) 95000.00781 and it is impossible to represent any value in between these two in such a type as it will be rounded. This implies that C/C++ code which compares floating-point type may not behave as expected. For example:

volatile float myFloat;
myFloat = 95000.006;
if(myFloat == 95000.007) // value will be rounded
   LATA++;               // this line will be executed!

in which the result of the if() expression will be true, even though it appears the two values being compared are different.

The characteristics of the floating-point formats are summarized in Table 8-3. The symbols in this table are preprocessor macros which are available after including <float.h> in your source code. Two sets of macros are available for float and double types, where XXX represents FLT and DBL, respectively. So, for example, FLT_MAX represents the maximum floating-point value of the float type. DBL_MAX represents the same values for the double type. As the size and format of floating-point data types are not fully specified by the ANSI Standard, these macros allow for more portable code which can check the limits of the range of values held by the type on this implementation.

Table 9-2. Ranges of Floating-Point Type Values
Symbol	Meaning	32-bit Value	64-bit Value
`XXX_RADIX`	Radix of exponent representation	2	2
`XXX_ROUNDS`	Rounding mode for addition	1
`XXX_MIN_EXP`	Min. n such that FLT_RADIXⁿ-1 is a normalized float value	-125	-1021
`XXX_MIN_10_EXP`	Min. n such that 10ⁿ is a  normalized float value	-37	-307
`XXX_MAX_EXP`	Max. n such that FLT_RADIXⁿ-1 is a normalized float value	128	1024
`XXX_MAX_10_EXP`	Max. n such that 10ⁿ is a  normalized float value	38	308
`XXX_MANT_DIG`	Number of FLT_RADIX mantissa digits	24	53
`XXX_EPSILON`	The smallest number which added to 1.0 does not yield 1.0	1.1920929e-07	2.2204460492503131e-16