8.4 Floating-Point Data Types

The compiler uses 32- and 64-bit forms of the IEEE-754 floating-point format to store floating-point values. The implementation limits applicable to a translation unit are contained in the float.h header.

Important: Some target PIC32C/SAM devices implement a Floating-point Unit (FPU). The compiler implements certain features, described in this guide, to support this hardware.
The table below shows the data types and their corresponding size and arithmetic type.
Type Bits
float 32
double 64
long double 64

Variables may be declared using the float, double and long double keywords, respectively, to hold values of these types. Floating-point types are always signed and the unsigned keyword is illegal when specifying a floating-point type. All floating-point values are represented in little endian format with the Least Significant Byte (LSB) at the lower address.

This format is described in the table below, where:

  • Sign is the sign bit which indicates if the number is positive or negative.
  • For 32-bit floating-point values, the exponent is 8 bits which is stored as excess 127 (i.e., an exponent of 0 is stored as 127).
  • For 64-bit floating-point values, the exponent is 11 bits which is stored as excess 1023 (i.e., an exponent of 0 is stored as 1023).
  • Mantissa is the mantissa, which is to the right of the radix point. There is an implied bit to the left of the radix point which is always 1 except for a zero value, where the implied bit is zero. A zero value is indicated by a zero exponent.

The value of this number for 32-bit floating-point values is:

(-1)sign x 2(exponent-127) x 1. mantissa

and for 64-bit values

(-1)sign x 2(exponent-1023) x 1. mantissa.

In the following table, examples of the 32- and 64-bit IEEE 754 formats are shown. Note that the Most Significant bit of the mantissa column (that is, the bit to the left of the radix point) is the implied bit, which is assumed to be 1 unless the exponent is zero (in which case the float number is zero).

Table 8-2. Floating-Point Format Example IEEE 754
Format Number Biased Exponent 1.mantissa Decimal
32-bit 0x7DA6B69C 11111011b

(251)

1.01001101011011010011100b

(1.3024477959)

2.770000117e+37

64-bit 0x47B4D6D37131A DE 10001111011b

(1147)

1.0100110101101101001101110001001100011010011011011110b

(1.3024477407110946)

2.77e+37

The example in the table can be calculated manually as follows.

The sign bit is zero; the biased exponent is 251, so the exponent is 251-127=124. Take the binary number to the right of the decimal point in the mantissa. Convert this to decimal and divide it by 223 where 23 is the number of bits taken up by the mantissa, to give 0.302447676659. Add 1 to this fraction. The floating-point number is then given by:

-10×2124×1.302447676659

which becomes:

1×2.126764793256e+37×1.302447676659

which is approximately equal to:

2.77000e+37

Binary floating-point values are sometimes misunderstood. It is important to remember that not every floating-point value can be represented by a finite-sized floating-point number. The size of the exponent in the number dictates the range of values that the number can hold, and the size of the mantissa relates to the spacing of each value that can be represented exactly. Thus the 64-bit floating-point format allows for values with a larger range of values and that can be more accurately represented.

So, for example, if you are using a 32-bit wide floating-point type, it can exactly store the value 95000.0. However, the next highest number it can represent is (approximately) 95000.00781 and it is impossible to represent any value in between these two in such a type as it will be rounded. This implies that C/C++ code which compares floating-point type may not behave as expected. For example:

volatile float myFloat;
myFloat = 95000.006;
if(myFloat == 95000.007) // value will be rounded
   LATA++;               // this line will be executed!

in which the result of the if() expression will be true, even though it appears the two values being compared are different.

The characteristics of the floating-point formats are summarized in Table 8-3. The symbols in this table are preprocessor macros which are available after including <float.h> in your source code. Two sets of macros are available for float and double types, where XXX represents FLT and DBL, respectively. So, for example, FLT_MAX represents the maximum floating-point value of the float type. DBL_MAX represents the same values for the double type. As the size and format of floating-point data types are not fully specified by the ANSI Standard, these macros allow for more portable code which can check the limits of the range of values held by the type on this implementation.

Table 8-3. Ranges of Floating-Point Type Values
Symbol Meaning 32-bit Value 64-bit Value
XXX_RADIX Radix of exponent representation 2 2
XXX_ROUNDS Rounding mode for addition 1
XXX_MIN_EXP Min. n such that FLT_RADIXn-1 is a normalized float value -125 -1021
XXX_MIN_10_EXP Min. n such that 10n is a 
normalized float value -37 -307
XXX_MAX_EXP Max. n such that FLT_RADIXn-1 is a normalized float value 128 1024
XXX_MAX_10_EXP Max. n such that 10n is a 
normalized float value 38 308
XXX_MANT_DIG Number of FLT_RADIX mantissa digits 24 53
XXX_EPSILON The smallest number which added to 1.0 does not yield 1.0 1.1920929e-07 2.2204460492503131e-16