Floating-Point Data Types

The MPLAB XC8 compiler supports 32- and 24-bit floating-point types, being an IEEE 754 32-bit format, or a truncated, 24-bit form of this, respectively. Floating-point sizes of 32-bits will be automatically set when you select C99 compliance. If 24-bit floating-point types are explicitly selected, the compiler will use the C90 libraries. The table below shows the data types and their corresponding size and arithmetic type.

Table 1. Floating-point Data Types
Type	Size (bits)	Arithmetic Type
`float`	24 / 32	Real
`double`	24 / 32	Real
`long double`	same size as `double`	Real

For both float and double values, the 24-bit format is the default. The options -fshort-float and -fshort-double can also be used to specify this explicitly. The 32-bit format is used for double values if -fno-short-double option is used and for float values if -fno-short-float is used.

Variables can be declared using the float and double keywords, respectively, to hold values of these types. Floating-point types are always signed and the unsigned keyword is illegal when specifying a floating-point type. Types declared as long double will use the same format as types declared as double. All floating-point values are represented in little-endian format with the LSB at the lower address.

The 32-bit floating-point type supports “relaxed” semantics when compared to the full IEEE implementation, which means the following rules are observed.

Tiny (sub-normal) arguments to floating-point routines are interpreted as zeros. There are no representable floating-point values possible between -1.17549435E-38 and 1.17549435E-38, except for 0.0. This range is called the denormal range. Sub-normal results of routines are flushed to zero. There are no negative 0 results produced.

Not-a-number (NaN) arguments to routines are interpreted as infinities. NaN results are never created in addition, subtraction, multiplication, or division routines where a NaN would be normally expected—an infinity of the proper sign is created instead. The square root of a negative number will return the “distinguished” NaN (default NaN used for error return).

Infinities are legal arguments for all operations and behave as the largest representable number with that sign. For example, +inf + -inf yields the value 0.

The format for both floating-point types is described in the Table 2 table, where:

sign is the sign bit, which indicates if the number is positive or negative
The biased exponent is 8 bits wide and is stored as excess 127 (i.e., an exponent of 0 is stored as 127).
mantissa is the mantissa, which is to the right of the radix point. There is an implied bit to the left of the radix point which is always 1 except for a zero value, where the implied bit is zero. A zero value is indicated by a zero exponent.

The value of this number is (-1)^sign x 2^{(exponent-127)} x 1. mantissa.

Table 2. Floating-point Formats
Format	Sign	Biased Exponent	Mantissa
IEEE 754 32-bit	x	xxxx xxxx	xxx xxxx xxxx xxxx xxxx xxxx
modified IEEE 754 24-bit	x	xxxx xxxx	xxx xxxx xxxx xxxx

Here are some examples of the IEEE 754 32-bit formats shown in the Table 3 table. Note that the most significant bit (MSb) of the mantissa column (i.e., the bit to the left of the radix point) is the implied bit, which is assumed to be 1 unless the exponent is zero.

Table 3. Floating-point Format Example IEEE 754
Format	Value	Biased Exponent	1.mantissa	Decimal
32-bit	7DA6B69Bh	11111011b	1.01001101011011010011011b	2.77000e+37
32-bit	7DA6B69Bh	(251)	(1.302447676659)	—
24-bit	42123Ah	10000100b	1.001001000111010b	36.557
24-bit	42123Ah	(132)	(1.142395019531)	—

Use the following process to manually calculate the 32-bit example in the Table 3 table.

The sign bit is zero; the biased exponent is 251, so the exponent is 251-127=124. Take the binary number to the right of the decimal point in the mantissa. Convert this to decimal and divide it by 223 where 23 is the size of the mantissa, to give 0.302447676659. Add 1 to this fraction. The floating-point number is then given by:

-1⁰x 2¹²⁴x 1.302447676659

which is approximately equal to:

2.77000e+37

Binary floating-point values are sometimes misunderstood. It is important to remember that not every floating-point value can be represented by a finite sized floating-point number. The size of the exponent in the number dictates the range of values that the number can hold and the size of the mantissa relates to the spacing of each value that can be represented exactly. Thus the 24-bit format allows for values with approximately the same range of values representable by the 32-bit format, but the values that can be exactly represented by this format are more widely spaced.

For example, if you are using a 24-bit wide floating-point type, it can exactly store the value 95000.0. However, the next highest number it can represent is 95002.0 and it is impossible to represent any value in between these two in such a type as it will be rounded. This implies that C code which compares floating-point values might not behave as expected.

For example:

volatile float myFloat;
myFloat = 95002.0;
if(myFloat == 95001.0)     // value will be rounded
    PORTA++;                // this line will be executed!

in which the result of the if expression will be true, even though it appears the two values being compared are different.

Compare this to a 32-bit floating-point type, which has a higher precision. It also can exactly store 95000.0 as a value. The next highest value which can be represented is (approximately) 95000.00781.

The characteristics of the floating-point formats are summarized in <float.h> Floating-Point Characteristics.

The symbols in this table are preprocessor macros that are available after including <float.h> in your source code. As the size and format of floating-point data types are not fully specified by the C Standard, these macros allow for more portable code which can check the limits of the range of values held by the type on this implementation.