Names specified here
Name	Description	Notes				Source
`__STDC_IEC_559__`	Indicator of support for IEC 60559-compatible floating-point arithmetic	L	?	M		Predefined	C99	C11
`double_t`	Evaluation type for `double`				T	`<math.h>`	C99	C11
`FP_INFINITE`	Floating-point classification: infinity			M		`<math.h>`	C99	C11
`FP_NAN`	Floating-point classification: not a number (NaN)			M		`<math.h>`	C99	C11
`FP_NORMAL`	Floating-point classification: normal number			M		`<math.h>`	C99	C11
`FP_SUBNORMAL`	Floating-point classification: subnormal number			M		`<math.h>`	C99	C11
`FP_ZERO`	Floating-point classification: zero			M		`<math.h>`	C99	C11
`float_t`	Evaluation type for `float`				T	`<math.h>`	C99	C11
`fpclassify()`	Classify floating-point number			M	(·)	`<math.h>`	C99	C11
`INFINITY`	Infinity			M		`<math.h>`	C99	C11
`isfinite()`	Test for finite value			M	(·)	`<math.h>`	C99	C11
`isgreater()`	Test for ordering			M	(·)	`<math.h>`	C99	C11
`isgreaterequal()`	Test for ordering			M	(·)	`<math.h>`	C99	C11
`isinf()`	Test for infinite value			M	(·)	`<math.h>`	C99	C11
`isless()`	Test for ordering			M	(·)	`<math.h>`	C99	C11
`islessequal()`	Test for ordering			M	(·)	`<math.h>`	C99	C11
`islessgreater()`	Test for ordering			M	(·)	`<math.h>`	C99	C11
`isnan()`	Test for NaN			M	(·)	`<math.h>`	C99	C11
`isnormal()`	Test for normal number			M	(·)	`<math.h>`	C99	C11
`isunordered()`	Test for ordering			M	(·)	`<math.h>`	C99	C11
`NAN`	Not-a-number (NaN) constant		?	M		`<math.h>`	C99	C11
`nextafter()`	Get next value				(·)	`<math.h>`	C99	C11
`nextafter()`	Get next value			M	(·)	`<tgmath.h>`	C99	C11
`nextafterf()`	Get next value				(·)	`<math.h>`	C99	C11
`nextafterl()`	Get next value				(·)	`<math.h>`	C99	C11
`nexttoward()`	Get next value				(·)	`<math.h>`	C99	C11
`nexttoward()`	Get next value			M	(·)	`<tgmath.h>`	C99	C11
`nexttowardf()`	Get next value				(·)	`<math.h>`	C99	C11
`nexttowardl()`	Get next value				(·)	`<math.h>`	C99	C11

Names specified for `<float.h>`, “Characteristics of floating-point types”
Name	Description	Notes		Source	Availability
`DBL_DECIMAL_DIG`	Representable decimal digits in any real floating-point type	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_DIG`	Decimal digits of precision of `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_EPSILON`	Smallest `x` of type `double` such that `1.0 + x != 1.0`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_HAS_SUBNORM`	Determinant whether `double` has subnormal values	L	M	`<float.h>`					C11
`DBL_MANT_DIG`	Number of base-`FLT_RADIX` digits in mantissa of `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_MAX`	Maximum value of `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_MAX_10_EXP`	Maximum integral base-10 exponent yielding `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_MAX_EXP`	One plus maximum integral exponent of base `FLT_RADIX` yielding `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_MIN`	Minimum normalized value of `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_MIN_10_EXP`	Minimum integral base-10 exponent yielding normalized `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_MIN_EXP`	One plus minimum integral exponent of base `FLT_RADIX` yielding normalized `double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`DBL_TRUE_MIN`	Minimum positive value of `double`	L	M	`<float.h>`					C11
`DECIMAL_DIG`	Representable decimal digits in any real floating-point type	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_DECIMAL_DIG`	Representable decimal digits in any real floating-point type	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_DIG`	Decimal digits of precision of `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_EPSILON`	Smallest `x` of type `float` such that `1.0 + x != 1.0`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_EVAL_METHOD`	Floating-point evaluation method	L	M	`<float.h>`				C99	C11
`FLT_HAS_SUBNORM`	Determinant whether `float` has subnormal values	L	M	`<float.h>`					C11
`FLT_MANT_DIG`	Number of base-`FLT_RADIX` digits in mantissa of `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_MAX`	Maximum value of `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_MAX_10_EXP`	Maximum integral base-10 exponent yielding `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_MAX_EXP`	One plus maximum integral exponent of base `FLT_RADIX` yielding `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_MIN`	Minimum normalized value of `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_MIN_10_EXP`	Minimum integral base-10 exponent yielding normalized `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_MIN_EXP`	One plus minimum integral exponent of base `FLT_RADIX` yielding normalized `float`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_RADIX`	Characteristic of `float`, `double` and `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_ROUNDS`	Characteristic of `float`, `double` and `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`FLT_TRUE_MIN`	Minimum positive value of `float`	L	M	`<float.h>`					C11
`LDBL_DECIMAL_DIG`	Representable decimal digits in any real floating-point type	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_DIG`	Decimal digits of precision of `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_EPSILON`	Smallest `x` of type `long double` such that `1.0 + x != 1.0`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_HAS_SUBNORM`	Determinant whether `long double` has subnormal values	L	M	`<float.h>`					C11
`LDBL_MANT_DIG`	Number of base-`FLT_RADIX` digits in mantissa of `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_MAX`	Maximum value of `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_MAX_10_EXP`	Maximum integral base-10 exponent yielding `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_MAX_EXP`	One plus maximum integral exponent of base `FLT_RADIX` yielding `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_MIN`	Minimum normalized value of `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_MIN_10_EXP`	Minimum integral base-10 exponent yielding normalized `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_MIN_EXP`	One plus minimum integral exponent of base `FLT_RADIX` yielding normalized `long double`	L	M	`<float.h>`	C89	C90	C95	C99	C11
`LDBL_TRUE_MIN`	Minimum positive value of `long double`	L	M	`<float.h>`					C11

This header is available in C89, C90, C95, C99 and C11.

Floating point is a way of representing a large range of real numbers in binary. A floating-point value consists of two signed integers, called the significand/mantissa m and the exponent p, and represents the value m×2^p. The mantissa usually has a separate bit to indicate the sign. Two's complement is typical for the exponent.

If m is a real number, there are an infinite number of ways to express any single value, just by varying the exponent, but floating-point values are usually normalized so that the mantissa is always ±1·xxxxxx. 100 in base 10, which is 1100100 in binary, would be represented as +1·1001×10⁺¹¹⁰. Furthermore, since the top bit is by definition always 1, it does not need to be recorded. A 10-bit mantissa would then hold the pattern 1001000000, while the exponent would still be 110.

Other special patterns may be reserved to represent positive or negative infinity, positive or negative zero, subnormals, and not-a-number (NaN).

C defines the following real floating-point types:

If they behave in accordance with ISO/IEC 60559, the macro __STDC_IEC_559__ will be predefined with the value 1, which also guarantees the existence of the following symbols:

The real floating-point types and the complex types form the floating-point types.

Are imaginary types also floating-point types?

The header <float.h> defines macros describing the characteristics of the native floating-point types. For example, the range of float is +FLT_MAX to -FLT_MAX.

It also defines FLT_RADIX, the base that all floating-point types use. It is usually 2, but doesn't have to be.

DECIMAL_DIG is defined as the number of decimal digits that at least one real floating-point type can retain without rounding error. It is usually the largest of FLT_DECIMAL_DIG, DBL_DECIMAL_DIG and LDBL_DECIMAL_DIG. [This definition is uncertain.]

Floating-point constants

Decimal floating-point constants always include at least one digit, and at least a period . or an exponent (consisting of e or E, and followed by an optional sign, then at least one decimal digit). Without the period and the exponent, such a sequence would interpreted as an integer constant. The exponent multiplies the represented value by a power of 10. 0.0, 0. and .0 all represent zero. 1e3 represents 1000, 1.2e3 represents 1200, and 1.2e-3 represents 0·0012.

From C99, hexadecimal floating-point constants are permitted, prefixed by 0x or 0X, and use p or P to introduce the exponent, e.g., 0x1EFp+12. The exponent is required, to avoid the token being mistaken for an integer constant. It is always expressed in decimal, and multiplies the represented value by a power of 2.

Floating-point constants may be suffixed by f or F to give them the type float, while l or L gives them the type long double. Otherwise, FP constants are of type double.

Floating-point constants are always non-negative, and do not include a sign. -3.0 is a unary-expression formed from a sign - and a floating-point constant 3.0, but it is still a constant expression.

constant: floating-constant
floating-constant: decimal-floating-constant; hexadecimal-floating-constant
from C99
decimal-floating-constant: fractional-constant exponent-part_opt floating-suffix_opt; digit-sequence exponent-part floating-suffix_opt
fractional-constant: digit-sequence_opt . digit-sequence; digit-sequence .
exponent-part: e sign_opt digit-sequence; E sign_opt digit-sequence
digit-sequence: digit; digit-sequence digit
digit: any of 0 1 2 3 4 5 6 7 8 9
hexadecimal-floating-constant: hexadecimal-prefix hexadecimal-fractional-constant binary-exponent-part floating-suffix_opt; hexadecimal-prefix hexadecimal-digit-sequence binary-exponent-part floating-suffix_opt
hexadecimal-fractional-constant: hexadecimal-digit-sequence_opt . hexadecimal-digit-sequence; hexadecimal-digit-sequence .
binary-exponent-part: p sign_opt digit-sequence; P sign_opt digit-sequence
hexadecimal-digit-sequence: hexadecimal-digit; hexadecimal-digit-sequence hexadecimal-digit
hexadecimal-digit: any of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
sign: +; -
floating-suffix: f; l; F; L

Floating-point zero

Mathematically, zero does not have a sign. However, in C, a floating-point zero may have distinct positive and negative representations, but C normally treats them as equal. C treats positive and negative zero as equal, so you can't test for negative zero with ==, but you can test for a magnitude of zero first, and then check the sign with signbit. Note, though, that negative zero does not have to be distinct from positive zero in all environments, and tests to detect it will fail in those cases. fpclassify returns FP_ZERO for a positive or negative zero argument.

Infinity

The floating-point types may permit representations of infinity, and possibly distinct values for positive and negative infinity. <math.h> defines INFINITY as a constant that means infinity if available, or a very large number if not.

#include <math.h>
int isfinite(real-floating-type x);
int isinf(real-floating-type x);

The macro isinf yields non-zero for arguments with infinite magnitiude, while isfinite yields non-zero for arguments with finite magnitude. fpclassify returns FP_INFINITE for arguments with infinite magnitude, and FP_NORMAL, FP_ZERO or FP_SUBNORMAL for arguments with finite magnitude.

Normal and subnormal numbers

With normal floating-point values, the significand/mantissa begins with 1, and so has no leading zeros. Very small numbers may be represented with subnormal values, where the significand does have leading zeros.

The macros FLT_HAS_SUBNORM, DBL_HAS_SUBNORM and LDBL_HAS_SUBNORM indicate whether the types float, double and long double support subnormal representations respectively, yielding 1 for yes, 0 for no, and -1 if undetermined.

#include <math.h>
int isnormal(real-floating-type x);

The macro isnormal will return non-zero if its argument is normal, and zero otherwise. The macro fpclassify returns FP_SUBNORMAL for subnormal arguments, and FP_NORMAL for normal arguments.

Not-a-number (NaN)

The floating-point types may include the concept of not-a-number, or NaN. NaNs are special values that can propagate through mathematical operations silently (they are quiet NaNs), or raise an exception when they first occur (signalling NaNs). The C implementation may support both, either or none.

<math.h> might define NAN, a constant for float, represent a quiet NaN. The existence of this macro firmly indicates that quiet NaNs are supported in at least float.

Quiet NaNs may also be generated from strings by nan, nanf and nanl.

#include <math.h>
int isnan(real-floating-type x);

The macro isnan yields non-zero when given a quiet NaN argument. The macro fpclassify yields FP_NAN when given a quiet NaN argument.

Ordering and comparison of floating-point numbers

relational-expression: relational-expression < shift-expression; relational-expression > shift-expression; relational-expression <= shift-expression; relational-expression >= shift-expression
equality-expression: equality-expression == relational-expression; equality-expression != relational-expression

The operators <, >, <=, >=, == and != allow real floating-point values to be compared in the expected mathematical ways. == and != should be used with caution, however, as it can be extremely difficult for two computed floating-point expressions to yield exactly the same number. Instead, compare a value with a small range centered on the intended value, e.g., with fabs(val - target) < epsilon, or consider avoiding the circumstance altogether.

<math.h> also defines several macros for comparing two floating-point numbers that don't raise floating-point exceptions.

isless(x, y) is equivalent to (x < y).
isgreater(x, y) is equivalent to (x > y).
islessequal(x, y) is equivalent to (x <= y).
isgreaterequal(x, y) is equivalent to (x >= y).
islessgreater(x, y) is equivalent to (x < y || x > y).

Also, the macro isunordered returns non-zero if either of its arguments are NaN.

#include <math.h>
float nextafterf(float x, float y);
double nextafter(double x, double y);
long double nextafterl(long double x, long double y);
float nexttowardf(float x, long double y);
double nexttoward(double x, long double y);
long double nexttowardl(long double x, long double y);

#include <tgmath.h>
real-floating-type nextafter(real-floating-type x, real-floating-type y);
real-floating-type nexttoward(real-floating-type x, long double y);

Real floating-point types can be incremented and decremented by the smallest amounts. The nextafter and nexttoward functions return the next representable value after x in the direction of y.

Floating-point classification

#include <math.h>
int fpclassify(real-floating-type x);

The macro fpclassify takes a value of any real floating-point type, and returns a code indicating what kind of value it is. FP_ZERO is returned if the value is positive or negative zero. FP_INFINITE is returned if the value is positive or negative infinity. FP_NAN is returned if the value is NaN. FP_SUBNORMAL is returned if the value is subnormal. FP_NORMAL is returned if the value is normalized. Other values may have defined meanings.

The macros isinf, isfinite, isnormal and isnan test for specific classifications.

Floating-point promotions

float_t should be the most efficient real floating-point type that is at least as wide as float. double_t should be the most efficient floating-point type that is at least as wide as double. These are defined in <math.h>. FLT_EVAL_METHOD can indicate exactly what these types really are:

Value of `FLT_EVAL_METHOD`	Type of `float_t`	Type of `double_t`
`0`	`float`	`double`
`1`	`double`	`double`
`2`	`long double`	`long double`
Other values	Unknown

[

: Need to say how operations involving two floats would be done using


            float_t

, etc.]