Names specified here
Name Description Notes Source Availability
__STDC_IEC_559__ Indicator of support for IEC 60559-compatible floating-point arithmetic L ? M Predefined C99 C11
double_t Evaluation type for double T <math.h> C99 C11
FP_INFINITE Floating-point classification: infinity M <math.h> C99 C11
FP_NAN Floating-point classification: not a number (NaN) M <math.h> C99 C11
FP_NORMAL Floating-point classification: normal number M <math.h> C99 C11
FP_SUBNORMAL Floating-point classification: subnormal number M <math.h> C99 C11
FP_ZERO Floating-point classification: zero M <math.h> C99 C11
float_t Evaluation type for float T <math.h> C99 C11
fpclassify() Classify floating-point number M (·) <math.h> C99 C11
INFINITY Infinity M <math.h> C99 C11
isfinite() Test for finite value M (·) <math.h> C99 C11
isgreater() Test for ordering M (·) <math.h> C99 C11
isgreaterequal() Test for ordering M (·) <math.h> C99 C11
isinf() Test for infinite value M (·) <math.h> C99 C11
isless() Test for ordering M (·) <math.h> C99 C11
islessequal() Test for ordering M (·) <math.h> C99 C11
islessgreater() Test for ordering M (·) <math.h> C99 C11
isnan() Test for NaN M (·) <math.h> C99 C11
isnormal() Test for normal number M (·) <math.h> C99 C11
isunordered() Test for ordering M (·) <math.h> C99 C11
NAN Not-a-number (NaN) constant ? M <math.h> C99 C11
nextafter() Get next value (·) <math.h> C99 C11
nextafter() Get next value M (·) <tgmath.h> C99 C11
nextafterf() Get next value (·) <math.h> C99 C11
nextafterl() Get next value (·) <math.h> C99 C11
nexttoward() Get next value (·) <math.h> C99 C11
nexttoward() Get next value M (·) <tgmath.h> C99 C11
nexttowardf() Get next value (·) <math.h> C99 C11
nexttowardl() Get next value (·) <math.h> C99 C11
Names specified for <float.h>, “Characteristics of floating-point types”
Name Description Notes Source Availability
DBL_DECIMAL_DIG Representable decimal digits in any real floating-point type L M <float.h> C89 C90 C95 C99 C11
DBL_DIG Decimal digits of precision of double L M <float.h> C89 C90 C95 C99 C11
DBL_EPSILON Smallest x of type double such that 1.0 + x != 1.0 L M <float.h> C89 C90 C95 C99 C11
DBL_HAS_SUBNORM Determinant whether double has subnormal values L M <float.h> C11
DBL_MANT_DIG Number of base-FLT_RADIX digits in mantissa of double L M <float.h> C89 C90 C95 C99 C11
DBL_MAX Maximum value of double L M <float.h> C89 C90 C95 C99 C11
DBL_MAX_10_EXP Maximum integral base-10 exponent yielding double L M <float.h> C89 C90 C95 C99 C11
DBL_MAX_EXP One plus maximum integral exponent of base FLT_RADIX yielding double L M <float.h> C89 C90 C95 C99 C11
DBL_MIN Minimum normalized value of double L M <float.h> C89 C90 C95 C99 C11
DBL_MIN_10_EXP Minimum integral base-10 exponent yielding normalized double L M <float.h> C89 C90 C95 C99 C11
DBL_MIN_EXP One plus minimum integral exponent of base FLT_RADIX yielding normalized double L M <float.h> C89 C90 C95 C99 C11
DBL_TRUE_MIN Minimum positive value of double L M <float.h> C11
DECIMAL_DIG Representable decimal digits in any real floating-point type L M <float.h> C89 C90 C95 C99 C11
FLT_DECIMAL_DIG Representable decimal digits in any real floating-point type L M <float.h> C89 C90 C95 C99 C11
FLT_DIG Decimal digits of precision of float L M <float.h> C89 C90 C95 C99 C11
FLT_EPSILON Smallest x of type float such that 1.0 + x != 1.0 L M <float.h> C89 C90 C95 C99 C11
FLT_EVAL_METHOD Floating-point evaluation method L M <float.h> C99 C11
FLT_HAS_SUBNORM Determinant whether float has subnormal values L M <float.h> C11
FLT_MANT_DIG Number of base-FLT_RADIX digits in mantissa of float L M <float.h> C89 C90 C95 C99 C11
FLT_MAX Maximum value of float L M <float.h> C89 C90 C95 C99 C11
FLT_MAX_10_EXP Maximum integral base-10 exponent yielding float L M <float.h> C89 C90 C95 C99 C11
FLT_MAX_EXP One plus maximum integral exponent of base FLT_RADIX yielding float L M <float.h> C89 C90 C95 C99 C11
FLT_MIN Minimum normalized value of float L M <float.h> C89 C90 C95 C99 C11
FLT_MIN_10_EXP Minimum integral base-10 exponent yielding normalized float L M <float.h> C89 C90 C95 C99 C11
FLT_MIN_EXP One plus minimum integral exponent of base FLT_RADIX yielding normalized float L M <float.h> C89 C90 C95 C99 C11
FLT_RADIX Characteristic of float, double and long double L M <float.h> C89 C90 C95 C99 C11
FLT_ROUNDS Characteristic of float, double and long double L M <float.h> C89 C90 C95 C99 C11
FLT_TRUE_MIN Minimum positive value of float L M <float.h> C11
LDBL_DECIMAL_DIG Representable decimal digits in any real floating-point type L M <float.h> C89 C90 C95 C99 C11
LDBL_DIG Decimal digits of precision of long double L M <float.h> C89 C90 C95 C99 C11
LDBL_EPSILON Smallest x of type long double such that 1.0 + x != 1.0 L M <float.h> C89 C90 C95 C99 C11
LDBL_HAS_SUBNORM Determinant whether long double has subnormal values L M <float.h> C11
LDBL_MANT_DIG Number of base-FLT_RADIX digits in mantissa of long double L M <float.h> C89 C90 C95 C99 C11
LDBL_MAX Maximum value of long double L M <float.h> C89 C90 C95 C99 C11
LDBL_MAX_10_EXP Maximum integral base-10 exponent yielding long double L M <float.h> C89 C90 C95 C99 C11
LDBL_MAX_EXP One plus maximum integral exponent of base FLT_RADIX yielding long double L M <float.h> C89 C90 C95 C99 C11
LDBL_MIN Minimum normalized value of long double L M <float.h> C89 C90 C95 C99 C11
LDBL_MIN_10_EXP Minimum integral base-10 exponent yielding normalized long double L M <float.h> C89 C90 C95 C99 C11
LDBL_MIN_EXP One plus minimum integral exponent of base FLT_RADIX yielding normalized long double L M <float.h> C89 C90 C95 C99 C11
LDBL_TRUE_MIN Minimum positive value of long double L M <float.h> C11

This header is available in C89, C90, C95, C99 and C11.

Floating point is a way of representing a large range of real numbers in binary. A floating-point value consists of two signed integers, called the significand/mantissa m and the exponent p, and represents the value m×2p. The mantissa usually has a separate bit to indicate the sign. Two's complement is typical for the exponent.

If m is a real number, there are an infinite number of ways to express any single value, just by varying the exponent, but floating-point values are usually normalized so that the mantissa is always ±1·xxxxxx. 100 in base 10, which is 1100100 in binary, would be represented as +1·1001×10+110. Furthermore, since the top bit is by definition always 1, it does not need to be recorded. A 10-bit mantissa would then hold the pattern 1001000000, while the exponent would still be 110.

Other special patterns may be reserved to represent positive or negative infinity, positive or negative zero, subnormals, and not-a-number (NaN).

C defines the following real floating-point types:

If they behave in accordance with ISO/IEC 60559, the macro __STDC_IEC_559__ will be predefined with the value 1, which also guarantees the existence of the following symbols:

The real floating-point types and the complex types form the floating-point types.

Are imaginary types also floating-point types?

The header <float.h> defines macros describing the characteristics of the native floating-point types. For example, the range of float is +FLT_MAX to -FLT_MAX.

It also defines FLT_RADIX, the base that all floating-point types use. It is usually 2, but doesn't have to be.

DECIMAL_DIG is defined as the number of decimal digits that at least one real floating-point type can retain without rounding error. It is usually the largest of FLT_DECIMAL_DIG, DBL_DECIMAL_DIG and LDBL_DECIMAL_DIG. [This definition is uncertain.]

Floating-point constants

Decimal floating-point constants always include at least one digit, and at least a period . or an exponent (consisting of e or E, and followed by an optional sign, then at least one decimal digit). Without the period and the exponent, such a sequence would interpreted as an integer constant. The exponent multiplies the represented value by a power of 10. 0.0, 0. and .0 all represent zero. 1e3 represents 1000, 1.2e3 represents 1200, and 1.2e-3 represents 0·0012.

From C99, hexadecimal floating-point constants are permitted, prefixed by 0x or 0X, and use p or P to introduce the exponent, e.g., 0x1EFp+12. The exponent is required, to avoid the token being mistaken for an integer constant. It is always expressed in decimal, and multiplies the represented value by a power of 2.

Floating-point constants may be suffixed by f or F to give them the type float, while l or L gives them the type long double. Otherwise, FP constants are of type double.

Floating-point constants are always non-negative, and do not include a sign. -3.0 is a unary-expression formed from a sign - and a floating-point constant 3.0, but it is still a constant expression.

constant
floating-constant
floating-constant
decimal-floating-constant
hexadecimal-floating-constant
from C99
decimal-floating-constant
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant
digit-sequenceopt . digit-sequence
digit-sequence .
exponent-part
e signopt digit-sequence
E signopt digit-sequence
digit-sequence
digit
digit-sequence digit
digit
any of 0 1 2 3 4 5 6 7 8 9
hexadecimal-floating-constant
hexadecimal-prefix hexadecimal-fractional-constant binary-exponent-part floating-suffixopt
hexadecimal-prefix hexadecimal-digit-sequence binary-exponent-part floating-suffixopt
hexadecimal-fractional-constant
hexadecimal-digit-sequenceopt . hexadecimal-digit-sequence
hexadecimal-digit-sequence .
binary-exponent-part
p signopt digit-sequence
P signopt digit-sequence
hexadecimal-digit-sequence
hexadecimal-digit
hexadecimal-digit-sequence hexadecimal-digit
hexadecimal-digit
any of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
sign
+
-
floating-suffix
f
l
F
L

Floating-point zero

Mathematically, zero does not have a sign. However, in C, a floating-point zero may have distinct positive and negative representations, but C normally treats them as equal. C treats positive and negative zero as equal, so you can't test for negative zero with ==, but you can test for a magnitude of zero first, and then check the sign with signbit. Note, though, that negative zero does not have to be distinct from positive zero in all environments, and tests to detect it will fail in those cases. fpclassify returns FP_ZERO for a positive or negative zero argument.

Infinity

The floating-point types may permit representations of infinity, and possibly distinct values for positive and negative infinity. <math.h> defines INFINITY as a constant that means infinity if available, or a very large number if not.

#include <math.h>
int isfinite(real-floating-type x);
int isinf(real-floating-type x);

The macro isinf yields non-zero for arguments with infinite magnitiude, while isfinite yields non-zero for arguments with finite magnitude. fpclassify returns FP_INFINITE for arguments with infinite magnitude, and FP_NORMAL, FP_ZERO or FP_SUBNORMAL for arguments with finite magnitude.

Normal and subnormal numbers

With normal floating-point values, the significand/mantissa begins with 1, and so has no leading zeros. Very small numbers may be represented with subnormal values, where the significand does have leading zeros.

The macros FLT_HAS_SUBNORM, DBL_HAS_SUBNORM and LDBL_HAS_SUBNORM indicate whether the types float, double and long double support subnormal representations respectively, yielding 1 for yes, 0 for no, and -1 if undetermined.

#include <math.h>
int isnormal(real-floating-type x);

The macro isnormal will return non-zero if its argument is normal, and zero otherwise. The macro fpclassify returns FP_SUBNORMAL for subnormal arguments, and FP_NORMAL for normal arguments.

Not-a-number (NaN)

The floating-point types may include the concept of not-a-number, or NaN. NaNs are special values that can propagate through mathematical operations silently (they are quiet NaNs), or raise an exception when they first occur (signalling NaNs). The C implementation may support both, either or none.

<math.h> might define NAN, a constant for float, represent a quiet NaN. The existence of this macro firmly indicates that quiet NaNs are supported in at least float.

Quiet NaNs may also be generated from strings by nan, nanf and nanl.

#include <math.h>
int isnan(real-floating-type x);

The macro isnan yields non-zero when given a quiet NaN argument. The macro fpclassify yields FP_NAN when given a quiet NaN argument.

Ordering and comparison of floating-point numbers

relational-expression
relational-expression < shift-expression
relational-expression > shift-expression
relational-expression <= shift-expression
relational-expression >= shift-expression
equality-expression
equality-expression == relational-expression
equality-expression != relational-expression

The operators <, >, <=, >=, == and != allow real floating-point values to be compared in the expected mathematical ways. == and != should be used with caution, however, as it can be extremely difficult for two computed floating-point expressions to yield exactly the same number. Instead, compare a value with a small range centered on the intended value, e.g., with fabs(val - target) < epsilon, or consider avoiding the circumstance altogether.

<math.h> also defines several macros for comparing two floating-point numbers that don't raise floating-point exceptions.

  • isless(x, y) is equivalent to (x < y).
  • isgreater(x, y) is equivalent to (x > y).
  • islessequal(x, y) is equivalent to (x <= y).
  • isgreaterequal(x, y) is equivalent to (x >= y).
  • islessgreater(x, y) is equivalent to (x < y || x > y).

Also, the macro isunordered returns non-zero if either of its arguments are NaN.

#include <math.h>
float nextafterf(float x, float y);
double nextafter(double x, double y);
long double nextafterl(long double x, long double y);
float nexttowardf(float x, long double y);
double nexttoward(double x, long double y);
long double nexttowardl(long double x, long double y);
#include <tgmath.h>
real-floating-type nextafter(real-floating-type x, real-floating-type y);
real-floating-type nexttoward(real-floating-type x, long double y);

Real floating-point types can be incremented and decremented by the smallest amounts. The nextafter and nexttoward functions return the next representable value after x in the direction of y.

Floating-point classification

#include <math.h>
int fpclassify(real-floating-type x);

The macro fpclassify takes a value of any real floating-point type, and returns a code indicating what kind of value it is. FP_ZERO is returned if the value is positive or negative zero. FP_INFINITE is returned if the value is positive or negative infinity. FP_NAN is returned if the value is NaN. FP_SUBNORMAL is returned if the value is subnormal. FP_NORMAL is returned if the value is normalized. Other values may have defined meanings.

The macros isinf, isfinite, isnormal and isnan test for specific classifications.

Floating-point promotions

float_t should be the most efficient real floating-point type that is at least as wide as float. double_t should be the most efficient floating-point type that is at least as wide as double. These are defined in <math.h>. FLT_EVAL_METHOD can indicate exactly what these types really are:

Value of FLT_EVAL_METHOD Type of float_t Type of double_t
0 float double
1 double double
2 long double long double
Other values Unknown
[ Work in progress : Need to say how operations involving two floats would be done using float_t, etc.]

CHaR
Sitemap Supported
Site format updated 2024-06-05T22:37:07.391+0000
Data updated 1970-01-01T00:00:00.000+0000
Page updated 2023-10-04T20:24:03.213+0000