FP64

  • mantisa: 52 bits
  • exponent: 11 bits (holds 0~2047, but exponent ranges form 1~2046, but 0 is reserved for denorms and zero, and 2047 is reserved for NaN and infinity)
  • sign: 1 bit
  • bias: 1023
  • range: 2^-1022 to 2^1023
  • precision: 15-17 decimal digits
  • size: 8 bytes
  • largest positive value: 1.7976931348623157e+308
  • largest negative value: -1.7976931348623157e+308
  • smallest positive value: 2.2250738585072014e-308
  • smallest negative value: -2.2250738585072014e-308
  • machine epsilon: 2.2204460492503131e-16
  • smallest denormalized value: 4.9406564584124654e-324
  • largest denormalized value: 2.2250738585072014e-308

FP32

  • mantisa: 23 bits
  • exponent: 8 bits (holds 0~255, but exponent ranges form 1~254, but 0 is reserved for denorms and zero, and 255 is reserved for NaN and infinity)
  • sign: 1 bit
  • bias: 127
  • range: 2^-126 to 2^127
  • precision: 6-9 decimal digits
  • size: 4 bytes
  • largest positive value: 3.4028235e+38
  • largest negative value: -3.4028235e+38
  • smallest positive value: 1.1754944e-38 (2^-126)
  • smallest negative value: -1.1754944e-38 (-2^-126)
  • machine epsilon: 1.1920929e-07
  • smallest denormalized value: 1.4012985e-45 (2^-(126+23))
  • largest denormalized value: 1.1754944e-38

F16

  • mantisa: 10 bits
  • exponent: 5 bits (holds 0~31, but exponent ranges form 1~30, but 0 is reserved for denorms and zero, and 31 is reserved for NaN and infinity)
  • sign: 1 bit
  • bias: 15
  • range: 2^-14 to 2^15
  • precision: 3-4 decimal digits
  • size: 2 bytes
  • largest positive value: 65504
  • largest negative value: -65504
  • smallest positive value: 6.1035e-5
  • smallest negative value: -6.1035e-5
  • machine epsilon: 9.77e-4
  • smallest denormalized value: 5.96e-8
  • largest denormalized value: 6.1035e-5

BF16

  • mantisa: 7 bits
  • exponent: 8 bits (holds 0~255, but exponent ranges form 1~254, but 0 is reserved for denorms and zero, and 255 is reserved for NaN and infinity)
  • sign: 1 bit
  • bias: 127
  • range: 2^-126 to 2^127
  • precision: 2-3 decimal digits
  • size: 2 bytes
  • largest positive value: 3.38953139e+38
  • largest negative value: -3.38953139e+38
  • smallest positive value: 1.168e-38
  • smallest negative value: -1.168e-38
  • machine epsilon: 1.168e-07
  • smallest denormalized value: 1.4012985e-45
  • largest denormalized value: 1.168e-38

NANOO FP8: E4M3

  • mantisa: 3 bits
  • exponent: 4 bits
  • sign: 1 bit
  • bias: 7
  • range: 2^-6 to 2^7
  • precision: 1-2 decimal digits
  • size: 1 byte
  • largest positive value: 240
  • largest negative value: -240
  • smallest positive value: 0.0625
  • smallest negative value: -0.0625
  • machine epsilon: 0.125
  • smallest denormalized value: 0.0625
  • largest denormalized value: 0.125

NANOO FP8: E5M2

  • mantisa: 2 bits
  • exponent: 5 bits
  • sign: 1 bit
  • bias: 15
  • range: 2^-14 to 2^15
  • precision: 1-2 decimal digits
  • size: 1 byte
  • largest positive value: 31
  • largest negative value: -31
  • smallest positive value: 0.03125
  • smallest negative value: -0.03125
  • machine epsilon: 0.0625
  • smallest denormalized value: 0.03125
  • largest denormalized value: 0.0625

Special case binary representations

  • Zero: 0/1 sign bit for positive or negative zero, exponent and mantisa are all 0
  • NaN (Not a Number): all exponent bits are 1, and mantisa is non zero, sign bit can be 0 or 1. The most significant bit from x is used to determine the type of NaN: "quiet NaN" or "signaling NaN"
  • +/-inf: all exponent bits are 1, and mantisa is zero. 0/1 sign bit for positive and negative infinity
  • subnormal/denorms:

    Normalized numbers have the implicit leading binary digit is a 1. To reduce the loss of precision when an underflow occurs, IEEE 754 includes the ability to represent fractions smaller than are possible in the normalized representation, by making the implicit leading digit a 0. Such numbers are called denormal/subnormal. They don't include as many significant digits as a normalized number, but they enable a gradual loss of precision when the result of an operation is not exactly zero but is too close to zero to be represented by a normalized number.

    A denormal number is represented with a biased exponent of all 0 bits, which represents, for example, an exponent of -126 in single precision (not -127), or -1022 in double precision (not -1023). In contrast, the smallest biased exponent representing a normal number is 1.

Operations generating NaN

There are three kinds of operations that can return NaN:

  • Operations with a NaN operand.
  • Indeterminate forms:
    • The divisions (±0) / (±0) and (±∞) / (±∞).
    • The multiplications (±0) × (±∞) and (±∞) × (±0).
    • Remainder x % y when x is an infinity or y is zero.
    • The additions (+∞) + (−∞), (−∞) + (+∞) and equivalent subtractions (+∞) − (+∞) and (−∞) − (−∞).
    • The standard has alternative functions for powers:
      • The standard pow function and the integer exponent pown function define 00, 1∞, and ∞0 as 1.
      • The powr function defines all three indeterminate forms as invalid operations and so returns NaN.
  • Real operations with complex results, for example:
    • The square root of a negative number.
    • The logarithm of a negative number.
    • The inverse sine or inverse cosine of a number that is less than -1 or greater than 1.