1$NetBSD: softfloat.txt,v 1.1 2000/06/06 08:15:10 bjh21 Exp $ 2$FreeBSD$ 3 4SoftFloat Release 2a General Documentation 5 6John R. Hauser 71998 December 13 8 9 10------------------------------------------------------------------------------- 11Introduction 12 13SoftFloat is a software implementation of floating-point that conforms to 14the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four 15formats are supported: single precision, double precision, extended double 16precision, and quadruple precision. All operations required by the standard 17are implemented, except for conversions to and from decimal. 18 19This document gives information about the types defined and the routines 20implemented by SoftFloat. It does not attempt to define or explain the 21IEC/IEEE Floating-Point Standard. Details about the standard are available 22elsewhere. 23 24 25------------------------------------------------------------------------------- 26Limitations 27 28SoftFloat is written in C and is designed to work with other C code. The 29SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt 30has been made to accomodate compilers that are not ISO-conformant. In 31particular, the distributed header files will not be acceptable to any 32compiler that does not recognize function prototypes. 33 34Support for the extended double-precision and quadruple-precision formats 35depends on a C compiler that implements 64-bit integer arithmetic. If the 36largest integer format supported by the C compiler is 32 bits, SoftFloat is 37limited to only single and double precisions. When that is the case, all 38references in this document to the extended double precision, quadruple 39precision, and 64-bit integers should be ignored. 40 41 42------------------------------------------------------------------------------- 43Contents 44 45 Introduction 46 Limitations 47 Contents 48 Legal Notice 49 Types and Functions 50 Rounding Modes 51 Extended Double-Precision Rounding Precision 52 Exceptions and Exception Flags 53 Function Details 54 Conversion Functions 55 Standard Arithmetic Functions 56 Remainder Functions 57 Round-to-Integer Functions 58 Comparison Functions 59 Signaling NaN Test Functions 60 Raise-Exception Function 61 Contact Information 62 63 64 65------------------------------------------------------------------------------- 66Legal Notice 67 68SoftFloat was written by John R. Hauser. This work was made possible in 69part by the International Computer Science Institute, located at Suite 600, 701947 Center Street, Berkeley, California 94704. Funding was partially 71provided by the National Science Foundation under grant MIP-9311980. The 72original version of this code was written as part of a project to build 73a fixed-point vector processor in collaboration with the University of 74California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. 75 76THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort 77has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT 78TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO 79PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY 80AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE. 81 82 83------------------------------------------------------------------------------- 84Types and Functions 85 86When 64-bit integers are supported by the compiler, the `softfloat.h' header 87file defines four types: `float32' (single precision), `float64' (double 88precision), `floatx80' (extended double precision), and `float128' 89(quadruple precision). The `float32' and `float64' types are defined in 90terms of 32-bit and 64-bit integer types, respectively, while the `float128' 91type is defined as a structure of two 64-bit integers, taking into account 92the byte order of the particular machine being used. The `floatx80' type 93is defined as a structure containing one 16-bit and one 64-bit integer, with 94the machine's byte order again determining the order of the `high' and `low' 95fields. 96 97When 64-bit integers are _not_ supported by the compiler, the `softfloat.h' 98header file defines only two types: `float32' and `float64'. Because 99ISO/ANSI C guarantees at least one built-in integer type of 32 bits, 100the `float32' type is identified with an appropriate integer type. The 101`float64' type is defined as a structure of two 32-bit integers, with the 102machine's byte order determining the order of the fields. 103 104In either case, the types in `softfloat.h' are defined such that if a system 105implements the usual C `float' and `double' types according to the IEC/IEEE 106Standard, then the `float32' and `float64' types should be indistinguishable 107in memory from the native `float' and `double' types. (On the other hand, 108when `float32' or `float64' values are placed in processor registers by 109the compiler, the type of registers used may differ from those used for the 110native `float' and `double' types.) 111 112SoftFloat implements the following arithmetic operations: 113 114-- Conversions among all the floating-point formats, and also between 115 integers (32-bit and 64-bit) and any of the floating-point formats. 116 117-- The usual add, subtract, multiply, divide, and square root operations 118 for all floating-point formats. 119 120-- For each format, the floating-point remainder operation defined by the 121 IEC/IEEE Standard. 122 123-- For each floating-point format, a ``round to integer'' operation that 124 rounds to the nearest integer value in the same format. (The floating- 125 point formats can hold integer values, of course.) 126 127-- Comparisons between two values in the same floating-point format. 128 129The only functions required by the IEC/IEEE Standard that are not provided 130are conversions to and from decimal. 131 132 133------------------------------------------------------------------------------- 134Rounding Modes 135 136All four rounding modes prescribed by the IEC/IEEE Standard are implemented 137for all operations that require rounding. The rounding mode is selected 138by the global variable `float_rounding_mode'. This variable may be set 139to one of the values `float_round_nearest_even', `float_round_to_zero', 140`float_round_down', or `float_round_up'. The rounding mode is initialized 141to nearest/even. 142 143 144------------------------------------------------------------------------------- 145Extended Double-Precision Rounding Precision 146 147For extended double precision (`floatx80') only, the rounding precision 148of the standard arithmetic operations is controlled by the global variable 149`floatx80_rounding_precision'. The operations affected are: 150 151 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt 152 153When `floatx80_rounding_precision' is set to its default value of 80, these 154operations are rounded (as usual) to the full precision of the extended 155double-precision format. Setting `floatx80_rounding_precision' to 32 156or to 64 causes the operations listed to be rounded to reduced precision 157equivalent to single precision (`float32') or to double precision 158(`float64'), respectively. When rounding to reduced precision, additional 159bits in the result significand beyond the rounding point are set to zero. 160The consequences of setting `floatx80_rounding_precision' to a value other 161than 32, 64, or 80 is not specified. Operations other than the ones listed 162above are not affected by `floatx80_rounding_precision'. 163 164 165------------------------------------------------------------------------------- 166Exceptions and Exception Flags 167 168All five exception flags required by the IEC/IEEE Standard are 169implemented. Each flag is stored as a unique bit in the global variable 170`float_exception_flags'. The positions of the exception flag bits within 171this variable are determined by the bit masks `float_flag_inexact', 172`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and 173`float_flag_invalid'. The exception flags variable is initialized to all 0, 174meaning no exceptions. 175 176An individual exception flag can be cleared with the statement 177 178 float_exception_flags &= ~ float_flag_<exception>; 179 180where `<exception>' is the appropriate name. To raise a floating-point 181exception, the SoftFloat function `float_raise' should be used (see below). 182 183In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess 184for underflow either before or after rounding. The choice is made by 185the global variable `float_detect_tininess', which can be set to either 186`float_tininess_before_rounding' or `float_tininess_after_rounding'. 187Detecting tininess after rounding is better because it results in fewer 188spurious underflow signals. The other option is provided for compatibility 189with some systems. Like most systems, SoftFloat always detects loss of 190accuracy for underflow as an inexact result. 191 192 193------------------------------------------------------------------------------- 194Function Details 195 196- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 197Conversion Functions 198 199All conversions among the floating-point formats are supported, as are all 200conversions between a floating-point format and 32-bit and 64-bit signed 201integers. The complete set of conversion functions is: 202 203 int32_to_float32 int64_to_float32 204 int32_to_float64 int64_to_float32 205 int32_to_floatx80 int64_to_floatx80 206 int32_to_float128 int64_to_float128 207 208 float32_to_int32 float32_to_int64 209 float32_to_int32 float64_to_int64 210 floatx80_to_int32 floatx80_to_int64 211 float128_to_int32 float128_to_int64 212 213 float32_to_float64 float32_to_floatx80 float32_to_float128 214 float64_to_float32 float64_to_floatx80 float64_to_float128 215 floatx80_to_float32 floatx80_to_float64 floatx80_to_float128 216 float128_to_float32 float128_to_float64 float128_to_floatx80 217 218Each conversion function takes one operand of the appropriate type and 219returns one result. Conversions from a smaller to a larger floating-point 220format are always exact and so require no rounding. Conversions from 32-bit 221integers to double precision and larger formats are also exact, and likewise 222for conversions from 64-bit integers to extended double and quadruple 223precisions. 224 225Conversions from floating-point to integer raise the invalid exception if 226the source value cannot be rounded to a representable integer of the desired 227size (32 or 64 bits). If the floating-point operand is a NaN, the largest 228positive integer is returned. Otherwise, if the conversion overflows, the 229largest integer with the same sign as the operand is returned. 230 231On conversions to integer, if the floating-point operand is not already an 232integer value, the operand is rounded according to the current rounding 233mode as specified by `float_rounding_mode'. Because C (and perhaps other 234languages) require that conversions to integers be rounded toward zero, the 235following functions are provided for improved speed and convenience: 236 237 float32_to_int32_round_to_zero float32_to_int64_round_to_zero 238 float64_to_int32_round_to_zero float64_to_int64_round_to_zero 239 floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero 240 float128_to_int32_round_to_zero float128_to_int64_round_to_zero 241 242These variant functions ignore `float_rounding_mode' and always round toward 243zero. 244 245- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 246Standard Arithmetic Functions 247 248The following standard arithmetic functions are provided: 249 250 float32_add float32_sub float32_mul float32_div float32_sqrt 251 float64_add float64_sub float64_mul float64_div float64_sqrt 252 floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt 253 float128_add float128_sub float128_mul float128_div float128_sqrt 254 255Each function takes two operands, except for `sqrt' which takes only one. 256The operands and result are all of the same type. 257 258Rounding of the extended double-precision (`floatx80') functions is affected 259by the `floatx80_rounding_precision' variable, as explained above in the 260section _Extended_Double-Precision_Rounding_Precision_. 261 262- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 263Remainder Functions 264 265For each format, SoftFloat implements the remainder function according to 266the IEC/IEEE Standard. The remainder functions are: 267 268 float32_rem 269 float64_rem 270 floatx80_rem 271 float128_rem 272 273Each remainder function takes two operands. The operands and result are all 274of the same type. Given operands x and y, the remainder functions return 275the value x - n*y, where n is the integer closest to x/y. If x/y is exactly 276halfway between two integers, n is the even integer closest to x/y. The 277remainder functions are always exact and so require no rounding. 278 279Depending on the relative magnitudes of the operands, the remainder 280functions can take considerably longer to execute than the other SoftFloat 281functions. This is inherent in the remainder operation itself and is not a 282flaw in the SoftFloat implementation. 283 284- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 285Round-to-Integer Functions 286 287For each format, SoftFloat implements the round-to-integer function 288specified by the IEC/IEEE Standard. The functions are: 289 290 float32_round_to_int 291 float64_round_to_int 292 floatx80_round_to_int 293 float128_round_to_int 294 295Each function takes a single floating-point operand and returns a result of 296the same type. (Note that the result is not an integer type.) The operand 297is rounded to an exact integer according to the current rounding mode, and 298the resulting integer value is returned in the same floating-point format. 299 300- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 301Comparison Functions 302 303The following floating-point comparison functions are provided: 304 305 float32_eq float32_le float32_lt 306 float64_eq float64_le float64_lt 307 floatx80_eq floatx80_le floatx80_lt 308 float128_eq float128_le float128_lt 309 310Each function takes two operands of the same type and returns a 1 or 0 311representing either _true_ or _false_. The abbreviation `eq' stands for 312``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands 313for ``less than'' (<). 314 315The standard greater-than (>), greater-than-or-equal (>=), and not-equal 316(!=) functions are easily obtained using the functions provided. The 317not-equal function is just the logical complement of the equal function. 318The greater-than-or-equal function is identical to the less-than-or-equal 319function with the operands reversed; and the greater-than function can be 320obtained from the less-than function in the same way. 321 322The IEC/IEEE Standard specifies that the less-than-or-equal and less-than 323functions raise the invalid exception if either input is any kind of NaN. 324The equal functions, on the other hand, are defined not to raise the invalid 325exception on quiet NaNs. For completeness, SoftFloat provides the following 326additional functions: 327 328 float32_eq_signaling float32_le_quiet float32_lt_quiet 329 float64_eq_signaling float64_le_quiet float64_lt_quiet 330 floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet 331 float128_eq_signaling float128_le_quiet float128_lt_quiet 332 333The `signaling' equal functions are identical to the standard functions 334except that the invalid exception is raised for any NaN input. Likewise, 335the `quiet' comparison functions are identical to their counterparts except 336that the invalid exception is not raised for quiet NaNs. 337 338- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 339Signaling NaN Test Functions 340 341The following functions test whether a floating-point value is a signaling 342NaN: 343 344 float32_is_signaling_nan 345 float64_is_signaling_nan 346 floatx80_is_signaling_nan 347 float128_is_signaling_nan 348 349The functions take one operand and return 1 if the operand is a signaling 350NaN and 0 otherwise. 351 352- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 353Raise-Exception Function 354 355SoftFloat provides a function for raising floating-point exceptions: 356 357 float_raise 358 359The function takes a mask indicating the set of exceptions to raise. No 360result is returned. In addition to setting the specified exception flags, 361this function may cause a trap or abort appropriate for the current system. 362 363- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 364 365 366------------------------------------------------------------------------------- 367Contact Information 368 369At the time of this writing, the most up-to-date information about 370SoftFloat and the latest release can be found at the Web page `http:// 371HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'. 372 373 374