xref: /freebsd/lib/libc/softfloat/softfloat.txt (revision 74bf4e164ba5851606a27d4feff27717452583e5)
1$NetBSD: softfloat.txt,v 1.1 2000/06/06 08:15:10 bjh21 Exp $
2$FreeBSD$
3
4SoftFloat Release 2a General Documentation
5
6John R. Hauser
71998 December 13
8
9
10-------------------------------------------------------------------------------
11Introduction
12
13SoftFloat is a software implementation of floating-point that conforms to
14the IEC/IEEE Standard for Binary Floating-Point Arithmetic.  As many as four
15formats are supported:  single precision, double precision, extended double
16precision, and quadruple precision.  All operations required by the standard
17are implemented, except for conversions to and from decimal.
18
19This document gives information about the types defined and the routines
20implemented by SoftFloat.  It does not attempt to define or explain the
21IEC/IEEE Floating-Point Standard.  Details about the standard are available
22elsewhere.
23
24
25-------------------------------------------------------------------------------
26Limitations
27
28SoftFloat is written in C and is designed to work with other C code.  The
29SoftFloat header files assume an ISO/ANSI-style C compiler.  No attempt
30has been made to accomodate compilers that are not ISO-conformant.  In
31particular, the distributed header files will not be acceptable to any
32compiler that does not recognize function prototypes.
33
34Support for the extended double-precision and quadruple-precision formats
35depends on a C compiler that implements 64-bit integer arithmetic.  If the
36largest integer format supported by the C compiler is 32 bits, SoftFloat is
37limited to only single and double precisions.  When that is the case, all
38references in this document to the extended double precision, quadruple
39precision, and 64-bit integers should be ignored.
40
41
42-------------------------------------------------------------------------------
43Contents
44
45    Introduction
46    Limitations
47    Contents
48    Legal Notice
49    Types and Functions
50    Rounding Modes
51    Extended Double-Precision Rounding Precision
52    Exceptions and Exception Flags
53    Function Details
54        Conversion Functions
55        Standard Arithmetic Functions
56        Remainder Functions
57        Round-to-Integer Functions
58        Comparison Functions
59        Signaling NaN Test Functions
60        Raise-Exception Function
61    Contact Information
62
63
64
65-------------------------------------------------------------------------------
66Legal Notice
67
68SoftFloat was written by John R. Hauser.  This work was made possible in
69part by the International Computer Science Institute, located at Suite 600,
701947 Center Street, Berkeley, California 94704.  Funding was partially
71provided by the National Science Foundation under grant MIP-9311980.  The
72original version of this code was written as part of a project to build
73a fixed-point vector processor in collaboration with the University of
74California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
75
76THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE.  Although reasonable effort
77has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
78TIMES RESULT IN INCORRECT BEHAVIOR.  USE OF THIS SOFTWARE IS RESTRICTED TO
79PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
80AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
81
82
83-------------------------------------------------------------------------------
84Types and Functions
85
86When 64-bit integers are supported by the compiler, the `softfloat.h' header
87file defines four types:  `float32' (single precision), `float64' (double
88precision), `floatx80' (extended double precision), and `float128'
89(quadruple precision).  The `float32' and `float64' types are defined in
90terms of 32-bit and 64-bit integer types, respectively, while the `float128'
91type is defined as a structure of two 64-bit integers, taking into account
92the byte order of the particular machine being used.  The `floatx80' type
93is defined as a structure containing one 16-bit and one 64-bit integer, with
94the machine's byte order again determining the order of the `high' and `low'
95fields.
96
97When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
98header file defines only two types:  `float32' and `float64'.  Because
99ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
100the `float32' type is identified with an appropriate integer type.  The
101`float64' type is defined as a structure of two 32-bit integers, with the
102machine's byte order determining the order of the fields.
103
104In either case, the types in `softfloat.h' are defined such that if a system
105implements the usual C `float' and `double' types according to the IEC/IEEE
106Standard, then the `float32' and `float64' types should be indistinguishable
107in memory from the native `float' and `double' types.  (On the other hand,
108when `float32' or `float64' values are placed in processor registers by
109the compiler, the type of registers used may differ from those used for the
110native `float' and `double' types.)
111
112SoftFloat implements the following arithmetic operations:
113
114-- Conversions among all the floating-point formats, and also between
115   integers (32-bit and 64-bit) and any of the floating-point formats.
116
117-- The usual add, subtract, multiply, divide, and square root operations
118   for all floating-point formats.
119
120-- For each format, the floating-point remainder operation defined by the
121   IEC/IEEE Standard.
122
123-- For each floating-point format, a ``round to integer'' operation that
124   rounds to the nearest integer value in the same format.  (The floating-
125   point formats can hold integer values, of course.)
126
127-- Comparisons between two values in the same floating-point format.
128
129The only functions required by the IEC/IEEE Standard that are not provided
130are conversions to and from decimal.
131
132
133-------------------------------------------------------------------------------
134Rounding Modes
135
136All four rounding modes prescribed by the IEC/IEEE Standard are implemented
137for all operations that require rounding.  The rounding mode is selected
138by the global variable `float_rounding_mode'.  This variable may be set
139to one of the values `float_round_nearest_even', `float_round_to_zero',
140`float_round_down', or `float_round_up'.  The rounding mode is initialized
141to nearest/even.
142
143
144-------------------------------------------------------------------------------
145Extended Double-Precision Rounding Precision
146
147For extended double precision (`floatx80') only, the rounding precision
148of the standard arithmetic operations is controlled by the global variable
149`floatx80_rounding_precision'.  The operations affected are:
150
151   floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
152
153When `floatx80_rounding_precision' is set to its default value of 80, these
154operations are rounded (as usual) to the full precision of the extended
155double-precision format.  Setting `floatx80_rounding_precision' to 32
156or to 64 causes the operations listed to be rounded to reduced precision
157equivalent to single precision (`float32') or to double precision
158(`float64'), respectively.  When rounding to reduced precision, additional
159bits in the result significand beyond the rounding point are set to zero.
160The consequences of setting `floatx80_rounding_precision' to a value other
161than 32, 64, or 80 is not specified.  Operations other than the ones listed
162above are not affected by `floatx80_rounding_precision'.
163
164
165-------------------------------------------------------------------------------
166Exceptions and Exception Flags
167
168All five exception flags required by the IEC/IEEE Standard are
169implemented.  Each flag is stored as a unique bit in the global variable
170`float_exception_flags'.  The positions of the exception flag bits within
171this variable are determined by the bit masks `float_flag_inexact',
172`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
173`float_flag_invalid'.  The exception flags variable is initialized to all 0,
174meaning no exceptions.
175
176An individual exception flag can be cleared with the statement
177
178    float_exception_flags &= ~ float_flag_<exception>;
179
180where `<exception>' is the appropriate name.  To raise a floating-point
181exception, the SoftFloat function `float_raise' should be used (see below).
182
183In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
184for underflow either before or after rounding.  The choice is made by
185the global variable `float_detect_tininess', which can be set to either
186`float_tininess_before_rounding' or `float_tininess_after_rounding'.
187Detecting tininess after rounding is better because it results in fewer
188spurious underflow signals.  The other option is provided for compatibility
189with some systems.  Like most systems, SoftFloat always detects loss of
190accuracy for underflow as an inexact result.
191
192
193-------------------------------------------------------------------------------
194Function Details
195
196- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
197Conversion Functions
198
199All conversions among the floating-point formats are supported, as are all
200conversions between a floating-point format and 32-bit and 64-bit signed
201integers.  The complete set of conversion functions is:
202
203   int32_to_float32      int64_to_float32
204   int32_to_float64      int64_to_float32
205   int32_to_floatx80     int64_to_floatx80
206   int32_to_float128     int64_to_float128
207
208   float32_to_int32      float32_to_int64
209   float32_to_int32      float64_to_int64
210   floatx80_to_int32     floatx80_to_int64
211   float128_to_int32     float128_to_int64
212
213   float32_to_float64    float32_to_floatx80   float32_to_float128
214   float64_to_float32    float64_to_floatx80   float64_to_float128
215   floatx80_to_float32   floatx80_to_float64   floatx80_to_float128
216   float128_to_float32   float128_to_float64   float128_to_floatx80
217
218Each conversion function takes one operand of the appropriate type and
219returns one result.  Conversions from a smaller to a larger floating-point
220format are always exact and so require no rounding.  Conversions from 32-bit
221integers to double precision and larger formats are also exact, and likewise
222for conversions from 64-bit integers to extended double and quadruple
223precisions.
224
225Conversions from floating-point to integer raise the invalid exception if
226the source value cannot be rounded to a representable integer of the desired
227size (32 or 64 bits).  If the floating-point operand is a NaN, the largest
228positive integer is returned.  Otherwise, if the conversion overflows, the
229largest integer with the same sign as the operand is returned.
230
231On conversions to integer, if the floating-point operand is not already an
232integer value, the operand is rounded according to the current rounding
233mode as specified by `float_rounding_mode'.  Because C (and perhaps other
234languages) require that conversions to integers be rounded toward zero, the
235following functions are provided for improved speed and convenience:
236
237   float32_to_int32_round_to_zero    float32_to_int64_round_to_zero
238   float64_to_int32_round_to_zero    float64_to_int64_round_to_zero
239   floatx80_to_int32_round_to_zero   floatx80_to_int64_round_to_zero
240   float128_to_int32_round_to_zero   float128_to_int64_round_to_zero
241
242These variant functions ignore `float_rounding_mode' and always round toward
243zero.
244
245- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
246Standard Arithmetic Functions
247
248The following standard arithmetic functions are provided:
249
250   float32_add    float32_sub    float32_mul    float32_div    float32_sqrt
251   float64_add    float64_sub    float64_mul    float64_div    float64_sqrt
252   floatx80_add   floatx80_sub   floatx80_mul   floatx80_div   floatx80_sqrt
253   float128_add   float128_sub   float128_mul   float128_div   float128_sqrt
254
255Each function takes two operands, except for `sqrt' which takes only one.
256The operands and result are all of the same type.
257
258Rounding of the extended double-precision (`floatx80') functions is affected
259by the `floatx80_rounding_precision' variable, as explained above in the
260section _Extended_Double-Precision_Rounding_Precision_.
261
262- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
263Remainder Functions
264
265For each format, SoftFloat implements the remainder function according to
266the IEC/IEEE Standard.  The remainder functions are:
267
268   float32_rem
269   float64_rem
270   floatx80_rem
271   float128_rem
272
273Each remainder function takes two operands.  The operands and result are all
274of the same type.  Given operands x and y, the remainder functions return
275the value x - n*y, where n is the integer closest to x/y.  If x/y is exactly
276halfway between two integers, n is the even integer closest to x/y.  The
277remainder functions are always exact and so require no rounding.
278
279Depending on the relative magnitudes of the operands, the remainder
280functions can take considerably longer to execute than the other SoftFloat
281functions.  This is inherent in the remainder operation itself and is not a
282flaw in the SoftFloat implementation.
283
284- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
285Round-to-Integer Functions
286
287For each format, SoftFloat implements the round-to-integer function
288specified by the IEC/IEEE Standard.  The functions are:
289
290   float32_round_to_int
291   float64_round_to_int
292   floatx80_round_to_int
293   float128_round_to_int
294
295Each function takes a single floating-point operand and returns a result of
296the same type.  (Note that the result is not an integer type.)  The operand
297is rounded to an exact integer according to the current rounding mode, and
298the resulting integer value is returned in the same floating-point format.
299
300- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
301Comparison Functions
302
303The following floating-point comparison functions are provided:
304
305   float32_eq    float32_le    float32_lt
306   float64_eq    float64_le    float64_lt
307   floatx80_eq   floatx80_le   floatx80_lt
308   float128_eq   float128_le   float128_lt
309
310Each function takes two operands of the same type and returns a 1 or 0
311representing either _true_ or _false_.  The abbreviation `eq' stands for
312``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
313for ``less than'' (<).
314
315The standard greater-than (>), greater-than-or-equal (>=), and not-equal
316(!=) functions are easily obtained using the functions provided.  The
317not-equal function is just the logical complement of the equal function.
318The greater-than-or-equal function is identical to the less-than-or-equal
319function with the operands reversed; and the greater-than function can be
320obtained from the less-than function in the same way.
321
322The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
323functions raise the invalid exception if either input is any kind of NaN.
324The equal functions, on the other hand, are defined not to raise the invalid
325exception on quiet NaNs.  For completeness, SoftFloat provides the following
326additional functions:
327
328   float32_eq_signaling    float32_le_quiet    float32_lt_quiet
329   float64_eq_signaling    float64_le_quiet    float64_lt_quiet
330   floatx80_eq_signaling   floatx80_le_quiet   floatx80_lt_quiet
331   float128_eq_signaling   float128_le_quiet   float128_lt_quiet
332
333The `signaling' equal functions are identical to the standard functions
334except that the invalid exception is raised for any NaN input.  Likewise,
335the `quiet' comparison functions are identical to their counterparts except
336that the invalid exception is not raised for quiet NaNs.
337
338- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
339Signaling NaN Test Functions
340
341The following functions test whether a floating-point value is a signaling
342NaN:
343
344   float32_is_signaling_nan
345   float64_is_signaling_nan
346   floatx80_is_signaling_nan
347   float128_is_signaling_nan
348
349The functions take one operand and return 1 if the operand is a signaling
350NaN and 0 otherwise.
351
352- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
353Raise-Exception Function
354
355SoftFloat provides a function for raising floating-point exceptions:
356
357    float_raise
358
359The function takes a mask indicating the set of exceptions to raise.  No
360result is returned.  In addition to setting the specified exception flags,
361this function may cause a trap or abort appropriate for the current system.
362
363- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
364
365
366-------------------------------------------------------------------------------
367Contact Information
368
369At the time of this writing, the most up-to-date information about
370SoftFloat and the latest release can be found at the Web page `http://
371HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
372
373
374