s_cbrtf.c - OpenGrok history log for /freebsd/lib/msun/src/s

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 0dd5a560	28-Jan-2024	Steve Kargl <kargl@FreeBSD.org>	lib/msun: Cleanup after $FreeBSD$ removal Remove no longer needed explicit inclusion of sys/cdefs.h. PR: 276669 MFC after: 1 week
Revision tags: release/14.0.0
# 1d386b48	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
Revision tags: release/13.2.0, release/12.4.0, release/13.1.0, release/12.3.0, release/13.0.0, release/12.2.0, release/11.4.0, release/12.1.0, release/11.3.0, release/12.0.0, release/11.2.0, release/10.4.0, release/11.1.0, release/11.0.1, release/11.0.0, release/10.3.0, release/10.2.0, release/10.1.0, release/9.3.0, release/10.0.0, release/9.2.0, release/8.4.0, release/9.1.0, release/8.3.0_cvs, release/8.3.0, release/9.0.0, release/7.4.0_cvs, release/8.2.0_cvs, release/7.4.0, release/8.2.0, release/8.1.0_cvs, release/8.1.0, release/7.3.0_cvs, release/7.3.0, release/8.0.0_cvs, release/8.0.0, release/7.2.0_cvs, release/7.2.0, release/7.1.0_cvs, release/7.1.0, release/6.4.0_cvs, release/6.4.0, release/7.0.0_cvs, release/7.0.0
# 5aa554c7	22-Feb-2008	David Schultz <das@FreeBSD.org>	s/rcsid/__FBSDID/
Revision tags: release/6.3.0_cvs, release/6.3.0
# 20a99011	29-May-2007	Bruce Evans <bde@FreeBSD.org>	Merge the relevant part of rev.1.14 of s_cbrt.c (a micro-optimization involving moving the check for x == 0). The savings in cycles are smaller for cbrtf() than for cbrt(), and positive in all measu Merge the relevant part of rev.1.14 of s_cbrt.c (a micro-optimization involving moving the check for x == 0). The savings in cycles are smaller for cbrtf() than for cbrt(), and positive in all measured cases with gcc-3.4.4, but still very machine/compiler-dependent. show more ...
Revision tags: release/6.2.0_cvs, release/6.2.0, release/5.5.0_cvs, release/5.5.0, release/6.1.0_cvs, release/6.1.0
# fd289100	05-Jan-2006	Bruce Evans <bde@FreeBSD.org>	Oops, on amd64 (and probably on all non-i386 systems), the previous commit broke the 224 cases where \|x\| > DBL_MAX/2. There are exponent range problems not just for denormals (underflow) but for l Oops, on amd64 (and probably on all non-i386 systems), the previous commit broke the 224 cases where \|x\| > DBL_MAX/2. There are exponent range problems not just for denormals (underflow) but for large values (overflow). Doubles have more than enough exponent range to avoid the problems, but I forgot to convert enough terms to double, so there was an x+x term which was sometimes evaluated in float precision. Unfortunately, this is a pessimization with some combinations of systems and compilers (it makes no difference on Athlon XP's, but on Athlon64's it gives a 5% pessimization with gcc-3.4 but not with gcc-3.3). Exlain the problem better in comments. show more ...
# 4bb97803	05-Jan-2006	Bruce Evans <bde@FreeBSD.org>	Use double precision internally to optimize cbrtf(), and change the algorithm for the second step significantly to also get a perfectly rounded result in round-to-nearest mode. The resulting optimiz Use double precision internally to optimize cbrtf(), and change the algorithm for the second step significantly to also get a perfectly rounded result in round-to-nearest mode. The resulting optimization is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles out of 100 on the former). Using extra precision, we don't need to do anything special to avoid large rounding errors in the third step (Newton's method), so we can regroup terms to avoid a division, increase clarity, and increase opportunities for parallelism. Rearrangement for parallelism loses the increase in clarity. We end up with the same number of operations but with a division reduced to a multiplication. Using specifically double precision, there is enough extra precision for the third step to give enough precision for perfect rounding to float precision provided the previous steps are accurate to 16 bits. (They were accurate to 12 bits, which was almost minimal for imperfect rounding in the old version but would be more than enough for imperfect rounding in this version (9 bits would be enough now).) I couldn't find any significant time optimizations from optimizing the previous steps, so I decided to optimize for accuracy instead. The second step needed a division although a previous commit optimized it to use a polynomial approximation for its main detail, and this division dominated the time for the second step. Use the same Newton's method for the second step as for the third step since this is insignificantly slower than the division plus the polynomial (now that Newton's method only needs 1 division), significantly more accurate, and simpler. Single precision would be precise enough for the second step, but doesn't have enough exponent range to handle denormals without the special grouping of terms (as in previous versions) that requires another division, so we use double precision for both the second and third steps. show more ...
# c5964538	19-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Use a minimax polynomial approximation instead of a Pade rational function approximation for the second step. The polynomial has degree 2 for cbrtf() and 4 for cbrt(). These degrees are minimal for Use a minimax polynomial approximation instead of a Pade rational function approximation for the second step. The polynomial has degree 2 for cbrtf() and 4 for cbrt(). These degrees are minimal for the final accuracy to be essentially the same as before (slightly smaller). Adjust the rounding between steps 2 and 3 to match. Unfortunately, for cbrt(), this breaks the claimed accuracy slightly although incorrect rounding doesn't. Claim less accuracy since its not worth pessimizing the polynomial or relying on exhaustive testing to get insignificantly more accuracy. This saves about 30 cycles on Athlons (mainly by avoiding 2 divisions) so it gives an overall optimization in the 10-25% range (a larger percentage for float precision, especially in 32-bit mode, since other overheads are more dominant for double precision, surprisingly more in 32-bit mode). show more ...
# ce804bff	18-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Fixed code to match comments and the algorithm: - in preparing for the third approximation, actually make t larger in magnitude than cbrt(x). After chopping, t must be incremented by 2 ulps to m Fixed code to match comments and the algorithm: - in preparing for the third approximation, actually make t larger in magnitude than cbrt(x). After chopping, t must be incremented by 2 ulps to make it larger, not 1 ulp since chopping can reduce it by almost 1 ulp and it might already be up to half a different-sized-ulp smaller than cbrt(x). I have not found any cases where this is essential, but the think-time error bound depends on it. The relative smallness of the different-sized-ulp limited the bug. If there are cases where this is essential, then the final error bound would be 5/6+epsilon instead of of 4/6+epsilon ulps (still < 1). - in preparing for the third approximation, round more carefully (but still sloppily to avoid branches) so that the claimed error bound of 0.667 ulps is satisfied in all cases tested for cbrt() and remains satisfied in all cases for cbrtf(). There isn't enough spare precision for very sloppy rounding to work: - in cbrt(), even with the inadequate increment, the actual error was 0.6685 in some cases, and correcting the increment increased this a little. The fix uses sloppy rounding to 25 bits instead of very sloppy rounding to 21 bits, and starts using uint64_t instead of 2 words for bit manipulation so that rounding more bits is not much costly. - in cbrtf(), the 0.667 bound was already satisfied even with the inadequate increment, but change the code to almost match cbrt() anyway. There is not enough spare precision in the Newton approximation to double the inadequate increment without exceeding the 0.667 bound, and no spare precision to avoid this problem as in cbrt(). The fix is to round using an increment of 2 smaller-ulps before chopping so that an increment of 1 ulp is enough. In cbrt(), we essentially do the same, but move the chop point so that the increment of 1 is not needed. Fixed comments to match code: - in cbrt(), the second approximation is good to 25 bits, not quite 26 bits. - in cbrt(), don't claim that the second approximation may be implemented in single precision. Single precision cannot handle the full exponent range without minor but pessimal changes to renormalize, and although single precision is enough, 25 bit precision is now claimed and used. Added comments about some of the magic for the error bound 4/6+epsilon. I still don't understand why it is 4/6+ and not 6/6+ ulps. Indent comments at the right of code more consistently. show more ...
# ec761d75	13-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Optimize by not doing excessive conversions for handling the sign bit. This gives an optimization of between 9 and 22% on Athlons (largest for cbrt() on amd64 -- from 205 to 159 cycles). We extracte Optimize by not doing excessive conversions for handling the sign bit. This gives an optimization of between 9 and 22% on Athlons (largest for cbrt() on amd64 -- from 205 to 159 cycles). We extracted the sign bit and worked with \|x\|, and restored the sign bit as the last step. We avoided branches to a fault by using accesses to FP values as bits to clear and restore the sign bit. Avoiding branches is usually good, but the bit access macros are not so good (especially for setting FP values), and here they always caused pipeline stalls on Athlons. Even using branches would be faster except on args that give perfect branch misprediction, since only mispredicted branches cause stalls, but it possible to avoid touching the sign bit in FP values at all (except to preserve it in conversions from bits to FP not related to the sign bit). Do this. The results are identical except in 2 of the 3 unsupported rounding modes, since all the approximations use odd rational functions so they work right on strictly negative values, and the special case of -0 doesn't use an approximation. show more ...
# 7d5a4821	13-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Fixed some especially horrible style bugs (indentation that is neither KNF nor fdlibmNF combined with multiple statements per line).
# af7f9913	11-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Added comments about the magic behind <cbrt(x) in bits> ~= <x in bits>/3 + BIAS. Keep the large comments only in the double version as usual. Fixed some style bugs (mainly grammar and spelling e Added comments about the magic behind <cbrt(x) in bits> ~= <x in bits>/3 + BIAS. Keep the large comments only in the double version as usual. Fixed some style bugs (mainly grammar and spelling errors in comments). show more ...
# 288a8c86	11-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Fixed the unexpectedly large maximum error after the previous commit. It was because I forgot to translate the part of the double precision algorithm that chops t so that tt is exact. Now the maxim Fixed the unexpectedly large maximum error after the previous commit. It was because I forgot to translate the part of the double precision algorithm that chops t so that tt is exact. Now the maximum error is the same as for double precision (almost exactly 2.0/3 ulps). show more ...
# 6de073b4	11-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Fixed all 502518670 errors of more than 1 ulp for cbrtf() on amd64. The maximum error was 3.56 ulps. The bug was another translation error. The double precision version has a comment saying "new cb Fixed all 502518670 errors of more than 1 ulp for cbrtf() on amd64. The maximum error was 3.56 ulps. The bug was another translation error. The double precision version has a comment saying "new cbrt to 23 bits, may be implemented in precision". This means exactly what it says -- that the 23 bit second approximation for the double precision cbrt() may be implemented in single (i.e., float) precision. It doesn't mean what the translation assumed -- that this approximation, when implemented in float precision, is good enough for the the final approximation in float precision. First, float precision needs a 24 bit approximation. The "23 bit" approximation is actually good to 24 bits on float precision args, but only if it is evaluated in double precision. Second, the algorithm requires a cleanup step to ensure its error bound. In float precision, any reasonable algorithm works for the cleanup step. Use the same algorithm as for double precision, although this is much more than enough and is a significant pessimization, and don't optimize or simplify anything using double precision to implement the float case, so that the whole double precision algorithm can be verified in float precision. A maximum error of 0.667 ulps is claimed for cbrt() and the max for cbrtf() using the same algorithm shouldn't be different, but the actual max for cbrtf() on amd64 is now 0.9834 ulps. (On i386 -O1 the max is 0.5006 (down from < 0.7) due to extra precision.) show more ...
Revision tags: release/6.0.0_cvs, release/6.0.0, release/5.4.0_cvs, release/5.4.0, release/4.11.0_cvs, release/4.11.0, release/5.3.0_cvs, release/5.3.0, release/4.10.0_cvs, release/4.10.0, release/5.2.1_cvs, release/5.2.1, release/5.2.0_cvs, release/5.2.0, release/4.9.0_cvs, release/4.9.0, release/5.1.0_cvs, release/5.1.0, release/4.8.0_cvs, release/4.8.0, release/5.0.0_cvs, release/5.0.0, release/4.7.0_cvs, release/4.6.2_cvs, release/4.6.2, release/4.6.1, release/4.6.0_cvs
# 59b19ff1	28-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Fix formatting, this is hard to explain, so I'll show one example. - float ynf(int n, float x) /* wrapper ynf / +float +ynf(int n, float x) / wrapper ynf / This is because the __S Fix formatting, this is hard to explain, so I'll show one example. - float ynf(int n, float x) / wrapper ynf / +float +ynf(int n, float x) / wrapper ynf */ This is because the __STDC__ stuff was indented. Reviewed by: md5 show more ...
# 2dcc2286	28-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Assume __STDC__, remove non-__STDC__ code. Reviewed by: md5
Revision tags: release/4.5.0_cvs, release/4.4.0_cvs, release/4.3.0_cvs, release/4.3.0, release/4.2.0, release/4.1.1_cvs, release/4.1.0, release/3.5.0_cvs, release/4.0.0_cvs, release/3.4.0_cvs, release/3.3.0_cvs
# 7f3dea24	28-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
Revision tags: release/3.2.0, release/3.1.0, release/3.0.0, release/2.2.8, release/2.2.7, release/2.2.6, release/2.2.5_cvs, release/2.2.2_cvs, release/2.2.1_cvs, release/2.2.0, release/2.1.7_cvs
# 7e546392	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Revert $FreeBSD$ to $Id$
Revision tags: release/2.1.6_cvs, release/2.1.6.1
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise. show more ...
Revision tags: release/2.1.5_cvs, release/2.1.0_cvs, release/2.0.5_cvs
# 6c06b4e2	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
Revision tags: release/2.0
# 3a8617a8	19-Aug-1994	Jordan K. Hubbard <jkh@FreeBSD.org>	J.T. Conklin's latest version of the Sun math library. -- Begin comments from J.T. Conklin: The most significant improvement is the addition of "float" versions of the math functions that take float J.T. Conklin's latest version of the Sun math library. -- Begin comments from J.T. Conklin: The most significant improvement is the addition of "float" versions of the math functions that take float arguments, return floats, and do all operations in floating point. This doesn't help (performance) much on the i386, but they are still nice to have. The float versions were orginally done by Cygnus' Ian Taylor when fdlibm was integrated into the libm we support for embedded systems. I gave Ian a copy of my libm as a starting point since I had already fixed a lot of bugs & problems in Sun's original code. After he was done, I cleaned it up a bit and integrated the changes back into my libm. -- End comments Reviewed by: jkh Submitted by: jtc show more ...
Revision tags: release/13.2.0, release/12.4.0, release/13.1.0, release/12.3.0, release/13.0.0, release/12.2.0, release/11.4.0, release/12.1.0, release/11.3.0, release/12.0.0, release/11.2.0, release/10.4.0, release/11.1.0, release/11.0.1, release/11.0.0, release/10.3.0, release/10.2.0, release/10.1.0, release/9.3.0, release/10.0.0, release/9.2.0, release/8.4.0, release/9.1.0, release/8.3.0_cvs, release/8.3.0, release/9.0.0, release/7.4.0_cvs, release/8.2.0_cvs, release/7.4.0, release/8.2.0, release/8.1.0_cvs, release/8.1.0, release/7.3.0_cvs, release/7.3.0, release/8.0.0_cvs, release/8.0.0, release/7.2.0_cvs, release/7.2.0, release/7.1.0_cvs, release/7.1.0, release/6.4.0_cvs, release/6.4.0, release/7.0.0_cvs, release/7.0.0
# 5aa554c7	22-Feb-2008	David Schultz <das@FreeBSD.org>	s/rcsid/__FBSDID/
Revision tags: release/6.3.0_cvs, release/6.3.0
# 20a99011	29-May-2007	Bruce Evans <bde@FreeBSD.org>	Merge the relevant part of rev.1.14 of s_cbrt.c (a micro-optimization involving moving the check for x == 0). The savings in cycles are smaller for cbrtf() than for cbrt(), and positive in all measu Merge the relevant part of rev.1.14 of s_cbrt.c (a micro-optimization involving moving the check for x == 0). The savings in cycles are smaller for cbrtf() than for cbrt(), and positive in all measured cases with gcc-3.4.4, but still very machine/compiler-dependent. show more ...
Revision tags: release/6.2.0_cvs, release/6.2.0, release/5.5.0_cvs, release/5.5.0, release/6.1.0_cvs, release/6.1.0
# fd289100	05-Jan-2006	Bruce Evans <bde@FreeBSD.org>	Oops, on amd64 (and probably on all non-i386 systems), the previous commit broke the 224 cases where \|x\| > DBL_MAX/2. There are exponent range problems not just for denormals (underflow) but for l Oops, on amd64 (and probably on all non-i386 systems), the previous commit broke the 224 cases where \|x\| > DBL_MAX/2. There are exponent range problems not just for denormals (underflow) but for large values (overflow). Doubles have more than enough exponent range to avoid the problems, but I forgot to convert enough terms to double, so there was an x+x term which was sometimes evaluated in float precision. Unfortunately, this is a pessimization with some combinations of systems and compilers (it makes no difference on Athlon XP's, but on Athlon64's it gives a 5% pessimization with gcc-3.4 but not with gcc-3.3). Exlain the problem better in comments. show more ...
# 4bb97803	05-Jan-2006	Bruce Evans <bde@FreeBSD.org>	Use double precision internally to optimize cbrtf(), and change the algorithm for the second step significantly to also get a perfectly rounded result in round-to-nearest mode. The resulting optimiz Use double precision internally to optimize cbrtf(), and change the algorithm for the second step significantly to also get a perfectly rounded result in round-to-nearest mode. The resulting optimization is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles out of 100 on the former). Using extra precision, we don't need to do anything special to avoid large rounding errors in the third step (Newton's method), so we can regroup terms to avoid a division, increase clarity, and increase opportunities for parallelism. Rearrangement for parallelism loses the increase in clarity. We end up with the same number of operations but with a division reduced to a multiplication. Using specifically double precision, there is enough extra precision for the third step to give enough precision for perfect rounding to float precision provided the previous steps are accurate to 16 bits. (They were accurate to 12 bits, which was almost minimal for imperfect rounding in the old version but would be more than enough for imperfect rounding in this version (9 bits would be enough now).) I couldn't find any significant time optimizations from optimizing the previous steps, so I decided to optimize for accuracy instead. The second step needed a division although a previous commit optimized it to use a polynomial approximation for its main detail, and this division dominated the time for the second step. Use the same Newton's method for the second step as for the third step since this is insignificantly slower than the division plus the polynomial (now that Newton's method only needs 1 division), significantly more accurate, and simpler. Single precision would be precise enough for the second step, but doesn't have enough exponent range to handle denormals without the special grouping of terms (as in previous versions) that requires another division, so we use double precision for both the second and third steps. show more ...
# c5964538	19-Dec-2005	Bruce Evans <bde@FreeBSD.org>	Use a minimax polynomial approximation instead of a Pade rational function approximation for the second step. The polynomial has degree 2 for cbrtf() and 4 for cbrt(). These degrees are minimal for Use a minimax polynomial approximation instead of a Pade rational function approximation for the second step. The polynomial has degree 2 for cbrtf() and 4 for cbrt(). These degrees are minimal for the final accuracy to be essentially the same as before (slightly smaller). Adjust the rounding between steps 2 and 3 to match. Unfortunately, for cbrt(), this breaks the claimed accuracy slightly although incorrect rounding doesn't. Claim less accuracy since its not worth pessimizing the polynomial or relying on exhaustive testing to get insignificantly more accuracy. This saves about 30 cycles on Athlons (mainly by avoiding 2 divisions) so it gives an overall optimization in the 10-25% range (a larger percentage for float precision, especially in 32-bit mode, since other overheads are more dominant for double precision, surprisingly more in 32-bit mode). show more ...
12