s_cosf.c - OpenGrok history log for /freebsd/lib/msun/src/s

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 0dd5a560	28-Jan-2024	Steve Kargl <kargl@FreeBSD.org>	lib/msun: Cleanup after $FreeBSD$ removal Remove no longer needed explicit inclusion of sys/cdefs.h. PR: 276669 MFC after: 1 week
Revision tags: release/14.0.0
# 1d386b48	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
Revision tags: release/13.2.0, release/12.4.0, release/13.1.0, release/12.3.0, release/13.0.0, release/12.2.0, release/11.4.0, release/12.1.0, release/11.3.0, release/12.0.0, release/11.2.0, release/10.4.0, release/11.1.0, release/11.0.1, release/11.0.0, release/10.3.0, release/10.2.0, release/10.1.0, release/9.3.0, release/10.0.0, release/9.2.0, release/8.4.0, release/9.1.0, release/8.3.0_cvs, release/8.3.0, release/9.0.0, release/7.4.0_cvs, release/8.2.0_cvs, release/7.4.0, release/8.2.0, release/8.1.0_cvs, release/8.1.0, release/7.3.0_cvs, release/7.3.0, release/8.0.0_cvs, release/8.0.0, release/7.2.0_cvs, release/7.2.0, release/7.1.0_cvs, release/7.1.0, release/6.4.0_cvs, release/6.4.0
# e822ea5b	25-Feb-2008	Bruce Evans <bde@FreeBSD.org>	Inline __ieee754__rem_pio2f(). On amd64 (A64) and i386 (A64), this gives an average speedup of about 12 cycles or 17% for 9pi/4 < \|x\| <= 219pi/2 and a smaller speedup for larger x, and a small spe Inline __ieee754__rem_pio2f(). On amd64 (A64) and i386 (A64), this gives an average speedup of about 12 cycles or 17% for 9pi/4 < \|x\| <= 219pi/2 and a smaller speedup for larger x, and a small speeddown for \|x\| <= 9pi/4 (only 1-2 cycles average, but that is 4%). Inlining this is less likely to bust caches than inlining the float version since it is much smaller (about 220 bytes text and rodata) and has many fewer branches. However, the float version was already large due to its manual inlining of the branches and also the polynomial evaluations. show more ...
# 70d818a2	25-Feb-2008	Bruce Evans <bde@FreeBSD.org>	Change __ieee754_rem_pio2f() to return double instead of float so that this function and its callers cosf(), sinf() and tanf() don't waste time converting values from doubles to floats and back for \| Change __ieee754_rem_pio2f() to return double instead of float so that this function and its callers cosf(), sinf() and tanf() don't waste time converting values from doubles to floats and back for \|x\| > 9pi/4. All these functions were optimized a few years ago to mostly use doubles internally and across the __kernel() interfaces but not across the __ieee754_rem_pio2f() interface. This saves about 40 cycles in cosf(), sinf() and tanf() for \|x\| > 9pi/4 on amd64 (A64), and about 20 cycles on i386 (A64) (except for cosf() and sinf() in the upper range). 40 cycles is about 35% for \|x\| < 9pi/4 <= 219pi/2 and about 5% for \|x\| > 2*19pi/2. The saving is much larger on amd64 than on i386 since the conversions are not easy to optimize except on i386 where some of them are automatic and others are optimized invalidly. amd64 is still about 10% slower in cosf() and tanf() in the lower range due to conversion overhead. This also gives a tiny speedup for \|x\| <= 9pi/4 on amd64 (by simplifying the code). It also avoids compiler bugs and/or additional slowness in the conversions on (not yet supported) machines where double_t != double. show more ...
Revision tags: release/7.0.0_cvs, release/7.0.0
# 5aa554c7	22-Feb-2008	David Schultz <das@FreeBSD.org>	s/rcsid/__FBSDID/
Revision tags: release/6.3.0_cvs, release/6.3.0, release/6.2.0_cvs, release/6.2.0, release/5.5.0_cvs, release/5.5.0, release/6.1.0_cvs, release/6.1.0
# 8d3b309b	30-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Fixed cosf(x) when x is a "negative" NaNs. I broke this in rev.1.10. cosf(x) is supposed to return something like x when x is a NaN, and we actually fairly consistently return x-x which is normally Fixed cosf(x) when x is a "negative" NaNs. I broke this in rev.1.10. cosf(x) is supposed to return something like x when x is a NaN, and we actually fairly consistently return x-x which is normally very like x (on i386 and and it is x if x is a quiet NaN and x with the quiet bit set if x is a signaling NaN. Rev.1.10 broke this by normalising x to fabsf(x). It's not clear if fabsf(x) is should preserve x if x is a NaN, but it actually clears the sign bit, and other parts of the code depended on this. The bugs can be fixed by saving x before normalizing it, and using the saved x only for NaNs, and using uint32_t instead of int32_t for ix so that negative NaNs are not misclassified even if fabsf() doesn't clear their sign bit, but gcc pessimizes the saving very well, especially on Athlon XPs (it generates extra loads and stores, and mixes use of the SSE and i387, and this somehow messes up pipelines). Normalizing x is not a very good optimization anyway, so stop doing it. (It adds latency to the FPU pipelines, but in previous versions it helped except for \|x\| <= 3pi/4 by simplifying the integer pipelines.) Use the same organization as in s_sinf.c and s_tanf.c with some branches reordered. These changes combined recover most of the performance of the unfixed version on A64 but still lose 10% on AXP with gcc-3.4 -O1 but not with gcc-3.3 -O1. show more ...
# 0bea84b2	28-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Exploit skew-symmetry to avoid an operation: -sin(x-A) = sin(A-x). This gives a tiny but hopefully always free optimization in the 2 quadrants to which it applies. On Athlons, it reduces maximum la Exploit skew-symmetry to avoid an operation: -sin(x-A) = sin(A-x). This gives a tiny but hopefully always free optimization in the 2 quadrants to which it applies. On Athlons, it reduces maximum latency by 4 cycles in these quadrants but has usually has a smaller effect on total time (typically ~2 cycles (~5%), but sometimes 8 cycles when the compiler generates poor code). show more ...
# 35ae3476	28-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Try to use the "proximity" (~) operator consistently in comments (x ~<= a, not x <= ~a). This got messed up in some places when the comments were moved from e_rem_pio2f.c. Added my (non-)copyright.
# 59aad933	28-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Use only double precision for "kernel" cosf and sinf (except for returning float). The functions are renamed from __kernel_{cos,sin}f() to __kernel_{cos,sin}df() so that misuses of them will cause l Use only double precision for "kernel" cosf and sinf (except for returning float). The functions are renamed from __kernel_{cos,sin}f() to __kernel_{cos,sin}df() so that misuses of them will cause link errors and not crashes. This version is an almost-routine translation with no special optimizations for accuracy or efficiency. The not-quite-routine part is that in __kernel_cosf(), regenerating the minimax polynomial with double precision coefficients gives a coefficient for the x2 term that is not quite -0.5, so the literal 0.5 in the code and the related `hz' variable need to be modified; also, the special code for reducing the error in 1.0-x20.5 is no longer needed, so it is convenient to adjust all the logic for the x2 term a little. Note that without extra precision, it would be very bad to use a coefficient of other than -0.5 for the x2 term -- the old version depends on multiplication by -0.5 being infinitely precise so as not to need even more special code for reducing the error in 1-x20.5. This gives an unimportant increase in accuracy, from ~0.8 to ~0.501 ulps. Almost all of the error is from the final rounding step, since the choice of the minimax polynomials so that their contribution to the error is a bit less than 0.5 ulps just happens to give contributions that are significantly less (~.001 ulps). An Athlons, for uniformly distributed args in [-2pi, 2pi], this gives overall speed increases in the 10-20% range, despite giving a speed decrease of typically 19% (from 31 cycles up to 37) for sinf() on args in [-pi/4, pi/4]. show more ...
# 4ce51209	21-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Mess up the "kernel" float trig function .c files with ifdefs so that they can be #included in other .c files to give inline functions, and use them to inline the functions in most callers (not in e_ Mess up the "kernel" float trig function .c files with ifdefs so that they can be #included in other .c files to give inline functions, and use them to inline the functions in most callers (not in e_lgammaf_r.c). __kernel_tanf() is too large and complicated for gcc to inline very well. An athlons, this gives a speed increase under favourable pipeline conditions of about 10% overall (larger for AXP, smaller for A64). E.g., on AXP, sinf() on uniformly distributed args in [-2Pi, 2Pi] now takes 30-56 cycles; it used to take 45-61 cycles; hardware fsin takes 65-129. show more ...
# 8299eb7e	19-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Moved all the optimizations for \|x\| <= 9pi/2 from __ieee754_rem_pio2f() to its 3 callers and manually inline them. On Athlons, with favourable compiler flags and optimizations and favourable pipelin Moved all the optimizations for \|x\| <= 9pi/2 from __ieee754_rem_pio2f() to its 3 callers and manually inline them. On Athlons, with favourable compiler flags and optimizations and favourable pipeline conditions, this gives a speedup of 30-40 cycles for cosf(), sinf() and tanf() on the range pi/4 < \|x\| <= 9pi/4, so thes functions are now signifcantly faster than the hardware trig functions in many cases. E.g., in a benchmark with uniformly distributed x in [-2pi, 2pi], A64 hardware fcos took 72-129 cycles and cosf() took 37-55 cycles. Out-of-order execution is needed to get both of these times. The optimizations in this commit apparently work more by removing 1 serialization point than by reducing latency. show more ...
# 75ff209c	17-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Minor cleanups: s_cosf.c and s_sinf.c: Use a non-bogus magic constant for the threshold of pi/4. It was 2 ulps smaller than pi/4 rounded down, but its value is not critical so it should be the resu Minor cleanups: s_cosf.c and s_sinf.c: Use a non-bogus magic constant for the threshold of pi/4. It was 2 ulps smaller than pi/4 rounded down, but its value is not critical so it should be the result of natural rounding. s_cosf.c and s_tanf.c: Use a literal 0.0 instead of an unnecessary variable initialized to [(float)]0.0. Let the function prototype convert to 0.0F. Improved wording in some comments. Attempted to improve indentation of comments. show more ...
Revision tags: release/6.0.0_cvs, release/6.0.0
# 4339c67c	24-Oct-2005	Bruce Evans <bde@FreeBSD.org>	Moved the optimization for tiny x from __kernel_{cos,sin}[f](x) to {cos_sin}[f](x) so that x doesn't need to be reclassified in the "kernel" functions to determine if it is tiny (it still needs to be Moved the optimization for tiny x from __kernel_{cos,sin}[f](x) to {cos_sin}[f](x) so that x doesn't need to be reclassified in the "kernel" functions to determine if it is tiny (it still needs to be reclassified in the cosine case for other reasons that will go away). This optimization is quite large for exponentially distributed x, since x is tiny for almost half of the domain, but it is a pessimization for uniformally distributed x since it takes a little time for all cases but rarely applies. Arg reduction on exponentially distributed x rarely gives a tiny x unless the reduction is null, so it is best to only do the optimization if the initial x is tiny, which is what this commit arranges. The imediate result is an average optimization of 1.4% relative to the previous version in a case that doesn't favour the optimization (double cos(x) on all float x) and a large pessimization for the relatively unimportant cases of lgamma[f][_r](x) on tiny, negative, exponentially distributed x. The optimization should be recovered for lgamma() as part of fixing lgamma()'s low-quality arg reduction. Fixed various wrong constants for the cutoff for "tiny". For cosine, the cutoff is when x2/2! == {FLT or DBL}_EPSILON/2. We round down to an integral power of 2 (and for cos() reduce the power by another 1) because the exact cutoff doesn't matter and would take more work to determine. For sine, the exact cutoff is larger due to the ration of terms being x2/3! instead of x2/2!, but we use the same cutoff as for cosine. We now use a cutoff of 2-27 for double precision and 2-12 for single precision. 2-27 was used in all cases but was misspelled 2**27 in comments. Wrong and sloppy cutoffs just cause missed optimizations (provided the rounding mode is to nearest -- other modes just aren't supported). show more ...
Revision tags: release/5.4.0_cvs, release/5.4.0, release/4.11.0_cvs, release/4.11.0, release/5.3.0_cvs, release/5.3.0, release/4.10.0_cvs, release/4.10.0, release/5.2.1_cvs, release/5.2.1, release/5.2.0_cvs, release/5.2.0, release/4.9.0_cvs, release/4.9.0, release/5.1.0_cvs, release/5.1.0, release/4.8.0_cvs, release/4.8.0, release/5.0.0_cvs, release/5.0.0, release/4.7.0_cvs, release/4.6.2_cvs, release/4.6.2, release/4.6.1, release/4.6.0_cvs
# 59b19ff1	28-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Fix formatting, this is hard to explain, so I'll show one example. - float ynf(int n, float x) /* wrapper ynf / +float +ynf(int n, float x) / wrapper ynf / This is because the __S Fix formatting, this is hard to explain, so I'll show one example. - float ynf(int n, float x) / wrapper ynf / +float +ynf(int n, float x) / wrapper ynf */ This is because the __STDC__ stuff was indented. Reviewed by: md5 show more ...
# 2dcc2286	28-May-2002	Alfred Perlstein <alfred@FreeBSD.org>	Assume __STDC__, remove non-__STDC__ code. Reviewed by: md5
Revision tags: release/4.5.0_cvs, release/4.4.0_cvs, release/4.3.0_cvs, release/4.3.0, release/4.2.0, release/4.1.1_cvs, release/4.1.0, release/3.5.0_cvs, release/4.0.0_cvs, release/3.4.0_cvs, release/3.3.0_cvs
# 7f3dea24	28-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
Revision tags: release/3.2.0, release/3.1.0, release/3.0.0, release/2.2.8, release/2.2.7, release/2.2.6, release/2.2.5_cvs, release/2.2.2_cvs, release/2.2.1_cvs, release/2.2.0, release/2.1.7_cvs
# 7e546392	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Revert $FreeBSD$ to $Id$
Revision tags: release/2.1.6_cvs, release/2.1.6.1
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise. show more ...
Revision tags: release/2.1.5_cvs, release/2.1.0_cvs, release/2.0.5_cvs
# 6c06b4e2	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
Revision tags: release/2.0
# 3a8617a8	19-Aug-1994	Jordan K. Hubbard <jkh@FreeBSD.org>	J.T. Conklin's latest version of the Sun math library. -- Begin comments from J.T. Conklin: The most significant improvement is the addition of "float" versions of the math functions that take float J.T. Conklin's latest version of the Sun math library. -- Begin comments from J.T. Conklin: The most significant improvement is the addition of "float" versions of the math functions that take float arguments, return floats, and do all operations in floating point. This doesn't help (performance) much on the i386, but they are still nice to have. The float versions were orginally done by Cygnus' Ian Taylor when fdlibm was integrated into the libm we support for embedded systems. I gave Ian a copy of my libm as a starting point since I had already fixed a lot of bugs & problems in Sun's original code. After he was done, I cleaned it up a bit and integrated the changes back into my libm. -- End comments Reviewed by: jkh Submitted by: jtc show more ...
Revision tags: release/13.2.0, release/12.4.0, release/13.1.0, release/12.3.0, release/13.0.0, release/12.2.0, release/11.4.0, release/12.1.0, release/11.3.0, release/12.0.0, release/11.2.0, release/10.4.0, release/11.1.0, release/11.0.1, release/11.0.0, release/10.3.0, release/10.2.0, release/10.1.0, release/9.3.0, release/10.0.0, release/9.2.0, release/8.4.0, release/9.1.0, release/8.3.0_cvs, release/8.3.0, release/9.0.0, release/7.4.0_cvs, release/8.2.0_cvs, release/7.4.0, release/8.2.0, release/8.1.0_cvs, release/8.1.0, release/7.3.0_cvs, release/7.3.0, release/8.0.0_cvs, release/8.0.0, release/7.2.0_cvs, release/7.2.0, release/7.1.0_cvs, release/7.1.0, release/6.4.0_cvs, release/6.4.0
# e822ea5b	25-Feb-2008	Bruce Evans <bde@FreeBSD.org>	Inline __ieee754__rem_pio2f(). On amd64 (A64) and i386 (A64), this gives an average speedup of about 12 cycles or 17% for 9pi/4 < \|x\| <= 219pi/2 and a smaller speedup for larger x, and a small spe Inline __ieee754__rem_pio2f(). On amd64 (A64) and i386 (A64), this gives an average speedup of about 12 cycles or 17% for 9pi/4 < \|x\| <= 219pi/2 and a smaller speedup for larger x, and a small speeddown for \|x\| <= 9pi/4 (only 1-2 cycles average, but that is 4%). Inlining this is less likely to bust caches than inlining the float version since it is much smaller (about 220 bytes text and rodata) and has many fewer branches. However, the float version was already large due to its manual inlining of the branches and also the polynomial evaluations. show more ...
# 70d818a2	25-Feb-2008	Bruce Evans <bde@FreeBSD.org>	Change __ieee754_rem_pio2f() to return double instead of float so that this function and its callers cosf(), sinf() and tanf() don't waste time converting values from doubles to floats and back for \| Change __ieee754_rem_pio2f() to return double instead of float so that this function and its callers cosf(), sinf() and tanf() don't waste time converting values from doubles to floats and back for \|x\| > 9pi/4. All these functions were optimized a few years ago to mostly use doubles internally and across the __kernel() interfaces but not across the __ieee754_rem_pio2f() interface. This saves about 40 cycles in cosf(), sinf() and tanf() for \|x\| > 9pi/4 on amd64 (A64), and about 20 cycles on i386 (A64) (except for cosf() and sinf() in the upper range). 40 cycles is about 35% for \|x\| < 9pi/4 <= 219pi/2 and about 5% for \|x\| > 2*19pi/2. The saving is much larger on amd64 than on i386 since the conversions are not easy to optimize except on i386 where some of them are automatic and others are optimized invalidly. amd64 is still about 10% slower in cosf() and tanf() in the lower range due to conversion overhead. This also gives a tiny speedup for \|x\| <= 9pi/4 on amd64 (by simplifying the code). It also avoids compiler bugs and/or additional slowness in the conversions on (not yet supported) machines where double_t != double. show more ...
Revision tags: release/7.0.0_cvs, release/7.0.0
# 5aa554c7	22-Feb-2008	David Schultz <das@FreeBSD.org>	s/rcsid/__FBSDID/
Revision tags: release/6.3.0_cvs, release/6.3.0, release/6.2.0_cvs, release/6.2.0, release/5.5.0_cvs, release/5.5.0, release/6.1.0_cvs, release/6.1.0
# 8d3b309b	30-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Fixed cosf(x) when x is a "negative" NaNs. I broke this in rev.1.10. cosf(x) is supposed to return something like x when x is a NaN, and we actually fairly consistently return x-x which is normally Fixed cosf(x) when x is a "negative" NaNs. I broke this in rev.1.10. cosf(x) is supposed to return something like x when x is a NaN, and we actually fairly consistently return x-x which is normally very like x (on i386 and and it is x if x is a quiet NaN and x with the quiet bit set if x is a signaling NaN. Rev.1.10 broke this by normalising x to fabsf(x). It's not clear if fabsf(x) is should preserve x if x is a NaN, but it actually clears the sign bit, and other parts of the code depended on this. The bugs can be fixed by saving x before normalizing it, and using the saved x only for NaNs, and using uint32_t instead of int32_t for ix so that negative NaNs are not misclassified even if fabsf() doesn't clear their sign bit, but gcc pessimizes the saving very well, especially on Athlon XPs (it generates extra loads and stores, and mixes use of the SSE and i387, and this somehow messes up pipelines). Normalizing x is not a very good optimization anyway, so stop doing it. (It adds latency to the FPU pipelines, but in previous versions it helped except for \|x\| <= 3pi/4 by simplifying the integer pipelines.) Use the same organization as in s_sinf.c and s_tanf.c with some branches reordered. These changes combined recover most of the performance of the unfixed version on A64 but still lose 10% on AXP with gcc-3.4 -O1 but not with gcc-3.3 -O1. show more ...
# 0bea84b2	28-Nov-2005	Bruce Evans <bde@FreeBSD.org>	Exploit skew-symmetry to avoid an operation: -sin(x-A) = sin(A-x). This gives a tiny but hopefully always free optimization in the 2 quadrants to which it applies. On Athlons, it reduces maximum la Exploit skew-symmetry to avoid an operation: -sin(x-A) = sin(A-x). This gives a tiny but hopefully always free optimization in the 2 quadrants to which it applies. On Athlons, it reduces maximum latency by 4 cycles in these quadrants but has usually has a smaller effect on total time (typically ~2 cycles (~5%), but sometimes 8 cycles when the compiler generates poor code). show more ...
12