[HN Gopher] Implementing Cosine in C from Scratch (2020)
       ___________________________________________________________________
        
       Implementing Cosine in C from Scratch (2020)
        
       Author : signa11
       Score  : 150 points
       Date   : 2022-03-29 16:38 UTC (6 hours ago)
        
 (HTM) web link (austinhenley.com)
 (TXT) w3m dump (austinhenley.com)
        
       | aaaronic wrote:
       | Mandatory Physics "troll" comment:
       | 
       | For small angles, sin(x) ~= x and cos(x) ~= 1 -- the ["small
       | angle approximation"](https://en.wikipedia.org/wiki/Small-
       | angle_approximation)
       | 
       | It's actually kind of ridiculous how many times it comes up and
       | works well enough in undergraduate Mechanics.
        
         | andi999 wrote:
         | Dont forget optics, there tan x=sin x
        
           | aaaronic wrote:
           | Yep, it's all over Physics, really. Many non-mechanical
           | concepts are often mapped to a mechanical analog model (so
           | many tiny oscillators out there!).
        
         | Sharlin wrote:
         | Useful in astronomy and telephoto photography as well.
        
         | jhgb wrote:
         | This a perfectly nice Programming 101 example of a recursive
         | function -- if an angle is too large, calculate its sine using
         | the sine of the half-angle, otherwise return the angle (or, if
         | you're fancy, some simple polynomial approximation of the
         | sine). I'm sure everyone did this in school (we did, in fact).
        
           | gugagore wrote:
           | It's not obvious when you should switch over to an
           | approximation (base case), so I'd say it's not a good example
           | to introduce recursion. I have never seen it, and I did not
           | do it in school.
        
             | jhgb wrote:
             | It's not obvious but that makes it a nice example in
             | parameter optimization as well.
        
             | scythe wrote:
             | It's "obvious" if you remember how the Taylor series for
             | sin(x) works. The error between x and sin(x) is bounded
             | above by x^3/6. So there you go.
        
             | aaaronic wrote:
             | When the difference between iterations is less than
             | whatever threshold of accuracy you're looking for, I'd
             | assume.
             | 
             | "Relaxation" algorithms for calculating fields work that
             | way.
        
         | dekhn wrote:
         | IIRC this comes up in the derivation of the boltzmann
         | distribution. I also recall stirlings approximation for x!.
        
       | nwallin wrote:
       | It's all well and good to have a library with fast sin/cos/tan
       | functions, but always remember to cheat if you can.
       | 
       | For instance, in a game, it's common to have a gun/spell or
       | whatever that shoots enemies in a cone shape. Like a shotgun or
       | burning hands. One way to code this is to calculate the angle
       | between where you're pointing and the enemies, (using arccos) and
       | if that angle is small enough, apply the damage or whatever.
       | 
       | A better way to do it to take the vector where the shotgun is
       | pointing, and the vector to a candidate enemy is, and take the
       | dot product of those two. Pre-compute the cosine of the angle of
       | effect, and compare the dot product to that -- if the dot product
       | is higher, it's a hit, if it's lower, it's a miss. (you can get
       | away with _very_ rough normalization in a game, for instance
       | using the rqsrt instruction in sse2) You 've taken a potentially
       | slow arccos operation and turned it into a fast handful of basic
       | float operations.
       | 
       | Or for instance, you're moving something in a circle over time.
       | You might have something like                   float angle = 0
       | for (...)           angle += 0.01           <something with
       | sin(angle) cos(angle)>
       | 
       | Instead, you might do:                   const float sin_dx =
       | sin(0.01)         const float cos_dx = cos(0.01)         float
       | sin_angle = 0         float cos_angle = 1         for (...)
       | const float new_sin_angle = sin_angle * cos_dx + cos_angle *
       | sin_dx           cos_angle = cos_angle * cos_dx - sin_angle *
       | sin_dx           sin_angle = new_sin_angle
       | 
       | And you've replaced a sin/cos pair with 6 elementary float
       | operations.
       | 
       | And there's the ever popular comparing squares of distances
       | instead of comparing distances, saving yourself a square root.
       | 
       | In general, inner loops should never have a transcendental or
       | exact square root in them. If you think you need it, there's
       | almost always a way to hoist from an inner loop out into an outer
       | loop.
        
       | aj7 wrote:
       | So 57 years ago, there weren't any programming books. Well, there
       | was one. McCracken and Dorn, Numerical Methods and Fortran
       | Programming. A kindly childless couple, Ben and Bluma Goldin, who
       | lived in the same apartment house in Brooklyn, bought me this
       | book as a gift. An axiomatic lesson from the book was distrust of
       | Taylor expansions, and the necessity of moving the expansion
       | point of any function to a favorable location. Because harmonic
       | functions are very well behaved, it should also be obvious that
       | even a very coarse lookup table + interpolation should converge
       | to high accuracy quickly. Finally, use of trigonometric
       | identities to zero in on the desired angle is helpful. OK I'll
       | shut up now.
        
       | ok123456 wrote:
       | A lot of Basic implementations, that had software support for
       | floating point numbers, implemented it like this. Microsoft BASIC
       | used a 6th degree polynomial iirc.
        
       | nonrandomstring wrote:
       | Audio stuff uses different methods depending on the application.
       | Additive oscillators need to be accurate and can use some closed
       | forms and series, mostly people use Taylor and quadratic
       | interpolation as the article shows. For tables remember you only
       | really need to compute amd store one quadrant and shift it
       | around. But for LFOs people use all manner of dirty and quick
       | methods like Bhaskara.
       | 
       | https://en.wikipedia.org/wiki/Bhaskara_I%27s_sine_approximat...
        
       | dilyevsky wrote:
       | I remember researching cordic for some pet fpga project i had and
       | finding some libs/discussions so it's definitely used.
        
       | xioxox wrote:
       | Here's a solution which is vectorizable:
       | http://gallium.inria.fr/blog/fast-vectorizable-math-approx/
       | 
       | I've also read that Chebyshev polynomials are far better for
       | approximating functions than Taylor series (eg
       | https://en.wikipedia.org/wiki/Approximation_theory)
        
         | freemint wrote:
         | This matters less once you put things through the horner
         | scheme. And if you are not you are leaving performance on the
         | table.
        
       | dheera wrote:
       | > So we are really only calculating cosine from 0 to pi.
       | 
       | You could go further and only to 0 to pi/2 as it is mirrored and
       | flipped for pi/2 to pi.
       | 
       | Or you could go even further and do only 0 to pi/4 and use a
       | simple trig identity and your presumably parallely-developed sin
       | function for the cos of pi/4 to pi.
       | 
       | > One of the implementations is nearly 3x faster than math.h
       | 
       | Is this true even for -O3?
        
       | blt wrote:
       | If you want sinusoidal oscillation over time, then you can
       | integrate the second-order differential equation x'' = -o2 x
       | instead. This only costs a few arithmetic instructions per time
       | step. On the other hand, it adds dependency on a state other than
       | the current time, and you must be careful to use energy-
       | conserving integration.
        
         | duped wrote:
         | Here's a quick way of doing it with one matrix multiplication,
         | you just need to precompute cos(o) and sin(o) with a bit of
         | intuition:                  [ x(n + 1) ] = [cos(o), -sin(o)] [
         | x(n) ]        [ y(n + 1) ]   [sin(o),  cos(o)] [ y(n) ]
         | where              x(n) = cos(o n)        y(n) = sin(o n)
         | x(0) = 1        y(0) = 0        o = frequency (radians) *
         | time_step
         | 
         | There are two ways to look at this, from the systems
         | perspective this is computing the impulse response of a
         | critically stable filter with poles at e^(+/-jo).
         | Geometrically, it's equivalent to starting at (1, 0) and
         | rotating around the unit circle by o radians each time step.
         | You can offset to arbitrary phases.
         | 
         | It's not suitable for numerous cases (notably if the frequency
         | needs to change in real time) but provided that the sum of the
         | coefficients sum to `1.0` it's guaranteed stable and will not
         | alias.
        
       | user_7832 wrote:
       | Slightly off topic but kudos to the author for explaining the
       | process extremely clearly. As someone without experience of low
       | level programing this was quite understandable for me (albeit
       | with uni-level maths knowledge).
        
       | em3rgent0rdr wrote:
       | Speaking of lookup tables, I recall reading the DX-7
       | (https://www.righto.com/2021/11/reverse-engineering-yamaha-dx...
       | shared a few months ago on HN), that DX-7 has a 4096-entry 14-bit
       | sine ROM built into the chip itself.
        
       | anonymousiam wrote:
       | I did the same thing for sqrt() seven years ago. It benchmarked
       | at about five times faster than math.h/libm, but the trade was a
       | reduction in the maximum value of the result. My version would
       | not produce accurate results for inputs greater than 2^16. It did
       | work very well for generating real-time laser pointing X/Y
       | coordinates for an Archimedean Spiral.
       | 
       | https://en.wikipedia.org/wiki/Archimedean_spiral
       | 
       | (This was on a MSP430 platform with a FPU that only did
       | multiplication.)
        
         | adgjlsfhk1 wrote:
         | I'm surprised this was faster than an initial guess (half the
         | exponent and mantisa) followed by newton's method.
        
       | [deleted]
        
       | ThePhysicist wrote:
       | I remember doing this in Macromedia Flash as I wanted to make
       | stuff go in circles and the language didn't have any higher math
       | functions back then. Good ole times.
        
       | todd8 wrote:
       | The article uses a game movement function as a possible
       | motivation for needing the cosine. If object.rotation is constant
       | as the spiral animation seems to suggest, there is an important
       | optimization for rapidly computing the sequence: cos(theta),
       | cos(theta + delta), cos(theta + 2 _delta), cos(theta + 3_ delta),
       | ...
       | 
       | This comes up in movement under constant rate of rotation as well
       | as Fourier transforms. See [1] for details, but the basic idea
       | uses the simple trig identities:                   cos(theta +
       | delta) = cos(theta) - (a*cos(theta) + b*sin(theta))
       | sin(theta + delta) = sin(theta) - (a*sin(theta) - b*cos(theta))
       | 
       | The values for a and b must be calculated, but only once if delta
       | is constant:                   a = 2 sin^2(delta/2)         b =
       | sin(delta)
       | 
       | By using these relationships, it is only necessary to calculate
       | cos(theta) and sin(theta) _once_ for the first term in the series
       | in order to calculate the entire series (until accumulated errors
       | become a problem).
       | 
       | [1] Press, William H. et. al., _Numerical Recipes_ , Third
       | Edition, Cambridge University Press, p. 219
        
       | dancsi wrote:
       | The implementation of the __cos kernel in Musl is actually quite
       | elegant. After reducing the input to the range [-pi/4, pi/4], it
       | just applies the best degree-14 polynomial for approximating the
       | cosine on this interval. It turns out that this suffices for
       | having an error that is less than the machine precision. The
       | coefficients of this polynomial can be computed with the Remez
       | algorithm, but even truncating the Chebyshev expansion is going
       | to yield much better results than any of the methods proposed by
       | the author.
        
         | brandmeyer wrote:
         | This has been the standard algorithm used by every libm for
         | decades. Its not special to Musl.
        
           | enriquto wrote:
           | But isn't this code rarely called in practice? I guess on
           | intel architectures the compiler just calls the fsin
           | instruction of the cpu.
        
             | stephencanon wrote:
             | No. FSIN has accuracy issues as sibling mentions, but is
             | also much slower than a good software implementation (it
             | varies with uArch, but 1 result every ~100 cycles is
             | common; even mediocre scalar software implementations can
             | produce a result every twenty cycles).
        
             | jcelerier wrote:
             | Wasn't there some blog article a few years ago which showed
             | how glibc's implementation was faster than fsin ?
        
             | chrisseaton wrote:
             | > I guess on intel architectures the compiler just calls
             | the fsin instruction of the cpu.
             | 
             | Do people do that in practice? It's on the FPU, which is
             | basically legacy emulated these days, and it's inaccurate.
        
               | enriquto wrote:
               | > the FPU, which is basically legacy
               | 
               | you'll pry my long doubles from my cold, dead hands!
        
               | chrisseaton wrote:
               | You're selectively quoting me - I said it's 'legacy
               | emulated'. It's emulated using very long-form microcode,
               | so basically emulated in software. I didn't say it was
               | simply 'legacy'.
        
               | jhgb wrote:
               | The question is if you wouldn't be better served with
               | double-doubles today. You get ~100 bits of mantissa AND
               | you can still vectorize your computations.
        
               | enriquto wrote:
               | Sure. There should be a gcc flag to make "long double"
               | become quadruple precision.
               | 
               | The thing is, my first programming language was x86
               | assembler and the fpu was the funniest part. Spent weeks
               | as a teenager writing almost pure 8087 code. I have a lot
               | of emotional investment in that tiny rolling stack of
               | extended precision floats.
        
               | colejohnson66 wrote:
               | How is "quad floating point" implemented on x86? Is it
               | software emulation?
        
             | jhgb wrote:
             | FSIN _only_ works on x87 registers which you will rarely
             | use on AMD64 systems -- you really want to use at least
             | scalar SSE2 today (since that is whence you receive your
             | inputs as per typical AMD64 calling conventions anyway).
             | Moving data from SSE registers to the FP stack _just_ to
             | calculate FSIN and then moving it back to SSE will probably
             | kill your performance even if your FSIN implementation is
             | good. If you 're vectorizing your computation over 4 double
             | floats or 8 single floats in an AVX register, it gets even
             | worse for FSIN.
        
               | stephencanon wrote:
               | Moving between x87 and xmm registers is actually fairly
               | cheap (it's through memory, so it's not free, but it's
               | also not _that_ bad). FSIN itself is catastrophically
               | slow.
        
               | jhgb wrote:
               | Fair enough, and I imagine there may even be some
               | forwarding going on? There often is when a load follows a
               | store, if I remember correctly. (Of course this _will_ be
               | microarchitecture-dependent.)
        
               | [deleted]
        
             | adgjlsfhk1 wrote:
             | No. The fsin instruction is inaccurate enough to be
             | useless. It gives 0 correct digits when the output is close
             | to 0.
        
               | enriquto wrote:
               | > 0 correct digits when the output is close to 0
               | 
               | this is an amusing way to describe the precision of sub-
               | normal floating point numbers
        
               | adgjlsfhk1 wrote:
               | It's not just sub-normal numbers. As
               | https://randomascii.wordpress.com/2014/10/09/intel-
               | underesti... shows, fsin only uses 66 bits of pi, which
               | means you have roughly no precision whenever
               | abs(sin(x))<10^-16 which is way bigger than the biggest
               | subnormal (10^-307 or so)
        
               | lifthrasiir wrote:
               | It is much more amusing if you describe it in ulps; for
               | some inputs the error can reach > 2^90 ulps, more than
               | the mantissa size itself.
        
             | lifthrasiir wrote:
             | x87 trig functions can be very inaccurate due to the
             | Intel's original sloppy implementation and subsequent
             | compatibility requirements [1].
             | 
             | [1] http://nighthacks.com/jag/blog/134/index.html
        
         | toolslive wrote:
         | Yup. Chebyshev is the way to go (after 30s of consideration)
        
         | potiuper wrote:
         | https://github.com/ifduyue/musl/blob/master/src/math/__cos.c
        
           | ufo wrote:
           | I think the polynomial calculation in the end looks
           | interesting. It doesn't use Horner's rule.
        
             | lifthrasiir wrote:
             | It does use Horner's rule, but splits the expression into
             | two halves in order to exploit instruction-level
             | parallelism.
        
               | jacobolus wrote:
               | Considering the form of both halves is the same, are
               | compilers smart enough to vectorize this code?
        
           | llimllib wrote:
           | > Input y is the tail of x.
           | 
           | What does "tail" mean in this context?
        
             | lifthrasiir wrote:
             | It is a double-double representation [1], where a logical
             | fp number is represented with a sum of two machine fp
             | numbers x (head, larger) and y (tail, smaller). This
             | effectively doubles the mantissa. In the context of musl
             | this representation is produced from the range reduction
             | process [2].
             | 
             | [1] https://en.wikipedia.org/wiki/Quadruple-
             | precision_floating-p...
             | 
             | [2] https://github.com/ifduyue/musl/blob/6d8a515796270eb6ce
             | c8a27...
        
               | jhgb wrote:
               | Does it make sense to use a double-double input when you
               | only have double output? Sine is Lispchitz-limited by 1
               | so I don't see how this makes a meaningful difference.
        
               | lifthrasiir wrote:
               | The input might be double but the constant pi is not. Let
               | f64(x) be a function from any real number to double, so
               | that an ordinary expression `a + b` actually computes
               | f64(a + b) and so on. Then in general f64(sin(x)) may
               | differ from f64(sin(f64(x mod 2pi))); since you can't
               | directly compute f64(sin(x mod 2pi)), you necessarily
               | need more precision during argument reduction so that
               | f64(sin(x)) = f64(sin(f64timeswhatever(x mod 2pi))).
        
               | jhgb wrote:
               | But am I correct in thinking that is at worst a 0.5 ulp
               | error in this case? The lesser term in double-double
               | can't be more than 0.5 ulp of the greater term and
               | sensitivity of both sine and cosine to an error in the
               | input will not be more than 1.
               | 
               | Also, in case of confusion, I was specifically commenting
               | on the function over the [-pi/4, pi/4] domain in https://
               | github.com/ifduyue/musl/blob/master/src/math/__cos.c ,
               | which the comment in
               | https://news.ycombinator.com/item?id=30846546 was
               | presumably about.
        
               | adgjlsfhk1 wrote:
               | Double rounding can still bite you. You are forced to
               | incur up to half an ulp of error from your polynomial, so
               | taking another half ulp in your reduction can lead to a
               | total error of about 1 ulp.
        
               | lifthrasiir wrote:
               | Yeah, sine and cosine are not as sensitive (but note that
               | many libms target 1 or 1.5 ulp error for them, so a 0.5
               | ulp error might still be significant). For tangent
               | however you definitely need more accurate range
               | reduction.
        
             | [deleted]
        
       | skulk wrote:
       | I learned how to write programs from Scratch[0]. Back in the
       | pre-1.4 days (IIRC) there was no native sin/cos functions, so
       | they had to be "implemented" somehow to translate arbitrary
       | angles into delta (x,y). The way that I found people did this was
       | as follows (for sin(A), cos(A)):
       | 
       | Invisible sprite:
       | 
       | [ when I receive "sine" ]:                 [ go to center ]
       | [ turn A degrees ]            [ move 1 step forward ]
       | 
       | Now, anyone that wants to know sin(A), cos(A) just has to read
       | that sprite's X and Y position.
       | 
       | [0]: https://scratch.mit.edu
        
         | em3rgent0rdr wrote:
         | I guess scratch (or LOGO) is Turing complete, so someone could
         | make a CPU built using a little turtle in a box where all
         | functionality is built out of LOGO commands.
        
       | aaaronic wrote:
       | I teach lookup tables specifically for sine/cosine (we just use
       | one with a phase shift and modulus) on the Gameboy Advance. The
       | hardware has no floating point unit support, so we do it in 8.8
       | fixed point with 1-degree resolution.
       | 
       | It turns out the BIOS on the hardware can do sine/cosine, but you
       | have to do that via inline assembly (swi calls) and it's no
       | faster than the simple lookup table.
       | 
       | This is all just to get terms for the affine matrix when rotating
       | sprites/backgrounds.
        
       | waynecochran wrote:
       | You are doing the polynomial evaluation wrong for Taylor Series.
       | First of all you should be using Horner's Method to reduce the
       | number of multiplications needed and then you should be using FMA
       | to increase precision of the solution.
       | https://momentsingraphics.de/FMA.html
        
       | phonebucket wrote:
       | This is a fun article!
       | 
       | Alternative avenues that complement the approaches shown:
       | 
       | - Pade Approximants
       | (https://en.wikipedia.org/wiki/Pad%C3%A9_approximant) can be
       | better than long Taylor Series for this kind of thing.
       | 
       | - Bhaskara I's sin approximation
       | (https://en.wikipedia.org/wiki/Bhaskara_I%27s_sine_approximat...)
       | is easily adaptable to cosine, remarkably accurate for its
       | simplicity and also fast to calculate.
        
       | deepspace wrote:
       | I was surprised that pre-calculating x _x as double xx = x_ x
       | made a difference.
       | 
       | The author's compiler must not be great at optimization if this
       | is the case.
        
       | msk-lywenn wrote:
       | My favorite implementation of a super fast but not that accurate
       | cosinus is from a now defunct forum. Good thing we have
       | webarchive !
       | https://web.archive.org/web/20111104163237/http://www.devmas...
        
         | karma_fountain wrote:
         | You can also use the Chebyshev polynomials to economise a
         | higher degree taylor series with bit of a loss of accuracy.
        
           | slim wrote:
           | you can't because boycott Chebyshev
        
             | rurban wrote:
             | You can't boycott Chebyshev, he died 1894 or so.
        
       | gweinberg wrote:
       | I think you should be able to get good results for precalculating
       | the cosines for a set of angles and then using the formula for
       | the cosine of the sum of 2 angles combined with the taylor
       | series. The series converges very quickly for very small angles,
       | right?
        
       | dboreham wrote:
       | This takes me back in time, to c. 1982 when I got curious about
       | how computers calculated trig functions. Luckily I was using one
       | of the first OSS systems (modulo AT&T lawyers): Unix V7. So I
       | pulled up the source and took a look. You can still see it here:
       | https://github.com/v7unix/v7unix/blob/master/v7/usr/src/libm...
       | There's a comment referencing "Hart & Cheney", which is this book
       | : https://www.google.com/books/edition/Computer_Approximations...
       | Also luckily I had access to a library that had the book which I
       | was able to read, at least in part. This taught me a few things :
       | that the behavior of a computer can be a white box, not a black
       | box; the value of having access to the source code; to use
       | comments effectively; that for many difficult tasks someone
       | probably had written a book on the subject. Also that polynomials
       | can approximate transcendental functions.
        
       | hatsunearu wrote:
       | Since cosine is even and periodic, you really only need like, up
       | to x = 0 to pi/4 and the rest you can infer just from that data.
        
         | mhh__ wrote:
         | That is how it's often implemented i.e. shift back to the
         | domain you actually need to be able to compute.
        
       | stncls wrote:
       | This is a very nice article, and I think the three initial
       | objectives are fully attained (simple enough, accurate enough,
       | fast enough).
       | 
       | Some of the performance claims have a caveat, though: Lookup
       | tables and micro-benchmarks don't mix well. At all.
       | 
       | I just added a simple loop that thrashes the L3 cache every 500
       | iterations (diff here:
       | https://goonlinetools.com/snapshot/code/#sm4fjqtvjyn36dyednc...).
       | Now the method recommended at the end (cos_table 0_001 LERP) is
       | _slower_ than glibc 's cos() (while still having an accuracy that
       | is more than 10^8 times worse)!
       | 
       | Time benchmark output:                   cos_table_1_LERP
       | 0.0038976235167879         cos_table_0_1_LERP
       | 0.0042602838286585         cos_table_0_01_LERP
       | 0.0048867938030232         cos_table_0_001_LERP
       | 0.0091254562801794         cos_table_0_0001_LERP
       | 0.0139627164397000         cos_math_h
       | 0.0089332715693581
       | 
       | Lookup table sizes:                   cos_table_1_LERP
       | 64 bytes         cos_table_0_1_LERP                     512 bytes
       | cos_table_0_01_LERP                   5040 bytes
       | cos_table_0_001_LERP                 50280 bytes
       | cos_table_0_0001_LERP               502664 bytes
        
         | adgjlsfhk1 wrote:
         | How long does the cache thrashing loop take compared to 500
         | iterations? I'm not sure how much you can trash the cache while
         | still being performance bottle-necked by a table-based
         | implementation of a trig function.
        
           | stncls wrote:
           | Yes of course the cache thrashing takes a lot longer. But I
           | modified the code to time only the cos() functions (see
           | diff), using the rdtsc instruction. That's why it had to be
           | every 500 iterations: If it was after each iteration, then
           | rdtsc itself would become the bottleneck!
        
             | adgjlsfhk1 wrote:
             | My point wasn't about how you were timing the code, what I
             | meant is in the real world, I don't think there are
             | programs where this will be an issue because if they are
             | thrashing their caches too much, they won't be bottlenecked
             | on trig. The advantage of table based functions is that
             | they are really fast if you are calling a lot of them at
             | once, but that's also the only case where they need to be
             | fast. If only 1/10000 instructions is trig, it's OK if the
             | trig is a little slower.
        
       | anugrahit wrote:
        
       | charlieyu1 wrote:
       | I think off-zero Taylor series along with lookup table should be
       | at least tried. You can approximate cos(x+a) using only cos(a)
       | and sin(a) and the series converges really fast
        
       | SeanLuke wrote:
       | A LERP lookup table is fast; but only somewhat slower, and
       | probably quite a bit more accurate, would be a lookup table
       | interpolated by a Catmull-Rom spline.
        
       | marcodiego wrote:
       | I don't know if this benchmark is trustworthy. Using a lookup
       | table pollutes cache. It definitely needs to be compared with
       | interactions on other parts of the code to get a good performance
       | measurement.
        
       | naoqj wrote:
       | >I couldn't find any notable projects on GitHub that use a table
       | for trig functions, but I'm sure they exist.
       | 
       | Doom?
        
         | zekica wrote:
         | Doom also uses the famous fast inverse square root method which
         | although I know how it works still amazes me.
        
           | naoqj wrote:
           | It doesn't, to the best of my knowledge. It appeared in an id
           | game for the first time in Quake 3.
        
         | freemint wrote:
         | Julia?
        
           | adgjlsfhk1 wrote:
           | Julia uses a very similar approach to musl here. Reduction
           | mod 2pi followed by polynomial kernel. We use tables for
           | `exp` and `log`, but the trig functions have good reductions
           | so we use those instead. The implementation is here https://g
           | ithub.com/JuliaLang/julia/blob/03af78108afe6058f54c... for
           | anyone interested.
        
       | dekhn wrote:
       | lookup tables and lerp are the most common solution I've seen
       | (for a wide range of functions). You can also memorize with an
       | LRU cache.
        
         | phkahler wrote:
         | >> lookup tables and lerp are the most common solution I've
         | seen
         | 
         | The author missed one nice table method. When you need sin()
         | and cos() split the angle into a coarse and fine angle. Coarse
         | might be 1/256 of a circle and fine would be 1/65536 of a
         | circle (whatever that angle is). You look up the sin/cos of the
         | coarse angle in a 256 entry table. Then you rotate that by the
         | fine angle which uses another 256 sin * cos entries. The
         | rotation between coarse values is more accurate than linear
         | interpolation and give almost the same accuracy as a 65536
         | entry table using only 768 entries. You can use larger tables
         | or even another rotation by finer angle steps to get more
         | accuracy.
        
           | LeifCarrotson wrote:
           | > _Then you rotate that by the fine angle_
           | 
           | Explain?
           | 
           | Are you trying to reduce the error in the linear
           | interpolation by describing the convex curve between the
           | endpoints?
           | 
           | The derivative of the cosine is the sine, and the derivative
           | of the sine is the cosine...so I'd expect the required output
           | of the fine angle table to interpolate 0.15 between 0.1 and
           | 0.2 and the required output to interpolate 45.15 between 45.1
           | and 45.2 degrees will be very different.
        
             | marginalia_nu wrote:
             | Remember that cos(a+b)=cos(a)cos(b)-sin(a)sin(b)
        
             | phkahler wrote:
             | >> Explain?
             | 
             | Yeah I thought that might not be clear enough. Let's say
             | you want 0.1 degree accuracy but don't want a 3600 entry
             | table. Make one table for every 1 degree. You get to use
             | the same table for sin and cos by changing the index. That
             | is the coarse table with 360 entries. Then make another
             | table with sin and cos values for every 0.1 degrees, or
             | specifically 0.1*n for n going from 0-9. This is another 20
             | values, 10 sin and 10 cos for small angles.
             | 
             | Take your input angle and split it into a coarse (integer
             | degrees) and fine (fractional degrees) angle. Now take the
             | (sin(x),cos(x)) from the coarse table as a vector and
             | rotate it using the sin & cos values of the fine angle
             | using a standard 2x2 rotation matrix.
             | 
             | You can size these tables however you need. I would not use
             | 360 and 10 for their sizes, but maybe 256 and 256. This can
             | also be repeated with a 3rd table for "extra fine" angles.
        
         | marginalia_nu wrote:
         | > You can also memorize with an LRU cache.
         | 
         | That's probably not going to perform well for an operation this
         | fast. The computation is faster than main memory read.
        
           | dekhn wrote:
           | You can fit quite a large LRU cache in the L2 cache of the
           | processor. You never want to go to main memory for a lerp
           | table.
        
             | Sharlin wrote:
             | In some cases (eg. oldskool 2D graphics) a 256-entry LUT is
             | enough. And a 8-bit fixed point format can be enough so the
             | whole table fits in 256 bytes.
        
         | munificent wrote:
         | Yeah, lerping into a lookup table is common in audio and (old
         | school) games.
         | 
         | A common trick to make that faster is to not use doubles and
         | radians to represent the phase. Instead, represent phase using
         | an integer in some power-of-two range. That lets you truncate
         | the phase to fit in a single period using a bit mask instead of
         | the relatively slow modulo.
        
       ___________________________________________________________________
       (page generated 2022-03-29 23:00 UTC)