[HN Gopher] Implementing Cosine in C from Scratch (2020)
       ___________________________________________________________________
        
       Implementing Cosine in C from Scratch (2020)
        
       Author : alraj
       Score  : 187 points
       Date   : 2023-06-05 11:12 UTC (11 hours ago)
        
 (HTM) web link (austinhenley.com)
 (TXT) w3m dump (austinhenley.com)
        
       | JKCalhoun wrote:
       | The author's Taylor Series looked salvageable to me. The range
       | between -Pi/2 and Pi/2 looked fine. That useable range can be re-
       | used to serve all other portions of the circle.
       | 
       | Add a conditional that applies (1-result) for some angles and
       | you're golden.
        
       | londons_explore wrote:
       | OP's lookup table benchmarking is bad. The "modd(x, CONST_2PI);"
       | at the top will dominate, by far, the runtime.
       | 
       | Anyone who wants performance measures angles using n bit fixed
       | point math, mapping 0 to 2*pi as 0 to (2^n)-1.
        
       | omgmajk wrote:
       | >Maybe don't stare at that for too long...
       | 
       | I sure did stare at that for way too long.
        
       | pacaro wrote:
       | I went through the same exercise implementing trig functions for
       | scheme in webassembly...
       | 
       | It was a rabbit hole for sure
       | 
       | https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s...
       | 
       | [Edit: above is just the range from 0-p/4, the branch cuts for
       | cos are at
       | https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s... ]
        
       | dang wrote:
       | Url changed from
       | https://web.archive.org/web/20210513043002/http://web.eecs.u...,
       | which points to this.
        
       | jpfr wrote:
       | The Taylor expansion locally fits a polynom based on the n first
       | derivatives. If you want to find the "best" nth-degree polynom to
       | approximate the sine function, functional analysis gives the
       | tools for solving that optimization in closed form.
       | 
       | By selecting an appropriate norm (in function space) you can
       | minimize either the maximum error or the error integral over some
       | range (e.g. the 0-\pi range).
       | 
       | Here's a video on the subject. You might want to watch earlier
       | ones also for more context.
       | 
       | https://www.youtube.com/watch?v=tMlKZZf2Kac&list=PLdkTDauaUn...
       | 
       | Full disclosure, this is my university lecture on optimization
       | that was recorded during Covid.
        
         | stephencanon wrote:
         | It's worth noting that for fixed-precision math libraries, we
         | usually want the best approximation in the (possibly weighted)
         | L-inf sense, which doesn't generally have a closed-form
         | solution, so we use iterative methods to generate
         | approximations instead.
        
           | bigdict wrote:
           | What's a good textbook intro to these methods?
        
             | remcob wrote:
             | I wrote this intro for a friend and since then more people
             | have found it useful:
             | 
             | https://xn--2-umb.com/22/approximation
             | 
             | If you want a proper textbook, I recommend L.N. Trefethen
             | (2019). "Approximation Theory and Approximation Practice."
             | Which is available online here:
             | 
             | https://epubs.siam.org/doi/book/10.1137/1.9781611975949
        
               | munificent wrote:
               | I think this is the first time I've seen a punycode URL
               | in the wild. I love it!
        
               | stephencanon wrote:
               | That's a really nice overview of Remez, thanks for
               | sharing!
        
           | nsajko wrote:
           | Seems also worth noting that although the usual algorithm for
           | finding the minimax polynomial is the very old Remez
           | algorithm, I'm playing with a minimax finder that relies on
           | what I think is a novel algorithm. My algorithm seems better
           | than Remez, and certainly beats it (as in, finds the solution
           | when Remez can not) in many cases, even though I don't have
           | an analysis of the algorithm.
           | 
           | The main idea is to use linear programming with some chosen
           | points from the interval to look for the polynomial that
           | approximates the chosen function over that interval. Unlike
           | Remez, this enables control over which individual points from
           | the interval are chosen as representatives, which enables
           | avoiding ill-behaved points. An example of where this leads
           | to improvements over Remez is when optimizing relative error:
           | Remez would trip over zeros of the function, because they
           | cause singularities in the relative error, however my
           | algorithm works (by avoiding ill-behaved points) as long as
           | the discontinuities are removable.
           | 
           | My algorithm is also a lot more flexible than Remez', for
           | example it allows optimizing over multiple intervals instead
           | of a single one.
           | 
           | The Git repo of the (still in-progress) project is here:
           | https://gitlab.com/nsajko/FindMinimaxPolynomial.jl
           | 
           | The Julia package is already registered (installable from the
           | official Julia registry): https://juliahub.com/ui/Packages/Fi
           | ndMinimaxPolynomial/kNIo8...
        
             | adgjlsfhk1 wrote:
             | This is really cool and I think it might be a re-discovery
             | of https://dl.acm.org/doi/abs/10.1145/3434310.
        
       | rjmunro wrote:
       | Can you get some better accuracy by noting that cos(x) ===
       | sin(p/2 - x) and using e.g. the taylor expansion for sin when p/4
       | < x < 3p/4?
        
       | Technotroll wrote:
       | Looks to me like you could compute the top half, and then just
       | repeat the rest as a kind of mirror function, that repeats with
       | some set translation. Am I wrong here?
        
         | dh2022 wrote:
         | I thought the same thing as well. When I saw the Taylor sum
         | diverging at the "edges" I thought - well Taylor sum converges
         | absolutely only in a compact interval. An easy solution to this
         | limitation is to use the periodicity of the cos/sin function...
        
       | sojuz151 wrote:
       | I would say the benchmarks with lookup tables are bad. In the
       | benchmark cpu will keep the table in cache but in a real program
       | this cache would have to be shared with rest of the code/data.
       | This would either kill the performance of cosine or rest of the
       | app
        
       | dang wrote:
       | Related:
       | 
       |  _Implementing Cosine in C from Scratch (2020)_ -
       | https://news.ycombinator.com/item?id=30844872 - March 2022 (134
       | comments)
       | 
       |  _Implementing cosine in C from scratch_ -
       | https://news.ycombinator.com/item?id=23893325 - July 2020 (20
       | comments)
        
       | eschneider wrote:
       | Trig functions are where accuracy and performance go to die.
       | Accumulated error is a thing, so when optimizing always consider
       | exactly how you're going to be using those functions so you make
       | the 'right' tradeoffs for your application. One size definitely
       | doesn't fit all here, so test and experiment.
        
       | amadio wrote:
       | Taylor expansion is usually not so good for this, you'll fare
       | better with either Legendre or Chebyshev polynomials. Robin Green
       | has some excellent material on the subject, which I am linking
       | below.
       | 
       | Faster Math Functions:
       | https://basesandframes.files.wordpress.com/2016/05/fast-math...
       | https://basesandframes.files.wordpress.com/2016/05/fast-math...
       | 
       | Even faster math functions GDC 2020:
       | https://gdcvault.com/play/1026734/Math-in-Game-Development-S...
        
       | aidenn0 wrote:
       | Marz's taylor series implementation as-is is significantly faster
       | (~60% the runtime) than the glibc implementation at 6 terms on My
       | Machine(tm), rather than about the same as in TFA. I haven't
       | compared to LERP lookup table yet though.
        
       | femto wrote:
       | If doing signal processing, an optimisation is to use an N-bit
       | integer to represent the range 0 to 2.pi. Some example points in
       | the mapping: 0->0, 2^(N-2)->pi/2, 2^(N-1)->pi, 2^N(wraps to
       | 0)->2.pi(wraps to 0).
       | 
       | If your lookup table has an M-bit index, the index to the lookup
       | table is calculated with: index = (unsigned)theta>>(N-M), where
       | theta in the N-bit integer representing the angle.
       | 
       | The fractional part, which can be used for interpolation, is:
       | theta & ((1<<(N-M))-1).
       | 
       | If you choose M=N=word size (16 bits is often nice), the angle
       | can be used directly as an index. With 16-bits accuracy is
       | typically good enough without interpolation and the entire trig
       | operation reduces to an array indexing operation.
       | 
       | This minimises the overhead of converting an angle to a table
       | index.
        
         | polalavik wrote:
         | This method has 2 error terms - phase error and amplitude
         | error. You can only represent sinusoids with a resolution of
         | 2pi/2^N rad/samp (same phase error if you don't do phase
         | truncation). Maybe that's enough for most applications, but
         | it's probably significantly more error than math libraries that
         | calculate cos. Totally depends on your application.
        
         | rjmunro wrote:
         | Why just signal processing? Surely this is good for storing
         | e.g. rotations of characters in games, longitude, in fact
         | almost anywhere where you want an angle where a full rotation
         | is the same as no rotation at all, and more precision as you
         | approach 0deg is not useful.
        
           | vikingerik wrote:
           | Accumulated error can matter, which is where you'd want that
           | precision near 0deg. If something is making tiny turns each
           | frame for thousands of frames, floating-point inaccuracy can
           | add up. (A smarter mathematical approach might remedy this,
           | or might not; think of some game component that is supposed
           | to jitter randomly each frame, maybe some particle system.)
           | It's not common, but it's within hypothetical feasibility
           | that the precision would be useful.
        
             | amitport wrote:
             | He is describing deterministic uniform round to nearest
             | quantization. The most common smarter mathematical approach
             | would be randomized rounding and it will solve this issue.
             | 
             | (Issue being that the deterministic rounding is a biased
             | estimator of the sin function)
        
               | londons_explore wrote:
               | > randomized rounding
               | 
               | There are a bunch of places where randomized rounding
               | works better. But is there any computer hardware that
               | implements this? Obviously the internal precision (of the
               | random element) can't be infinite, but a couple of extra
               | bits would seem worth the complexity.
        
       | mvcalder wrote:
       | If you like this sort of thing there's a great book: Methods and
       | Programs for Mathematical Functions by Stephen Moshier. It covers
       | the computation of all sorts of special functions. The code is
       | available in the cephes library but the book may be out of print.
        
         | jonsen wrote:
         | Not even a single hit on bookfinder.com!
        
       | xorvoid wrote:
       | How good do you need it? Lol.
       | 
       | This is the approximation that I used in for the animated sinwave
       | example for SectorC:
       | 
       | y ~= 100 + (x*(157 - x)) >> 7
       | 
       | https://github.com/xorvoid/sectorc/blob/main/examples/sinwav...
        
       | azhenley wrote:
       | No need for archive.org, my website moved a few years ago:
       | https://austinhenley.com/blog/cosine.html
        
         | butterlesstoast wrote:
         | Thanks for taking the time to make such an incredible article.
         | Made my day reading it this morning.
        
         | dang wrote:
         | Fixed. Thanks!
        
         | capitainenemo wrote:
         | BTW, WRT projects using a lookup table...
         | 
         | https://hg.hedgewars.org/hedgewars/file/tip/hedgewars/uSinTa...
         | 
         | I think there's even a github mirror someone made since that's
         | where you were looking. Maybe a little out of date, but this is
         | old code.
        
           | capitainenemo wrote:
           | oh... also, sine table isn't just a performance thing.
           | Hedgewars does fixed point custom math for deterministic
           | lockstep, so keeps the math simple and predictable on various
           | compilers/hardware.
        
           | [deleted]
        
       | charlieyu1 wrote:
       | Related:
       | 
       | https://news.ycombinator.com/item?id=35381968 Cosine
       | Implementation in C
       | 
       | Much better approximation with only 7 terms
        
       | londons_explore wrote:
       | Note that you can combine the lookup table with the taylor series
       | expansion...
       | 
       | And you can even use the same lookup table for each. That means
       | with 2 table lookups in a 32 entry table, a single multiply and
       | add (and a few bit shifts), you can get ~9 bits of precision in
       | your result, which is fine for most uses. It also probably makes
       | a sin operation take ~1 clock cycle on most superscalar
       | architectures, as long as your workload has sufficient
       | instruction parallelism.
       | 
       | Note that a smaller table typically works out faster because 32
       | entries fit in the cache, whereas repeated random entry into a
       | 1024 entry table would probably kick a bunch of other stuff out
       | of the cache that you wanted.
        
       | throwawaaarrgh wrote:
       | Strange. This article was easy to read, simple formating, not
       | driven by trends or weird internet culture. The weren't any
       | bombastic claims, aggrandizing statements or dramatic opinions.
       | Just interesting information presented clearly without being
       | verbose. Almost like it was written by an adult. How did this end
       | up on HN?
        
         | smlacy wrote:
         | Yeah, very much unlike your comment. Strange, that.
        
       | [deleted]
        
       | Helenarttr wrote:
       | [dead]
        
       | Aardwolf wrote:
       | " In all my time of coding, there has only been one situation in
       | which I used cosine: games, games, games.
       | function move(obj) {           obj.x += obj.speed *
       | Math.sin(obj.rotation);           obj.y += obj.speed *
       | Math.cos(obj.rotation);       }
       | 
       | "
       | 
       | Why not store the velocity as a 2D vector instead? Then you still
       | have to use cos/sin to compute this vector, but at least you
       | don't need it every frame, plus often you don't need to use
       | cos/sin to compute this vector either since forces that act on
       | the velocity themselves can have an x and y component you can
       | directly add to it
        
         | messe wrote:
         | Sometimes it's just easier to work and think in polar
         | coordinates. If that's the case, then (speed, rotation) is just
         | as valid a choice as (vx, vy). For smaller games, the cost of
         | computing sin and cos every frame is so astronomically small,
         | that it's not even concerning yourself over.
        
           | Aardwolf wrote:
           | Hmm, I usually find polar coordinates harder to reason with
           | due to the gimbal lock :)
           | 
           | In 2D there's no gimbal lock of course, but there's still a
           | number that behaves differently than the others, the angle
           | wrapping around in range 0..2pi!
           | 
           | And of course the article is about fast computation of
           | sin/cos so the option of not computing it should be
           | considered
        
             | midjji wrote:
             | The equivalent of polar coordinates in 3d are quaternions.
             | Which do not have gimbal lock, the problem only occurs for
             | euler angles(which should never ever be used for anything
             | ever).
        
               | messe wrote:
               | > The equivalent of polar coordinates in 3d are
               | quaternions
               | 
               | I'd disagree. Quaternions are the 3D equivalent (yes, I'm
               | aware of the subtleties here, but speaking in a rough
               | sense) to using complex numbers to represent rotations.
               | 
               | Euler angles are much closer in spirit to polar
               | coordinates, but unfortunately have gimbal lock.
        
             | thehappypm wrote:
             | Can you explain gimbal lock? I can't wrap my head around it
        
               | Aardwolf wrote:
               | The main thing is how when representing the 3D looking
               | orientation as multiple angles (pitch=up/down,
               | yaw=compass direction), you lose control over the yaw
               | axis when looking straight up.
        
               | [deleted]
        
               | pacaro wrote:
               | Of course gimbal lock is a thing, and there are ways to
               | avoid it, even without quaternions, but I'm curious as to
               | when it should be avoided. I think that gamers are now so
               | used to it, that removing it might cause more problems
               | than it solves
        
               | [deleted]
        
               | Aardwolf wrote:
               | I guess that's more for airplane simulators then. But I'd
               | also think any physics engine where things can rotate
               | arbitrarily need to work correctly here.
               | 
               | I also think, despite various sources on the internet
               | saying "quaternions are the solution to gimbal lock",
               | that classical 3D vectors and 3D matrices without
               | quaternions also perfectly solve it.
        
               | pacaro wrote:
               | Agreed absolutely. I've used quaternions when someone
               | else made the decision that was what the physics engine
               | was going to do! But when I'm the decision maker it's 3d
               | vectors and matrices all the way
        
               | [deleted]
        
               | basementcat wrote:
               | A physical analogy is an azimuth-elevation mounted
               | antenna tracking an object that flies over the zenith.
               | When the object is near the horizon, the angular velocity
               | of the azimuth actuator may be near zero and the
               | elevation actuator may be small. Near zenith, the angular
               | velocity of the elevation actuator may be small and the
               | azimuth actuator may be near infinity.
        
       | mistercow wrote:
       | This is a fine introduction to how you even approach the problem
       | of computing transcendental functions.
       | 
       | It would have been nice to have some discussion of accuracy
       | requirements rather than just eyeballing it and saying "more
       | accuracy than I need". This is a spot where inexperienced devs
       | can easily get tangled up. An attitude like "ten digits? That's
       | so many! I'm only making a game, after all" is exactly the sort
       | of thing that gets you into trouble if you start accumulating
       | errors over time, and this is particularly easy to do with trig
       | functions.
        
       | 082349872349872 wrote:
       | Niklaus Wirth also has some cosine code (probably meant more for
       | software floating point, or fp FPGA blocks); I don't know how
       | they compare with these approximations but his seem to be within
       | 1e-6 of python's math.cos ...
       | 
       | https://people.inf.ethz.ch/wirth/ProjectOberon/Sources/Math....
        
       | dm319 wrote:
       | The answer to Cos(1.57079632) (in radians) will give you a clue
       | as to how your calculator does it.
       | 
       | See here [0]
       | 
       | [0]
       | https://www.reddit.com/r/calculators/comments/126st95/cos157...
        
         | aidenn0 wrote:
         | 6.7948900e-9 which has only the first 5 digits correct (if you
         | assume it would have to have been 6.79490 to get 6 digits
         | correct since that is what the correct value rounds to)
        
         | the_third_wave wrote:
         | Quite well when you realise this comes down to _cos(p /2)_ to
         | which the thing answers correctly: _cos(p /2) = 0_.
         | 
         | Using HiPER Calc on Android
        
       | midjji wrote:
       | A good idea is to not to compute the values of cos from 0-2pi,
       | but further reduce the range, using cos(a) = cos(-a), and cos(2a)
       | = 1-2cos(a), or cos(a+pi/4) =...
       | 
       | So we really only ever need to be able to compute cos in the
       | range 0-pi/4.
       | 
       | Then for further accuracy we can do the taylor expansion around
       | pi/8. (or other approximations)
       | 
       | finally the number of terms for a fixed accuracy varies with the
       | distance from pi/8,
        
         | londons_explore wrote:
         | I think all real implementations use a lookup table... Small
         | lookup tables (eg. 8 elements) all work out far smaller in
         | silicon area than even one multiplier...
        
       | xipix wrote:
       | Nice. I'd love to see how this changes when you have SIMD and
       | multiple cosines to compute in parallel. Also when you have to
       | compute sine and cosine simultaneously which is often the case,
       | and then you may be more interested in polar error than cartesian
       | error.
        
         | p0nce wrote:
         | You can do this with: https://github.com/RJVB/sse_mathfun Those
         | functions do not have the same safety as libc, but even for
         | scalar sin/cos it can be faster.
        
       | amiga386 wrote:
       | One of the things I loved most about reading kernel and libc (or
       | rather, libm) sources was the floating-point and FP emulation
       | code. I thought it was immensely cool to see what others just
       | expected an FPU instruction to do, someone had not only written
       | out in C, but also wrote comments explaining the mathematics
       | (rendered in ASCII), often with numerical analysis about error
       | propagation, accuracy, etc.
       | 
       | Some examples:
       | 
       | https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=new...
       | 
       | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
       | 
       | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
        
       | sampo wrote:
       | > cosine repeats every 2pi
       | 
       | > We can do better since the cosine values are equivalent every
       | multiple of pi, except that the sign flips.
       | 
       | There is one more step to take: Each p/2 long segment has
       | identical shape, they are just pointing up or down (like you
       | already noticed), and left or right. So you can reduce you basic
       | domain down to not just 0...p, but to 0...p/2.
        
         | amelius wrote:
         | But that takes extra clock cycles, since now you have to fiddle
         | with the sign.
        
           | Someone wrote:
           | I think it's likely that you get that more than back by
           | having to do (at least) one iteration fewer.
           | 
           | Those who know more about this: educate me.
        
             | adgjlsfhk1 wrote:
             | Pretty much every implementation does reduction mod pi/2.
             | Switching polynomials is basically free (and halving the
             | domain significantly reduces the number of terms needed).
        
         | IIAOPSW wrote:
         | And you can keep going with this logic, building cos from 0-x
         | by recursively calculating cos from 0-x/2 then mirroring over
         | some line if needed. Eventually you hit machine precision limit
         | aka the base case cos(0) = 1.
        
         | ajross wrote:
         | And still further: the function can be piecewise decomposed
         | into separate polynomials, which is what the pervasive SunPro
         | code in everyone's libc does. There's one "mostly parabolic"
         | function to handle the curvy top, and another "mostly linear"
         | one to handle the zero crossing region.
        
         | gmiller123456 wrote:
         | You can actually reduce it to n/4 as the range from 0 to pi/2
         | is the same shape as the remaining 3 segments, just mirrored
         | along different axis.
        
       ___________________________________________________________________
       (page generated 2023-06-05 23:02 UTC)