[HN Gopher] Implementing Cosine in C from Scratch (2020) ___________________________________________________________________ Implementing Cosine in C from Scratch (2020) Author : alraj Score : 187 points Date : 2023-06-05 11:12 UTC (11 hours ago) (HTM) web link (austinhenley.com) (TXT) w3m dump (austinhenley.com) | JKCalhoun wrote: | The author's Taylor Series looked salvageable to me. The range | between -Pi/2 and Pi/2 looked fine. That useable range can be re- | used to serve all other portions of the circle. | | Add a conditional that applies (1-result) for some angles and | you're golden. | londons_explore wrote: | OP's lookup table benchmarking is bad. The "modd(x, CONST_2PI);" | at the top will dominate, by far, the runtime. | | Anyone who wants performance measures angles using n bit fixed | point math, mapping 0 to 2*pi as 0 to (2^n)-1. | omgmajk wrote: | >Maybe don't stare at that for too long... | | I sure did stare at that for way too long. | pacaro wrote: | I went through the same exercise implementing trig functions for | scheme in webassembly... | | It was a rabbit hole for sure | | https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s... | | [Edit: above is just the range from 0-p/4, the branch cuts for | cos are at | https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s... ] | dang wrote: | Url changed from | https://web.archive.org/web/20210513043002/http://web.eecs.u..., | which points to this. | jpfr wrote: | The Taylor expansion locally fits a polynom based on the n first | derivatives. If you want to find the "best" nth-degree polynom to | approximate the sine function, functional analysis gives the | tools for solving that optimization in closed form. | | By selecting an appropriate norm (in function space) you can | minimize either the maximum error or the error integral over some | range (e.g. the 0-\pi range). | | Here's a video on the subject. You might want to watch earlier | ones also for more context. | | https://www.youtube.com/watch?v=tMlKZZf2Kac&list=PLdkTDauaUn... | | Full disclosure, this is my university lecture on optimization | that was recorded during Covid. | stephencanon wrote: | It's worth noting that for fixed-precision math libraries, we | usually want the best approximation in the (possibly weighted) | L-inf sense, which doesn't generally have a closed-form | solution, so we use iterative methods to generate | approximations instead. | bigdict wrote: | What's a good textbook intro to these methods? | remcob wrote: | I wrote this intro for a friend and since then more people | have found it useful: | | https://xn--2-umb.com/22/approximation | | If you want a proper textbook, I recommend L.N. Trefethen | (2019). "Approximation Theory and Approximation Practice." | Which is available online here: | | https://epubs.siam.org/doi/book/10.1137/1.9781611975949 | munificent wrote: | I think this is the first time I've seen a punycode URL | in the wild. I love it! | stephencanon wrote: | That's a really nice overview of Remez, thanks for | sharing! | nsajko wrote: | Seems also worth noting that although the usual algorithm for | finding the minimax polynomial is the very old Remez | algorithm, I'm playing with a minimax finder that relies on | what I think is a novel algorithm. My algorithm seems better | than Remez, and certainly beats it (as in, finds the solution | when Remez can not) in many cases, even though I don't have | an analysis of the algorithm. | | The main idea is to use linear programming with some chosen | points from the interval to look for the polynomial that | approximates the chosen function over that interval. Unlike | Remez, this enables control over which individual points from | the interval are chosen as representatives, which enables | avoiding ill-behaved points. An example of where this leads | to improvements over Remez is when optimizing relative error: | Remez would trip over zeros of the function, because they | cause singularities in the relative error, however my | algorithm works (by avoiding ill-behaved points) as long as | the discontinuities are removable. | | My algorithm is also a lot more flexible than Remez', for | example it allows optimizing over multiple intervals instead | of a single one. | | The Git repo of the (still in-progress) project is here: | https://gitlab.com/nsajko/FindMinimaxPolynomial.jl | | The Julia package is already registered (installable from the | official Julia registry): https://juliahub.com/ui/Packages/Fi | ndMinimaxPolynomial/kNIo8... | adgjlsfhk1 wrote: | This is really cool and I think it might be a re-discovery | of https://dl.acm.org/doi/abs/10.1145/3434310. | rjmunro wrote: | Can you get some better accuracy by noting that cos(x) === | sin(p/2 - x) and using e.g. the taylor expansion for sin when p/4 | < x < 3p/4? | Technotroll wrote: | Looks to me like you could compute the top half, and then just | repeat the rest as a kind of mirror function, that repeats with | some set translation. Am I wrong here? | dh2022 wrote: | I thought the same thing as well. When I saw the Taylor sum | diverging at the "edges" I thought - well Taylor sum converges | absolutely only in a compact interval. An easy solution to this | limitation is to use the periodicity of the cos/sin function... | sojuz151 wrote: | I would say the benchmarks with lookup tables are bad. In the | benchmark cpu will keep the table in cache but in a real program | this cache would have to be shared with rest of the code/data. | This would either kill the performance of cosine or rest of the | app | dang wrote: | Related: | | _Implementing Cosine in C from Scratch (2020)_ - | https://news.ycombinator.com/item?id=30844872 - March 2022 (134 | comments) | | _Implementing cosine in C from scratch_ - | https://news.ycombinator.com/item?id=23893325 - July 2020 (20 | comments) | eschneider wrote: | Trig functions are where accuracy and performance go to die. | Accumulated error is a thing, so when optimizing always consider | exactly how you're going to be using those functions so you make | the 'right' tradeoffs for your application. One size definitely | doesn't fit all here, so test and experiment. | amadio wrote: | Taylor expansion is usually not so good for this, you'll fare | better with either Legendre or Chebyshev polynomials. Robin Green | has some excellent material on the subject, which I am linking | below. | | Faster Math Functions: | https://basesandframes.files.wordpress.com/2016/05/fast-math... | https://basesandframes.files.wordpress.com/2016/05/fast-math... | | Even faster math functions GDC 2020: | https://gdcvault.com/play/1026734/Math-in-Game-Development-S... | aidenn0 wrote: | Marz's taylor series implementation as-is is significantly faster | (~60% the runtime) than the glibc implementation at 6 terms on My | Machine(tm), rather than about the same as in TFA. I haven't | compared to LERP lookup table yet though. | femto wrote: | If doing signal processing, an optimisation is to use an N-bit | integer to represent the range 0 to 2.pi. Some example points in | the mapping: 0->0, 2^(N-2)->pi/2, 2^(N-1)->pi, 2^N(wraps to | 0)->2.pi(wraps to 0). | | If your lookup table has an M-bit index, the index to the lookup | table is calculated with: index = (unsigned)theta>>(N-M), where | theta in the N-bit integer representing the angle. | | The fractional part, which can be used for interpolation, is: | theta & ((1<<(N-M))-1). | | If you choose M=N=word size (16 bits is often nice), the angle | can be used directly as an index. With 16-bits accuracy is | typically good enough without interpolation and the entire trig | operation reduces to an array indexing operation. | | This minimises the overhead of converting an angle to a table | index. | polalavik wrote: | This method has 2 error terms - phase error and amplitude | error. You can only represent sinusoids with a resolution of | 2pi/2^N rad/samp (same phase error if you don't do phase | truncation). Maybe that's enough for most applications, but | it's probably significantly more error than math libraries that | calculate cos. Totally depends on your application. | rjmunro wrote: | Why just signal processing? Surely this is good for storing | e.g. rotations of characters in games, longitude, in fact | almost anywhere where you want an angle where a full rotation | is the same as no rotation at all, and more precision as you | approach 0deg is not useful. | vikingerik wrote: | Accumulated error can matter, which is where you'd want that | precision near 0deg. If something is making tiny turns each | frame for thousands of frames, floating-point inaccuracy can | add up. (A smarter mathematical approach might remedy this, | or might not; think of some game component that is supposed | to jitter randomly each frame, maybe some particle system.) | It's not common, but it's within hypothetical feasibility | that the precision would be useful. | amitport wrote: | He is describing deterministic uniform round to nearest | quantization. The most common smarter mathematical approach | would be randomized rounding and it will solve this issue. | | (Issue being that the deterministic rounding is a biased | estimator of the sin function) | londons_explore wrote: | > randomized rounding | | There are a bunch of places where randomized rounding | works better. But is there any computer hardware that | implements this? Obviously the internal precision (of the | random element) can't be infinite, but a couple of extra | bits would seem worth the complexity. | mvcalder wrote: | If you like this sort of thing there's a great book: Methods and | Programs for Mathematical Functions by Stephen Moshier. It covers | the computation of all sorts of special functions. The code is | available in the cephes library but the book may be out of print. | jonsen wrote: | Not even a single hit on bookfinder.com! | xorvoid wrote: | How good do you need it? Lol. | | This is the approximation that I used in for the animated sinwave | example for SectorC: | | y ~= 100 + (x*(157 - x)) >> 7 | | https://github.com/xorvoid/sectorc/blob/main/examples/sinwav... | azhenley wrote: | No need for archive.org, my website moved a few years ago: | https://austinhenley.com/blog/cosine.html | butterlesstoast wrote: | Thanks for taking the time to make such an incredible article. | Made my day reading it this morning. | dang wrote: | Fixed. Thanks! | capitainenemo wrote: | BTW, WRT projects using a lookup table... | | https://hg.hedgewars.org/hedgewars/file/tip/hedgewars/uSinTa... | | I think there's even a github mirror someone made since that's | where you were looking. Maybe a little out of date, but this is | old code. | capitainenemo wrote: | oh... also, sine table isn't just a performance thing. | Hedgewars does fixed point custom math for deterministic | lockstep, so keeps the math simple and predictable on various | compilers/hardware. | [deleted] | charlieyu1 wrote: | Related: | | https://news.ycombinator.com/item?id=35381968 Cosine | Implementation in C | | Much better approximation with only 7 terms | londons_explore wrote: | Note that you can combine the lookup table with the taylor series | expansion... | | And you can even use the same lookup table for each. That means | with 2 table lookups in a 32 entry table, a single multiply and | add (and a few bit shifts), you can get ~9 bits of precision in | your result, which is fine for most uses. It also probably makes | a sin operation take ~1 clock cycle on most superscalar | architectures, as long as your workload has sufficient | instruction parallelism. | | Note that a smaller table typically works out faster because 32 | entries fit in the cache, whereas repeated random entry into a | 1024 entry table would probably kick a bunch of other stuff out | of the cache that you wanted. | throwawaaarrgh wrote: | Strange. This article was easy to read, simple formating, not | driven by trends or weird internet culture. The weren't any | bombastic claims, aggrandizing statements or dramatic opinions. | Just interesting information presented clearly without being | verbose. Almost like it was written by an adult. How did this end | up on HN? | smlacy wrote: | Yeah, very much unlike your comment. Strange, that. | [deleted] | Helenarttr wrote: | [dead] | Aardwolf wrote: | " In all my time of coding, there has only been one situation in | which I used cosine: games, games, games. | function move(obj) { obj.x += obj.speed * | Math.sin(obj.rotation); obj.y += obj.speed * | Math.cos(obj.rotation); } | | " | | Why not store the velocity as a 2D vector instead? Then you still | have to use cos/sin to compute this vector, but at least you | don't need it every frame, plus often you don't need to use | cos/sin to compute this vector either since forces that act on | the velocity themselves can have an x and y component you can | directly add to it | messe wrote: | Sometimes it's just easier to work and think in polar | coordinates. If that's the case, then (speed, rotation) is just | as valid a choice as (vx, vy). For smaller games, the cost of | computing sin and cos every frame is so astronomically small, | that it's not even concerning yourself over. | Aardwolf wrote: | Hmm, I usually find polar coordinates harder to reason with | due to the gimbal lock :) | | In 2D there's no gimbal lock of course, but there's still a | number that behaves differently than the others, the angle | wrapping around in range 0..2pi! | | And of course the article is about fast computation of | sin/cos so the option of not computing it should be | considered | midjji wrote: | The equivalent of polar coordinates in 3d are quaternions. | Which do not have gimbal lock, the problem only occurs for | euler angles(which should never ever be used for anything | ever). | messe wrote: | > The equivalent of polar coordinates in 3d are | quaternions | | I'd disagree. Quaternions are the 3D equivalent (yes, I'm | aware of the subtleties here, but speaking in a rough | sense) to using complex numbers to represent rotations. | | Euler angles are much closer in spirit to polar | coordinates, but unfortunately have gimbal lock. | thehappypm wrote: | Can you explain gimbal lock? I can't wrap my head around it | Aardwolf wrote: | The main thing is how when representing the 3D looking | orientation as multiple angles (pitch=up/down, | yaw=compass direction), you lose control over the yaw | axis when looking straight up. | [deleted] | pacaro wrote: | Of course gimbal lock is a thing, and there are ways to | avoid it, even without quaternions, but I'm curious as to | when it should be avoided. I think that gamers are now so | used to it, that removing it might cause more problems | than it solves | [deleted] | Aardwolf wrote: | I guess that's more for airplane simulators then. But I'd | also think any physics engine where things can rotate | arbitrarily need to work correctly here. | | I also think, despite various sources on the internet | saying "quaternions are the solution to gimbal lock", | that classical 3D vectors and 3D matrices without | quaternions also perfectly solve it. | pacaro wrote: | Agreed absolutely. I've used quaternions when someone | else made the decision that was what the physics engine | was going to do! But when I'm the decision maker it's 3d | vectors and matrices all the way | [deleted] | basementcat wrote: | A physical analogy is an azimuth-elevation mounted | antenna tracking an object that flies over the zenith. | When the object is near the horizon, the angular velocity | of the azimuth actuator may be near zero and the | elevation actuator may be small. Near zenith, the angular | velocity of the elevation actuator may be small and the | azimuth actuator may be near infinity. | mistercow wrote: | This is a fine introduction to how you even approach the problem | of computing transcendental functions. | | It would have been nice to have some discussion of accuracy | requirements rather than just eyeballing it and saying "more | accuracy than I need". This is a spot where inexperienced devs | can easily get tangled up. An attitude like "ten digits? That's | so many! I'm only making a game, after all" is exactly the sort | of thing that gets you into trouble if you start accumulating | errors over time, and this is particularly easy to do with trig | functions. | 082349872349872 wrote: | Niklaus Wirth also has some cosine code (probably meant more for | software floating point, or fp FPGA blocks); I don't know how | they compare with these approximations but his seem to be within | 1e-6 of python's math.cos ... | | https://people.inf.ethz.ch/wirth/ProjectOberon/Sources/Math.... | dm319 wrote: | The answer to Cos(1.57079632) (in radians) will give you a clue | as to how your calculator does it. | | See here [0] | | [0] | https://www.reddit.com/r/calculators/comments/126st95/cos157... | aidenn0 wrote: | 6.7948900e-9 which has only the first 5 digits correct (if you | assume it would have to have been 6.79490 to get 6 digits | correct since that is what the correct value rounds to) | the_third_wave wrote: | Quite well when you realise this comes down to _cos(p /2)_ to | which the thing answers correctly: _cos(p /2) = 0_. | | Using HiPER Calc on Android | midjji wrote: | A good idea is to not to compute the values of cos from 0-2pi, | but further reduce the range, using cos(a) = cos(-a), and cos(2a) | = 1-2cos(a), or cos(a+pi/4) =... | | So we really only ever need to be able to compute cos in the | range 0-pi/4. | | Then for further accuracy we can do the taylor expansion around | pi/8. (or other approximations) | | finally the number of terms for a fixed accuracy varies with the | distance from pi/8, | londons_explore wrote: | I think all real implementations use a lookup table... Small | lookup tables (eg. 8 elements) all work out far smaller in | silicon area than even one multiplier... | xipix wrote: | Nice. I'd love to see how this changes when you have SIMD and | multiple cosines to compute in parallel. Also when you have to | compute sine and cosine simultaneously which is often the case, | and then you may be more interested in polar error than cartesian | error. | p0nce wrote: | You can do this with: https://github.com/RJVB/sse_mathfun Those | functions do not have the same safety as libc, but even for | scalar sin/cos it can be faster. | amiga386 wrote: | One of the things I loved most about reading kernel and libc (or | rather, libm) sources was the floating-point and FP emulation | code. I thought it was immensely cool to see what others just | expected an FPU instruction to do, someone had not only written | out in C, but also wrote comments explaining the mathematics | (rendered in ASCII), often with numerical analysis about error | propagation, accuracy, etc. | | Some examples: | | https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=new... | | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... | | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... | sampo wrote: | > cosine repeats every 2pi | | > We can do better since the cosine values are equivalent every | multiple of pi, except that the sign flips. | | There is one more step to take: Each p/2 long segment has | identical shape, they are just pointing up or down (like you | already noticed), and left or right. So you can reduce you basic | domain down to not just 0...p, but to 0...p/2. | amelius wrote: | But that takes extra clock cycles, since now you have to fiddle | with the sign. | Someone wrote: | I think it's likely that you get that more than back by | having to do (at least) one iteration fewer. | | Those who know more about this: educate me. | adgjlsfhk1 wrote: | Pretty much every implementation does reduction mod pi/2. | Switching polynomials is basically free (and halving the | domain significantly reduces the number of terms needed). | IIAOPSW wrote: | And you can keep going with this logic, building cos from 0-x | by recursively calculating cos from 0-x/2 then mirroring over | some line if needed. Eventually you hit machine precision limit | aka the base case cos(0) = 1. | ajross wrote: | And still further: the function can be piecewise decomposed | into separate polynomials, which is what the pervasive SunPro | code in everyone's libc does. There's one "mostly parabolic" | function to handle the curvy top, and another "mostly linear" | one to handle the zero crossing region. | gmiller123456 wrote: | You can actually reduce it to n/4 as the range from 0 to pi/2 | is the same shape as the remaining 3 segments, just mirrored | along different axis. ___________________________________________________________________ (page generated 2023-06-05 23:02 UTC)