[HN Gopher] Speeding up atan2f
       ___________________________________________________________________
        
       Speeding up atan2f
        
       Author : rostayob
       Score  : 196 points
       Date   : 2021-08-17 12:26 UTC (10 hours ago)
        
 (HTM) web link (mazzo.li)
 (TXT) w3m dump (mazzo.li)
        
       | cogman10 wrote:
       | I'm actually a bit surprised that the x86 SIMD instructions don't
       | support trig functions.
        
       | drfuchs wrote:
       | Wouldn't CORDIC have done the trick faster? There's no mention
       | that they even considered it, even though it's been around for
       | half a century or so.
        
         | adrian_b wrote:
         | CORDIC was great for devices that were too simple to include
         | hardware multipliers.
         | 
         | CORDIC, in its basic form, produces just 1 bit of result per
         | clock cycle. Only for very low precisions (i.e. with few bits
         | for each result) you would have the chance to overlap enough
         | parallel atan2 computations to achieve a throughput comparable
         | to what you can do with polynomial approximations on modern
         | CPUs, with pipelined multipliers.
         | 
         | CORDIC remains useful only for ASIC/FPGA implementations, in
         | the cases when the area and the power consumption are much more
         | important than the speed.
        
         | mzs wrote:
         | CORDIC unlikely to be faster than this article's: "the
         | approximation produces a result every 2 clock cycles"
        
       | jvz01 wrote:
       | I have developed very fast, accurate, and vectorizable atan() and
       | atan2() implementations, leveraging AVX/SSE capabilities. You can
       | find them here [warning: self-signed SSL-Cert].
       | 
       | https://fox-toolkit.org/wordpress/?p=219
        
         | ghusbands wrote:
         | Your result is significantly slower than the versions presented
         | in the article, though yours has more terms and so may be more
         | accurate.
        
           | jvz01 wrote:
           | Yes, aim was to be acurate down to 1 lsb while significantly
           | faster. Feel free to drop terms from the polynomial if you
           | can live with less accurate results!
           | 
           | The coefficients were generated by a package called Sollya,
           | I've used it a few times to develop accurate chebyshev
           | approximations for functions.
           | 
           | Abramowitz & Stegun is another good reference.
        
             | touisteur wrote:
             | Please, Would you mind one of these days updating your blog
             | post with the instructions you gave to sollya? I'm trying
             | something stupid with log1p and can't get sollya to help,
             | mostly because I'm not putting enough time to read all the
             | docs...
        
         | jvz01 wrote:
         | Little side-note: algorithm as given is scalar; however, its
         | branch-free, and defined entirely in the header file. So,
         | compilers will typically be able to vectorize it, and thus
         | achieve speed up directly based on the vector size. I see
         | potential [but architecture-dependent] optimization using
         | Estrin scheme for evaluating the polynomial.
        
       | aconz2 wrote:
       | Nice writeup and interesting results. I hadn't seen the use of
       | perf_event_open(2) before directly in code which looks cool.
       | 
       | The baseline is at a huge disadvantage here because the call to
       | atan2 in the loop never gets inlined and the loop doesn't seem to
       | get unrolled (which is surprising actually). Manually unrolling
       | by 8 gives me an 8x speedup. Maybe I'm missing something with the
       | `-static` link but unless they're using musl I didn't think -lm
       | could get statically linked.
        
         | xxpor wrote:
         | Could it be that calls to glibc never get inlined, since
         | inlining is essentially static linking by another name? Or
         | since they're in separate compilation units, without LTO you'd
         | never perform the analysis to figure out if it's worth it.
         | Would LTO inspect across a dynamic loading boundary? Just
         | speculating, I really have no idea. Everything I work with is
         | statically linked, so I've never really had to think about it.
        
         | aconz2 wrote:
         | You've made a silly mistake and just did 1/8th the work, so of
         | course it was 8x speedup. I am still wondering how the numbers
         | would look if you could inline atan2f from libc. Too bad that
         | isn't easier
        
       | [deleted]
        
       | h0mie wrote:
       | Love posts like this!
        
       | sorenjan wrote:
       | How do you handle arrays of values where the array lengths are
       | not a multiple of 8 in this kind of vectorized code? Do you zero
       | pad them before handling them to the vectorized function, or do
       | you run a second loop element by element on the remaining
       | elements after the main one? What happens if you try to do
       | `_mm256_load_ps(&ys[i])` with < 8 elements remaining?
        
         | TonyTrapp wrote:
         | Typically you have have the vectorized loop followed by a non-
         | vectorized loop handling the remaining items, or even just
         | explicit if-then-else statements for each possible number of
         | remaining items.
        
         | vlovich123 wrote:
         | > Do you zero pad them before handling them to the vectorized
         | function, or do you run a second loop element by element on the
         | remaining elements after the main one?
         | 
         | Typically the latter. That's why I find SVEs so interesting.
         | Should improve code density.
        
         | zbjornson wrote:
         | You could pad the input as you said, which avoids a "tail
         | loop," but otherwise you usually do a partial load (load <8
         | elements into a vector) and store. Some instruction set
         | extensions provide "masked load/store" instructions for this,
         | but there are ways to do it without those too.
         | 
         | To your last question specifically, if you
         | _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll
         | get a segfault. Otherwise you'll get undefined values (which
         | can be okay, you could ignore them instead of dealing with a
         | partial load).
        
         | twic wrote:
         | RISC-V has a solution to this which seems quite elegant - the
         | vector length is set by the program, up to some maximum, and
         | for the final block, it is simply whatever is left:
         | 
         | https://gms.tf/riscv-vector.html
        
       | Const-me wrote:
       | I wonder how does it compare with Microsoft's implementation,
       | there:
       | https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...
       | 
       | Based on the code your version is probably much faster. It would
       | be interesting to compare precision still, MS uses 17-degree
       | polynomial there.
        
         | pclmulqdq wrote:
         | The "small error" in this article isn't so small when people
         | want exact results. Unfortunately, a high-degree polynomial is
         | necessary if you want 24 bit precision across the input range
         | for functions like this.
         | 
         | That said, I wonder if a rational approximation might not be so
         | bad on modern machines...
        
           | azalemeth wrote:
           | Indeed. I love Pade approximants -- they're very underused.
           | Another really fun trick for approximations of expensive
           | functions are turning them into Chebsyshev functions, like
           | Nick Trefethen's chebfun package. You can trivially
           | differentiate, integrate and compute chebfun approximations
           | of really horrible special functions (e.g. DE solutions) with
           | fantastic accuracy. Criminally underused.
        
             | jacobolus wrote:
             | If you like Pade approximants, have you been following
             | Trefethen & al.'s recent work on the "AAA" algorithm etc.?
             | https://arxiv.org/pdf/1612.00337.pdf
             | https://arxiv.org/pdf/1705.10132.pdf
             | 
             | See also the lecture
             | https://www.youtube.com/watch?v=S1upJPMIFfg
        
               | azalemeth wrote:
               | No, I haven't -- thank you very much for sharing. I've
               | only ever used them, e.g. for spectral decomposition in
               | magnetic resonance in a medical setting (1) with the fast
               | Pade transformation, which has a wonderfully deep
               | relationship with the DEs describing magnetic resonance
               | (2).
               | 
               | Thanks also for the lecture - I'll enjoy. Nick Trefethen
               | is such a good speaker. He taught me in a graduate course
               | in numerical methods -- honestly, I'd watch those
               | lectures on a loop again.
               | 
               | (1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5179227/
               | 
               | (2) https://iopscience.iop.org/article/10.1088/0031-9155/
               | 51/5/00...
        
           | Const-me wrote:
           | > I wonder if a rational approximation might not be so bad on
           | modern machines
           | 
           | Uops.info says the throughput of 32-byte FP32 vector division
           | is 5 cycles on Intel, 3.5 cycles on AMD.
           | 
           | The throughput of 32-byte vector multiplication is 0.5
           | cycles. The same is for FMA, which is a single instruction
           | computing a*b+c for 3 float vectors.
           | 
           | If the approximation formula only has 1 division where both
           | numerator and denominator are polynomials, and the degree of
           | both polynomials is much lower than 17, the rational version
           | might be not too bad compared to the 17-degree polynomial
           | version.
        
             | pclmulqdq wrote:
             | A Pade approximation is generally like that. Especially
             | when you are dealing with singularities, it should be a
             | much smaller polynomial than 17 terms in the numerator and
             | denominator. I think I'm going to have to start a blog, and
             | this will be on the list.
        
               | janwas wrote:
               | Indeed, degree-4 rational polynomials are generally
               | enough for JPEG XL. Code for evaluating them using
               | Horner's scheme (just FMAs): https://github.com/libjxl/li
               | bjxl/blob/bdde644b94c125a15e532b...
               | 
               | I initially used Chebfun to approximate them, then a
               | teammate implemented the same Caratheodory-Fejer method
               | in C. We subsequently convert from Chebyshev basis to
               | monomials.
               | 
               | Blog sounds good :)
        
           | petschge wrote:
           | 24 bits precision would be possible with the Hastings
           | approximation with 15 terms instead of 6 terms. A naive
           | estimate would be that this requires 2.5 times as much work
           | and could run at 6GB/s and take a little less than 5 cycles
           | per element.
        
             | Someone wrote:
             | I think that remains to be seen. Exact calculations will
             | give you 24 bits, but in computers, floating point
             | calculations typically aren't exact.
             | 
             | If, as this article does, you compute all your intermediate
             | values as 32-bit IEEE floats, it would hugely surprise me
             | if you still got 24 bits of precision after 1 division (to
             | compute y/x or x/y), 15 multiplications and 14 additions
             | (to evaluate the polynomial) and, sometimes, a subtraction
             | from p/2.
             | 
             | (That's a weakness in this article. They point out how fast
             | their code is, but do not show any test that it actually
             | computes approximations to atan2, let alone about how
             | accurate their results are)
             | 
             | Getting all the bits right in floating point is expensive.
        
               | adgjlsfhk1 wrote:
               | The good news is that the high order terms of the
               | polynomial don't notably affect the error. with fma,
               | horner's method converges to .5 ULP accuracy. The
               | subtraction with pi/2 can also be done without adding
               | inaccuracy by using compensated summation.
        
       | gumby wrote:
       | Because of the disregard for the literature common in CS I loved
       | this part:
       | 
       | > This is achieved through ... _and some cool documents from the
       | 50s_.
       | 
       | A bit of anecdote: back when I was a research scientist
       | (corporate lab) 30+ years ago I would in fact go downstairs to
       | the library and read -- I was still a kid with a lot to learn
       | (and still am). When I came across (by chance or by someone's
       | suggestion) something useful to my work I'd photocopy the article
       | and and try it out. I'd put a comment in the code with the
       | reference.
       | 
       | My colleagues in my group would give me (undeserved) credit for
       | my supposed brilliance even though I said in the code where the
       | idea came from and would determinedly point to the very paper on
       | my desk. This attitude seemed bizarre as the group itself was
       | producing conference papers and even books.
       | 
       | (Obviously this was not a universal phenomenon as there were
       | other people in the lab overall, and friends on the net,
       | suggesting papers to read. But I've seen it a lot, from back then
       | up to today)
        
         | lmilcin wrote:
         | It is very unfortunate that in the world of ubiquitous
         | information, ability to just look up a reasonable solution to a
         | well known and well studied problem that has already been
         | solved long ago is practically a superpower.
         | 
         | As a developer, I regularly am in a situation where I solved
         | some kind of large problem people had for a long time with
         | significant impact for the company and this mostly with couple
         | google searches.
         | 
         | Once I solved a batch task that took 11 hours to complete and
         | reduced it to about 5s of work by looking up a paper on
         | bitemporal indexes. The reactions ranged from "You are
         | awesome!" to "You red a... paper????" to "No, we can't have
         | this, this is super advanced code and it surely must have bugs.
         | Can't you find an open source library that implements this?"
        
           | jacobolus wrote:
           | One of the most valuable (but often under-appreciated) kinds
           | of work people can do is to take hard-won knowledge that
           | exists somewhere obscure and make it more accessible, e.g. by
           | writing expository survey papers, adding material to
           | Wikipedia, writing documentation, ...
        
           | zokier wrote:
           | > It is very unfortunate that in the world of ubiquitous
           | information, ability to just look up a reasonable solution to
           | a well known and well studied problem that has already been
           | solved long ago is practically a superpower
           | 
           | Its exactly because information is so ubiquitous and
           | plentiful that finding a reasonable solution is like looking
           | for a needle in a haystack.
        
             | lmilcin wrote:
             | In most cases in my experience, the people did not face
             | challenge actually grepping the internet for the solution.
             | 
             | More likely they just went to roll out their own solution
             | for any number of other reasons:
             | 
             | -- For the fun of intellectual challenge (I am guilty of
             | this),
             | 
             | -- For having hubris to think they are first to meet that
             | challenge (I have been guilty of this in the past, I no
             | longer make that mistake),
             | 
             | -- For having hubris to think they have uniquely best
             | solution to the problem, highly unlikely it is even worth
             | verifying,
             | 
             | -- For being superior to everybody else and hence no real
             | reason to dignify other, lesser people, by being interested
             | in their work,
             | 
             | -- For feeling there is just too much pressure and no time
             | to do it right or even to take a peek at what is right and
             | how much time it would take,
             | 
             | -- For not having enough experience to suspect there might
             | be a solution to this problem,
             | 
             | -- Books? Only old people read books,
             | 
             | -- Papers? They are for research scientists. I haven't
             | finished CS to read papers at work.
             | 
             | and so on.
             | 
             | There have really been very few concrete problems that I
             | could not find some solution to on the Internet.
             | 
             | It is more difficult to find solution to less concrete
             | problems but even with these kinds of problems if you have
             | a habit of regularly seeking knowledge, reading various
             | sources, etc. you will probably amass enough experience and
             | knowledge to deal with most even complex problems on your
             | own.
        
               | djmips wrote:
               | I regularly hear coders say they cannot understand other
               | people's code so they _had_ to write it themselves. Most
               | people can't read papers either it seems.
        
               | lmilcin wrote:
               | Well, that's another thing.
               | 
               | People lost ability to read anything longer than a tweet
               | with focus and comprehension. Especially new developers.
               | 
               | I know this because whatever I put in third and further
               | paragraphs of my tickets is not being red or understood.
               | 
               | I sometimes quiz guys on review meetings and I even
               | sometimes put the most important information somewhere in
               | the middle of the ticket. Always with the same result --
               | they are surprised.
               | 
               | Now, to read a paper is to try to imagine being the
               | person who wrote it and appreciate not just written text
               | but also a lot more things that are not necessarily
               | spelled.
               | 
               | The same with code -- when I read somebodys code I want
               | to recognize their style, how they approach various
               | problems. Once you know a little bit about their
               | vocabulary you can really speed up reading the rest
               | because you can anticipate what is going to happen next.
        
               | nuclearnice1 wrote:
               | People may have a preference for certainty.
               | 
               | They work on the problem. They will make progress and
               | likely solve it.
               | 
               | They search the literature they will spend time on
               | research and quite possibly lose days before emerging
               | with nothing to show for it.
        
               | [deleted]
        
           | refset wrote:
           | > bitemporal indexes [...] Can't you find an open source
           | library that implements this?
           | 
           | Undoubtedly too late for you, but
           | https://github.com/juxt/crux/blob/master/crux-
           | core/src/crux/... :)
        
         | kbenson wrote:
         | > 30+ years ago ... I was still a kid with a lot to learn (and
         | still am).
         | 
         | We should all strive to stay young at heart like this. ;)
        
         | dleslie wrote:
         | I find that disregard for education to be tasteless, but
         | perhaps warranted due to the modern prevalence of "bootcamps"
         | and similar fast-but-shallow programmes.
         | 
         | Personally, I love when I get a chance to apply something from
         | a published paper; I will leave a citation as a comment in the
         | code and a brief explanation of how it works and why it was
         | chosen. I have no regrets for having achieved a computing
         | degree, so many years ago.
        
         | touisteur wrote:
         | But, knowing where to look for, being able to suss out details
         | from papers, having the almost adequate level of conversation
         | to try and contact authors (if still alive...). The ability to
         | put undecypherable formulas into code and to bear
         | stackoverflow's requirements for a 'good' question. Not a set
         | of skills that amounts to nothing. It's rare nowadays that I
         | see any copy of a reference textbook on numerical recipes,
         | although we do tons of signal processing and HPC stuff.
         | 
         | Recently I rediscovered the ability to compute logs with 64
         | bits integers with a better precision than double-precision.
         | Man, jumping back into those depths after so many years... Not
         | as easy as reading other people's code.
        
         | teddyh wrote:
         | I've seen an odd use of language, mostly from U.S. Americans,
         | where "intelligent" = "smart" = " _knowledgable_ ". And
         | conversely, assuming that knowing many things is what makes you
         | smart. I suspect this is what happened there; they thought you
         | "brilliant" because you knew many things - in their minds,
         | there might not be a difference between the two concepts.
        
         | whatshisface wrote:
         | Not brilliant? Use of the written word to pass ideas down
         | between generations is one of the most brilliant ideas in human
         | history, and since nobody around you was doing it, the only
         | explanation is that you independently invented the ethos of the
         | scholar yourself.
        
           | gumby wrote:
           | Heh heh, Ockham's razor provides an explanation!
        
         | specialist wrote:
         | Nice.
         | 
         | I truly miss foraging the stacks. Such delight.
         | 
         | Browsing scholar.google.com, resulting in 100s of tabs open,
         | just isn't the same.
        
           | jjgreen wrote:
           | I always found that the smell of old books filled me with the
           | urge to defecate, to the extent that I had to deliberately go
           | for a crap before diving into the stacks. Something of a
           | disadvantage as a pre-internet researcher. I thought I was
           | the only person in the world with this condition, but have
           | subsequently found a couple of other similarly afflicted
           | souls.
        
         | _moof wrote:
         | No one likes to do lit review, which is a shame. I get it -
         | it's way less fun to find out someone already bested your cool
         | idea than it is to write code. But it's definitely more
         | productive to read the literature!
        
       | prionassembly wrote:
       | I wonder whether Pade approximants are well known by this kind of
       | researcher. E.g. http://www-
       | labs.iro.umontreal.ca/~mignotte/IFT2425/Documents...
        
         | nimish wrote:
         | Numerical analysts are definitely familiar. Library authors may
         | not be unless they took a course in function approximation.
         | 
         | Personally I prefer chebyshev approximants like chebfun but
         | pade are really good at the cost of a division.
        
           | stephencanon wrote:
           | A math library implementor will generally be familiar with at
           | least minimax and Chebyshev approximations, and generally
           | Pade and Caratheodory-Fejer approximations as well. (Source:
           | I implemented math libraries for a decade. We used all of
           | these frequently.)
        
       | zokier wrote:
       | I would have wished to see the error analysis section expanded a
       | bit, or maybe seeing some sort of tests to validate the max
       | error. In particular if the mathematical approximation function
       | arctan* has max error of 1/10000 degrees then I'd naively expect
       | that the float-based implementation to have worse error.
       | Furthermore it's not obvious if additional error could be
       | introduced by the division                   float atan_input =
       | (swap ? x : y) / (swap ? y : x);
        
       | spaetzleesser wrote:
       | I am envious of people who can deal with such problems. The
       | problem is defined clearly and can be measured easily.
       | 
       | This is so much more fun than figuring out why some SAAS service
       | is misbehaving.
        
       | jacobolus wrote:
       | If someone wants a fast version of _x_ - tan( _px_ /2), let me
       | recommend the approximation:                 tanpi_2 = function
       | tanpi_2(x) {         var y = (1 - x*x);         return x *
       | (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y
       | + 0.3182994604 + 1.2732402998 / y);       }
       | 
       | (valid for -1 <= x <= 1)
       | 
       | https://observablehq.com/@jrus/fasttan with error:
       | https://www.desmos.com/calculator/hmncdd6fuj
       | 
       | But even better is to avoid trigonometry and angle measures as
       | much as possible. Almost everything can be done better (faster,
       | with fewer numerical problems) with vector methods; if you want a
       | 1-float representation of an angle, use the stereographic
       | projection:                 stereo = (x, y) => y/(x +
       | Math.hypot(x, y));       stereo_to_xy = (s) => {         var q =
       | 1/(1 + s*s);         return !q ? [-1, 0] : [(1 - s*s)/q, 2*s/q];
       | }
        
         | leowbattle wrote:
         | > Almost everything can be done better ... use the
         | stereographic projection
         | 
         | I was just learning about stereographic projection earlier.
         | Isn't it odd how when you know something you notice it
         | appearing in places.
         | 
         | Can you give an example of an operation that could be performed
         | better using stereographic projection rather than angles?
        
           | jacobolus wrote:
           | Generally you can just stick to storing 2-coordinate vectors
           | and using vector operations.
           | 
           | The places where you might want to convert to a 1-number
           | representation are when you have a lot of numbers you want to
           | store or transmit somewhere. Using the stereographic
           | projection (half-angle tangent) instead of angle measure
           | works better with the floating point number format and uses
           | only rational arithmetic instead of transcendental functions.
        
       | hdersch wrote:
       | A comparison with one of the many SIMD-mathlibraries would have
       | been fairer than with plain libm. Long time ago I wrote such a
       | dual-platform library for the PS3 (cell-processor) and x86
       | architecture (outdated, but still available here [1]). Depending
       | on how standard libm implements atan2f, a speedup of 3x to 15x is
       | achieved, without sacrifying accuracy.
       | 
       | 1. https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf
        
       | drej wrote:
       | Nice. Reminds me of an optimisation trick from a while ago: I
       | remember being bottlenecked by one of these trigonometric
       | functions years ago when working with a probabilistic data
       | structure... then I figured the input domain was pretty small (a
       | couple dozen values), so I precomputed those and used an array
       | lookup instead. A huge win in terms of perf, obviously only
       | applicable in these extreme cases.
        
         | ThePadawan wrote:
         | One of the things I recently learned that sounded the most
         | "that can't possibly work well enough" is an optimization for
         | sin(x):
         | 
         | If abs(x) < 0.1, "sin(x)" is approximated really well by "x".
         | 
         | That's it. For small x, just return x.
         | 
         | (Obviously, there is some error involved, but for the speedup
         | gained, it's a very good compromise)
        
           | sharikone wrote:
           | I think that you will find that for subnormal numbers any
           | math library will use the identity function for sin(x) and 1
           | for cos(x)
        
             | ThePadawan wrote:
             | Right, but the largest subnormal number in single-precision
             | floats is ~ 10^-38.
             | 
             | That the sin(x) approximation still works well for 10^-1
             | (with an error of ~0.01%) is the really cool thing!
        
           | chriswarbo wrote:
           | This is a very common assumption in Physics, e.g. https://en.
           | wikipedia.org/wiki/Pendulum_(mathematics)#Small-a...
           | 
           | Whether it's appropriate in a numerical calculation obviously
           | depends on the possible inputs and the acceptable error bars
           | :)
        
           | tzs wrote:
           | To put some numbers on it, using N terms of the Taylor series
           | for sin(x) [1] with |x| <= 0.1, the maximum error percentage
           | cannot exceed [2]:                 N  Error Limit       1
           | 0.167% (1/6%)       2  8.35 x 10^-5% (1/11980%)       3  1.99
           | x 10^-8% (1/50316042%)       4  2.76 x 10^-12%
           | (1/362275502328%)
           | 
           | Even for |x| as large as 1 the sin(x) = x approximation is
           | within 20%, which is not too bad when you are just doing a
           | rough ballpark calculation, and a two term approximation gets
           | you under 1%. Three terms is under 0.024% (1/43%).
           | 
           | For |x| up to Pi/2 (90deg) the one term approximation falls
           | apart. The guarantee from the Taylor series is within 65% (in
           | reality it does better, but is still off by 36%). Two terms
           | is guaranteed to be within 8%, three within 0.5%, and four
           | within 0.02%.
           | 
           | Here's a quick and dirty little Python program to compute a
           | table of error bounds for a given x [3].
           | 
           | [1] x - x^3/3! + x^5/5! - x^7/7! + ...
           | 
           | [2] In reality the error will usually be quite a bit below
           | this upper limit. The Taylor series for a given x is a
           | convergent alternating series. That is, it is of the form A0
           | - A1 + A2 - A3 + ... where all the A's have the same sign,
           | |Ak| is a decreasing sequence past some particular k, and
           | |Ak| has a limit of 0 as k goes to infinity. Such a series
           | converges to some value, and if you approximate that by just
           | taking the first N terms the error cannot exceed the first
           | omitted term as long as N is large enough to take you to the
           | point where the sequence from there on is decreasing. This is
           | the upper bound I'm using above.
           | 
           | [3] https://pastebin.com/thN8B7Gf
        
           | quietbritishjim wrote:
           | That is precisely the technique discussed in the article:
           | it's the first term of the Taylor expansion. Except that the
           | article used more terms of the expansion, and also used very
           | slightly "wrong" coefficients to improve the overall accuracy
           | within the small region.
        
           | Bostonian wrote:
           | Why wouldn't you at least include the x^3 term in the Taylor
           | series for abs(x) < 0.1?
        
             | ThePadawan wrote:
             | That's what I assumed would have been a reasonable
             | optimization!
             | 
             | What I really found amazing was that rather than reducing
             | the number of multiplications to 2 (to calculate x^3), you
             | can reduce it to 0, and it would still do _surprisingly_
             | well.
        
           | jvz01 wrote:
           | Another good hint is the classical half-angle formulas. You
           | can often avoid calling sin() and cos() altogether!
        
           | MauranKilom wrote:
           | Some trivia, partly stolen from Bruce Dawson[0]:
           | 
           | The sin(x) = x approximation is actually exact (in terms of
           | doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens
           | to cover 48.6% of all doubles.
           | 
           | Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).
           | 
           | Finally, sin(double(pi)) tells you the error in double(pi) -
           | that is, how far the double representation of pi is away from
           | the "real", mathematical pi [3].
           | 
           | [0]: https://randomascii.wordpress.com/2014/10/09/intel-
           | underesti...
           | 
           | [1]: https://github.com/lattera/glibc/blob/master/sysdeps/iee
           | e754...
           | 
           | [2]: https://github.com/lattera/glibc/blob/master/sysdeps/iee
           | e754...
           | 
           | [3]: https://randomascii.wordpress.com/2012/02/25/comparing-
           | float...
        
           | RicoElectrico wrote:
           | Come on, at this point I've seen this "engineering
           | approximation" memed so many times, even on Instagram ;)
           | 
           | What is more interesting to me is that this can be one of
           | rationales behind using radians.
           | 
           | And that tan(x)~x also holds for small angles, greatly easing
           | estimations of distance to objects of known size (cf. mil-dot
           | reticle).
        
             | lapetitejort wrote:
             | tan(x)~x because cos(x)~1. It's approximations all the way
             | down.
        
         | mooman219 wrote:
         | This would be a decent lookup for the atan2 function:
         | 
         | https://gist.github.com/mooman219/19b18ff07bb9d609a103ef0cd0...
        
         | tantalor wrote:
         | https://en.wikipedia.org/wiki/Memoization
        
           | bluedino wrote:
           | _Technically_ it 's a _lookup table_ if you pre-compute them.
           | Memoization would just be caching them as you do them.
        
             | tantalor wrote:
             | Not necessarily, you could "cache" them in a compilation
             | step and then use the table at runtime.
        
           | bruce343434 wrote:
           | Tangential at best, but why was the 'r' dropped from that
           | term? Or why not call it caching? Why the weird "memo-
           | ization"? It makes me think of a mass extinction event where
           | everything is turned into a memo.
        
             | franciscop wrote:
             | It's explained right in the linked Wikipedia page:
             | 
             | > The term "memoization" was coined by Donald Michie in
             | 1968[3] and is derived from the Latin word "memorandum"
             | ("to be remembered"), usually truncated as "memo" in
             | American English, and thus carries the meaning of "turning
             | [the results of] a function into something to be
             | remembered". While "memoization" might be confused with
             | "memorization" (because they are etymological cognates),
             | "memoization" has a specialized meaning in computing.
        
             | wongarsu wrote:
             | The term memoization likely precedes the word caching (as
             | related to computing, obviously weapon caches are far
             | older). Memoization was coined in 1968. CPU caches only
             | came about in the 80s as registers became significantly
             | faster than main memory.
             | 
             | As wikipedia outlines, the r was dropped because of the
             | memo. It's derived from the latin word memorandum that does
             | contain the r, just like memory, but apparently it was more
             | meant as an analogy to written memos.
        
       | unemphysbro wrote:
       | coolest blog post I've seen here in a while. :)
        
       | aj7 wrote:
       | Around 1988, I added phase shift to my optical thin film design
       | program written in Excel 4.0 for the Mac. At the time, this was
       | utterly unique: each spreadsheet row represented a layer and the
       | matrices describing each layer could be calculated right in that
       | row by squashing them down horizontally. The S- and
       | P-polarization matrices could be recorded this way, and the
       | running matrix products similarly maintained. Finally, using a
       | simple one input table, the reflectance of a typically 25-31
       | layer laser mirror could be calculated. And in less than a second
       | on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best
       | part were the graphics which were instantaneous, beautiful,
       | publishable, and pasteable into customer quotations. Semi-
       | technical people could be trained to use the whole thing.
       | 
       | Now about the phase shift. In 1988, atan2 didn't exist. Anywhere.
       | Not in FORTRAN, Basic, Excel, or a C library. I'm sure phase
       | shift calculators implemented it, each working alone. For us, it
       | was critical. You see we actually cared not about the phase
       | shift, but its second derivative, the group delay dispersion.
       | This was the beginning of the femtosecond laser era, and people
       | needed to check whether these broadband laser pulses would be
       | inadvertently stretched by reflection off or transmission through
       | the mirror coating. So atan2, the QUADRANT-PRESERVING arc
       | tangent, is required for a continuous,differential phase
       | function. An Excel function macro did this, with IF statements
       | correcting the quadrant. And the irony of all this?
       | 
       | I CALLED it atan2.
        
         | dzdt wrote:
         | Fortran did have it much earlier. Per wikipedia atan2 appeared
         | in Fortran IV from IBM in 1961.
         | https://en.wikipedia.org/wiki/Atan2
        
           | pklausler wrote:
           | And it's been in every ANSI & ISO Fortran standard from F'66
           | through F'2018.
        
         | Narishma wrote:
         | > Around 1988, I added phase shift to my optical thin film
         | design program written in Excel 4.0 for the Mac.
         | 
         | Excel 4 was released in 1992. In 1988 the only version of Excel
         | that would have been available on the Mac would be the very
         | first one.
        
         | floxy wrote:
         | Table 8-3, page 8-7 (pdf page 97):
         | 
         | http://www.bitsavers.org/www.computer.museum.uq.edu.au/pdf/D...
        
           | raphlinus wrote:
           | 1965: https://www.ecma-international.org/publications-and-
           | standard... pdf page 36
        
             | floxy wrote:
             | Nice find. Anyone have a way to search for old versions
             | "math.h"? The 1989 version of ANSI C had atan2:
             | 
             | https://web.archive.org/web/20161223125339/http://flash-
             | gord...
             | 
             | ...but I wonder when it first hit the scene.
        
               | kimixa wrote:
               | According to https://unix.superglobalmegacorp.com/ it was
               | in 3bsd libm -https://unix.superglobalmegacorp.com/cgi-
               | bin/cvsweb.cgi/3BSD...
               | 
               | 3BSD was released at the end of 1979 according to
               | wikipedia, and that's just the oldest BSD source I could
               | find, so it may have been in even older versions.
               | 
               | Before ANSI/ISO C I don't think there was really a
               | "math.h" to look for - as it may be different on
               | different systems, but as so many things derive from BSD
               | I wouldn't be surprised if it was "de-facto" standardised
               | to what that supported.
        
       | shoo wrote:
       | Related -- there's a 2011 post from Paul Minero with fast
       | approximations for logarithm, exponential, power, inverse root.
       | http://www.machinedlearnings.com/2011/06/fast-approximate-lo...
       | 
       | Minero's faster approximate log2, < 1.4% relative error for x in
       | [1/100, 10]. Here's the simple non-sse version:
       | static inline float        fasterlog2 (float x)       {
       | union { float f; uint32_t i; } vx = { x };         float y =
       | vx.i;         y *= 1.1920928955078125e-7f;         return y -
       | 126.94269504f;       }
       | 
       | This fastapprox library also includes fast approximations of some
       | other functions that show up in statistical / probabilistic
       | calculations -- gamma, digamma, lambert w function. It is BSD
       | licensed, originally lived in google code, copies of the library
       | live on in github, e.g. https://github.com/etheory/fastapprox
       | 
       | It's also interesting to read through libm. E.g. compare Sun's
       | ~1993 atan2 & atan:
       | 
       | https://github.com/JuliaMath/openlibm/blob/master/src/e_atan...
       | 
       | https://github.com/JuliaMath/openlibm/blob/master/src/s_atan...
        
       | nice2meetu wrote:
       | I did something similar for tanh once, though I found I could get
       | to 1 ulp.
       | 
       | Part of the motivation was that I could get 10x faster than libc.
       | However, I then tried on my FreeBSD and could only get 4x faster.
       | After a lot of head scratching and puzzling it turned out there
       | was a bug in the version of libc on my linux box that slowed
       | things down. It kind of took the wind out of the achievement, but
       | it was still a great learning experience.
        
       | azhenley wrote:
       | This is pretty similar to my quest to make my own cos() when my
       | friend didn't have access to libc. It was fun! Though I don't
       | have the math or low-level knowledge that this author does.
       | 
       | https://web.eecs.utk.edu/~azh/blog/cosine.html
        
       | pklausler wrote:
       | (Undoubtedly) stupid question: would it be any faster to project
       | (x, y) to the unit circle (x', y'), then compute acos(x') or
       | asin(y'), and then correct the result based on the signs of x &
       | y? When converting Cartesian coordinates to polar, the value of
       | r=HYPOT(x, y) is needed anyway, so the projection to the unit
       | circle would be a single division by r.
        
       | stephencanon wrote:
       | > if we're working with batches of points and willing to live
       | with tiny errors, we can produce an atan2 approximation which is
       | 50 times faster than the standard version provided by libc.
       | 
       | Which libc, though? I assume glibc, but it's frustrating when
       | people talk about libc as though there were a single
       | implementation. Each vendor supplies their own implementation,
       | libc is just a common interface defined by the C library. There
       | is no "standard version" provided by libc.
       | 
       | In particular, glibc's math functions are not especially fast--
       | Intel's and Apple's math libraries are 4-5x faster for some
       | functions[1], and often more accurate as well, for example (and
       | both vendors provide vectorized implementations). Even within
       | glibc versions, there have been enormous improvements over the
       | last decade or so, and for some functions there are big
       | performance differences depending on whether or not -fno-math-
       | errno is specified. (I would also note that atan2 has a lot of
       | edge cases, and more than half the work in a standards-compliant
       | libc is in getting those edge cases with zeros and infinities
       | right, which this implementation punts on. There's nothing wrong
       | with that, but that's a bigger tradeoff for most users than the
       | small loss of accuracy and important to note.)
       | 
       | So what are we actually comparing against here? Comparing against
       | a clown-shoes baseline makes for eye-popping numbers, but it's
       | not very meaningful.
       | 
       | None of this should really take away from the work presented, by
       | the way--the techniques described here are very useful for people
       | interested in this stuff.
       | 
       | [1] I don't know the current state of atan2f in glibc
       | specifically; it's possible that it's been improved since last I
       | looked at its performance. But the blog post cites "105.98 cycles
       | / element", which would be glacially slow on any semi-recent
       | hardware, which makes me think something is up here.
        
         | rostayob wrote:
         | (I'm the author)
         | 
         | You're right, I should have specified -- it is glibc 2.32-48 .
         | This the source specifying how glibc is built:
         | https://github.com/NixOS/nixpkgs/blob/97c5d0cbe76901da0135b0...
         | .
         | 
         | I've amended the article so that it says `glibc` rather than
         | `libc`, and added a sidenote specifying the version.
         | 
         | I link to it statically as indicated in the gist, although I
         | believe that shouldn't matter. Also see
         | https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47ac...
         | .
         | 
         | Note that the hardware is not particularly recent (Q3 2017),
         | but we tend to rent servers which are not exactly on the
         | bleeding edge, so that was my platform.
        
           | stephencanon wrote:
           | Thanks for clarifying!
           | 
           | Without benchmarking, I would expect atan2f to be around
           | 20-30 cycles per element or less with either Intel's or
           | Apple's scalar math library, and proportionally faster for
           | their vector libs.
        
             | rostayob wrote:
             | Thanks for the info.
             | 
             | By the way, your writing on floating point arithmetic is
             | very informative -- I even cite a message of yours on FMA
             | in the post itself!
        
         | rostayob wrote:
         | Ah, regarding edge cases: the only edge cases I do not handle
         | are infinities and the origin. In my case these are both
         | uninteresting: we deal with points from panoramic images, so
         | those both make no sense.
         | 
         | I did want to deal with 0s in either coordinates though, and to
         | preserve NaNs.
         | 
         | You're right that atan2 has a lot of edge cases, which is why
         | it made for an interesting subject. I thought it was neat that
         | the edge cases I did care about fell out fairly naturally from
         | the efficient bit-twiddling.
         | 
         | That said, the versions presented in the article post are _not_
         | conforming, as I remark repeatedly. The article is more about
         | getting people curious about some topics that I think are neat,
         | more than a guide on how to implement those standard functions
         | correctly.
        
       ___________________________________________________________________
       (page generated 2021-08-17 23:01 UTC)