[HN Gopher] Speeding up atan2f ___________________________________________________________________ Speeding up atan2f Author : rostayob Score : 196 points Date : 2021-08-17 12:26 UTC (10 hours ago) (HTM) web link (mazzo.li) (TXT) w3m dump (mazzo.li) | cogman10 wrote: | I'm actually a bit surprised that the x86 SIMD instructions don't | support trig functions. | drfuchs wrote: | Wouldn't CORDIC have done the trick faster? There's no mention | that they even considered it, even though it's been around for | half a century or so. | adrian_b wrote: | CORDIC was great for devices that were too simple to include | hardware multipliers. | | CORDIC, in its basic form, produces just 1 bit of result per | clock cycle. Only for very low precisions (i.e. with few bits | for each result) you would have the chance to overlap enough | parallel atan2 computations to achieve a throughput comparable | to what you can do with polynomial approximations on modern | CPUs, with pipelined multipliers. | | CORDIC remains useful only for ASIC/FPGA implementations, in | the cases when the area and the power consumption are much more | important than the speed. | mzs wrote: | CORDIC unlikely to be faster than this article's: "the | approximation produces a result every 2 clock cycles" | jvz01 wrote: | I have developed very fast, accurate, and vectorizable atan() and | atan2() implementations, leveraging AVX/SSE capabilities. You can | find them here [warning: self-signed SSL-Cert]. | | https://fox-toolkit.org/wordpress/?p=219 | ghusbands wrote: | Your result is significantly slower than the versions presented | in the article, though yours has more terms and so may be more | accurate. | jvz01 wrote: | Yes, aim was to be acurate down to 1 lsb while significantly | faster. Feel free to drop terms from the polynomial if you | can live with less accurate results! | | The coefficients were generated by a package called Sollya, | I've used it a few times to develop accurate chebyshev | approximations for functions. | | Abramowitz & Stegun is another good reference. | touisteur wrote: | Please, Would you mind one of these days updating your blog | post with the instructions you gave to sollya? I'm trying | something stupid with log1p and can't get sollya to help, | mostly because I'm not putting enough time to read all the | docs... | jvz01 wrote: | Little side-note: algorithm as given is scalar; however, its | branch-free, and defined entirely in the header file. So, | compilers will typically be able to vectorize it, and thus | achieve speed up directly based on the vector size. I see | potential [but architecture-dependent] optimization using | Estrin scheme for evaluating the polynomial. | aconz2 wrote: | Nice writeup and interesting results. I hadn't seen the use of | perf_event_open(2) before directly in code which looks cool. | | The baseline is at a huge disadvantage here because the call to | atan2 in the loop never gets inlined and the loop doesn't seem to | get unrolled (which is surprising actually). Manually unrolling | by 8 gives me an 8x speedup. Maybe I'm missing something with the | `-static` link but unless they're using musl I didn't think -lm | could get statically linked. | xxpor wrote: | Could it be that calls to glibc never get inlined, since | inlining is essentially static linking by another name? Or | since they're in separate compilation units, without LTO you'd | never perform the analysis to figure out if it's worth it. | Would LTO inspect across a dynamic loading boundary? Just | speculating, I really have no idea. Everything I work with is | statically linked, so I've never really had to think about it. | aconz2 wrote: | You've made a silly mistake and just did 1/8th the work, so of | course it was 8x speedup. I am still wondering how the numbers | would look if you could inline atan2f from libc. Too bad that | isn't easier | [deleted] | h0mie wrote: | Love posts like this! | sorenjan wrote: | How do you handle arrays of values where the array lengths are | not a multiple of 8 in this kind of vectorized code? Do you zero | pad them before handling them to the vectorized function, or do | you run a second loop element by element on the remaining | elements after the main one? What happens if you try to do | `_mm256_load_ps(&ys[i])` with < 8 elements remaining? | TonyTrapp wrote: | Typically you have have the vectorized loop followed by a non- | vectorized loop handling the remaining items, or even just | explicit if-then-else statements for each possible number of | remaining items. | vlovich123 wrote: | > Do you zero pad them before handling them to the vectorized | function, or do you run a second loop element by element on the | remaining elements after the main one? | | Typically the latter. That's why I find SVEs so interesting. | Should improve code density. | zbjornson wrote: | You could pad the input as you said, which avoids a "tail | loop," but otherwise you usually do a partial load (load <8 | elements into a vector) and store. Some instruction set | extensions provide "masked load/store" instructions for this, | but there are ways to do it without those too. | | To your last question specifically, if you | _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll | get a segfault. Otherwise you'll get undefined values (which | can be okay, you could ignore them instead of dealing with a | partial load). | twic wrote: | RISC-V has a solution to this which seems quite elegant - the | vector length is set by the program, up to some maximum, and | for the final block, it is simply whatever is left: | | https://gms.tf/riscv-vector.html | Const-me wrote: | I wonder how does it compare with Microsoft's implementation, | there: | https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di... | | Based on the code your version is probably much faster. It would | be interesting to compare precision still, MS uses 17-degree | polynomial there. | pclmulqdq wrote: | The "small error" in this article isn't so small when people | want exact results. Unfortunately, a high-degree polynomial is | necessary if you want 24 bit precision across the input range | for functions like this. | | That said, I wonder if a rational approximation might not be so | bad on modern machines... | azalemeth wrote: | Indeed. I love Pade approximants -- they're very underused. | Another really fun trick for approximations of expensive | functions are turning them into Chebsyshev functions, like | Nick Trefethen's chebfun package. You can trivially | differentiate, integrate and compute chebfun approximations | of really horrible special functions (e.g. DE solutions) with | fantastic accuracy. Criminally underused. | jacobolus wrote: | If you like Pade approximants, have you been following | Trefethen & al.'s recent work on the "AAA" algorithm etc.? | https://arxiv.org/pdf/1612.00337.pdf | https://arxiv.org/pdf/1705.10132.pdf | | See also the lecture | https://www.youtube.com/watch?v=S1upJPMIFfg | azalemeth wrote: | No, I haven't -- thank you very much for sharing. I've | only ever used them, e.g. for spectral decomposition in | magnetic resonance in a medical setting (1) with the fast | Pade transformation, which has a wonderfully deep | relationship with the DEs describing magnetic resonance | (2). | | Thanks also for the lecture - I'll enjoy. Nick Trefethen | is such a good speaker. He taught me in a graduate course | in numerical methods -- honestly, I'd watch those | lectures on a loop again. | | (1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5179227/ | | (2) https://iopscience.iop.org/article/10.1088/0031-9155/ | 51/5/00... | Const-me wrote: | > I wonder if a rational approximation might not be so bad on | modern machines | | Uops.info says the throughput of 32-byte FP32 vector division | is 5 cycles on Intel, 3.5 cycles on AMD. | | The throughput of 32-byte vector multiplication is 0.5 | cycles. The same is for FMA, which is a single instruction | computing a*b+c for 3 float vectors. | | If the approximation formula only has 1 division where both | numerator and denominator are polynomials, and the degree of | both polynomials is much lower than 17, the rational version | might be not too bad compared to the 17-degree polynomial | version. | pclmulqdq wrote: | A Pade approximation is generally like that. Especially | when you are dealing with singularities, it should be a | much smaller polynomial than 17 terms in the numerator and | denominator. I think I'm going to have to start a blog, and | this will be on the list. | janwas wrote: | Indeed, degree-4 rational polynomials are generally | enough for JPEG XL. Code for evaluating them using | Horner's scheme (just FMAs): https://github.com/libjxl/li | bjxl/blob/bdde644b94c125a15e532b... | | I initially used Chebfun to approximate them, then a | teammate implemented the same Caratheodory-Fejer method | in C. We subsequently convert from Chebyshev basis to | monomials. | | Blog sounds good :) | petschge wrote: | 24 bits precision would be possible with the Hastings | approximation with 15 terms instead of 6 terms. A naive | estimate would be that this requires 2.5 times as much work | and could run at 6GB/s and take a little less than 5 cycles | per element. | Someone wrote: | I think that remains to be seen. Exact calculations will | give you 24 bits, but in computers, floating point | calculations typically aren't exact. | | If, as this article does, you compute all your intermediate | values as 32-bit IEEE floats, it would hugely surprise me | if you still got 24 bits of precision after 1 division (to | compute y/x or x/y), 15 multiplications and 14 additions | (to evaluate the polynomial) and, sometimes, a subtraction | from p/2. | | (That's a weakness in this article. They point out how fast | their code is, but do not show any test that it actually | computes approximations to atan2, let alone about how | accurate their results are) | | Getting all the bits right in floating point is expensive. | adgjlsfhk1 wrote: | The good news is that the high order terms of the | polynomial don't notably affect the error. with fma, | horner's method converges to .5 ULP accuracy. The | subtraction with pi/2 can also be done without adding | inaccuracy by using compensated summation. | gumby wrote: | Because of the disregard for the literature common in CS I loved | this part: | | > This is achieved through ... _and some cool documents from the | 50s_. | | A bit of anecdote: back when I was a research scientist | (corporate lab) 30+ years ago I would in fact go downstairs to | the library and read -- I was still a kid with a lot to learn | (and still am). When I came across (by chance or by someone's | suggestion) something useful to my work I'd photocopy the article | and and try it out. I'd put a comment in the code with the | reference. | | My colleagues in my group would give me (undeserved) credit for | my supposed brilliance even though I said in the code where the | idea came from and would determinedly point to the very paper on | my desk. This attitude seemed bizarre as the group itself was | producing conference papers and even books. | | (Obviously this was not a universal phenomenon as there were | other people in the lab overall, and friends on the net, | suggesting papers to read. But I've seen it a lot, from back then | up to today) | lmilcin wrote: | It is very unfortunate that in the world of ubiquitous | information, ability to just look up a reasonable solution to a | well known and well studied problem that has already been | solved long ago is practically a superpower. | | As a developer, I regularly am in a situation where I solved | some kind of large problem people had for a long time with | significant impact for the company and this mostly with couple | google searches. | | Once I solved a batch task that took 11 hours to complete and | reduced it to about 5s of work by looking up a paper on | bitemporal indexes. The reactions ranged from "You are | awesome!" to "You red a... paper????" to "No, we can't have | this, this is super advanced code and it surely must have bugs. | Can't you find an open source library that implements this?" | jacobolus wrote: | One of the most valuable (but often under-appreciated) kinds | of work people can do is to take hard-won knowledge that | exists somewhere obscure and make it more accessible, e.g. by | writing expository survey papers, adding material to | Wikipedia, writing documentation, ... | zokier wrote: | > It is very unfortunate that in the world of ubiquitous | information, ability to just look up a reasonable solution to | a well known and well studied problem that has already been | solved long ago is practically a superpower | | Its exactly because information is so ubiquitous and | plentiful that finding a reasonable solution is like looking | for a needle in a haystack. | lmilcin wrote: | In most cases in my experience, the people did not face | challenge actually grepping the internet for the solution. | | More likely they just went to roll out their own solution | for any number of other reasons: | | -- For the fun of intellectual challenge (I am guilty of | this), | | -- For having hubris to think they are first to meet that | challenge (I have been guilty of this in the past, I no | longer make that mistake), | | -- For having hubris to think they have uniquely best | solution to the problem, highly unlikely it is even worth | verifying, | | -- For being superior to everybody else and hence no real | reason to dignify other, lesser people, by being interested | in their work, | | -- For feeling there is just too much pressure and no time | to do it right or even to take a peek at what is right and | how much time it would take, | | -- For not having enough experience to suspect there might | be a solution to this problem, | | -- Books? Only old people read books, | | -- Papers? They are for research scientists. I haven't | finished CS to read papers at work. | | and so on. | | There have really been very few concrete problems that I | could not find some solution to on the Internet. | | It is more difficult to find solution to less concrete | problems but even with these kinds of problems if you have | a habit of regularly seeking knowledge, reading various | sources, etc. you will probably amass enough experience and | knowledge to deal with most even complex problems on your | own. | djmips wrote: | I regularly hear coders say they cannot understand other | people's code so they _had_ to write it themselves. Most | people can't read papers either it seems. | lmilcin wrote: | Well, that's another thing. | | People lost ability to read anything longer than a tweet | with focus and comprehension. Especially new developers. | | I know this because whatever I put in third and further | paragraphs of my tickets is not being red or understood. | | I sometimes quiz guys on review meetings and I even | sometimes put the most important information somewhere in | the middle of the ticket. Always with the same result -- | they are surprised. | | Now, to read a paper is to try to imagine being the | person who wrote it and appreciate not just written text | but also a lot more things that are not necessarily | spelled. | | The same with code -- when I read somebodys code I want | to recognize their style, how they approach various | problems. Once you know a little bit about their | vocabulary you can really speed up reading the rest | because you can anticipate what is going to happen next. | nuclearnice1 wrote: | People may have a preference for certainty. | | They work on the problem. They will make progress and | likely solve it. | | They search the literature they will spend time on | research and quite possibly lose days before emerging | with nothing to show for it. | [deleted] | refset wrote: | > bitemporal indexes [...] Can't you find an open source | library that implements this? | | Undoubtedly too late for you, but | https://github.com/juxt/crux/blob/master/crux- | core/src/crux/... :) | kbenson wrote: | > 30+ years ago ... I was still a kid with a lot to learn (and | still am). | | We should all strive to stay young at heart like this. ;) | dleslie wrote: | I find that disregard for education to be tasteless, but | perhaps warranted due to the modern prevalence of "bootcamps" | and similar fast-but-shallow programmes. | | Personally, I love when I get a chance to apply something from | a published paper; I will leave a citation as a comment in the | code and a brief explanation of how it works and why it was | chosen. I have no regrets for having achieved a computing | degree, so many years ago. | touisteur wrote: | But, knowing where to look for, being able to suss out details | from papers, having the almost adequate level of conversation | to try and contact authors (if still alive...). The ability to | put undecypherable formulas into code and to bear | stackoverflow's requirements for a 'good' question. Not a set | of skills that amounts to nothing. It's rare nowadays that I | see any copy of a reference textbook on numerical recipes, | although we do tons of signal processing and HPC stuff. | | Recently I rediscovered the ability to compute logs with 64 | bits integers with a better precision than double-precision. | Man, jumping back into those depths after so many years... Not | as easy as reading other people's code. | teddyh wrote: | I've seen an odd use of language, mostly from U.S. Americans, | where "intelligent" = "smart" = " _knowledgable_ ". And | conversely, assuming that knowing many things is what makes you | smart. I suspect this is what happened there; they thought you | "brilliant" because you knew many things - in their minds, | there might not be a difference between the two concepts. | whatshisface wrote: | Not brilliant? Use of the written word to pass ideas down | between generations is one of the most brilliant ideas in human | history, and since nobody around you was doing it, the only | explanation is that you independently invented the ethos of the | scholar yourself. | gumby wrote: | Heh heh, Ockham's razor provides an explanation! | specialist wrote: | Nice. | | I truly miss foraging the stacks. Such delight. | | Browsing scholar.google.com, resulting in 100s of tabs open, | just isn't the same. | jjgreen wrote: | I always found that the smell of old books filled me with the | urge to defecate, to the extent that I had to deliberately go | for a crap before diving into the stacks. Something of a | disadvantage as a pre-internet researcher. I thought I was | the only person in the world with this condition, but have | subsequently found a couple of other similarly afflicted | souls. | _moof wrote: | No one likes to do lit review, which is a shame. I get it - | it's way less fun to find out someone already bested your cool | idea than it is to write code. But it's definitely more | productive to read the literature! | prionassembly wrote: | I wonder whether Pade approximants are well known by this kind of | researcher. E.g. http://www- | labs.iro.umontreal.ca/~mignotte/IFT2425/Documents... | nimish wrote: | Numerical analysts are definitely familiar. Library authors may | not be unless they took a course in function approximation. | | Personally I prefer chebyshev approximants like chebfun but | pade are really good at the cost of a division. | stephencanon wrote: | A math library implementor will generally be familiar with at | least minimax and Chebyshev approximations, and generally | Pade and Caratheodory-Fejer approximations as well. (Source: | I implemented math libraries for a decade. We used all of | these frequently.) | zokier wrote: | I would have wished to see the error analysis section expanded a | bit, or maybe seeing some sort of tests to validate the max | error. In particular if the mathematical approximation function | arctan* has max error of 1/10000 degrees then I'd naively expect | that the float-based implementation to have worse error. | Furthermore it's not obvious if additional error could be | introduced by the division float atan_input = | (swap ? x : y) / (swap ? y : x); | spaetzleesser wrote: | I am envious of people who can deal with such problems. The | problem is defined clearly and can be measured easily. | | This is so much more fun than figuring out why some SAAS service | is misbehaving. | jacobolus wrote: | If someone wants a fast version of _x_ - tan( _px_ /2), let me | recommend the approximation: tanpi_2 = function | tanpi_2(x) { var y = (1 - x*x); return x * | (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y | + 0.3182994604 + 1.2732402998 / y); } | | (valid for -1 <= x <= 1) | | https://observablehq.com/@jrus/fasttan with error: | https://www.desmos.com/calculator/hmncdd6fuj | | But even better is to avoid trigonometry and angle measures as | much as possible. Almost everything can be done better (faster, | with fewer numerical problems) with vector methods; if you want a | 1-float representation of an angle, use the stereographic | projection: stereo = (x, y) => y/(x + | Math.hypot(x, y)); stereo_to_xy = (s) => { var q = | 1/(1 + s*s); return !q ? [-1, 0] : [(1 - s*s)/q, 2*s/q]; | } | leowbattle wrote: | > Almost everything can be done better ... use the | stereographic projection | | I was just learning about stereographic projection earlier. | Isn't it odd how when you know something you notice it | appearing in places. | | Can you give an example of an operation that could be performed | better using stereographic projection rather than angles? | jacobolus wrote: | Generally you can just stick to storing 2-coordinate vectors | and using vector operations. | | The places where you might want to convert to a 1-number | representation are when you have a lot of numbers you want to | store or transmit somewhere. Using the stereographic | projection (half-angle tangent) instead of angle measure | works better with the floating point number format and uses | only rational arithmetic instead of transcendental functions. | hdersch wrote: | A comparison with one of the many SIMD-mathlibraries would have | been fairer than with plain libm. Long time ago I wrote such a | dual-platform library for the PS3 (cell-processor) and x86 | architecture (outdated, but still available here [1]). Depending | on how standard libm implements atan2f, a speedup of 3x to 15x is | achieved, without sacrifying accuracy. | | 1. https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf | drej wrote: | Nice. Reminds me of an optimisation trick from a while ago: I | remember being bottlenecked by one of these trigonometric | functions years ago when working with a probabilistic data | structure... then I figured the input domain was pretty small (a | couple dozen values), so I precomputed those and used an array | lookup instead. A huge win in terms of perf, obviously only | applicable in these extreme cases. | ThePadawan wrote: | One of the things I recently learned that sounded the most | "that can't possibly work well enough" is an optimization for | sin(x): | | If abs(x) < 0.1, "sin(x)" is approximated really well by "x". | | That's it. For small x, just return x. | | (Obviously, there is some error involved, but for the speedup | gained, it's a very good compromise) | sharikone wrote: | I think that you will find that for subnormal numbers any | math library will use the identity function for sin(x) and 1 | for cos(x) | ThePadawan wrote: | Right, but the largest subnormal number in single-precision | floats is ~ 10^-38. | | That the sin(x) approximation still works well for 10^-1 | (with an error of ~0.01%) is the really cool thing! | chriswarbo wrote: | This is a very common assumption in Physics, e.g. https://en. | wikipedia.org/wiki/Pendulum_(mathematics)#Small-a... | | Whether it's appropriate in a numerical calculation obviously | depends on the possible inputs and the acceptable error bars | :) | tzs wrote: | To put some numbers on it, using N terms of the Taylor series | for sin(x) [1] with |x| <= 0.1, the maximum error percentage | cannot exceed [2]: N Error Limit 1 | 0.167% (1/6%) 2 8.35 x 10^-5% (1/11980%) 3 1.99 | x 10^-8% (1/50316042%) 4 2.76 x 10^-12% | (1/362275502328%) | | Even for |x| as large as 1 the sin(x) = x approximation is | within 20%, which is not too bad when you are just doing a | rough ballpark calculation, and a two term approximation gets | you under 1%. Three terms is under 0.024% (1/43%). | | For |x| up to Pi/2 (90deg) the one term approximation falls | apart. The guarantee from the Taylor series is within 65% (in | reality it does better, but is still off by 36%). Two terms | is guaranteed to be within 8%, three within 0.5%, and four | within 0.02%. | | Here's a quick and dirty little Python program to compute a | table of error bounds for a given x [3]. | | [1] x - x^3/3! + x^5/5! - x^7/7! + ... | | [2] In reality the error will usually be quite a bit below | this upper limit. The Taylor series for a given x is a | convergent alternating series. That is, it is of the form A0 | - A1 + A2 - A3 + ... where all the A's have the same sign, | |Ak| is a decreasing sequence past some particular k, and | |Ak| has a limit of 0 as k goes to infinity. Such a series | converges to some value, and if you approximate that by just | taking the first N terms the error cannot exceed the first | omitted term as long as N is large enough to take you to the | point where the sequence from there on is decreasing. This is | the upper bound I'm using above. | | [3] https://pastebin.com/thN8B7Gf | quietbritishjim wrote: | That is precisely the technique discussed in the article: | it's the first term of the Taylor expansion. Except that the | article used more terms of the expansion, and also used very | slightly "wrong" coefficients to improve the overall accuracy | within the small region. | Bostonian wrote: | Why wouldn't you at least include the x^3 term in the Taylor | series for abs(x) < 0.1? | ThePadawan wrote: | That's what I assumed would have been a reasonable | optimization! | | What I really found amazing was that rather than reducing | the number of multiplications to 2 (to calculate x^3), you | can reduce it to 0, and it would still do _surprisingly_ | well. | jvz01 wrote: | Another good hint is the classical half-angle formulas. You | can often avoid calling sin() and cos() altogether! | MauranKilom wrote: | Some trivia, partly stolen from Bruce Dawson[0]: | | The sin(x) = x approximation is actually exact (in terms of | doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens | to cover 48.6% of all doubles. | | Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]). | | Finally, sin(double(pi)) tells you the error in double(pi) - | that is, how far the double representation of pi is away from | the "real", mathematical pi [3]. | | [0]: https://randomascii.wordpress.com/2014/10/09/intel- | underesti... | | [1]: https://github.com/lattera/glibc/blob/master/sysdeps/iee | e754... | | [2]: https://github.com/lattera/glibc/blob/master/sysdeps/iee | e754... | | [3]: https://randomascii.wordpress.com/2012/02/25/comparing- | float... | RicoElectrico wrote: | Come on, at this point I've seen this "engineering | approximation" memed so many times, even on Instagram ;) | | What is more interesting to me is that this can be one of | rationales behind using radians. | | And that tan(x)~x also holds for small angles, greatly easing | estimations of distance to objects of known size (cf. mil-dot | reticle). | lapetitejort wrote: | tan(x)~x because cos(x)~1. It's approximations all the way | down. | mooman219 wrote: | This would be a decent lookup for the atan2 function: | | https://gist.github.com/mooman219/19b18ff07bb9d609a103ef0cd0... | tantalor wrote: | https://en.wikipedia.org/wiki/Memoization | bluedino wrote: | _Technically_ it 's a _lookup table_ if you pre-compute them. | Memoization would just be caching them as you do them. | tantalor wrote: | Not necessarily, you could "cache" them in a compilation | step and then use the table at runtime. | bruce343434 wrote: | Tangential at best, but why was the 'r' dropped from that | term? Or why not call it caching? Why the weird "memo- | ization"? It makes me think of a mass extinction event where | everything is turned into a memo. | franciscop wrote: | It's explained right in the linked Wikipedia page: | | > The term "memoization" was coined by Donald Michie in | 1968[3] and is derived from the Latin word "memorandum" | ("to be remembered"), usually truncated as "memo" in | American English, and thus carries the meaning of "turning | [the results of] a function into something to be | remembered". While "memoization" might be confused with | "memorization" (because they are etymological cognates), | "memoization" has a specialized meaning in computing. | wongarsu wrote: | The term memoization likely precedes the word caching (as | related to computing, obviously weapon caches are far | older). Memoization was coined in 1968. CPU caches only | came about in the 80s as registers became significantly | faster than main memory. | | As wikipedia outlines, the r was dropped because of the | memo. It's derived from the latin word memorandum that does | contain the r, just like memory, but apparently it was more | meant as an analogy to written memos. | unemphysbro wrote: | coolest blog post I've seen here in a while. :) | aj7 wrote: | Around 1988, I added phase shift to my optical thin film design | program written in Excel 4.0 for the Mac. At the time, this was | utterly unique: each spreadsheet row represented a layer and the | matrices describing each layer could be calculated right in that | row by squashing them down horizontally. The S- and | P-polarization matrices could be recorded this way, and the | running matrix products similarly maintained. Finally, using a | simple one input table, the reflectance of a typically 25-31 | layer laser mirror could be calculated. And in less than a second | on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best | part were the graphics which were instantaneous, beautiful, | publishable, and pasteable into customer quotations. Semi- | technical people could be trained to use the whole thing. | | Now about the phase shift. In 1988, atan2 didn't exist. Anywhere. | Not in FORTRAN, Basic, Excel, or a C library. I'm sure phase | shift calculators implemented it, each working alone. For us, it | was critical. You see we actually cared not about the phase | shift, but its second derivative, the group delay dispersion. | This was the beginning of the femtosecond laser era, and people | needed to check whether these broadband laser pulses would be | inadvertently stretched by reflection off or transmission through | the mirror coating. So atan2, the QUADRANT-PRESERVING arc | tangent, is required for a continuous,differential phase | function. An Excel function macro did this, with IF statements | correcting the quadrant. And the irony of all this? | | I CALLED it atan2. | dzdt wrote: | Fortran did have it much earlier. Per wikipedia atan2 appeared | in Fortran IV from IBM in 1961. | https://en.wikipedia.org/wiki/Atan2 | pklausler wrote: | And it's been in every ANSI & ISO Fortran standard from F'66 | through F'2018. | Narishma wrote: | > Around 1988, I added phase shift to my optical thin film | design program written in Excel 4.0 for the Mac. | | Excel 4 was released in 1992. In 1988 the only version of Excel | that would have been available on the Mac would be the very | first one. | floxy wrote: | Table 8-3, page 8-7 (pdf page 97): | | http://www.bitsavers.org/www.computer.museum.uq.edu.au/pdf/D... | raphlinus wrote: | 1965: https://www.ecma-international.org/publications-and- | standard... pdf page 36 | floxy wrote: | Nice find. Anyone have a way to search for old versions | "math.h"? The 1989 version of ANSI C had atan2: | | https://web.archive.org/web/20161223125339/http://flash- | gord... | | ...but I wonder when it first hit the scene. | kimixa wrote: | According to https://unix.superglobalmegacorp.com/ it was | in 3bsd libm -https://unix.superglobalmegacorp.com/cgi- | bin/cvsweb.cgi/3BSD... | | 3BSD was released at the end of 1979 according to | wikipedia, and that's just the oldest BSD source I could | find, so it may have been in even older versions. | | Before ANSI/ISO C I don't think there was really a | "math.h" to look for - as it may be different on | different systems, but as so many things derive from BSD | I wouldn't be surprised if it was "de-facto" standardised | to what that supported. | shoo wrote: | Related -- there's a 2011 post from Paul Minero with fast | approximations for logarithm, exponential, power, inverse root. | http://www.machinedlearnings.com/2011/06/fast-approximate-lo... | | Minero's faster approximate log2, < 1.4% relative error for x in | [1/100, 10]. Here's the simple non-sse version: | static inline float fasterlog2 (float x) { | union { float f; uint32_t i; } vx = { x }; float y = | vx.i; y *= 1.1920928955078125e-7f; return y - | 126.94269504f; } | | This fastapprox library also includes fast approximations of some | other functions that show up in statistical / probabilistic | calculations -- gamma, digamma, lambert w function. It is BSD | licensed, originally lived in google code, copies of the library | live on in github, e.g. https://github.com/etheory/fastapprox | | It's also interesting to read through libm. E.g. compare Sun's | ~1993 atan2 & atan: | | https://github.com/JuliaMath/openlibm/blob/master/src/e_atan... | | https://github.com/JuliaMath/openlibm/blob/master/src/s_atan... | nice2meetu wrote: | I did something similar for tanh once, though I found I could get | to 1 ulp. | | Part of the motivation was that I could get 10x faster than libc. | However, I then tried on my FreeBSD and could only get 4x faster. | After a lot of head scratching and puzzling it turned out there | was a bug in the version of libc on my linux box that slowed | things down. It kind of took the wind out of the achievement, but | it was still a great learning experience. | azhenley wrote: | This is pretty similar to my quest to make my own cos() when my | friend didn't have access to libc. It was fun! Though I don't | have the math or low-level knowledge that this author does. | | https://web.eecs.utk.edu/~azh/blog/cosine.html | pklausler wrote: | (Undoubtedly) stupid question: would it be any faster to project | (x, y) to the unit circle (x', y'), then compute acos(x') or | asin(y'), and then correct the result based on the signs of x & | y? When converting Cartesian coordinates to polar, the value of | r=HYPOT(x, y) is needed anyway, so the projection to the unit | circle would be a single division by r. | stephencanon wrote: | > if we're working with batches of points and willing to live | with tiny errors, we can produce an atan2 approximation which is | 50 times faster than the standard version provided by libc. | | Which libc, though? I assume glibc, but it's frustrating when | people talk about libc as though there were a single | implementation. Each vendor supplies their own implementation, | libc is just a common interface defined by the C library. There | is no "standard version" provided by libc. | | In particular, glibc's math functions are not especially fast-- | Intel's and Apple's math libraries are 4-5x faster for some | functions[1], and often more accurate as well, for example (and | both vendors provide vectorized implementations). Even within | glibc versions, there have been enormous improvements over the | last decade or so, and for some functions there are big | performance differences depending on whether or not -fno-math- | errno is specified. (I would also note that atan2 has a lot of | edge cases, and more than half the work in a standards-compliant | libc is in getting those edge cases with zeros and infinities | right, which this implementation punts on. There's nothing wrong | with that, but that's a bigger tradeoff for most users than the | small loss of accuracy and important to note.) | | So what are we actually comparing against here? Comparing against | a clown-shoes baseline makes for eye-popping numbers, but it's | not very meaningful. | | None of this should really take away from the work presented, by | the way--the techniques described here are very useful for people | interested in this stuff. | | [1] I don't know the current state of atan2f in glibc | specifically; it's possible that it's been improved since last I | looked at its performance. But the blog post cites "105.98 cycles | / element", which would be glacially slow on any semi-recent | hardware, which makes me think something is up here. | rostayob wrote: | (I'm the author) | | You're right, I should have specified -- it is glibc 2.32-48 . | This the source specifying how glibc is built: | https://github.com/NixOS/nixpkgs/blob/97c5d0cbe76901da0135b0... | . | | I've amended the article so that it says `glibc` rather than | `libc`, and added a sidenote specifying the version. | | I link to it statically as indicated in the gist, although I | believe that shouldn't matter. Also see | https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47ac... | . | | Note that the hardware is not particularly recent (Q3 2017), | but we tend to rent servers which are not exactly on the | bleeding edge, so that was my platform. | stephencanon wrote: | Thanks for clarifying! | | Without benchmarking, I would expect atan2f to be around | 20-30 cycles per element or less with either Intel's or | Apple's scalar math library, and proportionally faster for | their vector libs. | rostayob wrote: | Thanks for the info. | | By the way, your writing on floating point arithmetic is | very informative -- I even cite a message of yours on FMA | in the post itself! | rostayob wrote: | Ah, regarding edge cases: the only edge cases I do not handle | are infinities and the origin. In my case these are both | uninteresting: we deal with points from panoramic images, so | those both make no sense. | | I did want to deal with 0s in either coordinates though, and to | preserve NaNs. | | You're right that atan2 has a lot of edge cases, which is why | it made for an interesting subject. I thought it was neat that | the edge cases I did care about fell out fairly naturally from | the efficient bit-twiddling. | | That said, the versions presented in the article post are _not_ | conforming, as I remark repeatedly. The article is more about | getting people curious about some topics that I think are neat, | more than a guide on how to implement those standard functions | correctly. ___________________________________________________________________ (page generated 2021-08-17 23:01 UTC)