hngopher.com

       [HN Gopher] Examples of floating point problems
       ___________________________________________________________________
        
       Examples of floating point problems
        
       Author : grappler
       Score  : 161 points
       Date   : 2023-01-13 14:59 UTC (8 hours ago)
        
 (HTM) web link (jvns.ca)
 (TXT) w3m dump (jvns.ca)
        
       | weakfortress wrote:
       | Used to run into these problems all the time when I was doing
       | work in numerical analysis.
       | 
       | The PATRIOT missile error (it wasn't a _disaster_ ) was more due
       | to the handling of timestamps than just floating point deviation.
       | There were several concurrent failures that allowed the SCUD to
       | hit it's target. IIRC the clock drift was significant and was
       | magnified by being converted to a floating point and,
       | importantly, _truncated_ into a 24 bit register. Moreover, they
       | weren 't "slightly off". The clock drift alone put the missile
       | considerably off target.
       | 
       | While I don't claim that floating points didn't have a hand in
       | this error it's likely the correct handling of timestamps would
       | not have introduced the problem in the first place. Unlike the
       | other examples given this one is a better example of knowing your
       | system and problem domain rather than simply forgetting to
       | calculate a delta or being unaware of the limitations of IEEE
       | 754. "Good enough for government work" struck again here.
        
         | [deleted]
        
       | ape4 wrote:
       | All numbers in JavaScript are floats, unless you make an array
       | with Int8Array(). https://developer.mozilla.org/en-
       | US/docs/Web/JavaScript/Refe...
       | 
       | I wonder if people sometimes make a one element integer array
       | this way so they can have a integer to work with.
        
       | mochomocha wrote:
       | Regarding denormal/subnormal numbers mentioned as "weird": the
       | main issue with them is that their hardware implementation is
       | awfully slow, to the point of being unusable for most computation
       | cases with even moderate FLOPs
        
       | dunham wrote:
       | I had one issue where pdftotext would produce different output on
       | different machines (Linux vs Mac). It broke some of our tests.
       | 
       | I tracked down where it was happening (involving an ==), but it
       | magically stopped when I added print statements or looked at it
       | in the debugger.
       | 
       | It turns out the x86 was running the math at a higher precision
       | and truncating when it moved values out of registers - as soon as
       | it hit memory, things were equal. MacOS was defaulting to
       | -ffloat-store to get consistency (their UI library is float
       | based).
       | 
       | There were too many instances of == in that code base (which IMO
       | is a bad idea with floats), so I just added -ffloat-store to the
       | Linux build and called it a day.
        
         | alkonaut wrote:
         | x86 (x87) FP is notoriously inconsistent because of the 80 bit
         | extended precision that may not be used. In a JITed language
         | line Java/C# it's even less fun as it can theoretically be
         | inconsistent even for the same compiled program on different
         | machines.
         | 
         | Thankfully the solution to that problem came when x86 (32 bit)
         | mostly disappeared.
        
       | WalterBright wrote:
       | > NaN/infinity values can propagate and cause chaos
       | 
       | NaN is the most misunderstood feature of IEEE floating point.
       | Most people react to a NaN like they'd react to the dentist
       | telling them they need a root canal. But NaN is actually a very
       | valuable and useful tool!
       | 
       | NaN is just a value that represents an invalid floating point
       | value. The result of any operation on a NaN is a NaN. This means
       | that NaNs propagate from the source of the original NaN to the
       | final printed result.
       | 
       | "This sounds terrible" you might think.
       | 
       | But let's study it a bit. Suppose you are searching an array for
       | a value, and the value is not in the array. What do you return
       | for an index into the array? People often use -1 as the "not
       | found" value. But then what happens when the -1 value is not
       | noticed? It winds up corrupting further attempts to use it. The
       | problem is that integers do not have a NaN value to use for this.
       | 
       | What's the result of sqrt(-1.0)? It's not a number, so it's a
       | NaN. If a NaN appears in your results, you know you've got
       | mistake in your algorithm or initial values. Yes, I know, it can
       | be clumsy to trace it back to its source, but I submit it is
       | _better_ than having a bad result go unrecognized.
       | 
       | NaN has value beyond that. Suppose you have an array of sensors.
       | One of those sensors goes bad (like they always do). What value
       | to you use for the bad sensor? NaN. Then, when the data is
       | crunched, if the result is NaN, you know that your result comes
       | from bad data. Compare with setting the bad input to 0.0. You
       | never know how that affects your results.
       | 
       | This is why D (in one of its more controversial choices) sets
       | uninitialized floating point values to NaN rather than the more
       | conventional choice of 0.0.
       | 
       | NaN is your friend!
        
         | inetknght wrote:
         | > _This means that NaNs propagate from the source of the
         | original NaN to the final printed result._
         | 
         | An exception would be better. Then you immediately get at the
         | first problem instead of having to track down the lifetime of
         | the observed problem to find the first problem.
        
           | insulanus wrote:
           | Definitely. Unfortunately, Language implementations that
           | guaranteed exceptions were not in wide use at the time. Also,
           | to have a chance at being implemented on more than one CPU,
           | it had to work in C and assembly.
        
         | pwpwp wrote:
         | I don't find this convincing.
         | 
         | > What do you return for an index into the array?
         | 
         | An option/maybe type would solve this much better.
         | 
         | > Yes, I know, it can be clumsy to trace it back to its source
         | 
         | An exception would be much better, alerting you to the exact
         | spot where the problem occurred.
        
           | WalterBright wrote:
           | > An option/maybe type would solve this much better.
           | 
           | NaN's are already an option type, although implemented in
           | hardware. The checking comes for free.
           | 
           | > An exception would be much better
           | 
           | You can configure the FPU to cause an Invalid Operation
           | Exception, but I personally don't find that attractive.
        
             | pwpwp wrote:
             | Good points!
        
             | omginternets wrote:
             | As far as I'm aware, there's no equivalent to a stack trace
             | with NaN, so finding the origin of a NaN can be extremely
             | tedious.
        
             | ratorx wrote:
             | The missing bit is language tooling. The regular floating
             | point API exposed by most languages don't force handling of
             | NaNs.
             | 
             | The benefit of the option type is not necessarily just the
             | extra value, but also the fact that the API that forces you
             | to handle the None value. It's the difference between null
             | and Option.
             | 
             | Even if the API was better, I think there's value in
             | expressing it as Option<FloatGuaranteedToNotBeNaN> which
             | compiles down to using NaNs for the extra value to keep it
             | similar to other Option specialisations and not have to
             | remember about this special primitive type that has option
             | built in.
        
             | jcparkyn wrote:
             | > NaN's are already an option type, although implemented in
             | hardware
             | 
             | The compromise with this is that it makes it impossible to
             | represent a _non-optional_ float, which leads to the same
             | issues as null pointers in c++ /java/etc.
             | 
             | The impacts of NaN are almost certainly not as bad (in
             | aggregate) as `null`, but it'd still be nice if more
             | languages had ways to guarantee that certain numbers aren't
             | NaN (e.g. with a richer set of number types).
        
           | jordigh wrote:
           | Exceptions are actually part of floats, they're called
           | "signalling nans".
           | 
           | So technically Python is correct when it decided that 0.0/0.0
           | should raise an exception instead of just quietly returning
           | NaN. Raising an exception is a standards-conforming option.
           | 
           | https://stackoverflow.com/questions/18118408/what-is-the-
           | dif...
        
             | WalterBright wrote:
             | In practice, I've found signalling NaNs to be completely
             | unworkable and gave up on them. The trouble is they eagerly
             | convert to quiet NaNs, too eagerly.
        
         | maximilianburke wrote:
         | I think the concept of NaNs are sound, but I think relying on
         | them is fraught with peril, made so by the unobvious test for
         | NaN-ness in many languages (ie, "if (x != x)"), and the lure of
         | people who want to turn on "fast math" optimizations which do
         | things like assume NaNs aren't possible and then dead-code-
         | eliminate everything that's guarded by an "x != x" test.
         | 
         | Really though, I'm a fan, I just think that we need better
         | means for checking them in legacy languages and we need to
         | entirely do away with "fast math" optimizations.
        
           | WalterBright wrote:
           | I call them "buggy math" optimizations. The dmd D compiler
           | does not have a switch to enable buggy math.
        
         | jordigh wrote:
         | > What's the result of 1.0/0.0? It's not a number, so it's a
         | NaN
         | 
         | It's not often that I get to correct Mr D himself, but 1.0/0.0
         | is...
        
           | WalterBright wrote:
           | You're right. I'll fix it.
        
       | evancox100 wrote:
       | Example 7 really got me, can anyone explain that? I'm not sure
       | how "modulo" operation would be implemented in hardware, if it is
       | a native instruction or not, but one would hope it would give a
       | result consistent with the matching divide operation.
       | 
       | Edit: x87 has FPREM1 which can calculate a remainder (accurately
       | one hopes), but I can't find an equivalent in modern SSE or AVX.
       | So I guess you are at the mercy of your language's library and/or
       | compiler? Is this a library/language bug rather than a Floating
       | Point gotcha?
        
         | adrian_b wrote:
         | This has nothing to do with the definition or implementation of
         | the remainder or modulo function.
         | 
         | It is a problem that appears whenever you compose an inexact
         | function, like the conversion from decimal to binary, with a
         | function that is not continuous, like the remainder a.k.a.
         | modulo function.
         | 
         | In decimal, 13.716 is exactly 3 times 4.572, so any kind of
         | remainder must be null, but after conversion from decimal to
         | binary that relationship is no longer true, and because the
         | remainder is not a continuous function its value may be wildly
         | different from the correct value.
         | 
         | When you compute with approximate numbers, like the floating-
         | point numbers, as long as you compose only continuous
         | functions, the error in the final result remains bounded and
         | smaller errors in inputs lead to a diminished error in the
         | output.
         | 
         | However, it is enough to insert one discontinuous function in
         | the computation chain for losing any guarantee about the
         | magnitude of the error in the final result.
         | 
         | The conclusion is that whenever computing with approximate
         | numbers (which may also use other representations, not only
         | floating-point) you have to be exceedingly cautious when using
         | any function that is not continuous.
        
         | timerol wrote:
         | Based on the nearest numbers that floats represent, the two
         | numbers are Y = 13.715999603271484375
         | (https://float.exposed/0x415b74bc) and X =
         | 4.57200002670288085938 (https://float.exposed/0x40924dd3).
         | 
         | The division of these numbers is 2.9999998957049091386350361962
         | 468173875300478102103478639802753918, but the nearest float to
         | that is 3. (Exactly 3.) [2]
         | 
         | The modulo operation can (presumably) determine that 3X > Y, so
         | the modulo is Y - 2X, as normal.
         | 
         | This gives inconsistent results, if you don't know that every
         | float is actually a range, and "3" as a float includes some
         | numbers that are smaller than 3.
         | 
         | [1]
         | https://www.wolframalpha.com/input?i=13.715999603271484375+%...
         | [2] https://www.wolframalpha.com/input?i=2.99999989570490913863
         | 5..., then https://float.exposed/0x40400000
        
           | svat wrote:
           | This is useful but note that Python uses 64-bit floats (aka
           | "double"), so the right values are:
           | 
           | * "13.716" means 13.7159999999999993037
           | (https://float.exposed/0x402b6e978d4fdf3b)
           | 
           | * "4.572" means 4.57200000000000006395
           | (https://float.exposed/0x401249ba5e353f7d)
           | 
           | * "13.716 / 4.572" means the nearest representable value to
           | 13.7159999999999993037 / 4.57200000000000006395 which (https:
           | //www.wolframalpha.com/input?i=13.7159999999999993037+...) is
           | 3.0 (https://float.exposed/0x4008000000000000)
           | 
           | * "13.716 % 4.572" means the nearest representable value to
           | 13.7159999999999993037 % 4.57200000000000006395 namely to
           | 4.5719999999999991758 (https://www.wolframalpha.com/input?i=1
           | 3.7159999999999993037+...), which is 4.57199999999999917577
           | (https://float.exposed/0x401249ba5e353f7c) printed as
           | 4.571999999999999.
           | 
           | ----------------
           | 
           | Edit: For a useful analogy (answering the GP), imagine you're
           | working in decimal fixed-point arithmetic with two decimal
           | digits (like dollars and cents), and someone asks you for
           | 10.01/3.34 and 10.01%3.34. Well,
           | 
           | * 10.01 / 3.34 is well over 2.99 (it's over 2.997 in fact) so
           | you'd be justified in answering 3.00 (the nearest
           | representable value).
           | 
           | * 10.01 % 3.34 is 3.33 (which you can represent exactly), so
           | you'd answer 3.33 to that one.
           | 
           | (For an even bigger difference: try 19.99 and 6.67 to get
           | 3.00 as quotient, but 6.65 as remainder.)
        
       | kilotaras wrote:
       | Story time.
       | 
       | Back in university I was taking part in programming competition.
       | I don't remember the exact details of a problem, but it was
       | expected to be solved as a dynamic problem with dp[n][n] as an
       | answer, n < 1000. But, wrangling some numbers around one could
       | show that dp[n][n] = dp[n-1][n-1] + 1/n, and the answer was just
       | the sum of first N elements of harmonic series. Unluckily for us
       | the intended solution had worse precision and our solution
       | failed.
        
         | HarryHirsch wrote:
         | They didn't take into account that floats come with an
         | estimated uncertainty, and that values that are the same within
         | the limits of experimental error are identical? That's a really
         | badly set problem!
        
           | kilotaras wrote:
           | I think it that particular case they just didn't do error
           | analysis.
           | 
           | The task was to output answer with `10^-6` precision, which
           | they solution didn't achieve. Funnily enough the number of
           | other teams went the "correct" route and passed (as they were
           | doing additions in same order as original solution).
        
       | jordigh wrote:
       | One thing that pains me about this kind of zoo of problems is
       | that people often have the takeaway, "floating point is full of
       | unknowable, random errors, never use floating point, you will
       | never understand it."
       | 
       | Floating point is amazingly useful! There's a reason why it's
       | implemented in hardware in all modern computers and why every
       | programming language has a built-in type for floats. You should
       | use it! And you should understand that most of its limitations
       | are an inherent mathematical and fundamental limitation, it is
       | logically impossible to do better on most of its limitations:
       | 
       | 1. Numerical error is a fact of life, you can only delay it or
       | move it to another part of your computation, but you cannot get
       | rid of it.
       | 
       | 2. You cannot avoid working with very small or very large things
       | because your users are going to try, and floating point or not,
       | you'd better have a plan ready.
       | 
       | 3. You might not like that floats are in binary, which makes
       | decimal arithmetic look weird. But doing decimal arithmetic does
       | not get rid of numerical error, see point 1 (and binary
       | arithmetic thinks your decimal arithmetic looks weird too).
       | 
       | But sure, don't use floats for ID numbers, that's always a
       | problem. In fact, don't use bigints either, nor any other
       | arithmetic type for something you won't be doing arithmetic on.
        
         | zokier wrote:
         | > One thing that pains me about this kind of zoo of problems is
         | that people often have the takeaway, "floating point is full of
         | unknowable, random errors, never use floating point, you will
         | never understand it."
         | 
         | > Floating point is amazingly useful!
         | 
         | Another thing about floats is they are for most parts actually
         | very predictable. In particular all basic operations should
         | produce bit-exact results to last ulp. Also because they are
         | language independent standard, you generally can get same
         | behavior in different languages and platforms. This makes
         | learning floats properly worthwhile because the knowledge is so
         | widely applicable
        
           | jsmith45 wrote:
           | >In particular all basic operations should produce bit-exact
           | results to last ulp.
           | 
           | As long as you are not using a compiler that utilizes x87's
           | extended precision flaots for intermediate calculations, and
           | silently rounding whenever it transfers to memory (That used
           | to be a common issue), and as long as you are not doing dumb
           | stuff with compiler math flags.
           | 
           | Also if you have any code anwhere in your program that relies
           | on correct subnormal handling, then you need to be absolutely
           | sure no code is compiled with `-ffast-math`, including in any
           | dynamically loaded code in your entire program, or your math
           | will break: https://simonbyrne.github.io/notes/fastmath/#flus
           | hing_subnor...
           | 
           | And of course if you are doing anything complicated with
           | floating point number, there are entire fields of study about
           | creating numerically stable algorithms, and determining the
           | precision of algorithms with floating point numbers.
        
         | gumby wrote:
         | > Floating point is amazingly useful! There's a reason why it's
         | implemented in hardware in all modern computers and why every
         | programming language has a built-in type for floats.
         | 
         | I completely agree with you even though I go out of my way to
         | avoid FP, and even though, due to what I usually work on, I can
         | often get away with avoiding FP (often fixed point works -- for
         | me).
         | 
         | IEEE-754 is a marvelous standard. It's a short, easy to
         | understand standard attached to an absolutely mind boggling
         | number of special cases or explanation as to why certain
         | decisions in the simple standard were actually incredibly
         | important (and often really smart and non-obvious). It's the
         | product of some very smart people who had, through their
         | careers, made FP implementations and discovered why various
         | decisions turned out to have been bad ones.
         | 
         | I'm glad it's in hardware, and not just because FP used to be
         | quite slow and different on every machine. I'm glad it's in
         | hardware because chip designers (unlike most software
         | developers) are anal about getting things right, and
         | implementing FP properly is _hard_ -- harder than using it!
        
         | [deleted]
        
         | carapace wrote:
         | Floating point is a goofy hacky kludge.
         | 
         | > There's a reason why it's implemented in hardware in all
         | modern computers
         | 
         | Yah, legacy.
         | 
         | The reason we used it originally is that computers were small
         | and slow. Now that they're big and fast we could do without it,
         | except that there is already so much hardware and software out
         | there that it will never happen.
        
           | astrange wrote:
           | Turning all your fixed-size numeric types into variable-sized
           | numeric types introduces some really exciting performance and
           | security issues. (At least if you consider DoS security.)
           | 
           | I think fixed-point math is underrated though.
        
           | dahfizz wrote:
           | What replacement would you propose? They all have different
           | tradeoffs.
        
             | carapace wrote:
             | (I just tried to delete my comment and couldn't because of
             | your reply. Such is life.)
             | 
             | ogogmad made a much more constructive comment than mine:
             | https://news.ycombinator.com/item?id=34370745
             | 
             | It really depends on your use case.
        
         | ogogmad wrote:
         | > And you should understand that most of its limitations are an
         | inherent mathematical and fundamental limitation, it is
         | logically impossible to do better on most of its limitations
         | 
         | You can do exact real arithmetic. But this is only done by
         | people who prove theorems with computers - or by the Android
         | calculator! https://en.wikipedia.org/wiki/Computable_analysis
         | 
         | Other alternatives (also niche) are exact rational arithmetic,
         | computer algebra, arbitrary precision arithmetic.
         | 
         | Fixed point sometimes gets used instead of floats because some
         | operations lose no precision over them, but most operations
         | still do.
        
           | saagarjha wrote:
           | These are only relevant in some circumstances. For example, a
           | calculator is typically bounded in the number of operations
           | you can perform to a small number (humans don't add millions
           | of numbers). This allows for certain representations that
           | don't make sense elsewhere.
        
           | lanstin wrote:
           | I wouldn't call computable reals the reals. They are a subset
           | of measure zero. Perhaps all we sentient beings can aspire to
           | use, but still short of the glory of the completed infinities
           | that even one arbitrary real represents.
           | 
           | One half : )
        
           | jordigh wrote:
           | In my opinion, that's in the realm of "you can only delay
           | it". Sure, you can treat real numbers via purely logical
           | deductions like a human mathematician would, but at some
           | point someone's going to ask, "so, where is the number on
           | this plot?" and that's when it's time to pay the fiddler.
           | 
           | Same for arbitrary-precision calculations like big rationals.
           | That just gives you as much precision as your computer can
           | fit in memory. You will still run out of precision, just
           | later rather then sooner.
        
             | ogogmad wrote:
             | > Same for arbitrary-precision calculations like big
             | rationals. That just gives you as much precision as your
             | computer can fit in memory. You will still run out of
             | precision, later rather then sooner.
             | 
             | Oh, absolutely. This actually shows that floats are (in
             | some sense) more rigorous than more idealised mathematical
             | approaches, because they explicitly deal with finite
             | memory.
             | 
             | Oh, I remembered! There's also interval arithmetic, and
             | variants of it like affine arithmetic. At least you _know_
             | when you 're losing precision. Why don't these get used
             | more? These seem more ideal, somehow.
        
               | gugagore wrote:
               | If x is the interval [-1, 1], the typical implementation
               | of IA will
               | 
               | evaluate x-x to [-2, 2] (instead of [0, 0], and
               | 
               | evaluate x*x [-1, 1] instead of [0, 1].
               | 
               | Therefore the intervals become too conservative to be
               | useful.
        
               | genneth wrote:
               | Because the interval, on average, grows exponentially
               | with the number of basic operations. So it quickly
               | becomes practically useless.
        
         | zokier wrote:
         | > 3. You might not like that floats are in binary, which makes
         | decimal arithmetic look weird. But doing decimal arithmetic
         | does not get rid of numerical error, see point 1 (and binary
         | arithmetic thinks your decimal arithmetic looks weird too).
         | 
         | One thing that I suspect trips people a lot is decimal
         | string/literal <-> (binary) float conversions instead of the
         | floating point math itself. This includes the classic 0.1+0.2
         | thing, and many of the problems in the article.
         | 
         | I think these days using floating point hex strings/literals
         | more would help a lot. There are also decimal floating point
         | numbers that people largely ignore despite being standard for
         | over 15 years
        
           | jordigh wrote:
           | The only implementation of IEEE754 decimals I've ever seen is
           | in Python's Decimal package. Is there an easily-available
           | implementation anywhere else?
        
             | zokier wrote:
             | I don't think Pythons Decimal is ieee754, instead its some
             | sort of arbitrary precision thingy.
             | 
             | GCC has builtin support for decimal floats:
             | https://gcc.gnu.org/onlinedocs/gcc/Decimal-Float.html
             | 
             | There are also library implementations floating around,
             | some of them are mentioned in this thread:
             | https://discourse.llvm.org/t/rfc-decimal-floating-point-
             | supp...
             | 
             | decnumber has also rust wrappers if you are so inclined
        
               | jordigh wrote:
               | Python's decimal absolutely is IEEE 754 (well, based on
               | the older standard, which has now been absorbed into IEEE
               | 754):
               | 
               | https://github.com/python/cpython/blob/main/Lib/_pydecima
               | l.p...
               | 
               | Cool, didn't know that gcc had built-in support. But is
               | it really as incomplete as it says there?
        
               | zokier wrote:
               | Huh, I didn't know it was that close, I'll grant that.
               | But I'd say still no cigar.
               | 
               | One of the most elementary requirements of IEEE754 is:
               | 
               | > A programming environment conforms to this standard, in
               | a particular radix, by implementing one or more of the
               | basic formats of that radix as both a supported
               | arithmetic format and a supported interchange format.
               | 
               | (Section 3.1.2)
               | 
               | While you could argue that you may configure Decimals
               | context parameters to match those of some IEEE754 format
               | and thus claim conformance as arithmetic format, Python
               | has absolutely no support for the specified interchange
               | formats.
               | 
               | To be honest, seeing this I'm bit befuddled on why closer
               | conformance with IEEE754 is not sought. Quick search
               | found e.g. this issue report on adding IEEE754
               | parametrized context, which is a trivial patch, and it
               | has been just sitting there for 10 years:
               | https://github.com/python/cpython/issues/53032
               | 
               | Adding code to import/export BID/DPD formats, while maybe
               | not as trivial, seems still comparatively small task and
               | would improve interoperability significantly imho.
        
       | Lind5 wrote:
       | AI already has led to a rethinking of computer architectures, in
       | which the conventional von Neumann structure is replaced by near-
       | compute and at-memory floorplans. But novel layouts aren't enough
       | to achieve the power reductions and speed increases required for
       | deep learning networks. The industry also is updating the
       | standards for floating-point (FP) arithmetic.
       | https://semiengineering.com/will-floating-point-8-solve-ai-m...
        
       | dkarl wrote:
       | I'm not on Mastodon, so I'll share here: I inherited some
       | numerical software that was used primarily to prototype new
       | algorithms and check errors for a hardware product that solved
       | the same problem. It was known that different versions of the
       | software produced slightly different answers, for seemingly no
       | reason. The hardware engineer who handed it off to me didn't seem
       | to be bothered by it. He wasn't using version control, so I
       | couldn't dig into it immediately, but I couldn't stop thinking
       | about it.
       | 
       | Soon enough I had two consecutive releases in hand, which
       | produced different results, and which had _identical numerical
       | code_. The only code I had changed that ran during the numerical
       | calculations was code that ran _between_ iterations of the
       | numerical parts of the code. IIRC, it printed out some status
       | information like how long it had been running, how many
       | calculations it had done, the percent completed, and the
       | predicted time remaining.
       | 
       | How could that be affecting the numerical calculations??? My
       | first thought was a memory bug (the code was in C-flavored C++,
       | with manual memory management) but I got nowhere looking for one.
       | Unfortunately, I don't remember the process by which I figured
       | out the answer, but at some point I wondered what instructions
       | were used to do the floating-point calculations. The Makefile
       | didn't specify any architecture at all, and for that compiler, on
       | that architecture, that meant using x87 floating-point
       | instructions.
       | 
       | The x87 instruction set was originally created for floating point
       | coprocessors that were designed to work in tandem with Intel
       | CPUs. The 8087 coprocessor worked with the 8086, the 287 with the
       | 286, the 387 with the 386. Starting with the 486 generation, the
       | implementation was moved into the CPU.
       | 
       | Crucially, the x87 instruction set includes a stack of eight
       | 80-bit registers. Your C code may specify 64-bit floating point
       | numbers, but since the compiled code has to copy those value into
       | the x87 registers to execute floating-point instructions, the
       | calculations are done with 80-bit precision. Then the values are
       | copied back into 64-bit registers. If you are doing multiple
       | calculations, a smart compiler will keep intermediate values in
       | the 80-bit registers, saving cycles and gaining a little bit of
       | precision as a bonus.
       | 
       | Of course, the number of registers is limited, so intermediate
       | values may need to be copied to a 64-bit register temporarily to
       | make room for another calculation to happen, rounding them in the
       | process. And that's how code interleaved with numerical
       | calculations can affected the results even if it semantically
       | doesn't change any of the values. Calculating percent completed,
       | printing a progress bar -- the compiler may need to move values
       | out of the 80-bit registers to make room for these calculations,
       | and when the code changes (like you decide to also print out an
       | estimated time remaining) the compiler might change which
       | intermediate values are bumped out of the 80-bit registers and
       | rounded to 64 bits.
       | 
       | It was silly that we were executing these ancient instructions in
       | 2004 on Opteron workstations, which supported SSE2, so I added a
       | compiler flag to enable SSE2 instructions, and voila, the
       | numerical results matched exactly from build to build. We also
       | got a considerable speedup. I later found out that there's a bit
       | you can flip to force x87 arithmetic to always round results to
       | 64 bits, probably to solve exactly the problem I encountered, but
       | I never circled back to try it.
        
         | jordigh wrote:
         | Oh man, those 80-bit registers on 32-bit machines were weird. I
         | was very confused as an undergrad when I ran the basic program
         | to find machine epsilon, and was getting a much smaller epsilon
         | than I expected on a 64-bit float. Turns out, the compiler had
         | optimised all of my code to run on registers and I was getting
         | the machine epsilon of the registers instead.
        
       | cratermoon wrote:
       | Muller's Recurrence is my favorite example of floating point
       | weirdness. See https://scipython.com/blog/mullers-recurrence/ and
       | https://latkin.org/blog/2014/11/22/mullers-recurrence-roundo...
        
       | lifefeed wrote:
       | My favorite floating point weirdness is that 0.1 can't be exactly
       | represented in floating point.
        
         | jrockway wrote:
         | Isn't it equally weird that 1/3 can't be exactly represented in
         | decimal?
        
           | pitaj wrote:
           | Yep! Too bad humanity has settled on decimal instead of
           | dozenal (base 12).
        
           | kps wrote:
           | Indeed, 0.1 can be represented exactly in _decimal_ floating
           | point, and can 't be represented in _binary_ fixed point. It
           | 's just that fractional values are currently almost always
           | represented using binary floating point, so the two get
           | conflated.
        
           | layer8 wrote:
           | The reason why the 0.1 case is weird (unexpected) is that we
           | use decimal notation in floating-point constants (in source
           | code, in formats like JSON, and in UI number inputs), but the
           | value that the constant actually ends up representing is
           | really the closest binary number, where in addition the
           | closeness depends on the FP precision used. If we would write
           | FP values in binary or hexadecimal (which some languages
           | support), the issue wouldn't arise.
        
       | dahfizz wrote:
       | > Javascript only has floating point numbers - it doesn't have an
       | integer type.
       | 
       | Can anyone justify this? Do JS developers prefer not having exact
       | integers, or is this something that everyone just kinda deals
       | with?
        
         | thdc wrote:
         | I believe this is technically inaccurate; while Javascript
         | groups most of the number values under, well, "number", modern
         | underlying implementations may resort to perform integer
         | operations when they recognize it is possible. There are also a
         | couple hacks you can do with bit operations to "work" with
         | integers, although I don't remember them off the top of my head
         | - typically used for truncating and whatnot and was mainly a
         | performance thing.
         | 
         | Also there are typed arrays and bigints if we can throw those
         | in, too.
        
           | saagarjha wrote:
           | The way runtimes optimize arithmetic is an implementation
           | detail and must conform to IEEE-754.
        
             | thdc wrote:
             | Fair point, I have been taking smis for granted
        
         | enriquto wrote:
         | > not having exact integers
         | 
         | What do you mean? Floating-point arithmetic is, by design,
         | exact for small integers. The result of adding 2.0 to 3.0 is
         | exactly 5.0. This is one of the few cases where it is perfectly
         | legitimate to compare floats for equality.
         | 
         | In fact, using 64-bit doubles to represent ints you get way
         | more ints than using plain 32-bit ints. Thus, choosing doubles
         | to represent integers makes perfect sense (unless you worry
         | about wasting a bit of memory and performance).
        
         | josefx wrote:
         | You can use doubles to store and calculate exact integer
         | values. You just wont get 2^64 integers, instead you get the
         | range +/-2^53 .
        
         | deathanatos wrote:
         | Nowadays, it has BigInt.
         | 
         | If you're very careful, a double can be an integer type. (A
         | 53-bit one, I think?) (I don't love this line of thinking. It
         | has _a lot_ of sharp edges. But JS programmers effectively do
         | this all the time, often without thinking too hard about it.)
         | 
         | (And even before BigInt, there's an odd u32-esque "type" in JS;
         | it's not a real type -- it doesn't appear in the JS type
         | system, but rather an internal one that certain operations will
         | be converted to internally. That's why (0x100000000 | 0) == 0 ;
         | even though 0x100000000 (and every other number in that
         | expression, and the right answer) is precisely representable as
         | a f64. This doesn't matter for JSON decoding, though, ... and
         | most other things.)
        
       | guyomes wrote:
       | Example 4 mentions that the result might be different with the
       | same code. Here is an example that is particularly counter-
       | intuitive.
       | 
       | Some CPU have the instruction FMA(a,b,c) = ab + c and it is
       | guaranteed to be rounded to the nearest float. You might think
       | that using FMA will lead to more accurate results, which is true
       | most of the time.
       | 
       | However, assume that you want to compute a dot product between 2
       | orthogonal vectors, say (u,v) and (w,u) where w = -v. You will
       | write:
       | 
       | p = uv + wu
       | 
       | Without FMA, that amounts to two products and an addition between
       | two opposite numbers. This results in p = 0, which is the
       | expected result.
       | 
       | With FMA, the compiler might optimize this code to:
       | 
       | p = FMA(u, v, wu)
       | 
       | That is one FMA and one product. Now the issue is that wu is
       | rounded to the nearest float, say x, which is not exactly -vu. So
       | the result will be the nearest float to uv + x, which is not
       | zero!
       | 
       | So even for a simple formula like this, testing if two vectors
       | are orthogonal would not necessary work by testing if the result
       | is exactly zero. One recommended workaround in this case is to
       | test if the dot product has an absolute value smaller than a
       | small threshold.
        
         | [deleted]
        
         | zokier wrote:
         | Note that with gcc/clang you can control the auto-use of fma
         | with compile flags (-ffp-contract=off). It is pretty crazy imho
         | that gcc defaults to using fma
        
           | thxg wrote:
           | > It is pretty crazy imho that gcc defaults to using fma
           | 
           | Yes! Different people can make different performance-vs-
           | correctness trade-offs, but I also think reproducible-by-
           | default would be better.
           | 
           | Fortunately, specifying a proper standard (e.g. -std=c99 or
           | -std=c++11) implies -ffp-contract=off. I guess specifying
           | such a standard is probably a good idea independently when we
           | care about reproducibility.
           | 
           | Edit: Thinking about it, it the days of 80-bit x87 FPUs,
           | strictly following the standard (specifically, always
           | rounding to 64 bits after every operation) may have been
           | prohibitively expensive. This may explain gcc's GNU mode
           | defaulting to -ffast-math.
        
             | zokier wrote:
             | > Edit: Thinking about it, it the days of 80-bit x87 FPUs,
             | strictly following the standard (specifically, always
             | rounding to 64 bits after every operation) may have been
             | prohibitively expensive
             | 
             | afaik you could just set the precision of x87 to 32/64/80
             | bits and there would not be any extra cost to the
             | operations
        
         | lanstin wrote:
         | In general with reals with any source of error anywhere, this
         | caution about equality is always correct. the odds of two reals
         | being equal is zero.
        
           | raphlinus wrote:
           | I have an exception that proves the rule. I thought about
           | responding to Julia's call, but decided this was too subtle.
           | But here we go...
           | 
           | A central primitive in 2D computational geometry is the
           | orientation problem; in this case deciding whether a point
           | lies to the left or right of a line. In real arithmetic, the
           | classic way to solve it is to set up the line equation (so
           | the value is zero for points on the line), then evaluate that
           | for the given point and test the sign.
           | 
           | The problem is of course that for points very near the line,
           | roundoff error can give the wrong answer, it is in fact an
           | example of cancellation. The problem has an exact answer, and
           | can be solved with rational numbers, or in a related
           | technique detecting when you're in the danger zone and upping
           | the floating point precision just in those cases. (This
           | technique is the basis of Jonathan Shewchuk's thesis).
           | 
           | However, in work I'm doing, I want to take a different
           | approach. If the y coordinate of the point matches the y
           | coordinate of one of the endpoints of the line, then you can
           | tell orientation exactly by comparing the x coordinates. In
           | other cases, either you're far enough away that you know you
           | won't get the wrong answer due to roundoff, or you can
           | subdivide the line at that y coordinate. Then you get an
           | orientation result that is not necessarily exactly correct
           | wrt the original line, but you can count on it being
           | consistent, which is what you really care about.
           | 
           | So the ironic thing is that if you had a lint that said,
           | "exact floating point equality is dangerous, you should use a
           | within-epsilon test instead," it would break the reasoning
           | outlined above, and you could no longer count on the
           | orientations being consistent.
           | 
           | As I said, though, this is a very special case. _Almost_
           | always, it is better to use a fuzzy test over exact equality,
           | and I can also list times I 've been bitten by that (
           | _especially_ in fastmath conditions, which are hard to avoid
           | when you 're doing GPU programming).
        
         | thxg wrote:
         | Yes, and this is not just a theoretical concern: There was an
         | article here [1] in 2021 claiming that Apple M1's FMA
         | implementation had "flaws". There was actually no such flaw.
         | Instead, the author was caught off guard by the very phenomenon
         | you are describing.
         | 
         | [1] https://news.ycombinator.com/item?id=27880461
        
       | kloch wrote:
       | > if you add very big values to very small values, you can get
       | inaccurate results (the small numbers get lost!)
       | 
       | There is a simple workaround for this:
       | 
       | https://en.wikipedia.org/wiki/Kahan_summation
       | 
       | It's usually only needed when adding billions of values together
       | and the accumulated truncation errors would be at an unacceptable
       | level.
        
         | phkahler wrote:
         | It can also come up in simple control systems. A simple low-
         | pass filter can fail to converge to a steady state value if the
         | time constant is long and the sample rate is high.
         | 
         | Y += (X-Y) * alpha * dt
         | 
         | When dt is small and alpha is too, the right hand side can be
         | too small to affect the 24bit mantissa of the left.
         | 
         | I prefer a 16/32bit fixed point version that guarantees
         | convergene to any 16bit steady state. This happened in a power
         | conversion system where dt=1/40000 and I needed a filter in the
         | 10's of Hz.
        
           | jbay808 wrote:
           | This is a very important application and a tougher problem
           | than most would guess. There is a huge variety of ways to
           | numerically implement even a simple transfer function, and
           | they can have very different consequences in terms of
           | rounding and overflow. Especially if you want to not only
           | guarantee that it converges to a steady-state, but
           | furthermore that the steady-state has no error. I spent a lot
           | of time working on this problem for nanometre-accurate servo
           | controls. Floating and fixed point each have advantages
           | depending on the nature and dynamic range of the variable
           | (eg. location parameter vs scale parameter).
        
         | kergonath wrote:
         | > It's usually only needed when adding billions of values
         | together and the accumulated truncation errors would be at an
         | unacceptable level.
         | 
         | OTOH, it's easy to implement, so I have a couple of functions
         | to do it easily, and I got quite a lot of use out of them. It's
         | probably overkill sometimes, but sometimes it's useful.
        
       | Aardwolf wrote:
       | > but I wanted to mention it because:
       | 
       | > 1. it has a funny name
       | 
       | Reasoning accepted!
        
         | [deleted]
        
       | owisd wrote:
       | My 'favourite' is that the quadratic formula -b+-sqrt(b2-4ac)/2a
       | falls apart when you solve for the positive solution using
       | floating point for cases where e=b2/4ac is small, the workaround
       | being to use the binomial expansion -b/2a*(0.5e-0.125e2+O(e3))
        
       | svat wrote:
       | If you have only a couple of minutes to develop a mental model of
       | floating-point numbers (and you have none currently), the most
       | valuable thing IMO would be to spend them staring at a diagram
       | like this one:
       | https://upload.wikimedia.org/wikipedia/commons/b/b6/Floating...
       | (uploaded to Wikipedia by user Joeleoj123 in 2020, made using
       | Microsoft Paint) -- it already covers the main things you need to
       | know about floating-point, namely there are only finitely many
       | discrete representable values (the green lines), and the gaps
       | between them are narrower near 0 and wider further away.
       | 
       | With just that understanding, you can understand the reason for
       | most of the examples in this post. You avoid both the extreme of
       | thinking that floating-point numbers are mathematical (exact)
       | real numbers, and the extreme of "superstition" like believing
       | that floating-point numbers are some kind of fuzzy blurry values
       | and that any operation always has some error / is "random", etc.
       | You won't find it surprising why 0.1 + 0.2 [?] 0.3, but 1.0 + 2.0
       | will always give 3.0, but 100000000000000000000000.0 +
       | 200000000000000000000000.0 [?] 300000000000000000000000.0. :-)
       | (Sure this confidence may turn out to be dangerous, but it's
       | better than "superstition".) The second-most valuable thing, if
       | you have 5-10 minutes, may be to go to https://float.exposed/ and
       | play with it for a while.
       | 
       | Anyway, great post as always from Julia Evans. Apart from the
       | technical content, her attitude is really inspiring to me as
       | well, e.g. the contents of the "that's all for now" section at
       | the end.
       | 
       | The page layout example ("example 7") illustrates the kind of
       | issue because of which Knuth avoided floating-point arithmetic in
       | TeX (except where it doesn't matter) and does everything with
       | scaled integers (fixed-point arithmetic). (It was even worse then
       | before IEEE 754.)
       | 
       | I think things like fixed-point arithmetic, decimal arithmetic,
       | and maybe even exact real arithmetic / interval arithmetic are
       | actually more feasible these days, and it's no longer obvious to
       | me that floating-point should be the default that programming
       | languages guide programmers towards.
        
         | sacrosancty wrote:
         | If you have even less time, just think of them as representing
         | physical measurements made with practical instruments and the
         | math done with analog equipment.
         | 
         | The common cause of floating point problems is usually treating
         | them as a mathematical ideal. The quirks appear at the extremes
         | when you try to to un-physical things with them. You can't
         | measure exactly 0 V with a voltmeter, or use an instrument for
         | measuring the distance to stars then add a length obtained from
         | a micrometer without entirely losing the latter's contribution.
        
           | svat wrote:
           | Thanks, I actually edited my post (made the second paragraph
           | longer) after seeing your comment. The "physical" / "analog"
           | idea does help in one direction (prevents us from relying on
           | floating-point numbers in unsafe ways) but I think it brings
           | us too close to the "superstition" end of the spectrum, where
           | we start to think that floating-point operations are non-
           | deterministic, start doubting whether we can rely on (say)
           | the operation 2.0 + 3.0 giving exactly 5.0 (we can!), whether
           | addition is commutative (it is, if working with non-NaN
           | floats) and so on.
           | 
           | You could argue that it's "safe" to distrust floating-point
           | entirely, but I find it more comforting to be able to take at
           | least some things as solid and reason about them, to refine
           | my mental model of when errors can happen and not happen,
           | etc.
           | 
           | Edit: See also the _floating point isn't "bad" or random_
           | section that the author just added to the post
           | (https://twitter.com/b0rk/status/1613986022534135809).
        
       | ogogmad wrote:
       | Related: In numerical analysis, I found the distinction between
       | forwards and backwards numerical error to be an interesting
       | concept. The forwards error initially seems like the only right
       | kind, but is often impossible to keep small in numerical linear
       | algebra. In particular, Singular Value Decomposition cannot be
       | computed with small forwards error. But the SVD can be computed
       | with small backwards error.
       | 
       | Also: The JSON example is nasty. Should IDs then always be
       | strings?
        
         | gugagore wrote:
         | IIRC, forward error: the error between the given answer and the
         | right answer to the given question.
         | 
         | Backward error: the error between the given question, and the
         | question whose right answer is the given answer.
         | 
         | Easier to parse like this: a small forward error means that you
         | give an answer close to the right one.
         | 
         | A small backward error means that the answer you give is the
         | right answer for a nearby question.
        
         | deathanatos wrote:
         | > _The JSON example is nasty._
         | 
         | Specs, vs. their implementations, vs. backwards compat. JSON
         | just defines a number type, and neither the grammar nor the
         | spec places limits on it (though the spec does call out exactly
         | this problem). So the JSON is to-spec valid. But
         | implementations have limits as to what they'll decode: JS's is
         | that it decodes to number (a double) by default, and thus,
         | loses precision.
         | 
         | (I feel like this issue is pretty well known, but I suppose it
         | probably bites everyone once.)
         | 
         | JS does have the BigInt type, nowadays. Unfortunately, while
         | the JSON.parse API includes a "reviver" parameter, the way it
         | ends up working means that it can't actually take advantage of
         | BigInt.
         | 
         | > _Should IDs then always be strings?_
         | 
         | That's a decent-ish solution; as it side-steps the interop
         | issues. String, to me, is not unreasonable for an ID, as you're
         | not going to be doing math on it.
        
       | mikehollinger wrote:
       | Love it. I actually use Excel which even power users take for
       | granted to highlight that people _really_ need to understand the
       | underlying system, or the system needs to have guard rails to
       | prevent people from stubbing their toes. Microsoft even had to
       | write a page explaining what might happen [1] with floating point
       | wierdness.
       | 
       | [1] https://docs.microsoft.com/en-
       | us/office/troubleshoot/excel/f...
        
       ___________________________________________________________________
       (page generated 2023-01-13 23:00 UTC)