[HN Gopher] Someone's Been Messing with My Subnormals
       ___________________________________________________________________
        
       Someone's Been Messing with My Subnormals
        
       Author : jpegqs
       Score  : 260 points
       Date   : 2022-09-06 15:14 UTC (7 hours ago)
        
 (HTM) web link (moyix.blogspot.com)
 (TXT) w3m dump (moyix.blogspot.com)
        
       | benreesman wrote:
       | That's...terrifying. This is a fantastic find: big, big respect
       | to @moyix, this is going to save people's ass.
        
       | olliej wrote:
       | Wow, I am surprised that -ffast-math triggers a mode switch in
       | the FPU in part due to the author's library problem, but also
       | because the documentation for clang at least[1] does not say it
       | impacts behaviour of denormals and in fact has a separate mode
       | switch for that, which is not explicitly called out as being
       | implied by -ffast-math.
       | 
       | [1] https://clang.llvm.org/docs/UsersManual.html#cmdoption-
       | ffast...
        
       | nsajko wrote:
       | -Ofast isn't a good name for the option, but in GCC's defense the
       | manual is pretty clear about all this, and there's no excuse for
       | blindly turning on compiler options - they literally change the
       | semantics of your code.
        
         | bombcar wrote:
         | It's a quirk of language, that for compiler writers and other
         | algorithmic people "fast" often means "ballpark, but damn
         | quick".
         | 
         | It's hard to come up with a similar name that isn't long.
        
           | cesarb wrote:
           | > It's hard to come up with a similar name that isn't long.
           | 
           | The suggestion given elsewhere in these comments to call it
           | "unsafe math" instead of "fast math" sounds good. It's nearly
           | as short, and properly conveys the "you must know what you're
           | doing" aspect of these flags. It's even better if you're used
           | to Rust.
        
         | actually_a_dog wrote:
         | I agree. I think --ffast-math should actually be called
         | --finexact-math. One would also hope that explicitly disabling
         | an option on the command line would, you know, explicitly
         | disable the option, but maybe that's too much to ask.
        
           | mbauman wrote:
           | I don't think it should exist at all. It's such a crazy grab
           | bag of code changes disguised as "optimizations" that it's
           | completely impossible to reason about, even for folks that
           | "don't care" about the exact floating point arithmetic.
           | 
           | It has global effects like those in TFA, and even locally you
           | no longer know if a line or two of arithmetic will become
           | more precise (e.g., by using higher precision intermediate
           | results), less precise, or become complete gibberish (e.g.,
           | because it thinks it can prove you're now dividing by zero
           | and thus can just return whatever it wants).
        
           | trelane wrote:
           | -fyolo-math?
           | 
           | -fgoodenough-math?
           | 
           | -fbroken-but-fast-math
        
         | mbauman wrote:
         | I wholeheartedly disagree.                   -Ofast
         | Disregard strict standards compliance. ...
         | 
         | There's strict standards compliance and then there's the crazy
         | grab bag of code changes that is `-ffast-math`. Further, I'd
         | say gevent can defensibly say that -ffast-math is okay for them
         | given what the manual says:                   -ffast-math
         | ... it can result in incorrect output for programs that depend
         | on an exact         implementation of IEEE or ISO
         | rules/specifications for math functions.         It may,
         | however, yield faster code for programs that do not require the
         | guarantees of these specifications.
         | 
         | This is 100% on the compiler people. For the option name, the
         | documentation, and the behavior.
         | 
         | https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Optio...
        
           | nsajko wrote:
           | Well, how would you improve the docs? Both documentation
           | entries seem reasonable to me.
           | 
           | That said, I don't see why the -Ofast option even needs to
           | exist, except backwards compatibility, as -ffast-math and the
           | others can (and should IMO) be specified explicitly.
        
             | mrguyorama wrote:
             | The fact that -ffast-math makes no mention that it will
             | poison any other code executing in your process space is a
             | huge missing point of info. Docs as written, anyone not
             | doing scientific math should have that flag, but the
             | reality is that most people have some code somewhere in
             | their process that expects fairly sane floating point math
             | behavior, even if it's just displaying progress bars or
             | something.
        
               | nsajko wrote:
               | > The fact that -ffast-math makes no mention that it will
               | poison any other code executing in your process space
               | 
               | Untrue. The doc entry for -ffast-math says "can result in
               | incorrect output for _programs_ that depend on an exact
               | implementation of IEEE or ISO rules /specifications for
               | math functions". Emphasis mine.
               | 
               | So they clearly say that the entire program can turn
               | invalid when -ffast-math is used.
               | 
               | You and some other people here act like the docs say
               | "translation unit" or something like that, instead of
               | "program", but this is simply not the case.
               | 
               | Furthermore, the entry for -ffast-math points to entries
               | for suboptions that -ffast-math turns on (located right
               | below in the man page), e.g. -funsafe-math-optimization.
               | These also make clear how dangerous they can be even when
               | turned on one at a time.
        
             | Athas wrote:
             | Consider the documentation for the similar compiler flag in
             | the OpenCL specification:
             | 
             | > -cl-unsafe-math-optimizations
             | 
             | > Allow optimizations for floating-point arithmetic that
             | (a) assume that arguments and results are valid, (b) may
             | violate IEEE 754 standard and (c) may violate the OpenCL
             | numerical compliance requirements as defined in section 7.4
             | for single-precision floating-point, section 9.3.9 for
             | double-precision floating-point, and edge case behavior in
             | section 7.5. This option includes the -cl-no-signed-zeros
             | and -cl-mad-enable options.
             | 
             | While it stops short of saying "this will likely break your
             | code" (maybe because it doesn't have the nonlocal effects
             | of -ffast-math), it makes it much more clear that this flag
             | is generally unsafe and fragile, except under rather
             | specific circumstances. Also, it is reasonably exact about
             | what those circumstances are. I'm not sure -ffast-math is
             | documented with enough precision for a programmer to even
             | know whether it will break their code. Best you can do is
             | try and see if the program still works.
        
               | nsajko wrote:
               | The relevant GCC man page entries are even more clear
               | than the OpenCL spec excerpt.
               | 
               | -ffast-math:
               | 
               | > This option is not turned on by any -O option besides
               | -Ofast since it can result in incorrect output for
               | programs that depend on an exact implementation of IEEE
               | or ISO rules/specifications for math functions.
               | 
               | It also point to the -funsafe-math-optimizations sub-
               | option, where it is said that:
               | 
               | > Allow optimizations for floating-point arithmetic that
               | (a) assume that arguments and results are valid and (b)
               | may violate IEEE or ANSI standards. When used at link
               | time, it may include libraries or startup files that
               | change the default FPU control word or other similar
               | optimizations. [...]
        
             | mbauman wrote:
             | Yes, exactly: I'd deprecate it entirely. It shouldn't be a
             | single flag.
        
         | fweimer wrote:
         | What's missing is that it also affects linking, and results in
         | this strange action-at-a-distance. Maybe disabling the linker
         | part with -shared would be a reasonable compromise.
        
           | nsajko wrote:
           | You're wrong, both the doc entry for -Ofast and the one for
           | -ffast-math say that they can result in incorrect _programs_.
           | Programs are produced by linking, so I don 't see what other
           | way to interpret this is possible.
        
             | brigade wrote:
             | Why not simply replace all FP math with a constant zero?
             | That'd be _really_ fast and an equally valid strict
             | interpretation of "can result in incorrect programs."
        
               | nsajko wrote:
               | See https://news.ycombinator.com/newsguidelines.html,
               | e.g.:
               | 
               | > Please don't post shallow dismissals, especially of
               | other people's work. A good critical comment teaches us
               | something.
        
               | brigade wrote:
               | Just because you're shallowly dismissing my comment
               | doesn't make it wrong.
               | 
               | Linking in code with undefined (in this case, _re_
               | defined) behavior doesn't automatically invalidate the
               | entire program. But thats the language used because once
               | the undefined behavior is hit at runtime, the spec no
               | longer defines what the behavior is and what the program
               | will do afterwards.
        
       | Const-me wrote:
       | That thread-local MXCSR register is particularly entertaining in
       | a thread pool environment, such as OpenMP. OSes carefully
       | preserve that piece of thread state across context switches.
       | 
       | I tend to avoid touching that value, even when it means extra
       | instructions like roundpd for specific rounding mode, or shuffles
       | to avoid division by 0 in the unused lanes.
        
       | mananaysiempre wrote:
       | Following the article's links, I fail to find an actual example
       | of anything failing to converge in flush-subnormals mode. I mean,
       | I'm sure one could be squeezed out, but the justification given
       | amounts to "Sterbenz's lemma [the one that rephrases
       | "catastrophic cancellation" as "exact differences"] fails, maybe
       | something somewhere also will". And my (shallow but not
       | nonexistent) experience with numerical analysis is that proofs
       | lump subnormals with underflow, and most of them don't survive
       | even intermediate underflows.
       | 
       | (AFAIU the original Intel justification for pushing subnormals
       | into 754 was gradual underflow, i.e. to give people at least
       | something to look at for debugging when they've ran out of
       | precision.)
       | 
       | So, yes, it's not exactly polite to fiddle with floating-point
       | flag bits that are not yours, and it's better that this not
       | happen for reproducibility if nothing else, but I doubt it
       | actually breaks any interesting numerics.
        
         | moyix wrote:
         | The gevent issue has an example:
         | 
         | https://github.com/gevent/gevent/pull/1820
         | 
         | I haven't examined the code of scipy.stats.skellam.sf so I
         | can't say for sure that it's not converging, but it's clearly
         | some kind of pathological behavior.
        
           | mananaysiempre wrote:
           | So somebody tried to calculate, for integer arguments from 0
           | to 99 inclusive, the CDF of the difference of two Poisson
           | variables with means 4e-6 and 1e-6? I... don't know if it is
           | at all reasonable to expect an answer to that question. As
           | in, genuinely don't know--obviously it's an utterly rotten
           | thing to compute, but at the same time maybe somebody got
           | really interested in that and figured out a way to make it
           | work.
           | 
           | Anyhow, my spelunking was cut off by sleep, so the best I can
           | tell that would end up in the CDFLIB[1] routine CUMCHN with X
           | = 8e-6, PNONC = 2e-6, DF from 0 to 99. The insides don't
           | really look like the kind of magic that is held up by
           | Sterbenz's lemma and strategically arranged to take advantage
           | of gradual underflow, so at first glance I wouldn't trust
           | anything subnormal-dependent that it would compute, but maybe
           | it still is? Sleep.
           | 
           | [1]
           | https://people.sc.fsu.edu/~jburkardt/f_src/cdflib/cdflib.f90
        
             | moyix wrote:
             | Yeah, unfortunately I have no idea if that was their
             | original goal (which seems unlikely?) or if this is just a
             | minimal example they came up with after tripping over the
             | actual problem in a more realistic setting.
             | 
             | I think it suffices to show that the behavior of FTZ/DAZ
             | caused an actual problem for someone, though. I agree that
             | the vast majority of numerical code won't care about
             | FTZ/DAZ, but when it's enabled thread-wide you have no idea
             | what kind of code you'll end up affecting.
        
             | UncleEntity wrote:
             | My last bug report I wrote a small C++ program to put all
             | the values between 0x000 .. 0xfff into a tree structure and
             | then iterate over the tree printing out the values.
             | 
             | I'd have loved if the library author replied with "why
             | don't you just print out the values directly?"
        
       | leni536 wrote:
       | Does this only affect pypi, or should I now worry about shared
       | libraries shipped with my distro as well? Debian is not crazy
       | enough to ship shared libs compiled with -ffast-math, right?
       | RIGHT?
        
         | moyix wrote:
         | Please don't do this to me, I don't know if I have it in me to
         | go on ANOTHER big scrape & scan.
        
         | JonChesterfield wrote:
         | If the package build scripts from upstream have that in them,
         | Debian packaged versions probably do too
        
       | cesarb wrote:
       | At a previous company I worked at, we had an issue with our
       | software (Windows-based, written in a proprietary language)
       | randomly crashing. After some debugging, we found that this
       | happened whenever the user made some specific actions, but only
       | if, in that session, the user had previously printed something or
       | opened a file picker. The culprit was either a printer driver or
       | a shell extension which, when loaded, changed the floating point
       | control word to trap. That happened whenever the culprit DLL had
       | been compiled by a specific compiler, which had the offending
       | code in the startup routine it linked into every DLL it produced.
       | 
       | Our solution was the inverse of the one presented in this
       | article: instead of wrapping our routines to temporarily set the
       | floating point control word to sane values, we wrapped the calls
       | to either printing or the file picker, and reset the floating
       | point control word to its previous (and sane) value after these
       | calls.
        
         | becurious wrote:
         | Had this exact same problem. It was a specific color inkjet
         | driver doing this, my guess is to enable dithering or something
         | similar. It's one of those things that infects everything in
         | the code base because the way you print with GDI is to
         | progressively draw parts of the page - so you have to call in
         | and out of code that talks to the printer DC. We also had to
         | render one item using Direct3D retained mode and that added to
         | the fp control word complexity. Things seemed to be more robust
         | on NT based OSes.
        
         | klysm wrote:
         | That is one hell of a war story - I didn't realize that kind of
         | failure was even possible, but it is truly terrifying.
        
           | pavlov wrote:
           | Direct3D used to flip the x87 FPU to single precision mode by
           | default. This produced some amazing bugs when your other C
           | libraries reasonably assumed that a double would be at least
           | 64 bits. (The FPU mode settings affected the thread that
           | called Direct3D, and most programs used to be single-
           | threaded.)
           | 
           | It seems they changed this behavior in Direct3D 10:
           | 
           | https://microsoft.public.win32.programmer.directx.graphics.n.
           | ..
        
             | speeder wrote:
             | I stumbled into this bug in a rather spetacular manner.
             | 
             | I was making a game using D3D, Lua and Chipmunk physics,
             | and some of the behaviour of the game was being odd.
             | 
             | So I started to try printing random stuff with Lua,
             | eventually I just tried: print(5+5), and to my surprise my
             | console outputted "11".
             | 
             | I went into Lua's irc channel to talk about this, and
             | everyone said I was nuts, that the number was too small to
             | trigger precision issues, that I was a troll and so on.
             | 
             | After a lot of searching I found out about this D3D bug, so
             | I switched the game to use OpenGL instead there it was, 5+5
             | = 10 again!
             | 
             | Now why fiddling with the FPU could make 5+5 become 11, I
             | have no idea.
        
         | titzer wrote:
         | I've heard so many stories akin to this one that I just shake
         | my head. It's a self-inflicted wound that people who prioritize
         | _performance_ above other considerations _keep inflicting on
         | everyone else_.
         | 
         | I _hope_ we learned our lessons on this specific question in
         | the design of Wasm. There are subnormals in Wasm and you can 't
         | turn them off for performance.
        
         | ack_complete wrote:
         | Had to deal with this same issue when I had a program
         | supporting plugins, DLLs compiled with Delphi would turn on all
         | the floating point traps. Took a while to track down what was
         | causing FP faults in comctl32.dll. It got so bad that I had to
         | put in a popup dialog that would name and shame the offending
         | DLL so the authors would fix their broken plugins. It's an ABI
         | violation in Windows since the ABI specifically defines FPU
         | exceptions as masked, so this was more egregious than just
         | turning on FTZ/DAZ (which Intel-compiled DLLs did).
         | 
         | Many of these same DLLs would also hijack
         | SetUnhandledExceptionFilter() for their custom exception
         | support, which would also result in hard fastfail crashes when
         | they failed to unhook properly. Ended up having to hotpatch
         | SetUnhandledExceptionFilter() Detours-style to prevent my crash
         | reporting filter from being overridden. Years later, Microsoft
         | revealed that Office had done the same thing for the same
         | reasons.
         | 
         | The new version of this problem is DLLs that use AVX
         | instructions and then don't execute a VZEROALL/VZEROUPPER
         | instruction before returning. This is more sinister as it
         | doesn't cause a failure, it just causes SSE2 code to run up to
         | four times slower in the thread.
        
           | astrange wrote:
           | You could also get an issue with x87/MMX where floating point
           | code wouldn't work if you wrote some MMX code and didn't do
           | an `emms` instruction afterward.
           | 
           | This is basically the reason compiler autovectorization
           | doesn't do MMX.
        
           | pavon wrote:
           | Yep, I've encountered floating point flag incompatibilities
           | when dynamically loading Borland-compiled libraries into
           | Visual Studio compiled applications, as well as when using
           | C++ code via Java Native Interface.
           | 
           | It is nice that diverse vendor-specific calling conventions
           | and ABIs are less common these days.
        
           | Xorlev wrote:
           | I was interested in the last point about AVX instructions,
           | and found https://john-
           | h-k.github.io/VexTransitionPenalties.html which discusses the
           | problem.
        
       | puffoflogic wrote:
       | Dynamic linking is the root of all kinds of evil, enough said.
        
         | benreesman wrote:
         | As a default (particularly an effectively _mandatory_ default,
         | looking at you glibc) it is indeed insane.
         | 
         | But for something like a Python extension it's what we've got.
         | 
         | Which has the ancillary benefit of surfacing stuff like this.
        
         | woodruffw wrote:
         | The content of this post has nothing to do with the specifics
         | of dynamic linking: it would be just as true if the wheels in
         | question had static binaries instead.
        
           | benreesman wrote:
           | Eh, somewhere in the middle. Someone else put '-ffast-math'
           | in a compile line and it poisons FP math far away with no
           | recompile?
           | 
           | I believe it's a necessary price in this case, but it does
           | highlight how suboptimal it is to pay the price in other
           | cases.
        
             | woodruffw wrote:
             | It's fair to point out that shared objects _surface_ the
             | problem here, but I don 't know if I would lay the blame
             | with them: the underlying problem is that a FPU control
             | register isn't (and can't be) meaningfully isolated. Python
             | needs to use shared objects for loadable extensions, but
             | the contaminating code might be statically linked into that
             | shared object.
             | 
             | (I don't say this because I want to excuse dynamic linking,
             | which I also generally dislike! Only that I think the
             | problem is somewhere else in this particular case.)
        
         | jeroenhd wrote:
         | What is the alternative here? To provide a python.so file with
         | all possible binary Python packages statically linked into it?
         | You'd need to update it every hour to include all the bugfixes
         | in every native library yanked in! To recompile Python itself
         | every time you install a package? Even with a compiler cache
         | you'd have the Gentoo experience of waiting for ages every time
         | you try to use the package manager.
         | 
         | Dynamic linking solves a real problem, especially in this
         | space. It comes with new problems of its own but so does the
         | alternative.
        
           | [deleted]
        
           | [deleted]
        
       | compiler-guy wrote:
       | -funsafe-math is neither fun nor safe.
        
         | kibwen wrote:
         | I hereby propose that we rename "unsafe-math" to "ucking-
         | broken-math".
        
           | tomrod wrote:
           | I approve. Lets get someone with authority to make the
           | change.
        
       | black_knight wrote:
       | I ran Gentoo back in the good old days. The biggest draw was that
       | after about a week of compiling my system ran a lot faster
       | because of all the compiler optimisations one could enable
       | because it only had to work on your CPU.
       | 
       | I might be misremembering, but I think fastmath was one of the
       | flags explicitly warned against in the Gentoo manual.
        
         | bombcar wrote:
         | It was and people would still use it because "hey it says
         | fast".
         | 
         | The CPU flags was less interesting to me compared to being able
         | to disable features like X.
        
         | p_l wrote:
         | There was a big warning that it might produce broken system,
         | iirc
        
         | jeffbee wrote:
         | ChromeOS is sort of the successor to Gentoo. The images are
         | built with profile-guided, link-time, and post-link
         | optimization, and they are targeted to the specific CPU in a
         | given Chromebook. Every other Linux leaves a large amount of
         | performance on the table by targeting a common denominator CPU
         | that's 20 years old and not having PGO.
        
           | TazeTSchnitzel wrote:
           | Apple avoid this problem with their OS by having a separate
           | architecture slice for modern x64 (Haswell+).
        
           | yjftsjthsd-h wrote:
           | It's not a successor, it's a derivative. And yes, if you're
           | only targeting specific known hardware than you can and
           | probably should optimize for it, but most distributions fully
           | intend to be usable on very nearly any x86(_64) hardware so
           | they can't do that.
        
             | jerf wrote:
             | It's also a bit less relevant when everything is so fast. I
             | used Gentoo on a cheap-for-the-time Pentium 133MHz. Gentoo
             | was basically the difference between a modestly pleasant
             | system and an unusably slow system if I tried to run a
             | standard still-compiled-for-386 distro on it.
             | 
             | I've long since stopped worrying about it because on the
             | systems I run, which are not top-of-the-line but aren't
             | RPis either, it's not worth worrying about anymore for most
             | programs. At most maybe you should target the one
             | particular program you use that could use a boost.
        
               | yjftsjthsd-h wrote:
               | Yeah, I don't know the breakdown between better hardware
               | and better compiler optimizations (even in the default
               | settings) and less differentiation between processors,
               | but I've done some minor not-very-scientific tests of
               | compiling packages with O3/march=mtune=native and in my
               | limited experience it wasn't particularly useful. Like,
               | not just small benefits, but zero or below the noise
               | floor benefits in my benchmarks. Obviously this is super
               | dependent on your workload and maybe hardware; it's an
               | area where if you care, you _have_ to do your own
               | testing.
        
               | jeffbee wrote:
               | Tune for native sometimes makes a difference but not
               | always. Targeting a platform that is known to have AVX2,
               | instead of detecting AVX2 at runtime and bouncing through
               | the PLT, can make a large difference. PGO remains the
               | largest opportunity.
        
         | hackingthelema wrote:
         | > I might be misremembering, but I think fastmath was one of
         | the flags explicitly warned against in the Gentoo manual.
         | 
         | It is, here:
         | https://wiki.gentoo.org/wiki/GCC_optimization#But_I_get_bett...
        
       | TazeTSchnitzel wrote:
       | Global state is the root of so many evils! FPU rounding mode, FPU
       | flush-to-zero mode, C locale, errno, and probably some other
       | things should all be eliminated. The functionality should still
       | exist but not as global flags.
        
         | leni536 wrote:
         | At least many of those are thread-local. But not C locale, it
         | is truly horrible.
        
       | Tyr42 wrote:
       | Oh man, great job digging through all that. This is exactly the
       | kind of content I want to see.
       | 
       | Don't you love your fun safe math?
        
       | ChrisRackauckas wrote:
       | The Julia package ecosystem has a lot of safeguards against
       | silent incorrect behavior like this. For example, if you try to
       | add a package binary build which would use fast math flags, it
       | will throw an error and tell you to repent:
       | 
       | https://github.com/JuliaPackaging/BinaryBuilderBase.jl/blob/...
       | 
       | In user codes you can do `@fastmath`, but it's at the semantic
       | level so it will change `sin` to `sin_fast` but not recurse down
       | into other people's functions, because at that point you're just
       | asking for trouble. There's also calls to rename it `@unsafemath`
       | in Julia, just to make it explicit. In summary, "Fastmath" is
       | overused and many times people actually want other optimizations
       | (automatic FMA), and people really need to stop throwing global
       | changes around willy-nilly, and programming languages need to
       | force people to avoid such global issues both semantically and
       | within its package ecosystems norms.
        
         | aidenn0 wrote:
         | Automatic FMA can change the result of operations, so it makes
         | (some) sense to be bundled in with fastmath.
        
           | ChrisRackauckas wrote:
           | But if what you want is automatic FMA, then why carry along
           | every other possible behavior with it? Just because you want
           | FMA, suddenly NaNs are turned into Infs, subnormal numbers go
           | to zero, handling of sin(x) at small values is inaccurate,
           | etc? To me that's painting numerical handling in way too
           | broad of strokes. FMA also only increases numerical accuracy,
           | it doesn't decrease numerical accuracy, so bundling it with
           | unsafe transformations makes one uncertain now whether it has
           | improved or decreased accuracy.
           | 
           | For reference, to handle this well we use MuladdMacro.jl
           | which is a semantic transformation that turns x*y+z into
           | muladd expressions, and it does not recurse into functions so
           | it does not change the definitions of the callers inside of
           | the macro scope.
           | 
           | https://github.com/SciML/MuladdMacro.jl
           | 
           | This is something that will always increase performance and
           | accuracy (performance because muladd in Julia is an FMA that
           | is only applied if hardware FMA exists, effectively never
           | resorting to a software FMA emulation) because it's targeted
           | to do only a transformation that has that property.
        
           | eigenspace wrote:
           | This isn't really as valid a comparison as you might think it
           | is. The results of operations varying is not the problem with
           | 'fast-math', the problem is that can negatively impact
           | accuracy in catastrophic ways (among other things).
           | 
           | Sure, automatic FMA can change the result, but to my
           | knowledge it always gives a _more_ accurate result, not a
           | less accurate one, and the way in which the results may
           | differ is bounded.
        
       | raymondh wrote:
       | This is a rockstar quality post. It is astonishing how much
       | detective work was involved.
        
       | stabbles wrote:
       | See also https://simonbyrne.github.io/notes/fastmath/ for a
       | similar story in julia, where ffast-math is now banned for
       | C/C++/Fortran dependencies
        
       | jesse__ wrote:
       | 10/10 yak shave. Would certainly read again
        
       | elina123 wrote:
        
       | bee_rider wrote:
       | A decorator is a nice idea for this.
       | 
       | I was going to suggest another package that just resets the MXCSR
       | when imported, but I guess... hypothetically... some function
       | might actually want the FTZ behavior.
        
         | jcranmer wrote:
         | If you want that behavior, you should explicitly enable
         | it/disable it at the borders of the region where you want that
         | behavior, rather than screwing over everybody for your own
         | benefit.
        
       | jcranmer wrote:
       | The problem here is that enabling FTZ/DAZ flags involves
       | modifying global (technically thread-local) state that is
       | relatively expensive to do. Ideally, you'd want to twiddle these
       | flags only for code that wants to work in this mode, but given
       | the relative expense of this operation, it's not entirely
       | practicable to auto-add twiddling to every function call, and
       | doing it manually is somewhat challenging because compilers tend
       | to support accessing the floating-point status rather poorly.
       | Also, FTZ/DAZ aren't IEEE 754, so there's no portable function
       | for twiddling these bits as there is for other rounding mode or
       | exception controls. I will note that icc's -fp-model=fast and
       | MSVC's /fp:fast correctly do not link code with crtfastmath.
       | 
       | As a side note, this kind of thing is why I think a good title
       | for a fast-math would be "Fast math, or how I learned to start
       | worrying and hate floating point."
        
         | [deleted]
        
         | titzer wrote:
         | I don't think flipping these flags is expensive. Can you
         | provide a source for that? AFAICT modern microarchitectures are
         | going to register-rename that into the u-ops issued to the
         | functional units, rather than flush the entire ROB.
        
       | mrtesthah wrote:
       | I thought the purpose of Python was to make development simple
       | and predictable. Needing to track down the compilation and linker
       | flags of every single shared library reveals the fallacy of this
       | abstraction.
        
         | RodgerTheGreat wrote:
         | If a language wishes to reap the rewards of a pre-existing
         | ecosystem, it must pay for the warts and misfeatures of that
         | ecosystem. Python is deeply dependent on C libraries to achieve
         | acceptable performance, and this is the price.
        
       | magicalhippo wrote:
       | Denormalized numbers is one reason why you really want to think
       | carefully if you try to optimize code by rewriting expressions
       | involving multiplication and division.
       | 
       | For example, if you got "x = (a / b) * (c / d)" one might think
       | that rewriting it as "x = (a * c) / (b * d)" will save you a
       | division and gain you speed. It will and it might, respectively.
       | 
       | However it will also potentially break an otherwise safe
       | operation. If the numbers are _very_ small, but still normal,
       | then the product (b * d) might result in a denormalized number,
       | and dividing by it will result in + /- infinity.
       | 
       | However, the code might guarantee that the ratios (a / b) and (c
       | / d) are not too small or too large, so that multiplying them is
       | guaranteed to lead to a useful result.
        
         | bee_rider wrote:
         | Anyway, since there aren't any dependencies between a, b, c,
         | and d, I would expect the two divisions to end up basically in
         | parallel in the pipeline. So the critical path is a division
         | and a multiplication either way. Of course that is just a
         | guess.
        
       | garaetjjte wrote:
       | > it turns out that when you use -Ofast, -fno-fast-math does not,
       | in fact, disable fast math. lol. lmao.
       | 
       | What about -fno-unsafe-math-optimizations?
        
         | moyix wrote:
         | Nope, it still links in crtfastmath:                   $ gcc
         | -Ofast -fno-unsafe-math-optimizations -fpic -shared foo.c -o
         | foo.so         $ objdump -j .text --disassemble=set_fast_math
         | foo.so              foo.so:     file format elf64-x86-64
         | Disassembly of section .text:              0000000000001040
         | <set_fast_math>:             1040: f3 0f 1e fa
         | endbr64              1044: 0f ae 5c 24 fc        stmxcsr
         | -0x4(%rsp)             1049: 81 4c 24 fc 40 80 00  orl
         | $0x8040,-0x4(%rsp)             1050: 00              1051: 0f
         | ae 54 24 fc        ldmxcsr -0x4(%rsp)             1056: c3
         | retq
        
           | Night_Thastus wrote:
           | Ouch. Two flags that should reasonably stop this, and neither
           | do. This feels a bit like the time I was told "No, -wAll does
           | not in fact add all warnings".
        
             | speeder wrote:
             | Wait, it doesn't? O.o
        
               | moyix wrote:
               | Nope. clang has "-Weverything", and gcc has "-Wextra",
               | both of which go beyond "-Wall".
               | 
               | https://stackoverflow.com/questions/11714827/how-can-i-
               | turn-...
        
         | klysm wrote:
         | Pain. This is so scuffed
        
       ___________________________________________________________________
       (page generated 2022-09-06 23:00 UTC)