[HN Gopher] Floating point visually explained (2017)
       ___________________________________________________________________
        
       Floating point visually explained (2017)
        
       Author : luisha
       Score  : 228 points
       Date   : 2021-11-28 12:49 UTC (10 hours ago)
        
 (HTM) web link (fabiensanglard.net)
 (TXT) w3m dump (fabiensanglard.net)
        
       | mgaunard wrote:
       | I don't understand how M * 2^E (modulo small representation
       | details) is difficult to grasp. Then you have decimal types as M
       | * 10^E.
       | 
       | It's certainly much clearer than the windowing and bucketing in
       | this article.
        
         | ribit wrote:
         | I think windowing and bucketing are a useful conceptual model
         | for understanding the precision of the encoding space. I
         | understand how FP values work in practice but this article
         | really hits the nail on it's head when explaining the less
         | tangible aspects of FP encoding.
        
         | cturner wrote:
         | His approach works well for me. I don't retain arbitrary facts.
         | At least part of this is a fear that if I don't properly
         | understand its dynamics I will misapply it, better to discard
         | it.
         | 
         | The windowing explanation shows me a path from the intent of
         | the designer through to the implementation (the algorithm). Now
         | I can retain that knowledge.
        
           | marcosdumay wrote:
           | > I don't retain arbitrary facts.
           | 
           | I still fail to understand what is arbitrary in scientific
           | notation. The designers almost certainly didn't think in
           | terms of windows when creating the data types, they probably
           | thought about the mantissa and exponent the way it's usually
           | explained, and maybe about information density (entropy per
           | bit) when comparing with other possibilities.
           | 
           | Anyway, if it helps, great. Different explanations are always
           | welcome. But this one is post-fact.
        
           | wruza wrote:
           | Ahem, exponential forms go back to Archimedes, and no design
           | (or intent thereof) assumed windows or offsets in FP. It's
           | just a fixed-precision floating-point notation, no more no
           | less. The problem persists with decimal tax calculations like
           | $11111.11 x 12% == $1333.33|32, where 0.0032 is lost in a
           | monetary form. It also persists with on-paper calculations
           | because there may be no room for all the digits that your
           | algorithm produces (think x/7) and you have to choose your
           | final acceptable precision.
           | 
           |  _At least part of this is a fear that if I don 't properly
           | understand its dynamics I will misapply it_
           | 
           | The explanation that you like can't save from it either. It's
           | not something you think through and just write the correct
           | code. Quick exercise with the "new" knowledge you've got: you
           | have an array of a million floats, add them up correctly:
        
       | ryan-duve wrote:
       | As someone who never learned the "dreadful notation...
       | discouraging legions of programmers", this was really helpful!
       | The "canonical" equation doesn't mean anything to me and the
       | window/offset explanation does.
       | 
       | I'm sure the author intended to keep the article short, but I
       | think it would benefit from more examples, including a number
       | like 5e-5 or something. It isn't clear to me how the
       | window/offset explains that.
       | 
       | Edit: to clarify, I do not understand how the floating point
       | representation of 0.00005 is interpreted with windows/offsets.
        
         | ricardobeat wrote:
         | 5e-5 is not really related to FP, it's just scientific
         | notation. The number itself is still stored as shown in the
         | post, it's just being printed in a more compact form.
         | 
         | The number after _e_ is a power of 10:                   5e2  =
         | 5 * 10^2  = 5 * 100 = 500         5e-5 = 5 * 10^-5 = 5 / 100000
         | = 0.00005
         | 
         | Once you internalize this, you just read the exponent as "x
         | zeroes to the left".
        
           | boibombeiro wrote:
           | Scientific notation IS a floating point.
        
           | taeric wrote:
           | Scientific notation is easily seen as just floating point in
           | base ten, though. Had the same rules about leading 1 and
           | such. Add in significant digits, and you have most all of the
           | same concerns.
        
       | TT-392 wrote:
       | It probably differs a lot per person, and I usually do find
       | graphical explanations of stuff easyer to graph. But that formula
       | was way easyer to understand for me than the other explanation.
       | Maybe it is related to the fact that I am pretty ok at math, but
       | I got kinda bad dyslexia.
        
       | dang wrote:
       | Some past related threads:
       | 
       |  _Floating Point Visually Explained (2017)_ -
       | https://news.ycombinator.com/item?id=23081924 - May 2020 (96
       | comments)
       | 
       |  _Floating Point Visually Explained (2017)_ -
       | https://news.ycombinator.com/item?id=19084773 - Feb 2019 (17
       | comments)
       | 
       |  _Floating Point Visually Explained_ -
       | https://news.ycombinator.com/item?id=15359574 - Sept 2017 (106
       | comments)
        
       | vlmutolo wrote:
       | The "default" formula as presented in the article seems...
       | strange. Is this really how it's normally taught?
       | (-1)^S * 1.M * 2^(E - 127)
       | 
       | This seems unnecessarily confusing. And 1.M isn't notation I've
       | seen before. If we expand 1.M into 1 + M : 0 < M < 1, then we
       | pretty quickly arrive at the author's construction of "windows"
       | and "offsets".                   (-1)^S * (1 + M) * 2^(E - 127)
       | (-1)^S * 2^log_2(1 + M) * 2^(E - 127)                  let F :=
       | log_2(1 + M)         0 < F < 1                  Note: [0 < M < 1]
       | implies [1 < M + 1 < 2] implies [0 < log_2(1+M) < 1]
       | (-1)^S * 2^(E - 127 + F)
       | 
       | Since we know F is between 0 and 1, we can see that F controls
       | where the number lands between 2^(E - 127) and 2^(E - 127 + 1)
       | (ignoring the sign). It's the "offset".
        
       | dirkk0 wrote:
       | I also found this video from SimonDev educative (and
       | entertaining): https://www.youtube.com/watch?v=Oo89kOv9pVk
        
       | baryphonic wrote:
       | I found a tool[0] that helps me debug potential floating point
       | issues when they arise. This one has modes for half-, single- and
       | double-precision IEEE-754 floats, so I can deal with the various
       | issues when converting between 64-bit and 32-bit floats.
       | 
       | [0] https://news.ycombinator.com/item?id=29370883
        
       | ToddWBurgess wrote:
       | See the Intel math coprocessor was a blast from the past. I sold
       | a few of them and installed a few of them when I used to work in
       | computer stores.
        
       | bool3max wrote:
       | This is _the_ article that finally helped me grasp and understand
       | IEEE-754. An excellent read.
        
       | defaultname wrote:
       | Several years ago I made a basic, interactive explanation of the
       | same-
       | 
       | https://dennisforbes.ca/articles/understanding-floating-poin...
       | 
       | At the time it was to educate a group I was working with about
       | why their concern that "every number in JavaScript is a double"
       | doesn't mean that 1 is actually 1.000000001 (e.g. everyone knows
       | that floating point numbers are potentially approximations, so
       | there is a widespread belief that every integer so represented is
       | the same).
        
       | kuharich wrote:
       | Past comments: https://news.ycombinator.com/item?id=15359574,
       | https://news.ycombinator.com/item?id=19084773
        
       | Sirenos wrote:
       | Why is everyone complaining about people finding floats hard?
       | Sure, scientific notation is easy to grasp, but you can't
       | honestly tell me that it's trivial AFTER you consider rounding
       | modes, subnormals, etc. Maybe if hardware had infinite precision
       | like the real numbers nobody would be complaining ;)
       | 
       | One thing I dislike in discussions about floats is this incessant
       | focus on the binary representation. The representation is NOT the
       | essence of the number. Sure it matters if you are a hardware guy,
       | or need to work with serialized floats, or some NaN-boxing
       | trickery, but you can get a perfectly good understanding of
       | binary floats by playing around with the key parameters of a
       | floating point number system:
       | 
       | - precision = how many bits you have available
       | 
       | - exponent range = lower/upper bounds for exponent
       | 
       | - radix = 2 for binary floats
       | 
       | Consider listing out all possible floats given precision=3,
       | exponents from -1 to 1, radix=2. See what happens when you have a
       | real number that needs more than 3 bits of precision. What is the
       | maximum rounding error using different rounding strategies? Then
       | move on to subnormals and see how that adds a can of worms to
       | underflow scenarios that you don't see in digital integer
       | arithmetic. For anyone interested in a short book covering all
       | this, I would recommend "Numerical Computing with IEEE Floating
       | Point Arithmetic" by Overton [1].
       | 
       | [1]: https://cs.nyu.edu/~overton/book/index.html
        
         | xyzzy_plugh wrote:
         | The binary form is important to understand the implementation
         | details. You even mention underflow. It's difficult for most
         | people to initially understand why you can't store a large
         | number that can be represented by an equivalent size integer as
         | a float accurately.
         | 
         | The binary form handily demonstrates the limitations.
         | Understanding the floating point instructions is kinda optional
         | but still valuable.
         | 
         | Otherwise everyone should just use varint-encoded arbitrary
         | precision numbers.
        
           | colejohnson66 wrote:
           | It also explains why 0.1+0.2 is _not_ 0.3. With binary
           | IEEE-754 floats, none of those can be represented exactly[a].
           | With decimal IEEE-754 floats, it 's possible, but the
           | majority of hardware people interact with works on binary
           | floats.
           | 
           | [a]: Sure, if you `console.log(0.1)`, you'll get 0.1, but
           | it's not possible to express it in binary _exactly_ ; only
           | after rounding. 0.5, however, _is_ exactly representable.
        
             | jacobolus wrote:
             | Python 3.9.5         >>> 0.1.hex()
             | '0x1.999999999999ap-4'         >>> 0.2.hex()
             | '0x1.999999999999ap-3'         >>> (0.1 + 0.2).hex()
             | '0x1.3333333333334p-2'         >>> 0.3.hex()
             | '0x1.3333333333333p-2'
        
               | colejohnson66 wrote:
               | But they are repeating. So, by definition, they are not
               | exactly representable in a (binary) floating point
               | system. Again, that's why 0.1 + 0.2 is not 0.3 in binary
               | floating point.
        
           | jhgb wrote:
           | > It's difficult for most people to initially understand why
           | you can't store a large number that can be represented by an
           | equivalent size integer as a float accurately.
           | 
           | Because you don't have all the digits available just for the
           | mantissa? That seems quite intuitive to me, even if you don't
           | know about the corner cases of FP. This isn't one of them.
        
             | bee_rider wrote:
             | I was going to respond with something like this. I think
             | for getting a general flavor*, just talking about
             | scientific notation with rounding to a particular number of
             | digits at every step is fine.
             | 
             | I guess -- the one thing that explicitly looking at the
             | bits does bring to the table, is the understanding that (of
             | course) the number of bits or digits in the mantissa must
             | be less than the number of bits or digits in an equivalent
             | length integer. Of course, this is pretty obvious if you
             | think about the fact that they are the same size in memory,
             | but if we're talking about sizes in memory, then I guess
             | we're talking about bits implicitly, so may as well make it
             | explicit.
             | 
             | * actually, much more than just getting a general flavor,
             | most of the time in numerical linear algebra stuff the
             | actual bitwise representation is usually irrelevant, so you
             | can get pretty far without thinking about bits.
        
         | GuB-42 wrote:
         | I think the binary representation is the essence of floating
         | point numbers, and if you go beyond the "sometimes, the result
         | is slightly wrong" stage, you have to understand it.
         | 
         | And so far the explanation in the article is the best I found,
         | not least because subnormal numbers appear naturally.
         | 
         | There is a mathematical foundation behind it of course, but it
         | is not easy for a programmer like me. I think it is better to
         | think in term of bits and the integers they make, because
         | that's what the computer sees. And going this way, you get NaN-
         | boxing and serialization as a bonus.
         | 
         | Now, I tend to be most comfortable with a "machine first",
         | bottom-up, low level approach to problems. Mathematical and
         | architectural concepts are fine and all, but unless I have some
         | idea about how it looks like in memory and the kind of
         | instructions being run, I tend to feel lost. Some people may be
         | more comfortable with high level reasoning, we don't all have
         | the same approach, that's what I call real diversity and it is
         | a good thing.
        
           | Sirenos wrote:
           | Sorry, I didn't mean to downplay the value of using concrete
           | examples. I absolutely agree that everyone learns better from
           | concrete settings, which is why my original comment fixed the
           | parameters for people to play with. I was referring more to
           | the discussions of how exponents are stored biased, the
           | leading bit in the mantissa is implied = 1 (except for
           | subnormals), and so on. All these are distracting features
           | that can (and should) be covered once the reader has a strong
           | intuition of the more fundamental aspects.
        
       | sharikous wrote:
       | As one who understands floats I really wish there was a better
       | notation for literals, the best would be a floating literal in
       | binary representation.
       | 
       | For integers you can write 0x0B, 11, 0b1011 and have a very
       | precise representation
       | 
       | For floats you write 1e-1 or 0.1 and you get an ugly truncation.
       | If it were possible to write something like 1eb-1 (for 0.5) and
       | 1eb-2 (for 0.25)... people would be incentivate to use nice
       | negative power of 2 floats, which are much less error prone than
       | ugly base conversions.
       | 
       | This way you can overcame the fears around floats being nonexact
       | and start writing more accurate tests (bit per bit) in many cases
        
         | jacobolus wrote:
         | > _better notation for literals [...] something like 1eb-1 (for
         | 0.5) and 1eb-2 (for 0.25)_
         | 
         | There are floating point hex literals. These can be written as
         | 0x1p-1 == 0.5 and 0x1p-2 == 0.25.
         | 
         | You can use them in C/C++, Java, Julia, Swift, ..., but they
         | are not supported everywhere.
         | 
         | https://observablehq.com/@jrus/hexfloat
        
           | Aardwolf wrote:
           | C++ hex floats are an interesting combination of 3 numeric
           | bases in one!
           | 
           | the mantissa is written in base 16
           | 
           | the exponent is written in base 10
           | 
           | the exponent itself is a power of 2 (not of 16 or 2), so
           | that's base 2
           | 
           | One can only wonder how that came to be. I think they chose
           | base 10 for the exponent to allow using the 'f' suffix to
           | denote float (as opposed to double)
        
           | adgjlsfhk1 wrote:
           | Julia is missing 32 bit and 16 bit hex floats unfortunately.
        
             | simonbyrne wrote:
             | You can just wrap the literal in a conversion function, eg
             | Float32(0x1p52), which should get constant propagated at
             | compile time.
        
               | adgjlsfhk1 wrote:
               | I know, it just isn't as nice to read.
        
         | stncls wrote:
         | I'm not sure this is what you are asking for, but would this be
         | suitable?
         | 
         | > ISO C99 and ISO C++17 support floating-point numbers written
         | not only in the usual decimal notation, such as 1.55e1, but
         | also numbers such as 0x1.fp3 written in hexadecimal format.
         | [...] The exponent is a decimal number that indicates the power
         | of 2 by which the significant part is multiplied. Thus '0x1.f'
         | is 1 15/16, 'p3' multiplies it by 8, and the value of 0x1.fp3
         | is the same as 1.55e1.
         | 
         | https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html
        
         | wruza wrote:
         | Are you asking for pow(2, n)? Or for `float mkfloat(int e, int
         | m)` which literally implements the given formula? I doubt that
         | the notation you're suggesting will be used in code, except for
         | really rare bitwise cases.
         | 
         | The initial precision doesn't really matter, because if you
         | plan to use this value in a computation, it will quickly
         | accumulate an error, which you have to deal with anyway. There
         | are three ways to deal with it: 1) ignore it, 2) account for
         | it, 3) use numeric methods which retain it in a decent range.
         | You may accidentally (1)==(3), but the problem doesn't go away
         | in general.
        
       | brianzelip wrote:
       | Wild, just got done listening to Coding Blocks podcast episode on
       | data structure primitives where they go in depth on floating
       | point, fixed point, binary floating point, and more. Great
       | listen! See around 54min mark -
       | https://www.codingblocks.net/podcast/data-structures-primiti...
        
       | bullen wrote:
       | Why is there no float type with linear precision?
        
         | adgjlsfhk1 wrote:
         | that's called fixed point. there isn't hardware for it because
         | it is cheap to make in software using integer math.
        
           | djmips wrote:
           | Some DSP chips had hardware for fixed point. I think it's a
           | shame that C never added support for fixed point.
        
           | bullen wrote:
           | But wouldn't it be sweet to have SIMD acceleration for fixed
           | point?
           | 
           | Or is that something you can do with integers and SIMD today?
           | 
           | Say 4x4 matrix multiplication?
        
             | adgjlsfhk1 wrote:
             | your add is just an integer add. your multiply is a mul_hi,
             | 2 shifts, 1 add and 1 mul.
        
             | adwn wrote:
             | Having implemented a nested PID controller using fixed-
             | point arithmetic in an FPGA, I can tell you that fixed-
             | point is a pain in the ass, and you only use it when
             | floating-point is too slow, too large (in terms of on-chip
             | resources), or consumes too much power, or when you
             | _really_ need to control the precision going into and out
             | of an arithmetic operation.
        
             | mhh__ wrote:
             | Are you asking if integer SIMD exists? (It does)
        
       | jacksonkmarley wrote:
       | Anyone got a good reference for pratical implications of floating
       | point vs fixed point in measurement calculations (especially
       | timing measurements)? Gotchas, rules of thumb, best practice etc?
        
       | wodenokoto wrote:
       | I was kinda hoping for a visualization of which numbers exists in
       | floating point.
       | 
       | While I always new about                   0.1 + 0.2 ->
       | 0.30000000000000004
       | 
       | it was still kind of an epiphany realizing that floating point
       | numbers don't so much have rounding errors as they are simply
       | discreet numbers.
       | 
       | You can move from one float to the next, which is a meaningfull
       | operation on discreet numbers like integers, but not continuous
       | numbers like rational and irrational numbers.
       | 
       | I also feel like that is a much more important take away from a
       | user of floating points knowing what the mantissa is.
        
         | rlanday wrote:
         | > You can move from one float to the next, which is a
         | meaningfull operation on discreet numbers like integers, but
         | not continuous numbers like rational and irrational numbers.
         | 
         | What do you mean by continuous? Obviously if you take a number
         | line and remove either the rational or irrational numbers, you
         | will end up with infinitely many holes.
         | 
         | The thing that makes floating point numbers unique is that, for
         | any given representation, there are actually only _finitely_
         | many. There's a largest possible value and a smallest possible
         | value and each number will have gaps on either side of it. I
         | think you meant that the rationals and irrationals are _dense_
         | (for any two distinct numbers, you can find another number
         | between them), which is also false for floating-point numbers.
        
         | bspammer wrote:
         | I highly recommend watching this video, which explains floats
         | in the context of Mario 64:
         | https://www.youtube.com/watch?v=9hdFG2GcNuA
         | 
         | Games are a great way to explain floats because they're so
         | visual. Specifically, check out 3:31, where the game has been
         | hacked so that the possible float values in one of the movement
         | directions is quite coarse.
         | 
         | (If you're curious what a PU is, and would like some completely
         | useless knowledge, this video is also an absolute classic and
         | very entertaining: https://www.youtube.com/watch?v=kpk2tdsPh0A)
        
       | darkest_ruby wrote:
       | Having read this article I understand floating point even less
       | now
        
       ___________________________________________________________________
       (page generated 2021-11-28 23:00 UTC)