hngopher.com

       [HN Gopher] Choosing the Right Integers
       ___________________________________________________________________
        
       Choosing the Right Integers
        
       Author : xmhidalgo
       Score  : 53 points
       Date   : 2022-04-17 15:50 UTC (7 hours ago)
        
 (HTM) web link (www.thecodedmessage.com)
 (TXT) w3m dump (www.thecodedmessage.com)
        
       | jandrewrogers wrote:
       | I am generally in the camp of maximally explicit and consistent
       | semantics for integer types. The old C-style types (short, int,
       | etc) and related promotion rules was the source of numerous bugs
       | in practice, particularly when porting code across architectures
       | with different type models. While not convenient in C, in C++
       | there is almost unlimited ability to define new families of
       | integer types that follow their own rules.
       | 
       | I don't follow the argument for using signed integers outside of
       | contexts where a negative value is valid. In code bases that
       | always use unsigned integers for values that cannot be negative
       | (common in C++), the MAX value is treated as a sentinel. This
       | extends the range of valid values versus signed types, which is
       | sometimes useful, and there will never be undefined behavior
       | (decrementing zero generates the sentinel, for example).
       | 
       | The infinite loop example often given for why to use signed
       | integers doesn't happen in practice for "default unsigned" code
       | bases because that is not idiomatic expression of that case --
       | and this case will come up even in code bases that are default
       | signed. Writing loops like this is often a code smell.
        
         | andrepd wrote:
         | How will there never be undefined behaviour?
        
           | jandrewrogers wrote:
           | I meant in the sense that unsigned integers have C-specified
           | defined behaviors for conditions like overflow that are
           | undefined for signed integers. That said, if you bothering to
           | create these integers types (e.g. the common MAX sentinel
           | idiom), their behavior is often fully defined. If you go
           | crazy with metaprogramming, type inference, etc in recent
           | C++, you can often verify at compile-time that these
           | conditions will not occur across arithmetic operations,
           | though this is a lot less trivial.
        
         | Jensson wrote:
         | > I am generally in the camp of maximally explicit and
         | consistent semantics for integer types
         | 
         | Size 3 minus size 4 being 4 billion is not correct type
         | semantics though. If unsigned integers crashed the program for
         | those operations or returned something akin to NaN it would be
         | correct. Which means that signed integers are more correct for
         | sizes, since when signed integers overflow you try to allocate
         | a negative length array, and that doesn't work so it crashes,
         | that is what you want. When you do the same thing with unsigned
         | integers you now get a very small array, and when you try to
         | write values to this array you now write them to arbitrary
         | places in memory, causing memory corruption, that is much much
         | worse than any issues you get with signed integers.
         | 
         | So the reason to use signed everywhere is that it reduces the
         | mental overhead, and it is safer, you now only needs to learn a
         | single integer math system and don't have to worry about all
         | the issues that comes with casting unsigned and signed values
         | to each other. Unsigned only makes sense if you explicitly want
         | the overflow mathematics of unsigned, you don't want overflow
         | math for array sizes though so they should be signed in any
         | sane type system.
        
           | jandrewrogers wrote:
           | A negative size is generally not a valid type instance
           | regardless. In well-behaved code I would expect this to be an
           | unreachable condition because you are explicitly checking for
           | sensible inputs _before_ doing this type of arithmetic. It
           | makes no sense to check for sensible inputs _after_ you do
           | the arithmetic, that is the same amount of code done in the
           | wrong order. It is not safer to use signed in this case nor
           | does it decrease mental overhead.
           | 
           | Assuming an integer is signed can be much _less_ safe in
           | practice because there is often no guarantee that an integer
           | type is always signed or unsigned in a given context.
           | Typedefs are a ubiquitous thing, and you don 't always
           | control those definitions. In code designed to be robust and
           | maintainable, it is usually idiomatic to always implement
           | these things in a way that is correct regardless of signed-
           | ness unless that signed-ness is an immutable contract (e.g.
           | some syscall stuff), because whether or not a type is signed
           | can sometimes change unexpectedly.
        
             | MauranKilom wrote:
             | > In well-behaved code I would expect this to be an
             | unreachable condition because you are explicitly checking
             | for sensible inputs before doing this type of arithmetic
             | 
             | You previously claimed "there will never be undefined
             | behavior [by using unsigned types]", but now you also
             | require that the UB scenario is checked beforehand? I'm
             | confused how you conclude unsigned types to be safer than
             | signed ones ("no UB") when you require UB be ruled out in
             | the first place.
             | 
             | Unless you explicitly intend for wraparound to happen, your
             | program has a bug on overflow regardless of signedness. You
             | can (and possibly should) guard against the situation, but
             | that favors neither side.
             | 
             | > Assuming an integer is signed can be much less safe in
             | practice because there is often no guarantee that an
             | integer type is always signed or unsigned in a given
             | context. [...]
             | 
             | That is not an argument for or against using signed types.
             | If types change from under you, you have a host of other
             | issues anyway (as you say). Mixed sign comparisons,
             | unexpected promotions, overflows... You get the exact same
             | issues whether you yourself favor signed or unsigned types
             | when mixing them with (what are, in your description)
             | effectively unknown types. So again, your argument against
             | signed types appears to be "you have to be careful either
             | way", which gives zero reasons to favor unsigned over
             | signed types.
        
         | xscott wrote:
         | > [unsigned] extends the range of valid values versus signed
         | types, which is sometimes useful
         | 
         | When is it useful?
         | 
         | On a 32 bit platform, for file offsets, maybe that last bit was
         | useful for a few years. But as soon as disks got bigger than
         | 4GB, you needed to switch to a 64 bit `off_t` and `lseek`
         | anyways.
         | 
         | For array indexes, on a 32 bit platform, there is really only
         | one VERY CONTRIVED case where that last bit for `unsigned`
         | matters: your program creates a single array of bytes greater
         | than 2GB. As soon as it's an array of 2-byte shorts or anything
         | larger, you never need that last bit. And the configurations
         | where a 32 bit kernel lets a program use more than 2GB of
         | address space were not super common - it was better to switch
         | to a 64 bit platform at that point.
         | 
         | On 64 bit platforms, you never need that last bit for array
         | indexes or file offsets because no system has 10,000 petabytes
         | of memory or 10,000 petabyte file sizes. And they won't for a
         | while. Unless clock speeds get back on Moore's law, that for
         | loop is going to take a long time to run anyways...
         | 
         | > Writing loops like this is often a code smell.
         | 
         | One person's code smell is another person's daily idiom. I
         | think signed array indexes are like complex numbers - people
         | who don't need them can't imagine why anyone else does.
         | (https://github.com/golang/go/issues/19921)
         | 
         | For me, I frequently do math on the array index. Reverse loops
         | are necessary sometimes (I can give examples), but let's ignore
         | that for now. How would you write this simple loop with
         | unsigned loop indices?                   double scale = <non-
         | integer value>         for (ssize_t ii = 0; ii<len; ii++) {
         | data[ii] = sinc(scale*(ii - len/2));         }
         | 
         | Note that `len/2` using integer division (truncating) is doing
         | the right thing for both even and odd lengths. I really want
         | that `sinc` function centered on a specific sample.
         | 
         | If either `ii` or `len` is unsigned, C/C++ quietly does
         | something awful here. Stuff like this makes me resent the STL
         | for choosing size_t over ssize_t.
         | 
         | I think it's ugly in Rust too:                   let len =
         | data.len(); // usize because that's the way         for ii in
         | 0..len {             data[ii] = sinc(scale*(ii as isize - len
         | as isize / 2) as f64);         }
         | 
         | Thankfully the `as` operator binds pretty tightly.
         | 
         | I don't think there are any good reasons to use unsigned
         | integers for array indexes or sizes. It doesn't help with bound
         | checking (it's the same assembly to check for unsigned out of
         | bounds as it is to check for signed out of bounds or less than
         | zero). It doesn't help with "large" arrays (except in one very
         | contrived case). And it gets in the way when you need to do
         | math on the indexes.
        
           | tialaramex wrote:
           | (In Rust) the ugliness for me stems from the fact we got this
           | array from somewhere and now we're overwriting it, and I
           | start to think probably the code should actually make the
           | data structure instead, whereupon the need for these to be
           | indices goes away and it's some sort of iterator into collect
           | ?
        
       | reikonomusha wrote:
       | I feel like, ideally, one should never have to face the conundrum
       | of choosing an integer width for general-purpose use.
       | 
       | I _really_ like that the  "normal" integers in Common Lisp are
       | arbitrary length, but they're represented automatically as
       | register-sized quantities when small enough (which, for typical
       | code, is most of the time). That way, at least in terms of space
       | efficiency, and for the most part in terms of time efficiency,
       | you can have your cake and eat it too.
       | 
       | If I really need performance, I can always declare that a
       | variable will never contain a non-machine-sized integer:
       | (defun f (x)           (declare (type (unsigned-byte 64) x))
       | ...)
       | 
       | This is me promising to Lisp that x will always be bound to an
       | unsigned 64-bit integer. If I dial the optimization settings with
       | high speed and low safety (which I can scope to just that file,
       | or even just that function):                   (declaim (optimize
       | (speed 3) (safety 0)))
       | 
       | then Lisp will eliminate bounds checks, bignum-promotions, etc.
       | and produce the native assembly you expect.
       | 
       | My point is that I really appreciate the "safe and general by
       | default" philosophy, while still being able to opt-in to
       | practical efficiency considerations when I absolutely need it.
       | 
       | (If I'm pedantic: The above optimization behavior, while
       | encouraged in various ways by the Common Lisp standard, is
       | technically not _de jure_ required. But any general purpose Lisp
       | compiler from the past 30 years will do these things, like SBCL.)
        
         | cpeterso wrote:
         | Here's Rob Pike's proposal to change Go's int to be arbitrary
         | precision in Go 2.0. But after five years of debate about
         | "performance", the proposal has stalled.
         | 
         | https://github.com/golang/go/issues/19623
        
       | lisper wrote:
       | This should really have been called, "Choosing the right integer
       | types in C/C++" rather than "Choosing the right integers".
        
       | thesneakin wrote:
       | Been thinking the same thing for just as long. "int" was supposed
       | to be a full native word. Nowadays it's best to never use "int".
        
         | cpeterso wrote:
         | Google's C++ coding style recommends using bare "int" by
         | default or "int64_t" if the number might be "big". It also
         | recommends using signed integers over unsigned.
         | 
         | https://google.github.io/styleguide/cppguide.html#Integer_Ty...
        
       | Animats wrote:
       | Too many years ago, I wrote something in-house, now lost, titled
       | "Type integer considered harmful". I was arguing for the idea
       | that everything should have an explicit range, written
       | foo: 0..100;
       | 
       | This was around the time Ada was being designed. The way I wanted
       | this to work is that the size of intermediate variables was
       | determined by the compiler, with the following rules:
       | 
       | - Unless a named typed variable result will overflow, no unnamed
       | intermediate variable can overflow. The compiler must size
       | intermediate temporary values accordingly.
       | 
       | - If this requires a larger intermediate variable than the
       | machine can provide, a compile error is reported.
       | 
       | The implication is that you often need longer integers than you'd
       | expect. Expressions such as                   x = a + b - c
       | x = (a + b) / c
       | 
       | with signed values, all, say, 32-bit integers, can overflow in
       | a+b without overflowing x. So such expressions have to be
       | computed in 64 bits, then checked for overflow. This eliminates
       | hidden overflow in intermediates. An expression with only one
       | operator never overflows in a recoverable way, so it just has to
       | be checked, not expanded.
       | 
       | That was written in an era when there were computers in use with
       | word sizes of 8, 12, 16, 18, 24, 32, 36, 48, 60, and 64 bits. So
       | word size was more of a practical issue than it is now, when non
       | power of 2 machines have died off. Also, machines of that era
       | tended to be slower on longer values. Much slower on some
       | machines which had to do double-length arithmetic in software.
       | Thus, there was a performance hit for this approach, which was
       | the main objection.
        
         | [deleted]
        
         | Someone wrote:
         | > An expression with only one operator never overflows in a
         | recoverable way, so it just has to be checked, not expanded
         | 
         | (Nitpick: I think you intended to write either _always_ or
         | _unrecoverable_ )
         | 
         | _If the target CPU doesn't have status flags indicating
         | overflow_, that may be more work than using a larger
         | intermediate. For _a+b_ , detecting overflow after the fact
         | highly likely is faster, but already is a bit convoluted if _a_
         | and _b_ are signed and, thus, can be negative
         | (https://stackoverflow.com/a/45261894)
         | 
         | For _axb_ https://stackoverflow.com/questions/1815367/catch-
         | and-comput... has a simple algorithm, but it requires a
         | division by _b_. I don't see that perform well. Possibly,
         | double size multiplication and then testing for overflow beats
         | that (but, if the CPU has a 'count leading zeroes' instruction,
         | I would use that (https://stackoverflow.com/a/59109502. I
         | haven't checked that for correctness, but even if it can't be
         | made correct, it likely can provide a fast path for most
         | multiplications that your program does)
         | 
         | For CPUs with an overflow bit, the compiler can use it, but not
         | all CPUs have that.
        
         | tialaramex wrote:
         | WUFFS call that type refinement, although you are free to
         | refine much larger types e.g. you can say base.u32[0 ..= 15]
         | and this is a 32-bit unsigned integer... that can't be more
         | than 15.
         | 
         | WUFFS not only refuses overflow (you get a compiler diagnostic)
         | it also infers these refinements from array sizing so as to (at
         | compile time) ensure bounds errors never occur. If you try to
         | index into an array of 426 integers with k, WUFFS will go ahead
         | and refine k to the 0..425 range.
         | 
         | Of course Ada (and presumably your proposal) are for general
         | purpose software, whereas WUFFS is specifically targeted at
         | software for well, Wrangling Untrusted File Formats Safely. So
         | it's fine for some things to be completely impossible in WUFFS
         | (e.g. you cannot write "Hello, World" in WUFFS, because WUFFS
         | intentionally has no idea what a "string" is, or what "console
         | output" is).
        
       ___________________________________________________________________
       (page generated 2022-04-17 23:00 UTC)