[HN Gopher] Choosing the Right Integers ___________________________________________________________________ Choosing the Right Integers Author : xmhidalgo Score : 53 points Date : 2022-04-17 15:50 UTC (7 hours ago) (HTM) web link (www.thecodedmessage.com) (TXT) w3m dump (www.thecodedmessage.com) | jandrewrogers wrote: | I am generally in the camp of maximally explicit and consistent | semantics for integer types. The old C-style types (short, int, | etc) and related promotion rules was the source of numerous bugs | in practice, particularly when porting code across architectures | with different type models. While not convenient in C, in C++ | there is almost unlimited ability to define new families of | integer types that follow their own rules. | | I don't follow the argument for using signed integers outside of | contexts where a negative value is valid. In code bases that | always use unsigned integers for values that cannot be negative | (common in C++), the MAX value is treated as a sentinel. This | extends the range of valid values versus signed types, which is | sometimes useful, and there will never be undefined behavior | (decrementing zero generates the sentinel, for example). | | The infinite loop example often given for why to use signed | integers doesn't happen in practice for "default unsigned" code | bases because that is not idiomatic expression of that case -- | and this case will come up even in code bases that are default | signed. Writing loops like this is often a code smell. | andrepd wrote: | How will there never be undefined behaviour? | jandrewrogers wrote: | I meant in the sense that unsigned integers have C-specified | defined behaviors for conditions like overflow that are | undefined for signed integers. That said, if you bothering to | create these integers types (e.g. the common MAX sentinel | idiom), their behavior is often fully defined. If you go | crazy with metaprogramming, type inference, etc in recent | C++, you can often verify at compile-time that these | conditions will not occur across arithmetic operations, | though this is a lot less trivial. | Jensson wrote: | > I am generally in the camp of maximally explicit and | consistent semantics for integer types | | Size 3 minus size 4 being 4 billion is not correct type | semantics though. If unsigned integers crashed the program for | those operations or returned something akin to NaN it would be | correct. Which means that signed integers are more correct for | sizes, since when signed integers overflow you try to allocate | a negative length array, and that doesn't work so it crashes, | that is what you want. When you do the same thing with unsigned | integers you now get a very small array, and when you try to | write values to this array you now write them to arbitrary | places in memory, causing memory corruption, that is much much | worse than any issues you get with signed integers. | | So the reason to use signed everywhere is that it reduces the | mental overhead, and it is safer, you now only needs to learn a | single integer math system and don't have to worry about all | the issues that comes with casting unsigned and signed values | to each other. Unsigned only makes sense if you explicitly want | the overflow mathematics of unsigned, you don't want overflow | math for array sizes though so they should be signed in any | sane type system. | jandrewrogers wrote: | A negative size is generally not a valid type instance | regardless. In well-behaved code I would expect this to be an | unreachable condition because you are explicitly checking for | sensible inputs _before_ doing this type of arithmetic. It | makes no sense to check for sensible inputs _after_ you do | the arithmetic, that is the same amount of code done in the | wrong order. It is not safer to use signed in this case nor | does it decrease mental overhead. | | Assuming an integer is signed can be much _less_ safe in | practice because there is often no guarantee that an integer | type is always signed or unsigned in a given context. | Typedefs are a ubiquitous thing, and you don 't always | control those definitions. In code designed to be robust and | maintainable, it is usually idiomatic to always implement | these things in a way that is correct regardless of signed- | ness unless that signed-ness is an immutable contract (e.g. | some syscall stuff), because whether or not a type is signed | can sometimes change unexpectedly. | MauranKilom wrote: | > In well-behaved code I would expect this to be an | unreachable condition because you are explicitly checking | for sensible inputs before doing this type of arithmetic | | You previously claimed "there will never be undefined | behavior [by using unsigned types]", but now you also | require that the UB scenario is checked beforehand? I'm | confused how you conclude unsigned types to be safer than | signed ones ("no UB") when you require UB be ruled out in | the first place. | | Unless you explicitly intend for wraparound to happen, your | program has a bug on overflow regardless of signedness. You | can (and possibly should) guard against the situation, but | that favors neither side. | | > Assuming an integer is signed can be much less safe in | practice because there is often no guarantee that an | integer type is always signed or unsigned in a given | context. [...] | | That is not an argument for or against using signed types. | If types change from under you, you have a host of other | issues anyway (as you say). Mixed sign comparisons, | unexpected promotions, overflows... You get the exact same | issues whether you yourself favor signed or unsigned types | when mixing them with (what are, in your description) | effectively unknown types. So again, your argument against | signed types appears to be "you have to be careful either | way", which gives zero reasons to favor unsigned over | signed types. | xscott wrote: | > [unsigned] extends the range of valid values versus signed | types, which is sometimes useful | | When is it useful? | | On a 32 bit platform, for file offsets, maybe that last bit was | useful for a few years. But as soon as disks got bigger than | 4GB, you needed to switch to a 64 bit `off_t` and `lseek` | anyways. | | For array indexes, on a 32 bit platform, there is really only | one VERY CONTRIVED case where that last bit for `unsigned` | matters: your program creates a single array of bytes greater | than 2GB. As soon as it's an array of 2-byte shorts or anything | larger, you never need that last bit. And the configurations | where a 32 bit kernel lets a program use more than 2GB of | address space were not super common - it was better to switch | to a 64 bit platform at that point. | | On 64 bit platforms, you never need that last bit for array | indexes or file offsets because no system has 10,000 petabytes | of memory or 10,000 petabyte file sizes. And they won't for a | while. Unless clock speeds get back on Moore's law, that for | loop is going to take a long time to run anyways... | | > Writing loops like this is often a code smell. | | One person's code smell is another person's daily idiom. I | think signed array indexes are like complex numbers - people | who don't need them can't imagine why anyone else does. | (https://github.com/golang/go/issues/19921) | | For me, I frequently do math on the array index. Reverse loops | are necessary sometimes (I can give examples), but let's ignore | that for now. How would you write this simple loop with | unsigned loop indices? double scale = <non- | integer value> for (ssize_t ii = 0; ii<len; ii++) { | data[ii] = sinc(scale*(ii - len/2)); } | | Note that `len/2` using integer division (truncating) is doing | the right thing for both even and odd lengths. I really want | that `sinc` function centered on a specific sample. | | If either `ii` or `len` is unsigned, C/C++ quietly does | something awful here. Stuff like this makes me resent the STL | for choosing size_t over ssize_t. | | I think it's ugly in Rust too: let len = | data.len(); // usize because that's the way for ii in | 0..len { data[ii] = sinc(scale*(ii as isize - len | as isize / 2) as f64); } | | Thankfully the `as` operator binds pretty tightly. | | I don't think there are any good reasons to use unsigned | integers for array indexes or sizes. It doesn't help with bound | checking (it's the same assembly to check for unsigned out of | bounds as it is to check for signed out of bounds or less than | zero). It doesn't help with "large" arrays (except in one very | contrived case). And it gets in the way when you need to do | math on the indexes. | tialaramex wrote: | (In Rust) the ugliness for me stems from the fact we got this | array from somewhere and now we're overwriting it, and I | start to think probably the code should actually make the | data structure instead, whereupon the need for these to be | indices goes away and it's some sort of iterator into collect | ? | reikonomusha wrote: | I feel like, ideally, one should never have to face the conundrum | of choosing an integer width for general-purpose use. | | I _really_ like that the "normal" integers in Common Lisp are | arbitrary length, but they're represented automatically as | register-sized quantities when small enough (which, for typical | code, is most of the time). That way, at least in terms of space | efficiency, and for the most part in terms of time efficiency, | you can have your cake and eat it too. | | If I really need performance, I can always declare that a | variable will never contain a non-machine-sized integer: | (defun f (x) (declare (type (unsigned-byte 64) x)) | ...) | | This is me promising to Lisp that x will always be bound to an | unsigned 64-bit integer. If I dial the optimization settings with | high speed and low safety (which I can scope to just that file, | or even just that function): (declaim (optimize | (speed 3) (safety 0))) | | then Lisp will eliminate bounds checks, bignum-promotions, etc. | and produce the native assembly you expect. | | My point is that I really appreciate the "safe and general by | default" philosophy, while still being able to opt-in to | practical efficiency considerations when I absolutely need it. | | (If I'm pedantic: The above optimization behavior, while | encouraged in various ways by the Common Lisp standard, is | technically not _de jure_ required. But any general purpose Lisp | compiler from the past 30 years will do these things, like SBCL.) | cpeterso wrote: | Here's Rob Pike's proposal to change Go's int to be arbitrary | precision in Go 2.0. But after five years of debate about | "performance", the proposal has stalled. | | https://github.com/golang/go/issues/19623 | lisper wrote: | This should really have been called, "Choosing the right integer | types in C/C++" rather than "Choosing the right integers". | thesneakin wrote: | Been thinking the same thing for just as long. "int" was supposed | to be a full native word. Nowadays it's best to never use "int". | cpeterso wrote: | Google's C++ coding style recommends using bare "int" by | default or "int64_t" if the number might be "big". It also | recommends using signed integers over unsigned. | | https://google.github.io/styleguide/cppguide.html#Integer_Ty... | Animats wrote: | Too many years ago, I wrote something in-house, now lost, titled | "Type integer considered harmful". I was arguing for the idea | that everything should have an explicit range, written | foo: 0..100; | | This was around the time Ada was being designed. The way I wanted | this to work is that the size of intermediate variables was | determined by the compiler, with the following rules: | | - Unless a named typed variable result will overflow, no unnamed | intermediate variable can overflow. The compiler must size | intermediate temporary values accordingly. | | - If this requires a larger intermediate variable than the | machine can provide, a compile error is reported. | | The implication is that you often need longer integers than you'd | expect. Expressions such as x = a + b - c | x = (a + b) / c | | with signed values, all, say, 32-bit integers, can overflow in | a+b without overflowing x. So such expressions have to be | computed in 64 bits, then checked for overflow. This eliminates | hidden overflow in intermediates. An expression with only one | operator never overflows in a recoverable way, so it just has to | be checked, not expanded. | | That was written in an era when there were computers in use with | word sizes of 8, 12, 16, 18, 24, 32, 36, 48, 60, and 64 bits. So | word size was more of a practical issue than it is now, when non | power of 2 machines have died off. Also, machines of that era | tended to be slower on longer values. Much slower on some | machines which had to do double-length arithmetic in software. | Thus, there was a performance hit for this approach, which was | the main objection. | [deleted] | Someone wrote: | > An expression with only one operator never overflows in a | recoverable way, so it just has to be checked, not expanded | | (Nitpick: I think you intended to write either _always_ or | _unrecoverable_ ) | | _If the target CPU doesn't have status flags indicating | overflow_, that may be more work than using a larger | intermediate. For _a+b_ , detecting overflow after the fact | highly likely is faster, but already is a bit convoluted if _a_ | and _b_ are signed and, thus, can be negative | (https://stackoverflow.com/a/45261894) | | For _axb_ https://stackoverflow.com/questions/1815367/catch- | and-comput... has a simple algorithm, but it requires a | division by _b_. I don't see that perform well. Possibly, | double size multiplication and then testing for overflow beats | that (but, if the CPU has a 'count leading zeroes' instruction, | I would use that (https://stackoverflow.com/a/59109502. I | haven't checked that for correctness, but even if it can't be | made correct, it likely can provide a fast path for most | multiplications that your program does) | | For CPUs with an overflow bit, the compiler can use it, but not | all CPUs have that. | tialaramex wrote: | WUFFS call that type refinement, although you are free to | refine much larger types e.g. you can say base.u32[0 ..= 15] | and this is a 32-bit unsigned integer... that can't be more | than 15. | | WUFFS not only refuses overflow (you get a compiler diagnostic) | it also infers these refinements from array sizing so as to (at | compile time) ensure bounds errors never occur. If you try to | index into an array of 426 integers with k, WUFFS will go ahead | and refine k to the 0..425 range. | | Of course Ada (and presumably your proposal) are for general | purpose software, whereas WUFFS is specifically targeted at | software for well, Wrangling Untrusted File Formats Safely. So | it's fine for some things to be completely impossible in WUFFS | (e.g. you cannot write "Hello, World" in WUFFS, because WUFFS | intentionally has no idea what a "string" is, or what "console | output" is). ___________________________________________________________________ (page generated 2022-04-17 23:00 UTC)