[HN Gopher] Pointers Are Complicated II, or: We need better lang...
       ___________________________________________________________________
        
       Pointers Are Complicated II, or: We need better language specs
        
       Author : ChrisSD
       Score  : 148 points
       Date   : 2020-12-14 16:53 UTC (6 hours ago)
        
 (HTM) web link (www.ralfj.de)
 (TXT) w3m dump (www.ralfj.de)
        
       | gens wrote:
       | Pointers are "just" integers (unsigned integers of a size, that
       | is). Most languages treat them differently (C assigns them a
       | type, and a stride with it), but that is usually on top of them
       | being integers.
       | 
       | Pointers "point" to (are addresses to) bytes, as he said. If you
       | want you can pack data to a resolution of a bit, and some
       | languages help you do that as well.
       | 
       | You can also do whatever you want with pointers (in some
       | languages, ofc), like do bitwise operations on them. Aliasing
       | them can be useful as well, in more then one way.
       | 
       | Why not just get rid of pointers ? You know you can.
       | 
       | Also, a byte is 0 .. 255. The hardest problem; naming things, and
       | off by one errors.
        
         | steveklabnik wrote:
         | They are not always integers. See this comment from the
         | previous discussions on the other posts, for example:
         | https://news.ycombinator.com/item?id=17607595
        
           | gens wrote:
           | If 'b' is actually used then the compiler has to put 'a' in
           | memory. Theoretically. He says that b is not changed, and
           | does not "escape" (the scope of the code in question). In
           | that case b is useless, and thus it doesn't really even
           | exist. A compiler can "collapse" a lot of code into a
           | fraction of it. That's a part of the "optimizing" part of
           | "optimizing compiler". A compiler can put a value in a
           | register. A compiler can do whatever it wants as long as the
           | results are always the same as if it didn't do anything other
           | then just translate to machine code.
           | 
           | Pointer aliasing in C can be bad for the compiler as it can't
           | be sure that it can optimize away something. That's why
           | Fortran is faster, and why the restrict keyword was added to
           | the C standard.
           | 
           | And yea, i agree with him completely that a context should be
           | made clear. Personally i'd like less voodoo in programming
           | topics, but, on the other hand, this is still a young science
           | and a lot of terminology is all around the place. ..Actually,
           | scratch that, i found what i was interested in and don't
           | really care about everything else.
        
             | klodolph wrote:
             | > If 'b' is actually used then the compiler has to put 'a'
             | in memory.
             | 
             | Theoretically, no--the "as if" rule applies. In practice,
             | also no, because real compilers perform escape analysis and
             | there are aggressive optimization passes for eliminating
             | dead loads and dead stores.
             | 
             | Just like the compiler can optimize out b and replace it
             | with a reference to a, the compiler can optimize out the
             | value stored in a but _keep_ the reference to it stored in
             | b.
             | 
             | > In that case b is useless, and thus it doesn't really
             | even exist.
             | 
             | I don't think this notion of "exist" has legs.
        
               | gens wrote:
               | That is what i said, that it can be optimized away.
               | 
               | "exist" is the wrong word, "is useless" would be better.
               | It is like.. you have a spanner in a box labeled "a" and
               | you have a box "b" that has a piece of paper that says
               | "look in a". You ask for a spanner and you get told to
               | look in box "a". The box labeled "b" is then completely
               | useless. One day some guy, weirdly named "compiler",
               | decides to trow away box "b", and nobody cares.
               | 
               | But we have to assume that a compiler will compile the
               | program just as we wrote it. Otherwise we are not talking
               | about the language, but about the compiler. This is gone
               | off topic, if there even was one.
        
               | klodolph wrote:
               | A variable is useless because it is optimized away? I
               | don't agree with that terminology.
               | 
               | The variable is useful / exists in the source code, and
               | that is enough. Variables / objects only exist
               | conceptually in the original program anyways, we just
               | infer their existence in the compiled program by
               | correspondence with the original code, or educated
               | guesses about the original code.
        
           | badsectoracula wrote:
           | > What happens when 'a' is allocated to a CPU register?
           | 
           | When the & operator is used on 'a' it should mark it as
           | unsafe to place the 'a' on a CPU register - CPU allocation is
           | an optimization and optimizations should _never_ affect how
           | the program behaves (except making it run faster, of course)
           | and as such they should only be applied when the compiler can
           | be sure that they 're safe to do so.
           | 
           | (and yes, the same applies on using & on something like an
           | array or struct element - the use of & should taint any
           | variable)
        
             | steveklabnik wrote:
             | That's addressed by the
             | 
             | > If 'b' never escapes and no (undefined) pointer
             | arithmetic is used
             | 
             | part.
        
       | raverbashing wrote:
       | Great article
       | 
       | It just seems to me there's too much risk in "overoptimizing"
       | especially in a weakly defined language like C with trigger happy
       | optimizers and even more trigger happy "UB means let's go crazy"
       | 
       | I take it as principle that no compilers should "throw their
       | hands up" when detecting UB. Crash the program, don't just take
       | that part out.
        
         | gpderetta wrote:
         | Compilers don't detect UB (at least as optimizations are
         | concerned), they assume no UB.
         | 
         | If you want to detect UB, at least dynamically, compile with
         | sanitizers.
        
       | ChrisSD wrote:
       | This is a standalone sequel to "Pointers Are Complicated, or:
       | What's in a Byte?"
       | (https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html).
       | 
       | Which has been discussed on HN
       | 
       | 2020: https://news.ycombinator.com/item?id=24376797
       | 
       | 2018: https://news.ycombinator.com/item?id=17604402
        
         | ralfj wrote:
         | Oh, there was a repost in 2020, that explains the sudden spike
         | in page hits that I saw earlier this year and couldn't trace
         | back. ;)
        
       | dvt wrote:
       | First of all: _fantastic article_. In-depth, insightful, and the
       | examples are absolutely top-notch. On the razor 's edge between
       | accessible and profound. Hats off to the author.
       | 
       | I will say that the problems seem to lie in a few interesting
       | interlanguage quirks, and not so much on language _specs_. For
       | example, LLVM and C have different definitions of  "undefined
       | behavior"[1] -- this is pointed out when looking at the `poison`
       | identifier. In LLVM this might just be a weird side-effect we
       | don't care about, but actually having an integer overflow is a
       | _big deal_ in C that introduces UB. Given the introduction, LLVM
       | (incorrectly) introduces (C) UB at times.
       | 
       | The second section, on pointer provenance, is much trickier, as
       | there's no "obvious" UB introduced there. Further, I would argue
       | that pointer provenance is a meta-language (or meta-
       | computational) concept (which the Rust docs hint at), and
       | logically speaking, might be intractable in the general case (I'd
       | have to think about this more). Pointer arithmetic can fiddle
       | with meta-state, whereas IRs like LLVM tend to focus purely on
       | symbol manipulation.
       | 
       | As a side note, I'd be curious what happens when LLVM sees
       | something like: `void *p = &p` which is a self-referential
       | pointer. What does it deduce about it? Does it optimize it away?
       | 
       | Cool thought-provoking article.
       | 
       | [1] https://www.cs.utah.edu/~regehr/llvm-ub.pdf
        
         | klodolph wrote:
         | > As a side note, I'd be curious what happens when LLVM sees
         | something like: `void <asterisk>p = &p` which is a self-
         | referential pointer. What does it deduce about it? Does it
         | optimize it away?
         | 
         | Why would it do anything special here? This is not really
         | different from something like this:                   struct a
         | { struct a *p; };         struct a x = { &a };
         | 
         | The only real difference is that the void version involves a
         | cast, because the type is otherwise impossible to express in C.
        
           | dvt wrote:
           | > Why would it do anything special here? This is not really
           | different from something like this:
           | 
           | Yeah, I suppose you're right. I was Googling potential
           | interesting cases and ran across this one[1] which made me
           | think of weird edge cases w.r.t. self-reference.
           | 
           | [1] https://stackoverflow.com/questions/20596856/self-
           | referentia...
        
         | sharpneli wrote:
         | I'm pretty sure there is no UB in the second part. It's just
         | LLVM bug. The pointer comparison is valid (pointer to and
         | object and to an another object that's one past the end MAY
         | compare equal). The write is valid.
         | 
         | Reads and writes to char* always alias everything. Unless the
         | compiler can prove that no writes happened it has to emit a
         | read. So if it "forgets" due to the uintptr_t cast where the
         | char* came from then it must assume that it can be pretty much
         | arbitrary and must emit a read.
        
         | Dylan16807 wrote:
         | > Further, I would argue that pointer provenance is a meta-
         | language (or meta-computational) concept (which the Rust docs
         | hint at), and logically speaking, might be intractable in the
         | general case (I'd have to think about this more).
         | 
         | Unless we define provenance to only work within certain easy-
         | to-calculate boundaries. But for the general case, sure, just
         | don't optimize when there's uncertainty.
        
         | mjevans wrote:
         | As usual; if a compiler knows about undefined behavior I would
         | much rather it throw an error rather than optimize something
         | the programmer didn't intend based on the compiler out-smarting
         | a human's ability to be specific.
        
           | simias wrote:
           | It would sometimes work but I think this (very common
           | comment) misses the general point, there's absolutely fine
           | and non buggy on warning-worthy code that can be optimized
           | away if the compiler relies on UB. A very simple example:
           | void do_stuff(int *some_ptr) {
           | do_substuff(some_ptr);                  *some_ptr += 2;
           | }              static void do_substuff(int *some_ptr) {
           | if (some_ptr != NULL) {                 *some_ptr = 10;
           | }         }
           | 
           | do_stuff calls a subroutine that does a NULL pointer check,
           | then unconditionally dereferences the same pointer.
           | 
           | From this the compiler, if it decides to inline that code,
           | can decide to optimize the NULL check away since if the
           | pointer is NULL it's guaranteed to trigger UB. The logic
           | being "clearly the programmer assumes that the pointer can't
           | be NULL here, so thanks to UB rules I can too".
           | 
           | There's nothing wrong with this code, there's no reason to
           | emit a warning.
           | 
           | Distinguishing between "of course that's a reasonable
           | optimization, that's probably what the developer intended"
           | and "wow this compiler is so dumb, obviously I never meant
           | for that to happen" is a very tough nut to crack.
           | 
           | At this point you can push the blame to the C language not
           | being expressive enough, in this case not giving us a way to
           | express nullable vs. non-nullable pointers within the
           | language, which forces the compiler to lean onto these UB
           | heuristics to optimize.
        
             | dooglius wrote:
             | Even in your heavily contrived example, I can think of
             | cases where the optimization isn't what the programmer
             | wants. For instance, I might have a special handler via
             | userfaultfd(2) that detects if I'm doing an increment of
             | the null pointer and handles it in some special way, but
             | can't handle just setting it to 10. For a more real
             | example, I might have acquired a lock around do_substuff,
             | and I might be okay with the thread segfaulting only if the
             | lock wasn't held.
        
           | klodolph wrote:
           | I've heard this proposal thrown around a lot on HN, but how
           | would that even work? Can you describe how the compiler would
           | do this, what patterns it would look for, and what error
           | messages it would produce? For example, consider this:
           | int factorial(int n) {             int r = 1;             for
           | (int i = 1; i <= n; i++) {                r *= i;
           | }             return r;         }
           | 
           | Since there are two instances of possible UB in the above
           | function, what error messages would you like the compiler to
           | produce?
           | 
           | First of all, the compiler can assume that the loop
           | terminates. This is a reasonable assumption from a human
           | perspective--the loop is almost certainly _intended_ to run N
           | times. However, if n = INT_MAX, then there's UB... and
           | there's no real way to know that n [?] INT_MAX.
           | 
           | You could argue that you want -fsanitize=undefined, where the
           | compiler inserts checks at runtime.
           | 
           | I don't think this proposal is well formulated enough to be
           | viable.
           | 
           | Or consider something like this:                   void
           | puts_safe(const char *s) {           puts(s == NULL ?
           | "(null)" : s)         }              void puts_with_len(const
           | char *s) {           printf("len = %zu\n", strlen(s));
           | puts_safe(s);         }
           | 
           | Do you want something like,                   Warning:
           | puts_safe replaced with puts, because s != NULL, because
           | strlen(NULL) is undefined
           | 
           | I suspect there would be a mountain of false positives, and I
           | would rather just use a static analyzer.
        
       | zokier wrote:
       | Intuitively I feel that the first optimization is incorrect as it
       | introduces out of bound write. I'll admit that this is very much
       | subjective thing
        
       | mehrdadn wrote:
       | If you enjoyed learning about pointer provenance (called "safely-
       | derived pointer" in C++ [1]), you might find the recent addition
       | of std::launder() [2] interesting.
       | 
       | [1] https://en.cppreference.com/w/cpp/memory/gc/pointer_safety
       | 
       | [2] https://en.cppreference.com/w/cpp/utility/launder
        
         | comex wrote:
         | Pointer safety is actually a separate concept from provenance.
         | It was added to C++11 with the intent that it be used by
         | garbage-collected implementations, but I'm not sure if anyone
         | has ever properly implemented it. At least, the big three STL
         | implementations have get_pointer_safety() always return
         | pointer_safety::relaxed ("All pointers are considered valid and
         | may be dereferenced or deallocated"), and there has been a
         | proposal to remove pointer safety from future versions of the
         | standard. [1]
         | 
         | [1] http://www.open-
         | std.org/jtc1/sc22/wg21/docs/papers/2020/p218...
        
       | TrianguloY wrote:
       | I'm sorry but I don't fully understand the problem and that
       | 'provenance' thing. For me the third optimization is the wrong
       | one, for the same reason as this                 char i,j='0';
       | *(&i+1)='1';       cout << j;
       | 
       | can't be optimized to                 cout << '0';
       | 
       | even though the j variable is also never overwritten directly.
       | 
       | This reminds me of paralelization of nested loops with pragmas,
       | where you need to specifically say that two pointers will never
       | point to the same space, otherwise the compiler won't optimize
       | the loop because 'it could happen'.
        
         | Dr_Emann wrote:
         | Your block of code can be (and absolutely is) optimized to
         | output zero directly (in c, because the assembly is easier to
         | read):
         | 
         | https://c.godbolt.org/z/qe3f4K
         | 
         | `*(&i+1)` dereferences the pointer "one past the end" which is
         | UB.
        
           | TrianguloY wrote:
           | And I agree, with an UB the original code is undefined and so
           | any optimization is not right nor wrong, simply keep that UB.
           | 
           | But then, why such a long article for a code whose second
           | line is a UB ('p+1' is undefined)?
        
             | fanf2 wrote:
             | No, p+1 is OK because you are allowed to make a pointer one
             | past the end in C (so that common pointer loops work as
             | expected) but you cannot dereference them.
        
             | gizmo686 wrote:
             | uintptr_t ip = (uintptr_t)(p+1);
             | 
             | Is perfectly defined. What is undefined is converting ip
             | back to a pointer and writing to it, which the original
             | program never does.
        
               | wahern wrote:
               | I'm not sure that _comparing_ uintptr_t values in a
               | pointer access conditional is well defined in C. At the
               | very least, it 's underspecified. The only thing that the
               | C standard specifies regarding uintptr_t is that if you
               | round-trip void pointers the addresses will compare
               | equal. It says nothing about the values in-flight and
               | their relation to underlying address representations.
               | Indeed, it's deliberately silent on that issue because
               | while often criticized for having a weak type system, an
               | informal notion of pointer provenance has always pervaded
               | the specification, if only because of the legacy of
               | physical memory architectures C has been ported to.
               | 
               | Indeed, I think the fact that the example code relies on
               | intptr_t is a reflection that good static analyzers
               | (possibly including those in LLVM or GCC) would already
               | catch the error of comparing addresses from two separate
               | objects, or compile it in a more deterministic way. For
               | example, char pointers can alias _any_ memory, which can
               | disable many kinds of optimizations. (FWIW, this is why
               | replacing char or unsigned char with int8_t or uint8_t
               | can be dangerous.)
               | 
               | Type-punning through intptr_t circumvents this capability
               | because, being underspecified, compilers don't track
               | provenance through such constructs. And I'm not sure they
               | should, because intptr_t is an escape mechanism. There's
               | an argument that for code that relies on such escape
               | mechanisms compilers should just _disable_ optimizations
               | locally and fall back on some simple, lexical code flow
               | model. Of course, the problem with that is nobody can
               | agree on the specifics of that model.
        
               | TrianguloY wrote:
               | Hmm, I think I understand now.
               | 
               | So now I think it's the first optimization the one that
               | is wrong. If you replace a variable with another, don't
               | you need to keep information of the original variable?
               | I mean, if a==b and you do *a=1 you can replace it with
               | *b=1, but then you need to keep the information that 'a'
               | was written to, so other optimizations (the third one)
               | don't think it wasn't. Or am I missing something else
               | here?
               | 
               | Edit: sorry for the code block, otherwise the asterisks
               | are removed.
        
               | Dylan16807 wrote:
               | > If you replace a variable with another, don't you need
               | to keep information of the original variable?
               | 
               | Note that ip and iq are integers. If you can't replace
               | integers freely, it makes it really hard to optimize
               | arithmetic.
        
             | Dr_Emann wrote:
             | Because there is no UB in the original program.
             | 
             | It constructs a pointer to the "one past the end" element,
             | but that is fine, and the original program never
             | dereferences that pointer. Again: there is no UB in the
             | original program.
        
         | fanf2 wrote:
         | "Provenance" is about the object that a pointer refers to. In
         | your example, i and j are different objects and the language
         | does not permit you to modify j using a pointer derived from i.
         | 
         | Your comparison with parallel loops is interesting because in
         | the case of big arrays, all the pointers refer to the same
         | object, so the compiler can't use provenance or strict aliasing
         | to prove that they don't overlap. So you have to use the
         | pragmas to give the compiler permission to treat them as non-
         | aliasing.
        
         | sheepfleece wrote:
         | ``` *(&i+1)='1'; ```
         | 
         | This line is UB, no? You are referencing some random, if
         | adjacent, block of memory. Anything can happen.
        
         | Someone wrote:
         | If you think that, you're giving up lots and lots of
         | optimization opportunities.
         | 
         | There's zero guarantee that _i_ and _j_ are adjacent on the
         | stack or even on the stack (there isn't even a guarantee that
         | there is a stack, but that's a different subject); a compiler
         | can decide to keep _j_ in a register. That is very common in
         | short functions, and essential for performance.
         | 
         | It also would mean the compiler would have to load data from
         | memory way more often, as ?about? every pointer write might
         | overwrite any memory.
        
       | brundolf wrote:
       | Before reading this I thought working on compilers might be a fun
       | career direction
        
       | vlovich123 wrote:
       | Does this have implications about the safety guarantees even safe
       | Rust can make? It seems like incorrect optimization passes could
       | result in bugs/security issues that are a byproduct of the
       | compiler rather than the language itself. I wonder how big of an
       | issue this is for Rust (relatively probably a bigger issue than
       | for C/C++ where basic resource ownership bugs are more common).
        
         | steveklabnik wrote:
         | Yes, absolutely. https://github.com/rust-lang/rust/issues/28728
         | is probably the most famous bug here, but many I-unsound bugs
         | are similar.
         | 
         | This would be the case even without this. Compilers are
         | programs. Programs have bugs.
        
         | jswrenn wrote:
         | Yes, the pointer issues Ralf raises has implications about what
         | operations Rust can make safe. An example Ralf posed to the
         | safe transmute working group: a transmutation from a reference
         | (i.e., a pointer with provenance) to one without (e.g., a "raw"
         | pointer) or to a `usize` number cannot be safe. (Rust _can_
         | safely provide these conversions via the `as` keyword, just not
         | via the more general framework of transmutations).
         | 
         | More broadly: Rust _is_ very sensitive to LLVM bugs that
         | inadvertently generate UB. The longstanding issues with
         | `noalias` optimizations is the prime example of this:
         | https://stackoverflow.com/a/57259339/7501301
        
         | dbaupp wrote:
         | It's enough of an issue for C at least that a fully formally
         | verified compiler exists:
         | https://en.m.wikipedia.org/wiki/CompCert . That is, there's
         | proof that its (limited) optimisation passes are correct.
         | 
         | There's other approaches to assurance of compiled code too,
         | such as seL4 which has proofs that the binary itself is
         | correct, I believe.
        
       | siraben wrote:
       | > However, to make the argument that an optimization is correct,
       | the exact semantics of LLVM IR (what the behavior of all possible
       | programs is and when they have UB) needs to be documented.
       | 
       | This is a great point as to why formal semantics of programming
       | languages matters. Even if an optimization seems "obviously"
       | correct, finicky things like introducing the possibility of UB
       | can, as the post outlines, cascade into compiler passes that
       | change the program more and more drastically. The post mentions
       | that one need not go overly formal with the semantics, but I'll
       | demonstrate what could happen when one does.
       | 
       | One possible formulation of semantics is denotational semantics,
       | which describes how to map a program source syntax into values,
       | i.e. eval : Expr -> Val. So when we have an optimization opt :
       | Expr -> Expr, the desired correctness property for that
       | optimization is that                 Definition opt_correct (opt
       | : Expr -> Expr) := forall (e : Expr), eval e = eval (opt e).
       | 
       | When we want to rewrite, say e + 0 ==> e, for any expression e,
       | the correctness can be stated and proved                 Theorem
       | e_plus_0_to_e_correct : opt_correct e_plus_0_to_e.
       | 
       | One claim in the blog post that correct optimization passes, e.g.
       | opt1 and opt2 compose into another correct optimization pass can
       | be stated as:                 Theorem opt_correct_compose (opt1,
       | opt2 : Expr -> Expr) :            opt_correct opt1 -> opt_correct
       | op2 -> opt_correct (opt1 [?] opt2).
       | 
       | Which means given any two optimization passes opt1 and opt2 such
       | that they are correct, composing them preserves correctness. The
       | proof is simply;                   eval (opt1 (opt2 e))       = {
       | since opt1 is correct }         eval (opt2 e)       = { since
       | opt2 is correct }         eval e
        
         | Dylan16807 wrote:
         | Not sure why you wanted to spend the time to prove "correct
         | optimizations can be combined", but I take issue with that
         | proof. It only works for optimizations that give code _exactly_
         | the same behavior, which is severely limiting.
        
       | samatman wrote:
       | Excellent article.
       | 
       | It seems to me (and this is somewhat off-the-cuff) that compilers
       | have another option: integers _which are cast from pointers_ have
       | provenance.
       | 
       | That is, provenance is a taint: you can't clean it off through
       | casting. So casting from pointer means the user can do integer-
       | things like addition, but it means the compiler can't do integer-
       | things like constant folding.
        
         | gizmo686 wrote:
         | I don't think that works. Consider the following (contrived)
         | program:                 char* q[1] = {0};       int iq =
         | (uintptr_t)q;       int ip = 0;       while (iq>0) {
         | iq--;         ip++;       }       char* p = (char*) ip;
         | 
         | You can make this more efficient (albeit no less contrived) by
         | iterating through iq bit-wise.
         | 
         | Less contrived would be sending a pointer through some IPC
         | mechanism (allowing it to be used as an opaque handle
         | externally, while the program will blindly dereference it when
         | it is passed back in); or even passing it within a program
         | across compilation units.
         | 
         | If you treat provenance like a taint, you need to track every
         | way it could be propagated, and that does not seem solvable.
        
           | samatman wrote:
           | So what are the optimizations here? iq is tainted, you can't
           | hoist the loop out. You've said "I'm doing a weird thing here
           | with an integer which used to be a pointer, you have to
           | assume that it involves memory and treat it more like a
           | pointer than an integer"
        
       | pierrebai wrote:
       | The bug is clearly the writing past the end of an array. That is:
       | *(p+1) = 10;
       | 
       | If you start with a buggy program, that it stays buggy after
       | optimizations is not surprising!
       | 
       | Edit: what;s more, the original form of the program also did not
       | write through the q pointer and could have had optimizer the
       | print zero. In the first form, the author assumes the compiler
       | can track the pointer q through q == qi and avoid the
       | optimization that would print zero, but fail to track pointer q
       | through pi + 1 == qi == q.
        
         | mskslal wrote:
         | No, the original program writes through iq, which is well
         | defined. Only the optimized program writes through p+1.
        
       ___________________________________________________________________
       (page generated 2020-12-14 23:00 UTC)