[HN Gopher] Pointers Are Complicated II, or: We need better lang... ___________________________________________________________________ Pointers Are Complicated II, or: We need better language specs Author : ChrisSD Score : 148 points Date : 2020-12-14 16:53 UTC (6 hours ago) (HTM) web link (www.ralfj.de) (TXT) w3m dump (www.ralfj.de) | gens wrote: | Pointers are "just" integers (unsigned integers of a size, that | is). Most languages treat them differently (C assigns them a | type, and a stride with it), but that is usually on top of them | being integers. | | Pointers "point" to (are addresses to) bytes, as he said. If you | want you can pack data to a resolution of a bit, and some | languages help you do that as well. | | You can also do whatever you want with pointers (in some | languages, ofc), like do bitwise operations on them. Aliasing | them can be useful as well, in more then one way. | | Why not just get rid of pointers ? You know you can. | | Also, a byte is 0 .. 255. The hardest problem; naming things, and | off by one errors. | steveklabnik wrote: | They are not always integers. See this comment from the | previous discussions on the other posts, for example: | https://news.ycombinator.com/item?id=17607595 | gens wrote: | If 'b' is actually used then the compiler has to put 'a' in | memory. Theoretically. He says that b is not changed, and | does not "escape" (the scope of the code in question). In | that case b is useless, and thus it doesn't really even | exist. A compiler can "collapse" a lot of code into a | fraction of it. That's a part of the "optimizing" part of | "optimizing compiler". A compiler can put a value in a | register. A compiler can do whatever it wants as long as the | results are always the same as if it didn't do anything other | then just translate to machine code. | | Pointer aliasing in C can be bad for the compiler as it can't | be sure that it can optimize away something. That's why | Fortran is faster, and why the restrict keyword was added to | the C standard. | | And yea, i agree with him completely that a context should be | made clear. Personally i'd like less voodoo in programming | topics, but, on the other hand, this is still a young science | and a lot of terminology is all around the place. ..Actually, | scratch that, i found what i was interested in and don't | really care about everything else. | klodolph wrote: | > If 'b' is actually used then the compiler has to put 'a' | in memory. | | Theoretically, no--the "as if" rule applies. In practice, | also no, because real compilers perform escape analysis and | there are aggressive optimization passes for eliminating | dead loads and dead stores. | | Just like the compiler can optimize out b and replace it | with a reference to a, the compiler can optimize out the | value stored in a but _keep_ the reference to it stored in | b. | | > In that case b is useless, and thus it doesn't really | even exist. | | I don't think this notion of "exist" has legs. | gens wrote: | That is what i said, that it can be optimized away. | | "exist" is the wrong word, "is useless" would be better. | It is like.. you have a spanner in a box labeled "a" and | you have a box "b" that has a piece of paper that says | "look in a". You ask for a spanner and you get told to | look in box "a". The box labeled "b" is then completely | useless. One day some guy, weirdly named "compiler", | decides to trow away box "b", and nobody cares. | | But we have to assume that a compiler will compile the | program just as we wrote it. Otherwise we are not talking | about the language, but about the compiler. This is gone | off topic, if there even was one. | klodolph wrote: | A variable is useless because it is optimized away? I | don't agree with that terminology. | | The variable is useful / exists in the source code, and | that is enough. Variables / objects only exist | conceptually in the original program anyways, we just | infer their existence in the compiled program by | correspondence with the original code, or educated | guesses about the original code. | badsectoracula wrote: | > What happens when 'a' is allocated to a CPU register? | | When the & operator is used on 'a' it should mark it as | unsafe to place the 'a' on a CPU register - CPU allocation is | an optimization and optimizations should _never_ affect how | the program behaves (except making it run faster, of course) | and as such they should only be applied when the compiler can | be sure that they 're safe to do so. | | (and yes, the same applies on using & on something like an | array or struct element - the use of & should taint any | variable) | steveklabnik wrote: | That's addressed by the | | > If 'b' never escapes and no (undefined) pointer | arithmetic is used | | part. | raverbashing wrote: | Great article | | It just seems to me there's too much risk in "overoptimizing" | especially in a weakly defined language like C with trigger happy | optimizers and even more trigger happy "UB means let's go crazy" | | I take it as principle that no compilers should "throw their | hands up" when detecting UB. Crash the program, don't just take | that part out. | gpderetta wrote: | Compilers don't detect UB (at least as optimizations are | concerned), they assume no UB. | | If you want to detect UB, at least dynamically, compile with | sanitizers. | ChrisSD wrote: | This is a standalone sequel to "Pointers Are Complicated, or: | What's in a Byte?" | (https://www.ralfj.de/blog/2018/07/24/pointers-and-bytes.html). | | Which has been discussed on HN | | 2020: https://news.ycombinator.com/item?id=24376797 | | 2018: https://news.ycombinator.com/item?id=17604402 | ralfj wrote: | Oh, there was a repost in 2020, that explains the sudden spike | in page hits that I saw earlier this year and couldn't trace | back. ;) | dvt wrote: | First of all: _fantastic article_. In-depth, insightful, and the | examples are absolutely top-notch. On the razor 's edge between | accessible and profound. Hats off to the author. | | I will say that the problems seem to lie in a few interesting | interlanguage quirks, and not so much on language _specs_. For | example, LLVM and C have different definitions of "undefined | behavior"[1] -- this is pointed out when looking at the `poison` | identifier. In LLVM this might just be a weird side-effect we | don't care about, but actually having an integer overflow is a | _big deal_ in C that introduces UB. Given the introduction, LLVM | (incorrectly) introduces (C) UB at times. | | The second section, on pointer provenance, is much trickier, as | there's no "obvious" UB introduced there. Further, I would argue | that pointer provenance is a meta-language (or meta- | computational) concept (which the Rust docs hint at), and | logically speaking, might be intractable in the general case (I'd | have to think about this more). Pointer arithmetic can fiddle | with meta-state, whereas IRs like LLVM tend to focus purely on | symbol manipulation. | | As a side note, I'd be curious what happens when LLVM sees | something like: `void *p = &p` which is a self-referential | pointer. What does it deduce about it? Does it optimize it away? | | Cool thought-provoking article. | | [1] https://www.cs.utah.edu/~regehr/llvm-ub.pdf | klodolph wrote: | > As a side note, I'd be curious what happens when LLVM sees | something like: `void <asterisk>p = &p` which is a self- | referential pointer. What does it deduce about it? Does it | optimize it away? | | Why would it do anything special here? This is not really | different from something like this: struct a | { struct a *p; }; struct a x = { &a }; | | The only real difference is that the void version involves a | cast, because the type is otherwise impossible to express in C. | dvt wrote: | > Why would it do anything special here? This is not really | different from something like this: | | Yeah, I suppose you're right. I was Googling potential | interesting cases and ran across this one[1] which made me | think of weird edge cases w.r.t. self-reference. | | [1] https://stackoverflow.com/questions/20596856/self- | referentia... | sharpneli wrote: | I'm pretty sure there is no UB in the second part. It's just | LLVM bug. The pointer comparison is valid (pointer to and | object and to an another object that's one past the end MAY | compare equal). The write is valid. | | Reads and writes to char* always alias everything. Unless the | compiler can prove that no writes happened it has to emit a | read. So if it "forgets" due to the uintptr_t cast where the | char* came from then it must assume that it can be pretty much | arbitrary and must emit a read. | Dylan16807 wrote: | > Further, I would argue that pointer provenance is a meta- | language (or meta-computational) concept (which the Rust docs | hint at), and logically speaking, might be intractable in the | general case (I'd have to think about this more). | | Unless we define provenance to only work within certain easy- | to-calculate boundaries. But for the general case, sure, just | don't optimize when there's uncertainty. | mjevans wrote: | As usual; if a compiler knows about undefined behavior I would | much rather it throw an error rather than optimize something | the programmer didn't intend based on the compiler out-smarting | a human's ability to be specific. | simias wrote: | It would sometimes work but I think this (very common | comment) misses the general point, there's absolutely fine | and non buggy on warning-worthy code that can be optimized | away if the compiler relies on UB. A very simple example: | void do_stuff(int *some_ptr) { | do_substuff(some_ptr); *some_ptr += 2; | } static void do_substuff(int *some_ptr) { | if (some_ptr != NULL) { *some_ptr = 10; | } } | | do_stuff calls a subroutine that does a NULL pointer check, | then unconditionally dereferences the same pointer. | | From this the compiler, if it decides to inline that code, | can decide to optimize the NULL check away since if the | pointer is NULL it's guaranteed to trigger UB. The logic | being "clearly the programmer assumes that the pointer can't | be NULL here, so thanks to UB rules I can too". | | There's nothing wrong with this code, there's no reason to | emit a warning. | | Distinguishing between "of course that's a reasonable | optimization, that's probably what the developer intended" | and "wow this compiler is so dumb, obviously I never meant | for that to happen" is a very tough nut to crack. | | At this point you can push the blame to the C language not | being expressive enough, in this case not giving us a way to | express nullable vs. non-nullable pointers within the | language, which forces the compiler to lean onto these UB | heuristics to optimize. | dooglius wrote: | Even in your heavily contrived example, I can think of | cases where the optimization isn't what the programmer | wants. For instance, I might have a special handler via | userfaultfd(2) that detects if I'm doing an increment of | the null pointer and handles it in some special way, but | can't handle just setting it to 10. For a more real | example, I might have acquired a lock around do_substuff, | and I might be okay with the thread segfaulting only if the | lock wasn't held. | klodolph wrote: | I've heard this proposal thrown around a lot on HN, but how | would that even work? Can you describe how the compiler would | do this, what patterns it would look for, and what error | messages it would produce? For example, consider this: | int factorial(int n) { int r = 1; for | (int i = 1; i <= n; i++) { r *= i; | } return r; } | | Since there are two instances of possible UB in the above | function, what error messages would you like the compiler to | produce? | | First of all, the compiler can assume that the loop | terminates. This is a reasonable assumption from a human | perspective--the loop is almost certainly _intended_ to run N | times. However, if n = INT_MAX, then there's UB... and | there's no real way to know that n [?] INT_MAX. | | You could argue that you want -fsanitize=undefined, where the | compiler inserts checks at runtime. | | I don't think this proposal is well formulated enough to be | viable. | | Or consider something like this: void | puts_safe(const char *s) { puts(s == NULL ? | "(null)" : s) } void puts_with_len(const | char *s) { printf("len = %zu\n", strlen(s)); | puts_safe(s); } | | Do you want something like, Warning: | puts_safe replaced with puts, because s != NULL, because | strlen(NULL) is undefined | | I suspect there would be a mountain of false positives, and I | would rather just use a static analyzer. | zokier wrote: | Intuitively I feel that the first optimization is incorrect as it | introduces out of bound write. I'll admit that this is very much | subjective thing | mehrdadn wrote: | If you enjoyed learning about pointer provenance (called "safely- | derived pointer" in C++ [1]), you might find the recent addition | of std::launder() [2] interesting. | | [1] https://en.cppreference.com/w/cpp/memory/gc/pointer_safety | | [2] https://en.cppreference.com/w/cpp/utility/launder | comex wrote: | Pointer safety is actually a separate concept from provenance. | It was added to C++11 with the intent that it be used by | garbage-collected implementations, but I'm not sure if anyone | has ever properly implemented it. At least, the big three STL | implementations have get_pointer_safety() always return | pointer_safety::relaxed ("All pointers are considered valid and | may be dereferenced or deallocated"), and there has been a | proposal to remove pointer safety from future versions of the | standard. [1] | | [1] http://www.open- | std.org/jtc1/sc22/wg21/docs/papers/2020/p218... | TrianguloY wrote: | I'm sorry but I don't fully understand the problem and that | 'provenance' thing. For me the third optimization is the wrong | one, for the same reason as this char i,j='0'; | *(&i+1)='1'; cout << j; | | can't be optimized to cout << '0'; | | even though the j variable is also never overwritten directly. | | This reminds me of paralelization of nested loops with pragmas, | where you need to specifically say that two pointers will never | point to the same space, otherwise the compiler won't optimize | the loop because 'it could happen'. | Dr_Emann wrote: | Your block of code can be (and absolutely is) optimized to | output zero directly (in c, because the assembly is easier to | read): | | https://c.godbolt.org/z/qe3f4K | | `*(&i+1)` dereferences the pointer "one past the end" which is | UB. | TrianguloY wrote: | And I agree, with an UB the original code is undefined and so | any optimization is not right nor wrong, simply keep that UB. | | But then, why such a long article for a code whose second | line is a UB ('p+1' is undefined)? | fanf2 wrote: | No, p+1 is OK because you are allowed to make a pointer one | past the end in C (so that common pointer loops work as | expected) but you cannot dereference them. | gizmo686 wrote: | uintptr_t ip = (uintptr_t)(p+1); | | Is perfectly defined. What is undefined is converting ip | back to a pointer and writing to it, which the original | program never does. | wahern wrote: | I'm not sure that _comparing_ uintptr_t values in a | pointer access conditional is well defined in C. At the | very least, it 's underspecified. The only thing that the | C standard specifies regarding uintptr_t is that if you | round-trip void pointers the addresses will compare | equal. It says nothing about the values in-flight and | their relation to underlying address representations. | Indeed, it's deliberately silent on that issue because | while often criticized for having a weak type system, an | informal notion of pointer provenance has always pervaded | the specification, if only because of the legacy of | physical memory architectures C has been ported to. | | Indeed, I think the fact that the example code relies on | intptr_t is a reflection that good static analyzers | (possibly including those in LLVM or GCC) would already | catch the error of comparing addresses from two separate | objects, or compile it in a more deterministic way. For | example, char pointers can alias _any_ memory, which can | disable many kinds of optimizations. (FWIW, this is why | replacing char or unsigned char with int8_t or uint8_t | can be dangerous.) | | Type-punning through intptr_t circumvents this capability | because, being underspecified, compilers don't track | provenance through such constructs. And I'm not sure they | should, because intptr_t is an escape mechanism. There's | an argument that for code that relies on such escape | mechanisms compilers should just _disable_ optimizations | locally and fall back on some simple, lexical code flow | model. Of course, the problem with that is nobody can | agree on the specifics of that model. | TrianguloY wrote: | Hmm, I think I understand now. | | So now I think it's the first optimization the one that | is wrong. If you replace a variable with another, don't | you need to keep information of the original variable? | I mean, if a==b and you do *a=1 you can replace it with | *b=1, but then you need to keep the information that 'a' | was written to, so other optimizations (the third one) | don't think it wasn't. Or am I missing something else | here? | | Edit: sorry for the code block, otherwise the asterisks | are removed. | Dylan16807 wrote: | > If you replace a variable with another, don't you need | to keep information of the original variable? | | Note that ip and iq are integers. If you can't replace | integers freely, it makes it really hard to optimize | arithmetic. | Dr_Emann wrote: | Because there is no UB in the original program. | | It constructs a pointer to the "one past the end" element, | but that is fine, and the original program never | dereferences that pointer. Again: there is no UB in the | original program. | fanf2 wrote: | "Provenance" is about the object that a pointer refers to. In | your example, i and j are different objects and the language | does not permit you to modify j using a pointer derived from i. | | Your comparison with parallel loops is interesting because in | the case of big arrays, all the pointers refer to the same | object, so the compiler can't use provenance or strict aliasing | to prove that they don't overlap. So you have to use the | pragmas to give the compiler permission to treat them as non- | aliasing. | sheepfleece wrote: | ``` *(&i+1)='1'; ``` | | This line is UB, no? You are referencing some random, if | adjacent, block of memory. Anything can happen. | Someone wrote: | If you think that, you're giving up lots and lots of | optimization opportunities. | | There's zero guarantee that _i_ and _j_ are adjacent on the | stack or even on the stack (there isn't even a guarantee that | there is a stack, but that's a different subject); a compiler | can decide to keep _j_ in a register. That is very common in | short functions, and essential for performance. | | It also would mean the compiler would have to load data from | memory way more often, as ?about? every pointer write might | overwrite any memory. | brundolf wrote: | Before reading this I thought working on compilers might be a fun | career direction | vlovich123 wrote: | Does this have implications about the safety guarantees even safe | Rust can make? It seems like incorrect optimization passes could | result in bugs/security issues that are a byproduct of the | compiler rather than the language itself. I wonder how big of an | issue this is for Rust (relatively probably a bigger issue than | for C/C++ where basic resource ownership bugs are more common). | steveklabnik wrote: | Yes, absolutely. https://github.com/rust-lang/rust/issues/28728 | is probably the most famous bug here, but many I-unsound bugs | are similar. | | This would be the case even without this. Compilers are | programs. Programs have bugs. | jswrenn wrote: | Yes, the pointer issues Ralf raises has implications about what | operations Rust can make safe. An example Ralf posed to the | safe transmute working group: a transmutation from a reference | (i.e., a pointer with provenance) to one without (e.g., a "raw" | pointer) or to a `usize` number cannot be safe. (Rust _can_ | safely provide these conversions via the `as` keyword, just not | via the more general framework of transmutations). | | More broadly: Rust _is_ very sensitive to LLVM bugs that | inadvertently generate UB. The longstanding issues with | `noalias` optimizations is the prime example of this: | https://stackoverflow.com/a/57259339/7501301 | dbaupp wrote: | It's enough of an issue for C at least that a fully formally | verified compiler exists: | https://en.m.wikipedia.org/wiki/CompCert . That is, there's | proof that its (limited) optimisation passes are correct. | | There's other approaches to assurance of compiled code too, | such as seL4 which has proofs that the binary itself is | correct, I believe. | siraben wrote: | > However, to make the argument that an optimization is correct, | the exact semantics of LLVM IR (what the behavior of all possible | programs is and when they have UB) needs to be documented. | | This is a great point as to why formal semantics of programming | languages matters. Even if an optimization seems "obviously" | correct, finicky things like introducing the possibility of UB | can, as the post outlines, cascade into compiler passes that | change the program more and more drastically. The post mentions | that one need not go overly formal with the semantics, but I'll | demonstrate what could happen when one does. | | One possible formulation of semantics is denotational semantics, | which describes how to map a program source syntax into values, | i.e. eval : Expr -> Val. So when we have an optimization opt : | Expr -> Expr, the desired correctness property for that | optimization is that Definition opt_correct (opt | : Expr -> Expr) := forall (e : Expr), eval e = eval (opt e). | | When we want to rewrite, say e + 0 ==> e, for any expression e, | the correctness can be stated and proved Theorem | e_plus_0_to_e_correct : opt_correct e_plus_0_to_e. | | One claim in the blog post that correct optimization passes, e.g. | opt1 and opt2 compose into another correct optimization pass can | be stated as: Theorem opt_correct_compose (opt1, | opt2 : Expr -> Expr) : opt_correct opt1 -> opt_correct | op2 -> opt_correct (opt1 [?] opt2). | | Which means given any two optimization passes opt1 and opt2 such | that they are correct, composing them preserves correctness. The | proof is simply; eval (opt1 (opt2 e)) = { | since opt1 is correct } eval (opt2 e) = { since | opt2 is correct } eval e | Dylan16807 wrote: | Not sure why you wanted to spend the time to prove "correct | optimizations can be combined", but I take issue with that | proof. It only works for optimizations that give code _exactly_ | the same behavior, which is severely limiting. | samatman wrote: | Excellent article. | | It seems to me (and this is somewhat off-the-cuff) that compilers | have another option: integers _which are cast from pointers_ have | provenance. | | That is, provenance is a taint: you can't clean it off through | casting. So casting from pointer means the user can do integer- | things like addition, but it means the compiler can't do integer- | things like constant folding. | gizmo686 wrote: | I don't think that works. Consider the following (contrived) | program: char* q[1] = {0}; int iq = | (uintptr_t)q; int ip = 0; while (iq>0) { | iq--; ip++; } char* p = (char*) ip; | | You can make this more efficient (albeit no less contrived) by | iterating through iq bit-wise. | | Less contrived would be sending a pointer through some IPC | mechanism (allowing it to be used as an opaque handle | externally, while the program will blindly dereference it when | it is passed back in); or even passing it within a program | across compilation units. | | If you treat provenance like a taint, you need to track every | way it could be propagated, and that does not seem solvable. | samatman wrote: | So what are the optimizations here? iq is tainted, you can't | hoist the loop out. You've said "I'm doing a weird thing here | with an integer which used to be a pointer, you have to | assume that it involves memory and treat it more like a | pointer than an integer" | pierrebai wrote: | The bug is clearly the writing past the end of an array. That is: | *(p+1) = 10; | | If you start with a buggy program, that it stays buggy after | optimizations is not surprising! | | Edit: what;s more, the original form of the program also did not | write through the q pointer and could have had optimizer the | print zero. In the first form, the author assumes the compiler | can track the pointer q through q == qi and avoid the | optimization that would print zero, but fail to track pointer q | through pi + 1 == qi == q. | mskslal wrote: | No, the original program writes through iq, which is well | defined. Only the optimized program writes through p+1. ___________________________________________________________________ (page generated 2020-12-14 23:00 UTC)