[HN Gopher] "Unexplainable" core dump (2011) ___________________________________________________________________ "Unexplainable" core dump (2011) Author : curling_grad Score : 84 points Date : 2023-01-03 13:02 UTC (9 hours ago) (HTM) web link (stackoverflow.com) (TXT) w3m dump (stackoverflow.com) | markus_zhang wrote: | Sounds fun, wish I had the talent to do some serious debugging. | Izkata wrote: | > Our code and compilers are constantly changing, and the problem | disappeared as suddenly as it appeared ... only to happen again 2 | years later in a completely unrelated executable. | | It does not encourage me how much this sounds like the short | story "Coding Machines". The original post even happened right | about 2 years after the short story was posted, then in that | comment reoccured after another 2 years. | | https://www.teamten.com/lawrence/writings/coding-machines/ | moffkalast wrote: | Damn, what a story. | | Though in reality it definitely would've been some guy on the | compiler dev team adding it before publishing binaries. I | wonder if you could set it up to inject a halt into compiled | code that runs if some conditions are met, crashing most of the | word's infrastructure on a predetermined date. | highspeedbus wrote: | That was a great read, thanks. | ericbarrett wrote: | This is great. Reminds me of a crash I saw early in my career. It | was a null-pointer exception, except it occurred right after | confirming the address was non-null. This was on a single core | with a non-preemptible kernel. So the processor just took the | wrong branch! There was simply no other explanation. | mcculley wrote: | What hardware/platform was this? I worked on AIX on POWER a | long time ago and it had to map the zero page read-only just to | support speculative execution of dereferencing the NULL | pointer, if I remember right. | | If you were on a platform that did this wrong, speculative | execution could have been dereferencing the NULL pointer. | jrpelkonen wrote: | Interesting, how did you fix it? Negate the comparison with an | appropriate comment? | Jiro wrote: | Are you sure the compiler didn't say "since having a null | pointer gives undefined behavior, we can optimize out the part | that confirms the address is non-null"? | logicchop wrote: | This is likely your answer. C++ story. I worked at a large | company that had a "no exceptions" policy and a custom | operator new. If a new expression failed it would return | nullptr instead of throwing. So lots of people wrote | "checking" code to make sure the result wasn't nullptr, | except that the compiler would always just elide that code | since the standard mandates that the result cannot be | nullptr. Many weird crashes ensued. | aw1621107 wrote: | There are non-throwing operator new overloads that can | return nullptr, but I'm not sure if those are a relatively | recent development. Did the non-throwing operator new | overloads not exist at the time? | logicchop wrote: | Hard to say. Most of the uses probably predated the | custom operator new and so nobody thought about it. Not | to mention the places you cannot sneak into to switch to | std::nothrow. | JoeAltmaier wrote: | Got to read the assembly to really know what happened. | | E.g. if the architecture has pointers with non-address bits | (modes or segments or whatever) and those bits were set yet the | rest of the address was 'null', and the check was for 'all bits | zero' then you could conceivably get that situation. | dekhn wrote: | One of the best bugs I've seen had a description fairly similar | to this. Hot routine run at scale (floating point math for ads ML | training) fails at a rate about 0.000000001. Turned out to be a | very obscure bug in the context switching code in the linux | kernel, the FP registers weren't being restored properly. | logicchop wrote: | I suspect that windows still has a subtle FP restoration bug. | We do large scale validation of floating point data and | occasionally get ever so subtly different results. | tremon wrote: | Given that you say "subtly", have you ruled out | rounding/precision errors? I wouldn't be surprised if some | processors would play fast-and-loose with the number of | significant bits they really honour. | jacooper wrote: | Debugging that must've been a PITA for sure. | cybrox wrote: | If you want to experience this in a slightly more controlled | way, I'd recommend you give the game "Turing complete" a spin. | | It lets you build your own turing complete processor, and | define a simple assembly language, starting from NAND gates and | you can create your own arbitrarily wild edge cases for | specific opcode combinations. | dekhn wrote: | One of the best bugs I've seen had a description fairly similar | to this. Hot routine run at scale (floating point math for ads ML | training) fails at a rate about 0.000000001. Turned out to be a | very obscure bug in the context switching code in the linux | kernel, the FP registers weren't being restored properly. | | Another one, the debugging was aided by the fact the developers | ensured that everything was accessed through const pointers, so | it wasn't their code corrupting their memory. | sidewndr46 wrote: | It has been a while, but a switch to kernel mode followed by a | switch back to the same user mode process doesn't actually mess | with FP registers. The idea being, the kernel should not be | using those anyways. | | Also minor point: a const pointer is a pointer which always | points at the same address. You can still change what is | pointed at. You probably meant "a pointer to const" | lisper wrote: | I had one of these back in the 90s that turned out to be a | compiler bug. It was code that ran a mobile robot with an arm. | Exact same code running on a Sun workstation never failed, but | running on an embedded system running vxWorks crashed | intermittently, but only when the arm was moving. Entire heap | was corrupted, so by the time the crash occurred there was no | hope of getting a stack trace or any hint of what went wrong | upstream. Turned out to be two mis-ordered instructions that | accessed a value on the stack after the stack pointer had been | popped. On vxWorks, interrupts used the same stack as the | currently running process, so if an interrupt occurred exactly | between these two instructions it would clobber that value, and | chaos ensued. | | Took a full year to figure it out. Good times. | aw1621107 wrote: | How did you end up piecing together what happened? | lisper wrote: | Long story but the tldr is that it happened in two stages. | First someone figured out a way to reliably reproduce the | problem. And then I spent a very long time single stepping | through machine instructions until I had a eureka moment. ___________________________________________________________________ (page generated 2023-01-03 23:00 UTC)