[HN Gopher] "Unexplainable" core dump (2011)
       ___________________________________________________________________
        
       "Unexplainable" core dump (2011)
        
       Author : curling_grad
       Score  : 84 points
       Date   : 2023-01-03 13:02 UTC (9 hours ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | markus_zhang wrote:
       | Sounds fun, wish I had the talent to do some serious debugging.
        
       | Izkata wrote:
       | > Our code and compilers are constantly changing, and the problem
       | disappeared as suddenly as it appeared ... only to happen again 2
       | years later in a completely unrelated executable.
       | 
       | It does not encourage me how much this sounds like the short
       | story "Coding Machines". The original post even happened right
       | about 2 years after the short story was posted, then in that
       | comment reoccured after another 2 years.
       | 
       | https://www.teamten.com/lawrence/writings/coding-machines/
        
         | moffkalast wrote:
         | Damn, what a story.
         | 
         | Though in reality it definitely would've been some guy on the
         | compiler dev team adding it before publishing binaries. I
         | wonder if you could set it up to inject a halt into compiled
         | code that runs if some conditions are met, crashing most of the
         | word's infrastructure on a predetermined date.
        
         | highspeedbus wrote:
         | That was a great read, thanks.
        
       | ericbarrett wrote:
       | This is great. Reminds me of a crash I saw early in my career. It
       | was a null-pointer exception, except it occurred right after
       | confirming the address was non-null. This was on a single core
       | with a non-preemptible kernel. So the processor just took the
       | wrong branch! There was simply no other explanation.
        
         | mcculley wrote:
         | What hardware/platform was this? I worked on AIX on POWER a
         | long time ago and it had to map the zero page read-only just to
         | support speculative execution of dereferencing the NULL
         | pointer, if I remember right.
         | 
         | If you were on a platform that did this wrong, speculative
         | execution could have been dereferencing the NULL pointer.
        
         | jrpelkonen wrote:
         | Interesting, how did you fix it? Negate the comparison with an
         | appropriate comment?
        
         | Jiro wrote:
         | Are you sure the compiler didn't say "since having a null
         | pointer gives undefined behavior, we can optimize out the part
         | that confirms the address is non-null"?
        
           | logicchop wrote:
           | This is likely your answer. C++ story. I worked at a large
           | company that had a "no exceptions" policy and a custom
           | operator new. If a new expression failed it would return
           | nullptr instead of throwing. So lots of people wrote
           | "checking" code to make sure the result wasn't nullptr,
           | except that the compiler would always just elide that code
           | since the standard mandates that the result cannot be
           | nullptr. Many weird crashes ensued.
        
             | aw1621107 wrote:
             | There are non-throwing operator new overloads that can
             | return nullptr, but I'm not sure if those are a relatively
             | recent development. Did the non-throwing operator new
             | overloads not exist at the time?
        
               | logicchop wrote:
               | Hard to say. Most of the uses probably predated the
               | custom operator new and so nobody thought about it. Not
               | to mention the places you cannot sneak into to switch to
               | std::nothrow.
        
         | JoeAltmaier wrote:
         | Got to read the assembly to really know what happened.
         | 
         | E.g. if the architecture has pointers with non-address bits
         | (modes or segments or whatever) and those bits were set yet the
         | rest of the address was 'null', and the check was for 'all bits
         | zero' then you could conceivably get that situation.
        
       | dekhn wrote:
       | One of the best bugs I've seen had a description fairly similar
       | to this. Hot routine run at scale (floating point math for ads ML
       | training) fails at a rate about 0.000000001. Turned out to be a
       | very obscure bug in the context switching code in the linux
       | kernel, the FP registers weren't being restored properly.
        
         | logicchop wrote:
         | I suspect that windows still has a subtle FP restoration bug.
         | We do large scale validation of floating point data and
         | occasionally get ever so subtly different results.
        
           | tremon wrote:
           | Given that you say "subtly", have you ruled out
           | rounding/precision errors? I wouldn't be surprised if some
           | processors would play fast-and-loose with the number of
           | significant bits they really honour.
        
       | jacooper wrote:
       | Debugging that must've been a PITA for sure.
        
         | cybrox wrote:
         | If you want to experience this in a slightly more controlled
         | way, I'd recommend you give the game "Turing complete" a spin.
         | 
         | It lets you build your own turing complete processor, and
         | define a simple assembly language, starting from NAND gates and
         | you can create your own arbitrarily wild edge cases for
         | specific opcode combinations.
        
       | dekhn wrote:
       | One of the best bugs I've seen had a description fairly similar
       | to this. Hot routine run at scale (floating point math for ads ML
       | training) fails at a rate about 0.000000001. Turned out to be a
       | very obscure bug in the context switching code in the linux
       | kernel, the FP registers weren't being restored properly.
       | 
       | Another one, the debugging was aided by the fact the developers
       | ensured that everything was accessed through const pointers, so
       | it wasn't their code corrupting their memory.
        
         | sidewndr46 wrote:
         | It has been a while, but a switch to kernel mode followed by a
         | switch back to the same user mode process doesn't actually mess
         | with FP registers. The idea being, the kernel should not be
         | using those anyways.
         | 
         | Also minor point: a const pointer is a pointer which always
         | points at the same address. You can still change what is
         | pointed at. You probably meant "a pointer to const"
        
         | lisper wrote:
         | I had one of these back in the 90s that turned out to be a
         | compiler bug. It was code that ran a mobile robot with an arm.
         | Exact same code running on a Sun workstation never failed, but
         | running on an embedded system running vxWorks crashed
         | intermittently, but only when the arm was moving. Entire heap
         | was corrupted, so by the time the crash occurred there was no
         | hope of getting a stack trace or any hint of what went wrong
         | upstream. Turned out to be two mis-ordered instructions that
         | accessed a value on the stack after the stack pointer had been
         | popped. On vxWorks, interrupts used the same stack as the
         | currently running process, so if an interrupt occurred exactly
         | between these two instructions it would clobber that value, and
         | chaos ensued.
         | 
         | Took a full year to figure it out. Good times.
        
           | aw1621107 wrote:
           | How did you end up piecing together what happened?
        
             | lisper wrote:
             | Long story but the tldr is that it happened in two stages.
             | First someone figured out a way to reliably reproduce the
             | problem. And then I spent a very long time single stepping
             | through machine instructions until I had a eureka moment.
        
       ___________________________________________________________________
       (page generated 2023-01-03 23:00 UTC)