[HN Gopher] My Hardest Bug Ever (2013)
       ___________________________________________________________________
        
       My Hardest Bug Ever (2013)
        
       Author : whack
       Score  : 67 points
       Date   : 2023-03-07 19:52 UTC (3 hours ago)
        
 (HTM) web link (www.gamedeveloper.com)
 (TXT) w3m dump (www.gamedeveloper.com)
        
       | metadat wrote:
       | Previous discussions (slightly tricky to find because the URL has
       | changed)
       | 
       | https://news.ycombinator.com/item?id=6654905 (November 2013; 81
       | comments)
       | 
       | https://news.ycombinator.com/item?id=9738302 (June 2015; 29
       | comments)
       | 
       | https://news.ycombinator.com/item?id=14394095 (May 2017, 7
       | comments)
        
       | ezekg wrote:
       | > As a programmer, you learn to blame your code first, second,
       | and third... and somewhere around 10,000th you blame the
       | compiler. Well down the list after that, you blame the hardware.
       | 
       | I wish this were the case. The average programmer blames whatever
       | library/third-party/etc. they're using, then somewhere around the
       | 10,000th they might blame their own code.
       | 
       | (I run a third-party service and everything is always my fault,
       | even syntax errors.)
        
         | bena wrote:
         | Like everything, it really depends on the person.
         | 
         | I also like to espouse a philosophy that problems should be
         | investigated from inside out. Start with what you had direct
         | control over, assume the issue is with something you did. Then
         | work your way out.
         | 
         | However, I have watched more than one person do the exact
         | opposite: assume everything else was wrong before even looking
         | at their own contributions.
         | 
         | And this holds not just for programming, but for any endeavor.
        
         | yifanl wrote:
         | Why would they make their own life harder?
         | 
         | If there's a bug in 3p code, they'd need to open up a PR to the
         | open source library and be stalled on 3 weeks for the
         | maintainer to see it. If it's a one-line bug in their own code,
         | it's one glance at a stack trace.
        
           | ezekg wrote:
           | I think you misunderstood my comment?
        
           | happytoexplain wrote:
           | There are many strange assumptions here. Why in your example
           | is the library open source? Even when it is, why would the
           | developer be expected to know how to fix it? Why in your
           | example is the bug in the developer's code a "one-line" bug
           | fixable by "one glance at a stack trace"?
           | 
           | The point is that, if the bug's cause is not immediately
           | obvious, some developers tend to jump to "it's the 3rd party
           | library", because in many cases they can then claim to be
           | unable to fix it, or offload the responsibility to the 3rd
           | party.
        
       | gumby wrote:
       | My most memorable hardware bug was noware near as hard as this,
       | but I'll never forget it.
       | 
       | Intel was trying to sell the 960s and sent us a dev board with
       | that CPU. Nobody in the company could get it to boot up. It would
       | power up but nothing would show up on the serial port. Eventually
       | it was my turn to look and for some reason I happened to notice a
       | pullup _capacitor_ on the UART VCC. I looked at the schematics
       | and indeed it was there. A simple jumper to bypass it (back in
       | those days we had big, manly components; none of that surface
       | mount shit) and what hey: the serial console responded. It had
       | booted up just fine, but was mute.
       | 
       | After that we could do development but it was immediately clear
       | to me that the 960 was DoA. It's not like we were the first to
       | get that board!
        
       | einpoklum wrote:
       | > As a programmer, you learn to blame your code first, second,
       | and third... and somewhere around 10,000th you blame the
       | compiler. Well down the list after that, you blame the hardware.
       | 
       | So, first - in many settings, the hardware is more likely to be
       | the source of the problem than your compiler; the question is
       | what has more churn - the compiler code or the chip you run on.
       | 
       | But regardless - the compiler is much higher than the 10,000'th
       | item on the blame list. Even mature, popular compilers have bugs!
       | Hell, they have many known, open bugs! The subtle ones, which
       | don't manifest easily, can stay open for quite a long time. See:
       | 
       | https://gcc.gnu.org/bugzilla/
       | 
       | and:
       | 
       | https://bugs.llvm.org/
       | 
       | I personally have encountered and even filed several of them, and
       | it's not like I was trying. Some of these were even the result of
       | "Why does my code not work?" questions on StackOverflow.
       | 
       | One tip, though: Play one compiler against another when you begin
       | suspecting your compiler, or the hardware. The buggy behavior
       | will often be different. And of course run multiple times to
       | check for variation in behavior, like the author had.
        
         | AshamedCaptain wrote:
         | > But regardless - the compiler is much higher than the
         | 10,000'th item on the blame list. Even mature, popular
         | compilers have bugs! Hell, they have many known, open bugs!
         | 
         | I don't even understand when compilers started being thought as
         | these perfect, bug-free programs. It's been some kind of
         | gradual change over the decades. A lot of people seem surprised
         | when I mention that around 15 years ago -O3 in gcc was
         | practically unusable. I don't mean "it would actually degrade
         | performance", I mean "it would break your program".
        
           | einpoklum wrote:
           | TBH, I'm surprised by that. I would have though compiler
           | authors would not have released optimization options in this
           | state - when such breakage is encountered by testers of
           | nightlies or beta releases.
        
       | glonq wrote:
       | Having spent the better part of 30 years working on/with/around
       | embedded systems, I can't even count how many bugs I've bumped
       | into that were hiding inbetween sofware and hardware. Or between
       | software and compiler/tools/OS. Or between hardware and spooky RF
       | black magic.
        
       | GlenTheMachine wrote:
       | Oh man.
       | 
       | I was writing the motor controller code for a new submersible
       | robot my PhD lab was building. We had bought one of the very
       | first compact PCI boards on the market, and it was so new we
       | couldn't find any cPCI motor controller cards, so we bought a
       | different format card and a motherboard that converted between
       | compact PCI bus signals and the signals on the controller boards.
       | The controller boards themselves were based around the LM629, an
       | old but widely used motor controller chip.
       | 
       | To interface with the LM629 you have to write to 8-bit registers
       | that are mapped to memory addresses and then read back the
       | result. The 8-bit part is important, because some of the
       | registers are read or write only, and reading or writing to a
       | register that cannot be read from or written to throws the chip
       | into an error state.
       | 
       | LM629s are dead simple, but my code didn't work. It. Did. Not.
       | Work. The chip kept erroring out. I had no idea why. It's almost
       | trivially easy to issue 8-bit reads and writes to specific memory
       | addresses in C. I had been coding in C since I was fifteen years
       | old. I banged my head against it for two weeks.
       | 
       | Eventually we packed up the entire thing in a shipping crate and
       | flew to Minneapolis, the site of the company that made the cards.
       | They looked at my code. They thought it was fine.
       | 
       | After three days the CEO had pity on us poor grad students and
       | detailed his highly paid digital logic analyst to us for an hour.
       | He carted in a crate of electronics that were probably worth
       | about a million dollars. Hooked everything up. Ran my code.
       | 
       | "You're issuing a sixteen-bit read, which is reading both the
       | correct read-only register and the next adjacent register, which
       | is write-only", he said.
       | 
       | Is showed him in my code where the read in question was very
       | clearly a *CHAR*. 8 bits.
       | 
       | "I dunno," he said - "I can only say what the digital logic
       | analyzer shows, which is that you're issuing a sixteen bit read."
       | 
       | Eventually, we found it. The Intel bridge chip that did the bus
       | conversion had a known bug, which was clearly documented in an
       | 8-point footnote on page 79 of the manual: 8 bit reads were
       | translated to 16 bit reads on the cPCI bus, and then the 8 most
       | significant units were thrown away.
       | 
       | In other words, a hardware bug. One that would only manifest in
       | these _very_ specific circumstances.
       | 
       | We fixed it by taking a razor knife to the bus address lines and
       | shifting them to the right by one, and then taking the least
       | significant line and mapping it all the way over to the left, so
       | that even and odd addresses resolved to completely different
       | memory banks. Thus, reads to odd addresses resolved to addresses
       | way outside those the chip was mapped to, and it never saw them.
       | Adjusted the code to the (new) correct address range. Worked like
       | a charm.
       | 
       | But I feel bad for the next grad student who had to work on that
       | robot. "You are not expected to understand this."
        
         | nameoda wrote:
         | It's not a bug! It's a clearly documented feature! /s
        
       | whitewingjek wrote:
       | Previously discussed:
       | 
       | https://news.ycombinator.com/item?id=6654905 (81 comments)
       | 
       | https://news.ycombinator.com/item?id=9738302 (29 comments)
       | 
       | https://news.ycombinator.com/item?id=14394095 (7 comments)
        
         | Ruq wrote:
         | What can you say? It's a classic reading!
        
           | dang wrote:
           | Reposts of classics are most welcome on HN!
           | 
           | We do try to space them out a bit, to avoid too much
           | repetition, but anything up to once a year is fine. This one
           | hasn't had a thread since 2017, so completely ok.
           | 
           | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.
           | ..
        
       | yellow_lead wrote:
       | 2013
        
         | dang wrote:
         | Added. Thanks!
        
       | toolslive wrote:
       | I once (about 10y ago) experienced hardware that got tired. A
       | customer replaced the usual hard disks with shiny new Seagate SMR
       | drives, because they had more storage capacity. Funny thing is
       | that they could not handle the sustained 100MB/s we were feeding
       | them. So after about 20 minutes they started slowing down and
       | after half an hour they stopped working for about 20 minutes and
       | then they were fine again. Obviously the customer complained
       | about our storage product and forgot to mention this small fact.
       | Once we figured it out we had good laugh.
        
         | _a_a_a_ wrote:
         | That's interesting. My old server about 10 years ago had a
         | Seagate black which died. I replaced it with a Seagate green. I
         | notice things started slowing down and down when the disc
         | writes got heavy. It could freeze up for minutes at a time,
         | then recover without any errors. It took me weeks to realise
         | what was happening because... Because I don't actually know
         | why. In hindsight it was obvious. Maybe the Seagate green was a
         | SMR drive. Either way, it was nasty and caused a lot of
         | frustration.
         | 
         | A quick check just now and it seems that the Seagate green were
         | SMR. Fuckers never put that on the box did they. Bastards.
        
           | favorited wrote:
           | A couple years ago, Western Digital quietly changed their WD
           | Red line (which is explicitly marketed as being for NAS use)
           | to SMR.
           | 
           | https://www.tomshardware.com/news/wd-addresses-smr-
           | controver...
        
       | oifjsidjf wrote:
       | I've never seen such annoying ads on any website: the ad size
       | changes every ~30 seconds which rearranges the text flow of the
       | article completely and I get lost.
        
         | ezekg wrote:
         | How have you survived the nets this long without an ad blocker?
        
       ___________________________________________________________________
       (page generated 2023-03-07 23:00 UTC)