[HN Gopher] My Hardest Bug Ever (2013) ___________________________________________________________________ My Hardest Bug Ever (2013) Author : whack Score : 67 points Date : 2023-03-07 19:52 UTC (3 hours ago) (HTM) web link (www.gamedeveloper.com) (TXT) w3m dump (www.gamedeveloper.com) | metadat wrote: | Previous discussions (slightly tricky to find because the URL has | changed) | | https://news.ycombinator.com/item?id=6654905 (November 2013; 81 | comments) | | https://news.ycombinator.com/item?id=9738302 (June 2015; 29 | comments) | | https://news.ycombinator.com/item?id=14394095 (May 2017, 7 | comments) | ezekg wrote: | > As a programmer, you learn to blame your code first, second, | and third... and somewhere around 10,000th you blame the | compiler. Well down the list after that, you blame the hardware. | | I wish this were the case. The average programmer blames whatever | library/third-party/etc. they're using, then somewhere around the | 10,000th they might blame their own code. | | (I run a third-party service and everything is always my fault, | even syntax errors.) | bena wrote: | Like everything, it really depends on the person. | | I also like to espouse a philosophy that problems should be | investigated from inside out. Start with what you had direct | control over, assume the issue is with something you did. Then | work your way out. | | However, I have watched more than one person do the exact | opposite: assume everything else was wrong before even looking | at their own contributions. | | And this holds not just for programming, but for any endeavor. | yifanl wrote: | Why would they make their own life harder? | | If there's a bug in 3p code, they'd need to open up a PR to the | open source library and be stalled on 3 weeks for the | maintainer to see it. If it's a one-line bug in their own code, | it's one glance at a stack trace. | ezekg wrote: | I think you misunderstood my comment? | happytoexplain wrote: | There are many strange assumptions here. Why in your example | is the library open source? Even when it is, why would the | developer be expected to know how to fix it? Why in your | example is the bug in the developer's code a "one-line" bug | fixable by "one glance at a stack trace"? | | The point is that, if the bug's cause is not immediately | obvious, some developers tend to jump to "it's the 3rd party | library", because in many cases they can then claim to be | unable to fix it, or offload the responsibility to the 3rd | party. | gumby wrote: | My most memorable hardware bug was noware near as hard as this, | but I'll never forget it. | | Intel was trying to sell the 960s and sent us a dev board with | that CPU. Nobody in the company could get it to boot up. It would | power up but nothing would show up on the serial port. Eventually | it was my turn to look and for some reason I happened to notice a | pullup _capacitor_ on the UART VCC. I looked at the schematics | and indeed it was there. A simple jumper to bypass it (back in | those days we had big, manly components; none of that surface | mount shit) and what hey: the serial console responded. It had | booted up just fine, but was mute. | | After that we could do development but it was immediately clear | to me that the 960 was DoA. It's not like we were the first to | get that board! | einpoklum wrote: | > As a programmer, you learn to blame your code first, second, | and third... and somewhere around 10,000th you blame the | compiler. Well down the list after that, you blame the hardware. | | So, first - in many settings, the hardware is more likely to be | the source of the problem than your compiler; the question is | what has more churn - the compiler code or the chip you run on. | | But regardless - the compiler is much higher than the 10,000'th | item on the blame list. Even mature, popular compilers have bugs! | Hell, they have many known, open bugs! The subtle ones, which | don't manifest easily, can stay open for quite a long time. See: | | https://gcc.gnu.org/bugzilla/ | | and: | | https://bugs.llvm.org/ | | I personally have encountered and even filed several of them, and | it's not like I was trying. Some of these were even the result of | "Why does my code not work?" questions on StackOverflow. | | One tip, though: Play one compiler against another when you begin | suspecting your compiler, or the hardware. The buggy behavior | will often be different. And of course run multiple times to | check for variation in behavior, like the author had. | AshamedCaptain wrote: | > But regardless - the compiler is much higher than the | 10,000'th item on the blame list. Even mature, popular | compilers have bugs! Hell, they have many known, open bugs! | | I don't even understand when compilers started being thought as | these perfect, bug-free programs. It's been some kind of | gradual change over the decades. A lot of people seem surprised | when I mention that around 15 years ago -O3 in gcc was | practically unusable. I don't mean "it would actually degrade | performance", I mean "it would break your program". | einpoklum wrote: | TBH, I'm surprised by that. I would have though compiler | authors would not have released optimization options in this | state - when such breakage is encountered by testers of | nightlies or beta releases. | glonq wrote: | Having spent the better part of 30 years working on/with/around | embedded systems, I can't even count how many bugs I've bumped | into that were hiding inbetween sofware and hardware. Or between | software and compiler/tools/OS. Or between hardware and spooky RF | black magic. | GlenTheMachine wrote: | Oh man. | | I was writing the motor controller code for a new submersible | robot my PhD lab was building. We had bought one of the very | first compact PCI boards on the market, and it was so new we | couldn't find any cPCI motor controller cards, so we bought a | different format card and a motherboard that converted between | compact PCI bus signals and the signals on the controller boards. | The controller boards themselves were based around the LM629, an | old but widely used motor controller chip. | | To interface with the LM629 you have to write to 8-bit registers | that are mapped to memory addresses and then read back the | result. The 8-bit part is important, because some of the | registers are read or write only, and reading or writing to a | register that cannot be read from or written to throws the chip | into an error state. | | LM629s are dead simple, but my code didn't work. It. Did. Not. | Work. The chip kept erroring out. I had no idea why. It's almost | trivially easy to issue 8-bit reads and writes to specific memory | addresses in C. I had been coding in C since I was fifteen years | old. I banged my head against it for two weeks. | | Eventually we packed up the entire thing in a shipping crate and | flew to Minneapolis, the site of the company that made the cards. | They looked at my code. They thought it was fine. | | After three days the CEO had pity on us poor grad students and | detailed his highly paid digital logic analyst to us for an hour. | He carted in a crate of electronics that were probably worth | about a million dollars. Hooked everything up. Ran my code. | | "You're issuing a sixteen-bit read, which is reading both the | correct read-only register and the next adjacent register, which | is write-only", he said. | | Is showed him in my code where the read in question was very | clearly a *CHAR*. 8 bits. | | "I dunno," he said - "I can only say what the digital logic | analyzer shows, which is that you're issuing a sixteen bit read." | | Eventually, we found it. The Intel bridge chip that did the bus | conversion had a known bug, which was clearly documented in an | 8-point footnote on page 79 of the manual: 8 bit reads were | translated to 16 bit reads on the cPCI bus, and then the 8 most | significant units were thrown away. | | In other words, a hardware bug. One that would only manifest in | these _very_ specific circumstances. | | We fixed it by taking a razor knife to the bus address lines and | shifting them to the right by one, and then taking the least | significant line and mapping it all the way over to the left, so | that even and odd addresses resolved to completely different | memory banks. Thus, reads to odd addresses resolved to addresses | way outside those the chip was mapped to, and it never saw them. | Adjusted the code to the (new) correct address range. Worked like | a charm. | | But I feel bad for the next grad student who had to work on that | robot. "You are not expected to understand this." | nameoda wrote: | It's not a bug! It's a clearly documented feature! /s | whitewingjek wrote: | Previously discussed: | | https://news.ycombinator.com/item?id=6654905 (81 comments) | | https://news.ycombinator.com/item?id=9738302 (29 comments) | | https://news.ycombinator.com/item?id=14394095 (7 comments) | Ruq wrote: | What can you say? It's a classic reading! | dang wrote: | Reposts of classics are most welcome on HN! | | We do try to space them out a bit, to avoid too much | repetition, but anything up to once a year is fine. This one | hasn't had a thread since 2017, so completely ok. | | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que. | .. | yellow_lead wrote: | 2013 | dang wrote: | Added. Thanks! | toolslive wrote: | I once (about 10y ago) experienced hardware that got tired. A | customer replaced the usual hard disks with shiny new Seagate SMR | drives, because they had more storage capacity. Funny thing is | that they could not handle the sustained 100MB/s we were feeding | them. So after about 20 minutes they started slowing down and | after half an hour they stopped working for about 20 minutes and | then they were fine again. Obviously the customer complained | about our storage product and forgot to mention this small fact. | Once we figured it out we had good laugh. | _a_a_a_ wrote: | That's interesting. My old server about 10 years ago had a | Seagate black which died. I replaced it with a Seagate green. I | notice things started slowing down and down when the disc | writes got heavy. It could freeze up for minutes at a time, | then recover without any errors. It took me weeks to realise | what was happening because... Because I don't actually know | why. In hindsight it was obvious. Maybe the Seagate green was a | SMR drive. Either way, it was nasty and caused a lot of | frustration. | | A quick check just now and it seems that the Seagate green were | SMR. Fuckers never put that on the box did they. Bastards. | favorited wrote: | A couple years ago, Western Digital quietly changed their WD | Red line (which is explicitly marketed as being for NAS use) | to SMR. | | https://www.tomshardware.com/news/wd-addresses-smr- | controver... | oifjsidjf wrote: | I've never seen such annoying ads on any website: the ad size | changes every ~30 seconds which rearranges the text flow of the | article completely and I get lost. | ezekg wrote: | How have you survived the nets this long without an ad blocker? ___________________________________________________________________ (page generated 2023-03-07 23:00 UTC)