[HN Gopher] As winter approaches, here's a story about why hardw...
       ___________________________________________________________________
        
       As winter approaches, here's a story about why hardware is hard
        
       Author : mooreds
       Score  : 125 points
       Date   : 2022-12-17 14:28 UTC (8 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | lostlogin wrote:
       | Thought this was going to be about Iran's drones, with which
       | Russia is smashing up Ukraine. They are reported to be sensitive
       | to low temperatures.
       | 
       | https://www.nytimes.com/2022/12/14/world/europe/ukraine-russ...
        
       | Eleison23 wrote:
       | Chris Siebenmann has been blogging about how cold affects his
       | workstation.
       | https://utcc.utoronto.ca/~cks/space/blog/tech/ColdLockupMach...
        
       | mdorazio wrote:
       | I really wish hardware startups would at least hire people early
       | on who have experience with actual manufacturing. Stuff like this
       | really shouldn't happen and really shouldn't be a surprise,
       | either. Temperature considerations are a standard part of the
       | design criteria and testing in any mechanical device, and
       | standard root cause analysis (did they even do FMEA?) would have
       | forced them to find the problem the first time, not the second.
       | When you're still in the early prototype phase, fine, but this is
       | a 25? person team selling a product with a subscription and
       | everything.
       | 
       | Edit: just realized their stated fix is to remove two resistors.
       | Did they actually test the full effects of that change? Seems
       | like there's a decent chance of a third head-scratcher a few
       | months from now...
        
         | Game_Ender wrote:
         | As a software person this is first I have heard of FMEA
         | (Failure Mode and Effect Analysis) [0]. Sounds like a very
         | rigorous way to move through a system and identify all the ways
         | it can break and develop ways to fix it.
         | 
         | 0 -
         | https://en.wikipedia.org/wiki/Failure_mode_and_effects_analy...
        
         | Tempest1981 wrote:
         | Yep, we used a temperature chamber that could ramp from -40degC
         | to +80degC, along with humidity. We would push the limits until
         | something broke, fix the design, and repeat.
         | 
         | Surprised, but not surprised, that everyone doesn't do at least
         | some stress testing.
         | 
         | I was expecting something more complicated, like
         | electromagnetic interference from a passing vehicle. Or ESD
         | (static discharge).
        
         | exmadscientist wrote:
         | > standard root cause analysis (did they even do FMEA?) would
         | have forced them to find the problem the first time, not the
         | second
         | 
         | Not to mention that the standard checklist for solving "why is
         | this joint having problems" is "is it cold or does it get worse
         | when we make it cold"....
         | 
         | > just realized their stated fix is to remove two resistors.
         | Did they actually test the full effects of that change? Seems
         | like there's a decent chance of a third head-scratcher a few
         | months from now...
         | 
         | As a half-decent analog guy I can actually believe this. I can
         | imagine a lot of ways something like this _could_ fix a
         | problem, though of course I haven 't seen any schematics here.
        
           | pclmulqdq wrote:
           | I'm guessing that a robotics company actually has very few
           | analog-competent engineers. It may well have solved the issue
           | if one of them had built an analog circuit. A lot of
           | engineers don't think about operating conditions when they
           | first learn to build circuits, and run them near the edge of
           | their specified tolerances, so moving the component into a
           | "comfortable" range would fix it.
           | 
           | However, if they were working with a COTS component and took
           | some resistors off that, I doubt it was actually the right
           | fix.
        
         | pifm_guy wrote:
         | What's the betting that on the hot weather their Texas
         | customers with those resistors removed will find their robots
         | nonfunctional again...?
        
         | steve_adams_86 wrote:
         | I'm just a bum with an automated hydroponic garden and I
         | verified and tested all of my components for expected behaviour
         | in the normal conditions the system will be in. I had to
         | exclude several components as a result.
         | 
         | On one hand I'm surprised they didn't do that, but on the
         | other, I have no time constraints and I'm very, very uncertain
         | of what I'm doing most of the time. I knew I wanted to have
         | accurate readings and reliable performance so my garden doesn't
         | die due to malfunctions. So, I goofed around and made sure
         | stuff was right. I'd do the same with code in my spare time,
         | but my employers have me cut corners all the time. People could
         | criticize me for it, but it's not as though I don't know
         | better. It might be similar for this team. By the time the bugs
         | strike, it's not clear what's been properly vetted or who knows
         | what about which components. Debugging becomes harder because
         | the initial spec and how well it was met is no longer clear.
         | 
         | I also discovered the ruggeduino line of arduino boards in the
         | process which are pretty cool. Overkill for my use case, but I
         | hope to have a use for one some day. I'm thinking of making a
         | robot shop vac and metal-picker-upper, and the ruggeduino would
         | be great there I think.
        
       | ThrowawayTestr wrote:
       | Temperature dependent problems are the hardest to reproduce.
       | Often they only occur within a specific range of temps.
        
       | dagw wrote:
       | I met a guy who worked for IBM on the AS/400s. Apparently they
       | had a 'server room' where they could control the temperature and
       | humidity to basically any combination that was likely to occur in
       | the real world and they would test all their hardware there under
       | any extreme condition they could think of.
        
         | dboreham wrote:
         | All hardware produced in a professional manner is tested like
         | this.
        
       | h2odragon wrote:
       | Wonder what the chances are that the first iteration of "problem
       | solved!" actually _was_ a lurking problem that would 've bitten
       | as well, sooner or later.
        
       | nixpulvis wrote:
       | Reminds me of the time I bought a new iPhone and the touchscreen
       | didn't work properly when it was below about 20 degrees.
       | 
       | Leave it to Californians to forget that winter exists.
        
         | ghaff wrote:
         | In fairness, it's generally known that a lot of consumer
         | electronics have problems in freezing weather unless they're
         | protected somehow (e.g. kept in an inner pocket) if only
         | because of the battery. Some of it is possibly California
         | designers not being especially focused on the sub-zero use case
         | but it's also not clear how much focus there should be on that
         | use case in general if there are costs (money, physical specs)
         | associated with doing so.
        
           | nixpulvis wrote:
           | Let's see... does the object go outside? If yes, it will
           | probably experience below freezing temperature.
           | 
           | This isn't rocket science.
           | 
           | Now, if you happen to be OK with selling a product that isn't
           | reliable, now's a good time! I still need some more toys for
           | the holidays.
        
             | ghaff wrote:
             | Maybe you should start up your 40 below Android phone
             | company. Enjoy!
        
       | gumby wrote:
       | This is why I was shocked by Unix producing core dumps (I'd
       | previously gotten my experience on ITS and Lispms where
       | everything ran in the debugger).
       | 
       | In a core dump all the IO (network connections, files, etc) is
       | closed while a debugger gives you access to the functioning
       | environment.
        
       | joezydeco wrote:
       | Any developer, hardware _or_ software, will tell you that
       | reproducibility is the key to solving a problem.
       | 
       | But reproducibility can be vague and sometimes, when you're under
       | pressure, you can be quick to point to something and declare
       | "aha! that's the root cause!" and be totally wrong.
        
         | JoeAltmaier wrote:
         | Or even be totally right, but there's more to the problem.
         | Peeling the onion.
         | 
         | We hope that fixing one piece of code will solve three or four
         | exhibited problems. But it's often more like, change three or
         | four pieces of code to make one problem go away.
         | 
         | As Heinlein is purported to have written, "If it's not one
         | thing, it's two things."
        
           | eesmith wrote:
           | I've read about 98% of everything Heinlein published[1], and
           | don't recognize that quote.
           | 
           | An archive.org search find a few examples, like this 1979
           | Harlequin short story https://archive.org/details/romanticsho
           | rtsto00harl/page/30/m...
           | 
           | If Heinlein did use a phrase like it, I expect my searches
           | would have found it.
           | 
           | It doesn't appear to be a common saying, so I'm curious how
           | you acquired the association between it and Heinlein. It
           | doesn't seems like a common misquote people end up spreading.
           | 
           | [1] I've only read "Tramp Royale" up to the point where they
           | left the US, I haven't read the "stinkeroos", nor his
           | posthumous novels, nor most of what Wikipedia lists under
           | "Other short fiction", nor a couple more non-fiction
           | publications.
        
             | morphle wrote:
             | I agree, having read most Heinlein ever published (even
             | including his unfortunate right wing and conservative and
             | unscientific ramblings that wasted my time), I never came
             | across "If it is not one thing, it is two things."
        
               | joshmarinacci wrote:
               | Maybe Niven?
        
               | morphle wrote:
               | I'm even more sure it was not written by Larry Niven. I
               | read most of Heinlein twice but read Niven at least 4
               | times.
        
               | JoeAltmaier wrote:
               | Maybe Spider Robinson?
        
         | metaphor wrote:
         | Agreed.
         | 
         | While reading the thread, a red flag[1] was immediately raised
         | in my mind when:
         | 
         | > _We couldn 't reproduce it, but we did come up with a theory
         | for why it was happening._
         | 
         | ...going right into mechanical subsystem redesign. Surely a
         | cursory review would have challenged such a reactionary
         | proposal: What meaningful steps were taken to falsify the
         | prevailing theory?
         | 
         | There's something implied about discipline when this vacant _QA
         | Tester Hardware /Software_ engineering position description[2]
         | bundles verification/validation test roles on the
         | design/development/production/field support fronts with the
         | following caveat:
         | 
         | > _Initially, you 'll be the only QA engineer and will perform
         | active testing of new product releases in our lab and in the
         | field at construction sites._
         | 
         | Also, non-rhetorical question: Selenium for industrial hardware
         | test automation...is that really a thing in the wild?
         | 
         | [1] https://twitter.com/tessalau/status/1604018887603138561
         | 
         | [2] https://boards.greenhouse.io/dustyrobotics/jobs/5373908003
        
       | Workaccount2 wrote:
       | As a hardware guy I look at software with envy. Having to deal
       | with physics is such a huge fucking pain in the ass all the time.
       | Reality real fucking hates low entropy systems and will try and
       | sabotage you at every turn. There is also the inherent opaqueness
       | to reality based systems that makes debugging them a huge pain
       | that can be enormously time consuming and expensive. And scaling
       | is ridiculously difficult and expensive.
       | 
       | And worst of all, for me, there is no money in hardware. At best
       | you make a trinket that requires a $9.99 subscription to really
       | get use out of. At worst you make a cool trinket, get forced by
       | pricing to make it in China, and then end up just having the idea
       | stolen and reproduced to be sold for 1/2 the cost.
       | 
       | Ok rant over.
        
         | green_on_black wrote:
         | I agree with everything except the first sentence. Similar to a
         | meeting, where time used meets time alloted, complexity meets
         | complexity allowed.
         | 
         | https://www.stilldrinking.org/programming-sucks
        
       | Quarrelsome wrote:
       | I once lost about a month to a USB issue reported in the field on
       | some custom hardware. Spent several weeks failing to reproduce it
       | (setting up multiple machines to automatically hammer through
       | typical usage). It eventually transpired that the issue
       | correlated with cold temperatures and the recent outsourcing of
       | assembly had resulted in some poor soldering.
        
       | analog31 wrote:
       | I work in hardware. The picture of the freezer with wires coming
       | out under the door gasket, is familiar.
        
       | svnt wrote:
       | I'm gonna snark on this one.
       | 
       | There is no mystery or surprise here. It's basic functional
       | qualification. You buy or make a thermal chamber and cycle
       | release versions of your device before you ship one. This isn't
       | some uncatchable mystery, they just didn't test adequately.
       | 
       | Depending on the product size and cost you may also do this to
       | every individual robot off the line. This is not uncommon.
       | 
       | This isn't "hardware is hard" this is "we thought it was software
       | with screwdrivers."
        
         | SkyPuncher wrote:
         | I'm arm-chairing a bit. This is the type of problem I'd expect
         | on a prototype, but not on a production level device -
         | especially on a construction robot that will be clearly out in
         | the weather.
         | 
         | It reads to me like they didn't properly spec/source components
         | that were appropriate for the weather conditions these robots
         | are likely to see. The fact that this issue was reproducible at
         | refrigerator temperatures is even more shocking. 39F (taken
         | from a photo) is not very cold.
        
         | iancmceachern wrote:
         | Agreed. In many more regulated industries (medical devices,
         | automotive, aerospace) the kind of testing you mention is
         | required by law. In all cases it's good form, good practice to
         | test your product to make sure it works as promised by its
         | labeling. Typically you put an operating temperature and
         | humidity range in your manual or labeling. In many industries
         | testing to those operating parameters is required by law.
        
           | greenbit wrote:
           | "environmental qualification testing", good old EQT.
        
         | bsder wrote:
         | > You buy or make a thermal chamber and cycle release versions
         | of your device before you ship one.
         | 
         | That is a _total_ waste of time in a startup with a small
         | number of units shipped.
         | 
         | A component behaving out of spec due to temperature excursions
         | simply isn't that common nowadays. If my system is mostly ADC
         | to digital to DAC (standard for robotics controllers), testing
         | for temperature is a waste until I'm shipping significant
         | volumes.
         | 
         | There is a video of one of the slightly famous YouTubers who
         | has a high voltage thing that fails at the altitude of his lab.
         | The manufacturer _did_ check it for function at the elevation
         | of Denver, but his lab is higher than that. There are limits to
         | how much engineering effort you should put in until you get an
         | _actual_ failure. (Maybe someone can link the video as I can 't
         | remember it at this point.)
         | 
         | You can waste infinite engineering effort covering all
         | possibilities. Or you can ship the thing and fix the failures.
         | "Good engineering" is about balancing the two--you need to
         | ship, but you don't want to have too many failures in the field
         | either.
        
           | mdorazio wrote:
           | Testing thermal performance is as simple as going to a
           | restaurant down the street and asking if you can pay them
           | $100 to borrow their walk-in fridge for an hour for cold, and
           | leaving your device in a hot car for an hour for hot. It's
           | also exactly the kind of thing you should be doing if you're
           | selling an actual product to a customer instead of partnering
           | with someone to test your prototype.
        
           | snovv_crash wrote:
           | No, often components are dependent on the stress they are
           | under. If you have inconsistent reflow, or lots of rework
           | happening, then each device will behave differently due to
           | thermal contractions.
        
           | justin66 wrote:
           | If you specify an operating temperature range for your
           | gadget, some testing at the extremes of that range is called
           | for.
        
         | KennyBlanken wrote:
         | Or even just drive down to Lake Tahoe and find a winter
         | construction site and try it there.
         | 
         | This is a general problem with all these California-based
         | companies and inventors (especially SV inventors cranking out
         | crowd-funded bike stuff.) They seem blissfully unaware of
         | things like cold weather, water, dirt/mud, and road salt...or
         | combinations of them. I laugh at all those stupid fucking
         | delivery bots because they'll fall apart anywhere there's snow,
         | and get completely stuck on the slightest bit of ice.
         | 
         | For many years, driving a Model S in heavy rain would cause
         | water to get into the drive unit via either seals or vents that
         | weren't sufficiently designed to keep water out. It "totals"
         | the drive unit, causing corrosion of the motor control boards.
         | And Tesla denies warranty claims on such repairs, because of
         | course they do - just like they did on the windows that
         | randomly shattered in parked cars.
         | 
         | Raise your hand if you've owned a car that had problems with
         | water ingress issues affecting its transmission. Or windows
         | randomly shattering.
         | 
         | What's that? Nobody? Exactly.
        
           | trasz2 wrote:
        
           | lostlogin wrote:
           | It is not just SV startups.
           | 
           | I have owned a couple of Philips bread makers. Basic ones and
           | expensive ones.
           | 
           | If you make sourdough with them (ie bread) the coating is
           | stripped off the bowl and the stirrer corrodes.
           | 
           | They will deny replacement and claim you sprayed something
           | acid on it. Yes, fermentation is acid, but they don't believe
           | bread would damage their unit.
        
         | TooSmugToFail wrote:
         | You are 100% right, but I would not be as dismissive to the
         | engineering team.
         | 
         | When you run a hardware startup, you can only hope for an
         | experienced team that would do everything by the book and
         | implement best practices from the very first production unit.
         | Reality is: that's a luxury for most hardware startup teams out
         | there.
         | 
         | Typically, there's a frantic rush to get your device to market
         | that you simply skip, or more likely don't even have time to
         | think about stuff like climate chamber cycling.
         | 
         | One thing I'm almost sure of: these guys have learned something
         | -- the engineer's way. Good chance there's a guy there googling
         | climate chambers to ask the CEO for a budget to buy one.
        
           | toss1 wrote:
           | >>Typically, there's a frantic rush to get your device to
           | market that you simply skip, or more likely don't even have
           | time to think about stuff like climate chamber cycling.
           | 
           | And that, right there, is the difference between a company
           | oriented around myopic management vs a company oriented
           | around robust quality.
           | 
           | Any company trying to build a quality reputation would spec
           | this stuff out AT THE BEGINNING -- what are the operational
           | requirements, what loads will they see, in what environments
           | will they run, etc??? Then spec every component, and test the
           | whole lot against those requirements. Sure, this is more like
           | the dreaded "waterfall" vs "agile", but the result is a
           | quality product from the start that has far fewer of these
           | problems (because they did this whole test & fix routine at
           | the prototype or Alpha test stages), rather than showing up
           | with stories like this of how they recovered from customer-
           | reported problems.
           | 
           | If you're telling your customers that they're the Alpha
           | testers because they get early access, fine. If you're
           | selling it as a finished product, then we know your company
           | isn't prioritizing actual quality.
        
             | mschuster91 wrote:
             | > If you're telling your customers that they're the Alpha
             | testers because they get early access, fine. If you're
             | selling it as a finished product, then we know your company
             | isn't prioritizing actual quality.
             | 
             | Sounds like Tesla... although it is public knowledge there
             | that you _are_ still alpha testers years after a model was
             | introduced.
        
           | spaceywilly wrote:
           | I've worked for a hardware startup. We didn't have the budget
           | for a thermal chamber, but that doesn't mean we didn't test
           | and anticipate temperature related issues. We had plenty of
           | setups like the one in the blog with units in the fridge or
           | on a car dashboard on a summer day.
           | 
           | The difference is we did it before the units reached
           | customers.
        
             | exmadscientist wrote:
             | We skip stuff like the temperature chamber all the time.
             | The difference is that we _know_ what we 're skipping and
             | why. This can easily make for a "10x" team: we are
             | experienced enough to know what we can get away with, and
             | experienced enough to know very quickly what happened when
             | we _fail_ to get away with it. (So it 's often a pretty
             | quick and direct fix! Well, that or a C-level "you declined
             | this part of our proposal so now it's failing exactly like
             | we told you it would so you're looking at this big of a
             | redo, exactly like we told you...").
        
               | dboreham wrote:
               | But you probably have a cupboard full of freezer spray
               | cans and a heat gun.
        
           | bostik wrote:
           | Sorry, but this is a bit rich:
           | 
           | > _When you run a hardware startup, you can only hope for an
           | experienced team_
           | 
           | If you run a hardware startup and fail to _acknowledge_ that
           | places like Alaska or Ontario exist, you fail before even
           | getting close to merely inexperienced. The most charitable
           | word I can think of is  "myopic".
        
             | mindslight wrote:
             | It all looks so simple after someone else has debugged the
             | problem and figured out it was temperature, that it's just
             | too tempting to distill it down to some dismissive
             | statement like " _fail to acknowledge places like Alaska or
             | Ontario exist_ ". But ultimately it's about unknown
             | unknowns. Sure, you can spend massive amounts of time and
             | money trying to make the known unknowns into knowns (higher
             | background radiation, higher cosmic rays, lower/higher air
             | pressure, higher sun intensity, camera flashes, and so on).
             | But even if you do this for a number of things including
             | temperature, there will still be bunch of factors you won't
             | be preemptively testing. In which case respecting the
             | general lesson will come in handy, even though it's been
             | explained by someone who failed to do testing that you
             | consider routine.
        
             | mcculley wrote:
             | It is not only startups and not only cold places that are
             | overlooked. The iPhone documentation says that it should
             | not be used outside of 32deg-95degF.
             | 
             | https://support.apple.com/en-gb/HT201678
        
               | runnerup wrote:
               | That seems reasonably accurate on the high end for me.
               | For _sustained_ use, iPhones generally stop working well
               | around 100-105F.
               | 
               | I suspect I would be disappointed by the performance of
               | an iPhone that could operate in >125F temperatures
               | (temperatures which I have worked in outdoors for several
               | years)
        
         | Aloha wrote:
         | Based on the pictures of the robots, it looks like they were
         | intended for indoor use, I guess no one planned for an indoor
         | space outside of 50-90 degrees - which is a pretty reasonable
         | supposition, better than 80% of indoor environments are within
         | that range.
        
       | todd8 wrote:
       | My story: my first real job after grad school was at Texas
       | Instruments. It was a good job, and I enjoyed working there.
       | 
       | A fellow new hire and I were tasked with fixing a machine that
       | ran in a clean room where semiconductor wafers were made. On
       | weekends while the line was down, we would go in, crank up one
       | section of the line and let waste wafers travel through the line
       | where very rarely they might get stuck in a multi-lane machine
       | that would etch the wafers with some sort of acid.
       | 
       | The machine had over two dozen asynchronous motors, actuators,
       | pumps, sensors and so forth. All generating interrupts and I/O
       | events that were sent to a computer that ran the whole line and
       | controlled all the machines.
       | 
       | We couldn't slow down the machine, it had to run at full speed.
       | The program controlling the machine was thousands of lines of
       | assembly language--everything was assembly language, including
       | the homemade OS that ran the computer running the line. It took
       | like an hour for us to bring up the line and two more hours to
       | see the machine do something strange.
       | 
       | The computer running all this had no user interface other than a
       | some front panel switches and some panel lights that would reveal
       | 16 bits of it's 128K of memory at a time. This was in the 1970s
       | before Ethernet had been invented.
       | 
       | It felt a bit like those escape room events where you know there
       | is a solution, but you don't know if you will ever get out.
       | Without my coworker cracking jokes about our plight, I'm not sure
       | we would have ever triumphed over that stupid machine.
        
       | mrkeen wrote:
       | I have an HP Spectre x360 laptop. It wasn't registering some of
       | its keystrokes. The same keys would fail, and for quite a while
       | too - hitting them harder and repeatedly didn't happen.
       | 
       | Turns out it was the cold. Now when I take a trip out for coffee
       | & coding, I boot up and let it sit for a while before starting my
       | work.
        
       | TeeMassive wrote:
       | "Turns out that last year's coupler problems had the same root
       | cause. While people were opening up the robot and tightening the
       | coupler, the robot would warm up. By the time they put it back
       | together, the problem would have gone away. It had nothing to do
       | with couplers at all.
       | 
       | By the time we had rolled out the coupler "fix" to all robots,
       | the weather had warmed up enough across the country that the
       | issue didn't reoccur. We thought we had fixed it, when actually
       | spring fixed it."
       | 
       | At first they _correlated_ a possible root cause and then after
       | learning from that mistake they finally _understood_ the root
       | cause.
       | 
       | I've seen it happen many times where people with not enough time
       | and knowledge to debug a huge system had to resort to shotgun
       | debugging. IME taking the time to understand always ends up 1)
       | solving the problem and 2) saving time and money.
       | 
       | This is especially true when the problem is actually caused by
       | two or more root causes.
        
         | rileymat2 wrote:
         | Is there a good name for "two root causes" where they work in
         | tandem? I am blanking on it. Contributing factors? Necessary
         | but not sufficient conditions?
        
           | spaceywilly wrote:
           | 2nd order effects
        
           | jasonwatkinspdx wrote:
           | https://en.wikipedia.org/wiki/Confounding
        
           | itcrowd wrote:
           | Destructive interference?
        
       ___________________________________________________________________
       (page generated 2022-12-17 23:00 UTC)