[HN Gopher] As winter approaches, here's a story about why hardw... ___________________________________________________________________ As winter approaches, here's a story about why hardware is hard Author : mooreds Score : 125 points Date : 2022-12-17 14:28 UTC (8 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | lostlogin wrote: | Thought this was going to be about Iran's drones, with which | Russia is smashing up Ukraine. They are reported to be sensitive | to low temperatures. | | https://www.nytimes.com/2022/12/14/world/europe/ukraine-russ... | Eleison23 wrote: | Chris Siebenmann has been blogging about how cold affects his | workstation. | https://utcc.utoronto.ca/~cks/space/blog/tech/ColdLockupMach... | mdorazio wrote: | I really wish hardware startups would at least hire people early | on who have experience with actual manufacturing. Stuff like this | really shouldn't happen and really shouldn't be a surprise, | either. Temperature considerations are a standard part of the | design criteria and testing in any mechanical device, and | standard root cause analysis (did they even do FMEA?) would have | forced them to find the problem the first time, not the second. | When you're still in the early prototype phase, fine, but this is | a 25? person team selling a product with a subscription and | everything. | | Edit: just realized their stated fix is to remove two resistors. | Did they actually test the full effects of that change? Seems | like there's a decent chance of a third head-scratcher a few | months from now... | Game_Ender wrote: | As a software person this is first I have heard of FMEA | (Failure Mode and Effect Analysis) [0]. Sounds like a very | rigorous way to move through a system and identify all the ways | it can break and develop ways to fix it. | | 0 - | https://en.wikipedia.org/wiki/Failure_mode_and_effects_analy... | Tempest1981 wrote: | Yep, we used a temperature chamber that could ramp from -40degC | to +80degC, along with humidity. We would push the limits until | something broke, fix the design, and repeat. | | Surprised, but not surprised, that everyone doesn't do at least | some stress testing. | | I was expecting something more complicated, like | electromagnetic interference from a passing vehicle. Or ESD | (static discharge). | exmadscientist wrote: | > standard root cause analysis (did they even do FMEA?) would | have forced them to find the problem the first time, not the | second | | Not to mention that the standard checklist for solving "why is | this joint having problems" is "is it cold or does it get worse | when we make it cold".... | | > just realized their stated fix is to remove two resistors. | Did they actually test the full effects of that change? Seems | like there's a decent chance of a third head-scratcher a few | months from now... | | As a half-decent analog guy I can actually believe this. I can | imagine a lot of ways something like this _could_ fix a | problem, though of course I haven 't seen any schematics here. | pclmulqdq wrote: | I'm guessing that a robotics company actually has very few | analog-competent engineers. It may well have solved the issue | if one of them had built an analog circuit. A lot of | engineers don't think about operating conditions when they | first learn to build circuits, and run them near the edge of | their specified tolerances, so moving the component into a | "comfortable" range would fix it. | | However, if they were working with a COTS component and took | some resistors off that, I doubt it was actually the right | fix. | pifm_guy wrote: | What's the betting that on the hot weather their Texas | customers with those resistors removed will find their robots | nonfunctional again...? | steve_adams_86 wrote: | I'm just a bum with an automated hydroponic garden and I | verified and tested all of my components for expected behaviour | in the normal conditions the system will be in. I had to | exclude several components as a result. | | On one hand I'm surprised they didn't do that, but on the | other, I have no time constraints and I'm very, very uncertain | of what I'm doing most of the time. I knew I wanted to have | accurate readings and reliable performance so my garden doesn't | die due to malfunctions. So, I goofed around and made sure | stuff was right. I'd do the same with code in my spare time, | but my employers have me cut corners all the time. People could | criticize me for it, but it's not as though I don't know | better. It might be similar for this team. By the time the bugs | strike, it's not clear what's been properly vetted or who knows | what about which components. Debugging becomes harder because | the initial spec and how well it was met is no longer clear. | | I also discovered the ruggeduino line of arduino boards in the | process which are pretty cool. Overkill for my use case, but I | hope to have a use for one some day. I'm thinking of making a | robot shop vac and metal-picker-upper, and the ruggeduino would | be great there I think. | ThrowawayTestr wrote: | Temperature dependent problems are the hardest to reproduce. | Often they only occur within a specific range of temps. | dagw wrote: | I met a guy who worked for IBM on the AS/400s. Apparently they | had a 'server room' where they could control the temperature and | humidity to basically any combination that was likely to occur in | the real world and they would test all their hardware there under | any extreme condition they could think of. | dboreham wrote: | All hardware produced in a professional manner is tested like | this. | h2odragon wrote: | Wonder what the chances are that the first iteration of "problem | solved!" actually _was_ a lurking problem that would 've bitten | as well, sooner or later. | nixpulvis wrote: | Reminds me of the time I bought a new iPhone and the touchscreen | didn't work properly when it was below about 20 degrees. | | Leave it to Californians to forget that winter exists. | ghaff wrote: | In fairness, it's generally known that a lot of consumer | electronics have problems in freezing weather unless they're | protected somehow (e.g. kept in an inner pocket) if only | because of the battery. Some of it is possibly California | designers not being especially focused on the sub-zero use case | but it's also not clear how much focus there should be on that | use case in general if there are costs (money, physical specs) | associated with doing so. | nixpulvis wrote: | Let's see... does the object go outside? If yes, it will | probably experience below freezing temperature. | | This isn't rocket science. | | Now, if you happen to be OK with selling a product that isn't | reliable, now's a good time! I still need some more toys for | the holidays. | ghaff wrote: | Maybe you should start up your 40 below Android phone | company. Enjoy! | gumby wrote: | This is why I was shocked by Unix producing core dumps (I'd | previously gotten my experience on ITS and Lispms where | everything ran in the debugger). | | In a core dump all the IO (network connections, files, etc) is | closed while a debugger gives you access to the functioning | environment. | joezydeco wrote: | Any developer, hardware _or_ software, will tell you that | reproducibility is the key to solving a problem. | | But reproducibility can be vague and sometimes, when you're under | pressure, you can be quick to point to something and declare | "aha! that's the root cause!" and be totally wrong. | JoeAltmaier wrote: | Or even be totally right, but there's more to the problem. | Peeling the onion. | | We hope that fixing one piece of code will solve three or four | exhibited problems. But it's often more like, change three or | four pieces of code to make one problem go away. | | As Heinlein is purported to have written, "If it's not one | thing, it's two things." | eesmith wrote: | I've read about 98% of everything Heinlein published[1], and | don't recognize that quote. | | An archive.org search find a few examples, like this 1979 | Harlequin short story https://archive.org/details/romanticsho | rtsto00harl/page/30/m... | | If Heinlein did use a phrase like it, I expect my searches | would have found it. | | It doesn't appear to be a common saying, so I'm curious how | you acquired the association between it and Heinlein. It | doesn't seems like a common misquote people end up spreading. | | [1] I've only read "Tramp Royale" up to the point where they | left the US, I haven't read the "stinkeroos", nor his | posthumous novels, nor most of what Wikipedia lists under | "Other short fiction", nor a couple more non-fiction | publications. | morphle wrote: | I agree, having read most Heinlein ever published (even | including his unfortunate right wing and conservative and | unscientific ramblings that wasted my time), I never came | across "If it is not one thing, it is two things." | joshmarinacci wrote: | Maybe Niven? | morphle wrote: | I'm even more sure it was not written by Larry Niven. I | read most of Heinlein twice but read Niven at least 4 | times. | JoeAltmaier wrote: | Maybe Spider Robinson? | metaphor wrote: | Agreed. | | While reading the thread, a red flag[1] was immediately raised | in my mind when: | | > _We couldn 't reproduce it, but we did come up with a theory | for why it was happening._ | | ...going right into mechanical subsystem redesign. Surely a | cursory review would have challenged such a reactionary | proposal: What meaningful steps were taken to falsify the | prevailing theory? | | There's something implied about discipline when this vacant _QA | Tester Hardware /Software_ engineering position description[2] | bundles verification/validation test roles on the | design/development/production/field support fronts with the | following caveat: | | > _Initially, you 'll be the only QA engineer and will perform | active testing of new product releases in our lab and in the | field at construction sites._ | | Also, non-rhetorical question: Selenium for industrial hardware | test automation...is that really a thing in the wild? | | [1] https://twitter.com/tessalau/status/1604018887603138561 | | [2] https://boards.greenhouse.io/dustyrobotics/jobs/5373908003 | Workaccount2 wrote: | As a hardware guy I look at software with envy. Having to deal | with physics is such a huge fucking pain in the ass all the time. | Reality real fucking hates low entropy systems and will try and | sabotage you at every turn. There is also the inherent opaqueness | to reality based systems that makes debugging them a huge pain | that can be enormously time consuming and expensive. And scaling | is ridiculously difficult and expensive. | | And worst of all, for me, there is no money in hardware. At best | you make a trinket that requires a $9.99 subscription to really | get use out of. At worst you make a cool trinket, get forced by | pricing to make it in China, and then end up just having the idea | stolen and reproduced to be sold for 1/2 the cost. | | Ok rant over. | green_on_black wrote: | I agree with everything except the first sentence. Similar to a | meeting, where time used meets time alloted, complexity meets | complexity allowed. | | https://www.stilldrinking.org/programming-sucks | Quarrelsome wrote: | I once lost about a month to a USB issue reported in the field on | some custom hardware. Spent several weeks failing to reproduce it | (setting up multiple machines to automatically hammer through | typical usage). It eventually transpired that the issue | correlated with cold temperatures and the recent outsourcing of | assembly had resulted in some poor soldering. | analog31 wrote: | I work in hardware. The picture of the freezer with wires coming | out under the door gasket, is familiar. | svnt wrote: | I'm gonna snark on this one. | | There is no mystery or surprise here. It's basic functional | qualification. You buy or make a thermal chamber and cycle | release versions of your device before you ship one. This isn't | some uncatchable mystery, they just didn't test adequately. | | Depending on the product size and cost you may also do this to | every individual robot off the line. This is not uncommon. | | This isn't "hardware is hard" this is "we thought it was software | with screwdrivers." | SkyPuncher wrote: | I'm arm-chairing a bit. This is the type of problem I'd expect | on a prototype, but not on a production level device - | especially on a construction robot that will be clearly out in | the weather. | | It reads to me like they didn't properly spec/source components | that were appropriate for the weather conditions these robots | are likely to see. The fact that this issue was reproducible at | refrigerator temperatures is even more shocking. 39F (taken | from a photo) is not very cold. | iancmceachern wrote: | Agreed. In many more regulated industries (medical devices, | automotive, aerospace) the kind of testing you mention is | required by law. In all cases it's good form, good practice to | test your product to make sure it works as promised by its | labeling. Typically you put an operating temperature and | humidity range in your manual or labeling. In many industries | testing to those operating parameters is required by law. | greenbit wrote: | "environmental qualification testing", good old EQT. | bsder wrote: | > You buy or make a thermal chamber and cycle release versions | of your device before you ship one. | | That is a _total_ waste of time in a startup with a small | number of units shipped. | | A component behaving out of spec due to temperature excursions | simply isn't that common nowadays. If my system is mostly ADC | to digital to DAC (standard for robotics controllers), testing | for temperature is a waste until I'm shipping significant | volumes. | | There is a video of one of the slightly famous YouTubers who | has a high voltage thing that fails at the altitude of his lab. | The manufacturer _did_ check it for function at the elevation | of Denver, but his lab is higher than that. There are limits to | how much engineering effort you should put in until you get an | _actual_ failure. (Maybe someone can link the video as I can 't | remember it at this point.) | | You can waste infinite engineering effort covering all | possibilities. Or you can ship the thing and fix the failures. | "Good engineering" is about balancing the two--you need to | ship, but you don't want to have too many failures in the field | either. | mdorazio wrote: | Testing thermal performance is as simple as going to a | restaurant down the street and asking if you can pay them | $100 to borrow their walk-in fridge for an hour for cold, and | leaving your device in a hot car for an hour for hot. It's | also exactly the kind of thing you should be doing if you're | selling an actual product to a customer instead of partnering | with someone to test your prototype. | snovv_crash wrote: | No, often components are dependent on the stress they are | under. If you have inconsistent reflow, or lots of rework | happening, then each device will behave differently due to | thermal contractions. | justin66 wrote: | If you specify an operating temperature range for your | gadget, some testing at the extremes of that range is called | for. | KennyBlanken wrote: | Or even just drive down to Lake Tahoe and find a winter | construction site and try it there. | | This is a general problem with all these California-based | companies and inventors (especially SV inventors cranking out | crowd-funded bike stuff.) They seem blissfully unaware of | things like cold weather, water, dirt/mud, and road salt...or | combinations of them. I laugh at all those stupid fucking | delivery bots because they'll fall apart anywhere there's snow, | and get completely stuck on the slightest bit of ice. | | For many years, driving a Model S in heavy rain would cause | water to get into the drive unit via either seals or vents that | weren't sufficiently designed to keep water out. It "totals" | the drive unit, causing corrosion of the motor control boards. | And Tesla denies warranty claims on such repairs, because of | course they do - just like they did on the windows that | randomly shattered in parked cars. | | Raise your hand if you've owned a car that had problems with | water ingress issues affecting its transmission. Or windows | randomly shattering. | | What's that? Nobody? Exactly. | trasz2 wrote: | lostlogin wrote: | It is not just SV startups. | | I have owned a couple of Philips bread makers. Basic ones and | expensive ones. | | If you make sourdough with them (ie bread) the coating is | stripped off the bowl and the stirrer corrodes. | | They will deny replacement and claim you sprayed something | acid on it. Yes, fermentation is acid, but they don't believe | bread would damage their unit. | TooSmugToFail wrote: | You are 100% right, but I would not be as dismissive to the | engineering team. | | When you run a hardware startup, you can only hope for an | experienced team that would do everything by the book and | implement best practices from the very first production unit. | Reality is: that's a luxury for most hardware startup teams out | there. | | Typically, there's a frantic rush to get your device to market | that you simply skip, or more likely don't even have time to | think about stuff like climate chamber cycling. | | One thing I'm almost sure of: these guys have learned something | -- the engineer's way. Good chance there's a guy there googling | climate chambers to ask the CEO for a budget to buy one. | toss1 wrote: | >>Typically, there's a frantic rush to get your device to | market that you simply skip, or more likely don't even have | time to think about stuff like climate chamber cycling. | | And that, right there, is the difference between a company | oriented around myopic management vs a company oriented | around robust quality. | | Any company trying to build a quality reputation would spec | this stuff out AT THE BEGINNING -- what are the operational | requirements, what loads will they see, in what environments | will they run, etc??? Then spec every component, and test the | whole lot against those requirements. Sure, this is more like | the dreaded "waterfall" vs "agile", but the result is a | quality product from the start that has far fewer of these | problems (because they did this whole test & fix routine at | the prototype or Alpha test stages), rather than showing up | with stories like this of how they recovered from customer- | reported problems. | | If you're telling your customers that they're the Alpha | testers because they get early access, fine. If you're | selling it as a finished product, then we know your company | isn't prioritizing actual quality. | mschuster91 wrote: | > If you're telling your customers that they're the Alpha | testers because they get early access, fine. If you're | selling it as a finished product, then we know your company | isn't prioritizing actual quality. | | Sounds like Tesla... although it is public knowledge there | that you _are_ still alpha testers years after a model was | introduced. | spaceywilly wrote: | I've worked for a hardware startup. We didn't have the budget | for a thermal chamber, but that doesn't mean we didn't test | and anticipate temperature related issues. We had plenty of | setups like the one in the blog with units in the fridge or | on a car dashboard on a summer day. | | The difference is we did it before the units reached | customers. | exmadscientist wrote: | We skip stuff like the temperature chamber all the time. | The difference is that we _know_ what we 're skipping and | why. This can easily make for a "10x" team: we are | experienced enough to know what we can get away with, and | experienced enough to know very quickly what happened when | we _fail_ to get away with it. (So it 's often a pretty | quick and direct fix! Well, that or a C-level "you declined | this part of our proposal so now it's failing exactly like | we told you it would so you're looking at this big of a | redo, exactly like we told you..."). | dboreham wrote: | But you probably have a cupboard full of freezer spray | cans and a heat gun. | bostik wrote: | Sorry, but this is a bit rich: | | > _When you run a hardware startup, you can only hope for an | experienced team_ | | If you run a hardware startup and fail to _acknowledge_ that | places like Alaska or Ontario exist, you fail before even | getting close to merely inexperienced. The most charitable | word I can think of is "myopic". | mindslight wrote: | It all looks so simple after someone else has debugged the | problem and figured out it was temperature, that it's just | too tempting to distill it down to some dismissive | statement like " _fail to acknowledge places like Alaska or | Ontario exist_ ". But ultimately it's about unknown | unknowns. Sure, you can spend massive amounts of time and | money trying to make the known unknowns into knowns (higher | background radiation, higher cosmic rays, lower/higher air | pressure, higher sun intensity, camera flashes, and so on). | But even if you do this for a number of things including | temperature, there will still be bunch of factors you won't | be preemptively testing. In which case respecting the | general lesson will come in handy, even though it's been | explained by someone who failed to do testing that you | consider routine. | mcculley wrote: | It is not only startups and not only cold places that are | overlooked. The iPhone documentation says that it should | not be used outside of 32deg-95degF. | | https://support.apple.com/en-gb/HT201678 | runnerup wrote: | That seems reasonably accurate on the high end for me. | For _sustained_ use, iPhones generally stop working well | around 100-105F. | | I suspect I would be disappointed by the performance of | an iPhone that could operate in >125F temperatures | (temperatures which I have worked in outdoors for several | years) | Aloha wrote: | Based on the pictures of the robots, it looks like they were | intended for indoor use, I guess no one planned for an indoor | space outside of 50-90 degrees - which is a pretty reasonable | supposition, better than 80% of indoor environments are within | that range. | todd8 wrote: | My story: my first real job after grad school was at Texas | Instruments. It was a good job, and I enjoyed working there. | | A fellow new hire and I were tasked with fixing a machine that | ran in a clean room where semiconductor wafers were made. On | weekends while the line was down, we would go in, crank up one | section of the line and let waste wafers travel through the line | where very rarely they might get stuck in a multi-lane machine | that would etch the wafers with some sort of acid. | | The machine had over two dozen asynchronous motors, actuators, | pumps, sensors and so forth. All generating interrupts and I/O | events that were sent to a computer that ran the whole line and | controlled all the machines. | | We couldn't slow down the machine, it had to run at full speed. | The program controlling the machine was thousands of lines of | assembly language--everything was assembly language, including | the homemade OS that ran the computer running the line. It took | like an hour for us to bring up the line and two more hours to | see the machine do something strange. | | The computer running all this had no user interface other than a | some front panel switches and some panel lights that would reveal | 16 bits of it's 128K of memory at a time. This was in the 1970s | before Ethernet had been invented. | | It felt a bit like those escape room events where you know there | is a solution, but you don't know if you will ever get out. | Without my coworker cracking jokes about our plight, I'm not sure | we would have ever triumphed over that stupid machine. | mrkeen wrote: | I have an HP Spectre x360 laptop. It wasn't registering some of | its keystrokes. The same keys would fail, and for quite a while | too - hitting them harder and repeatedly didn't happen. | | Turns out it was the cold. Now when I take a trip out for coffee | & coding, I boot up and let it sit for a while before starting my | work. | TeeMassive wrote: | "Turns out that last year's coupler problems had the same root | cause. While people were opening up the robot and tightening the | coupler, the robot would warm up. By the time they put it back | together, the problem would have gone away. It had nothing to do | with couplers at all. | | By the time we had rolled out the coupler "fix" to all robots, | the weather had warmed up enough across the country that the | issue didn't reoccur. We thought we had fixed it, when actually | spring fixed it." | | At first they _correlated_ a possible root cause and then after | learning from that mistake they finally _understood_ the root | cause. | | I've seen it happen many times where people with not enough time | and knowledge to debug a huge system had to resort to shotgun | debugging. IME taking the time to understand always ends up 1) | solving the problem and 2) saving time and money. | | This is especially true when the problem is actually caused by | two or more root causes. | rileymat2 wrote: | Is there a good name for "two root causes" where they work in | tandem? I am blanking on it. Contributing factors? Necessary | but not sufficient conditions? | spaceywilly wrote: | 2nd order effects | jasonwatkinspdx wrote: | https://en.wikipedia.org/wiki/Confounding | itcrowd wrote: | Destructive interference? ___________________________________________________________________ (page generated 2022-12-17 23:00 UTC)