[HN Gopher] Minesweeper automates root cause analysis as a first... ___________________________________________________________________ Minesweeper automates root cause analysis as a first-line defense against bugs Author : muglug Score : 70 points Date : 2021-02-09 19:38 UTC (3 hours ago) (HTM) web link (engineering.fb.com) (TXT) w3m dump (engineering.fb.com) | nomy99 wrote: | typo: Engineering not Enginering | pronoiac wrote: | I emailed the mods. | stingraycharles wrote: | While this is a great strategy for figuring out the cause of a | bug, I'd argue that "root cause analysis" in engineering is | typically a much more qualitative analysis, and more about high | impact failures than mere bug reports. | | A more accurate title may be "automatic data collection and | analysis for bug reports"; I'm also confident that Microsoft has | been doing this exact same thing for at least a few decades. | cat199 wrote: | > and more about high impact failures than mere bug reports. | | It's often more about people problems than software problems.. | tehjoker wrote: | I did like the idea that they recorded low memory conditions. | That seems incredibly useful for debugging issues that occur in | the wild. A natural next step would be checking GPU memory as | well if they haven't already. | | Is it possible to measure overall system memory pressure in JS | or is that sandboxed? | puttycat wrote: | I'm pretty confident that they use the same OS data for | fingerprinting as well. | schemescape wrote: | This seems like a great system for isolating steps to reproduce a | bug, but I'm not sure I would consider this "root cause | analysis". | Tarsul wrote: | It appears to have nothing to do with the game. | 1_2__4 wrote: | It's been a disappointing trend that SRE execs have adopted | magical ML thinking, believing they're just a few short years | away from ditching all those annoying engineers and replacing | them with a model. I am skeptical. | qbasic_forever wrote: | So any concrete stats on how this has helped shorten bug | investigation time, improve quality of releases, etc? It looks | like an interesting data-driven approach to bug finding but | there's curiously no qualitative analysis of how it's actually | working in practice. I'd be a little concerned that systems like | this can fall into the background as a flurry of noise and | process that doesn't actually improve the quality of the product. | vladd wrote: | Seems the article is confusing the "trigger" of an event with its | "root cause". | | I like to give the example of the Concorde airplane crash [0] to | exemplify the difference: the incident was _triggered_ by debris | on the runway (which caused the tires to explode, igniting the | fuel tank above). But _the root cause_ was the placement of fuel | in proximity of inflatable tire materials. | | [0] https://en.wikipedia.org/wiki/Air_France_Flight_4590 | k1t wrote: | Is there really a difference though? | | To me a "trigger" is the initial event that begins a sequence. | Isn't that also a "root cause"? | | Since everything is connected to everything else, it seems like | the point that you decide is the "root" is fairly arbitrary. | | It seems you could easily conclude that the root cause was that | the runway wasn't cleaned/inspected often enough. Or that the | departures were scheduled too close together, preventing such | an inspection, etc.. | | If anything I would say the _root cause_ was the piece of | engine cowl falling from the preceding flight - since that | seems to be the first thing that "went wrong" in the process. | azinman2 wrote: | If I said 'screw you', and that cause you to flip out and | kill everyone around you, you couldn't say the root cause was | me saying 'screw you'. The root cause might be childhood | trauma, extreme emotional imbalance, irrational thoughts, | etc. The statement 'screw you' was the trigger. | | Similarly here, a trigger might be uploading a photo to FB, | but the root cause of an issue might be a bug in encoding | JPGs. | breischl wrote: | One approach to this is the "Five Why's" approach, wherein | you ask "why" five times. eg, | | Q1: Why did the plan explode? | | A1: The engine cowling fell into the engine | | Q2: Why'd that happen? | | A2: The tire exploded and damaged it. | | etc etc. | | Obviously the number 5 is arbitrary and not always | applicable, it's just a heuristic to get to something "root- | ish" without getting to ridiculous distant things like "the | laws of physics prevent two objects from inhabiting the same | position in space-time". | | More generally, defining something as root vs. not is | somewhat of a judgement call. Usually you try to find | something that will prevent future problems of this sort and | call it the root cause. Ideally something that your | organization can mitigate with a reasonable time/cost. | | Note that the actual mitigation is a separate question. If | runway debris is the root cause, then one mitigation is | reworking the fuel system. Another would be using tougher | tires. Perhaps another would be adding a shield between the | tires and the aircraft body. Another might be an automated | runway monitoring system that detects debris. etc. | dathinab wrote: | Debris on the runway is something to be expected in rare | cases, given how many flights there are a day it's just a | matter of time until it happens. | | As such the problem is an air plan which is designed in a way | too prone to cause (too) fatal accidents in certain "rare but | guaranteed to happen at some point" situation. | | But in the end if you say both are trigger which together | lead to the catastrophe or one is a trigger and another is | the root cause is indeed irrelevant. | | The problem is if you do something I will call trigger | analysis but refer to root cause analysis treating it as if | it gives you _the_ root cause it can very easily to | situations where you fix one of the problems but not all, and | potentially not even the biggest problem. | | I.e. you make it slightly less likely that there is debris on | the runway but you don't fix the problem of the airplane | being too prone to certain kinds of catastrophic failure. | kryogen1c wrote: | > Is there really a difference though? | | yes. | | > To me a "trigger" is the initial event that begins a | sequence. Isn't that also a "root cause"? | | no. root causes are irreducible, hence the word "root". | | if someone is endlessly trained for an event and then fails | at game time, its probably a root cause. people make mistakes | that cannot be avoided (this is why defense in depth is a | thing) | | if the training program is a 5 second sentence before the | event, the persons mistake is not the root cause, its the | training program. | spockz wrote: | To me the placement of the tanks or lack of protection would | be the root cause. Because this failure could have been | triggered by any other debris as well. So the root cause is | the thing that If you fix it, it fixes the problem in a | fundamental level. ___________________________________________________________________ (page generated 2021-02-09 23:00 UTC)