[HN Gopher] Minesweeper automates root cause analysis as a first...
       ___________________________________________________________________
        
       Minesweeper automates root cause analysis as a first-line defense
       against bugs
        
       Author : muglug
       Score  : 70 points
       Date   : 2021-02-09 19:38 UTC (3 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | nomy99 wrote:
       | typo: Engineering not Enginering
        
         | pronoiac wrote:
         | I emailed the mods.
        
       | stingraycharles wrote:
       | While this is a great strategy for figuring out the cause of a
       | bug, I'd argue that "root cause analysis" in engineering is
       | typically a much more qualitative analysis, and more about high
       | impact failures than mere bug reports.
       | 
       | A more accurate title may be "automatic data collection and
       | analysis for bug reports"; I'm also confident that Microsoft has
       | been doing this exact same thing for at least a few decades.
        
         | cat199 wrote:
         | > and more about high impact failures than mere bug reports.
         | 
         | It's often more about people problems than software problems..
        
         | tehjoker wrote:
         | I did like the idea that they recorded low memory conditions.
         | That seems incredibly useful for debugging issues that occur in
         | the wild. A natural next step would be checking GPU memory as
         | well if they haven't already.
         | 
         | Is it possible to measure overall system memory pressure in JS
         | or is that sandboxed?
        
           | puttycat wrote:
           | I'm pretty confident that they use the same OS data for
           | fingerprinting as well.
        
       | schemescape wrote:
       | This seems like a great system for isolating steps to reproduce a
       | bug, but I'm not sure I would consider this "root cause
       | analysis".
        
       | Tarsul wrote:
       | It appears to have nothing to do with the game.
        
       | 1_2__4 wrote:
       | It's been a disappointing trend that SRE execs have adopted
       | magical ML thinking, believing they're just a few short years
       | away from ditching all those annoying engineers and replacing
       | them with a model. I am skeptical.
        
       | qbasic_forever wrote:
       | So any concrete stats on how this has helped shorten bug
       | investigation time, improve quality of releases, etc? It looks
       | like an interesting data-driven approach to bug finding but
       | there's curiously no qualitative analysis of how it's actually
       | working in practice. I'd be a little concerned that systems like
       | this can fall into the background as a flurry of noise and
       | process that doesn't actually improve the quality of the product.
        
       | vladd wrote:
       | Seems the article is confusing the "trigger" of an event with its
       | "root cause".
       | 
       | I like to give the example of the Concorde airplane crash [0] to
       | exemplify the difference: the incident was _triggered_ by debris
       | on the runway (which caused the tires to explode, igniting the
       | fuel tank above). But _the root cause_ was the placement of fuel
       | in proximity of inflatable tire materials.
       | 
       | [0] https://en.wikipedia.org/wiki/Air_France_Flight_4590
        
         | k1t wrote:
         | Is there really a difference though?
         | 
         | To me a "trigger" is the initial event that begins a sequence.
         | Isn't that also a "root cause"?
         | 
         | Since everything is connected to everything else, it seems like
         | the point that you decide is the "root" is fairly arbitrary.
         | 
         | It seems you could easily conclude that the root cause was that
         | the runway wasn't cleaned/inspected often enough. Or that the
         | departures were scheduled too close together, preventing such
         | an inspection, etc..
         | 
         | If anything I would say the _root cause_ was the piece of
         | engine cowl falling from the preceding flight - since that
         | seems to be the first thing that  "went wrong" in the process.
        
           | azinman2 wrote:
           | If I said 'screw you', and that cause you to flip out and
           | kill everyone around you, you couldn't say the root cause was
           | me saying 'screw you'. The root cause might be childhood
           | trauma, extreme emotional imbalance, irrational thoughts,
           | etc. The statement 'screw you' was the trigger.
           | 
           | Similarly here, a trigger might be uploading a photo to FB,
           | but the root cause of an issue might be a bug in encoding
           | JPGs.
        
           | breischl wrote:
           | One approach to this is the "Five Why's" approach, wherein
           | you ask "why" five times. eg,
           | 
           | Q1: Why did the plan explode?
           | 
           | A1: The engine cowling fell into the engine
           | 
           | Q2: Why'd that happen?
           | 
           | A2: The tire exploded and damaged it.
           | 
           | etc etc.
           | 
           | Obviously the number 5 is arbitrary and not always
           | applicable, it's just a heuristic to get to something "root-
           | ish" without getting to ridiculous distant things like "the
           | laws of physics prevent two objects from inhabiting the same
           | position in space-time".
           | 
           | More generally, defining something as root vs. not is
           | somewhat of a judgement call. Usually you try to find
           | something that will prevent future problems of this sort and
           | call it the root cause. Ideally something that your
           | organization can mitigate with a reasonable time/cost.
           | 
           | Note that the actual mitigation is a separate question. If
           | runway debris is the root cause, then one mitigation is
           | reworking the fuel system. Another would be using tougher
           | tires. Perhaps another would be adding a shield between the
           | tires and the aircraft body. Another might be an automated
           | runway monitoring system that detects debris. etc.
        
           | dathinab wrote:
           | Debris on the runway is something to be expected in rare
           | cases, given how many flights there are a day it's just a
           | matter of time until it happens.
           | 
           | As such the problem is an air plan which is designed in a way
           | too prone to cause (too) fatal accidents in certain "rare but
           | guaranteed to happen at some point" situation.
           | 
           | But in the end if you say both are trigger which together
           | lead to the catastrophe or one is a trigger and another is
           | the root cause is indeed irrelevant.
           | 
           | The problem is if you do something I will call trigger
           | analysis but refer to root cause analysis treating it as if
           | it gives you _the_ root cause it can very easily to
           | situations where you fix one of the problems but not all, and
           | potentially not even the biggest problem.
           | 
           | I.e. you make it slightly less likely that there is debris on
           | the runway but you don't fix the problem of the airplane
           | being too prone to certain kinds of catastrophic failure.
        
           | kryogen1c wrote:
           | > Is there really a difference though?
           | 
           | yes.
           | 
           | > To me a "trigger" is the initial event that begins a
           | sequence. Isn't that also a "root cause"?
           | 
           | no. root causes are irreducible, hence the word "root".
           | 
           | if someone is endlessly trained for an event and then fails
           | at game time, its probably a root cause. people make mistakes
           | that cannot be avoided (this is why defense in depth is a
           | thing)
           | 
           | if the training program is a 5 second sentence before the
           | event, the persons mistake is not the root cause, its the
           | training program.
        
           | spockz wrote:
           | To me the placement of the tanks or lack of protection would
           | be the root cause. Because this failure could have been
           | triggered by any other debris as well. So the root cause is
           | the thing that If you fix it, it fixes the problem in a
           | fundamental level.
        
       ___________________________________________________________________
       (page generated 2021-02-09 23:00 UTC)