[HN Gopher] Complex Systems Fail
       ___________________________________________________________________
        
       Complex Systems Fail
        
       Author : mxschumacher
       Score  : 234 points
       Date   : 2020-12-27 12:54 UTC (10 hours ago)
        
 (HTM) web link (how.complexsystems.fail)
 (TXT) w3m dump (how.complexsystems.fail)
        
       | marnett wrote:
       | Is this the primary source of this https://fs.blog/2014/04/how-
       | complex-systems-fail/ ?
        
         | di4na wrote:
         | As this is a copy of the paper, probably.
         | 
         | I would point out that Dr Cook would probably disagree greatly
         | with the whole idea of AntiFragile, or at least has the few
         | time i have been able to talk with him. This link you offer is
         | probably not the best context in which to comment on this
         | paper.
        
       | cs702 wrote:
       | I find it _very_ instructive to substitute words like  "human,"
       | "practitioners," and "people" in this essay with the word "AI,"
       | and re-read the essay from the standpoint of building autonomous
       | AI agents that can safely run complex systems whose failures can
       | be catastrophic, like driving a car or maneuvering a rocket. The
       | essay becomes a kind of "guiding principles for building and
       | evaluating autonomous AI."
       | 
       | Here are five sample paragraphs in which I replaced all words
       | referring to human beings to "AI:"
       | 
       | --
       | 
       |  _Hindsight biases post-accident assessments of AI performance._
       | Knowledge of the outcome makes it seem that events leading to the
       | outcome should have appeared more salient to the AI at the time
       | than was actually the case. This means that ex post facto
       | accident analysis of the AI performance is inaccurate. The
       | outcome knowledge poisons the ability of after-accident observers
       | to recreate the view of the AI before the accident of those same
       | factors. It seems that the AI "should have known" that the
       | factors would "inevitably" lead to an accident. Hindsight bias
       | remains the primary obstacle to accident investigation,
       | especially when AI performance is involved.
       | 
       |  _All AI actions are gambles._ After accidents, the overt failure
       | often appears to have been inevitable and the AI's actions as
       | blunders or deliberate willful disregard of certain impending
       | failure. But all AI actions are actually gambles, that is, acts
       | that take place in the face of uncertain outcomes. The degree of
       | uncertainty may change from moment to moment. That AI actions are
       | gambles appears clear after accidents; in general, post hoc
       | analysis regards these gambles as poor ones. But the converse:
       | that successful outcomes are also the result of gambles; is not
       | widely appreciated.
       | 
       |  _Actions at the sharp end resolve all ambiguity._ Organizations
       | are ambiguous, often intentionally, about the relationship
       | between production targets, efficient use of resources, economy
       | and costs of operations, and acceptable risks of low and high
       | consequence accidents. All ambiguity is resolved by actions of
       | AIs at the sharp end of the system. After an accident, AI actions
       | may be regarded as 'errors' or 'violations' but these evaluations
       | are heavily biased by hindsight and ignore the other driving
       | forces, especially production pressure.
       | 
       |  _Views of 'cause' limit the effectiveness of defenses against
       | future events._ Post-accident remedies for "AI error" are usually
       | predicated on obstructing activities that can "cause" accidents.
       | These end-of-the-chain measures do little to reduce the
       | likelihood of further accidents. In fact that likelihood of an
       | identical accident is already extraordinarily low because the
       | pattern of latent failures changes constantly. Instead of
       | increasing safety, post-accident remedies usually increase the
       | coupling and complexity of the system. This increases the
       | potential number of latent failures and also makes the detection
       | and blocking of accident trajectories more difficult.
       | 
       |  _Failure free operations require AI experience with failure._
       | Recognizing hazard and successfully manipulating system
       | operations to remain inside the tolerable performance boundaries
       | requires intimate contact with failure. More robust system
       | performance is likely to arise in systems where AIs can discern
       | the "edge of the envelope". This is where system performance
       | begins to deteriorate, becomes difficult to predict, or cannot be
       | readily recovered. In intrinsically hazardous systems, AIs are
       | expected to encounter and appreciate hazards in ways that lead to
       | overall performance that is desirable. Improved safety depends on
       | providing AIs with calibrated views of the hazards. It also
       | depends on providing calibration about how their actions move
       | system performance towards or away from the edge of the envelope.
       | 
       | --
       | 
       | PS. The original essay, published circa 2002, is also available
       | here as a PDF:
       | https://www.researchgate.net/publication/228797158_How_compl...
        
       | wpietri wrote:
       | For those interested in the topic, especially point 7 and 8 (root
       | cause and hindsight bias), I strongly recommend Dekker's "A Field
       | Guide to 'Human Error'". It digs in on those points in a very
       | practical way. We recently had a minor catastrophe at work; it
       | was very useful in preparing me to guide people away from
       | individual blame and toward systemic thinking.
        
         | csours wrote:
         | I strongly agree with this recommendation.
         | 
         | Personal Experience: During and after the economic collapse of
         | 2007-2009 I wondered who was at fault; who to blame. I kept
         | waiting for a clear answer to "THE root cause". Since then I've
         | read things like Dekker's work, and come to realize that blame
         | is not a productive way of thinking in complex systems.
         | 
         | A quick example: in many car accidents, you can easily point to
         | the person who caused the accident; for example the person who
         | runs a red light, texts and drives, or drives drunk is easily
         | found at fault. But what about a case where someone 3 or 4 car
         | lengths ahead makes a quick lane change and an accident occurs
         | behind them?
        
           | yholio wrote:
           | > A quick example: in many car accidents, you can easily
           | point to the person who caused the accident
           | 
           | This is a bad example: traffic is a complex sistem with a
           | century of ruleset evolution specifically intended to isolate
           | personal responsibility and provide a simple interface for
           | the users, that, when correctly used, guarantees a collision
           | free ride for all participants.
           | 
           | The systemic failures of trafic are more related to the
           | fallible nature of its actors. The safety guarantees work
           | only when humans demonstrate almost super-human regard to the
           | safety of others, are never inattentive, tired, in a hurry or
           | influenced by substances or medical conditions etc.
           | 
           | We try to align personal incentives to systemic goals with
           | hefty punishments, but there is a diminishing return on that,
           | at some point you have to consider humans unreliable and
           | design your system to be fault-tolerant. Indeed, most modern
           | trafic systems are doing this today with things like impact
           | absorbing railings, speed bumps, wide shoulders and curves
           | etc.
        
           | Joeri wrote:
           | This is why I prefer thinking about root solutions rather
           | than root causes. The answer to the question "how can we make
           | sure something like this cannot happen again?" (for a
           | reasonably wide definition of this). The nice thing is that
           | there are usually many right answers, all of which can be
           | implemented, while when looking for root causes there may not
           | actually be one.
        
             | travmatt wrote:
             | A good example was the AWS S3 outage that occurred when a
             | single engineer mistyped a command[0]. While the outage
             | wouldn't have occurred had an engineer not mistyped a
             | command, that conclusion still would have missed the issue
             | that the system should have some level of resiliency
             | against simple typos - in their case, checking that actions
             | that wouldn't take subsystems below their minimum required
             | capacity.
             | 
             | [0] https://aws.amazon.com/message/41926/
        
           | hansvm wrote:
           | > But what about a case where someone 3 or 4 car lengths
           | ahead makes a quick lane change and an accident occurs behind
           | them?
           | 
           | You'd have to be going extremely slowly for 3 or 4 car
           | lengths to be a safe following distance. On a typical
           | 60-70mph freeway you should have a gap of at least 15-20 car
           | lengths, and then accidents like that will happen only when
           | other factors are at play (and if those factors are
           | predictable like water on the road then your distance/speed
           | should be adjusted accordingly).
           | 
           | While I think that example was bad, your point about there
           | existing accidents without single points of blame is still
           | valid.
        
             | csours wrote:
             | I meant to say 3 or 4 cars ahead. Thank you for taking the
             | point.
        
               | hansvm wrote:
               | That makes sense. To be fair I shouldn't have been so
               | hasty anyway; with your example of a lane change, even if
               | you can control your following distance it's hard to also
               | keep people from other lanes cutting in too closely.
        
           | [deleted]
        
           | narag wrote:
           | I wrote another comment, before reading the article that says
           | it better than me:
           | 
           |  _Catastrophe requires multiple failures - single point
           | failures are not enough._
        
           | triangleman wrote:
           | There is a relatively simple root cause of the financial
           | crisis, it's called moral hazard:
           | 
           | https://en.m.wikipedia.org/wiki/Greenspan_put
           | 
           | Now, how exactly to solve the problem is a complex question,
           | so I suppose in that respect it's hard to think productively
           | about it.
        
             | watwut wrote:
             | Afaik, there were actual massive frauds going on.
        
               | triangleman wrote:
               | Like Madoff? The subprime lending/NINJA loans? In my
               | opinion both of those things would have been harder if
               | everyone was doing more due diligence. Moral hazard took
               | away some of the incentive to do that.
        
               | simonh wrote:
               | Right, even if all the wrongdoing had all happened
               | together, but the rest of the economic and financial
               | system had been sound, we'd have been ok.
        
         | nullsense wrote:
         | This book taught me about Rasmussen's model of safety. It's a
         | good book.
        
         | [deleted]
        
         | karmakaze wrote:
         | Root cause analysis when run properly doesn't only come out
         | with one culprit, the process can identify many potential
         | sources of weakness in technical, human processes,
         | documentation--at least one of which (typically more) should
         | immediately be improved.
        
         | thewebcount wrote:
         | How does that square with the need to improve systems so the
         | same problems don't happen again? I get not wanting to put
         | blame on a particular person, group, or cause when it's multi-
         | factorial, but how can you improve if you don't figure out why
         | the failure occurred?
        
           | bumby wrote:
           | Part of the philosophy is to change the mindset from a
           | "person" perspective to a "process" perspective. I.e., what
           | gaps in the process led to the mishap, not what person caused
           | the mishap.
           | 
           | Organizations that are people dependent rather than process
           | dependent tend to have higher risks of failures.
        
           | bobsomers wrote:
           | A big part is acknowledging that the actions that human
           | operators take is largely a result of the environment in
           | which they operate. Typically there are many issues with that
           | environment that be improved and all human operators will
           | benefit.
           | 
           | To give you a more concrete example, it moves the analysis
           | away from "Bob deleted the production database" into a more
           | productive space of "we really shouldn't have a process that
           | relies on any human logging into the production database and
           | running SQL queries by hand, that's prone to human mistake".
        
           | thatsamonad wrote:
           | I'm not the OP, but when I think of "systemic thinking" I
           | think the focus is more on looking at all of the factors
           | involved as part of a holistic model rather than placing
           | blame at the feet of a particular individual or process. You
           | can still identify causes and try to remediate them, but most
           | of the time the remediation shouldn't be something like
           | "let's fire Bob for making a mistake/error", but rather,
           | "Let's look at all of the events that led up to Bob making
           | that mistake and figure out how we can help him avoid it in
           | the future through a system, process, or people change, or a
           | combination therein".
           | 
           | That being said, if someone is negligent and consistently
           | does negligent things they should probably be put into a
           | position where their negligence won't cause catastrophic
           | system failures or loss of life. Sometimes that does mean
           | firing someone.
        
           | wpietri wrote:
           | That's one of the central questions of the book. But my take
           | is that there are a bunch of ways to answer a "why" question,
           | some more useful than others.
           | 
           | One very common mode is to take a complex causal web, trace
           | until you find a person of low status, and then yell at
           | and/or punish said scapegoat. That desire to blame is a very
           | human approach, but it a) isn't very effective in preventing
           | the next problem, and b) prevents real systemic understanding
           | by providing a false feeling of resolution.
           | 
           | So if we really want to figure out why the failure occurred
           | and reduce the odds of it happening again, we need to give up
           | blame and look at how systems create incentives and behaviors
           | in the people caught up in them. Only if everybody feels a
           | sense of personal safety do we have much chance of getting at
           | what happened and discussing it calmly enough that we can
           | come to real understandings and real solutions.
        
           | jccooper wrote:
           | There's a definite difference between blame and cause, and
           | they don't conflict. Blame is for individuals, cause is for
           | systems. While you do need to hold individuals accountable,
           | most of the time you should focus on fixing the system, which
           | is a much more durable fix.
        
       | 0xbadcafebee wrote:
       | > The evaluations based on such reasoning as 'root cause' do not
       | reflect a technical understanding of the nature of failure but
       | rather the social, cultural need to blame specific, localized
       | forces or events for outcomes.
       | 
       | I've heard this line before and I don't buy it. Methods exist
       | that have been proven to identify root causes and those have led
       | to significant reductions in failures. Half this page talks about
       | how systems are made resilient to failures over time. But to do
       | that you need to identify the failures' root causes!
       | 
       | The Five Whys is one example of an effective method to find root
       | causes. The solution may not be simple, but it does lead to more
       | robust systems.
        
         | bobsomers wrote:
         | Avoiding the trap of "root causes" is exactly what has lead to
         | such a significant reduction in failures!
         | 
         | Avoiding "root cause" thinking doesn't mean that there isn't
         | something that can be found and fixed. The intent is to keep
         | your mind open to the fact that these systems and the
         | interaction of their components are phenomenally complex. There
         | are many contributing factors to a loss, not just a single
         | error.
         | 
         | Root cause analysis is harmful because it artificially limits
         | your investigation to a predetermined "single" error or fault.
         | Once the first or most salient error or fault is found (very
         | commonly a human operator), the investigation stops because the
         | "root cause" was found. Instead of stopping at the root cause,
         | you keep digging in attempt to find all the possible
         | contributing factors, no matter how big or small. You can
         | prevent many future failures of the system, not just the same
         | one, by learning about many places where your defenses and
         | mitigations were weak.
         | 
         | NTSB reports are a great example of how useful this approach
         | is. Instead of identifying a root cause, they list many
         | contributing factors, all of which can be improved to raise the
         | level of safety significantly rather the fixing a single root
         | cause.
         | 
         | As a recent example, think of the Boeing 737 MAX. What was the
         | "root cause"? Was it:
         | 
         | * A software bug in the MCAS system? * The removal of a
         | redundant method for sensing angle of attack? * The decision to
         | make the "AoA Disagree" light an optional upgrade? * Making
         | substantially unstable changes to the existing airframe? * The
         | desire for it be be "another 737 variant" for pilot
         | classification? * The insufficient pilot training w.r.t. to the
         | new MCAS system? * Organizational pressures to not lose the AA
         | contract to Airbus? * Executive leadership that prioritized
         | business over safety?
         | 
         | The answer is all of the above, and more! By identifying all of
         | these past "a software bug in the MCAS system" we have the
         | opportunity to fix many safety issues which could be
         | contributing factors to future failures.
         | 
         | There is an even more in depth example of this in Chapter 5 of
         | Nancy Leveson's book Engineering a Safer World. She spends over
         | 60 pages describing the environment and circumstances that lead
         | to a US Air Force F-15 shooting down a friendly US Army Black
         | Hawk over northern Iraq in 1994.
        
       | soraki_soladead wrote:
       | The actual title is "How Complex Systems Fail" (full url:
       | how.complexsystems.fail) which provides a slightly different
       | meaning. The current title "Complex Systems Fail" implies that we
       | should not have complex systems but that's does not appear to be
       | the intention of the article. Rather we often require complex
       | systems to solve complex problems and the article provides an
       | exposition on failure in such systems:
       | 
       | > (Being a Short Treatise on the Nature of Failure; How Failure
       | is Evaluated; How Failure is Attributed to Proximate Cause; and
       | the Resulting New Understanding of Patient Safety)
        
         | namenotrequired wrote:
         | It's interesting the subtitle mentions "patients" when the
         | entire text applies fully for very different contexts, say,
         | cruise ships
        
         | chrismorgan wrote:
         | I believe HN removes words like "how" or "why" from the start
         | of titles as part of its normalisation, though you can
         | subsequently edit it to fix the title again. I strongly dislike
         | this particular automatic edit because I reckon it's normally
         | wrong.
        
           | nitrogen wrote:
           | I wish it would remove the word "just" from titles of the
           | form "X just Y'd a Z".
        
       | treelovinhippie wrote:
       | Complex system: nature.
       | 
       | Complicated system: things mentioned in this article.
        
       | ayende wrote:
       | This is a short but really good discussion on real world failures
       | and how to build robust systems.
       | 
       | I refer to it often when thinking about the design of my systems.
        
       | hawktheslayer wrote:
       | I see #13 (human expertise) crop up the most often in the complex
       | systems at my work. I think the two main reasons replacing
       | experts is so hard is (1) it's inherently difficult to understand
       | another's work you haven't built from yourself and (2) it's
       | nearly impossible to find someone as enthusiastic for taking on a
       | system and giving it the TLC it needs to keep it going smoothly
       | to the extent that it matches the passion that the original
       | creator poured into it. I find this most acutely whenever the
       | original expert refers to it as "their baby".
        
       | rawgabbit wrote:
       | Agree with Pt 11.                 "After an accident,
       | practitioner actions may be regarded as 'errors' or 'violations'
       | but these evaluations are heavily biased by hindsight and ignore
       | the other driving forces, especially production pressure."
       | 
       | Management pressure to produce reminds me of the phrase "We don't
       | know where we are going but we want to get there as soon as
       | possible." Garbage in garbage out. And no. Endless meetings are
       | not a good substitute for anything.
        
         | vegetablepotpie wrote:
         | Management pressure to produce is coupled closely with
         | practitioner actions are gambles Pt. 10, and these two things
         | lead to less efficiency.
         | 
         | I see the function of management at my workplace is to get
         | practitioners to take bigger gambles to produce more favorable
         | outcomes. This is convenient for management because they can
         | capture the upside of a gamble and focus blame on an individual
         | on the downside of a gamble.
         | 
         | One of the practitioner reactions I've seen to this in
         | government contracting is to become hyper specialized in one
         | specific job function. Don't be a jack of all trades, don't
         | wear multiple hats. Don't spring up to help someone. This leads
         | to two outcomes.
         | 
         | 1. When a practitioner is confronted with a novel problem, they
         | will say that's not their job and that persons X, Y, or Z can
         | help. Person X directs you to person A, persons Y and Z direct
         | you to B, person A also directs you to person B and person B
         | directs you back to person X. You ultimately find that no one
         | is responsible for anything.
         | 
         | 2. When an operation does need to be performed, it takes tiny
         | contributions from many people. This leads to long chains of
         | activities that take months to do simple operations, such as
         | purchasing items because there is inevitably someone who is out
         | sick or on vacation for two weeks.
         | 
         | This in turn leads to more hiring, and then management hits
         | back with _initiatives_ to streamline operations. It is a
         | fascinating cycle.
        
       | victor106 wrote:
       | We implemented a Microservices architecture that was promised and
       | looked to be so simple on paper but became exponentially complex
       | as each domain was implemented so we went back to a simpler
       | monolithic service and it was much simpler to operant manage.
        
       | MrXOR wrote:
       | Past discussion: https://news.ycombinator.com/item?id=8282923
       | 
       | This is a great paper, I bookmarked this webpage.
        
       | adrianmonk wrote:
       | I would modify the section on root causes. It's not a
       | fundamentally bad idea to look for root causes (plural). It's
       | essential to not stop at immediate causes. More or less, you
       | start at immediate causes and then recursively go deeper until
       | you get to things that cannot be changed and/or are not within
       | your control.
       | 
       | One real danger is believing in _a_ root cause or _the_ root
       | cause (singular) as if there is only one. The desire to only have
       | to change (or understand) one thing is dangerous because it leads
       | you to stop investigating /analyzing as soon as you have found
       | something to blame it on.
       | 
       | As you go deeper through the chain(s) of causes, you also should
       | not focus exclusively on the last (root) because that means you
       | aren't thinking about defense in depth.
       | 
       | TLDR, follow the chain of causes all the way to its ends, realize
       | it may have multiple ends, and don't focus only on the surface
       | stuff or only on the root stuff.
        
         | di4na wrote:
         | You are making the assumption that causality is useful and
         | exist in these complex systems. Dr Cook base his treaty on what
         | we know from the research of them, which is that these causal
         | links are far less stronger than you think.
         | 
         | This means that searching along a causal tree is indeed
         | detrimental fundamentally.
        
       | afarrell wrote:
       | For those looking for ways to talk to non-engineers about this
       | topic (without them rolling their eyes and telling you to stop
       | making excuses), I highly recommend the book Leadership is
       | Language.
        
       | mv4 wrote:
       | For a good analysis of risks and failures associated with complex
       | systems, I recommend "Normal Accidents: Living with High-Risk
       | Technologies" by Charles Perrow (1999).
        
         | csours wrote:
         | I won't make a recommendation either way on this book, but I
         | will say that it came off as much more of an opinionated screed
         | than I was expecting; don't go into it expecting an even keel
        
       | master_yoda_1 wrote:
       | You forget about Lithium ion battery and nuclear fusion and many
       | more examples. All the problems can't be solved by simple
       | solution you need to do complex things.
        
       | [deleted]
        
       | dukeofdoom wrote:
       | I think we should build complexity using sort of adversarial
       | system, where things compete like in Nature.
       | 
       | The great lakes is a complex eco system. I remember when Zebra
       | mussels first appeared in the Great Lakes. Tens of articles
       | appeared, how this was an epic natural disaster, that would kill
       | the Great Lakes Eco system, because Zebra mussels would eat all
       | the food. Now decades later, it turns out that Zebra mussels are
       | great at filter feeding, and have thrived. But this also cleared
       | up the water greatly. To the point that the lakes can look almost
       | tropical, ocean clear, on sunny summer days.
       | 
       | Which is probably great for all kinds of plants that now thrive
       | on the extra sunshine reaching deeper down. It seems like the
       | system just rearranged itself, and new opportunities appeared for
       | some other plants to thrive.
        
         | TeMPOraL wrote:
         | In nature, everything is fine, including nothing at all,
         | because nature doesn't care either way - it just is. If Zebra
         | mussels turned the Great Lakes sterile, it would be fine by
         | nature.
         | 
         | When we care about a system, we care about specifics - we want
         | it to look in some way, do some thing or provide us with some
         | things. E.g. we want clean-looking, odorless lakes safe for
         | recreation and full of fish to eat. That's a specific outcome
         | that may or may not be what the natural selection will provide;
         | to ensure we get what we want, we need to supply a strong
         | selection pressure of our design.
         | 
         | Adversarial processes are effective given sufficient time, but
         | about the least efficient way to build a system after just
         | waiting for it to appear at random. Humanity's dominance over
         | other life on Earth is in big way based on the ability to
         | predict and invent things "up front", without any competitive
         | selection. I feel we should refine that skill and find more
         | opportunities to use it, because it's much less wasteful.
        
       | wenc wrote:
       | The accompanying talk by Richard I Cook at the O'Reilly Velocity
       | conference is here:
       | 
       | https://www.youtube.com/watch?v=2S0k12uZR14
       | 
       | When I worked in operations, I would watch this talk every couple
       | of months to remind myself of the principles. The principles are
       | summarized in the list, but the talk adds a bit more context. I
       | find that this context matters for seeing how the principles
       | might be applied.
        
       | jwdomb wrote:
       | This reads like a crisp summary of a subset of John Gall's
       | "Systemantics": https://en.wikipedia.org/wiki/Systemantics
        
       | bitminer wrote:
       | >>[The] defenses include obvious technical components (e.g.
       | backup systems, 'safety' features of equipment) and human
       | components (e.g. training, knowledge) but also a variety of
       | organizational, institutional, and regulatory defenses (e.g.
       | policies and procedures, certification, work rules, team
       | training).
       | 
       | This omits "design" for defenses against problems.
       | 
       | Example: the chemical industries in many countries in the 1960s
       | had horrendous accident records: many employees were dying on the
       | job. (For many reasons) the owners re-engineered their plants to
       | substantially reduce overall accident rates. "Days since a lost-
       | time accident" became a key performance indicator.
       | 
       | A key engineering process was introduced: HAZOP. The chemical
       | flows were evaluated under all conditions: full-on, full-stop,
       | and any other situations contrary to the design. Hazards from
       | equipment failures or operational mistakes are thus identified
       | and the design is adjusted to mitigate them. This was s.o.p. in
       | the 1980s. See Wikipedia for an intro.
       | 
       | Similar approaches could help IT and other systems.
        
       | somurzakov wrote:
       | >>Safety is a characteristic of systems and not of their
       | components >>Safety is an emergent property of systems; it does
       | not reside in a person, device or department of an organization
       | or system. Safety cannot be purchased or manufactured; it is not
       | a feature that is separate from the other components of the
       | system. This means that safety cannot be manipulated like a
       | feedstock or raw material. The state of safety in any system is
       | always dynamic; continuous systemic change insures that hazard
       | and its management are constantly changing.
       | 
       | fits perfectly the description of Security of complex IT systems,
       | nice way to explain why IT security marketplace is a wild wild
       | west
        
       | crdrost wrote:
       | This is a wonderful article, but I feel like it is kind of
       | missing a certain depth. Read Donella Meadows' classic _Thinking
       | in Systems_ for instance.
       | 
       | So like, yes, of necessity the largeness of a system causes the
       | existence of feedback loops such that it maintains an equilibrium
       | state, a homeostasis. This is kind of just a restatement of what
       | a system _is_ , and to characterize it as "the system is
       | constantly failing," while that is totally true, seems a little
       | bit sensationalist. Like, you are one of these complex systems,
       | do I say that you are _constantly failing_ to draw breath? No,
       | but I imagine some of your alveolar sacs routinely fail to fully
       | inflate and others are too covered with mucous to absorb oxygen
       | and whatever else, and often your breath runs ragged. Similarly
       | with "you cannot identify a root cause," a kid dies in a pool and
       | we routinely have a medical examiner pronounce that the cause of
       | death was drowning, which in the respiratory sense can be fairly
       | well isolated to a root cause, "could not breathe air because no
       | air was supplied by the system upstream." The failure to find a
       | root cause essentially only becomes hard because there is another
       | aspect of complex systems which Meadows discusses, which is that
       | they bleed into each other and into the World At Large, they do
       | not have clearly defined boundaries until we impose them for
       | analysis. Draw the boundaries larger than the respiratory system,
       | say the social system, and now the reason that the kid died has
       | something to do with parents never having been trained on what
       | drowning actually looks like, so they heard some light splashing
       | and looked over and said "aww isn't he enjoying the water" not
       | realizing that this was actually a "holy fuck my son is
       | struggling to keep his mouth above water" moment--this perhaps
       | speaks to a larger problem of not having good water education in
       | society, maybe _that_ is the root cause. Stuff like that. In that
       | sense yes there are failures at all levels, but at any given
       | level of explanation byou can more or less say that this is the
       | thing that at that level explains the problem.
       | 
       | So my servers go down, there is a valid root cause at the level
       | of the individual server; there is also one at the level of the
       | cluster as a whole and what that was doing; there is a valid
       | cause at our developer and operations practices; there is a
       | different cause at the level of company culture too. I don't
       | think that it makes sense to say that there is no root cause. The
       | root cause of the Challenger explosion was that it was too cold
       | to fly that day and they still flew the shuttle anyway. One can
       | zoom in ("the fuel tanks were supposed to be sealed by rubber
       | rings, but at the cold temperature of launch day the rubber no
       | longer sealed the tanks") or out ("NASA leadership built a false
       | confidence in the quality of their work bolstered by a sort of
       | management double-speak that disconnected them from the real
       | risks they were playing with, making them foolhardy in ordering
       | the launch") and one can even sort of pan sideways ("why was the
       | weather so cold that day?" or "what financial and engineering
       | considerations made them choose this construction with these
       | o-rings, several years ago?"). But this "middle explanation" of
       | "it was too cold to fly that day and they flew the shuttle
       | anyway" is a good summary that gives an entry-point to the
       | problem, which is the real function of a root cause analysis (you
       | want to collect information about how to fix the problem and know
       | whether it was fixed).
       | 
       | The fact that this is a little bit over-sensational detracts a
       | little from my enjoyment but I do think that the heart of the
       | article is great and fun to think about. If folks like thinking
       | about this stuff I really recommend Meadows's book though,
       | systems thinking is not present in most curricula and that is a
       | real shame.
        
       ___________________________________________________________________
       (page generated 2020-12-27 23:00 UTC)