[HN Gopher] Complex Systems Fail ___________________________________________________________________ Complex Systems Fail Author : mxschumacher Score : 234 points Date : 2020-12-27 12:54 UTC (10 hours ago) (HTM) web link (how.complexsystems.fail) (TXT) w3m dump (how.complexsystems.fail) | marnett wrote: | Is this the primary source of this https://fs.blog/2014/04/how- | complex-systems-fail/ ? | di4na wrote: | As this is a copy of the paper, probably. | | I would point out that Dr Cook would probably disagree greatly | with the whole idea of AntiFragile, or at least has the few | time i have been able to talk with him. This link you offer is | probably not the best context in which to comment on this | paper. | cs702 wrote: | I find it _very_ instructive to substitute words like "human," | "practitioners," and "people" in this essay with the word "AI," | and re-read the essay from the standpoint of building autonomous | AI agents that can safely run complex systems whose failures can | be catastrophic, like driving a car or maneuvering a rocket. The | essay becomes a kind of "guiding principles for building and | evaluating autonomous AI." | | Here are five sample paragraphs in which I replaced all words | referring to human beings to "AI:" | | -- | | _Hindsight biases post-accident assessments of AI performance._ | Knowledge of the outcome makes it seem that events leading to the | outcome should have appeared more salient to the AI at the time | than was actually the case. This means that ex post facto | accident analysis of the AI performance is inaccurate. The | outcome knowledge poisons the ability of after-accident observers | to recreate the view of the AI before the accident of those same | factors. It seems that the AI "should have known" that the | factors would "inevitably" lead to an accident. Hindsight bias | remains the primary obstacle to accident investigation, | especially when AI performance is involved. | | _All AI actions are gambles._ After accidents, the overt failure | often appears to have been inevitable and the AI's actions as | blunders or deliberate willful disregard of certain impending | failure. But all AI actions are actually gambles, that is, acts | that take place in the face of uncertain outcomes. The degree of | uncertainty may change from moment to moment. That AI actions are | gambles appears clear after accidents; in general, post hoc | analysis regards these gambles as poor ones. But the converse: | that successful outcomes are also the result of gambles; is not | widely appreciated. | | _Actions at the sharp end resolve all ambiguity._ Organizations | are ambiguous, often intentionally, about the relationship | between production targets, efficient use of resources, economy | and costs of operations, and acceptable risks of low and high | consequence accidents. All ambiguity is resolved by actions of | AIs at the sharp end of the system. After an accident, AI actions | may be regarded as 'errors' or 'violations' but these evaluations | are heavily biased by hindsight and ignore the other driving | forces, especially production pressure. | | _Views of 'cause' limit the effectiveness of defenses against | future events._ Post-accident remedies for "AI error" are usually | predicated on obstructing activities that can "cause" accidents. | These end-of-the-chain measures do little to reduce the | likelihood of further accidents. In fact that likelihood of an | identical accident is already extraordinarily low because the | pattern of latent failures changes constantly. Instead of | increasing safety, post-accident remedies usually increase the | coupling and complexity of the system. This increases the | potential number of latent failures and also makes the detection | and blocking of accident trajectories more difficult. | | _Failure free operations require AI experience with failure._ | Recognizing hazard and successfully manipulating system | operations to remain inside the tolerable performance boundaries | requires intimate contact with failure. More robust system | performance is likely to arise in systems where AIs can discern | the "edge of the envelope". This is where system performance | begins to deteriorate, becomes difficult to predict, or cannot be | readily recovered. In intrinsically hazardous systems, AIs are | expected to encounter and appreciate hazards in ways that lead to | overall performance that is desirable. Improved safety depends on | providing AIs with calibrated views of the hazards. It also | depends on providing calibration about how their actions move | system performance towards or away from the edge of the envelope. | | -- | | PS. The original essay, published circa 2002, is also available | here as a PDF: | https://www.researchgate.net/publication/228797158_How_compl... | wpietri wrote: | For those interested in the topic, especially point 7 and 8 (root | cause and hindsight bias), I strongly recommend Dekker's "A Field | Guide to 'Human Error'". It digs in on those points in a very | practical way. We recently had a minor catastrophe at work; it | was very useful in preparing me to guide people away from | individual blame and toward systemic thinking. | csours wrote: | I strongly agree with this recommendation. | | Personal Experience: During and after the economic collapse of | 2007-2009 I wondered who was at fault; who to blame. I kept | waiting for a clear answer to "THE root cause". Since then I've | read things like Dekker's work, and come to realize that blame | is not a productive way of thinking in complex systems. | | A quick example: in many car accidents, you can easily point to | the person who caused the accident; for example the person who | runs a red light, texts and drives, or drives drunk is easily | found at fault. But what about a case where someone 3 or 4 car | lengths ahead makes a quick lane change and an accident occurs | behind them? | yholio wrote: | > A quick example: in many car accidents, you can easily | point to the person who caused the accident | | This is a bad example: traffic is a complex sistem with a | century of ruleset evolution specifically intended to isolate | personal responsibility and provide a simple interface for | the users, that, when correctly used, guarantees a collision | free ride for all participants. | | The systemic failures of trafic are more related to the | fallible nature of its actors. The safety guarantees work | only when humans demonstrate almost super-human regard to the | safety of others, are never inattentive, tired, in a hurry or | influenced by substances or medical conditions etc. | | We try to align personal incentives to systemic goals with | hefty punishments, but there is a diminishing return on that, | at some point you have to consider humans unreliable and | design your system to be fault-tolerant. Indeed, most modern | trafic systems are doing this today with things like impact | absorbing railings, speed bumps, wide shoulders and curves | etc. | Joeri wrote: | This is why I prefer thinking about root solutions rather | than root causes. The answer to the question "how can we make | sure something like this cannot happen again?" (for a | reasonably wide definition of this). The nice thing is that | there are usually many right answers, all of which can be | implemented, while when looking for root causes there may not | actually be one. | travmatt wrote: | A good example was the AWS S3 outage that occurred when a | single engineer mistyped a command[0]. While the outage | wouldn't have occurred had an engineer not mistyped a | command, that conclusion still would have missed the issue | that the system should have some level of resiliency | against simple typos - in their case, checking that actions | that wouldn't take subsystems below their minimum required | capacity. | | [0] https://aws.amazon.com/message/41926/ | hansvm wrote: | > But what about a case where someone 3 or 4 car lengths | ahead makes a quick lane change and an accident occurs behind | them? | | You'd have to be going extremely slowly for 3 or 4 car | lengths to be a safe following distance. On a typical | 60-70mph freeway you should have a gap of at least 15-20 car | lengths, and then accidents like that will happen only when | other factors are at play (and if those factors are | predictable like water on the road then your distance/speed | should be adjusted accordingly). | | While I think that example was bad, your point about there | existing accidents without single points of blame is still | valid. | csours wrote: | I meant to say 3 or 4 cars ahead. Thank you for taking the | point. | hansvm wrote: | That makes sense. To be fair I shouldn't have been so | hasty anyway; with your example of a lane change, even if | you can control your following distance it's hard to also | keep people from other lanes cutting in too closely. | [deleted] | narag wrote: | I wrote another comment, before reading the article that says | it better than me: | | _Catastrophe requires multiple failures - single point | failures are not enough._ | triangleman wrote: | There is a relatively simple root cause of the financial | crisis, it's called moral hazard: | | https://en.m.wikipedia.org/wiki/Greenspan_put | | Now, how exactly to solve the problem is a complex question, | so I suppose in that respect it's hard to think productively | about it. | watwut wrote: | Afaik, there were actual massive frauds going on. | triangleman wrote: | Like Madoff? The subprime lending/NINJA loans? In my | opinion both of those things would have been harder if | everyone was doing more due diligence. Moral hazard took | away some of the incentive to do that. | simonh wrote: | Right, even if all the wrongdoing had all happened | together, but the rest of the economic and financial | system had been sound, we'd have been ok. | nullsense wrote: | This book taught me about Rasmussen's model of safety. It's a | good book. | [deleted] | karmakaze wrote: | Root cause analysis when run properly doesn't only come out | with one culprit, the process can identify many potential | sources of weakness in technical, human processes, | documentation--at least one of which (typically more) should | immediately be improved. | thewebcount wrote: | How does that square with the need to improve systems so the | same problems don't happen again? I get not wanting to put | blame on a particular person, group, or cause when it's multi- | factorial, but how can you improve if you don't figure out why | the failure occurred? | bumby wrote: | Part of the philosophy is to change the mindset from a | "person" perspective to a "process" perspective. I.e., what | gaps in the process led to the mishap, not what person caused | the mishap. | | Organizations that are people dependent rather than process | dependent tend to have higher risks of failures. | bobsomers wrote: | A big part is acknowledging that the actions that human | operators take is largely a result of the environment in | which they operate. Typically there are many issues with that | environment that be improved and all human operators will | benefit. | | To give you a more concrete example, it moves the analysis | away from "Bob deleted the production database" into a more | productive space of "we really shouldn't have a process that | relies on any human logging into the production database and | running SQL queries by hand, that's prone to human mistake". | thatsamonad wrote: | I'm not the OP, but when I think of "systemic thinking" I | think the focus is more on looking at all of the factors | involved as part of a holistic model rather than placing | blame at the feet of a particular individual or process. You | can still identify causes and try to remediate them, but most | of the time the remediation shouldn't be something like | "let's fire Bob for making a mistake/error", but rather, | "Let's look at all of the events that led up to Bob making | that mistake and figure out how we can help him avoid it in | the future through a system, process, or people change, or a | combination therein". | | That being said, if someone is negligent and consistently | does negligent things they should probably be put into a | position where their negligence won't cause catastrophic | system failures or loss of life. Sometimes that does mean | firing someone. | wpietri wrote: | That's one of the central questions of the book. But my take | is that there are a bunch of ways to answer a "why" question, | some more useful than others. | | One very common mode is to take a complex causal web, trace | until you find a person of low status, and then yell at | and/or punish said scapegoat. That desire to blame is a very | human approach, but it a) isn't very effective in preventing | the next problem, and b) prevents real systemic understanding | by providing a false feeling of resolution. | | So if we really want to figure out why the failure occurred | and reduce the odds of it happening again, we need to give up | blame and look at how systems create incentives and behaviors | in the people caught up in them. Only if everybody feels a | sense of personal safety do we have much chance of getting at | what happened and discussing it calmly enough that we can | come to real understandings and real solutions. | jccooper wrote: | There's a definite difference between blame and cause, and | they don't conflict. Blame is for individuals, cause is for | systems. While you do need to hold individuals accountable, | most of the time you should focus on fixing the system, which | is a much more durable fix. | 0xbadcafebee wrote: | > The evaluations based on such reasoning as 'root cause' do not | reflect a technical understanding of the nature of failure but | rather the social, cultural need to blame specific, localized | forces or events for outcomes. | | I've heard this line before and I don't buy it. Methods exist | that have been proven to identify root causes and those have led | to significant reductions in failures. Half this page talks about | how systems are made resilient to failures over time. But to do | that you need to identify the failures' root causes! | | The Five Whys is one example of an effective method to find root | causes. The solution may not be simple, but it does lead to more | robust systems. | bobsomers wrote: | Avoiding the trap of "root causes" is exactly what has lead to | such a significant reduction in failures! | | Avoiding "root cause" thinking doesn't mean that there isn't | something that can be found and fixed. The intent is to keep | your mind open to the fact that these systems and the | interaction of their components are phenomenally complex. There | are many contributing factors to a loss, not just a single | error. | | Root cause analysis is harmful because it artificially limits | your investigation to a predetermined "single" error or fault. | Once the first or most salient error or fault is found (very | commonly a human operator), the investigation stops because the | "root cause" was found. Instead of stopping at the root cause, | you keep digging in attempt to find all the possible | contributing factors, no matter how big or small. You can | prevent many future failures of the system, not just the same | one, by learning about many places where your defenses and | mitigations were weak. | | NTSB reports are a great example of how useful this approach | is. Instead of identifying a root cause, they list many | contributing factors, all of which can be improved to raise the | level of safety significantly rather the fixing a single root | cause. | | As a recent example, think of the Boeing 737 MAX. What was the | "root cause"? Was it: | | * A software bug in the MCAS system? * The removal of a | redundant method for sensing angle of attack? * The decision to | make the "AoA Disagree" light an optional upgrade? * Making | substantially unstable changes to the existing airframe? * The | desire for it be be "another 737 variant" for pilot | classification? * The insufficient pilot training w.r.t. to the | new MCAS system? * Organizational pressures to not lose the AA | contract to Airbus? * Executive leadership that prioritized | business over safety? | | The answer is all of the above, and more! By identifying all of | these past "a software bug in the MCAS system" we have the | opportunity to fix many safety issues which could be | contributing factors to future failures. | | There is an even more in depth example of this in Chapter 5 of | Nancy Leveson's book Engineering a Safer World. She spends over | 60 pages describing the environment and circumstances that lead | to a US Air Force F-15 shooting down a friendly US Army Black | Hawk over northern Iraq in 1994. | soraki_soladead wrote: | The actual title is "How Complex Systems Fail" (full url: | how.complexsystems.fail) which provides a slightly different | meaning. The current title "Complex Systems Fail" implies that we | should not have complex systems but that's does not appear to be | the intention of the article. Rather we often require complex | systems to solve complex problems and the article provides an | exposition on failure in such systems: | | > (Being a Short Treatise on the Nature of Failure; How Failure | is Evaluated; How Failure is Attributed to Proximate Cause; and | the Resulting New Understanding of Patient Safety) | namenotrequired wrote: | It's interesting the subtitle mentions "patients" when the | entire text applies fully for very different contexts, say, | cruise ships | chrismorgan wrote: | I believe HN removes words like "how" or "why" from the start | of titles as part of its normalisation, though you can | subsequently edit it to fix the title again. I strongly dislike | this particular automatic edit because I reckon it's normally | wrong. | nitrogen wrote: | I wish it would remove the word "just" from titles of the | form "X just Y'd a Z". | treelovinhippie wrote: | Complex system: nature. | | Complicated system: things mentioned in this article. | ayende wrote: | This is a short but really good discussion on real world failures | and how to build robust systems. | | I refer to it often when thinking about the design of my systems. | hawktheslayer wrote: | I see #13 (human expertise) crop up the most often in the complex | systems at my work. I think the two main reasons replacing | experts is so hard is (1) it's inherently difficult to understand | another's work you haven't built from yourself and (2) it's | nearly impossible to find someone as enthusiastic for taking on a | system and giving it the TLC it needs to keep it going smoothly | to the extent that it matches the passion that the original | creator poured into it. I find this most acutely whenever the | original expert refers to it as "their baby". | rawgabbit wrote: | Agree with Pt 11. "After an accident, | practitioner actions may be regarded as 'errors' or 'violations' | but these evaluations are heavily biased by hindsight and ignore | the other driving forces, especially production pressure." | | Management pressure to produce reminds me of the phrase "We don't | know where we are going but we want to get there as soon as | possible." Garbage in garbage out. And no. Endless meetings are | not a good substitute for anything. | vegetablepotpie wrote: | Management pressure to produce is coupled closely with | practitioner actions are gambles Pt. 10, and these two things | lead to less efficiency. | | I see the function of management at my workplace is to get | practitioners to take bigger gambles to produce more favorable | outcomes. This is convenient for management because they can | capture the upside of a gamble and focus blame on an individual | on the downside of a gamble. | | One of the practitioner reactions I've seen to this in | government contracting is to become hyper specialized in one | specific job function. Don't be a jack of all trades, don't | wear multiple hats. Don't spring up to help someone. This leads | to two outcomes. | | 1. When a practitioner is confronted with a novel problem, they | will say that's not their job and that persons X, Y, or Z can | help. Person X directs you to person A, persons Y and Z direct | you to B, person A also directs you to person B and person B | directs you back to person X. You ultimately find that no one | is responsible for anything. | | 2. When an operation does need to be performed, it takes tiny | contributions from many people. This leads to long chains of | activities that take months to do simple operations, such as | purchasing items because there is inevitably someone who is out | sick or on vacation for two weeks. | | This in turn leads to more hiring, and then management hits | back with _initiatives_ to streamline operations. It is a | fascinating cycle. | victor106 wrote: | We implemented a Microservices architecture that was promised and | looked to be so simple on paper but became exponentially complex | as each domain was implemented so we went back to a simpler | monolithic service and it was much simpler to operant manage. | MrXOR wrote: | Past discussion: https://news.ycombinator.com/item?id=8282923 | | This is a great paper, I bookmarked this webpage. | adrianmonk wrote: | I would modify the section on root causes. It's not a | fundamentally bad idea to look for root causes (plural). It's | essential to not stop at immediate causes. More or less, you | start at immediate causes and then recursively go deeper until | you get to things that cannot be changed and/or are not within | your control. | | One real danger is believing in _a_ root cause or _the_ root | cause (singular) as if there is only one. The desire to only have | to change (or understand) one thing is dangerous because it leads | you to stop investigating /analyzing as soon as you have found | something to blame it on. | | As you go deeper through the chain(s) of causes, you also should | not focus exclusively on the last (root) because that means you | aren't thinking about defense in depth. | | TLDR, follow the chain of causes all the way to its ends, realize | it may have multiple ends, and don't focus only on the surface | stuff or only on the root stuff. | di4na wrote: | You are making the assumption that causality is useful and | exist in these complex systems. Dr Cook base his treaty on what | we know from the research of them, which is that these causal | links are far less stronger than you think. | | This means that searching along a causal tree is indeed | detrimental fundamentally. | afarrell wrote: | For those looking for ways to talk to non-engineers about this | topic (without them rolling their eyes and telling you to stop | making excuses), I highly recommend the book Leadership is | Language. | mv4 wrote: | For a good analysis of risks and failures associated with complex | systems, I recommend "Normal Accidents: Living with High-Risk | Technologies" by Charles Perrow (1999). | csours wrote: | I won't make a recommendation either way on this book, but I | will say that it came off as much more of an opinionated screed | than I was expecting; don't go into it expecting an even keel | master_yoda_1 wrote: | You forget about Lithium ion battery and nuclear fusion and many | more examples. All the problems can't be solved by simple | solution you need to do complex things. | [deleted] | dukeofdoom wrote: | I think we should build complexity using sort of adversarial | system, where things compete like in Nature. | | The great lakes is a complex eco system. I remember when Zebra | mussels first appeared in the Great Lakes. Tens of articles | appeared, how this was an epic natural disaster, that would kill | the Great Lakes Eco system, because Zebra mussels would eat all | the food. Now decades later, it turns out that Zebra mussels are | great at filter feeding, and have thrived. But this also cleared | up the water greatly. To the point that the lakes can look almost | tropical, ocean clear, on sunny summer days. | | Which is probably great for all kinds of plants that now thrive | on the extra sunshine reaching deeper down. It seems like the | system just rearranged itself, and new opportunities appeared for | some other plants to thrive. | TeMPOraL wrote: | In nature, everything is fine, including nothing at all, | because nature doesn't care either way - it just is. If Zebra | mussels turned the Great Lakes sterile, it would be fine by | nature. | | When we care about a system, we care about specifics - we want | it to look in some way, do some thing or provide us with some | things. E.g. we want clean-looking, odorless lakes safe for | recreation and full of fish to eat. That's a specific outcome | that may or may not be what the natural selection will provide; | to ensure we get what we want, we need to supply a strong | selection pressure of our design. | | Adversarial processes are effective given sufficient time, but | about the least efficient way to build a system after just | waiting for it to appear at random. Humanity's dominance over | other life on Earth is in big way based on the ability to | predict and invent things "up front", without any competitive | selection. I feel we should refine that skill and find more | opportunities to use it, because it's much less wasteful. | wenc wrote: | The accompanying talk by Richard I Cook at the O'Reilly Velocity | conference is here: | | https://www.youtube.com/watch?v=2S0k12uZR14 | | When I worked in operations, I would watch this talk every couple | of months to remind myself of the principles. The principles are | summarized in the list, but the talk adds a bit more context. I | find that this context matters for seeing how the principles | might be applied. | jwdomb wrote: | This reads like a crisp summary of a subset of John Gall's | "Systemantics": https://en.wikipedia.org/wiki/Systemantics | bitminer wrote: | >>[The] defenses include obvious technical components (e.g. | backup systems, 'safety' features of equipment) and human | components (e.g. training, knowledge) but also a variety of | organizational, institutional, and regulatory defenses (e.g. | policies and procedures, certification, work rules, team | training). | | This omits "design" for defenses against problems. | | Example: the chemical industries in many countries in the 1960s | had horrendous accident records: many employees were dying on the | job. (For many reasons) the owners re-engineered their plants to | substantially reduce overall accident rates. "Days since a lost- | time accident" became a key performance indicator. | | A key engineering process was introduced: HAZOP. The chemical | flows were evaluated under all conditions: full-on, full-stop, | and any other situations contrary to the design. Hazards from | equipment failures or operational mistakes are thus identified | and the design is adjusted to mitigate them. This was s.o.p. in | the 1980s. See Wikipedia for an intro. | | Similar approaches could help IT and other systems. | somurzakov wrote: | >>Safety is a characteristic of systems and not of their | components >>Safety is an emergent property of systems; it does | not reside in a person, device or department of an organization | or system. Safety cannot be purchased or manufactured; it is not | a feature that is separate from the other components of the | system. This means that safety cannot be manipulated like a | feedstock or raw material. The state of safety in any system is | always dynamic; continuous systemic change insures that hazard | and its management are constantly changing. | | fits perfectly the description of Security of complex IT systems, | nice way to explain why IT security marketplace is a wild wild | west | crdrost wrote: | This is a wonderful article, but I feel like it is kind of | missing a certain depth. Read Donella Meadows' classic _Thinking | in Systems_ for instance. | | So like, yes, of necessity the largeness of a system causes the | existence of feedback loops such that it maintains an equilibrium | state, a homeostasis. This is kind of just a restatement of what | a system _is_ , and to characterize it as "the system is | constantly failing," while that is totally true, seems a little | bit sensationalist. Like, you are one of these complex systems, | do I say that you are _constantly failing_ to draw breath? No, | but I imagine some of your alveolar sacs routinely fail to fully | inflate and others are too covered with mucous to absorb oxygen | and whatever else, and often your breath runs ragged. Similarly | with "you cannot identify a root cause," a kid dies in a pool and | we routinely have a medical examiner pronounce that the cause of | death was drowning, which in the respiratory sense can be fairly | well isolated to a root cause, "could not breathe air because no | air was supplied by the system upstream." The failure to find a | root cause essentially only becomes hard because there is another | aspect of complex systems which Meadows discusses, which is that | they bleed into each other and into the World At Large, they do | not have clearly defined boundaries until we impose them for | analysis. Draw the boundaries larger than the respiratory system, | say the social system, and now the reason that the kid died has | something to do with parents never having been trained on what | drowning actually looks like, so they heard some light splashing | and looked over and said "aww isn't he enjoying the water" not | realizing that this was actually a "holy fuck my son is | struggling to keep his mouth above water" moment--this perhaps | speaks to a larger problem of not having good water education in | society, maybe _that_ is the root cause. Stuff like that. In that | sense yes there are failures at all levels, but at any given | level of explanation byou can more or less say that this is the | thing that at that level explains the problem. | | So my servers go down, there is a valid root cause at the level | of the individual server; there is also one at the level of the | cluster as a whole and what that was doing; there is a valid | cause at our developer and operations practices; there is a | different cause at the level of company culture too. I don't | think that it makes sense to say that there is no root cause. The | root cause of the Challenger explosion was that it was too cold | to fly that day and they still flew the shuttle anyway. One can | zoom in ("the fuel tanks were supposed to be sealed by rubber | rings, but at the cold temperature of launch day the rubber no | longer sealed the tanks") or out ("NASA leadership built a false | confidence in the quality of their work bolstered by a sort of | management double-speak that disconnected them from the real | risks they were playing with, making them foolhardy in ordering | the launch") and one can even sort of pan sideways ("why was the | weather so cold that day?" or "what financial and engineering | considerations made them choose this construction with these | o-rings, several years ago?"). But this "middle explanation" of | "it was too cold to fly that day and they flew the shuttle | anyway" is a good summary that gives an entry-point to the | problem, which is the real function of a root cause analysis (you | want to collect information about how to fix the problem and know | whether it was fixed). | | The fact that this is a little bit over-sensational detracts a | little from my enjoyment but I do think that the heart of the | article is great and fun to think about. If folks like thinking | about this stuff I really recommend Meadows's book though, | systems thinking is not present in most curricula and that is a | real shame. ___________________________________________________________________ (page generated 2020-12-27 23:00 UTC)