hngopher.com

       [HN Gopher] The Therac-25 Incident
       ___________________________________________________________________
        
       The Therac-25 Incident
        
       Author : dagurp
       Score  : 150 points
       Date   : 2021-02-15 13:18 UTC (9 hours ago)
        
 (HTM) web link (thedailywtf.com)
 (TXT) w3m dump (thedailywtf.com)
        
       | dade_ wrote:
       | The article says the developer was never named, as if that had
       | anything to do with the actual problems. Everything about this
       | project sounds insane & inept.
       | 
       | Some questions in my mind while reading this article (but I
       | couldn't find them quickly in a search): Who were the executives
       | that were running the company? Sounds like something that should
       | be taught to MBA as well as CS students. Further, the AECL was a
       | crown corporation of the Canadian government. Who was the
       | minister and bureaucrats in charge of the department? What role
       | did they have in solving or covering up the issue?
        
         | strken wrote:
         | The article has a very long explanation of why one developer is
         | not to blame, and why it's entirely the fault of the company
         | for having no testing procedure and no review process.
        
       | nickdothutton wrote:
       | I mentioned this incident in passing on my post of a few years
       | ago. We studied it many years ago in the early 90s at university.
       | https://blog.eutopian.io/the-age-of-invisible-disasters/
        
       | b3lvedere wrote:
       | Interesting story. I've been on the testing site of medical
       | hardware many many moons ago. It's quite amazing what you can and
       | must test. For instance: We had to prove that if our equipment
       | broke by falling damage that the potential debris flying off
       | could not harm the patient.
       | 
       | I always liked the testing philosophy institutes like for
       | instance Underwriter Laboratories had: Your product will fail.
       | This is stated as fact and is not debatable. What kind of fail
       | safes and protection have you made so that when it fails (and it
       | will) it cannot not harm the patient?
        
         | buescher wrote:
         | Yes. It's amazing the number of engineers that resist doing
         | that analysis - "oh that part won't ever break". Some safety
         | standards do allow for "reliable components" (i.e. if the
         | component is already been scrutinized for safe failure modes,
         | you don't have to consider them) and for submitting reliability
         | analysis or data. I've never seen reliability analysis or data
         | submitted instead of single-point failure analysis though,
         | myself.
         | 
         | Single-point failure analysis techniques like fault trees,
         | event trees, and especially the tabular "failure modes and
         | effects analysis" (FMEA) are so powerful, especially for
         | safety-critical hardware, that when people learn them they want
         | to apply them to everything, including software.
         | 
         | However, FMEA techniques actually have been found to not apply
         | well to software below about the block diagram level. They
         | don't find bugs that would not be found by other methods
         | (static analysis, code review, requirements analysis etc) and
         | they're extremely time and labor intensive. Here's an NRC
         | report that goes into some detail: https://www.nrc.gov/reading-
         | rm/doc-collections/nuregs/agreem...
        
           | b3lvedere wrote:
           | "Through analysis and examples of several real-life
           | catastrophes, this report shows that FMEA could not have
           | helped in the discovery of the underlying faults. The report
           | concludes that the contribution of FMEA to regulatory
           | assurance of Complex Logic, especially software, in a nuclear
           | power plant safety system is marginal."
           | 
           | Even more interesting! Thank you for this link. I appreciate
           | it. Never too old to learn. :)
        
       | ed25519FUUU wrote:
       | The very worst part of this story is that the manufacture
       | vigorously defended their machine, threatening individuals and
       | hospitals with lawsuits if they spoke out publicly. I have zero
       | doubt this led to more deaths.
        
       | buescher wrote:
       | Here's a bit more background, from Nancy Leveson (now at MIT):
       | http://sunnyday.mit.edu/papers/therac.pdf
        
         | meristem wrote:
         | Leveson's _Engineering a Safer World_ [1] is excellent, for
         | those interested in safety engineering.
         | 
         | [1] https://mitpress.mit.edu/books/engineering-safer-world
        
           | buescher wrote:
           | The systems safety case studies in that are great. It's also
           | available as "open access" (free-as-in-free-beer) at MIT
           | press:
           | 
           | https://direct.mit.edu/books/book/2908/Engineering-a-
           | Safer-W...
        
       | time0ut wrote:
       | Many years ago, I had an opportunity to work on a similar type of
       | system (though more recent than this). In the final round of
       | interviews, one of the executives asked if I would be comfortable
       | working on a device that could deliver a potentially dangerous
       | dose of radiation to a patient. In that moment, my mind flashed
       | to this story. I try to be a careful engineer and I am sure there
       | are many more safeguards in place now, but, in that moment, I
       | realized I would not be able to live with myself if I harmed
       | someone that way. I answered truthfully and he thanked me and we
       | ended things there.
       | 
       | I do not mean this as a judgement on those who do work on systems
       | that can physically harm people. Obviously, we need good
       | engineers to design potentially dangerous systems. It is just how
       | I realized I really don't have the character to do it.
        
         | mcguire wrote:
         | Good for you to realize that, then.
         | 
         | On the other hand, I'm not entirely sure it's appropriate to be
         | comfortable with that kind of position in any case.
        
         | sho_hn wrote:
         | > In the final round of interviews, one of the executives asked
         | if I would be comfortable working on a device that could
         | deliver a potentially dangerous dose of radiation to a patient
         | 
         | Automotive software engineer here: I've asked the same in
         | interviews.
         | 
         | "We work on multi-ton machines that can kill people" is a
         | frequently uttered statement at work.
        
         | Hydraulix989 wrote:
         | I would have considered the fact that for a vast majority of
         | people suffering from cancer, this device helps them instead of
         | harms them. However, I can also imagine leadership at some
         | places trying to move fast and pressure ICs into delivering
         | something that isn't completely bulletproof in the name of
         | bottom line. That is something I would have tried to discern
         | from the executives. Similar tradeoffs have been made before
         | with cars against expected legal costs.
         | 
         | There are plenty of other high stakes software that involve
         | human lives (Uber self driving cars, SpaceX, Patriot missiles)
         | and many of them completely scare me and morally frustrate me
         | as well to the point where I would not like to work on one, but
         | I totally understand if you have a personal profile that is
         | different than mine.
        
       | jancsika wrote:
       | I found this comment from the article fascinating:
       | 
       | > I am a physician who did a computer science degree before
       | medical school. I frequently use the Therac-25 incident as an
       | example of why we need dual experts who are trained in both
       | fields. I must add two small points to this fantastic summary.
       | 
       | > 1. The shadow of the Therac-25 is much longer than those who
       | remember it. In my opinion, this incident set medical informatics
       | back 20 years. Throughout the 80s and 90s there was just a
       | feeling in medicine that computers were dangerous, even if the
       | individual physicians didn't know why. This is why, when I was a
       | resident in 2002-2006 we still were writing all of our orders and
       | notes on paper. It wasn't until the US federal government slammed
       | down the hammer in the mid 2000's and said no payment unless you
       | adopt electronic health records, that computers made real inroads
       | into clinical medicine.
       | 
       | > 2. The medical profession, and the government agencies that
       | regulate it, are accustomed to risk and have systems to manage
       | it. The problem is that classical medicine is tuned to
       | "continuous risks." If the Risk of 100 mg of aspirin is "1 risk
       | unit" and the risk of 200 mg of aspirin is "2 risk units" then
       | the risk of 150 mg of aspirin is strongly likely to be between 1
       | and 2, and it definitely won't be 1,000,000. The mechanisms we
       | use to regulate medicine, with dosing trials, and pharmacokinetic
       | studies, and so forth are based on this assumption that both
       | benefit and harm are continuous functions of prescribed dose, and
       | the physician's job is to find the sweet spot between them.
       | 
       | > When you let a computer handle a treatment you are exposed to a
       | completely different kind of risk. Computers are inherently
       | binary machines that we sometimes make simulate continuous
       | functions. Because computers are binary, there is a potential for
       | corner cases that expose erratic, and as this case shows,
       | potentially fatal behavior. This is not new to computer science,
       | but it is very foreign to medicine. Because of this, medicine has
       | a built in blind spot in evaluating computer technology.
        
         | mcguire wrote:
         | I'm not sure I buy that. Or, well, I suppose that those in the
         | medical field believe it, but I don't think they're right.
         | 
         | Consider something like a surgeon nicking and artery while
         | performing some routine surgery, the patient not responding
         | normally to anesthesia or the anesthetist not getting the
         | mixture right and the patient not coming back the way they went
         | in. Or that subset of patients that have poor responses to a
         | vaccine.
         | 
         | Everybody likes to think of the world as a linear system, but
         | it's not.
        
       | ZuLuuuuuu wrote:
       | This is one of the infamous incidents where software failure
       | caused harm on humans. I am kind of fascinated about such
       | incidents (since I learn so much by reading about them), are
       | there any other examples of such incidents that you guys know of?
       | It doesn't have to result in harming human, but any software
       | failure related incident which resulted in big consequences.
       | 
       | One another example that comes to my mind is the Toyota
       | "unintended acceleration" incident. Or the "Mars Climate Orbiter"
       | incident.
        
         | crocal wrote:
         | Google Ariane 501. Enjoy.
        
         | probably_wrong wrote:
         | If big consequences is what you're after, I can think of three
         | typical incidents: the "most expensive hyphen in history" (you
         | can search it like that), it's companion piece, the Mars
         | Climate Orbiter(which I see you added now), and the Denver
         | Airport Baggage System fiasco [1] where bad software planning
         | caused over $500M in delays.
         | 
         | [1] http://calleam.com/WTPF/?page_id=2086
        
       | FiatLuxDave wrote:
       | I find it interesting how often the Therac 25 is mentioned on HN
       | (thanks to Dan for the list), but nobody ever mentions that those
       | kind of problems never entirely went away. Therac 25 is just the
       | famous one. You don't have to go back to 1986, there are
       | definitely examples from this century. The root causes are
       | somewhat different, and somewhat the same. But no one seems to be
       | teaching these more modern cases to aspiring programmers in
       | school, at least not to the level where every programmer I know
       | has heard of them.
       | 
       | For example, the issue which caused this in 2007:
       | 
       | https://www.heraldtribune.com/article/LK/20100124/News/60520...
       | 
       | Or the process issues which caused this in 2001:
       | 
       | https://www.fda.gov/radiation-emitting-products/alerts-and-n...
        
         | sebmellen wrote:
         | I quote this from the article so people may take an interest in
         | reading it. This is the opening paragraph:
         | 
         | > _As Scott Jerome-Parks lay dying, he clung to this wish: that
         | his fatal radiation overdose -- which left him deaf, struggling
         | to see, unable to swallow, burned, with his teeth falling out,
         | with ulcers in his mouth and throat, nauseated, in severe pain
         | and finally unable to breathe -- be studied and talked about
         | publicly so that others might not have to live his nightmare._
        
       | jtchang wrote:
       | I can say when I was doing my CS degree this was definitely
       | covered. In fact it is one of the lectures that stood out in my
       | mind at that time. My professor at the time (Bill Leahy)
       | definitely drilled into us the importance of understanding the
       | systems in which we were eventually going to work on.
       | 
       | Not sure if this is still covered today.
        
         | Mountain_Skies wrote:
         | When I was the graduate teaching assistant for a software
         | engineering lab, the students got a week off to do research on
         | software failures that harmed humans. For many of the students
         | it was the first time they gave any thought to the concept of
         | software causing actual physical harm. I'm glad we were able to
         | expose them to this reality but also was a bit disheartened as
         | they should have thought about it far before a fourth year
         | course in their major.
        
         | lxgr wrote:
         | It was covered in at least one of my classes as well.
         | (Graduated only a few years ago.)
        
         | icelancer wrote:
         | Was definitely covered in my small school program in Embedded
         | Systems in the early 2000s.
        
         | phlyingpenguin wrote:
         | The book I use to instruct software engineering uses it, and I
         | do use that chapter.
        
         | Jtsummers wrote:
         | Per younger, CS, colleagues who went through school in the last
         | 6 years, it was still being taught at their smaller US
         | colleges.
        
       | lordnacho wrote:
       | How much money was AECL making selling these things? You'd think
       | a second pair of eyes on the code would not cost too much. Do I
       | blame the one person? Not really, who in this world hasn't
       | written a race condition at some point? RCs are also one of those
       | things where someone else might spot it a lot sooner than the
       | original writer.
       | 
       | I agree with the sentiment that they took the software for
       | granted. I get the feeling that happens in a lot of settings,
       | most of them less life-threatening than this one. I've come
       | across it myself too, in finance. Somehow someone decides they
       | have invented a brilliant money-making strategy, if they could
       | only get the coders to implement it properly. Of course the
       | coders come back to ask questions, and then depending on the
       | environment it plays out to a resolution. I get the feeling the
       | same thing happened here. Some scientist said "hey all it needs
       | is to send this beam into the patient" and assumed their
       | description was the only level of abstraction that needed to be
       | understood.
        
       | ufmace wrote:
       | > The Therac-25 was the first entirely software-controlled
       | radiotherapy device. As that quote from Jacky above points out:
       | most such systems use hardware interlocks to prevent the beam
       | from firing when the targets are not properly configured. The
       | Therac-25 did not.
       | 
       | This makes me think - there was only one developer there, I
       | guess, who was doing everything in assembly. This software, and
       | the process to produce it, must have been designed in the early
       | days of their devices, when there would be expected to be
       | hardware interlocks to prevent any of the really bad failure
       | modes. I bet they never did change much of the software, or their
       | procedures for developing, testing, qualifying, and releasing it
       | in light of the change from relying on hardware interlocks to the
       | quality of the software being the only thing preventing something
       | terrible from happening.
        
         | bluGill wrote:
         | The software was working just fine for years before on earlier
         | versions with the interlocks. They never checked to see how
         | often or why the interlocks fired before removing them. Turns
         | out those interlocks fired often because of the same bugs.
        
           | brians wrote:
           | They had two fuses, so they had a 2:1 safety margin! Just
           | like the NASA managers who decided that 30% erosion in an
           | O-ring designed for no erosion meant a 3:1 safety margin.
        
         | Gare wrote:
         | A quote from the report:
         | 
         | > Related problems were found in the Therac-20 software. These
         | were not recognized until after the Therac-25 accidents because
         | the Therac-20 included hardware safety interlocks and thus no
         | injuries resulted.
         | 
         | The safety fuses were occasionally blowing during the operation
         | of Therac-20, but nobody asked why.
        
           | baobabKoodaa wrote:
           | > The safety fuses were occasionally blowing during the
           | operation of Therac-20, but nobody asked why.
           | 
           | Have you tried turning it off and on again?
        
       | joncrane wrote:
       | I feel like this makes it to HN once every few years or so.
       | 
       | I know it well from it being the first and main case study in my
       | software testing class as an undergraduate CS major in Washington
       | DC in 1999.
       | 
       | It will never not be interesting.
        
         | siltpotato wrote:
         | Apparently this is the seventh one. I've never worked on a
         | safety critical system but this is the story that makes me
         | wonder what it's like to do so.
        
           | Jtsummers wrote:
           | It's stressful, but often worthwhile. It requires diligence,
           | deliberate action, and patience.
        
             | matthias509 wrote:
             | I used to work on public safety radio systems. Things which
             | seem like minor issues like clipping the beginning of a
             | transmission every now and then are showstopper defects in
             | that space.
             | 
             | It's because it can be the difference between "Shoot" and
             | "Don't shoot."
        
       | at_a_remove wrote:
       | I rather randomly met a woman with a similar sort of background
       | and trajectory as I have: trained in physics, got sucked into
       | computers via the brain drain. She programmed the models for
       | radiation dosing in the metaphorical descendants of Therac-25. I
       | asked her just how often it was brought up in her work and she
       | mentioned that she trained under someone who was in the original
       | group of people brought in to analyze and understand just what
       | happened with Therac-25. Fascinating stuff.
        
       | dang wrote:
       | (For the curious) the Therac-25 stack on HN:
       | 
       | 2019 https://news.ycombinator.com/item?id=21679287
       | 
       | 2018 https://news.ycombinator.com/item?id=17740292
       | 
       | 2016 https://news.ycombinator.com/item?id=12201147
       | 
       | 2015 https://news.ycombinator.com/item?id=9643054
       | 
       | 2014 https://news.ycombinator.com/item?id=7257005
       | 
       | 2010 https://news.ycombinator.com/item?id=1143776
       | 
       | Others?
        
         | kondro wrote:
         | It comes up a lot, but it's an incredibly important story that
         | bares repeating often. Especially with similar issues like the
         | 737-MAX occurring pretty recently.
        
       | omginternets wrote:
       | The featured comment is great, for those who missed it:
       | 
       | I am a physician who did a computer science degree before medical
       | school. I frequently use the Therac-25 incident as an example of
       | why we need dual experts who are trained in both fields. I must
       | add two small points to this fantastic summary.
       | 
       | 1. The shadow of the Therac-25 is much longer than those who
       | remember it. In my opinion, this incident set medical informatics
       | back 20 years. Throughout the 80s and 90s there was just a
       | feeling in medicine that computers were dangerous, even if the
       | individual physicians didn't know why. This is why, when I was a
       | resident in 2002-2006 we still were writing all of our orders and
       | notes on paper. It wasn't until the US federal government slammed
       | down the hammer in the mid 2000's and said no payment unless you
       | adopt electronic health records, that computers made real inroads
       | into clinical medicine.
       | 
       | 2. The medical profession, and the government agencies that
       | regulate it, are accustomed to risk and have systems to manage
       | it. The problem is that classical medicine is tuned to
       | "continuous risks." If the Risk of 100 mg of aspirin is "1 risk
       | unit" and the risk of 200 mg of aspirin is "2 risk units" then
       | the risk of 150 mg of aspirin is strongly likely to be between 1
       | and 2, and it definitely won't be 1,000,000. The mechanisms we
       | use to regulate medicine, with dosing trials, and pharmacokinetic
       | studies, and so forth are based on this assumption that both
       | benefit and harm are continuous functions of prescribed dose, and
       | the physician's job is to find the sweet spot between them.
       | 
       | When you let a computer handle a treatment you are exposed to a
       | completely different kind of risk. Computers are inherently
       | binary machines that we sometimes make simulate continuous
       | functions. Because computers are binary, there is a potential for
       | corner cases that expose erratic, and as this case shows,
       | potentially fatal behavior. This is not new to computer science,
       | but it is very foreign to medicine. Because of this, medicine has
       | a built in blind spot in evaluating computer technology.
        
         | beerandt wrote:
         | It's so short-sighted that he doesn't see that medical records
         | being forced so quickly to digital/computers is almost exactly
         | the same problem being played out, just not as directly or
         | dramatically, but with a much wider net, and way more short-
         | and long-term problems (including the software/systems trust
         | mentioned).
        
           | pessimizer wrote:
           | _' Shocking' hack of psychotherapy records in Finland affects
           | thousands_
           | 
           | https://www.theguardian.com/world/2020/oct/26/tens-of-
           | thousa...
        
             | jbay808 wrote:
             | Similar thing happened in Canada:
             | 
             | https://globalnews.ca/news/6311853/lifelabs-data-hack-
             | what-t...
        
         | [deleted]
        
         | dwohnitmok wrote:
         | I suspect that a large proportion of ways that abstract
         | planning fail are due to discontinuous jumps, foreseen or
         | unforeseen. That may be manifested in computer programs,
         | government policy, etc.
         | 
         | Continuity of risk, change, incentives, etc. lend themselves to
         | far easier analysis and confidence in outcomes. And higher
         | degrees of continuity as well as lower values of change only
         | make that analysis easier. Of course it's a trade-off: a flat
         | line is the easiest thing to analyze, but also the least useful
         | thing.
         | 
         | In many ways I view the core enterprise of planning as an
         | exercise in trying to smooth out discontinuous jumps (and their
         | analogues in higher degree derivatives) to the best of one's
         | ability, especially if they exist naturally (e.g. your system's
         | objective response may be continuous, but its interpretation by
         | humans is discontinuous, how are you going to compensate to try
         | to regain as much continuity as possible?).
        
       ___________________________________________________________________
       (page generated 2021-02-15 23:01 UTC)