[HN Gopher] The Therac-25 Incident ___________________________________________________________________ The Therac-25 Incident Author : dagurp Score : 150 points Date : 2021-02-15 13:18 UTC (9 hours ago) (HTM) web link (thedailywtf.com) (TXT) w3m dump (thedailywtf.com) | dade_ wrote: | The article says the developer was never named, as if that had | anything to do with the actual problems. Everything about this | project sounds insane & inept. | | Some questions in my mind while reading this article (but I | couldn't find them quickly in a search): Who were the executives | that were running the company? Sounds like something that should | be taught to MBA as well as CS students. Further, the AECL was a | crown corporation of the Canadian government. Who was the | minister and bureaucrats in charge of the department? What role | did they have in solving or covering up the issue? | strken wrote: | The article has a very long explanation of why one developer is | not to blame, and why it's entirely the fault of the company | for having no testing procedure and no review process. | nickdothutton wrote: | I mentioned this incident in passing on my post of a few years | ago. We studied it many years ago in the early 90s at university. | https://blog.eutopian.io/the-age-of-invisible-disasters/ | b3lvedere wrote: | Interesting story. I've been on the testing site of medical | hardware many many moons ago. It's quite amazing what you can and | must test. For instance: We had to prove that if our equipment | broke by falling damage that the potential debris flying off | could not harm the patient. | | I always liked the testing philosophy institutes like for | instance Underwriter Laboratories had: Your product will fail. | This is stated as fact and is not debatable. What kind of fail | safes and protection have you made so that when it fails (and it | will) it cannot not harm the patient? | buescher wrote: | Yes. It's amazing the number of engineers that resist doing | that analysis - "oh that part won't ever break". Some safety | standards do allow for "reliable components" (i.e. if the | component is already been scrutinized for safe failure modes, | you don't have to consider them) and for submitting reliability | analysis or data. I've never seen reliability analysis or data | submitted instead of single-point failure analysis though, | myself. | | Single-point failure analysis techniques like fault trees, | event trees, and especially the tabular "failure modes and | effects analysis" (FMEA) are so powerful, especially for | safety-critical hardware, that when people learn them they want | to apply them to everything, including software. | | However, FMEA techniques actually have been found to not apply | well to software below about the block diagram level. They | don't find bugs that would not be found by other methods | (static analysis, code review, requirements analysis etc) and | they're extremely time and labor intensive. Here's an NRC | report that goes into some detail: https://www.nrc.gov/reading- | rm/doc-collections/nuregs/agreem... | b3lvedere wrote: | "Through analysis and examples of several real-life | catastrophes, this report shows that FMEA could not have | helped in the discovery of the underlying faults. The report | concludes that the contribution of FMEA to regulatory | assurance of Complex Logic, especially software, in a nuclear | power plant safety system is marginal." | | Even more interesting! Thank you for this link. I appreciate | it. Never too old to learn. :) | ed25519FUUU wrote: | The very worst part of this story is that the manufacture | vigorously defended their machine, threatening individuals and | hospitals with lawsuits if they spoke out publicly. I have zero | doubt this led to more deaths. | buescher wrote: | Here's a bit more background, from Nancy Leveson (now at MIT): | http://sunnyday.mit.edu/papers/therac.pdf | meristem wrote: | Leveson's _Engineering a Safer World_ [1] is excellent, for | those interested in safety engineering. | | [1] https://mitpress.mit.edu/books/engineering-safer-world | buescher wrote: | The systems safety case studies in that are great. It's also | available as "open access" (free-as-in-free-beer) at MIT | press: | | https://direct.mit.edu/books/book/2908/Engineering-a- | Safer-W... | time0ut wrote: | Many years ago, I had an opportunity to work on a similar type of | system (though more recent than this). In the final round of | interviews, one of the executives asked if I would be comfortable | working on a device that could deliver a potentially dangerous | dose of radiation to a patient. In that moment, my mind flashed | to this story. I try to be a careful engineer and I am sure there | are many more safeguards in place now, but, in that moment, I | realized I would not be able to live with myself if I harmed | someone that way. I answered truthfully and he thanked me and we | ended things there. | | I do not mean this as a judgement on those who do work on systems | that can physically harm people. Obviously, we need good | engineers to design potentially dangerous systems. It is just how | I realized I really don't have the character to do it. | mcguire wrote: | Good for you to realize that, then. | | On the other hand, I'm not entirely sure it's appropriate to be | comfortable with that kind of position in any case. | sho_hn wrote: | > In the final round of interviews, one of the executives asked | if I would be comfortable working on a device that could | deliver a potentially dangerous dose of radiation to a patient | | Automotive software engineer here: I've asked the same in | interviews. | | "We work on multi-ton machines that can kill people" is a | frequently uttered statement at work. | Hydraulix989 wrote: | I would have considered the fact that for a vast majority of | people suffering from cancer, this device helps them instead of | harms them. However, I can also imagine leadership at some | places trying to move fast and pressure ICs into delivering | something that isn't completely bulletproof in the name of | bottom line. That is something I would have tried to discern | from the executives. Similar tradeoffs have been made before | with cars against expected legal costs. | | There are plenty of other high stakes software that involve | human lives (Uber self driving cars, SpaceX, Patriot missiles) | and many of them completely scare me and morally frustrate me | as well to the point where I would not like to work on one, but | I totally understand if you have a personal profile that is | different than mine. | jancsika wrote: | I found this comment from the article fascinating: | | > I am a physician who did a computer science degree before | medical school. I frequently use the Therac-25 incident as an | example of why we need dual experts who are trained in both | fields. I must add two small points to this fantastic summary. | | > 1. The shadow of the Therac-25 is much longer than those who | remember it. In my opinion, this incident set medical informatics | back 20 years. Throughout the 80s and 90s there was just a | feeling in medicine that computers were dangerous, even if the | individual physicians didn't know why. This is why, when I was a | resident in 2002-2006 we still were writing all of our orders and | notes on paper. It wasn't until the US federal government slammed | down the hammer in the mid 2000's and said no payment unless you | adopt electronic health records, that computers made real inroads | into clinical medicine. | | > 2. The medical profession, and the government agencies that | regulate it, are accustomed to risk and have systems to manage | it. The problem is that classical medicine is tuned to | "continuous risks." If the Risk of 100 mg of aspirin is "1 risk | unit" and the risk of 200 mg of aspirin is "2 risk units" then | the risk of 150 mg of aspirin is strongly likely to be between 1 | and 2, and it definitely won't be 1,000,000. The mechanisms we | use to regulate medicine, with dosing trials, and pharmacokinetic | studies, and so forth are based on this assumption that both | benefit and harm are continuous functions of prescribed dose, and | the physician's job is to find the sweet spot between them. | | > When you let a computer handle a treatment you are exposed to a | completely different kind of risk. Computers are inherently | binary machines that we sometimes make simulate continuous | functions. Because computers are binary, there is a potential for | corner cases that expose erratic, and as this case shows, | potentially fatal behavior. This is not new to computer science, | but it is very foreign to medicine. Because of this, medicine has | a built in blind spot in evaluating computer technology. | mcguire wrote: | I'm not sure I buy that. Or, well, I suppose that those in the | medical field believe it, but I don't think they're right. | | Consider something like a surgeon nicking and artery while | performing some routine surgery, the patient not responding | normally to anesthesia or the anesthetist not getting the | mixture right and the patient not coming back the way they went | in. Or that subset of patients that have poor responses to a | vaccine. | | Everybody likes to think of the world as a linear system, but | it's not. | ZuLuuuuuu wrote: | This is one of the infamous incidents where software failure | caused harm on humans. I am kind of fascinated about such | incidents (since I learn so much by reading about them), are | there any other examples of such incidents that you guys know of? | It doesn't have to result in harming human, but any software | failure related incident which resulted in big consequences. | | One another example that comes to my mind is the Toyota | "unintended acceleration" incident. Or the "Mars Climate Orbiter" | incident. | crocal wrote: | Google Ariane 501. Enjoy. | probably_wrong wrote: | If big consequences is what you're after, I can think of three | typical incidents: the "most expensive hyphen in history" (you | can search it like that), it's companion piece, the Mars | Climate Orbiter(which I see you added now), and the Denver | Airport Baggage System fiasco [1] where bad software planning | caused over $500M in delays. | | [1] http://calleam.com/WTPF/?page_id=2086 | FiatLuxDave wrote: | I find it interesting how often the Therac 25 is mentioned on HN | (thanks to Dan for the list), but nobody ever mentions that those | kind of problems never entirely went away. Therac 25 is just the | famous one. You don't have to go back to 1986, there are | definitely examples from this century. The root causes are | somewhat different, and somewhat the same. But no one seems to be | teaching these more modern cases to aspiring programmers in | school, at least not to the level where every programmer I know | has heard of them. | | For example, the issue which caused this in 2007: | | https://www.heraldtribune.com/article/LK/20100124/News/60520... | | Or the process issues which caused this in 2001: | | https://www.fda.gov/radiation-emitting-products/alerts-and-n... | sebmellen wrote: | I quote this from the article so people may take an interest in | reading it. This is the opening paragraph: | | > _As Scott Jerome-Parks lay dying, he clung to this wish: that | his fatal radiation overdose -- which left him deaf, struggling | to see, unable to swallow, burned, with his teeth falling out, | with ulcers in his mouth and throat, nauseated, in severe pain | and finally unable to breathe -- be studied and talked about | publicly so that others might not have to live his nightmare._ | jtchang wrote: | I can say when I was doing my CS degree this was definitely | covered. In fact it is one of the lectures that stood out in my | mind at that time. My professor at the time (Bill Leahy) | definitely drilled into us the importance of understanding the | systems in which we were eventually going to work on. | | Not sure if this is still covered today. | Mountain_Skies wrote: | When I was the graduate teaching assistant for a software | engineering lab, the students got a week off to do research on | software failures that harmed humans. For many of the students | it was the first time they gave any thought to the concept of | software causing actual physical harm. I'm glad we were able to | expose them to this reality but also was a bit disheartened as | they should have thought about it far before a fourth year | course in their major. | lxgr wrote: | It was covered in at least one of my classes as well. | (Graduated only a few years ago.) | icelancer wrote: | Was definitely covered in my small school program in Embedded | Systems in the early 2000s. | phlyingpenguin wrote: | The book I use to instruct software engineering uses it, and I | do use that chapter. | Jtsummers wrote: | Per younger, CS, colleagues who went through school in the last | 6 years, it was still being taught at their smaller US | colleges. | lordnacho wrote: | How much money was AECL making selling these things? You'd think | a second pair of eyes on the code would not cost too much. Do I | blame the one person? Not really, who in this world hasn't | written a race condition at some point? RCs are also one of those | things where someone else might spot it a lot sooner than the | original writer. | | I agree with the sentiment that they took the software for | granted. I get the feeling that happens in a lot of settings, | most of them less life-threatening than this one. I've come | across it myself too, in finance. Somehow someone decides they | have invented a brilliant money-making strategy, if they could | only get the coders to implement it properly. Of course the | coders come back to ask questions, and then depending on the | environment it plays out to a resolution. I get the feeling the | same thing happened here. Some scientist said "hey all it needs | is to send this beam into the patient" and assumed their | description was the only level of abstraction that needed to be | understood. | ufmace wrote: | > The Therac-25 was the first entirely software-controlled | radiotherapy device. As that quote from Jacky above points out: | most such systems use hardware interlocks to prevent the beam | from firing when the targets are not properly configured. The | Therac-25 did not. | | This makes me think - there was only one developer there, I | guess, who was doing everything in assembly. This software, and | the process to produce it, must have been designed in the early | days of their devices, when there would be expected to be | hardware interlocks to prevent any of the really bad failure | modes. I bet they never did change much of the software, or their | procedures for developing, testing, qualifying, and releasing it | in light of the change from relying on hardware interlocks to the | quality of the software being the only thing preventing something | terrible from happening. | bluGill wrote: | The software was working just fine for years before on earlier | versions with the interlocks. They never checked to see how | often or why the interlocks fired before removing them. Turns | out those interlocks fired often because of the same bugs. | brians wrote: | They had two fuses, so they had a 2:1 safety margin! Just | like the NASA managers who decided that 30% erosion in an | O-ring designed for no erosion meant a 3:1 safety margin. | Gare wrote: | A quote from the report: | | > Related problems were found in the Therac-20 software. These | were not recognized until after the Therac-25 accidents because | the Therac-20 included hardware safety interlocks and thus no | injuries resulted. | | The safety fuses were occasionally blowing during the operation | of Therac-20, but nobody asked why. | baobabKoodaa wrote: | > The safety fuses were occasionally blowing during the | operation of Therac-20, but nobody asked why. | | Have you tried turning it off and on again? | joncrane wrote: | I feel like this makes it to HN once every few years or so. | | I know it well from it being the first and main case study in my | software testing class as an undergraduate CS major in Washington | DC in 1999. | | It will never not be interesting. | siltpotato wrote: | Apparently this is the seventh one. I've never worked on a | safety critical system but this is the story that makes me | wonder what it's like to do so. | Jtsummers wrote: | It's stressful, but often worthwhile. It requires diligence, | deliberate action, and patience. | matthias509 wrote: | I used to work on public safety radio systems. Things which | seem like minor issues like clipping the beginning of a | transmission every now and then are showstopper defects in | that space. | | It's because it can be the difference between "Shoot" and | "Don't shoot." | at_a_remove wrote: | I rather randomly met a woman with a similar sort of background | and trajectory as I have: trained in physics, got sucked into | computers via the brain drain. She programmed the models for | radiation dosing in the metaphorical descendants of Therac-25. I | asked her just how often it was brought up in her work and she | mentioned that she trained under someone who was in the original | group of people brought in to analyze and understand just what | happened with Therac-25. Fascinating stuff. | dang wrote: | (For the curious) the Therac-25 stack on HN: | | 2019 https://news.ycombinator.com/item?id=21679287 | | 2018 https://news.ycombinator.com/item?id=17740292 | | 2016 https://news.ycombinator.com/item?id=12201147 | | 2015 https://news.ycombinator.com/item?id=9643054 | | 2014 https://news.ycombinator.com/item?id=7257005 | | 2010 https://news.ycombinator.com/item?id=1143776 | | Others? | kondro wrote: | It comes up a lot, but it's an incredibly important story that | bares repeating often. Especially with similar issues like the | 737-MAX occurring pretty recently. | omginternets wrote: | The featured comment is great, for those who missed it: | | I am a physician who did a computer science degree before medical | school. I frequently use the Therac-25 incident as an example of | why we need dual experts who are trained in both fields. I must | add two small points to this fantastic summary. | | 1. The shadow of the Therac-25 is much longer than those who | remember it. In my opinion, this incident set medical informatics | back 20 years. Throughout the 80s and 90s there was just a | feeling in medicine that computers were dangerous, even if the | individual physicians didn't know why. This is why, when I was a | resident in 2002-2006 we still were writing all of our orders and | notes on paper. It wasn't until the US federal government slammed | down the hammer in the mid 2000's and said no payment unless you | adopt electronic health records, that computers made real inroads | into clinical medicine. | | 2. The medical profession, and the government agencies that | regulate it, are accustomed to risk and have systems to manage | it. The problem is that classical medicine is tuned to | "continuous risks." If the Risk of 100 mg of aspirin is "1 risk | unit" and the risk of 200 mg of aspirin is "2 risk units" then | the risk of 150 mg of aspirin is strongly likely to be between 1 | and 2, and it definitely won't be 1,000,000. The mechanisms we | use to regulate medicine, with dosing trials, and pharmacokinetic | studies, and so forth are based on this assumption that both | benefit and harm are continuous functions of prescribed dose, and | the physician's job is to find the sweet spot between them. | | When you let a computer handle a treatment you are exposed to a | completely different kind of risk. Computers are inherently | binary machines that we sometimes make simulate continuous | functions. Because computers are binary, there is a potential for | corner cases that expose erratic, and as this case shows, | potentially fatal behavior. This is not new to computer science, | but it is very foreign to medicine. Because of this, medicine has | a built in blind spot in evaluating computer technology. | beerandt wrote: | It's so short-sighted that he doesn't see that medical records | being forced so quickly to digital/computers is almost exactly | the same problem being played out, just not as directly or | dramatically, but with a much wider net, and way more short- | and long-term problems (including the software/systems trust | mentioned). | pessimizer wrote: | _' Shocking' hack of psychotherapy records in Finland affects | thousands_ | | https://www.theguardian.com/world/2020/oct/26/tens-of- | thousa... | jbay808 wrote: | Similar thing happened in Canada: | | https://globalnews.ca/news/6311853/lifelabs-data-hack- | what-t... | [deleted] | dwohnitmok wrote: | I suspect that a large proportion of ways that abstract | planning fail are due to discontinuous jumps, foreseen or | unforeseen. That may be manifested in computer programs, | government policy, etc. | | Continuity of risk, change, incentives, etc. lend themselves to | far easier analysis and confidence in outcomes. And higher | degrees of continuity as well as lower values of change only | make that analysis easier. Of course it's a trade-off: a flat | line is the easiest thing to analyze, but also the least useful | thing. | | In many ways I view the core enterprise of planning as an | exercise in trying to smooth out discontinuous jumps (and their | analogues in higher degree derivatives) to the best of one's | ability, especially if they exist naturally (e.g. your system's | objective response may be continuous, but its interpretation by | humans is discontinuous, how are you going to compensate to try | to regain as much continuity as possible?). ___________________________________________________________________ (page generated 2021-02-15 23:01 UTC)