[HN Gopher] Knightmare: A DevOps Cautionary Tale (2014) ___________________________________________________________________ Knightmare: A DevOps Cautionary Tale (2014) Author : redredhathat Score : 30 points Date : 2020-02-05 20:50 UTC (2 hours ago) (HTM) web link (dougseven.com) (TXT) w3m dump (dougseven.com) | jmalicki wrote: | (2014) | dang wrote: | Added. Thanks! | floatingatoll wrote: | Previous discussions on HN: | | 2014: https://news.ycombinator.com/item?id=7652036 | | 2015: https://news.ycombinator.com/item?id=8994701 | lostlogin wrote: | Thanks - the top comment from vijucat in the 2015 discussion is | anxiety inducing. | | " - Ctrl-r for reverse-search through history - typing 'ps' to | find the process status utility (of course) - pressing | Enter,....and realizing that Ctrl-r actually found | 'stopserver.sh' in history instead. (There's a ps inside | stoPServer.sh)" | t0mas88 wrote: | I'm not sure the conclusion of the post is the "One and Only | Answer" because a fully automated deploy process has another risk | that has bitten both AWS and Google at some point: fully | automatically taking down huge amounts of instances. | SteveNuts wrote: | A lot of times those issues have been "fully automated (but | with human inputs)" or "fully automated with no guardrails" | toomuchtodo wrote: | This is less DevOps and more poor software engineering practices | (code reviews, unit testing, paying off your technical debt | through refactoring/removing old code, etc), although properly | managing and instrumenting deploys might have stemmed the | bleeding and kept losses manageable. | | It's good though; poor decisions must have a cost. The only way | to enforce good engineering practices that are human time | intensive is for there to be a cost not to. | forgottenpass wrote: | What's to take away from this? | | Automate deployment? Fine but boring. That's the prevailing dogma | today. I don't remember where the devops hype train was in 2012. | Package management had already been a solved problem for years | even though it was (and continues to) be regarded as involving | too much "icky reading" and a repository system using plain | directories on vanilla webservers; all way too unoptimized for | resume padding. | | Learn how to identify and manage risk like an engineer? | Understand how business process and software can implement risk | controls and mitigations? | | I kid, so I don't cry. | Traster wrote: | There's a whole slew of lessons to learn from this. Leaving | dead code in your system and then deciding to _repurpose_ it. | Manually deploying with no verification. No checks in place to | disable a system during crazy behaviour. No real alerting | system. No procedures in place for when a system goes wrong. No | audit log to refer to when rolling back. | | The lesson from this article is kind of funny | | >It is not enough to build great software and test it; you also | have to ensure it is delivered to market correctly so that your | customers get the value you are delivering | | While true, I don't see any indication this was great software | or that it was properly tested. | | >Had Knight implemented an automated deployment system - | complete with configuration, deployment and test automation - | the error that cause the Knightmare would have been avoided. | | Or to put it another way - had Knight implemented a higher | quality deployment system than the quality of any of their | other systems, they _might_ have avoided this issue. | | These stories are never about a single thing gone wrong. The | whole point about critical systems is that you _should_ need | dozens of things to go wrong for them to fail, and then you | should fail safe. | wikiman wrote: | The fundamental truth of software. Your system is only as | good as its worst component. every. single. time. | | Deployment is a component. Monitoring is a component. They | are also OpEx and therefore "inferior" ___________________________________________________________________ (page generated 2020-02-05 23:00 UTC)