hngopher.com

       [HN Gopher] Knightmare: A DevOps Cautionary Tale (2014)
       ___________________________________________________________________
        
       Knightmare: A DevOps Cautionary Tale (2014)
        
       Author : redredhathat
       Score  : 30 points
       Date   : 2020-02-05 20:50 UTC (2 hours ago)
        
 (HTM) web link (dougseven.com)
 (TXT) w3m dump (dougseven.com)
        
       | jmalicki wrote:
       | (2014)
        
         | dang wrote:
         | Added. Thanks!
        
       | floatingatoll wrote:
       | Previous discussions on HN:
       | 
       | 2014: https://news.ycombinator.com/item?id=7652036
       | 
       | 2015: https://news.ycombinator.com/item?id=8994701
        
         | lostlogin wrote:
         | Thanks - the top comment from vijucat in the 2015 discussion is
         | anxiety inducing.
         | 
         | " - Ctrl-r for reverse-search through history - typing 'ps' to
         | find the process status utility (of course) - pressing
         | Enter,....and realizing that Ctrl-r actually found
         | 'stopserver.sh' in history instead. (There's a ps inside
         | stoPServer.sh)"
        
       | t0mas88 wrote:
       | I'm not sure the conclusion of the post is the "One and Only
       | Answer" because a fully automated deploy process has another risk
       | that has bitten both AWS and Google at some point: fully
       | automatically taking down huge amounts of instances.
        
         | SteveNuts wrote:
         | A lot of times those issues have been "fully automated (but
         | with human inputs)" or "fully automated with no guardrails"
        
       | toomuchtodo wrote:
       | This is less DevOps and more poor software engineering practices
       | (code reviews, unit testing, paying off your technical debt
       | through refactoring/removing old code, etc), although properly
       | managing and instrumenting deploys might have stemmed the
       | bleeding and kept losses manageable.
       | 
       | It's good though; poor decisions must have a cost. The only way
       | to enforce good engineering practices that are human time
       | intensive is for there to be a cost not to.
        
       | forgottenpass wrote:
       | What's to take away from this?
       | 
       | Automate deployment? Fine but boring. That's the prevailing dogma
       | today. I don't remember where the devops hype train was in 2012.
       | Package management had already been a solved problem for years
       | even though it was (and continues to) be regarded as involving
       | too much "icky reading" and a repository system using plain
       | directories on vanilla webservers; all way too unoptimized for
       | resume padding.
       | 
       | Learn how to identify and manage risk like an engineer?
       | Understand how business process and software can implement risk
       | controls and mitigations?
       | 
       | I kid, so I don't cry.
        
         | Traster wrote:
         | There's a whole slew of lessons to learn from this. Leaving
         | dead code in your system and then deciding to _repurpose_ it.
         | Manually deploying with no verification. No checks in place to
         | disable a system during crazy behaviour. No real alerting
         | system. No procedures in place for when a system goes wrong. No
         | audit log to refer to when rolling back.
         | 
         | The lesson from this article is kind of funny
         | 
         | >It is not enough to build great software and test it; you also
         | have to ensure it is delivered to market correctly so that your
         | customers get the value you are delivering
         | 
         | While true, I don't see any indication this was great software
         | or that it was properly tested.
         | 
         | >Had Knight implemented an automated deployment system -
         | complete with configuration, deployment and test automation -
         | the error that cause the Knightmare would have been avoided.
         | 
         | Or to put it another way - had Knight implemented a higher
         | quality deployment system than the quality of any of their
         | other systems, they _might_ have avoided this issue.
         | 
         | These stories are never about a single thing gone wrong. The
         | whole point about critical systems is that you _should_ need
         | dozens of things to go wrong for them to fail, and then you
         | should fail safe.
        
           | wikiman wrote:
           | The fundamental truth of software. Your system is only as
           | good as its worst component. every. single. time.
           | 
           | Deployment is a component. Monitoring is a component. They
           | are also OpEx and therefore "inferior"
        
       ___________________________________________________________________
       (page generated 2020-02-05 23:00 UTC)