[HN Gopher] Fundamentals of Incident Management
       ___________________________________________________________________
        
       Fundamentals of Incident Management
        
       Author : bitfield
       Score  : 109 points
       Date   : 2021-08-09 16:38 UTC (6 hours ago)
        
 (HTM) web link (bitfieldconsulting.com)
 (TXT) w3m dump (bitfieldconsulting.com)
        
       | blamestross wrote:
       | I've done a LOT of incident management and I'm not happy about
       | it. The biggest issue I have run into other than burnout is this:
       | 
       | Thinking and reasoning under pressure are the enemy. Make as many
       | decisions in advance as possible. Make flowcharts and decision
       | trees with "decision criteria" already written down.
       | 
       | If you have to figure something out or make a "decision" then
       | things are really really bad. That happens sometimes, but when
       | teams don't prep at all for incident management (pre-determined
       | plans for common classes of problem) every incident is "really
       | really bad"
       | 
       | If have a low risk, low cost action with low confidence of high
       | reward, I'm going to do it and just tell people it happened.
       | Asking means I just lost a half-hour+ worth of money and if I
       | just did it and I was wrong we would have lost 2 minutes of
       | money. When management asks me why I did that, I point at the doc
       | I wrote that my coworkers reviewed and mostly forgot about.
       | 
       | A really common example is "it looks like most the errors are in
       | datacenter X", you fail out of the datacenter. Maybe it was
       | sampling bias or some other issue and it doesn't help, maybe the
       | problem follows the traffic, maybe it just suddenly makes things
       | better. No matter what we get signal. Establish well in advance
       | of a situation what the common "solutions" to problems are and if
       | you are oncall and responding, then just DO them and
       | document+communicate as you do.
        
       | spa3thyb wrote:
       | There is a month and day, Feb 15, in the header, but no year. I
       | can't figure out if that's ironic or apropos, since this story
       | reads like a thriller from perhaps ten years ago, but the post
       | date appears to have been 2020-02-15 - yikes.
        
       | quartz wrote:
       | Nice to see articles like this describing a company's incident
       | response process and the positive approach to incident culture
       | via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an
       | incident management startup).
       | 
       | Regarding gamedays specifically: I've found that many company
       | leaders don't embrace them because culturally they're not really
       | aligned to the idea that incidents and outages aren't 100%
       | preventable.
       | 
       | It's a mistake to think of the incident management muscle as one
       | you'd like exercised as little as possible when in reality it's
       | something that should be in top form because doing so comes with
       | all kinds of downstream values for the company (a positive
       | culture towards resiliency, openness, team building, honesty
       | about technical risk, etc).
       | 
       | Sadly this can be a difficult mindset to break out of especially
       | if you come from a company mired in "don't tell the exec unless
       | it's so bad they'll find out themselves anyway."
       | 
       | Relatedly, the desire to drop the incident count to zero
       | discourages recordkeeping of "near-miss" incidents, which
       | generally deserve to have the same learning process (postmortem,
       | followup action items, etc) associated with them as the outcomes
       | of major incidents and game days.
       | 
       | Hopefully this outdated attitude continues to die off.
       | 
       | If you're just getting started with incident response or are
       | interested in the space, I highly recommend:
       | 
       | - For basic practices: Google's SRE chapters on incident
       | management [2]
       | 
       | - For the history of why we prepare for incidents and how we
       | learn from them effectively: Sidney Dekker's Field Guide to
       | Understanding Human Error [3]
       | 
       | [1] https://kintaba.com
       | 
       | [2] https://sre.google/sre-book/managing-incidents/
       | 
       | [3] https://www.amazon.com/Field-Guide-Understanding-Human-
       | Error...
        
         | athenot wrote:
         | > Relatedly, the desire to drop the incident count to zero
         | discourages recordkeeping of "near-miss" incidents, which
         | generally deserve to have the same learning process
         | (postmortem, followup action items, etc) associated with them
         | as the outcomes of major incidents and game days.
         | 
         | Zero recorded incidents is a vanity metric in many orgs, and
         | yes, this looses many fantastic learning opportunities. The end
         | results is that these learning opportunities eventually do
         | happen, but with significant impact associated with them.
        
         | pm90 wrote:
         | > Regarding gamedays specifically: I've found that many company
         | leaders don't embrace them because culturally they're not
         | really aligned to the idea that incidents and outages aren't
         | 100% preventable.
         | 
         | So. Much. This. Unless leaders were engineers in the past or
         | have kept abreast of evolution in technology, the default
         | mindset is still "incidents should never happen" rather than
         | "incidents will happen how can we handle them better". This is
         | especially pronounced in politics heavy environments since
         | outages are seen as a professional failure, a way to score
         | brownie points over the team that fails. As a result, you often
         | have a culture that tried to avoid being responsible for
         | outages at any cost, which (ironically) leads to worse overall
         | quality of the system since the root cause is never dealt with.
        
       | denton-scratch wrote:
       | It doesn't match my experience, with a real incident.
       | 
       | I was a dev in a small web company (10 staff), moonlighting as
       | sysadmin. Our webserver had 40 sites on it. It was hit by a not-
       | very-clever zero-day exploit, and most of the websites were now
       | running the attacker's scripts.
       | 
       | It fell to me to sort it out - the rest of the crew were to keep
       | on coding websites. The ISP had cut off the server's outbound
       | email, because it was spewing spam. So I spent about an hour
       | trying to find the malicious scripts, before I realised that I
       | could never be certain that I'd found them all.
       | 
       | You get an impulse to panic when you realise that the company's
       | future (and your job) depends on you not screwing up; and you're
       | facing a problem you've never faced before.
       | 
       | So I commissioned a new machine, and configured it. I started
       | moving sites across from the old machine to the new one. After
       | about three sites, I decided to script the moving work. Cool.
       | 
       | But the sites weren't all the same - some were Drupal (different
       | versions), some were Wordpress, some were custom PHP. It worked
       | for about 30 of the sites, with a lot of per-site manual
       | tinkering.
       | 
       | Note that for the most part, the sites weren't under revision
       | control - there were backups in zip files, from various dates,
       | for some of the sites. And I'd never worked on most of those
       | sites, each of which had its own quirks. So I spent the next week
       | making every site deploy correctly from the RCS.
       | 
       | I then spent about a week getting this automated, so that in a
       | future incident we could get running again quickly. Happily we
       | had a generously-configured Xen server, and I could test the
       | process on VMs.
       | 
       | My colleagues weren't allowed to help out, they were supposed to
       | go on making websites. And I got resistance from my boss,
       | demanding status updates ("are we there yet?")
       | 
       | The happy outcome is that that work became the kernel of a proper
       | CI pipeline, and provoked a fairly deep change in the way the
       | company worked. And by the end, I knew all about every site the
       | company hosted.
       | 
       | We were just a web-shop; most web-shops are (or were) like this.
       | If I was doing routine sysadmin, instead of coding websites, I
       | was watched like a hawk to make sure I wasn't doing anything
       | 'unnecessary'.
       | 
       | This incident gave me the authority to do the sysadmin job
       | properly; and in fact it saved me a lot of sysadmin time -
       | because previously, if a dev wanted a new version of a site
       | deployed, I had to interrupt whatever I was doing to deploy it.
       | With the CI pipeline, provided the site had passed some testing
       | and review stage, it could be deployed to production by the dev
       | himself.
       | 
       | It would have been cool to be able to do recovery drills,
       | rotating roles and so on; but it was enough for my bosses that
       | more than one person knew how to rebuild the server from scratch,
       | and that it could be done in 30 minutes.
       | 
       | Life in a small web-shop could get exciting, occasionally.
        
         | pm90 wrote:
         | It sounds like you're working in a different environment than
         | the author. The environment they describe involves an ops
         | _team_ rather than an ops _individual_ (what you've described).
         | If you had to work with a team to resolve the incident, and had
         | to do so on a fairly regular cadence, processes like this would
         | likely be more useful.
        
           | denton-scratch wrote:
           | I have worked in a properly-organised devops environment
           | (same number of staff, totally different culture).
           | 
           | Anyway, I was just telling a story about a different kind of
           | "incident response".
        
       | rachelbythebay wrote:
       | I may never understand why some places are all about assigning
       | titles and roles in this kind of thing. You need one, maybe two,
       | plus a whole whack of technical skills from everyone else.
       | 
       | Also, conference calls are death.
        
         | dilyevsky wrote:
         | I find Comms Lead role to be super useful bc i dont want to be
         | bogged down replying to customers in the middle of the incident
         | + probably don't even have all the context/access. Everything
         | else except ICM seems like a waste of time to me especially
         | Recorder
        
       | mimir wrote:
       | It sort of baffles me how much engineer time is seemingly spent
       | here designing and running these "gamedays" vs just improving and
       | automating the underlying systems. Don't glorify getting paged,
       | glorify systems that can automatically heal themselves.
       | 
       | I spend a good amount of time doing incident management and
       | reliability work.
       | 
       | Red team/blue team gamedays seems like a waste of time. Either
       | you are so early on your reliability journey that trivial things
       | like "does my database failover" are interesting things to test
       | (in which case just fix it). Or, you're a more experienced team
       | and there's little low hanging reliability fruit left. In the
       | later, gamedays seem unlikely to that closely mimic a real world
       | incident. Since low hanging fruit is gone, all your serious
       | incidents tend to be complex failure interactions between various
       | system components. To resolve them quickly, you simply want all
       | the people with deep context on those systems quickly coming up
       | with and testing out competing hypotheses on what might be wrong.
       | Incident management only really matters in the sense that you
       | want to allow the people with the most system context to focus on
       | fixing the actual system. Serious incident management really only
       | comes into play when the issue is large enough to threaten the
       | company + require coordinated work from many orgs/teams.
       | 
       | My team and I spend most of time thinking about how we can
       | automate any repetitive tasks or failover. In the case something
       | can't be automated, we think about how we can increase the
       | observability of the system, so that future issues can be
       | resolved faster.
        
       | krisoft wrote:
       | > which captures all the key log files and status information
       | from the ailing machine.
       | 
       | Machine? As in singular machine goes down and you wake up 5
       | people? That just sounds like bad planning.
       | 
       | > Pearson is spinning up a new cloud server, and Rawlings checks
       | the documentation and procedures for migrating websites, getting
       | everything ready to run so that not even a second is wasted.
       | 
       | Heroic. But in reality you have already wasted minutes. Why is
       | this not all automated?
       | 
       | I understand that this is a simulated scenairo. Maybe the
       | situation was simplified for clarity, but really if a single
       | machine going down leads to this amount of heroics then you
       | should work on those fundamentals. In my opinion.
        
         | RationPhantoms wrote:
         | Not only that but they appear to be okay with the fact that a
         | single ISP has knocked them offline. If I was a customer of
         | theirs and found out, I would probably change providers.
        
         | LambdaComplex wrote:
         | Agreed.
         | 
         | While reading this, I was thinking "This is so important that
         | you'll wake all these people up in the middle of the night, but
         | you only have a single ISP? No backup ISP with automated
         | failover?"
        
         | commiefornian wrote:
         | They skipped over a few steps of ICS. ICS starts with a single
         | person playing all roles.
         | 
         | It prescribes a way to scale up and down the team in ways that
         | streamlines the communication so everyone knows their role,
         | nothing gets lost when people come in and out of the system and
         | you don't have all hands conference calls and multiple people
         | telling the customers multiple things or multiple people asking
         | for status updates from each person.
        
       | gengelbro wrote:
       | It's not inspiring to me with the 'cool tactical' framing this
       | article attempts to convey.
       | 
       | I've worked as an oncall for a fundamental backbone service of
       | the internet in the past and paged into middle of the night
       | outages. It's harrowing and exhausting. Cool names like 'incident
       | commander' do not change this.
       | 
       | We also had a "see ya in the morning" culture. Instead I'd be
       | much more impressed to have a "see ya in the afternoon, get some
       | sleep" culture.
        
         | choeger wrote:
         | It seems to be a bit of cargo cult, to be honest. They seem to
         | take inspiration from ER teams or the military.
         | 
         | I think that this kind of drill helps a lot for cases where you
         | can take a pre-planned route, like deploying that backup server
         | or rerouting traffic. But the obvious question then is: Why not
         | automate _that_ as well?
         | 
         | When it comes to diagnosis or, worse, triage, in my experience
         | you want independent free agents looking at the system all at
         | once. You don't want a warroom-like atmosphere with a single
         | screen but rather n+1 hackers focusing on what their first
         | intuition tells them is the root cause. In a second step you
         | want these hackers to convene and discuss their root cause
         | hypotheses. If necessary, you want them to run experiemnts to
         | confirm these hypotheses. And _then_ you decide the appropriate
         | reaction.
        
           | joshuamorton wrote:
           | I agree. I think this particular framing gets things slightly
           | wrong. You want parallelism, but you still need central
           | organization (so that you can have clear delegation) and
           | delegation of work to various researchers. For a complex
           | incident, I've seen 5+ subteams researching various threads
           | of the incident. But, importantly, before any of those
           | subteams take any action, they report to the IC so that two
           | groups don't accidentally take actions that might be good in
           | isolation but are harmful when combined.
        
           | 1123581321 wrote:
           | My experience is there's little conflict between a central
           | conference call or room, and multiple independent
           | investigators, since those investigators need to present and
           | compare their findings _somewhere_. It would indeed be a
           | mistake to demand everyone look at one high-level view,
           | though. Based on the organization depicted in the article,
           | this would be the "researcher" role, split among multiple
           | people.
        
           | dvtrn wrote:
           | _They seem to take inspiration from ER teams or the military_
           | 
           | It's probably nothing but overestimation but I feel like I'm
           | seeing more of this later in my career than I did early on,
           | or maybe I'm paying more attention?
           | 
           | Whatever it is: past experience (which includes coming from a
           | military family in the states) has taught me to avoid
           | companies that crib unnecessary amounts of jargon, lingo and
           | colloquialisms from the military.
           | 
           | Curious if others have noticed or even feel the same and what
           | your experiences have been for feeling similarly?
        
             | pinko wrote:
             | Agree completely. It's a strong signal that someone has a
             | military cosplay fetish (which very few people with
             | experience in the actual military do), which in turn tends
             | to come along with other dysfunctional traits. It's a
             | warning for me that the person is not likely to be a good
             | vendor, customer, or collaborator.
        
               | pjc50 wrote:
               | Yup. It's misplaced machismo, with all that implies.
        
               | dvtrn wrote:
               | My favorite one was when a superior was explaining a plan
               | to right-size some new machines as we slowly migrated
               | customers onto the appliance, and some particularly
               | aggravating issues we were having with memory consumption
               | that upon inspection and a lot of time spent-made no real
               | sense to us why it was occurring the way it was.
               | 
               | "dvtrn you are to take the flank and breach this issue
               | with Paul"
               | 
               | And this ran all the way up to the top of the org. Senior
               | leaders were _constantly_ quoting that Jocko Wilink
               | fella. It was...something.
               | 
               | My old man (a former Dill Instructor, made for an
               | interesting childhood) found it utterly hilarious when
               | I'd call him up randomly with the latest phrase of the
               | day, uttered by some director or another. To my
               | knowledge, and I sure-damn asked, the only affinity
               | anyone on the executive team had with the military was
               | two of them having buddies who served.
        
             | igetspam wrote:
             | I don't know how old you are but my career now exceeds two
             | decades. I definitely see this more now but that's because
             | I institute it. Earlier in my career, we failed at incident
             | management and at ownership. We now share the burden of on-
             | call not just with the operators (sysadmins or old) but
             | also with the people who wrote the code. We've spent a lot
             | of time building better models based on proven methods,
             | quite a few come from work done in high intensity roles
             | paid by tax dollars: risk analysis, disaster recovery,
             | firefighting, command and control, incident management, war
             | games, red teams.
        
               | dvtrn wrote:
               | You've got a couple of years on me, I've been in the game
               | a little over 13 years now.
               | 
               | I support the notion there's a strong difference between
               | lingo that's properly applied to the situation and lingo
               | that is recklessly applied because it "sounds cool".
               | 
               | The examples you gave seem to be fair game for the work
               | being done-in the interest of brief, specific language;
               | the examples I gave in another comment though
               | ("flanking","breaching") however are just grating
               | and...weird to use in a work environment.
               | 
               | Your point is nevertheless well met.
        
           | athenot wrote:
           | Yes, it's a map-reduce algorithm. Muliple people check
           | multiple areas of the system in parallel and then both
           | evidence & rule-outs start to emerge.
        
         | totetsu wrote:
         | The worst part is hearing fron your manager the next day that
         | the NOC operator complained about your rude tone of voice when
         | waking up and answering the phone at 3am.
        
         | dmuth wrote:
         | I don't disagree with your post, but one thing I want to
         | mention is the origin of the term "Incident Commander"--it
         | doesn't exist to be cool, but rather derives from how FEMA
         | handles disasters. I suspect its usage in IT became a thing
         | because it was already used in real-life, and it made more
         | sense than creating a new term.
         | 
         | If you have two hours, you can take the training that describes
         | the nomenclature behind the Incident Command System, and why it
         | became a thing:
         | 
         | https://training.fema.gov/is/courseoverview.aspx?code=is-100...
         | 
         | This online training takes about 2 hours and is open to the
         | general public. I took it on a Saturday afternoon some years
         | ago and it gave me useful context to why certain things are
         | standardized.
        
           | [deleted]
        
         | jvreagan wrote:
         | > nstead I'd be much more impressed to have a "see ya in the
         | afternoon, get some sleep" culture
         | 
         | I've led teams for over a decade that have oncall duties. One
         | principle we have lived by is that if you are paged outside of
         | hours, you take off the time you need to not hold a grudge
         | against the team/company. Some people don't need it, some
         | people take a day off, some people sleep in, some people cash
         | in on their next vacation. To each their own according to their
         | needs. Seems to work well.
         | 
         | We also swap out oncall in real time if, say, someone gets
         | paged a couple nights in a row.
        
           | athenot wrote:
           | Yup, this is important or people will burn out real quick.
           | And when there are major incidents, as IC it's especially
           | important to dismiss people from bridges as early as possible
           | when I know I'm going to need them sooner rather than later
           | the following day. Or swap with a more junior person so that
           | the senior one is nice and fresh for when the next wave is
           | anticipated.
        
           | [deleted]
        
         | Sebguer wrote:
         | Yeah, the tone of this article is really odd, and like, the
         | bulk of the content is just a narrativization of the incident
         | roles in the Google SRE book. The only 'trick' is running game
         | days?
        
         | lemax wrote:
         | These incident role names are fairly common in product
         | companies these days. I guess you are correct that they do
         | suggest a certain culture around incidents, but in my
         | experience is definitely a good thing. It's a "don't blame
         | people, let's focus on the root cause, get things back up, and
         | figure out how to prevent this next time" sort of thing. People
         | try to meet SLAs and they treat each other like humans. We
         | focus on improving process/frameworks over blaming individual
         | people. And yup, think this comes along with, "incident
         | yesterday was intense, I'm gonna catch up on sleep".
        
           | pm90 wrote:
           | I agree with your comment. These names are just ways for
           | teams to delegate different responsibilities w.r.t incident
           | management quickly and in a way that's understood by
           | everyone. Having concrete names for such roles is both a good
           | thing (everyone knows who can make the call for hard
           | decisions) and helps you talk intelligently about the
           | evolution of such roles. e.g. "our Incident Commanders used
           | to spend 15% of their time in p0 incidents, but that has
           | reduced to 10% due to improvements in rollout
           | procedures/runbooks/etc."
        
         | zippergz wrote:
         | The concepts and terms of incident command are not from the
         | military (or ER as another poster suggested). It's from the
         | fire service and emergency management in general. I don't know
         | if that changes peoples' perceptions and I agree that no amount
         | of terminology changes how exhausting being on call is. But if
         | people are reacting negatively to "military" connotations, I
         | think that is unwarranted.
         | 
         | https://en.wikipedia.org/wiki/Incident_Command_System
        
           | nucleardog wrote:
           | I think actually learning what the ICS is _for_ might help
           | people understand a bit better why it's not necessarily just
           | "unnecessary tacticool". It's not just a bunch of important-
           | sounding names for things.
           | 
           | ICS, at its core, is a system for helping people self-
           | organize into an effective organization in the face of
           | quickly changing circumstances and emergent problems.
           | 
           | Some simple rules are things like:
           | 
           | * The most senior/qualified person on-site is generally in
           | charge. (How you determine that kinda varies depending on
           | organization.)
           | 
           | * Positions are only created when required. You don't assign
           | people roles unless there's a need for that role.
           | 
           | * Positions are split and responsibilities delegated as the
           | span of control increases beyond a set point.
           | 
           | * Control should stay as local to the problem as it
           | realistically can while still solving the problem.
           | 
           | From there, it goes on to standardized a template hierarchy
           | and defines things like specific colours associated with
           | specific roles so as roles change and chaos ensues, people
           | can continue to operate effectively and in an organized
           | manner. In-person, this means things like the
           | commander/executive roles running around in red vests with
           | their role on the back. If the role changes hands, so does
           | the vest.
           | 
           | Some of the roles in that template organization are things
           | like:
           | 
           | * The "Public Information Officer" who is responsible for
           | preparing and communicating to the public. This makes a
           | single person responsible to ensure conflicting or confusing
           | messaging is not making its way out.
           | 
           | * A "Liason Officer" who is responsible for coordinating with
           | other organizations. This provides another central point of
           | coordination for requests flowing outside of your response.
           | 
           | I think we could all imagine how this starts to become
           | valuable in, say, a building collapse scenario with police,
           | fire, EMS, the gas company, search and rescue, emergency
           | social services, etc all on scene.
           | 
           | In an IT context, what this means it that, generally, the
           | most senior person online is going to be in charge of
           | receiving reports from people and directing them. If there
           | aren't many people around, they'd generally be pitching in to
           | help as well.
           | 
           | As more people show up and the communication and coordination
           | overhead increases, they step out of doing any specific
           | technical work. If enough show up, they may then delegate
           | people out as leaders of specific teams tasked with specific
           | goals (they may also just tell them they're not needed and
           | send them to wait on standby).
           | 
           | All roles, including the "Public Information" and "Liason"
           | roles fall to the Incident Commander unless delegated out. At
           | some point, if the requests for reporting from management
           | start interfering with their role as Incident Commander, they
           | delegate that role out. If it turns out the incident is going
           | to require heavy communication or coordination with a vendor,
           | they may delegate out the Liason role to someone else.
           | 
           | ICS is probably largely unnecessary if your response never
           | spans larger than the number of people that can effectively
           | communicate in a google meet call, but as you get more and
           | more people involved it contains a lot of valuable lessons
           | and things learned through real world experience in
           | situations much more stressful and dangerous than we ever
           | face that help you effectively manage and coordinate the
           | human resources in response to an incident.
           | 
           | (Disclaimer: That's all basically from memory. The city sent
           | me on a ICS, ICS in an emergency operations centre context,
           | and a few more courses a few years back as part of
           | volunteering with an emergency communications group. It's
           | probably 90% accurate.)
        
         | raffraffraff wrote:
         | EU working time act means "I'll see you in the afternoon",
         | whether they like it or not.
        
         | tetha wrote:
         | > We also had a "see ya in the morning" culture. Instead I'd be
         | much more impressed to have a "see ya in the afternoon, get
         | some sleep" culture.
         | 
         | German labor laws forbid employees from working 10-13 hours
         | after a long on-call situation after a normal work day, just
         | like that. Add in time compensation, and a bad on-call
         | situation at night easily ends up as the next day off paid.
         | 
         | I've found this to take a lot of edge of on-call. Sure, it
         | /sucks/ to get called at 1am and do stuff until 3, but that's a
         | day to sleep in and recover. Maybe hop on a call if the team
         | needs input, but that's optional.
        
       | ipaddr wrote:
       | So they are testing against fully awake people at 2:30pm and
       | expecting similiar results at 4:30am after heavy drinking.
        
         | x3n0ph3n3 wrote:
         | Do you really drink heavily when you are on-call?
        
           | LambdaComplex wrote:
           | Based on my understanding of the UK's drinking culture: it
           | wouldn't be the most surprising thing ever
        
           | shimylining wrote:
           | Why not? I am not working just to work, I am working to enjoy
           | my life with the people around me. I am also from Europe so
           | it might be different views, but just because I am on call
           | doesn't mean I'm going to stop living my life. Work doesn't
           | define me as a person. :)
        
             | pm90 wrote:
             | If I'm on call (OC), I'm responsible for the uptime of the
             | system even after hours. So If I'm planning on going
             | hiking, I will inform the secondary OC, or delay plans to a
             | weekend when I'm not OC. Generally I do tend to avoid
             | getting heavily inebriated (although of course there are
             | times when this is unavoidable).
             | 
             | I'm not judging, but just pointing out that I've certainly
             | experienced a different OC culture in the US.
        
         | burnished wrote:
         | You're mixed up, they're drilling. You drill so that when an
         | emergency happens and it's 4:30am and you're bleary eyed your
         | hands already know what to do (and are doing it) before your
         | eyes even open all the way.
        
           | candiddevmike wrote:
           | This sounds like a great way to wipe a database accidentally
           | (like GitLab). The worst thing you can do to help fix a
           | problem is having people asleep at the wheel.
        
             | pm90 wrote:
             | Noted, but the point of the drill is precisely to uncover
             | these failure modes and attempt to fix them. e.g. you might
             | have automated runbooks to fix the problem rather than
             | access the DB directly. You might have frequent backups and
             | processes to easily restore from backups in case of
             | database wipes.
        
       ___________________________________________________________________
       (page generated 2021-08-09 23:00 UTC)