[HN Gopher] Fundamentals of Incident Management ___________________________________________________________________ Fundamentals of Incident Management Author : bitfield Score : 109 points Date : 2021-08-09 16:38 UTC (6 hours ago) (HTM) web link (bitfieldconsulting.com) (TXT) w3m dump (bitfieldconsulting.com) | blamestross wrote: | I've done a LOT of incident management and I'm not happy about | it. The biggest issue I have run into other than burnout is this: | | Thinking and reasoning under pressure are the enemy. Make as many | decisions in advance as possible. Make flowcharts and decision | trees with "decision criteria" already written down. | | If you have to figure something out or make a "decision" then | things are really really bad. That happens sometimes, but when | teams don't prep at all for incident management (pre-determined | plans for common classes of problem) every incident is "really | really bad" | | If have a low risk, low cost action with low confidence of high | reward, I'm going to do it and just tell people it happened. | Asking means I just lost a half-hour+ worth of money and if I | just did it and I was wrong we would have lost 2 minutes of | money. When management asks me why I did that, I point at the doc | I wrote that my coworkers reviewed and mostly forgot about. | | A really common example is "it looks like most the errors are in | datacenter X", you fail out of the datacenter. Maybe it was | sampling bias or some other issue and it doesn't help, maybe the | problem follows the traffic, maybe it just suddenly makes things | better. No matter what we get signal. Establish well in advance | of a situation what the common "solutions" to problems are and if | you are oncall and responding, then just DO them and | document+communicate as you do. | spa3thyb wrote: | There is a month and day, Feb 15, in the header, but no year. I | can't figure out if that's ironic or apropos, since this story | reads like a thriller from perhaps ten years ago, but the post | date appears to have been 2020-02-15 - yikes. | quartz wrote: | Nice to see articles like this describing a company's incident | response process and the positive approach to incident culture | via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an | incident management startup). | | Regarding gamedays specifically: I've found that many company | leaders don't embrace them because culturally they're not really | aligned to the idea that incidents and outages aren't 100% | preventable. | | It's a mistake to think of the incident management muscle as one | you'd like exercised as little as possible when in reality it's | something that should be in top form because doing so comes with | all kinds of downstream values for the company (a positive | culture towards resiliency, openness, team building, honesty | about technical risk, etc). | | Sadly this can be a difficult mindset to break out of especially | if you come from a company mired in "don't tell the exec unless | it's so bad they'll find out themselves anyway." | | Relatedly, the desire to drop the incident count to zero | discourages recordkeeping of "near-miss" incidents, which | generally deserve to have the same learning process (postmortem, | followup action items, etc) associated with them as the outcomes | of major incidents and game days. | | Hopefully this outdated attitude continues to die off. | | If you're just getting started with incident response or are | interested in the space, I highly recommend: | | - For basic practices: Google's SRE chapters on incident | management [2] | | - For the history of why we prepare for incidents and how we | learn from them effectively: Sidney Dekker's Field Guide to | Understanding Human Error [3] | | [1] https://kintaba.com | | [2] https://sre.google/sre-book/managing-incidents/ | | [3] https://www.amazon.com/Field-Guide-Understanding-Human- | Error... | athenot wrote: | > Relatedly, the desire to drop the incident count to zero | discourages recordkeeping of "near-miss" incidents, which | generally deserve to have the same learning process | (postmortem, followup action items, etc) associated with them | as the outcomes of major incidents and game days. | | Zero recorded incidents is a vanity metric in many orgs, and | yes, this looses many fantastic learning opportunities. The end | results is that these learning opportunities eventually do | happen, but with significant impact associated with them. | pm90 wrote: | > Regarding gamedays specifically: I've found that many company | leaders don't embrace them because culturally they're not | really aligned to the idea that incidents and outages aren't | 100% preventable. | | So. Much. This. Unless leaders were engineers in the past or | have kept abreast of evolution in technology, the default | mindset is still "incidents should never happen" rather than | "incidents will happen how can we handle them better". This is | especially pronounced in politics heavy environments since | outages are seen as a professional failure, a way to score | brownie points over the team that fails. As a result, you often | have a culture that tried to avoid being responsible for | outages at any cost, which (ironically) leads to worse overall | quality of the system since the root cause is never dealt with. | denton-scratch wrote: | It doesn't match my experience, with a real incident. | | I was a dev in a small web company (10 staff), moonlighting as | sysadmin. Our webserver had 40 sites on it. It was hit by a not- | very-clever zero-day exploit, and most of the websites were now | running the attacker's scripts. | | It fell to me to sort it out - the rest of the crew were to keep | on coding websites. The ISP had cut off the server's outbound | email, because it was spewing spam. So I spent about an hour | trying to find the malicious scripts, before I realised that I | could never be certain that I'd found them all. | | You get an impulse to panic when you realise that the company's | future (and your job) depends on you not screwing up; and you're | facing a problem you've never faced before. | | So I commissioned a new machine, and configured it. I started | moving sites across from the old machine to the new one. After | about three sites, I decided to script the moving work. Cool. | | But the sites weren't all the same - some were Drupal (different | versions), some were Wordpress, some were custom PHP. It worked | for about 30 of the sites, with a lot of per-site manual | tinkering. | | Note that for the most part, the sites weren't under revision | control - there were backups in zip files, from various dates, | for some of the sites. And I'd never worked on most of those | sites, each of which had its own quirks. So I spent the next week | making every site deploy correctly from the RCS. | | I then spent about a week getting this automated, so that in a | future incident we could get running again quickly. Happily we | had a generously-configured Xen server, and I could test the | process on VMs. | | My colleagues weren't allowed to help out, they were supposed to | go on making websites. And I got resistance from my boss, | demanding status updates ("are we there yet?") | | The happy outcome is that that work became the kernel of a proper | CI pipeline, and provoked a fairly deep change in the way the | company worked. And by the end, I knew all about every site the | company hosted. | | We were just a web-shop; most web-shops are (or were) like this. | If I was doing routine sysadmin, instead of coding websites, I | was watched like a hawk to make sure I wasn't doing anything | 'unnecessary'. | | This incident gave me the authority to do the sysadmin job | properly; and in fact it saved me a lot of sysadmin time - | because previously, if a dev wanted a new version of a site | deployed, I had to interrupt whatever I was doing to deploy it. | With the CI pipeline, provided the site had passed some testing | and review stage, it could be deployed to production by the dev | himself. | | It would have been cool to be able to do recovery drills, | rotating roles and so on; but it was enough for my bosses that | more than one person knew how to rebuild the server from scratch, | and that it could be done in 30 minutes. | | Life in a small web-shop could get exciting, occasionally. | pm90 wrote: | It sounds like you're working in a different environment than | the author. The environment they describe involves an ops | _team_ rather than an ops _individual_ (what you've described). | If you had to work with a team to resolve the incident, and had | to do so on a fairly regular cadence, processes like this would | likely be more useful. | denton-scratch wrote: | I have worked in a properly-organised devops environment | (same number of staff, totally different culture). | | Anyway, I was just telling a story about a different kind of | "incident response". | rachelbythebay wrote: | I may never understand why some places are all about assigning | titles and roles in this kind of thing. You need one, maybe two, | plus a whole whack of technical skills from everyone else. | | Also, conference calls are death. | dilyevsky wrote: | I find Comms Lead role to be super useful bc i dont want to be | bogged down replying to customers in the middle of the incident | + probably don't even have all the context/access. Everything | else except ICM seems like a waste of time to me especially | Recorder | mimir wrote: | It sort of baffles me how much engineer time is seemingly spent | here designing and running these "gamedays" vs just improving and | automating the underlying systems. Don't glorify getting paged, | glorify systems that can automatically heal themselves. | | I spend a good amount of time doing incident management and | reliability work. | | Red team/blue team gamedays seems like a waste of time. Either | you are so early on your reliability journey that trivial things | like "does my database failover" are interesting things to test | (in which case just fix it). Or, you're a more experienced team | and there's little low hanging reliability fruit left. In the | later, gamedays seem unlikely to that closely mimic a real world | incident. Since low hanging fruit is gone, all your serious | incidents tend to be complex failure interactions between various | system components. To resolve them quickly, you simply want all | the people with deep context on those systems quickly coming up | with and testing out competing hypotheses on what might be wrong. | Incident management only really matters in the sense that you | want to allow the people with the most system context to focus on | fixing the actual system. Serious incident management really only | comes into play when the issue is large enough to threaten the | company + require coordinated work from many orgs/teams. | | My team and I spend most of time thinking about how we can | automate any repetitive tasks or failover. In the case something | can't be automated, we think about how we can increase the | observability of the system, so that future issues can be | resolved faster. | krisoft wrote: | > which captures all the key log files and status information | from the ailing machine. | | Machine? As in singular machine goes down and you wake up 5 | people? That just sounds like bad planning. | | > Pearson is spinning up a new cloud server, and Rawlings checks | the documentation and procedures for migrating websites, getting | everything ready to run so that not even a second is wasted. | | Heroic. But in reality you have already wasted minutes. Why is | this not all automated? | | I understand that this is a simulated scenairo. Maybe the | situation was simplified for clarity, but really if a single | machine going down leads to this amount of heroics then you | should work on those fundamentals. In my opinion. | RationPhantoms wrote: | Not only that but they appear to be okay with the fact that a | single ISP has knocked them offline. If I was a customer of | theirs and found out, I would probably change providers. | LambdaComplex wrote: | Agreed. | | While reading this, I was thinking "This is so important that | you'll wake all these people up in the middle of the night, but | you only have a single ISP? No backup ISP with automated | failover?" | commiefornian wrote: | They skipped over a few steps of ICS. ICS starts with a single | person playing all roles. | | It prescribes a way to scale up and down the team in ways that | streamlines the communication so everyone knows their role, | nothing gets lost when people come in and out of the system and | you don't have all hands conference calls and multiple people | telling the customers multiple things or multiple people asking | for status updates from each person. | gengelbro wrote: | It's not inspiring to me with the 'cool tactical' framing this | article attempts to convey. | | I've worked as an oncall for a fundamental backbone service of | the internet in the past and paged into middle of the night | outages. It's harrowing and exhausting. Cool names like 'incident | commander' do not change this. | | We also had a "see ya in the morning" culture. Instead I'd be | much more impressed to have a "see ya in the afternoon, get some | sleep" culture. | choeger wrote: | It seems to be a bit of cargo cult, to be honest. They seem to | take inspiration from ER teams or the military. | | I think that this kind of drill helps a lot for cases where you | can take a pre-planned route, like deploying that backup server | or rerouting traffic. But the obvious question then is: Why not | automate _that_ as well? | | When it comes to diagnosis or, worse, triage, in my experience | you want independent free agents looking at the system all at | once. You don't want a warroom-like atmosphere with a single | screen but rather n+1 hackers focusing on what their first | intuition tells them is the root cause. In a second step you | want these hackers to convene and discuss their root cause | hypotheses. If necessary, you want them to run experiemnts to | confirm these hypotheses. And _then_ you decide the appropriate | reaction. | joshuamorton wrote: | I agree. I think this particular framing gets things slightly | wrong. You want parallelism, but you still need central | organization (so that you can have clear delegation) and | delegation of work to various researchers. For a complex | incident, I've seen 5+ subteams researching various threads | of the incident. But, importantly, before any of those | subteams take any action, they report to the IC so that two | groups don't accidentally take actions that might be good in | isolation but are harmful when combined. | 1123581321 wrote: | My experience is there's little conflict between a central | conference call or room, and multiple independent | investigators, since those investigators need to present and | compare their findings _somewhere_. It would indeed be a | mistake to demand everyone look at one high-level view, | though. Based on the organization depicted in the article, | this would be the "researcher" role, split among multiple | people. | dvtrn wrote: | _They seem to take inspiration from ER teams or the military_ | | It's probably nothing but overestimation but I feel like I'm | seeing more of this later in my career than I did early on, | or maybe I'm paying more attention? | | Whatever it is: past experience (which includes coming from a | military family in the states) has taught me to avoid | companies that crib unnecessary amounts of jargon, lingo and | colloquialisms from the military. | | Curious if others have noticed or even feel the same and what | your experiences have been for feeling similarly? | pinko wrote: | Agree completely. It's a strong signal that someone has a | military cosplay fetish (which very few people with | experience in the actual military do), which in turn tends | to come along with other dysfunctional traits. It's a | warning for me that the person is not likely to be a good | vendor, customer, or collaborator. | pjc50 wrote: | Yup. It's misplaced machismo, with all that implies. | dvtrn wrote: | My favorite one was when a superior was explaining a plan | to right-size some new machines as we slowly migrated | customers onto the appliance, and some particularly | aggravating issues we were having with memory consumption | that upon inspection and a lot of time spent-made no real | sense to us why it was occurring the way it was. | | "dvtrn you are to take the flank and breach this issue | with Paul" | | And this ran all the way up to the top of the org. Senior | leaders were _constantly_ quoting that Jocko Wilink | fella. It was...something. | | My old man (a former Dill Instructor, made for an | interesting childhood) found it utterly hilarious when | I'd call him up randomly with the latest phrase of the | day, uttered by some director or another. To my | knowledge, and I sure-damn asked, the only affinity | anyone on the executive team had with the military was | two of them having buddies who served. | igetspam wrote: | I don't know how old you are but my career now exceeds two | decades. I definitely see this more now but that's because | I institute it. Earlier in my career, we failed at incident | management and at ownership. We now share the burden of on- | call not just with the operators (sysadmins or old) but | also with the people who wrote the code. We've spent a lot | of time building better models based on proven methods, | quite a few come from work done in high intensity roles | paid by tax dollars: risk analysis, disaster recovery, | firefighting, command and control, incident management, war | games, red teams. | dvtrn wrote: | You've got a couple of years on me, I've been in the game | a little over 13 years now. | | I support the notion there's a strong difference between | lingo that's properly applied to the situation and lingo | that is recklessly applied because it "sounds cool". | | The examples you gave seem to be fair game for the work | being done-in the interest of brief, specific language; | the examples I gave in another comment though | ("flanking","breaching") however are just grating | and...weird to use in a work environment. | | Your point is nevertheless well met. | athenot wrote: | Yes, it's a map-reduce algorithm. Muliple people check | multiple areas of the system in parallel and then both | evidence & rule-outs start to emerge. | totetsu wrote: | The worst part is hearing fron your manager the next day that | the NOC operator complained about your rude tone of voice when | waking up and answering the phone at 3am. | dmuth wrote: | I don't disagree with your post, but one thing I want to | mention is the origin of the term "Incident Commander"--it | doesn't exist to be cool, but rather derives from how FEMA | handles disasters. I suspect its usage in IT became a thing | because it was already used in real-life, and it made more | sense than creating a new term. | | If you have two hours, you can take the training that describes | the nomenclature behind the Incident Command System, and why it | became a thing: | | https://training.fema.gov/is/courseoverview.aspx?code=is-100... | | This online training takes about 2 hours and is open to the | general public. I took it on a Saturday afternoon some years | ago and it gave me useful context to why certain things are | standardized. | [deleted] | jvreagan wrote: | > nstead I'd be much more impressed to have a "see ya in the | afternoon, get some sleep" culture | | I've led teams for over a decade that have oncall duties. One | principle we have lived by is that if you are paged outside of | hours, you take off the time you need to not hold a grudge | against the team/company. Some people don't need it, some | people take a day off, some people sleep in, some people cash | in on their next vacation. To each their own according to their | needs. Seems to work well. | | We also swap out oncall in real time if, say, someone gets | paged a couple nights in a row. | athenot wrote: | Yup, this is important or people will burn out real quick. | And when there are major incidents, as IC it's especially | important to dismiss people from bridges as early as possible | when I know I'm going to need them sooner rather than later | the following day. Or swap with a more junior person so that | the senior one is nice and fresh for when the next wave is | anticipated. | [deleted] | Sebguer wrote: | Yeah, the tone of this article is really odd, and like, the | bulk of the content is just a narrativization of the incident | roles in the Google SRE book. The only 'trick' is running game | days? | lemax wrote: | These incident role names are fairly common in product | companies these days. I guess you are correct that they do | suggest a certain culture around incidents, but in my | experience is definitely a good thing. It's a "don't blame | people, let's focus on the root cause, get things back up, and | figure out how to prevent this next time" sort of thing. People | try to meet SLAs and they treat each other like humans. We | focus on improving process/frameworks over blaming individual | people. And yup, think this comes along with, "incident | yesterday was intense, I'm gonna catch up on sleep". | pm90 wrote: | I agree with your comment. These names are just ways for | teams to delegate different responsibilities w.r.t incident | management quickly and in a way that's understood by | everyone. Having concrete names for such roles is both a good | thing (everyone knows who can make the call for hard | decisions) and helps you talk intelligently about the | evolution of such roles. e.g. "our Incident Commanders used | to spend 15% of their time in p0 incidents, but that has | reduced to 10% due to improvements in rollout | procedures/runbooks/etc." | zippergz wrote: | The concepts and terms of incident command are not from the | military (or ER as another poster suggested). It's from the | fire service and emergency management in general. I don't know | if that changes peoples' perceptions and I agree that no amount | of terminology changes how exhausting being on call is. But if | people are reacting negatively to "military" connotations, I | think that is unwarranted. | | https://en.wikipedia.org/wiki/Incident_Command_System | nucleardog wrote: | I think actually learning what the ICS is _for_ might help | people understand a bit better why it's not necessarily just | "unnecessary tacticool". It's not just a bunch of important- | sounding names for things. | | ICS, at its core, is a system for helping people self- | organize into an effective organization in the face of | quickly changing circumstances and emergent problems. | | Some simple rules are things like: | | * The most senior/qualified person on-site is generally in | charge. (How you determine that kinda varies depending on | organization.) | | * Positions are only created when required. You don't assign | people roles unless there's a need for that role. | | * Positions are split and responsibilities delegated as the | span of control increases beyond a set point. | | * Control should stay as local to the problem as it | realistically can while still solving the problem. | | From there, it goes on to standardized a template hierarchy | and defines things like specific colours associated with | specific roles so as roles change and chaos ensues, people | can continue to operate effectively and in an organized | manner. In-person, this means things like the | commander/executive roles running around in red vests with | their role on the back. If the role changes hands, so does | the vest. | | Some of the roles in that template organization are things | like: | | * The "Public Information Officer" who is responsible for | preparing and communicating to the public. This makes a | single person responsible to ensure conflicting or confusing | messaging is not making its way out. | | * A "Liason Officer" who is responsible for coordinating with | other organizations. This provides another central point of | coordination for requests flowing outside of your response. | | I think we could all imagine how this starts to become | valuable in, say, a building collapse scenario with police, | fire, EMS, the gas company, search and rescue, emergency | social services, etc all on scene. | | In an IT context, what this means it that, generally, the | most senior person online is going to be in charge of | receiving reports from people and directing them. If there | aren't many people around, they'd generally be pitching in to | help as well. | | As more people show up and the communication and coordination | overhead increases, they step out of doing any specific | technical work. If enough show up, they may then delegate | people out as leaders of specific teams tasked with specific | goals (they may also just tell them they're not needed and | send them to wait on standby). | | All roles, including the "Public Information" and "Liason" | roles fall to the Incident Commander unless delegated out. At | some point, if the requests for reporting from management | start interfering with their role as Incident Commander, they | delegate that role out. If it turns out the incident is going | to require heavy communication or coordination with a vendor, | they may delegate out the Liason role to someone else. | | ICS is probably largely unnecessary if your response never | spans larger than the number of people that can effectively | communicate in a google meet call, but as you get more and | more people involved it contains a lot of valuable lessons | and things learned through real world experience in | situations much more stressful and dangerous than we ever | face that help you effectively manage and coordinate the | human resources in response to an incident. | | (Disclaimer: That's all basically from memory. The city sent | me on a ICS, ICS in an emergency operations centre context, | and a few more courses a few years back as part of | volunteering with an emergency communications group. It's | probably 90% accurate.) | raffraffraff wrote: | EU working time act means "I'll see you in the afternoon", | whether they like it or not. | tetha wrote: | > We also had a "see ya in the morning" culture. Instead I'd be | much more impressed to have a "see ya in the afternoon, get | some sleep" culture. | | German labor laws forbid employees from working 10-13 hours | after a long on-call situation after a normal work day, just | like that. Add in time compensation, and a bad on-call | situation at night easily ends up as the next day off paid. | | I've found this to take a lot of edge of on-call. Sure, it | /sucks/ to get called at 1am and do stuff until 3, but that's a | day to sleep in and recover. Maybe hop on a call if the team | needs input, but that's optional. | ipaddr wrote: | So they are testing against fully awake people at 2:30pm and | expecting similiar results at 4:30am after heavy drinking. | x3n0ph3n3 wrote: | Do you really drink heavily when you are on-call? | LambdaComplex wrote: | Based on my understanding of the UK's drinking culture: it | wouldn't be the most surprising thing ever | shimylining wrote: | Why not? I am not working just to work, I am working to enjoy | my life with the people around me. I am also from Europe so | it might be different views, but just because I am on call | doesn't mean I'm going to stop living my life. Work doesn't | define me as a person. :) | pm90 wrote: | If I'm on call (OC), I'm responsible for the uptime of the | system even after hours. So If I'm planning on going | hiking, I will inform the secondary OC, or delay plans to a | weekend when I'm not OC. Generally I do tend to avoid | getting heavily inebriated (although of course there are | times when this is unavoidable). | | I'm not judging, but just pointing out that I've certainly | experienced a different OC culture in the US. | burnished wrote: | You're mixed up, they're drilling. You drill so that when an | emergency happens and it's 4:30am and you're bleary eyed your | hands already know what to do (and are doing it) before your | eyes even open all the way. | candiddevmike wrote: | This sounds like a great way to wipe a database accidentally | (like GitLab). The worst thing you can do to help fix a | problem is having people asleep at the wheel. | pm90 wrote: | Noted, but the point of the drill is precisely to uncover | these failure modes and attempt to fix them. e.g. you might | have automated runbooks to fix the problem rather than | access the DB directly. You might have frequent backups and | processes to easily restore from backups in case of | database wipes. ___________________________________________________________________ (page generated 2021-08-09 23:00 UTC)