[HN Gopher] Post-incident review on the Atlassian April 2022 outage ___________________________________________________________________ Post-incident review on the Atlassian April 2022 outage Author : johnmoon Score : 104 points Date : 2022-04-29 20:53 UTC (1 days ago) (HTM) web link (www.atlassian.com) (TXT) w3m dump (www.atlassian.com) | andrewstuart wrote: | For anyone wanting to attack: | | * hindsight is 20/20 | | * modern software systems are very complex | | * have you never made a mistake? | BOOSTERHIDROGEN wrote: | What apps to create that timeline ? | Andugal wrote: | If I understand correctly, since they deleted only a small subset | of all their customers, they could not restore a clean backup of | those customers without losing data from other customers not | impacted by the outage. | | So how would one have a clean "partial backup" strategy if | something similar would happen in his company? | dopylitty wrote: | I can only imagine the way the person who pulled the trigger on | the deletion script felt the moment they realized what had | happened. | | I've been there with much less significant incidents when a | "routine" change turned into a potentially resume generating | event. It's not fun. | | Ultimately the responsibility is with the organization that made | it possible for an event of that scale to happen rather than the | individual person who happened to trigger it but that doesn't | make it feel any better. | llamaLord wrote: | As someone who observed this particular incident from the | inside (holy shit-balls have the last 3 weeks not been fun), | one of the few positive elements of it has been the universal | and effectively instinctual agreement internally that it was a | massive screw-up in the system that we all have to own, rather | than one or several individual screw-ups that need to be put at | the feet of individuals. | trebligdivad wrote: | Well, the one constant of IT is 'shit happens'; mark this one | down as something interesting you've seen. | SnowHill9902 wrote: | That moment when you see DELETE 18388272773 0 | Kwpolska wrote: | Thomas J. Watson famously said: | | _Recently, I was asked if I was going to fire an employee who | made a mistake that cost the company $600,000. No, I replied, I | just spent $600,000 training him. Why would I want somebody to | hire his experience?_ | originalvichy wrote: | I think in this instance it's the person who sent the IDs that | feels worse. The deleters were provided IDs that had 30 correct | app IDs and the rest were site IDs. | | Like they mentioned they had a delete script that worked for | all types of unique IDs so that also can dilute the feeling of | "it's all my fault" hopefully. | baskethead wrote: | Oooof. Passing in Application IDs will delete applications and | passing in site IDs will delete sites. That's a really really bad | design. I'm bookmarking this so that I can use it as a showcase | going forward. | | Just this week, I changed a spec in one of our proposed endpoints | that did exactly that. We passed in ids of various types of | objects to perform actions, and I changed the api so that it | would be forced to pass in a struct that contained an object type | and object id. Explicitness is so much safer in the long run, | especially in enterprise apps. | ilayn wrote: | I am very curious if they used a Jira board during this crisis | for issue tracking. Because then they would have more than 4 | lessons learned. | civilized wrote: | Good thing they acquired Trello. | codeflo wrote: | What you're basically suggesting is that feature development at | Atlassian moves at such a glacial speed because of course | they're using Jira to manage it. This kind of blows my mind | right now. | duxup wrote: | >The script that was executed followed our standard peer-review | process that focused on which endpoint was being called and how. | It did not cross-check the provided cloud site IDs to validate | whether they referred to the Insight App or to the entire site, | and the problem was that the script contained the ID for a | customer's entire site. | | Yup that deletes something... anyway... | | > Establish universal "soft deletes" across all systems. | | It's just easier that way to observe what might happen. | pokoleo wrote: | Soft deletion always feels at odds with privacy-related "right to | have data deleted" laws. | | Would be super interested in a technical writeup on how they do | this. | Jcowell wrote: | It shouldn't be. These laws at least have the nuance to | understand that data can't be immediately deleted from Backups | and that in such instances where deletes are complicated the | customer is notified. | h1fra wrote: | "Right to have data deleted" can be 'circumvented' if the data | is critical part of the system or is needed for legal purpose | (for example it can be mandatory to keep 1 year of IP logs and | data associated with it) | | In previous companies I have worked for, we did instant soft- | delete, then hard anonymisation after 15-30days and then hard | delete after a year. That means the data was not recoverable | for customer but could still be recovered for legal purpose. | baskethead wrote: | There's a time period before which you need to permanently | delete the data. A soft delete will allow you to delete the | data quickly and you can see what happens. If everything is | okay you can then purge your database of all soft deleted data. | jasonwatkinspdx wrote: | IANAL but the laws have carve outs for backup retention, etc. | | A simple technical solution is to store all data with per user | encryption keys, and then just delete the key. This obviously | doesn't let you prove to anyone else that you've deleted all | copies of the key, but you can use it as a way to have higher | confidence you don't inadvertently leak it. | notreallyserio wrote: | Ideally they'd encrypt the customer content with a key provided | by the customer and destroyed when the customer requests | account deletion. The customer would still be able to use their | key to decrypt backups that they get prior to the request. If | the customer changes their mind, they just upload the key again | (along with the backup, if necessary). | | Of course, this means trusting Atlassian to actually delete the | key on request, but there's not much reason for them not to. | rob_c wrote: | Seriously, where were the -24hr backups that could be rolled back | to once its clear the script is fubar, or using it on just 10% of | the estate first? ... | largbae wrote: | It is never that simple. Say the backup existed, and was | global. By the time you get everyone briefed on how fubar it is | and get agreement to load the backups, there are hours of | changes from the unaffected customers that will be wiped by the | restore, or have to be reconciled by hand for months. Sure, you | can concoct the perfect antidote with hindsight, but their | retro and next steps are sound. | rob_c wrote: | Build it properly. By definiton of following best practice IT | IS or always should be not far from being able to follow | this. If its not someone is to blame | kwertyoowiyop wrote: | That lesson really stuck out for me also. My definition of | "restore" has been too simplistic. | baskethead wrote: | What it means is that they never tested their disaster | recovery system, because this would have been found right | away. Or, someone would have reported it and an upper level | exec would have signed off on it being okay to take 14 days | to restore a small subset of users. | largbae wrote: | Again, not that simple. The customer restore procedure was | almost certainly tested(and in active use as customers blow | up their own data often enough). It was _not_ tested on 800 | customer stacks simultaneously, as that was considered a | sitewide disaster by whoever dreamed up the failure modes | to test for. Meanwhile the actual whole site disaster | restore plan may or may not have been tested, but it was | useless for this case since some customers were unaffected | and would be damaged by the whole site plan. | [deleted] | joering2 wrote: | I wish they could assure us the engineer who pressed the button | wasn't fired. | devjam wrote: | > Atlassian is proud of our incident management process which | emphasizes that a blameless culture and a focus on improving | our technical systems ... | seanwilson wrote: | Do any databases have something like native support for soft | deletes or ability to undo (other than SQL transactions rollbacks | where you're having to specify the undo checkpoint)? Something | like what Git does where it keeps a history of edits? If this | isn't common, is this a neglected area that should be addressed | or it's just too hard of a problem? It feels like with SQL, | there's minimal guardrails and it's just your own fault if you're | not extra careful, compared to say using Git with code or using | "restore from trash" with filesystems. | VTimofeenko wrote: | Snowflake has time travel[1] keeping the original data for | specified period of time. | | 1: https://docs.snowflake.com/en/user-guide/data- | availability.h... | jordanthoms wrote: | Quite a few databases support time travel queries, in | particular Oracle has for years and CockroachDB has them also. | We can query the state of a table as it was at any point in the | last 72hrs. | tyre wrote: | Datomic treats all data as immutable, so it can wind back to | any version. | | When new data is written, the entire block is copied and | rewritten rather than changing the data in-place. | mdavidn wrote: | This is similar to SQL checkpoints in that a rollback is all | or nothing. It wouldn't be segmented by tenant unless the | each tenant has its own transactors. | candiddevmike wrote: | WAL archiving / point in time recovery can help with this. | wolf550e wrote: | In MVCC systems like PostgreSQL, if you don't vacuum (garbage | collect the old tuples), your database is append-only and you | can query as if your transaction was started at some time in | the past. I don't know how to set auto-vacuum to have a fixed | delay, e.g. keeping 24h of changes, but I bet it can be added | if it's not built-in. | jasonwatkinspdx wrote: | So, the fully fleshed out form of this in databases is usually | called Bitemporality. The official SQL standard has included | this for some years now, but it's not widely implemented by | databases. | | An intuitive way to think of bitemporality is it's like MVCC, | but with 4 timestamps per row version. One pair describes a | range of time in "outside" or "valid" time, ie whatever | semantic domain the database is modeling, the other pair | describes a range of "system" time, which is when this record | was present in the database. This lets you capture and reason | about the distinction between when a fact the database models | was true in the real world, and when the database was updated | to reflect this fact (some people call this "as of" vs "as | at"... the terms here aren't fully settled but the basic | distinction is). So you can revise history, do complex time | travel queries, all sorts of stuff. It's a very useful model | that directly aligns with what sort of questions businesses | need to answer in the context of a court case or revising their | ground source of truth due to past bug or error. | | The downside is your database balloons with row versions, and | many queries become far more complicated, perhaps needing | addition joins, etc. Also from the perspective of database | implementors there's a ton more complexity in the code. So | that's why it's not widely supported despite the standard. | | There's also a niche of databases built around this model from | the ground up, usually based on Datalog instead of SQL. There's | also overlapping work with RDF and Semantic Web thinking (as | awry as all that went). | | In practice how most organizations address this is | operationally, by keeping generational and incremental backups | that let them restore previous database states as needed. | Though as the original post we're here for proves, that kind of | operational solution can bite back hard when it goes wrong. | justinludwig wrote: | I can't say that I've ever been a fan of Atlassian or their | products, but this blog post makes it sounds like they've at | least learned the right lessons from this: | | 1. Establish universal "soft deletes" across all systems. | | 2. Better DR for multi-site, multi-product incidents. | | 3. Fix their incident-management process for large-scale | incidents. | | 4. Fix their incident communications. | | Regarding #4 in particular: | | "Rather than wait until we had a full picture, we should have | been transparent about what we did know and what we didn't know. | Providing general restoration estimates (even if directional) and | being clear about when we expected to have a more complete | picture would have allowed our customers to better plan around | the incident.... | | [In the future], we will acknowledge incidents early, through | multiple channels. We will release public communications on | incidents within hours. To better reach impacted customers, we | will improve the backup of key contacts and retrofit support | tooling to enable customers... [to make emergency] contact with | our technical support team." | OrderlyTiamat wrote: | I am actually pleasantly surprised at the openness of this | response, and their taking responsibility of mistakes and | detailing what will change in the future. It's not just | corporate speak. I think that speaks well for the company and | it improved my view of them. | [deleted] | perlgeek wrote: | I tend to agree. | | However, there's one point that makes me skeptical: there are | no organizational changes, or changes to leadership, or | anything in that direction. | | This sounds like "the tech guys screwed up, culture and | management is fine here". Which it might be, or it might not. | | I would have loved to see | | 5. We will stop pushing customers so hard towards using our | cloud | | for one, but that wouldn't be convenient for Atlassian. | eastbound wrote: | Note that the ToS also forbid Cloud users from "disseminating | information" about "the performance of the products". | | So you can't say it's slow or unperforming. | geerlingguy wrote: | I thought, in terms of Jira and Confluence at least, it was | just accepted that being slow and underperforming was the | status quo and if it was running at a speed you'd consider | normal, that's an exception (and cause for alarm... like | "did that actually save or is there a silent JS error not | being displayed?"). ___________________________________________________________________ (page generated 2022-04-30 23:00 UTC)