[HN Gopher] Post-incident review on the Atlassian April 2022 outage
       ___________________________________________________________________
        
       Post-incident review on the Atlassian April 2022 outage
        
       Author : johnmoon
       Score  : 104 points
       Date   : 2022-04-29 20:53 UTC (1 days ago)
        
 (HTM) web link (www.atlassian.com)
 (TXT) w3m dump (www.atlassian.com)
        
       | andrewstuart wrote:
       | For anyone wanting to attack:
       | 
       | * hindsight is 20/20
       | 
       | * modern software systems are very complex
       | 
       | * have you never made a mistake?
        
       | BOOSTERHIDROGEN wrote:
       | What apps to create that timeline ?
        
       | Andugal wrote:
       | If I understand correctly, since they deleted only a small subset
       | of all their customers, they could not restore a clean backup of
       | those customers without losing data from other customers not
       | impacted by the outage.
       | 
       | So how would one have a clean "partial backup" strategy if
       | something similar would happen in his company?
        
       | dopylitty wrote:
       | I can only imagine the way the person who pulled the trigger on
       | the deletion script felt the moment they realized what had
       | happened.
       | 
       | I've been there with much less significant incidents when a
       | "routine" change turned into a potentially resume generating
       | event. It's not fun.
       | 
       | Ultimately the responsibility is with the organization that made
       | it possible for an event of that scale to happen rather than the
       | individual person who happened to trigger it but that doesn't
       | make it feel any better.
        
         | llamaLord wrote:
         | As someone who observed this particular incident from the
         | inside (holy shit-balls have the last 3 weeks not been fun),
         | one of the few positive elements of it has been the universal
         | and effectively instinctual agreement internally that it was a
         | massive screw-up in the system that we all have to own, rather
         | than one or several individual screw-ups that need to be put at
         | the feet of individuals.
        
           | trebligdivad wrote:
           | Well, the one constant of IT is 'shit happens'; mark this one
           | down as something interesting you've seen.
        
         | SnowHill9902 wrote:
         | That moment when you see DELETE 18388272773 0
        
         | Kwpolska wrote:
         | Thomas J. Watson famously said:
         | 
         |  _Recently, I was asked if I was going to fire an employee who
         | made a mistake that cost the company $600,000. No, I replied, I
         | just spent $600,000 training him. Why would I want somebody to
         | hire his experience?_
        
         | originalvichy wrote:
         | I think in this instance it's the person who sent the IDs that
         | feels worse. The deleters were provided IDs that had 30 correct
         | app IDs and the rest were site IDs.
         | 
         | Like they mentioned they had a delete script that worked for
         | all types of unique IDs so that also can dilute the feeling of
         | "it's all my fault" hopefully.
        
       | baskethead wrote:
       | Oooof. Passing in Application IDs will delete applications and
       | passing in site IDs will delete sites. That's a really really bad
       | design. I'm bookmarking this so that I can use it as a showcase
       | going forward.
       | 
       | Just this week, I changed a spec in one of our proposed endpoints
       | that did exactly that. We passed in ids of various types of
       | objects to perform actions, and I changed the api so that it
       | would be forced to pass in a struct that contained an object type
       | and object id. Explicitness is so much safer in the long run,
       | especially in enterprise apps.
        
       | ilayn wrote:
       | I am very curious if they used a Jira board during this crisis
       | for issue tracking. Because then they would have more than 4
       | lessons learned.
        
         | civilized wrote:
         | Good thing they acquired Trello.
        
         | codeflo wrote:
         | What you're basically suggesting is that feature development at
         | Atlassian moves at such a glacial speed because of course
         | they're using Jira to manage it. This kind of blows my mind
         | right now.
        
       | duxup wrote:
       | >The script that was executed followed our standard peer-review
       | process that focused on which endpoint was being called and how.
       | It did not cross-check the provided cloud site IDs to validate
       | whether they referred to the Insight App or to the entire site,
       | and the problem was that the script contained the ID for a
       | customer's entire site.
       | 
       | Yup that deletes something... anyway...
       | 
       | > Establish universal "soft deletes" across all systems.
       | 
       | It's just easier that way to observe what might happen.
        
       | pokoleo wrote:
       | Soft deletion always feels at odds with privacy-related "right to
       | have data deleted" laws.
       | 
       | Would be super interested in a technical writeup on how they do
       | this.
        
         | Jcowell wrote:
         | It shouldn't be. These laws at least have the nuance to
         | understand that data can't be immediately deleted from Backups
         | and that in such instances where deletes are complicated the
         | customer is notified.
        
         | h1fra wrote:
         | "Right to have data deleted" can be 'circumvented' if the data
         | is critical part of the system or is needed for legal purpose
         | (for example it can be mandatory to keep 1 year of IP logs and
         | data associated with it)
         | 
         | In previous companies I have worked for, we did instant soft-
         | delete, then hard anonymisation after 15-30days and then hard
         | delete after a year. That means the data was not recoverable
         | for customer but could still be recovered for legal purpose.
        
         | baskethead wrote:
         | There's a time period before which you need to permanently
         | delete the data. A soft delete will allow you to delete the
         | data quickly and you can see what happens. If everything is
         | okay you can then purge your database of all soft deleted data.
        
         | jasonwatkinspdx wrote:
         | IANAL but the laws have carve outs for backup retention, etc.
         | 
         | A simple technical solution is to store all data with per user
         | encryption keys, and then just delete the key. This obviously
         | doesn't let you prove to anyone else that you've deleted all
         | copies of the key, but you can use it as a way to have higher
         | confidence you don't inadvertently leak it.
        
         | notreallyserio wrote:
         | Ideally they'd encrypt the customer content with a key provided
         | by the customer and destroyed when the customer requests
         | account deletion. The customer would still be able to use their
         | key to decrypt backups that they get prior to the request. If
         | the customer changes their mind, they just upload the key again
         | (along with the backup, if necessary).
         | 
         | Of course, this means trusting Atlassian to actually delete the
         | key on request, but there's not much reason for them not to.
        
       | rob_c wrote:
       | Seriously, where were the -24hr backups that could be rolled back
       | to once its clear the script is fubar, or using it on just 10% of
       | the estate first? ...
        
         | largbae wrote:
         | It is never that simple. Say the backup existed, and was
         | global. By the time you get everyone briefed on how fubar it is
         | and get agreement to load the backups, there are hours of
         | changes from the unaffected customers that will be wiped by the
         | restore, or have to be reconciled by hand for months. Sure, you
         | can concoct the perfect antidote with hindsight, but their
         | retro and next steps are sound.
        
           | rob_c wrote:
           | Build it properly. By definiton of following best practice IT
           | IS or always should be not far from being able to follow
           | this. If its not someone is to blame
        
           | kwertyoowiyop wrote:
           | That lesson really stuck out for me also. My definition of
           | "restore" has been too simplistic.
        
           | baskethead wrote:
           | What it means is that they never tested their disaster
           | recovery system, because this would have been found right
           | away. Or, someone would have reported it and an upper level
           | exec would have signed off on it being okay to take 14 days
           | to restore a small subset of users.
        
             | largbae wrote:
             | Again, not that simple. The customer restore procedure was
             | almost certainly tested(and in active use as customers blow
             | up their own data often enough). It was _not_ tested on 800
             | customer stacks simultaneously, as that was considered a
             | sitewide disaster by whoever dreamed up the failure modes
             | to test for. Meanwhile the actual whole site disaster
             | restore plan may or may not have been tested, but it was
             | useless for this case since some customers were unaffected
             | and would be damaged by the whole site plan.
        
         | [deleted]
        
       | joering2 wrote:
       | I wish they could assure us the engineer who pressed the button
       | wasn't fired.
        
         | devjam wrote:
         | > Atlassian is proud of our incident management process which
         | emphasizes that a blameless culture and a focus on improving
         | our technical systems ...
        
       | seanwilson wrote:
       | Do any databases have something like native support for soft
       | deletes or ability to undo (other than SQL transactions rollbacks
       | where you're having to specify the undo checkpoint)? Something
       | like what Git does where it keeps a history of edits? If this
       | isn't common, is this a neglected area that should be addressed
       | or it's just too hard of a problem? It feels like with SQL,
       | there's minimal guardrails and it's just your own fault if you're
       | not extra careful, compared to say using Git with code or using
       | "restore from trash" with filesystems.
        
         | VTimofeenko wrote:
         | Snowflake has time travel[1] keeping the original data for
         | specified period of time.
         | 
         | 1: https://docs.snowflake.com/en/user-guide/data-
         | availability.h...
        
         | jordanthoms wrote:
         | Quite a few databases support time travel queries, in
         | particular Oracle has for years and CockroachDB has them also.
         | We can query the state of a table as it was at any point in the
         | last 72hrs.
        
         | tyre wrote:
         | Datomic treats all data as immutable, so it can wind back to
         | any version.
         | 
         | When new data is written, the entire block is copied and
         | rewritten rather than changing the data in-place.
        
           | mdavidn wrote:
           | This is similar to SQL checkpoints in that a rollback is all
           | or nothing. It wouldn't be segmented by tenant unless the
           | each tenant has its own transactors.
        
         | candiddevmike wrote:
         | WAL archiving / point in time recovery can help with this.
        
         | wolf550e wrote:
         | In MVCC systems like PostgreSQL, if you don't vacuum (garbage
         | collect the old tuples), your database is append-only and you
         | can query as if your transaction was started at some time in
         | the past. I don't know how to set auto-vacuum to have a fixed
         | delay, e.g. keeping 24h of changes, but I bet it can be added
         | if it's not built-in.
        
         | jasonwatkinspdx wrote:
         | So, the fully fleshed out form of this in databases is usually
         | called Bitemporality. The official SQL standard has included
         | this for some years now, but it's not widely implemented by
         | databases.
         | 
         | An intuitive way to think of bitemporality is it's like MVCC,
         | but with 4 timestamps per row version. One pair describes a
         | range of time in "outside" or "valid" time, ie whatever
         | semantic domain the database is modeling, the other pair
         | describes a range of "system" time, which is when this record
         | was present in the database. This lets you capture and reason
         | about the distinction between when a fact the database models
         | was true in the real world, and when the database was updated
         | to reflect this fact (some people call this "as of" vs "as
         | at"... the terms here aren't fully settled but the basic
         | distinction is). So you can revise history, do complex time
         | travel queries, all sorts of stuff. It's a very useful model
         | that directly aligns with what sort of questions businesses
         | need to answer in the context of a court case or revising their
         | ground source of truth due to past bug or error.
         | 
         | The downside is your database balloons with row versions, and
         | many queries become far more complicated, perhaps needing
         | addition joins, etc. Also from the perspective of database
         | implementors there's a ton more complexity in the code. So
         | that's why it's not widely supported despite the standard.
         | 
         | There's also a niche of databases built around this model from
         | the ground up, usually based on Datalog instead of SQL. There's
         | also overlapping work with RDF and Semantic Web thinking (as
         | awry as all that went).
         | 
         | In practice how most organizations address this is
         | operationally, by keeping generational and incremental backups
         | that let them restore previous database states as needed.
         | Though as the original post we're here for proves, that kind of
         | operational solution can bite back hard when it goes wrong.
        
       | justinludwig wrote:
       | I can't say that I've ever been a fan of Atlassian or their
       | products, but this blog post makes it sounds like they've at
       | least learned the right lessons from this:
       | 
       | 1. Establish universal "soft deletes" across all systems.
       | 
       | 2. Better DR for multi-site, multi-product incidents.
       | 
       | 3. Fix their incident-management process for large-scale
       | incidents.
       | 
       | 4. Fix their incident communications.
       | 
       | Regarding #4 in particular:
       | 
       | "Rather than wait until we had a full picture, we should have
       | been transparent about what we did know and what we didn't know.
       | Providing general restoration estimates (even if directional) and
       | being clear about when we expected to have a more complete
       | picture would have allowed our customers to better plan around
       | the incident....
       | 
       | [In the future], we will acknowledge incidents early, through
       | multiple channels. We will release public communications on
       | incidents within hours. To better reach impacted customers, we
       | will improve the backup of key contacts and retrofit support
       | tooling to enable customers... [to make emergency] contact with
       | our technical support team."
        
         | OrderlyTiamat wrote:
         | I am actually pleasantly surprised at the openness of this
         | response, and their taking responsibility of mistakes and
         | detailing what will change in the future. It's not just
         | corporate speak. I think that speaks well for the company and
         | it improved my view of them.
        
           | [deleted]
        
         | perlgeek wrote:
         | I tend to agree.
         | 
         | However, there's one point that makes me skeptical: there are
         | no organizational changes, or changes to leadership, or
         | anything in that direction.
         | 
         | This sounds like "the tech guys screwed up, culture and
         | management is fine here". Which it might be, or it might not.
         | 
         | I would have loved to see
         | 
         | 5. We will stop pushing customers so hard towards using our
         | cloud
         | 
         | for one, but that wouldn't be convenient for Atlassian.
        
           | eastbound wrote:
           | Note that the ToS also forbid Cloud users from "disseminating
           | information" about "the performance of the products".
           | 
           | So you can't say it's slow or unperforming.
        
             | geerlingguy wrote:
             | I thought, in terms of Jira and Confluence at least, it was
             | just accepted that being slow and underperforming was the
             | status quo and if it was running at a speed you'd consider
             | normal, that's an exception (and cause for alarm... like
             | "did that actually save or is there a silent JS error not
             | being displayed?").
        
       ___________________________________________________________________
       (page generated 2022-04-30 23:00 UTC)