[HN Gopher] Inside the longest Atlassian outage
       ___________________________________________________________________
        
       Inside the longest Atlassian outage
        
       Author : andyjohnson0
       Score  : 714 points
       Date   : 2022-04-13 15:27 UTC (7 hours ago)
        
 (HTM) web link (newsletter.pragmaticengineer.com)
 (TXT) w3m dump (newsletter.pragmaticengineer.com)
        
       | hsnewman wrote:
       | Sounds like the continuity planners at Atlassian (the fall guys)
       | will be looking for a new job.
        
       | bogomipz wrote:
       | >"The outage is its 9th day, having started on Monday, 4th of
       | April." >"It took until Day 9 for executives at the company to
       | acknowledge the outage."
       | 
       | Just to put this in perspective. These executives would have left
       | on a Friday afternoon to start their weekends without bothering
       | to publicly address an ongoing outage that was by then 5 days
       | old.
       | 
       | This is mind boggling. Like did some C-level exec say something
       | like "Let's just park this whole outage communication discussion
       | until Monday, have a good weekend everyone."?
        
       | gunapologist99 wrote:
       | Trello seems to still be up?
        
       | R0ger wrote:
       | I guess this is wake call for the people rushing to SaaS
       | solutions.
        
         | brianwawok wrote:
         | Is it?
         | 
         | We use JIRA. Not impacted.
         | 
         | If this had hit us.. we would just switch to excel or something
         | for a week/month?
         | 
         | But maybe we are a very light user of JIRA. Nothing in there
         | can't be replaced. It's "nice" to be able to go look up a 3
         | year old bug and which client reported it, but not really
         | crucial for day to day ops.
        
           | ProAm wrote:
           | > We use JIRA. Not impacted.
           | 
           | This time.
        
           | bborud wrote:
           | "Switch to Excel for a week/month"
           | 
           | Right.
        
           | mrits wrote:
           | I wonder why you use Jira if a spreadsheet is sufficient for
           | your use case.
        
             | HeyLaughingBoy wrote:
             | He didn't say it was sufficient; he said they could do it
             | for a short while. I consider myself in the same situation:
             | we depend on Jira, but for a week or so it's not a big deal
             | to use a bunch of Post-It notes.
        
             | function_seven wrote:
             | Same reason I use oil lamps when the power is out, even
             | though electric bulbs are my normal lighting.
             | 
             | A spreadsheet may be sufficient, but it's not as good as a
             | system designed for development workflows.
             | 
             | (This comment sounds like I have a speck of love for JIRA.
             | I don't! :)
        
               | mrits wrote:
               | I don't see this as a valid comparison. There is
               | information loss. This has happened to my team which had
               | about 50 people and it was very chaotic. It took us
               | several days to just create the state our features were
               | in.
               | 
               | Today it would even be more troublesome as we have a lot
               | of integration rules dependent upon the workflow. I'd
               | probably just recommend everyone uses a few weeks for
               | self improvement and only address critical production
               | issues.
        
         | dangus wrote:
         | Outages definitely happen with on-premise software.
         | 
         | At some point the logo on the engineer's badge doesn't really
         | matter.
        
       | kgeist wrote:
       | We use on-premises setups for almost everything (we generally
       | avoid cloud solutions to have full control of our data),
       | sometimes (approximately once a month) it goes down for a few
       | minutes which already feels like a torture because all our
       | processes depend on it, I can't imagine having no access to it
       | for several weeks, all our work would stop to a halt... The
       | office of the guy who administers on-premise servers is literally
       | next door, all it takes is to make a visit to him and everything
       | works again after 5 minutes. Reading horror stories like this
       | (Slack being down, Atlassian being down, no one knows what is
       | happening and when it will end etc.), I wonder why many companies
       | choose cloud solutions for critical business processes. Is it
       | pricing? Ease of use? I can understand why very small companies
       | would choose it, but I don't understand why a medium/large
       | business would choose anything but an on-premises setup.
        
         | lloydatkinson wrote:
         | Cloud solutions can work well. I've used GitHub, Azure Devops,
         | and BitBucket (another wonderful atlassian product /s) and
         | BitBucket frequently craps out, multiple times a week. We need
         | to rerun builds in TeamCity because BitBucket stops talking to
         | it.
        
         | Msw242 wrote:
         | You're assuming every team would have better uptime with in-
         | house solutions
         | 
         | I think many would have worse uptime even with more headcount
        
           | tetha wrote:
           | In our experience, this strongly depends on the services
           | involved, as well as the scale.
           | 
           | For example, for our own service: If you have a hundred or
           | two hundred licenses, you can drop our system on a linux box
           | and usually you have to throw a yum update and one or two
           | service restarts at it every few months and it just works. I
           | honestly wouldn't be surprised if many of our small on-prem
           | solutions have better uptime than the SaaS clusters, or be
           | capped in uptime by some externality, rendering the system
           | downtime irrelevant. If their VMWare cluster is down, our
           | system is down, but no one cares.
           | 
           | This also mirrors a lot of our internal systems. At a small
           | scale, you can just dump chef, jenkins, sonar, nexus,
           | whatever on a linux box and forget about it.
           | 
           | However, this changes with high license counts. We have
           | singular customers in our SaaS offering that are more than 50
           | - 100x bigger than the small on prem systems. At that point,
           | our SaaS offering is better than anything the customer could
           | to on-prem. I'm confident to say this about all of our
           | customers, except maybe 2.
        
           | bob1029 wrote:
           | I find this argument to be totally bs these days.
           | 
           | If anything, a smaller company with smaller footprint and
           | fewer total requirements is going to be more likely to manage
           | a vertical slice of some SAAS product.
           | 
           | The reason things like github go down so often is _because_
           | they are public /shared resources.
        
             | kgeist wrote:
             | >The reason things like github go down so often is because
             | they are public/shared resources.
             | 
             | Very much this. Managing shared resources at scale is
             | pretty hard. We have a bunch of internal sites made by
             | interns as part of their internships, and, funny enough,
             | those sites have much greater uptime and appear more stable
             | than our own multi-tenant SaaS solution made by seasoned
             | devs.
        
           | kgeist wrote:
           | I've heard this argument many times before, but is there
           | research into this? I.e. where they would compare uptime of
           | cloud vs. on-premises across a wide range of companies.
        
             | rirze wrote:
             | I mean, you're going to get biased results, no? Only
             | companies who are confident in self-hosting will self-host
             | it. You won't have any real data about companies who are
             | not confident in self-hosting maintaining their on-premises
             | version of the software.
        
         | NationalPark wrote:
         | What do you do if the on-premises guy gets hit by a car and
         | isn't in his office?
        
           | kgeist wrote:
           | There's IIRC 3 or 4 people in their department, they
           | administer the whole building (wifi, security cams, LDAP,
           | etc.), not only the on-premises servers. From what I
           | gathered, our internal systems usually go down due to lack of
           | disk space or some bug in the software which requires merely
           | a reboot, it's not rocket science. Another thing is that our
           | IT department (for internal systems) and the SRE department
           | (for client-facing systems) have 24/7 on-call duty so it's
           | unlikely that no one will respond.
        
           | oriki wrote:
           | The same thing that the cloud company would do. If there are
           | other people there who share that guy's responsibilities,
           | have them do it. If there aren't, you should have an on-call.
           | 
           | Cloud just outsources that problem to another business. Sure,
           | they have better reasons to actually cover those positions
           | and make sure they have on-calls and backup and a disaster
           | plan, but just because you pay extra money for it doesn't
           | actually make it work better if the company underlying it
           | sucks.
        
         | snark42 wrote:
         | > but I don't understand why a medium/large business would
         | choose anything but an on-premises setup.
         | 
         | Atlassian is in the process of killing the on-premise
         | small/medium business option, already announced an EOL date.
         | 
         | Move to the cloud, buy a 500+ user solution for a much higher
         | price or migrate away are my choices. Of course I use the local
         | database and have local services JIRA/Confluence talk to so
         | it's not really an option to move to the cloud.
         | 
         | I assume lack of competent on-site staff 24/7, having someone
         | else to blame as well as lower costs are why people choose the
         | cloud over on-premise though.
        
         | originalvichy wrote:
         | I am biased but I can tell you what works best for mid-large
         | companies: having a solution provider. Basically a partner that
         | hosts and maintains the instance and has enough Atlassian
         | certified people to help you with any question so that you will
         | never have to hire people to just maintain the beasts or tell
         | you about features, tricks or plugins that could solve problem
         | X.
         | 
         | Experienced people hosting and tuning Atlassian products has a
         | greater success rate than someone doing it alone for a large
         | company. Almost every time I've migrated an old Atlassian
         | installation under our wing it's given me shock how users have
         | been made to suffer the loading times and perfs that come from
         | underprovisioning (db or actual machine) and messy
         | configuration. I'm not blaming the former admins but it just
         | happens. Usually end users are happy after we clean the mess up
         | and everything feels snappy.
         | 
         | Disclosure: I've worked in this kind of expert role.
        
           | crummy wrote:
           | I can't see the difference between a "solution provider" that
           | hosts your Jira and just getting Atlassian to do it. What's
           | stopping the solution provider from accidentally running a
           | script that deletes some customer's files and struggling to
           | do a partial backup restore?
        
             | snark42 wrote:
             | I would assume the MSP is running a dedicated instance and
             | can do a full/backup restore just for the user they're
             | supporting.
             | 
             | If it's some multi-tenant solution it's no better.
        
               | originalvichy wrote:
               | Correct. There are probably not a lot of MSPs that have
               | so many customers that they need to share that much data,
               | and their customers probably use MSPs for the strict
               | purpose that they don't want to share things with other
               | companies.
        
             | originalvichy wrote:
             | Because you can get the best parts of self-hosted and
             | managed services. And on that backup question: self-hosted
             | Atlassian is vastly easier to protect against disasters.
             | The problem these Atlassian guys had arose from multi-
             | tenant architecture. Usually managed service providers will
             | host your stack on individual databases and VMs, and
             | backing up the software is just a matter of taking pg_dumps
             | and rsyncing certain directories (pretty old school) or
             | just taking disk level snapshots.
             | 
             | Many medium-large corporations have their own cloud
             | environments that their IT Ops control. Solution providers
             | can host Atlassian stacks on their own cloud environment
             | where they are not affected by data privacy concerns (it's
             | in their already green-lit cloud providers data center) so
             | they can host it behind a firewall with only VPN access
             | allowed. They can also do all the magic you can usually do
             | with web software like put a frontend proxy in front of it,
             | or use more flexible/legacy authentication methods. Not to
             | mention that for example you could have a Jira Cloud that
             | you would need to integrate with a SCM program. Jira data
             | could be "OK" to live in the cloud but code would be a big
             | no-no. These problems can be solved by having them all live
             | behind the firewall.
             | 
             | A competent managed solution provider also has consultants
             | that can train or instruct on usage. It costs but it is
             | simpler and faster than having to go through the forums or
             | send a support ticket for every small issue to Atlassian
             | itself.
        
           | pphysch wrote:
           | It seems like if you are going to pay for a bunch of SaaS
           | seats AND a team of technicians/engineers for make it work,
           | you might as well just do the latter and roll your own
           | solutions...
           | 
           | A lot of these SaaS are just glorified Rails apps with a
           | patina of professional "security" and "reliability", and
           | loads of extra junk that your co will never use.
        
             | originalvichy wrote:
             | Trust me, if someone could clone Jira and its functionality
             | they would have done so already. Truth is that if you build
             | one product for 20 years you have a giant lead in features.
             | If all it took was having a Kanban board then Jira would
             | have died years ago.
        
         | yibg wrote:
         | What do you do if your on prem setup lost data? There is an
         | implicit assumption here that on prem is more reliable than
         | cloud. Less downtime, less chances of data loss etc. Obviously
         | it depends on which cloud product we're talking about but I
         | don't think a blanket "my on prem goes down less and when it
         | does go down I can get it back up sooner" is true.
        
           | originalvichy wrote:
           | I think for that question we also have to define on-prem just
           | to be clear. To many on-prem means "own cloud subscription".
        
       | kingofpandora wrote:
       | Engineering mistakes happen.
       | 
       | The most inexcusable thing is not communicating with the paying
       | customers who have been affected for over a week.
       | 
       | Atlassian's Global Head of Customer Success probably should have
       | been fired but here she is promoting Atlassian Cloud on LinkedIn
       | three days ago: https://www.linkedin.com/mwlite/in/gertie-
       | rizzo-5b70061
       | 
       | Actually reading a bit more, it seems like their customer team
       | was partying in Las Vegas instead of taking care of business:
       | https://www.linkedin.com/mwlite/feed/hashtag/atlassianteam22
       | 
       | Priorities.
        
         | naoqj wrote:
        
         | _sword wrote:
         | Fair criticisms on response times but regarding Vegas, it was
         | their annual user conference last week in Vegas.
        
         | madmulita wrote:
         | They claim they test backups quarterly yet they don't have a
         | procedure in place to restore the operation. We all know your
         | backup is not tested until you restored everything
         | successfully. This is not an engineering mistake, it is a flat
         | out lie.
        
           | iancarroll wrote:
           | Well, their explanation makes sense. These are multi-tenant
           | environments where not every tenant was affected; sensibly,
           | the backups appear divided by environment, not tenant. You
           | can't blindly revert to an environment's last backup in this
           | scenario, although you'd think they would have done it
           | before.
        
         | julesallen wrote:
         | No argument on the crappy comms.
         | 
         | If I was in customer success at an enterprise vendor I doubt
         | I'd be let anywhere near the tools to get this back up and
         | running. These guys are generally in the way rather than
         | helping in a situation like this.
         | 
         | Head of engineering or some product rather than customer
         | support? That might be a different outcome.
        
         | iamtheworstdev wrote:
         | Can confirm. Saw them there while I was on vacation.
        
           | benreesman wrote:
           | Jesus, if there was ever an example of the internet making
           | the world smaller.
           | 
           | When do execs living it up at the fucking Wynn Encore while
           | the house burns down start to not get another job?
           | 
           | They'll keep pulling this shit until it cost money.
        
             | benreesman wrote:
             | For clarity: I went through a period where some combination
             | of self-indulgence and legitimate life crisis caused me to
             | take my eye off the ball when it mattered.
             | 
             | I'm still trying to kickstart a second act years later,
             | because I'm trailer trash and it's hard work when you're
             | that.
        
         | syshum wrote:
         | sales never takes the blame. If anyone is fired it will be
         | scapegoats in engineering once they have busted their ass to
         | restore their reward will be the door
        
           | systemvoltage wrote:
           | This is an engineering problem. They should own it and
           | improve things, make sure it doesn't happen again.
           | 
           | Also, GP's quote
           | 
           | > Engineering mistakes happen.
           | 
           | I don't like this statement because it offers consolation at
           | the expense of unintentional normalization.
        
             | nix23 wrote:
             | Ever heard of Space Shuttle Challenger? You cant own it if
             | your management is against it.
        
             | syshum wrote:
             | The deletion of customer data was engineering mistake, that
             | is not what I was talking about
             | 
             | The Negative fall out was not due to the deletion of
             | customer data, as the Story and multiple customers have
             | state the negative fall out was the SILENCE, which is Sales
             | / Customer Service not engineering
             | 
             | As the comment I was replying to noted while engineering
             | was trying to recover from what might possibly be the
             | biggest outage in the history of the company Sales was
             | partying and not handling customer communications
             | 
             | That (the failure to communicate with customers) should be
             | a resume generating event of all leadership customer
             | service / sales. It will not be however because sales will
             | simply redirect their failure on to engineering in the
             | exact same manner you just have
        
             | buscoquadnary wrote:
             | And coders that say all code has bugs are just defeatists
             | that are trying to make excuses for being lazy.
             | 
             | Sometimes manure will always hit the fan. Being robust
             | means being able to handle that.
        
               | jacksnipe wrote:
               | I think this is obviously incorrect.
               | 
               | Human error is probabilistic, and the probability of
               | making an error cannot be zero.
               | 
               | On the flip side, it's infeasible to use only provably
               | correct systems; not lazy, but literally not a practical
               | option due to compute costs, developer time, what formal
               | techniques can even be applied to the problem at hand,
               | etc...
        
               | systemvoltage wrote:
               | A culture where mistakes are taken too seriously or too
               | lightly leads to problems. Also it depends on what stage
               | of the product cycle (Innovation/Rapid Development vs.
               | Robustness/Quality). I'd argue that Atlassian products
               | should err towards robustness and high quality. Not
               | trying to break any new ground.
        
       | xyst wrote:
       | Atlassian about to dip over the next few years as firms around
       | the world slowly remove themselves from their ecosystem of
       | products.
        
         | jlawer wrote:
         | Not to mention a CEO who is more interested in activities
         | outside the company like the green energy transition and
         | politics.
         | 
         | As an Aussie I always wanted Atlassian to succeed as we have so
         | few tech companies at that scale or larger. Now I view them as
         | another Oracle. Now they innovate little, they keep ratchetting
         | up prices, pushing deployments to cloud where they make more
         | money. Nickel and dime you for what should be core features
         | (SAML Auth?). They aren't coming up with anything new to keep
         | the value in the ecosystem. They buy applications in, spend a
         | little to make some cross integration and then drop down to a
         | slower development Cadence.
        
         | ekanes wrote:
         | Right. Feels similar in a way to an ongoing conflict
         | elsewhere... There is what happens now, and what happens over
         | the next decade because people have lost fundamental trust in
         | you.
        
         | tpmx wrote:
         | Their core customers are unfortunately just as dysfunctional
         | and slow-learning. Think Boeing, etc. Witness:
         | 
         | https://jobs.boeing.com/job/annapolis-junction/jira-administ...
        
       | xiaodai wrote:
       | Comes across as jerk. How can an outsider say things with such
       | certianty?
        
       | bluedino wrote:
       | Regarding the backup restores:
       | 
       | I once worked a company that had a data loss issue. There was
       | nothing else we could do, we had exhausted every option we had
       | over almost 40 hours. At the end of the second day, it was
       | decided to restore from backup.
       | 
       | We had done this before, as a test. It took about 12 hours to
       | restore the data and another 12 hours to import the data and get
       | back up and running.
       | 
       | One small thing was different this time, and it had huge
       | consequences. As a cost-saving measure, an engineer had changed
       | the location of our backups to the cold-storage tier offered by
       | our cloud provider. All backups, not just 'old' ones.
       | 
       | This added 2 additional days to our recovery time, for a total of
       | five days. Interestingly enough, even though we offered a full
       | month's refund to all of our customers, not even half of them
       | took us up on it.
        
         | bombcar wrote:
         | In these cases the best thing to do is just give every customer
         | the full month refund; don't make them ask for it.
        
           | sodality2 wrote:
           | The best thing to do business-wise, or as a good faith move?
        
             | rjmunro wrote:
             | What's the difference?
        
               | treesknees wrote:
               | Good faith would be to lose all of that money to people
               | who are already your customers.
               | 
               | Business-wise would be to stay in their good graces and
               | keep those customers by offering the refund, but you
               | don't lose any money to those who either don't care or
               | won't move to a competitor.
        
               | function_seven wrote:
               | 25 years ago the clutch in my beater truck was slipping.
               | I was 16 years old, making $50 a _week_ and had very
               | little in savings. I took that truck to a shop within
               | walking distance of my job.
               | 
               | 2 hours later I walked back to see what they found. I
               | figured it would be several hundred dollars for a new
               | clutch, and I'd have to borrow money or something to get
               | it done. I talked to the owner who told be it was an
               | adjustment on the cable. Just needed to be scootched up a
               | bit and it was probably good for another 30k miles.
               | 
               | When I asked him how much I owed, he laughed at me and
               | said, "For that? Not worth writing it up. No charge. You
               | want me to show you how to do it yourself next time?"
               | 
               | The shop could very easily have charged me 1 hour of
               | labor at their standard rate, maybe $75 or so. Plus a
               | diagnostic or test drive fee. Whatever. He could have
               | told me, "$123.98" and I would have paid it. I wouldn't
               | even have been mad. But I sure as hell wouldn't have
               | remembered the experience so clearly. Nor would I have
               | told a dozen people over the years to take their cars
               | there. And I definitely would not have driven 20 miles
               | out of my way to return to that shop in the future years.
               | 
               | Being cynical about this stuff will hurt your brand. It's
               | not obvious. It doesn't show up on the earnings report as
               | a line item. This is service segmentation that seems like
               | a no-brainer to a clueless MBA, but actually matters in
               | the long run. How people view your brand is immensely
               | important.
               | 
               | Not forcing customers you already screwed over to then
               | spend more time chasing a refund is not only the right
               | thing to do, it's also good business.
        
               | heisenbit wrote:
               | Reducing the impact analysis within a long running
               | relationship to a single transaction is too narrow.
               | People observe how other people are treated and draw
               | their conclusions even if not impacted. People may
               | tolerate some abuse but it moves them closer to leaving
               | next time. Money lost in the outage may provide for a
               | budget creation to look for an alternative.
        
               | rgj wrote:
               | A lot of people making those decisions don't care about a
               | refund because it's other people's money anyway. In my
               | experience only small companies care about that.
               | 
               | Focussing on communicating open and honestly allows them
               | to explain the crap they're going through because of your
               | mistakes to their bosses, so in fact you can help them
               | save their asses, and they'll save your ass in return.
               | This is much more important and valuable than a refund.
               | 
               | So you should ALWAYS communicate open and honestly, and
               | offer the refund as an option for clients who do not have
               | a boss to account to.
        
           | bzxcvbn wrote:
           | Not every business can afford to go one month without income.
           | What's the best thing for customers? Have the business go
           | bankrupt and irremediably lose access to the service?
        
             | function_seven wrote:
             | It's 400 clients, not all their user base. They can handle
             | the lost income from a small slice of their customers for
             | one month.
             | 
             | And if they can't sustain that, then it's even _more_
             | imperative that those customers migrate away.
        
               | enra wrote:
               | Atlassian had almost a billion in free cashflow last year
               | and over a billion in cash. I think they should cover the
               | whole year for these customers.
        
         | miketria wrote:
         | Hi, I'm Mike and I work in Engineering at Atlassian. Here's our
         | approach to backup and data management:
         | https://www.atlassian.com/trust/security/data-management - we
         | certainly have the backups and have a restore process that we
         | keep to. However, this incident stressed our ability to do this
         | at scale, which has led to the very long times to restore.
        
           | sizzle wrote:
           | How's the atmosphere internally Mike? Must be crazy times
           | there. I know this isn't your fault, so hang in there.
           | Cheers!
        
           | encryptluks2 wrote:
           | You mean your poor practices and bad design. The only way to
           | prevent this type of issue in the future is to admit the
           | failures.
        
       | farseer wrote:
       | They have recently killed off on premise offerings, it's cloud
       | only now. And this makes it harder to trust both the security and
       | integrity of your data.
        
         | ocdtrekkie wrote:
         | The fact that a single bad script could delete 400 of their
         | customers should be absolute proof they do not have the
         | processes in place to be a steward of your data in the cloud.
         | On-prem or bust.
        
           | dangus wrote:
           | On-premise just means that your overworked IT person is going
           | to spend 5% of their time keeping your service maintained, at
           | no point gaining any more than baseline familiarity with the
           | product.
           | 
           | On-premise isn't a magic pill guaranteeing 100% uptime and 0
           | data loss.
           | 
           | While on-premise may be a good choice in many cases, it's not
           | like running on-premise business tools has no risk associated
           | with that choice.
           | 
           | Remember that the goal of a company is to sell the most
           | product possible (output) with the lowest cost possible
           | (input).
           | 
           | Any Joe off the street starting their own business can pay
           | Atlassian $0/month for up to a 10 users. On-prem doesn't
           | compete with that.
        
             | dzikimarian wrote:
             | On Prem means you have control over spending. I calculated
             | that if we've moved to the cloud, we would pay YEARLY as
             | much as we spent on Atlassian licenses in last 5 years.
             | That easily pays for the maintenance overhead on our devops
             | team.
        
         | [deleted]
        
         | hrpnk wrote:
         | Afaik, the Data Center option still allows for on-premise
         | deployment, incl. Kubernetes and cloud deployments [1, 2, 3].
         | 
         | [1] https://www.atlassian.com/enterprise/data-center
         | 
         | [2] https://confluence.atlassian.com/enterprise/jira-data-
         | center...
         | 
         | [3] https://confluence.atlassian.com/enterprise/deploying-
         | enterp...
        
       | jmondi wrote:
       | What blow's my mind is that Atlassian stock has barely taken a
       | hit...
        
         | ferdowsi wrote:
         | The market will react at the next earnings report, not now. And
         | only if customers start to bail.
        
         | NineStarPoint wrote:
         | Unless their revenue takes a long term hit over the outage, no
         | reason for the stock market to care. There isn't news of people
         | actually planning to stop using Atlassian products over this.
         | The only direct consequence is going to be the one time payment
         | of SLA credits. So I guess the part I find surprising is how
         | little impact this looks like it will have on people using
         | their products more so than I am that the stock market doesn't
         | care much about this.
        
           | mountainriver wrote:
           | Yeah Atlassian is a corporate leech, you don't get away that
           | easy
        
         | pigtailgirl wrote:
         | the stock is on a rally today - just goes to show - the market
         | is full of surprise
        
         | capableweb wrote:
         | It was a long time ago individual stocks represented anything
         | grounded in reality. People talk about "fundamentals" and so
         | on, but that's not what the price is based on. I don't think
         | anyone know why the prices move as they do anymore, as there
         | are so many algorithms involved today, both manual and
         | automatic ones.
        
         | __app_dev__ wrote:
         | Yeah, I place a Put option order yesterday. By end of day I was
         | up over 50% and now down to 50% of what I original purchased
         | the Put at because it went up 5% today.
         | 
         | Oh well, better luck next time.
        
         | devmunchies wrote:
         | it was at $317 on the day of the outage and now at $278.5. A
         | ~12% drop. You're right, not much of a drop for such a large
         | outage.
        
           | [deleted]
        
           | __app_dev__ wrote:
           | The outage did not impact the stock, most major tech stocks
           | have taken a large hit in the past week and a half (until
           | today).
           | 
           | This even is not even showing on any financial news site. I'm
           | still hoping it does and the stock goes down because I place
           | an option order yesterday betting that it goes down by next
           | Friday. Seems like it won't now but the risk was worth taking
           | in my book.
        
       | mdoms wrote:
       | Title is a bit misleading, there's no insider info here. This is
       | all stuff we knew from the official statements, the blog post,
       | reddit and twitter.
        
       | rmbyrro wrote:
       | Are Confluence pages and Jira tickets build like a GPT-3 300
       | Terabyte model?
       | 
       | I mean, I thought they were text.
       | 
       | 5 days to restore text?
       | 
       | They must be generated by a huge complex deep learning voodoo.
       | 
       | Atlassian is working on the bleeding edge of technology. This
       | outage is understandable...
        
         | er4hn wrote:
         | I suppose if they recover what they can and restore the rest
         | using GPT-3 that may make the process easier.
        
         | katbyte wrote:
         | images and other files can be attached to issues or embedded in
         | pages so a single instance can use a lot of storage.
        
       | napolux wrote:
       | Yeah, let's centralize the Internet (born decentralized). This is
       | what the Internet has become.
        
         | Crabber wrote:
         | How do we solve this problem? In other industries based on
         | physical products there is a big incentive to buy goods as
         | locally as possible because of reduced shipping costs, shorter
         | shipping time, no import taxes etc.
         | 
         | But with software it costs nothing to spin up new instances,
         | costs nothing to deliver half way across the world, and has no
         | delivery time. How can you convince a manager to use a software
         | solution provided by a local company when a company in a
         | completely different country 600 miles away offers similar
         | software with 5 extra features?
         | 
         | It seems like the internet is now perfectly set up to create,
         | for each software type, a single company that has a global
         | monopoly.
        
           | barneygale wrote:
           | That's OK in principle, as long as those companies function
           | like governments (i.e. they work to improve things rather
           | than turn a profit, subject to constitutions, public voting,
           | judicial review). As engineers we should embrace the
           | efficiency of scale, but it's quite clear that it can't work
           | under capitalism.
        
       | Alex3917 wrote:
       | A few years ago we didn't renew our subscription on time because
       | we got the email over Christmas break, and iirc they deleted all
       | of our data in less than two weeks. They were eventually able to
       | manually restore it from backups, but they restored it
       | incorrectly so there was a bunch of stuff broken. This whole
       | thing isn't even remotely surprising to me.
        
         | a2800276 wrote:
         | You can sleep soundly: it seems like they back _everything_ up:
         | 
         | > Second, the script we used provided both the "mark for
         | deletion" capability ... (where recoverability is desirable),
         | and the "permanently delete" capability that is required to
         | permanently remove data _when required for compliance reasons_.
         | The script was executed with the wrong execution mode and the
         | wrong list of IDs. The result was that sites for approximately
         | 400 customers were improperly deleted.
         | 
         | > To recover from this incident, our global engineering team
         | has implemented a > methodical process for restoring our
         | impacted customers.
         | 
         | [https://www.atlassian.com/engineering/april-2022-outage-
         | upda...]
         | 
         | Anyone else find it disturbing that they are able to restore
         | data that they deleted permanently for "compliance" reasons? If
         | this is true, how were they ever compliant? I guess data is
         | only permanently deleted when the engineering team is following
         | their typical, non-methodical process...
        
           | notatoad wrote:
           | No, I don't think that's disturbing. That's the point of
           | backups - even when something is permanently and completely
           | erased in the production database, it's still in the backup.
           | Eventually it will get rotated out as the backups expire.
           | 
           | Going back and purging things from the backups as part of the
           | delete process would be overdoing it to a ridiculous degree.
        
             | Delitio wrote:
             | Nope it's not ridiculous. If you are only allowed to store
             | data for x month that's it.
             | 
             | It's your job to use technics which allow you to do this
             | like using encryption on your backup and deleting the keys
             | for it, for example.
        
             | usefulcat wrote:
             | > Going back and purging things from the backups as part of
             | the delete process would be overdoing it to a ridiculous
             | degree.
             | 
             | Also, modifying backups is a great way to inadvertently
             | hose your backups.
        
             | yebyen wrote:
             | I think that depends on what you mean by compliance. Some
             | regulations require you to irreversibly destroy data when
             | they prescribe the destruction of that data.
             | 
             | That can mean as much as "you have to encrypt everything
             | with a separate key, so that you can destroy the key for
             | the given (say, personally identifiable) dataset making its
             | retrieval irrecoverable"
             | 
             | I'm not saying that's the particular compliance reason they
             | had here, or that the analysis you're giving is wrong,
             | either. There is an interpretation where either of these
             | ideas could be the correct one.
        
               | a2800276 wrote:
               | "permanently delete" strongly suggests to me that it was
               | the "medical and financial data" kind of compliance. If
               | data can be restored, it's not permanently deleted. But
               | this was a statement from the CEO, so words can have
               | arbitrary meaning :)
        
               | notatoad wrote:
               | "permanently delete" does not mean the same thing as
               | "immediately delete". deleting from the live database is
               | the first step of a permanent deletion, as long as the
               | data exists somewhere the deletion process is still in-
               | progress.
               | 
               | there's a whole lot of people in here who are way too
               | quick to assume that just because one part of a permanent
               | deletion process was inadvertently triggered and then
               | caught while they still had backups, their whole
               | permanent deletion process is a lie.
        
               | voxic11 wrote:
               | https://ico.org.uk/for-organisations/guide-to-data-
               | protectio...
               | 
               | You seem to be right-ish, while the gdpr in certain
               | circumstances allows you to keep backups of data that
               | should have been deleted it seems like they are trying to
               | discourage it in the future.
               | 
               | > ...It is, however, important to note that where data
               | put beyond use is still held it might need to be provided
               | in response to a court order. Therefore data controllers
               | should work towards technical solutions to prevent
               | deletion problems recurring in the future.
        
               | jacobsenscott wrote:
               | A better way to do this sort of thing is not an actual
               | "delete", but a "cryptographic delete". The data should
               | be encrypted, and you just delete the key. The data is
               | then unrecoverable everywhere, including backups. Of
               | course you probably don't want to just nuke the key, but
               | disable it for some period of time, and then nuke it.
        
               | nindalf wrote:
               | This is why regulations specify that data must be
               | destroyed within a time period, typically 90 days. It
               | gives enough time for backups to rotate out.
               | 
               | If this weren't a concern, regulations would demand
               | immediate deletion of data.
        
           | deepspace wrote:
           | I asked the same question yesterday, and the responses were
           | food for thought.
           | 
           | If you make backups, you are, almost by definition, unable to
           | perform a full 'Compliance Delete' before the oldest backup
           | in the set has expired.
           | 
           | Compliance-based deletion, if it is offered as a service, is
           | almost always something time-based, like "we guarantee the
           | data will be deleted 7 years from now". And then that
           | deliberate deletion step is baked into the backup process.
           | 
           | So, i.m.o. at best they misrepresented the nature of the
           | compliance deletion process. It never did what it was
           | designed to do.
        
           | luhn wrote:
           | It's generally recognized that deleting data from a backup
           | would violate the integrity of the backup, so allowances are
           | made. Usually you have to make sure the data is deleted as
           | part of the restore process. For example, from CCPA:
           | 
           | > If a business stores any personal information on archived
           | or backup systems, it may delay compliance with the
           | consumer's request to delete, with respect to data stored on
           | the archived or backup system, until the archived or backup
           | system relating to that data is restored to an active system
           | or next accessed or used for a sale, disclosure, or
           | commercial purpose.
        
         | jacquesm wrote:
         | Did you continue as their customer after that?
        
           | Alex3917 wrote:
           | Nope. I exported our data after they restored the backup and
           | then we cancelled less than a month later. Like I obviously
           | understand suspending our logins, but why would you ever
           | delete someone's data when it's literally only 160 KB of
           | text? The whole thing made zero sense.
        
             | Kwpolska wrote:
             | > why would you ever delete someone's data when it's
             | literally only 160 KB of text?
             | 
             | Compliance? The contract has expired, so there's no legal
             | basis for them to keep your data?
        
               | usefulcat wrote:
               | Seems like that could be addressed with some fine print
               | in the initial agreements. "In the event that you stop
               | paying us, we may keep your data for up to N days unless
               | directed otherwise by you"--or similar.
        
               | bzxcvbn wrote:
               | Why would they bother?
        
             | herpderperator wrote:
             | I don't think people write code saying "if accountSize <
             | 160kB { skipDelete() }" - THAT would make zero sense. So,
             | the size is not relevant here. The process was likely to
             | delete data after some event occurred, or lack of event
             | occurred.
        
             | hinkley wrote:
             | Someone somewhere got a promotion sooner because they
             | lowered the slope of a line a little bit.
        
               | hallway_monitor wrote:
               | Or some overzealous engineer said hey guys let's delete
               | all data 7 days after an account is canceled. This is
               | called over optimizing.
        
               | dangrossman wrote:
               | Such a decision is just as likely to have come from the
               | legal/compliance team as an engineer. Data you no longer
               | have clear consent or a legitimate business need to store
               | is a liability, and if you operate in Europe, potentially
               | illegal to continue storing.
        
             | nemo1618 wrote:
             | After I met my now-fiancee on OkCupid, I deactivated my
             | profile, turned off notifications and forgot about it for a
             | while. A while later, I thought it be nice to revisit the
             | first messages we sent to each other, only to find that...
             | OkCupid had deleted both of our accounts. They didn't give
             | me any advance warning, either, because I turned off
             | notifications, remember? :^)
             | 
             | I'm still kinda salty about it. I understand why big
             | services can't retain data indefinitely, but like... it's
             | just a few KB of text, and that text happens to have a lot
             | of sentimental value. Besides, OkCupid _knows_ that I
             | deactivated my account _because I am a success story_ --
             | why not hold onto those profiles a bit longer? Or better
             | yet, how about emailing an archive of those messages
             | immediately when you click the  "I'm leaving because I'm in
             | a happy relationship now" button? /rant
        
               | dangrossman wrote:
               | With GDPR, privacy regulations and data breach
               | regulations sweeping the globe, holding onto unnecessary
               | data is a huge liability. Getting rid of data you no
               | longer have clear consent to store, or which you're
               | unlikely to have a clear business need to continue
               | storing, is a sign of a good company these days.
        
               | jacquesm wrote:
               | True, but likely not _this_ kind of data.
        
               | callalex wrote:
               | Yes, this kind of data. Your OkCupid account has all
               | kinds of information about who you associate with.
        
               | cto_of_antifa wrote:
        
             | [deleted]
        
       | RomanPushkin wrote:
       | > I've never seen a product outage last this long
       | 
       | Title should be "Inside the longest outage of all time", without
       | "Atlassian" word in it
        
         | k8sToGo wrote:
         | If I remember correctly, many years ago PSN was down for
         | months.
        
       | nemothekid wrote:
       | > _Most of them said they won't leave the Atlassian stack, as
       | long as they don't lose data. This is because moving is complex
       | and they don't see a move would mitigate a risk of a cloud
       | provider going down._
       | 
       | I still don't understand the strangehold JIRA has on some
       | clients. I can't quickly think of another SaaS product that could
       | be down for almost 2 weeks and not have most customers leave.
        
         | brimble wrote:
         | If they don't lose data, two weeks of downtime every few years
         | might be cheaper than the cost of switching. Plus, it's not
         | like you know the thing you switch to will be any better, if
         | it's another SaaS.
        
           | user22 wrote:
           | Let's say we have an announced release schedule on may 1st.
           | With the tools down, there is no way to meet that date. For a
           | 4 billion dollar company, this can make a huge difference in
           | revenue. For a public company, the stock will definitely drop
           | when it's announced the revenue goals were missed because the
           | tools were down.
           | 
           | For companies of size, the cost of tools being down for 3
           | weeks can easily be in the multi-millions of dollars.
        
             | brimble wrote:
             | Again, part of the trouble is it's hard to gain enough
             | _certainty_ that the thing you switch to--self-hosted, or
             | another service--won 't be _at least_ as bad. You can look
             | at their past record, but then, when 's the last time
             | Atlassian had _this_ happen? (or maybe they 've been having
             | similar issues every year or two and I've just not noticed,
             | in which case, yeah, it's probably a safe bet that
             | switching to almost anything else would be an improvement)
        
         | tyingq wrote:
         | >I still don't understand the strangehold JIRA has on some
         | clients.
         | 
         | - Integrations with things like the source code repos, incident
         | management systems, confluence or other wikis, Slack, etc.
         | Moving away from Jira creates a bunch of dead links.
         | 
         | - Internal dependence on complex workflows and state transition
         | rules that are implemented in Jira.
         | 
         | - Various very customized reports that leaders depend on to
         | make decisions, despite the often dubious value and/or
         | accuracy.
        
           | femto113 wrote:
           | Many years worth of source code filled with comments like
           | // if we don't toggle bit 7 here 10% of transactions will
           | fail on Thursdays         // see JIRA issue BIGPROJ-12654 for
           | detailed discussion
        
             | DannyBee wrote:
             | Having migrated bug systems for very large, very old code
             | bases before, it's pretty easy to make the URls and links
             | like this still go to the right place.
             | 
             | This is actually the least difficult thing, i would say ;)
        
             | ReidZB wrote:
             | When we migrated away from JIRA, we scripted it such that
             | the JIRA issue numbers were recorded in the newly migrated
             | issues exactly because of things like this.
        
         | dirtybirdnj wrote:
         | >I still don't understand the strangehold JIRA has on some
         | clients.
         | 
         | But it's got what plant's crave...
        
         | z58 wrote:
         | I imagine most people used something like Google Sheets during
         | the downtime
        
         | chupchap wrote:
         | A lot of companies have integrations to atlassian suite which
         | might not be easy to shift from.
         | 
         | Secondly, there are a lot of individual competitors to Jira,
         | Confluence and Bitbucket but which competitor can offer all
         | three under a single invoice? May be Microsoft, can't think of
         | anyone else.
         | 
         | Also for such an extended downtime the customers are entitled
         | to a discount or a credit note which a lot of CXOs consider in
         | their decision making.
        
           | krinchan wrote:
           | We are in a similar place with Slack. We moved from HipChat
           | to Slack and that was painful enough. Then the company
           | noticed we get Teams for "free" and they tried to push us
           | over to it. But folks have so much automation (because
           | "ChatOps" is that new new) that is pushing things into Slack
           | the company eventually gave up.
        
           | judge2020 wrote:
           | > May be Microsoft,
           | 
           | Is there a Jira replacement/offering in the Microsoft 365
           | suite?
        
             | iameoin wrote:
             | Microsoft has Azure Devops Boards that is similar:
             | https://azure.microsoft.com/en-us/services/devops/boards/
        
             | ralgozino wrote:
             | GitHub / GitHub Enterprise?
        
             | muricula wrote:
             | visual studio online is what it was called internally, the
             | marketing may have changed. It's okay, and is what
             | was/probably still is used at MS internally to develop
             | windows.
        
             | lfpeb8b45ez wrote:
             | Azure DevOps is really underrated:
             | https://www.thoughtworks.com/radar/platforms/azure-devops
        
             | travellingprog wrote:
             | Never used it, but looks like Microsoft Project would fit
             | that box.
        
               | HeyLaughingBoy wrote:
               | Oh god, no. We moved from Project to Jira and life became
               | immeasurably better!
        
             | shadowronin42 wrote:
             | Azure DevOps has boards and tickets and whatnot, so
             | probably that?
        
         | encryptluks2 wrote:
         | Atlassian sells to execs and gives kickbacks. You don't want to
         | burn the company that gave you money and that you pushed
         | through although you knew they sucked.
        
         | jasd wrote:
         | Even if they don't, I imagine they will have conversations
         | internally to see what's feasible. It's just really difficult
         | for an organization to move away from a product that everyone
         | has learnt how to use. The company I work for is struggling to
         | move away from something as simple as a collaborative editor,
         | when I feel like I find no difference between the two products.
        
       | travisgriggs wrote:
       | > Atlassian is a tech company, built by engineers, building
       | products for tech professionals.
       | 
       | I am curious if anyone can provide any more insight on this
       | simplification.
       | 
       | I've worked at companies like this. Originally a core of
       | motivated creative individuals make a cool product. As the
       | business grows rapidly, Pournelle's (Iron) Law (of Bureaucracy)
       | takes over. For a variety of reasons, the very capable creators
       | depart and are replaced by less motivated/aware individuals who
       | are glad to have a job and easily compelled to do things to the
       | product that probably should not be done.
       | 
       | My guess is that while Atlassian may have originally been one of
       | those cool founder places, it has probably morphed into the more
       | incompetent version that comes with scale all too often. But I
       | don't know. Thus my question if anyone can speak to the true
       | current tech capabilities of this company.
        
       | cdjk wrote:
       | This isn't the longest outage - last time they couldn't recover
       | and recovered data from email archives.
        
       | elesbao wrote:
       | In a side note that someone else already made: it is interesting
       | to see that many companies that uses JIRA also uses Slack but the
       | noise/complaint/mentions comparing when Slack is down is way
       | different. I barely saw people complaning.
        
         | caymanjim wrote:
         | I dunno about everyone else, but I'm generally frustrated and
         | feel blocked when Slack is down, and I celebrate Jira being
         | down because I've never had a pleasant experience using it.
         | Jira is bureaucracy that gets in the way of me getting things
         | done, and Slack is a critical communication path.
        
           | mountainriver wrote:
           | Yup Jira is bureaucracy incarnate. Middle managers love it
           | though
        
           | elxr wrote:
           | Same here. I actively made an effort to tell coworkers how
           | much I hate Jira. Hopefully new startups choose something
           | more sensible.
        
         | upbeat_general wrote:
         | I don't believe slack has been down as long?
         | 
         | Slack is generally much more critical than JIRA in order to
         | keep working.
        
           | adhesive_wombat wrote:
           | It's it though? You can hop onto any of a constellation of
           | other IM platforms, FOSS and not fairly quickly for an
           | instant comms channel, even if you're missing the history.
           | Having all your issue tickets missing is something you can't
           | really deal with unless you have a very recent dump, and even
           | then you can't just fire up Bugzilla and get something
           | working without a lot of migration and administrative effort.
           | 
           | You can do without JIRA for a week or two as long as managers
           | understand and you all have a good concept of what work
           | needed doing anyway. Then it starts getting dicey unless
           | someone becomes a human JIRA to connect temporary manual bug
           | tracking systems with everyone involved.
        
             | adamc wrote:
             | We have all sorts of slack channels set up to coordinate
             | activity, so that internal customers can talk to engineers
             | easily, or engineers can engage with each other. If slack
             | goes down, we'd have to work all that out. For many days,
             | it would be a huge drag on the process, slowing down
             | interactions.
             | 
             | Other IM platforms wouldn't solve that just by existing.
             | Sure, in principle one could set up such channels
             | elsewhere, but that takes time, and the communication about
             | it takes considerably more time.
        
               | adhesive_wombat wrote:
               | Sounds like having a fallback pre-defined would be
               | prudent if it's that important and you don't feel you
               | could collectively extemporise something. "If Slack goes
               | down, the plan is to use WhatsApp/Teams/Jeff's Matrix
               | homeserver in his garage until service comes back. A list
               | of group channels will be emailed if that happens."
               | 
               | Then if it does go down, you don't have to waste the
               | first day arguing about the plan.
        
       | ineedasername wrote:
       | Something to consider is that Jira can require a great deal of
       | configuration to tailor it to your needs. If you already have a
       | DevOps team of some capacity (not everyone does) then it may only
       | be a small incremental increase to run thinks on prem. I did it
       | myself: I'm ver much not a DevOps person, mostly unfamiliar with
       | optimizing JVM parameters for apps like this, but it still only
       | took me about 5 hours to get things running stable, and then
       | another 2 hours or so a few weeks later to tweak things like heap
       | size to help things go a bit faster (though it was still somewhat
       | slow)
       | 
       | To be complete open though I don't know how much DevOps overhead
       | is involved in maintenance or feature updates. I hated the app
       | and used it for less than a year so I didn't have much exposure.
       | I guess my point though is simply that you may not need to use
       | their SaaS option if you have a decent DevOps team already. After
       | the initial setup time I doubt I spent more than half an hour a
       | month managing the internals and updates.
       | 
       | I _did_ spend more than that on configuring the system for use,
       | which you 'll need to do regardless.
        
         | thesh4d0w wrote:
         | Atlassian has EOL'ed their non-cloud products
         | 
         | https://www.atlassian.com/migration/assess/journey-to-cloud
        
           | originalvichy wrote:
           | I have had to correct this too many times already. Server is
           | the name of the deployment type of their on-prem. It means
           | single node non-clustered. Data center is their deployment
           | that supports clustering to multiple nodes (and used to
           | support a few extra features). They are retiring the Server
           | deployment type licenses and pushing everyone to data center
           | or cloud. So no, they aren't EOLing their on-prem.
        
             | bombcar wrote:
             | The cost of server was lower and fixed (buy it once), the
             | cost of datacenter is MUCH higher (minimum 500 users, pay
             | per year).
             | 
             | Which is even more amusing when you realize Server has been
             | Datacenter with a fake mustache for years now.
        
               | tedivm wrote:
               | The datacenter product also seems geared towards people
               | reselling Atlassian stacks. For example there's a company
               | that offers HIPAA compliant Confluence (complete with
               | signing a BAA, so you can actual store PHI on it). It
               | doesn't seem like a great replacement for the server
               | version.
        
             | dzikimarian wrote:
             | Our instance is half the size of minimal Data center
             | license. For us and for many customers this is effectively
             | EOL.
        
         | tyingq wrote:
         | Their on-prem options are being reduced down to one product
         | with pretty high minimum spend numbers:
         | 
         | - 500 users (Jira Software, Confluence, Crowd)
         | 
         | - 50 agents (Jira Service Management)
         | 
         | - 25 users (Bitbucket)
         | 
         | https://www.atlassian.com/migration/assess/journey-to-cloud
         | 
         | https://www.atlassian.com/migration/assess/compare-cloud-dat...
        
         | k8sToGo wrote:
         | I am _Dev_ Ops, and not Ops. So I try to not waste time with
         | self hosting as much as possible.
        
       | h2odragon wrote:
       | > However, if they [restore backups], while the impacted ~400
       | companies would get back all their data, everyone else would lose
       | all data committed since that point
       | 
       | OK, so you restore backups to a separate system, and selectively
       | copy the stomped accounts data back to production. Simple
       | concepts aren't that simple at their scale, sure, but I suspect
       | this is skimping details on some truly horrendous monolithic
       | architecture choices that they're trying to hide.
       | 
       | Not that I ever thought using their products was a good idea; to
       | be clear about my position... But at this point anyone continuing
       | to rely on them for anything is asking for the suffering they'll
       | get. Signing up for their crap for a vital business function is
       | like offering your tonker to a snapping turtle.
        
         | throwaway894345 wrote:
         | I would really like to understand who makes the decision to
         | purchase JIRA. It's like the C++ of ticketing software--it does
         | everything because no one wanted to sit down and think
         | critically about the use cases and instead decided it would be
         | easier to say "yes" to every single feature request. It
         | definitely feels like whoever is buying JIRA is not on the team
         | who is using it (maybe IT or finance) because it ticks their
         | boxes and it has such a huge list of features that _nominally_
         | it appears to tick the product development boxes (ignoring more
         | subjective concerns like  "quality", "performance", and
         | "usability").
         | 
         | I would really like to try working in an organization that uses
         | something simpler, like Trello (although now that this is also
         | an Atlassian property, maybe not exactly Trello?).
        
           | robertlagrant wrote:
           | The reason to buy Jira is that loads of stuff integrates with
           | it, and lots of people know it. Maybe not perfect, but that's
           | why. And unless you're in it all the time, which some people
           | may be, its ergonomics are not as important as, say, an
           | IDE's.
        
           | tenacious_tuna wrote:
           | My relatively small team at a massive enterprise built all
           | our report generation tools around JIRA for an entire class
           | of offerings. It's been easier for them to justify continuing
           | to pay for JIRA and keep it propped up than to develop (or
           | migrate to) a new solution.
           | 
           | As the lone dev on the team I've been continually astounded
           | by my leadership's willingness to commit more and more to
           | tech debt laden paths. The notion that _all software_
           | requires maintenance is anathema to them, and it 's led us to
           | be 'cornered' into decisions re: what software we can use /
           | where we can invest our discretionary funding.
           | 
           | Moreover, we're constrained by the parent mega-enterprise's
           | software purchase policies; JIRA's already approved (and run
           | elsewhere in the enterprise), whereas off-the-shelf or SaaSy
           | alternatives are significantly harder to get buy-in for. (No
           | using corporate cards for SaaS, all purchases need to go
           | through the quote/purchase-order process, etc).
        
           | Cd00d wrote:
           | Interesting take.
           | 
           | Personally, I like JIRA. I think it adds a ton of
           | transparency in our org, and while I've used Trello for
           | personal and home projects, I don't see how it's good enough
           | for business. Trello doesn't even allow for time estimates
           | (last I tried), which for us is part of planning. Search in
           | JIRA is also really good, so no ticket is ever just lost to
           | the ether.
           | 
           | Sure, it's not perfect, and waiting for a board to load is
           | annoying, but for distributed work and visibility, I haven't
           | seen something as professionally useful.
           | 
           | Open to exploring though.
        
             | stuff4ben wrote:
             | GitHub Enterprise and ZenHub Enterprise work well for us
             | here @IBM, not that I speak on behalf of them, just a drone
             | doing work.
        
               | Cd00d wrote:
               | ZenHub looks really interesting - thank you for pointing
               | it out.
               | 
               | How good is ticket search? I have to be honest, JQL is
               | the superpower that makes or breaks for me.
        
           | x0x0 wrote:
           | I made the decision, unfortunately. The rationale was
           | literally that I hated pivotal tracker -- what a garbage app
           | that is -- and I'd heard of jira, needed something to track
           | bugs / work items, and signed up. It crucially had a zendesk
           | -> jira sync, so all our zendesk requests could end up in
           | jira.
           | 
           | In the beginning, with me plus 2 engineers, I noticed it was
           | slow but since I used it for 20 minutes a week, that didn't
           | really matter. By the time I started using it for an hour a
           | day, we had 10 engineers on 2 teams using it. I got to see a
           | friend using linear, and I had some spare time that I was
           | going to use to switch, but I couldn't get in the beta. By
           | the time they let me in, the opportunity was over and I was
           | too busy.
        
           | BolexNOLA wrote:
           | I really, really like Trello and am dreading the day when
           | atlassian starts tinkering with it in any real capacity. As a
           | content creator, it is the first workflow system I've ever
           | seen that I can effectively share with my client. It's so
           | simple and streamlined and the fact that I've stuck with it
           | despite my ADHD says a lot.
           | 
           | Clients add their notes to the card, I check the boxes as I
           | hit the notes, and I move the card further right as we enter
           | different stages of the post production process. We then have
           | a column of every completed project, which is incredibly easy
           | to sift through if we need to revisit something. It's
           | literally left to right in the workflow, it visually is
           | telling me where we are at all times.
           | 
           | It's incredibly simple and elegant. For fast turnaround,
           | relatively stripped down content (like podcasts) there is
           | nothing like it.
        
           | systemvoltage wrote:
           | What's wrong with C++? Seems unfair to compare it with JIRA.
        
             | throwaway894345 wrote:
             | I was a C++ programmer in a past life and I sorta like it.
             | C++ and JIRA seem to have the same philosophy with respect
             | to choosing which features to admit: "yes". The idea is
             | that by supporting the largest number of features possible,
             | they'll surely build something that everyone likes because
             | it will tick everyone's boxes. What people frequently fail
             | to realize is that the absence of misfeatures or redundant
             | features is an important feature in and of itself.
             | Moreover, the more features you support, the harder it is
             | to control for quality.
        
               | hhmc wrote:
               | > The idea is that by supporting the largest number of
               | features possible, they'll surely build something that
               | everyone likes because it will tick everyone's boxes.
               | 
               | The idea that the C++ committee are unthinking people
               | pleasers it patently false.
               | 
               | C++ does have a lot of cruft, but mostly because it aims
               | to: i) support new features ii) maintain pretty strong
               | backward compatibility guarantees
               | 
               | In general the new features are actually pretty well
               | liked, but in conjunction with (ii) it creates a big
               | language. There's a reasonably decent subset that can be
               | carved out, but it's also clear why newcomers without
               | legacy baggage (e.g. rust) are making inroads.
        
               | throwaway894345 wrote:
               | "unthinking people pleasers" isn't how I would
               | characterize it; rather, I think of it more as a "kitchen
               | sink" or "more is more" philosophy rather than a "less is
               | more" philosophy. I'm sure the committee deliberated
               | extensively, but deliberation within their particular
               | philosophical context still produced an unpleasant
               | result. I think the same is true of JIRA.
               | 
               | EDIT: clarified wording a bit.
        
             | aaaaaaaaaaab wrote:
             | It's a HackerNews meme from people who never bothered to
             | properly learn C++ and are angry that it's not
             | JavaScript/Ruby/Rust/whatever.
        
               | throwaway894345 wrote:
               | I refer you to my sibling comment:
               | https://news.ycombinator.com/item?id=31017079
        
             | uuyi wrote:
             | Ex C++ dev and ex JIRA admin. They are the same class of
             | complete bananas.
        
           | antiterra wrote:
           | In a lot of ways, JIRA disrupted Remedy Action Request
           | System, which had a painful transition from X to Windows
           | client. Remedy was even more admin dependent and unwieldy.
        
           | fnord123 wrote:
           | > like Trello
           | 
           | Maybe Asana or Monday would work for you.
        
           | spookthesunset wrote:
           | I find it helpful to stop thinking of JIRA as a bug tracker
           | or anything like that. In my opinion JIRA is more of a way to
           | create and track workflows. It can be used as a blank slate
           | for quite a lot of things (which I cannot come up with any
           | examples for at the moment!)
           | 
           | That being said, because it can do anything, it doesn't take
           | much effort to make a workflow as painful as possible.
           | Somebody with the "right" mind might make all kinds of
           | checkpoints in a workflow, which makes a lot of operations a
           | pain in the ass because you wind up hopping through a bunch
           | of steps. Pretty sure in our org we just make our workflow
           | "you can hop from any state to any other state"--basically a
           | free-for-all.
           | 
           | Dunno my point, but there you go!
        
           | RandallBrown wrote:
           | I think people buy JIRA because you can set it up however you
           | want. I've seen it almost as simple as Trello and much more
           | complicated. It doesn't have to be terrible, it just usually
           | is.
           | 
           | If JIRA didn't allow you to make it terrible, it wouldn't
           | allow for some of the absurd things that people want it for
           | and those companies might not buy it.
        
             | a4isms wrote:
             | They used to say of Microsoft Word, "Nobody uses more than
             | 5% of its features, but every company uses a different 5%."
             | 
             | The saying is apocryphal and unlikely to be accurate, but
             | the shape of the thing its describing applies to almost
             | every piece of enterprise software whether installed on-
             | prem or SaaS.
             | 
             | And as another comment points out, at Enterprise scale you
             | can substitute "team" or "group" for customer. Every team
             | might use a different 5%, and unless you standardize their
             | processes, you have to buy the product that can accomodate
             | all of their needs.
        
               | grog454 wrote:
               | >"Nobody uses more than 5% of its features, but every
               | company uses a different 5%."
               | 
               | >The saying is apocryphal and unlikely to be accurate
               | 
               | Well its mathematically impossible to be accurate as soon
               | as you have > 20 users.
        
               | shukantpal wrote:
               | False. If you have 100 features, there are nCr(100, 5)
               | combinations of 5% features = 75287520.
        
               | a4isms wrote:
               | If it's no more than 5% of the features, it's actually
               | n-choose-k(100,5) + n-choose-k(100,4) + n-choose-k(100,3)
               | + n-choose-k(100,2) + 100!                   75,287,520
               | +  3,921,225        +    161,700       +      4,950
               | +        100       ------------         79,375,495
        
               | moonbooth wrote:
               | Only if you assume the 5% of features to be a contiguous
               | block each time.
               | 
               | However, if we assume there are, say, 100 features in
               | Word (the real number is likely much higher), the number
               | of combinations is orders of magnitude higher than 20.
        
               | [deleted]
        
               | robertlagrant wrote:
               | A better counterexample is that one user could use all
               | features.
               | 
               | But your statement doesn't make sense; there might be
               | millions of features, and trillions of ways to combine
               | them to make 5%.
        
               | KronisLV wrote:
               | > Well its mathematically impossible to be accurate as
               | soon as you have > 20 users.
               | 
               | It's probably in the semantics.
               | 
               | Text input and editing is clearly a part of functionality
               | that's probably used by everyone (or at least most
               | users), so it's not possible for "different 5%" to mean
               | what you're alluding to, maybe the phrasing needs work.
               | 
               | In any given 5% there might be 1-4% of overlap with what
               | others are using and the remainder of that is specific to
               | the company.
        
               | grog454 wrote:
               | And the greater the degree of overlap the weaker the
               | implicit argument.
               | 
               | If it's a uniform distribution of discrete features then
               | each feature is equally "important" and worth equal
               | resources and dev time. If 81/100 companies use the exact
               | same 5% of features and the remaining 19 cover the
               | remaining 95%, then all else equal you can probably drop
               | 95% of your features and still do well.
        
               | a4isms wrote:
               | The dynamics of the Enterprise market are such that there
               | are features where having just one customer that will
               | make a buy/no-buy decision based on just one feature will
               | deliver enough incremental ARR to justify the opportunity
               | cost of doing that feature instead of a bunch of others.
               | 
               | Typically you do the most popular features first, but
               | most Enterprise vendors end up working on a long tail of
               | niche features that nevertheless are profitable.
               | 
               | There's a long conversation to be had about how this ends
               | up being a trap where Enterprise software gets bloated
               | and shitty and eventually gets disrupted by a small
               | vendor that does "less," but in a powerful,
               | transformative way that obsoletes the Enterprise
               | "standard," which leads us back to discussing Atlassian
               | :-)
               | 
               | They're a good example of this dynamic, because they have
               | a "constellation" of products to sell. So if they build a
               | niche feature that gets a new customer to buy Jira seats,
               | having "landed" in the account, their salespeople can
               | "expand" by selling OpsGenie and other related products
               | very profitably.
        
           | karaterobot wrote:
           | The way it works is, someone always says "Sure, JIRA is bad
           | out of the box, but you can customize it to work the way you
           | want" and there is nobody around to say "so now you have two
           | problems: a bad system that depends on having an expert to
           | make it work the way it should".
           | 
           | Then, you pay for JIRA, and that expert customizes it the way
           | _they_ like. It still doesn 't work very well for most
           | people. Nobody likes it except one stakeholder, and the
           | engineering lead who acts as a admin on it. A while later,
           | those people have left the company, and everyone else is out
           | of luck.
           | 
           | Seen this exact scenario play out at two different companies
           | now. Am witnessing it play out in real time at a third.
        
           | sam0x17 wrote:
           | And yet, it actually is set up in an extremely opinionated
           | annoying way. For example there is no way to actually assign
           | multiple users to the same ticket, which is a big problem if
           | your org legitimately does pair programming (mine does for
           | juniors)
        
             | robertlagrant wrote:
             | Having a single owner for each ticket is not a bad idea.
             | You can see contributors in git.
        
             | Cd00d wrote:
             | Why not just clone the ticket?
        
           | BlargMcLarg wrote:
           | Trickle down and first mover. JIRA was there first being
           | "decently ok", enough people adapted it and now others do the
           | same. Then couple with that what you write, the people in
           | charge of deciding the software are generally the ones who
           | can justify wasting half their day on it.
           | 
           | To this day I still don't know what JIRA does so much better
           | that other products don't which big corps are willing to
           | waste months worth of manhours over. It's biggest selling
           | point is integration with the remainder of the Atlassian
           | stack, not exactly known for being great either.
        
             | vikingerik wrote:
             | Jira's big feature is being widely known. It's the modern
             | version of "nobody ever got fired for buying IBM."
        
           | csours wrote:
           | "I am not a fish" - the people who buy it are not the people
           | who use it. -
           | https://www.ted.com/talks/seth_godin_this_is_broken
        
           | prescriptivist wrote:
           | This reminds me of one of my favorite HN comments of all
           | time: https://news.ycombinator.com/item?id=16424423
        
           | dangus wrote:
           | The answer is medium to large companies. Jira is a tool that
           | can satisfy hundreds of different teams' work management
           | needs without having to buy dozens of different products.
           | 
           | The fact that it's so feature packed and customizable is the
           | point.
           | 
           | I think the complainers are not really investing the time in
           | to change project settings to fit their needs.
           | 
           | My only complaint about the Atlassian suite is the
           | performance of Jira and Confluence. The overall page load
           | speed is too slow.
        
             | matwood wrote:
             | I agree. I look at every JIRA killer and think we could
             | maybe move and nope...they're missing something we use. In
             | many ways JIRA is like Excel. On the surface it can appear
             | easy to replicate for a single user, then you realize every
             | user uses 10 different features.
        
             | macintux wrote:
             | How do you change the markup language to be consistent
             | between Jira and Confluence?
             | 
             | How do you eliminate all non-task ticket types in a Jira
             | board and allow any ticket to be a child of any other
             | ticket?
             | 
             | It's hard to configure away complexity from a product if
             | it's designed to be complicated.
        
               | Karunamon wrote:
               | Re 1: I'm not sure why that's a necessity beyond a notion
               | of consistency. I find that major wiki editors are not
               | often major ticket creators, and these are different
               | products with different audiences at the end of the day.
               | Also, Confluence uses a WYSIWYG editor, so it's rare to
               | need to think about the markup.
               | 
               | Re 2: Set the project's issue type scheme to one that
               | only allows tasks and subtasks. That gets you one level
               | of nesting. (And even though task and subtasks are
               | different issue types, changing from one to the other is
               | trivial since they have identical fields.) Allowing epics
               | gets you another at the top level. That's a bit limited,
               | but wouldn't arbitrary nesting be even more complex?
        
               | DocTomoe wrote:
               | > allow any ticket to be a child of any other ticket?
               | 
               | I have no idea why you would want this from a work
               | management point of view, but you can just use issue
               | linking to describe a parent <-> child relationship.
        
               | chrisseaton wrote:
               | > How do you change the markup language to be consistent
               | between Jira and Confluence?
               | 
               | This here is the single most insane thing about
               | Atlassian.
        
           | anecd0te wrote:
           | ime people pick Jira because they've used Jira and have been
           | promoted via the peter principle to the level at which they
           | make purchasing decisions.
        
             | brimble wrote:
             | IBM effect. If you don't care _a whole lot_ about your
             | ticketing system, you just pick Jira because everyone 'll
             | nod along with the choice and you won't personally be
             | blamed if/when it sucks, you won't make enemies or have to
             | argue over the choice because it can't do something that
             | someone else in the org "needs" it to do, et c.
        
           | SatvikBeri wrote:
           | At big companies I've worked at, the justification was that
           | JIRA was the only one that met all the regulatory/compliance
           | requirements. I don't know if this is actually true, but
           | smaller companies certainly don't market compliance as well.
        
           | JoBrad wrote:
           | I was part of the decision to purchase Atlassian tools at my
           | company. We had been using a variety of self-hosted and SaaS
           | tools which had varying abilities to integrate with each
           | other. We've had very positive feedback from users since
           | switching to them. We were also able to move some of our help
           | desks to JIRA Service Management, and away from another self-
           | hosted product which is still used by a good portion of our
           | business. The self-hosted product is honestly a nightmare to
           | maintain and keep secure. According to the vendor, the "fix"
           | is to have 1-2 people dedicated to that product, which simply
           | isn't something that my team has the bandwidth or will to do.
           | 
           | JIRA does try to be all things to all people...and mostly
           | succeeds. For instance, we use the same workflow and mostly
           | the same nomenclature across our development and helpdesk
           | teams. Some of our software projects use Kanban-style
           | workflows, while others use sprints, but we can keep track of
           | a project across multiple teams using the same tools. I'm
           | sure other products also offer this, but we liked the
           | integration and overall capability for the price.
           | 
           | There are definitely issues: some feature requests and bugs
           | have languished in their backlog for years. But you can get
           | started very quickly and we've had great feedback from users.
        
           | throwawayboise wrote:
           | The one place I worked that used Jira was a small-but-not-
           | tiny company (about 15 devs at the time). The only people who
           | actually used Jira were the managers. Developers got printed
           | stories. These were used for planning, and were printed on
           | cards and taped to a white board when ready. Developer would
           | pull a card to work on, and return it to the manager when it
           | was complete. The manager did all the status updates and
           | reporting to upper management.
           | 
           | IDK if this was to cheap out on the licensing with a minimal
           | number of users, or if it was to insulate the developers from
           | the experience of using Jira. Perhaps some of both.
           | 
           | Clearly that usage pattern would only scale so far.
        
           | notreallyserio wrote:
           | JIRA is generally fine software that is good enough for most
           | folks, especially if you're willing to adapt your workflow to
           | it. Where it goes wrong is where tools like Jenkins go wrong:
           | folks add too much customization.
           | 
           | That means the tool is often the wrong one for the job, but
           | instead of picking something that's a better match out of the
           | box folks stick with the easy choice (extend what they have).
        
           | closeparen wrote:
           | JIRA is a framework for making assembly lines out of
           | knowledge workers. When you're a middle manager at a decent
           | sized company, a major problem you face is that the mass of
           | knowledge workers beneath you are _opaque_ : you have no way
           | of knowing whether they're working or not. Another problem
           | you face is that they're _uppity_ : people who went to
           | college and got used to managing their own time now have all
           | kinds of idiosyncratic ideas about how to manage their own
           | time and arrange their own working lives. Since you are a
           | middle manager you despise local differences. Since you are a
           | manager you're pretty sure that only you and your lieutenants
           | can be trusted with this kind of decision making power.
           | Adopting JIRA is a powerful level to put people back in their
           | place as work item churning machines. Constraints such as
           | only certain people can create or assign tickets, only
           | certain people can mark them completed, only certain states
           | are valid transitions from other states, etc. implement a
           | level of domination over white-collar workforces that
           | managers would be otherwise uncomfortable asserting face to
           | face.
           | 
           | Other ticketing systems do not work nearly as well for this
           | purpose because they are designed mainly as external brains
           | or communication platforms for workers, and they assume a
           | level of worker autonomy in moving tasks through their
           | lifecycle. In Trello you cannot make it so that a PM has to
           | sign off before a card is moved to the in-progress column, or
           | that only in-progress cards can have code reviews associated
           | with them. JIRA eats these kinds of requirements for
           | breakfast.
           | 
           | EDIT: This is not to say you _can 't_ use JIRA in a workflow-
           | neutral way, or that everyone uses it for this reason, but I
           | would submit that it's JIRA's differentiated advantage.
        
             | TheRealDunkirk wrote:
             | Even worse, companies with the resources to buy JIRA will
             | probably hire consultants to set it up, and you wind up
             | with a system 1) bought by people who don't understand how
             | programmers work, 2) configured by people who don't know
             | how your company works. So end users usually wind up with a
             | terrible system that continually generates complaints
             | (along MANY axes), and the people responsible for foisting
             | it on them think they're just being difficult.
        
             | mikepurvis wrote:
             | So I would say that this assessment is on the whole, kind
             | of cynical, however I suppose I have the interesting
             | position of being in an organization where I feel like I
             | actually see _both_ JIRAs.
             | 
             | One JIRA is the project that's used for development of the
             | core product, where there are no constraints-- anyone can
             | add a comment, create links, change assignee, add new tags,
             | push the tickets through whatever state transitions they
             | want, and so on. It works, though it is a little chaotic
             | sometimes as subgroups of people have different preferences
             | for how things should go (eg, for tickets requiring test
             | team validation, should the ticket assignee remain as the
             | person who did the original work so it's clear who has more
             | to do if it fails validation, or should the assignee change
             | to the test team person, so that it's clear that that's the
             | next person who has it as an action item?)
             | 
             | The second JIRA is the IT team's internal support project,
             | which is completely locked down-- no one except them can
             | close tickets or move them around, or even edit the
             | contents, closed tickets can't be commented on any more,
             | and so on. This is the one that gives me the vibes you are
             | talking about. Every time I have to interact with it, I
             | loathe it because every inch of it is transparently a
             | funnel, railroading me along a path toward one of either
             | DONE or WONTFIX. This is absolutely _efficient_ , in the
             | sense of meeting the goal of closing all the tickets, but I
             | feel it introduces friction for the larger business goal of
             | actually helping people resolve their problems. To the
             | point where eventually most of the IT support activity
             | moved away from the JIRA project to an informal Slack
             | channel, which is way more accessible, but worse in
             | basically every other way: it's harder to effectively
             | search, impossible to properly link, bad for async, bad for
             | dealing with more than one thing at once, etc.
        
             | codycraven wrote:
             | It sounds like you've been hurt by the some terrible
             | management practices, I'm truly sorry that some managers
             | think their job is to control their subordinates.
             | 
             | However, regarding ticketing systems, in team environments,
             | it is very effective and helpful to have a system that
             | manages the data about the work that has been completed, is
             | being worked, and is planned to be worked on .
             | 
             | Part of that system might be defining restrictive workflows
             | for some teams, not for control, but to ensure the agreed
             | upon process is followed for quality or consistency.
             | 
             | One of the many problems Jira has is that if you don't have
             | a Jira admin on your team, it's impossible to build an
             | effective and efficient workflow for your team. Coupled
             | with Jira making many things global by default (it takes a
             | lot of care to make a change that only affects specific
             | Jira projects) most configurations end up being a pile of
             | garbage automatically inherited from changes an admin(that
             | is not part of the team) made when intending to change
             | something for another specific team.
        
               | agalunar wrote:
               | Caveat: this is going to be a meta comment rather than a
               | comment about the topic proper, and so maybe not
               | appropriate for HN, but I think it's worth discussing.
               | 
               | > It sounds like you've been hurt by the some terrible
               | management practices, I'm truly sorry that some managers
               | think their job is to control their subordinates.
               | 
               | When we assume someone was hurt, and imply they hold an
               | opinion only because they were hurt, we risk
               | delegitimizing their position. The interpolated message
               | we might be sending is "your experience is personal and
               | not representative of the subject at hand, and so your
               | thoughts are only applicable to your situation; so, after
               | we express our sympathy, your thoughts can be dismissed."
               | Or the message we might be sending can be patronizing:
               | "you hold your opinion for emotional, rather than
               | rational, reasons; I'm sorry that you are so
               | unfortunate."
               | 
               | To be clear, though, I'm sure this wasn't your intent,
               | and it makes me glad to see someone being compassionate
               | (i.e. that you bothered to consider the experiences and
               | feelings of the parent commenter).
               | 
               | A personal story: I was raised devoutly religious but
               | left the church in my twenties. My family and friends
               | assumed I left because I wanted to be free from guilt,
               | had been hurt by a culture that belied the doctrine, and
               | so on (and they said as much). My change of belief
               | occurred after recovering from a few years of mental
               | illness, and while it is true that I may not have left
               | _when_ I did were it not for the opportunity to reexamine
               | my beliefs (while trying to piece back the fragments of
               | my life into a sense of self), the reasons _why_ I left
               | were the result of a lot of research and thinking. It was
               | mildly frustrating when people assumed my decision was
               | made for emotional convenience, when in reality, the
               | research was uncomfortable and contemplating an
               | unfamiliar universe was scary.
               | 
               | I recognize the irony here - the issue I'm highlighting
               | in this comment may be something that only I feel is an
               | issue, born from a personal experience. But I _think_ it
               | 's more common than that.
        
               | [deleted]
        
               | liamwire wrote:
               | I sincerely appreciate your articulation of this, thank
               | you for taking the time.
        
               | closeparen wrote:
               | >ensure the agreed upon process is followed for quality
               | or consistency
               | 
               | That is what I mean here by "assembly line" and
               | "control." Making sure that processes lead and
               | individuals follow.
               | 
               | Citing consistency as a terminal value in the same breath
               | as quality is also exactly what I mean by the middle-
               | manager aversion to local differences.
        
               | chousuke wrote:
               | Beyond trivial scale, you need good processes so that
               | individuals can do their jobs. If you have no processes,
               | change and development becomes _extremely_ difficult
               | because people will be hunting for documentation all the
               | time, stepping on each other 's toes, and making mistakes
               | that they should not be making because they forgot a
               | trivial procedure that was a prerequisite to solving
               | their actual problem.
               | 
               | I work with a variety of different environments, and
               | depending on the environment I can either solve my
               | problem in minutes and get it deployed in another few
               | minutes _or_ solve the problem in minutes and spend hours
               | figuring out how to safely deploy it without breaking
               | everything. JIRA is terrible if you do anything that it
               | offers by default, but when used properly it can
               | absolutely help with this.
        
               | baq wrote:
               | To add to that, and perhaps educate your downvoters a
               | bit, it can be very hard to imagine why or when such
               | strict processes are helpful without having direct
               | experience with organizations of sufficient scale. It
               | literally boggles the mind but the process truly is king
               | when there are hundreds (or thousands) of individuals
               | working on a single product.
        
               | hn_go_brrrrr wrote:
               | Agreed. An essential part of blameless engineering
               | culture is "the outage isn't any one person's fault, it's
               | the fault of the tooling and processes for allowing them
               | to do that". Good processes prevent everyone from making
               | the same mistakes.
        
               | tyingq wrote:
               | >However, regarding ticketing systems, in team
               | environments, it is very effective and helpful to have a
               | system
               | 
               | I think the point is that Jira is particularly granular
               | in the way that it lets you do things with permissions,
               | workflow rules, roles, metrics, etc. There's a fair
               | number of places that use that granularity to create a
               | weird digital sweatshop.
               | 
               | Meaning the complaint is more about really deep _"
               | micromanagement as a service"_ than what you might get
               | with lighter tools.
        
               | brazzledazzle wrote:
               | Micro managers are everywhere, even in places that may
               | seem culturally incompatible. I've yet to work for a
               | business that prioritizes regularly evaluating managers
               | for their management skills. It's only addressed when
               | shit really hits the fan. Managers are primarily
               | evaluated by their own managers on deliverables. As long
               | as they're getting results and entire teams aren't
               | quitting simultaneously there's no need to question
               | anything. As long as a manager is toxic in ways that
               | don't break the law or violate major company policies any
               | attempt to address this by a direct report carries the
               | risk of termination or retribution. Does it contradict
               | your company's cultural values? Rules for thee.
               | 
               | And I wouldn't assume you're not one of them. The worst
               | cases I've run into aren't even the psychos that embrace
               | micro management as part of their "management style".
               | It's the ones that genuinely believe they aren't engaging
               | in the behavior. They're not micro-ing, they're "helping"
               | their team because they are an awesome manager and their
               | team is _almost_ awesome, they just need to be monitored
               | very carefully and given "suggestions" until they nail
               | it. But they'll never nail it. Because no one is as
               | smart, experienced or does a task "just so". They view
               | themselves as a mentor to all. All decisions must be
               | theirs to make. Jira becomes the perfect tool since the
               | team effectively becomes little boxes that accept tickets
               | or stories and return work both performed and delivered
               | as specified.
               | 
               | For any managers reading this that don't see a problem
               | with this or see some of those behaviors in yourself
               | please understand that you are sacrificing your team's
               | happiness and motivation at the altar of your own
               | insecurities. No one can grow where they're not trusted
               | and no one can improve their skills when they're never
               | given latitude to make meaningful decisions. Your people
               | will make mistakes. They will accomplish things in ways
               | that are different from how you would do them. It might
               | even be objectively worse. That's ok. That's how you grow
               | into a strong team with confident members.
        
               | mistrial9 wrote:
               | I was told by a lifetime manager turned successful
               | consultant, that roughly fifty percent of engineering
               | firms govern their engineers basically using fear.
        
               | ornornor wrote:
               | > using fear
               | 
               | Could you elaborate? What kind of fear? "You're fired"? I
               | wonder how effective it actually is because of the
               | current job market and also because I (and others) react
               | very poorly to this kind of tactics: "you want me to fear
               | getting fired? Joke's on you, please DO fire me, I dare
               | you"
        
               | KronisLV wrote:
               | > I wonder how effective it actually is because of the
               | current job market
               | 
               | Counterpoint: software developers aren't necessarily well
               | paid or highly regarded _everywhere_ , since remote
               | working for companies abroad hasn't quite gotten
               | mainstream enough.
               | 
               | So it might just be effective against some people, or in
               | cases where the hiring process itself has become
               | increasingly unreasonable - the job being working on
               | boring CRUD apps but the hiring process being multiple
               | stages of Leetcode and complex interviews.
               | 
               | That's probably not applicable to everyone since plenty
               | of folk can grokk Leetcode and find jobs without too much
               | trouble, but i still recall "The Unseen 99%" article:
               | https://www.hanselman.com/blog/dark-matter-developers-
               | the-un...
               | 
               | It probably applies to the industries and companies where
               | devs are treated as a cost center and since those
               | companies aren't all out of business, plenty of people
               | must be working in such environments, with sometimes sub-
               | optimal conditions.
        
               | numpad0 wrote:
               | I'm guessing it's a sort of a nerd shorthand for "various
               | means that are accompanied with self confusion of users
               | but not with strong rational or scientific or technical
               | basis"
        
               | zrail wrote:
               | The perf process is basically one big exercise in fear-
               | based control.
        
               | 52-6F-62 wrote:
               | Kanban, by design, was a tool used in production control.
               | It's one of the ways Toyota made their JIT production
               | function.
               | 
               | I worked on the line (Toyoda Iron Works) and used a real-
               | life Kanban implemented by the plant engineers. It was
               | used for quality control, to broadcast quality control
               | and station output, and was checked regularly against
               | their internal estimates and baselines and used also as a
               | gauge for employee output.
               | 
               | Control is what it's designed to do. The very fact that
               | Kanban is the tool of choice should support at least some
               | of OP's points, objectively.
        
               | [deleted]
        
               | sjtindell wrote:
               | Agreed. This is a problem of scale in my opinion. When we
               | have 10 engineers, it is easy to check in with everyone
               | and know what they are working on and get a status
               | update. When we have 500 engineers, making sure all their
               | tasks are aligning (organizations are one big race
               | condition) is not just hard but impossible without some
               | sort of tracking system. We all want to grow big. To do
               | so, your processes need to change as you add more people.
               | The exceptions (Valve, Netflix, etc.) that can handle
               | being flat or semi-flat are very unique.
        
               | biomcgary wrote:
               | Are they unique because their problem domain allows it or
               | because the leadership is uniquely ideologically driven
               | (and competent) to implement efficient, flat systems?
        
               | malermeister wrote:
               | > ensure the agreed upon process is followed for quality
               | or consistency.
               | 
               | Isn't that just a more corporate way of phrasing
               | "control"?
        
               | robertlagrant wrote:
               | Not in a negative way. You want to trust engineers to
               | always have changes built and tested before they go to
               | production, but when something egregious happens you need
               | to go back and see what went wrong. You can choose to
               | interpret that as control, but really the only
               | alternative (often cited) is "Well that shouldn't ever
               | happen, so you don't need tooling to support that
               | situation".
               | 
               | And that is not a useful way of thinking when you have
               | real engineers writing software that people depend on.
        
               | malermeister wrote:
               | I think the problem is that the processes are often not
               | _mutually agreed_ , but instead dictated by middle
               | managers.
               | 
               | JIRA then becomes a tool for enforcing arbitrary rules,
               | e.g. control
        
               | robertlagrant wrote:
               | This is very likely even if engineers come up with the
               | processes, unless all process is scrapped and done from
               | scratch every time an engineer is hired.
        
             | Rimbo wrote:
             | Oh, nonsense. People buy Atlassisn because the licensing is
             | cheap, not because it's particularly good at what it does
             | or designed with any particular workflow in mind.
        
               | Viliam1234 wrote:
               | Cheaper than whatever is the open-source alternative?
        
               | chaosite wrote:
               | Sure, if you host it yourself you have to pay someone to
               | admin it (usually significantly more expensive than a
               | license), and if you use a hosted solution you have to
               | pay the host.
        
               | ivan_gammel wrote:
               | Free software has zero acquisition cost, but non-zero
               | TCO, which can measure in millions USD (recurring salary
               | of dedicated IT team), depending on the size of
               | organization and complexity of the setup. You will need
               | to maintain on-premise infrastructure, automate backups
               | and recovery, automate security, automate updates
               | (including testing and rollbacks) etc etc, basically
               | doing all the jobs of the people responsible for the
               | infrastructure at the SaaS provider, but at much smaller
               | scale and not achieving the same efficiency. You will
               | have to do those jobs considerably better to justify the
               | costs.
        
               | mistrial9 wrote:
               | in thirty years of experience, I see this talking point
               | straight from Microsoft anti-Open Source days..
               | 
               | > Free software has zero acquisition cost, but non-zero
               | TCO, which can measure in millions USD
               | 
               | Often a primary driver is exactly the opposite -- for-
               | profit companies are accustomed to paying money for a
               | good or service, with a billing pattern and legal
               | obligations. The company financial deciders do not want a
               | setup that does not have a billing pattern and clear
               | legal obligations. Meanwhile, Open Source Software went
               | from niche to mission-critical in the 2000s via the
               | Internet. For-profit companies (and their publicists)
               | scrambled to explain it, and came up with that exact line
               | repeated again today. I do not blame any person for
               | saying it, it was in print in some reliable place. It
               | does not capture the reality in 2022 IMO.
        
               | ivan_gammel wrote:
               | To be honest, I do not understand your comment.
               | 
               | > The company financial deciders do not want a setup that
               | does not have a billing pattern and clear legal
               | obligations.
               | 
               | I haven't ever met a CTO or CIO, who would make budget
               | decisions like that, neither I do it this way myself. The
               | reality in 2022 is the same as it was in 2012 or in 2002:
               | when you choose a solution, you consider all long term
               | costs. In 2022 TCO for the server software includes
               | everything that I mentioned in my comment and more.
               | There's a lot of use cases for OSS in corporate
               | environment, for sure, but not every OSS solution is
               | cheap or even affordable. Running on-premise open source
               | collaboration tool is certainly not cheap if you do it
               | right.
        
               | ofrzeta wrote:
               | I don't see how it is cheap. Standard may be cheap but
               | then you are missing a lot of features that are announced
               | on the product pages with a small footnote saying "only
               | in premium".
        
             | _dark_matter_ wrote:
             | I feel you here, but I've been at multiple companies that
             | used JIRA and never once had any of those requirements.
             | I've also never seen it come up when deciding which
             | ticketing system to use. Teams have always been free to
             | move tickets at-will.
        
               | KptMarchewa wrote:
               | One very large video game studio has tons of automation
               | for Jira. Imagine someone deciding to add new weapon. The
               | automation creates 100s of tasks for concept artists, 3d
               | artists, animators, sound artists, software developers
               | with complex dependencies better those. Most importantly,
               | automation creates multiple QA steps for each element of
               | completed work.
               | 
               | The same exists for levels, enemies, quests and tons of
               | other elements.
               | 
               | I would not be surprised if a lot of studios had similar
               | workflows.
        
               | BlargMcLarg wrote:
               | See, that is great. Automate what can logically be
               | deduced from the information available and set up
               | templates to provide that information. For developers, it
               | should be automated enough you shouldn't have to write
               | the same info twice, once in commit
               | messages/merges/branch names, once in the ticket itself.
               | If the workflow is so streamlined, all that information
               | can be deduced and the ticket can be advanced
               | automatically. Most information is available and
               | documented for other parties.
               | 
               | However, that's just not what most people go through in
               | companies using JIRA. Worse, they have to toggle between
               | pages multiple times, each taking at least a few decent
               | seconds to reload. I'd like to give JIRA the benefit of
               | the doubt here, but it sounds like the tool is just _very
               | easy_ to misconfigure and abuse.
        
               | robertlagrant wrote:
               | This is pretty easy with Jira. There's a GitHub plugin
               | which links PRs and commits to a ticket, and a GitHub
               | plugin that links ticket numbers back to Jira tickets.
               | 
               | And you generally do them both at a lower level than
               | tickets, certainly commits, so you don't want to have too
               | much automation between them as that starts adding
               | constraints.
        
             | theptip wrote:
             | I think you've got part of the answer here, but are selling
             | it short. Jira is the most complex task-processing rule
             | engine that is also easy enough for a small team to
             | operate, and also has the broadest set of integrated tools
             | of any offering.
             | 
             | You can use Jira as a simple Scrum board, a Kanban board,
             | or you can build enforced-process monstrosities. You can
             | build customer-support / internal-helpdesk workflows, or
             | even model internal work-item-oriented business processes,
             | etc. Now, as you point out, just because you can doesn't
             | mean you should, and many orgs fall into the trap of making
             | issue workflows overly-restrictive. But most companies (I
             | believe) choose Jira before they choose those hairy task
             | workflows. Startups with zero process use Jira.
             | 
             | Also, you can integrate it all together to give good-enough
             | dashboards/roadmaps, good-enough (for some, not me) docs
             | integrations with Confluence, Git integration with
             | Bitbucket etc. -- while there are big issues with these
             | systems, I think it would be myopic to ignore the real
             | benefits of working in one integrated stack where every
             | design doc you write has dynamically-updated labels and
             | auto-complete for each issue you type in.
             | 
             | For context, I use Jira for tasks and don't love it, found
             | Confluence to be really annoying and so I don't use it, and
             | prefer Gitlab to Bitbucket, but I think you have to
             | recognize these unique selling points. If all Jira had to
             | offer was the rule engine it would not be as widely used.
        
               | pid-1 wrote:
               | Yeah my team uses Jira to keep track of what we are doing
               | and what we need to do.
               | 
               | Each member can actually organize their sprint and create
               | tasks.
               | 
               | Point assignment is not a big deal, it's just there so we
               | avoid promising more than we can chew.
               | 
               | I've found Jira really pleasant to use for lightweight
               | processes.
        
           | [deleted]
        
           | richardw wrote:
           | I'm just a user but totally happy with all our Atlassian
           | apps. Confluence is a huge win across our multi-thousand
           | person company and the best teams use it very well. I like
           | the integration between Jira and Bitbucket. We don't over
           | complicate things and it works fine.
           | 
           | It's like my taste in wine. I don't want an overdeveloped
           | sense of taste where only a $400 bottle will do. I'm fine
           | with what we have because the work is what excites me and if
           | people are documenting projects and managing workloads and
           | committing code, we're 90% of the way there.
        
             | danielovichdk wrote:
             | Good point.
             | 
             | Wine that costs 400$ is for fun.
             | 
             | You don't drink that professionally.
        
         | spaetzleesser wrote:
         | "horrendous monolithic architecture "
         | 
         | I don't really understand what this has to do with "monolithic"
         | or not.
         | 
         | Atlassian's software is probably very complex and convoluted
         | but from my experience it's almost impossible to keep a clean
         | architecture in a software system that has grown over many
         | years and is used and customized by many customers so you have
         | to avoid breaking backwards compatibility.
        
         | JoBrad wrote:
         | It sounds like that's what they are doing, but it's manual.
        
         | outworlder wrote:
         | > OK, so you restore backups to a separate system, and
         | selectively copy the stomped accounts data back to production
         | 
         | This seems to be exactly what they are doing, as described in
         | the article. They don't have automated tools to do this.
        
         | omoikane wrote:
         | Not being able to selectively restore data for a subset of
         | users might also indicate that users are not fully isolated
         | from each other, which is worrying for technical and
         | nontechnical reasons.
        
           | indymike wrote:
           | There is nothing non-technical that matters. If we start
           | acting like it does, we incredibly poor decisions that in
           | fact have nothing to do with physical reality, and quickly
           | arrive at unworkable technology.
        
             | omoikane wrote:
             | Non-technical reasons include "legal" and "compliance",
             | which often matters a fair bit. I am not disagreeing that
             | non-technical requirements occasionally lead to poor
             | decisions, for some value of poor.
        
               | indymike wrote:
               | I live is a state that once tried to legislate that pi =
               | 3.15. The results were tragic, and the attempt to
               | legislate a ratio was a failure, much like systems
               | created by regulation and laws often are. Math is much
               | less forgiving than legal prose. Making database
               | decisions based on criteria that don't make any
               | engineering sense one way or the other is not far off
               | from legislating the value of PI.
        
         | ksala_ wrote:
         | Personally, given the multi-day outage, I think I would just
         | restore everything to a separate system, and then only point
         | the affected customers to this new system. Take the hit of
         | having two fully separated system initially, and work on
         | reconciling them after without having to worry about the on-
         | going outage.
         | 
         | I wonder if they're not doing this due to some tech
         | limitations, to avoid taking the financial cost of running two
         | systems, or to avoid having to reconcile the systems.
        
           | bigtones wrote:
           | That's a really good idea.
        
           | mandevil wrote:
           | At a big multi-tenancy company I used to work at, the problem
           | would have been the accessory machines: we had something like
           | 15-20 different machines around the main DB and API machines,
           | running cron jobs, terminating SSL connections, load
           | balancing, sending alerts to us and customer emails out, etc.
           | And while the backing up and failing over on DB and API
           | machines was a well documented, thoroughly tested process...
           | the other machines were all custom jobs that were very poorly
           | documented, with who knows what scripts running on them, that
           | might or might not be important. Trying to replicate all of
           | that during an emergency would have been a challenge.
           | 
           | For just this sort of problem, we actually had three DB
           | servers running all the time: active, passive, and _hour
           | behind_ with the ability to break _hour behind_ 's copying of
           | the write-ahead log of active as the DBA's secret weapon for
           | just this problem. If all customers had accidentally lost an
           | hours worth of data it would have been embarrassing, but much
           | less than completely shutting out hundreds of paying
           | customers for two weeks, I think?
        
         | underdeserver wrote:
         | > Simple concepts aren't that simple at their scale
         | 
         | It's true that nothing is simple at scale, but it's important
         | to note that simple concepts are the _only_ concepts that work
         | at scale.
        
         | VWWHFSfQ wrote:
         | Most likely the database tables themselves are just a mixture
         | of everyone's data. There's no true multitenancy. So they have
         | to load the backups into a separate database. Then just go
         | through and individually select/insert into the old database.
         | And then you have to worry about things like foreign key
         | constraints complicating the bulk data loading. Are you going
         | to disable constraint enforcement while you bulk load the data?
         | How does that affect existing and new data from customers using
         | the database? Just a guess. But this sounds like a nightmare
         | honestly.
        
           | tetha wrote:
           | Yup. The database schema of one of our products uses a
           | tenant_id in most tables to separate customers logically.
           | 
           | I've eventually gotten a tenant exporter to work.
           | Practically, this requires some deep and nasty digging
           | through the information_schema to build a graph of tables and
           | foreign key constraints. Once it had that, it generates
           | selects with a simple where clause for tables with the
           | tenant_id, and selects with weird joins all over the place
           | for other tables to dump the tenant data.
           | 
           | All of that sounds complex, but that part took a day or two
           | to hammer together to 90% completion, since it's just some
           | graph handling. The other 10% were getting some weird date
           | formatting questions right to produce a properly importable
           | sql dump. And interestingly enough, it's working for more
           | than just that one product.
           | 
           | But that's just where the journey started. After that, it
           | took a weeks and months to sort out legacy tables, old
           | tables, tables without indexes, tables no one knew about,
           | tables that were important (but not), tables with
           | inconsistent data, .... And it's just handling a single
           | relational database. And compared to \copy in psql, it's
           | slow. And at times, weird things happen if you import huge
           | chunks of sql into a postgres with deferred foreign keys
           | (because our schema has cyclical references).
           | 
           | Point is, I know how painful it can be to handle that kind of
           | database schema, at a ridiculously smaller scale. I'm kind of
           | happy to not work there.
        
             | [deleted]
        
           | radicaldreamer wrote:
           | I can't believe that they would intermix the data in that
           | way... but if they did, godspeed to them, they're likely
           | still overpromising what can be done in this time frame.
        
             | mdoms wrote:
             | They don't, you're responding to speculation which is just
             | outright wrong. Jira and Confluence is single tenanted
             | databases, unless something fundamental has changed at
             | Atlassian in the past 4 years.
             | 
             | Source: worked at Atlassian, on Jira, 4 years ago.
        
               | robertlagrant wrote:
               | Then Atlassian's description of why the restore took so
               | long makes no sense to me.
        
             | dabeeeenster wrote:
             | How else do you run a multitenancy platform?
        
               | kikki wrote:
               | Not quite the same but at Fandom (Wikia), every wiki has
               | its own DB (over 300,000 wikis), and they are clustered
               | across a bunch of servers (usually balanced by traffic).
               | It works well - but we don't ever really need to query
               | across databases. There's a bunch of logic around
               | instance/db selection but that's about as complex as it
               | gets.
        
               | jjice wrote:
               | Interesting architecture. From a design point of view, I
               | like the idea of full isolation. From an infrastructure
               | point of view I'm a little scared. I'd assume it's
               | actually not that bad and there's a good way to manage
               | the individual DBs and scale them individually.
               | 
               | Really interested if you can share any details.
               | 
               | Edit: I know each wiki is on a subdomain. Does each wiki
               | also have it's own server?
        
               | kikki wrote:
               | There are _many_ databases on each server, last I checked
               | there was around 8 servers (or: "clusters") - and we have
               | it so the traffic is somewhat evenly distributed across
               | each server. There are reasonable capacity limits, and
               | when servers get full we spin up a new one and start
               | accepting new wikis there. I am not in OPS, and they do a
               | lot of work behind the scenes to make this all run
               | smoothly - but from an eng perspective we rarely have
               | issues with this at scale.
               | 
               | Some of this was open source before we unified all of our
               | wiki products, which has a lot of the selection / db
               | logic, at https://github.com/Wikia/app.
        
               | spookthesunset wrote:
               | How do you update the schema on 300,000 databases?
        
               | msh wrote:
               | At minimum separate tables for each tenant.
        
               | Spivak wrote:
               | At that point you might as well just do separate schemas,
               | it's actually less headache.
        
               | radicaldreamer wrote:
               | Sorry, I'm not actually sure... maybe someone who's
               | experienced in backend db can elucidate here.
               | 
               | Is it not a good idea to spin up separate db instances
               | for each client/company?
        
               | indymike wrote:
               | Answer: it depends on the application. For example big
               | social app is not going to provision a new db for every
               | user, or for every customer that runs an ad. Likewise, a
               | lot of enterprise software fits a model where each
               | customer getting it's own db makes sense. So, really,
               | just a design decision.
        
               | nemothekid wrote:
               | I believe you can sign up an account for free or
               | incredibly cheap ($5/user). You would potentially have
               | tens of thousands of databases. Imagine trying to do
               | something like a database migration to add a column. I
               | believe the day to day operations would be a nightmare as
               | no RDBMS has probably had that kind of feature stress
               | tested.
        
               | some-guy wrote:
               | The company I work at (Workday) does this, but it's for
               | business / liability reasons.
        
               | robertlagrant wrote:
               | Bearing in mind the licence fees of Workday, the costs of
               | separate databases pale in comparison!
        
               | andyjohnson0 wrote:
               | > Is it not a good idea to spin up separate db instances
               | for each client/company?
               | 
               | It depends, really. There is a trade-off in terms of
               | software and operational complexity vs scalability/perf
               | and isolation. And probably a bunch of other factors.
               | 
               | If you have separate databases for each customer, schema
               | migrations can be staged over time. But that means your
               | software backend needs to be able to work with different
               | schemas concurrently. You can also benefit from
               | resilience and isolation guarantees provided by the dbms.
               | On the other hand, having a dbms manage lots of databases
               | can affect perf. Linking between databases can be a
               | minefield, especially w/r/t foreign keys and distributed
               | transactions.
               | 
               | https://docs.microsoft.com/en-us/azure/azure-
               | sql/database/sa...
        
               | Spivak wrote:
               | > But that means your software backend needs to be able
               | to work with different schemas concurrently.
               | 
               | Not if you're truly multi-tenant and each customer has
               | their own app servers. Then your code and schema version
               | are always in lock-step.
        
               | andyjohnson0 wrote:
               | True. But then you have an additional problem ...
        
               | truffdog wrote:
               | Separate DB instances doesn't scale as well cost wise,
               | and generally means onboarding takes a few minutes
               | instead of being instant. It is very common though.
        
               | Spivak wrote:
               | The solution that satisfies everyone is having a separate
               | _schema_ per customer and a number of database clusters.
               | Then each customer is assigned to a particular cluster.
               | Always make sure you have excess capacity on your pool of
               | clusters and onboarding is still instant.
        
               | brightball wrote:
               | There are basically two options for multi-tenancy with
               | their own tradeoffs.
               | 
               | 1. An account/tenant_id field for each table
               | 
               | 2. A schema for each tenant wrapping all of the tables
               | 
               | Option 2 gives you cleaner separation but complicates
               | your deployment process because now you have to run every
               | database change across every schema every time you
               | deploy. This gets more complicated as your code is
               | deploying in case the code itself gets out of sync,
               | there's a rollback or an error mid deploy due to an issue
               | with some specific data.
               | 
               | The benefit of the approach is the option to do different
               | backup policies for different customers, makes moving
               | specific customers to specific instances easier and you
               | avoid the extra index on tenant_id in every table.
               | 
               | Option 1 is significantly easier to shard out
               | horizontally and simplifies the database change process,
               | but you lose space on the extra indexes. Plus in many
               | databases you can partition on the tenant_id.
               | 
               | Most people typically end up with option 1 after dealing
               | with or reading horror stories about the operational
               | complexity of option 2.
        
               | outworlder wrote:
               | Option 2 has many unforeseen consequences.
               | 
               | Business wants to run a query across customers? In most
               | DBs you need either custom code or to create a stored
               | procedure to iterate across schemas.
               | 
               | Every table that you create is multiplied by the number
               | of customers. This has implications for some database
               | systems (like PG's vacuum).
               | 
               | Your migrations will take _forever_ to run.
               | 
               | Etc.
        
               | Spivak wrote:
               | The second problem is mitigated by the fact that schemas
               | are trivially migratable between database servers. Once
               | you grow too big for one cluster just make another.
        
               | eropple wrote:
               | The secret bomb in option 1 is that you generally have to
               | have smarter primary keys that fully embrace multitenancy
               | and while Atlassian hires smart folks and I'm sure they
               | at some level know this--that's a relatively hard
               | retrofit to work into a system.
        
               | [deleted]
        
               | codingdave wrote:
               | It is like any other architectural choice - there are
               | pros and cons both directions. If you have separate db
               | instances, you have to scale up the operations to manage
               | each one - migrations, scripts, etc need to be either run
               | against them all, or you need good tooling in place to
               | automate it. A single instance avoids all that, but is
               | more complex in the actual software and definitely more
               | complex for security. A single DB also would let you
               | share data amongst organizations fairly easily, but
               | whether that is good or bad depends on your product. I've
               | created and run products both ways, and I like separate
               | DBs at small scales, single DBs at medium scale, but
               | separate DBs again at huge scale if you also put
               | management tooling in place.
        
               | stickfigure wrote:
               | I have built multiple multi-tenancy platforms and I never
               | create separate databases for each customer. If you have
               | separate databases, it's almost impossible to run
               | meaningful queries across all of them. That architectural
               | choice creates far more headaches than it solves. Usually
               | people end up with the split-database architecture when
               | they want a quick retrofit for a system that wasn't
               | designed with multiple tenants.
               | 
               | I've also had to restore partial data from backups on a
               | few occasions when customers fat-fingered some data and
               | asked pretty-please to undo. If someone on staff
               | understands the system well, it's not hard. I suspect
               | Atlassian suffers from a complicated schema and a post-
               | IPO brain drain.
        
               | x0x0 wrote:
               | I can't believe anyone would do separate databases.
               | 
               | Just wait until a migration doesn't run on 2 of your 400+
               | customer databases. Or multi-hour migrations.
        
               | tedunangst wrote:
               | Sounds good to me. Now you've got 398 happy customers on
               | the new version, and a small scale issue to resolve with
               | two customers.
        
               | rectang wrote:
               | When all customer data lives in the unified database:
               | Just wait until a bug in a query exposes the data of
               | customers to each other, creating instant regulatory and
               | privacy nightmares for everyone.
        
               | x0x0 wrote:
               | With an orm and customer objects to create scoped
               | queries, I haven't found this to be a problem. It's also
               | very easy to check in code reviews. And not a painful
               | issue from, well, the lack of this happening given it's
               | an extremely common app design.
        
               | sharken wrote:
               | It's likely a mixture of all these factors, the brain
               | drain could absolutely be responsible.
               | 
               | At least it would not be the first time in history that a
               | company has lost the engineering spirit. And instead the
               | business people have taken over, so that details like
               | disaster plans become less of a priority.
               | 
               | A business person and an engineer will always view risk
               | differently, better disaster plans is a kind of insurance
               | that is a lot harder to sell when too many business
               | people run the company.
        
               | ZetaZero wrote:
               | This. It would be an impossible nightmare for every
               | account to have their own DB. Hundreds of thousands of
               | accounts and databases....
        
               | lelandbatey wrote:
               | I worked at company that architected their multi-tenancy
               | in almost exactly this style. In their particular case,
               | only a few of the very largest customers had their
               | database set aside on their own dedicated instance, but
               | every customer did have their own DB with their own set
               | of tables. Having worked in that world (every customer
               | had their own DB) and on a product where all customers
               | had their data intermingled in one gigantic set of tables
               | in one giant DB on one logical instance, I'd definitely
               | encourage the "every customer gets their own DB".
               | 
               | Giving every customer their own table means you're going
               | to need database administrators. For these folks their
               | _dedicated_ job was maintaining, operating, and changing
               | their fleet of databases, but they where very technical
               | and were _amazing_ to work with.
        
               | david422 wrote:
               | > I'd definitely encourage the "every customer gets their
               | own DB".
               | 
               | Does this extend to services as well? We have a suite of
               | (micro) services. Are they all segregated?
        
               | mdoms wrote:
               | This is the case. I won't comment on your "hundreds of
               | thousands" figure because the number of Cloud customers
               | was a closely guarded secret at least when I worked
               | there, but yes one DB per tenant, dozens to hundreds of
               | DBs per server, and some complicated shuffling of tenant
               | DBs when you run into noisy neighbours.
        
               | mh- wrote:
               | That makes this prolonged restore process all the more
               | confusing, then.
               | 
               | I (and many others) assumed they had to graft in data
               | from backups since a full restore would clobber newer
               | changes from unaffected customers.
               | 
               | If they're all isolated in their own logical per-tenant
               | DBs, I'm really at a loss for what is making restoration
               | take 3 weeks for 400 tenants.
               | 
               | I understand if you'd rather not venture into it, but
               | care to offer any speculation?
        
               | spookthesunset wrote:
               | If they had multi-tenant databases for SaaS it would mean
               | either the self-hosted jira instances also had the same
               | multi-tenant database schema or they'd have to maintain
               | two almost entirely different data access layers for
               | cloud vs. on-prem. Since their cloud offering came from a
               | historically on-prem codebase, I would expect the easiest
               | way to offer cloud stuff is to do a DB per tenant.
               | Otherwise there would a shit-ton of new code that only
               | applies for cloud stuff....
        
               | [deleted]
        
               | taeric wrote:
               | Wait. Why? This sounds like something that feels hard, if
               | you are used to the giant DBs of old. But you can
               | probably get many many instances of the smaller databases
               | without much trouble.
               | 
               | Would still be some maintenance, don't get me wrong. But
               | far from impossible.
        
               | ezekg wrote:
               | Imagine the database schema migrations...
        
               | jagged-chisel wrote:
               | the good news is by the time you get to the 100th client,
               | you'll likely have run into all possible bugs and the
               | remaining 6900 will be pretty smooth.
        
               | Spivak wrote:
               | Having worked at shops that used this architecture it's
               | really not that bad. Can you write the code to do one
               | schema migration? Great, now you can do 1000. App server
               | boots and runs the schema migrations, drops privs and
               | launches the app. Now you've staved off your scaling
               | issues from "how to have a db large enough to hold all
               | our customer data" to "how to have a db large enough to
               | hold our biggest customer's data." Much easier.
        
               | robertlagrant wrote:
               | You can write the code to do 1000 schema migrations, but
               | the problem is if you've migrated 40% of them and hit an
               | issue. What do?
        
               | spookthesunset wrote:
               | One of the many reasons to put good constrains on fields
               | and use referential integrity! If you don't let the
               | database enforce data validity you are gonna get fucked
               | at some point!
               | 
               | source: every single place I've worked at that poo-poos
               | referential integrity has a database that is full of
               | bullshit that "the application code" never cleaned up
               | 
               | Always use referential integrity. The people who are
               | against it almost always are against it for superstitious
               | reasons (eg: "it makes things slow" or "only one codebase
               | calls it so the code can enforce the integrity"). All it
               | takes is exactly one bug in the application code to
               | corrupt the whole damn thing. And that bug _will_ happen
               | over the lifetime of the product regardless of how
               | "good" or "awesome" the programmers think they are....
               | 
               | ... I'll get off my soapbox now!
        
               | oauea wrote:
               | You'll quickly run into limitations of how many tcp
               | connections you can hold open. Unless you also want to
               | run separate app servers for each customer, which will
               | cost a lot of $$$
               | 
               | Oh, and just forget about allowing your customers to
               | share their data with each other, which most enterprises
               | want in one way or another.
        
               | kubami wrote:
               | Wait. What? None of the enterprise customers want to
               | share data with each other. And definitely not on a DB
               | level. That should happen in the business logic.
        
               | lalaithion wrote:
               | Lots of companies have consultants, and want to be able
               | to share their consulting-related tickets with their
               | consultants. And the consultants want one system they can
               | log into and see the tickets from all of the companies
               | that are hiring them.
        
               | outworlder wrote:
               | It would be a nightmarish scenario if you have thousands
               | of customers. And completely unnecessary. You can create
               | multiple databases and or schemas in a single instance.
               | 
               | Don't do any of the above unless you understand the
               | implications.
        
               | hnlmorg wrote:
               | Multiple schemas? You don't need every tenant in the same
               | schema. However I'm not a DBA by trade so there might be
               | some issue with doing this at scale that I'm unaware of.
        
               | doliveira wrote:
               | By segregating as much as you can. Definitely not by
               | putting everything in a single table. At the very least
               | separate databases/schemas with proper permissions so
               | there's not any chance of data intermiBy segregating as
               | much as you can. Definitely not by putting everything in
               | a single table. At the very least separate
               | databases/schemas with proper permissions so there's no
               | chance of data intermixing.
               | 
               | The best would be multiple separate database instances,
               | which is not even hard to manage specially for qualified
               | engineers like Atlassian surely has plenty of. The
               | problem are business decisions of ignoring the tech debt,
               | usually...
        
               | akie wrote:
               | Now every time you run a database migration, you have to
               | adjust N tables - and in Atlassian's case, N is 200000.
               | Is that better? It depends. There is no "best" way of
               | doing multitenancy.
        
               | doliveira wrote:
               | There is a worst way of doing multitenancy, and that is
               | sharing a single big table.
        
               | hnlmorg wrote:
               | That's just an automation issue. It's not like you have
               | to write a bespoke database migration script per DB.
        
               | robertlagrant wrote:
               | The bug we are mitigating was also just an automation
               | issue.
        
               | hnlmorg wrote:
               | It's also pretty easy to foobar up a single DB instance
               | if you don't have proper guardrails in place.
               | 
               | Automation wasn't the issue here. It's the symptom not
               | the cause.
        
               | doliveira wrote:
               | Way easier, actually.
        
               | robertlagrant wrote:
               | No, the symptom was the loss of customer data.
        
         | tus666 wrote:
         | > OK, so you restore backups to a separate system, and
         | selectively copy the stomped accounts data back to production
         | 
         | You don't think that's exactly what they are doing?
        
         | qiskit wrote:
         | > However, if they [restore backups], while the impacted ~400
         | companies would get back all their data, everyone else would
         | lose all data committed since that point
         | 
         | How would they lose committed data? Even after restoring the
         | backups can't they run the logs so that everyone is caught up?
        
           | drjasonharrison wrote:
           | Are you assuming that they record the events in a way that
           | can be played back?
        
           | mh- wrote:
           | _(There 's a tacit assumption here that the data across
           | tenants is commingled in tables, and that's being disputed
           | elsewhere in the thread, but playing along..)_
           | 
           | You wouldn't be able to do that without forcing downtime for
           | all customers, for the duration it takes to restore the
           | snapshot and then replay the logs. Not to mention the risks
           | of the process failing somehow
           | 
           | You could narrow the window to just the "replay" portion, if
           | you were able to stand up an extra database/infra, to switch
           | over to when it was ready. But at some point you'd probably
           | still have to go read only to checkpoint the logs and begin
           | the replay.
           | 
           | It's of course possible to do something more complicated here
           | and stream the changes then eventually enact a failover, but
           | this would all be too complex and error prone to introduce in
           | their current crisis mode. It's something I'd suggest
           | _considering_ when architecting their DR /BCP, but it's too
           | late for that kind of elegance (and complexity) now.
        
         | more_corn wrote:
         | Yeah, I'm thinking the exact same thing.
         | 
         | Perhaps they don't have the right people on hand to do hard
         | things like this.
         | 
         | They also apparently lack an incident response plan since a
         | critical component of that is coms to affected customers.
         | 
         | They also lack good practices around preventing human error. It
         | should not have even been possible to make the initial mistake.
         | It certainly should have involved multiple steps of "are you
         | sure" and potentially even review.
         | 
         | Sounds like an operations shit show. Glad it's not my circus.
        
           | robertlagrant wrote:
           | They have great practices; they even published them. They
           | just didn't follow them here.
        
       | ChrisMarshallNY wrote:
       | Heh. We have a Confluence account.
       | 
       | That no one uses.
       | 
       | So we didn't notice.
        
         | teh_klev wrote:
         | You probably wouldn't if you weren't in the affected subset of
         | customers who were. This wasn't a total outage, but rather it
         | affected a group of users who had been running a legacy
         | standalone app called "Insight - Asset Management".
        
       | throwawayHN378 wrote:
       | When in doubt I go on LinkedIn and find an engineer that works
       | for the company and message them directly.
        
         | rjmunro wrote:
         | Does that work? I've sometimes thought about trying it, but
         | never actually done so.
        
       | flaviotsf wrote:
       | I recommend doing disaster recovery steps for your personal data
       | as well, such as Gmail. At one point recently I was creating
       | filters to delete bulk messages and - when the filter got
       | created, it somehow missed the from:@xyz.com domain part and I
       | ended up deleting => delete forever all emails. I noticed the
       | issue right away but it was enough to wipe 2-3 months worth of
       | emails (all of them, even Sent ones).
        
       | Traster wrote:
       | I remember finding out one of the senior managers from my company
       | ended up as head of software at Atlassian. It was at that point I
       | was convinced Atlassian has no idea what the hell they're doing.
       | I think this demonstrates the point nicely.
        
         | celim307 wrote:
         | After this they might have to boomerang back to your company
         | lol
        
       | Cederfjard wrote:
       | PSA, because I'm seeing a lot of JIRA in this thread: Since the
       | 2017 rebranding, Jira is no longer officially written in all
       | caps: https://community.atlassian.com/t5/Feedback-Forum-
       | articles/A...
       | 
       | (You can argue how successful it was when people are still using
       | the old style in 2022).
       | 
       | It also makes more sense, since Jira is not an acronym, it's a
       | truncation of Gojira, inspired by Bugzilla/Mozilla.
        
         | spookthesunset wrote:
         | Yeah, I've never typed it as anything but JIRA. Pretty sure my
         | auto-complete will vouch for that.
        
       | Vaslo wrote:
       | I bet the Shitlassian guy is dancing and singing because of this.
        
       | a-dub wrote:
       | i hate deleting things. prefer flags that hide things instead
       | (like a boolean deleted flag in an rdbms table).
       | 
       | prevents data integrity issues in relational databases, makes
       | debugging easier and prevents disasters.
       | 
       | ideally also include a timestamp, both for bookkeeping and safe
       | tools that only remove things that have been soft deleted for
       | some time and are safe to delete without compromising integrity
       | of anything that is not deleted (this is especially important in
       | relational data models)
        
         | jacquesm wrote:
         | Better still: a field that registers at what date a record was
         | supposedly marked as deleted. Because otherwise you still can't
         | bulk recover from an error.
        
           | a-dub wrote:
           | yep. but at least in the rdbms case, and probably in all
           | cases, a flag (and an index on it) tends to be essential for
           | query performance since the state of the flag will appear in
           | most, if not all queries.
           | 
           | that's okay though, queries that reference the timestamp can
           | be slow since they're housekeeping.
        
         | bombcar wrote:
         | The GDPR and various things have made companies more skittish
         | in doing things this way, because they get scared.
         | 
         | Perhaps an effective measure would be to create a key that
         | encrypts a customer's data, and give them a copy of the key,
         | and let them know that after a certain point your copy of the
         | key will be deleted, and if they want a restore past that point
         | they'll need to provide the key.
        
           | brimble wrote:
           | You may as well just delete it, then. I guarantee a high
           | percentage of users won't save that key _and_ be able to find
           | it later. GH (edit: or similarly nerdy sites) might (might!)
           | be able to get away with that, but as soon as part of your
           | process is  "give the user a cryptographic key" you've just
           | guaranteed yourself a support nightmare, with normal users.
           | It's why the only cryptographic person-to-person
           | communication systems that've been broadly successful haven't
           | involved keeping track of _anything_ , and don't have a setup
           | process more complex than "point camera at QR code".
        
             | bombcar wrote:
             | Yeah, you end up in the case where you "officially" cannot
             | recover after X, but then you make sure that "accidentally"
             | you might be able to recover by keeping copies around
             | somewhere ... until someone realizes and you get sued.
        
           | a-dub wrote:
           | that's an interesting question, i've given a little thought
           | to this multi tenant saas stuff...
           | 
           | not sure if the right way forward is some sort of innovation
           | in operating system and software design where people write
           | and run apps that feel like single tenant apps attached to
           | dedicated per tenant datastores where os and framework magic
           | handle per tenant encryption and segmentation (tenant id as
           | an os level concept)
           | 
           | or... if it makes more sense to encrypt at the record level
           | with keys that only the customers hold using (assuming it's
           | up to the task) homomorphic encryption for things like
           | searches and other backend functions.
           | 
           | either way, for now, soft deleting and following up with an
           | automatic daily hard delete of things soft deleted more than
           | x days ago is a totally reasonable approach.
           | 
           | ops scripts should require typing "yes i know what i'm doing"
           | if someone attempts to hard delete things that have not yet
           | been soft deleted.
        
             | bombcar wrote:
             | Yeah, soft delete is the way to go in 99.99% of the cases,
             | with a system setup to eventually hard delete on some
             | schedule (preferably don't hard delete until X number of
             | backups have caught the soft deleted data safely, for
             | example).
        
               | miketria wrote:
               | Hi, this is Mike from Atlassian Engineering. Strongly
               | agree with this. I'd say that if you can afford it, don't
               | do the hard deletes on a schedule though. You never know
               | when there's a system out there referring to soft deleted
               | data that fails once the data is hard deleted. Hard
               | deletes should feel frightening because they are
               | frightening.
        
           | deckard1 wrote:
           | > The GDPR and various things have made companies more
           | skittish in doing things this way, because they get scared.
           | 
           | They may be scared. But are they scared enough to reload
           | every single backup they have, purge the desired records, and
           | resave each and every single backup they have? And not also
           | worry they will corrupt/break the backups in the process.
           | 
           | GDPR compliance is a mess of contradictions and unreasonable
           | asks which all seem to amount to "depends on who you ask."
        
       | yabones wrote:
       | What's a good Jira replacement? Redmine? Phabricator?
       | OpenProject? Just leaving the jira server alone and hoping
       | there's no new and exciting zero-days? One thing is clear, these
       | guys are a bunch of cowboys who can't be trusted with any amount
       | of data.
        
         | elxr wrote:
         | If you're hosting your code on GitHub, then GitHub projects is
         | definitely worth using.
         | 
         | Does everything I used to use Jira for, but feels more modern
         | and lightweight. Also, it has dark mode.
        
         | dzikimarian wrote:
         | I'm on the same boat. Currently best choice seems to be
         | youtrack, which has reasonable licensing model for self hosted
         | option.
        
         | 420official wrote:
         | I'm not at all familiar but a tweet linked from the OP and
         | written by the author plugs https://linear.app/
        
           | gkoberger wrote:
           | Linear is phenomenal. Probably built for a different audience
           | than Jira (it's like Superhuman for tickets), but if you want
           | something that works well and is opinionated I highly highly
           | recommend it.
        
           | bloopernova wrote:
           | Linear has a dark mode. I'm already won over! ;)
        
         | kitsune_ wrote:
         | Gitlab would be enough for engineering teams
        
         | nicoburns wrote:
         | We switched from JIRA to Shortcut https://shortcut.com/
         | (formerly Clubhouse), and I'd highly recommend them. It's much
         | better than JIRA ever was, both from a UX perspective and an
         | implementation/performance perspective.
        
         | _dark_matter_ wrote:
         | _Bugzilla_
        
         | [deleted]
        
         | originalvichy wrote:
         | For pure engineering teams it's either Gitlab or Azure Devops.
         | Those are the most common competitors I hear about. If you have
         | non-engineers the choice gets trickier.
        
         | histriosum wrote:
         | I've used Request Tracker for years. It's not pretty, it's
         | written in Perl, but I can fairly easily make it do all the
         | ticket tracking flows I care about and it just runs and runs
         | and runs. My scale is admittedly small, but I put tens of
         | thousands of tickets per year through my instance, and i
         | basically never have to touch it unless I'm setting up a new
         | queue or different flow for something.
        
           | beardbound wrote:
           | Wow, I've never seen anyone mention RT here. I used it for
           | years when I was working IT for my university while in
           | undergrad. It worked pretty well. It didn't have a lot of
           | features but it allowed clients/customers to respond to
           | tickets via email which was pretty cool at the time (late
           | 00s). It also ran pretty fast on the terrible servers we had
           | it on.
        
         | WC3w6pXxgGd wrote:
        
       | Tomsilverberry wrote:
        
       | jacquesm wrote:
       | I suspect - pure speculation - that they _can 't_ restore the
       | backups, because if they could then they could easily do this in
       | a way that accounts affected could be restored selectively. In
       | other words: test your backups, if you don't they won't be there
       | for you when you need them.
        
       | digital79 wrote:
        
       | ordiel wrote:
       | All I can say as an Attlassian Server products user is that the
       | moment they say it was Cloud or nothing, I choose nothing.
       | 
       | I much rather running Gittea on a raspberry pi that I CONTROL
       | than having to have the impotence of doing nothing for more than
       | a week. + having work at cloud companies and having been
       | requested to "collect customer data" to hand it over to the
       | government I would NEVER move critical pieces to anyone else's
       | infa...
       | 
       | (Note: I am not supporting crime, but I rather to have privacy
       | and criminals than living on an authoritarian regime where a
       | dictator who knows everything abot everyone keeps "peace".... Yes
       | I am looking at you China!)
       | 
       | If mistakes will be made, at least I wont pay others to do them
       | for me....
        
         | Phelinofist wrote:
         | As I understood it is not "Cloud or Nothing" but "Cloud or Data
         | Center" - is this wrong?
        
           | rsstack wrote:
           | The on-prem offering of Atlassian was discontinued. Existing
           | contracts are being honored but as of March 2022, that's the
           | end of the line for it. Maybe it will be revived now.
        
           | grnmamba wrote:
           | Unlike server, Data Center starts at 42.000$ per year.
           | 
           | For most SMBs, it's cloud or nothing (or a different vendor,
           | of course).
        
           | yabones wrote:
           | AFAIK the Datacenter pricing starts at 500 users and goes up
           | from there. So a small org could end up paying 5-10x what
           | they were before on the Server license.
        
           | callamdelaney wrote:
           | Where do you think the cloud lives?
        
       | kache_ wrote:
       | return to monke
       | 
       | vi your todolists on an ec2 box
        
       | [deleted]
        
       | mc4ndr3 wrote:
       | They never heard of beta testing, rolling updates, infrastructure
       | as code, federation, customer isolation, or Public Relations.
       | What the heck.
        
       | parentheses wrote:
       | A case for reducing complexity of software. Also, given the
       | recent GitHub incident spree, it's almost debilitating. The
       | entire tech industry takes a hit when companies like these fail
       | at operations.
        
       | oldshatterhand wrote:
       | Random guess, that this is a "we say we make backups, but we
       | actually take snapshots" issue :)
        
       | luckydata wrote:
       | so this is the end of Atlassian as a company right?
        
         | vinay_ys wrote:
         | Depends. Are there strong alternate products to which customers
         | can easily migrate in next 6-12 months? If yes, and they choose
         | to move away, then Atlassian will be in serious trouble. I
         | wonder how many of their customers have long-term locked-in
         | contracts and if they have performance clauses that allows them
         | to exit such contracts.
        
         | lifefeed wrote:
         | Eh. The Exxon Valdez oil spill is a case study in the failure
         | of crisis management, but Exxon weathered it. It's a vastly
         | different industry with huge "economic moats," but it does
         | point to the fact that a company can weather a crisis.
        
         | raincom wrote:
         | I don't think so, as long as investors hold the stock, as long
         | as customers keep paying Atlassian.
        
         | function_seven wrote:
         | I had the same initial thought. _Surely_ a weekslong outage
         | would drive customers away permanently, right?
         | 
         | Nope. From TFA:
         | 
         | > _I asked customers if they would offboard Atlassian as a
         | result of the outage. Most of them said they won't leave the
         | Atlassian stack, as long as they don't lose data. This is
         | because moving is complex and they don't see a move would
         | mitigate a risk of a cloud provider going down._
        
           | luckydata wrote:
           | it doesn't happen overnight, but this is a really bad
           | precedent and it will definitely have an effect on both sales
           | AND renewals. This market is theirs to lose and seems they
           | are doing everything they can to do just that. Github is
           | getting better, and it has mindshare amongst developers, not
           | to mention it's part of a company that like it or not knows
           | how to sell to large enterprises (Microsoft).
        
         | case0x wrote:
         | Why would it ? On our end everything works fine. If you're not
         | one of the 400 companies, there's no difference
        
           | asah wrote:
           | "first they came for ..."
        
             | gmfawcett wrote:
             | Poor taste, buddy. Comparing the Atlassian mess-up to the
             | Holocaust diminishes the Holocaust.
        
               | asah wrote:
               | um... the sentiment is universal it's not specific to
               | that particularly awful history. Sorry if it triggered
               | you, HN doesn't offer a delete button.
               | 
               | FYI my ancestors fled oppression on both sides and I'm
               | well aware that it's a miracle I'm alive.
               | 
               | Again, one bad thing leading to another is a common human
               | behavior, and the Holocaust is just an extreme example
               | that I ABSOLUTELY did not intend whatsoever. You make
               | this connection, not me.
        
               | gmfawcett wrote:
               | If I'm then one making this connection, then it should be
               | trivial for you to finish your sentence. "First they came
               | for..." Who are the Jews in your analogy ? Who are the
               | communists? the trade unionists? And who is the
               | totalitarian regime?
               | 
               | Suggesting that Niemoller's poem is about "one bad thing
               | leads to another" is like suggesting that Anne Frank's
               | diary is about "sometimes girls have really bad days." I
               | understand you didn't mean any offense to anyone. But
               | that's not a license to be offensive, and then duck for
               | cover.
        
             | [deleted]
        
           | foobiekr wrote:
           | Severe operational issues don't give you pause?
        
           | dividedbyzero wrote:
           | Even those 400, especially Jira is crazy popular with a lot
           | of scrum masters and the scrum crowd in general. I could see
           | some of those 400 stick with Jira even after this shit show
           | if only to avoid losing all their scrum masters.
        
           | openknot wrote:
           | Yep. The vast majority of users don't follow these outages
           | (aka don't browse forums like Hacker News or r/sysadmin), and
           | thus aren't aware of them.
           | 
           | Many of these users are decision-makers who decide what tools
           | to use, and will continue to use Atlassian out of inertia due
           | to lots of existing documentation on the tool (this is
           | compounded by not knowing about the outages, or not knowing
           | the severity of the outages), and also because large,
           | professional companies use their tools too.
           | 
           | I don't necessarily agree with the perspective to stay with
           | it, but it uses a lot of political capital/innovation
           | tokens/goodwill/etc. to change systems, when there are
           | usually higher-priority things to do (than to get buy-in to
           | switch).
        
       | knbrlo wrote:
       | My current employer uses Jira but we seem to have not been
       | affected by this. Hopefully those customers affected are able to
       | press Atlassian for improvements from notification time, backups,
       | usability etc.
        
       | bitwise101 wrote:
       | This talk from Atlassian aged well
       | https://conferences.oreilly.com/software-architecture/sa-eu-...
        
         | danuker wrote:
         | I am tired of survivor-biased "best practices" advice. I wonder
         | which practices contained there are the _worst practices_.
        
       | nitinagg wrote:
       | Selectively restoring data only for certain rows is super hard.
       | But the communications by Atlassian has been the worst I have
       | ever seen in the industry.
        
         | raincom wrote:
         | So, it must be a bad idea to shove the data of multiple
         | customers in a single table controlled by some column name
         | ('tenant').
        
         | profmonocle wrote:
         | I actually got an email from our Atlassian contact just the
         | other day encouraging us to switch to their cloud service.
         | Crazy that no one thought to pause those. (I assume it _must_
         | have been scheduled.)
        
           | HeyLaughingBoy wrote:
           | This article on HN is the _only_ time I 've even heard that
           | Atlassian was having a problem. I suspect that 99% of the
           | tech "community" has absolutely no idea this is happening.
           | 
           | We use Jira, but it's self-hosted for my team. Maybe other
           | teams that have transitioned to the cloud version are aware
           | that there's a problem, but I haven't heard about it.
        
             | LadyCailin wrote:
             | Apparently the self hosted version goes out of support in
             | 2024, so there will only be cloud hosting. Dumb dumb dumb.
        
             | mcintyre1994 wrote:
             | It's only 400 teams affected, but from this article it
             | sounds like they're all really big ones.
        
         | seanwilson wrote:
         | > Selectively restoring data only for certain rows is super
         | hard.
         | 
         | What's the right way to structure your data here that would
         | make restoring more straightforward here? Is this
         | backup/restore scenario niche or they should have designed for
         | it?
        
           | inopinatus wrote:
           | in theory, shard your customer databases 1:1, job done. alas,
           | in practice, many SaaS compromise this two ways:
           | 
           | a) overwhelmed by creeping featuritis, each customer's data
           | has relationships to global tables, and
           | 
           | b) they backup their entire database cluster in one snapshot
           | 
           | and there maybe other gotchas for restoration, like relying
           | on denormalized views and caches that have to be rebuilt.
           | they may also have erroneously assumed that data protection's
           | main value driver is whole-of-system disaster recovery, which
           | can lead to pathologies such as "we don't have a single-
           | customer restoration tool".
           | 
           | this is not a niche scenario
        
             | bpicolo wrote:
             | Heck, it's worse now - if your data deletion tooling did a
             | good job, there are dozens or hundreds of microservice
             | databases to restore.
        
             | seanwilson wrote:
             | > shard your customer databases 1:1
             | 
             | What are the downsides to this?
        
               | inopinatus wrote:
               | * makes it much harder to distribute your tables by any
               | other factor, for whatever reason (usually performance,
               | sometimes archival)
               | 
               | * disaggregates data that the SaaS might be interested in
               | querying/updating as an aggregate
               | 
               | * not all ORM frameworks handle this case well, if at all
               | 
               | * dumps are more than a single trivial command
               | 
               | basically all your data operations gain an additional
               | dimension of complexity, and you may not perceive the
               | benefits until much later
        
               | seanwilson wrote:
               | Would it be fair to estimate that the majority of SaaS
               | companies aren't sharding like this then? Seems like a
               | lot of downsides that impact everything often except for
               | backups, which you'd restore rarely.
        
               | mypalmike wrote:
               | Per-customer is a common sharding strategy for noSQL
               | databases, so it may not be entirely uncommon.
        
               | darkwater wrote:
               | All of your points (minus maybe the first one) should be
               | "easily" solved/implemented in a company the size of
               | Atlassian, and maybe there are newer costumers sharded
               | like this already. IMO what happened in this case is
               | basically tech debt that is now being paid with loooot of
               | interests.
        
               | deckard1 wrote:
               | > not all ORM frameworks handle this case well, if at all
               | 
               | typically this is probably for internal
               | reporting/metrics. But yeah, a custom script with direct
               | SQL is in order. Personally my opinion is avoid ORM at
               | all costs. Never seen a benefit that wasn't trivially
               | done in SQL, and the downsides are incredibly painful.
               | 
               | The big downside of sharding out, per customer, is that's
               | a _lot_ of databases to migrate on upgrades. Or rollback
               | if shit hits the fan.
               | 
               | The upside? You can have customers on different versions
               | of your app if you really wanted to do such a thing.
               | 
               | In any case, proper tooling goes a _long_ way to making
               | it the difference between wonderfully manageable and
               | torturous nightmare. Think idempotent backup scripts that
               | are capable of failing at any time and resuming where
               | they died, etc.
        
           | oauea wrote:
           | Work out a relationship graph and automate the export/import
        
         | ollien wrote:
         | As someone who has never had to perform this kind of recovery:
         | why is it so hard?
        
           | jacquesm wrote:
           | Because it is very difficult to maintain relational integrity
           | during a restore like that.
        
             | ollien wrote:
             | Gotcha. I guess you could be heavy-handed and disable
             | foreign key checks, but who knows what other bugs that
             | would bring into the mix.
        
               | teling2 wrote:
               | The other difficulty is if you don't restore the entire
               | state in a single transaction. Imagine you have partial
               | data restored in Table A but haven't updated Table B
               | correspondingly. Now some other program that consumes
               | Table A and Table B and doesn't have error handling will
               | crash (or worse, mutate state in other weird ways).
        
               | jacquesm wrote:
               | That _is_ relational integrity.
        
         | miketria wrote:
         | Hi, this is Mike from Atlassian Engineering. You are right the
         | communications from us have not lived up to our standard. We
         | will focus on this specifically once we restore service and get
         | the post incident review out there. More details here:
         | https://www.atlassian.com/engineering/april-2022-outage-upda...
        
           | lallysingh wrote:
           | Spamming HN isn't helping your cause man.
        
           | [deleted]
        
         | jacquesm wrote:
         | It is, but between 'hard' and 'impossible' there is the nagging
         | question of whether you actually really still _have_ that data.
        
         | chousuke wrote:
         | If the database schema for Jira on the cloud is anything like
         | the Datacenter version, I'm not surprised they're having a hard
         | time restoring data. I once tried to figure out how to find
         | duplicate / redundant project schemas by querying the database
         | (the required APIs are cloud-only) and could not even find
         | which tables stored half the data, never mind how they referred
         | to each other.
        
         | duxup wrote:
         | As this continues I suspect that this might be one of the few
         | times where a lack of transparency / good communication really
         | ... might not be better or worse because the situation is so
         | bad that transparency would be horrible just the same.
         | 
         | Granted that's how all lies start / what sometimes people
         | assume and they're wrong but ... maybe this is that time?
         | 
         | Maybe it is in fact so bad that honesty would be a push or
         | worse?
        
           | adamc wrote:
           | If so, that itself would be a huge red flag for dealing with
           | Atlassian.
        
             | duxup wrote:
             | I think it is...either way.
        
         | tmpz22 wrote:
         | It's super hard no doubt but I wonder how much of the data was
         | hot vs cold.
        
       | abraae wrote:
       | This is extremely poor for a large SaaS company.
       | 
       | A standard RFP question for SaaS should be:
       | 
       | - Can you restore data for a single customer, and if so, what is
       | the RTO for that operation?
       | 
       | A smaller SaaS could be excused for only thinking about full
       | database restores. When you're a scrappy upstart, thinking about
       | hypotheticals is less important than survival.
       | 
       | But for any decent size multi-tenanted SaaS, it's imperative that
       | you have the ability to selectively restore individual customers.
       | 
       | The usual approach is to do a full database restore into a
       | separate instance, then run your pre-prepared "restore customer"
       | scripts to extract a single customer's data from there and pump
       | it across your prod instance. In Oracle for example you might use
       | database links to give your restore code access to prod and also
       | the restore instance at the same time.
       | 
       | Atlassian - MUST DO BETTER.
        
         | scottlamb wrote:
         | Is it standard for a RFP to have a long list of questions like
         | this? I've never been involved in an RFP from either side.
         | 
         | Is it standard to (in addition or instead) to have something
         | more general/forward-looking like: how do you watch other
         | providers' postmortems and apply the lessons to your own
         | system?
         | 
         | > - Can you restore data for a single customer, and if so, what
         | is the RTO for that operation?
         | 
         | If I were to aim something at this specifically, it'd be: can
         | you restore data for N customers or N% of customers, and if so,
         | what is the RTO for that operation?
         | 
         | I mentioned in another comment that Gmail had a similar outage
         | in which they had to restore from tape.
         | https://news.ycombinator.com/item?id=31017160 They had a tool
         | for restoring a single account but not for restoring N accounts
         | in bulk, which would be significantly more efficient than doing
         | the one-account process N times. (E.g., in the case of tape
         | backups, imagine the difference between pulling data from the
         | tape library sequentially for each user vs all N at once,
         | particularly when one tape may hold data for many of these
         | customers.)
        
           | taude wrote:
           | Yes, pages of them. Multiple pages of security questions,
           | ciphers used, how data is stored, when is it encrypted, etc.
           | I filled out a 20 pager once. As the company got better and
           | more mature, we had a bunch of canned answers to make it
           | easier and faster....
        
         | drsim wrote:
         | Which SaaS platforms provide account-level restores?
         | 
         | If you contact them and say "please restore our data to as it
         | was last week" those I know do not offer this.
        
           | boardwaalk wrote:
           | I wouldn't expect them to advertise such a thing, but the
           | question is "can they recover from their own mistakes" not
           | "can they recover from mine." I don't care if this is with an
           | "account-level restore" or whatever; it shouldn't be my
           | concern.
        
           | fknorangesite wrote:
           | I wouldn't expect it if I just asked. I think it's reasonable
           | as part of their disaster recovery though.
        
           | Tobani wrote:
           | I accidentally built out this feature at a company once and
           | it totally saved our asses a week later.
        
           | hotpotamus wrote:
           | I actually did this once with Dropbox, though it wasn't a
           | feature they actually published. I clobbered my Dropbox
           | directory accidentally, but I was able to find a script
           | someone wrote to roll it back to a previous point in time and
           | it worked quite well. After that I also took my own snapshots
           | just in case.
        
             | MapleWalnut wrote:
             | Dropbox support can rollback your Dropbox account to a
             | previous point in time too.
        
           | ibejoeb wrote:
           | I did. It was an first-principles architectural decision. A
           | client could request any point-in-time within the contracted
           | period, and it could be either a restoration or a fully
           | operational, parallel instance of the account.
           | 
           | It was initially a cover-my-own-ass design, but it turned out
           | to be an extremely popular feature that was never even used
           | for disaster recovery. Instead, it was used for audit
           | support, trial scenarios, projections, and all kinds of other
           | stuff.
        
         | Animats wrote:
         | Rather, customers must stop using Atlassian cloud services.
        
           | imroot wrote:
           | Which is becoming more and more difficult due to them
           | focusing on Cloud Products (my on-prem renewal jumped almost
           | 8x this year).
           | 
           | I'd rather use request tracker or bugzilla over Atlassian
           | these days
        
       | 1970-01-01 wrote:
       | Interesting note: Atlassian stock (NASDAQ: TEAM) is up 4% as of
       | noon today.
        
         | radicaldreamer wrote:
         | It might be a good short opportunity... I imagine a lot of
         | customers are kicking off their own internal process for
         | migrating away from JIRA. By the time they actually do, it'll
         | be at least a couple of quarters from now, which is when the
         | customer hit will start materializing in quarterly results for
         | the company.
         | 
         | Maybe time to throw a few chips at some long term puts?
        
           | eli wrote:
           | Aren't most customers in 12+ month contracts? A migration
           | seems like it would take many months to select a new vendor
           | and migrate regardless. Be careful about the date on those
           | puts. It's pretty hard to out-think the market on this kind
           | of stuff. I'd just as soon bet the other way: few customers
           | will _actually_ churn and in 6 months this won 't really
           | matter.
        
             | __app_dev__ wrote:
             | They might even get some new customers after people who
             | never used it look at their site and offerings.
             | 
             | Disclaimer I have puts that expire 4/22 (purchased
             | yesterday) so I hope they go down in the short term. Seems
             | like a total loss now after being up 50% yesterday.
        
           | dahdum wrote:
           | I wouldn't short. They just slapped 400+ customers and likely
           | hundreds of thousands of users in the face and the C-suite
           | didn't think it was important to even acknowledge.
           | 
           | That might look like incompetence, but I think it's
           | confidence. They know the switching costs for large orgs are
           | so high they can treat these people like trash and few if any
           | will leave. I wouldn't be surprised if the total number of
           | seats among affected customers has gone up in a few months.
           | By failing to acknowledge the problem they've kept it out of
           | the mainstream media and financial press.
           | 
           | They have their customers by the balls and don't respect
           | them. That's a short term bullish signal to me.
        
           | Iolaum wrote:
           | Given how out of sync tech people are with the general
           | population I 'd be tempted to think it's a buy opportunity.
           | Time will tell.
        
           | __app_dev__ wrote:
           | I bought puts yesterday morning. Was up 50% by the end of day
           | but now down to 50% of what I paid.
           | 
           | Mine expire 4/22 but I have more calls open at the moment
           | anyways so if I had to choose between this going down or the
           | market up I'll take a full loss on these puts (seems likely
           | at the moment)
        
       | mkl95 wrote:
       | The fact it's been so long and they still haven't revealed and
       | explained the root cause of the outage is going to make it hard
       | to regain trust on their buggy, slow tools. The bright side of
       | the incident is that competitors that somewhat care about users
       | have a unique opportunity to stand out.
        
         | pgwhalen wrote:
         | > The fact it's been so long and they still haven't revealed
         | and explained the root cause of the outage
         | 
         | They did last night:
         | https://www.atlassian.com/engineering/april-2022-outage-upda...
        
           | hu3 wrote:
           | > Faulty script. Second, the script we used provided both the
           | "mark for deletion" capability used in normal day-to-day
           | operations (where recoverability is desirable), and the
           | "permanently delete" capability that is required to
           | permanently remove data when required for compliance reasons.
           | The script was executed with the wrong execution mode and the
           | wrong list of IDs. The result was that sites for
           | approximately 400 customers were improperly deleted.
           | 
           | Ouch. I hope no one person got the blame. This is a systemic
           | failure. Regardless, my regards to the engineers involved.
        
             | gtm1260 wrote:
             | Right? The way this reads it seems like one person set a
             | flag incorrectly, something I'm sure we've all done
             | numerous times. And there were no checks down the line to
             | catch it.
        
               | miketria wrote:
               | Hi, this is Mike from Atlassian Engineering. You are
               | right that the checks need to improve to reduce human
               | error, but that's only half of it. I don't see this as
               | human error though. It's a system error. We will be doing
               | some work to make these kind of hard deletes impossible
               | in our system.
        
             | TheJoeMan wrote:
             | I suppose that's why you don't combine a tazer and gun into
             | 1 device with 2 triggers.
        
               | mrits wrote:
               | If you have a 3rd trigger where the gun turns on the user
               | it would be fairly safe.
        
               | femiagbabiaka wrote:
               | the problem is that sometimes that gun looks like a
               | taser.
        
               | dylan604 wrote:
               | Instead, you make it with one trigger and a PRNG that
               | decides which gets activated. Just hope you've chosen the
               | right PRNG!!
        
               | tadfisher wrote:
               | I will then write a script calls your script with the
               | PRNG of my choice: PRNG1 always returns "trigger 2", and
               | PRNG2 always returns "trigger 1". This detail will be
               | documented in Confluence.
        
               | rubyist5eva wrote:
               | Considering American police can't even seem to get it
               | right when they have two distinct firearms, and are
               | trained to holster them on specific sides so they know
               | what they are grabbing - and still manage to f*ck it
               | up....this might be an improvement.
        
             | hinkley wrote:
             | If coding is theatrical then ops is operatic. You have to
             | telegraph stuff so over the top that the people in the
             | cheap seats know what's going on.
             | 
             | I think what we've lost in the post-XP world is that just
             | because you build something incrementally doesn't mean it's
             | designed incrementally (read: myopically).
             | 
             | My idiot coworkers are "fixing" redundancy issues by adding
             | caching, which recreates the same problem they're
             | (un?)knowingly trying to avoid, which is having to iterate
             | over things twice to accomplish anything. They've just
             | moved the conditional branches to the cache and added more.
             | 
             | Most of the time, and especially on a concurrent system,
             | you are better off building a plan of action first and then
             | executing it second. You can dedupe while assembling the
             | plan (dynamic programming) and you don't have to worry
             | about weird eviction issues dropping you into a logic
             | problem like an infinite loop.
             | 
             | More importantly, you can build the plan and then explain
             | the plan. You can explain the plan without running it. You
             | can abort the plan in the middle when you realize you've
             | clicked the wrong button. And you can clean up on abort
             | because the plan is not twelve levels deep in a recursive
             | call, where trying to clean up will have bugs you don't see
             | in a Dev sandbox.                   Deleting 500 users...
             | 
             | Versus                   Permanently deleting 500 users...
             | 
             | Maybe with a nice 10 second pause (what's an extra ten
             | seconds for a task that takes five minutes?)
        
             | deckard1 wrote:
             | I don't want to assume too much, since the details are
             | sparse. But I know for a fact that few of my current
             | coworkers know a thing about writing tooling code. It's
             | becoming a bit of a lost art.
             | 
             | Here's the way such a script should be done. You have a
             | dry-run flag. Or, better yet, make the script dry-run
             | _only_. What this script does is it checks the database,
             | gathers actions, and then sends those actions to stdout.
             | You dump this to a file. These commands are executable.
             | They can be SQL, or additional shell scripts (e.g.
             | "delete-recoverable <customer-id>" vs. "delete-permanent
             | <customer-id>").
             | 
             | The idea is you now have something to verify. You can scan
             | it for errors. You can even put it up on Github for review
             | by stakeholders. You double/triple check the output and
             | then you execute it.
             | 
             | Tooling that enhances visibility by breaking down changes
             | into verifiable commands is incredibly powerful. Making
             | these tools idempotent is also an art form, and important.
        
             | krooj wrote:
             | This speaks to a lack of operational excellence - when you
             | develop a platform like JIRA, Confluence, etc, the
             | operational tools required to manage the systems are just
             | as important as the features themselves. If all you do is
             | pump out features, you're a feature factory and will suffer
             | these kinds of issues. There's no reasonable explanation
             | for needing a script to do what was described when the
             | necessary tooling to generalize such an operation should
             | have been in existence.
        
           | drc500free wrote:
           | Highlighting the text in any of their lists breaks the page
           | in interesting ways, apparently due to some twitter-sharing
           | functionality.
        
           | mkl95 wrote:
           | > Communication gap. First, there was a communication gap
           | between the team that requested the deactivation and the team
           | that ran the deactivation. Instead of providing the IDs of
           | the intended app being marked for deactivation, the team
           | provided the IDs of the entire cloud site where the apps were
           | to be deactivated.
           | 
           | So what they are saying is that they are not testing scripts
           | at some staging server before running them in production.
           | It's wild that they've managed to scale their products so
           | much before something like this happened.
           | 
           | I hope they've learnt their lesson and they set up some QA
           | process for that stuff.
        
             | notdang wrote:
             | it seems that it worked as intended, thus they have a QA
             | process. The problem was in the wrong IDs provided and I
             | doubt that at their scale they have a staging environment
             | that duplicates the customer data.
        
               | dylan604 wrote:
               | Would it be bad practice to append values to a GUID type
               | of ID that would help a human recognize them? For
               | instance, in this specific case they wanted app IDs as
               | APP-XXXXX-XXXX-blahblah and CLOUD-XXXXX-blahblah.
               | 
               | I'm not looking to help their specific problems, but this
               | is more from a general question I've thought of doing but
               | never have done just because I'm sure I'd get laughed at
               | for blazing my own trail
        
               | bombcar wrote:
               | This is recommended in my experience, but you do have
               | some potential issues when a UUID gets reused or
               | repurposed.
               | 
               | WHENEVER a human is involved in the chain, UUIDs can be
               | suspicious because there's no easy way to verify what it
               | is, whereas a human has a good chance of realizing that
               | $1,342.34 is probably not a valid date.
        
               | xeromal wrote:
               | I kind of dig it. Something that helps make things
               | obvious to a human
        
               | mkl95 wrote:
               | > I doubt that at their scale they have a staging
               | environment that duplicates the customer data.
               | 
               | If there is no feasible way of replicating their
               | production environment somewhere else, then there should
               | be some sanity checks in place. Something like "if an
               | abnormally high amount of customer sites go down during
               | the script's execution, kill the script". This is a 20/20
               | hindsight approach though and if Atlassian engineers
               | can't solve I doubt a random HN user like me can.
        
           | tempest_ wrote:
           | Is it just me or is highlighting on that site broken,
           | 
           | Perhaps my ad blocker is causing that stupid highlight to
           | tweet js they are using to break.
        
         | teh_klev wrote:
         | > and they still haven't revealed and explained the root cause
         | of the outage
         | 
         | They did, this post by Atlassian from yesterday is referenced
         | in the article.
         | 
         | https://www.atlassian.com/engineering/april-2022-outage-upda...
         | 
         | Still doesn't excuse them for the time taken to come clean.
        
       | selimnairb wrote:
       | CTO should be fired.
        
       | scottlamb wrote:
       | Gmail had a vaguely similar outage years ago. [1] tl;dr:
       | 
       | 1. Different root cause. There was a bug in a refactoring of
       | gmail's storage layer (iirc a missing asterisk caused a pointer
       | to an important bool to be set to null, rather than setting the
       | bool to false), which slipped through code review, automated
       | testing, and early test servers dedicated to the team, so it got
       | rolled out to some fraction of real users. Online data was
       | lost/corrupted for 0.02% of users (a huge amount of email).
       | 
       | 2. There were tape backups, but the tooling wasn't ready for a
       | restore at scale. It was all hands on deck to get those accounts
       | back to an acceptable state, and it took four days to get back to
       | basically normal (iirc no lost mail, although some got bounced).
       | 
       | 3. During the outage, some users could log in and see something
       | frightening: an empty/incomplete mailbox, and no banner or
       | anything telling them "we're fixing it".
       | 
       | 4. Google communicated more openly, sooner, [2] which I think
       | helped with customer trust. Wow, Atlassian really didn't say
       | anything publicly for nine days?!?
       | 
       | Aside from the obvious "have backups and try hard to not need
       | them", a big lesson is that you have to be prepared to do a
       | _mass_ restore, and you have to have good communication: not only
       | traditional support and PR communication but also within the UI
       | itself.
       | 
       | [1]
       | https://static.googleusercontent.com/media/www.google.com/en...
       | 
       | [2] https://gmail.googleblog.com/2011/02/gmail-back-soon-for-
       | eve...
        
         | fishnchips wrote:
         | Funny enough, most of what we restored then was spam (ex gTape
         | SRE, remember the outage).
        
       | Aissen wrote:
       | The sad truth is that with 99.8% of customers unaffected, it was
       | probably thought to be a minor issue. If those customers didn't
       | have Gergely's ear we probably wouldn't have heard about it.
        
         | miketria wrote:
         | Hi, this is Mike from Atlassian Engineering. Not a minor issue.
         | Once we knew the extent and severity of the incident, we had
         | hundreds of engineers engaged and working to restore service.
        
         | tpmx wrote:
         | Is there a source on this number?
        
           | Aissen wrote:
           | From the article:
           | 
           | > Atlassian claims the customers impacted were "only" 0.18%
           | of its customer base at 400 companies.
           | 
           | From https://jira-software.status.atlassian.com/ :
           | 
           | > The team is continuing the restoration process for the ~400
           | impacted customers.
        
             | tpmx wrote:
             | > The team is continuing the restoration process for the
             | ~400 impacted customers. We have restored functionality for
             | 45% of impacted users.
             | 
             | If this is truthful it implies implies more than 400
             | impacted customers.
        
       | jgrahamc wrote:
       | _Communicate directly and transparently_
       | 
       | Yes. Always.
        
       | politelemon wrote:
       | > it takes between 4 and 5 elapsed days to hand a site back to a
       | customer.
       | 
       | Atlassian's SLA page says, Premium Cloud Products 99.9%
       | 
       | That's 43 minutes of downtime per month.
       | 
       | That works out to, Atlassian can't have any more downtime for the
       | next 14 years. Are SLAs even real?
       | 
       | I'm being slightly facetious. From the page text it's just a
       | threshold after which I think you're entitled to some money back
       | for that month.
        
         | bborud wrote:
         | Think of SLAs as "this is how hard we'll scramble when shit
         | hits the fan".
         | 
         | Except...I don't even believe that.
        
           | chrsig wrote:
           | It's more "this is our contractual obligation, if we're down
           | more than this, then we might not charge you"
        
             | dylan604 wrote:
             | Lawyers are involved, so I'd assume some text about
             | "excluding acts of god, sabotage,etc" to weasel their way
             | out of things. They might even be able to get away with
             | "acts of incompetence" how ever a lawyer might phrase that
             | to allow their client to weasel.
        
               | mywittyname wrote:
               | That's a good way to get executive approval to replace a
               | system. Google or Apple can get away with this kind of
               | behavior, I doubt Atlassian can.
               | 
               | This outage alone has spurred conversations in slack
               | about how terrible JIRA is and why we should replace it.
               | If this kind of shit was pulled, I can guarantee we'd be
               | on shortcut, linear, or something else in short order.
        
               | [deleted]
        
               | MajorBee wrote:
               | > Google or Apple can get away with this kind of
               | behavior, I doubt Atlassian can
               | 
               | Atlassian absolutely can in enterprise settings. In my
               | company (a large cloud company), if JIRA goes down, large
               | swathes of the business will also stall, including code
               | deployment (deployments are tracked through change
               | management JIRA tickets). We also use the DC version of
               | Atlassian products, so presumably we aren't be at the
               | mercy of Atlassian cloud engineers.
        
               | TheCoelacanth wrote:
               | SLA credits are a thing that actually happen in the
               | industry. I wouldn't automatically assume that they will
               | be able to weasel out of it.
               | 
               | They are typically limited to the amount that you
               | actually paid, though, so basically they don't charge you
               | for the time when you couldn't use the product. You
               | usually won't get more than that.
        
             | [deleted]
        
           | mmcgaha wrote:
           | I think of SLAs as how do we design this thing. Ask for a
           | system without an SLA and I will give you a system that is
           | well designed and almost never goes down. As soon as you ask
           | for an SLA, I will give you an over engineered system that
           | costs more, takes longer to implement and is slower to
           | iterate but it will almost never go down either.
        
           | echelon wrote:
           | In some industries, three nines isn't exactly stellar. Every
           | service I've worked on recently has demanded five nines of
           | uptime and tons of reporting on latency and even seconds-long
           | outages.
           | 
           | I've been on-call during a total infrastructure outage whose
           | root cause was a service my team owned [1]. Our CEO was aware
           | of it. Customers and business partners were aware of it.
           | Other CEOs were aware of it. The media, you name it.
           | 
           | Some outages can be "business ending" or "business damaging".
           | That's why we made a practice and process of performing
           | regular disaster recovery exercises, had exceptionally well
           | documented runbooks, had monitoring attached to everything,
           | and engineered for resilience.
           | 
           | Though I'm not familiar with how Atlassian runs, I think this
           | is an "engineering culture" thing or can be mitigated with a
           | proper approach.
           | 
           | [1] The company has only had a few of these in total, and no
           | member of our team was culpable for the complicated failure.
        
           | krinchan wrote:
           | Per the article, if you experience < 95% uptime in any 30 day
           | window you qualify for a 50% discount. On a month or your
           | next year or ... ? it doesn't say.
        
         | hinkley wrote:
         | Basically not counting lost sales their income for this year
         | went down 2%, which is not as big a deal to them as it is to
         | their customers.
        
         | 0xbadcafebee wrote:
         | The typical SLA has no teeth because even if the customer gets
         | their money back, the real harm to the customer may be orders
         | of magnitude greater than what they paid for the service. Some
         | services are contractual or tightly embedded and you know
         | you're not gonna lose the customer if your service goes down
         | frequently. If the service provider doesn't lose money or face,
         | they aren't motivated to prevent the downtime.
         | 
         | One alternative I thought of is the Charity SLA. The service
         | provider pledges to give $5,000 to charity for every minute of
         | downtime. Now everyone within the company knows "if we're down,
         | we're losing thousands of dollars a minute!" and thus will be
         | motivated to ensure the services stay up. But even if the
         | services go down, the company's making tax-free donations,
         | which isn't really bad for anybody. The company could even have
         | a specific downtime goal every year, to make sure their
         | monitoring/alerting/runbooks actually work, and to ensure they
         | donate every year.
        
         | bluedino wrote:
         | > Are SLAs even real?
         | 
         |  _Tommy: Here 's the way I see it, Ted. Guy puts a fancy
         | guarantee on a box 'cause he wants you to fell all warm and
         | toasty inside.
         | 
         | Ted Nelson: Yeah, makes a man feel good.
         | 
         | Ted Nelson: But why do they put a guarantee on the box?
         | 
         | Tommy: Because they know all they sold ya was a guaranteed
         | piece of shit. That's all it is, isn't it? Hey, if you want me
         | to take a dump in a box and mark it guaranteed, I will._
        
           | rglover wrote:
           | Haha I needed this, thank you.
        
         | nh2 wrote:
         | > it's just a threshold after which I think you're entitled to
         | some money back for that month
         | 
         | That is exactly what SLAs are.
         | 
         | There are just a lot of people applying the wishful thinking
         | that SLAs are a goal or metric of uptime.
         | 
         | Consider the AWS S3 page on the topic:
         | https://aws.amazon.com/s3/sla/
         | 
         | "Reasonable efforts"; if not met, you get some fraction of the
         | money back.
         | 
         | S3 has worse uptime than my desktop PC over the last years, but
         | affected users got some fraction of their spending back.
        
           | iso1631 wrote:
           | > S3 has worse uptime than my desktop PC over the last years
           | 
           | That's sacrilege on HN
        
         | colechristensen wrote:
         | SLAs aren't real unless there's a contractual consequence for
         | not meeting them.
         | 
         | And a couple of percent discount on services for the extra
         | downtime isn't really a meaningful consequence.
        
           | imglorp wrote:
           | I was just thinking that there's a hysteresis function here:
           | the service is worth much more to your team after you've
           | wired your whole process into it than before you joined.
           | 
           | Offering you a free month or whatever doesn't acknowledge all
           | the person-hours lost.
        
             | colechristensen wrote:
             | There are certainly circumstances where you might have
             | grounds to sue for damages if an SLA is breached. I'm not
             | sure how often this happens but the losses from something
             | like Jira being down could be quite a lot more than anybody
             | pays for it. It's quite likely that defenses against
             | exactly this are written into the contracts you agree to
             | signing up for the service though.
        
         | dxf wrote:
         | >Are SLAs even real?
         | 
         | SLI: Some metric you use to measure a thing (e.g. uptime,
         | latency, etc.)
         | 
         | SLO: Some objective you try to hit, as measured by the SLI
         | (e.g. "99.99% of requests are processed within 3 seconds)
         | 
         | SLA: A promise to a customer that they will meet some SLO, and
         | consequences if they don't. If there aren't consequences for
         | not meeting the SLO, then measuring and tracking the metrics is
         | a pointless exercise.
         | 
         | The SLA is "real" to the extent Atlassian is adhering to any
         | listed consequences.
        
           | aunty_helen wrote:
           | You could use it as a material breach of the contract and
           | possibly get out of any arrangement you have with Atlassian.
        
             | inopinatus wrote:
             | A typical SLA precludes that by specifying the remedy for
             | noncompliance with the performance measure. Only if they
             | fail to apply the remedy is there a material breach. For a
             | month-to-month SLA, this limits liability to one month's
             | subscription, as agreed in black-and-white.
             | 
             | Customers that demand service level agreements often fail
             | to recognise that they cut both ways.
        
           | bombcar wrote:
           | Most SLAs say "if we miss this, you get time for free" which
           | means that these companies will hopefully get a refund ...
           | for the time they can't use the service.
           | 
           | SLAs are mostly aspirational.
        
             | hinkley wrote:
             | Cars warranties are also aspirational/virtue signaling, to
             | a point.
             | 
             | If the maintenance costs exceed the margins on the cars you
             | lose money. Do that on too many product lines for too often
             | and you're looking at bankruptcy. But some makers clearly
             | are more risk averse than others, so a 6 year warranty from
             | maker X does not translate to a 7 year warranty from maker
             | Y.
        
               | mh- wrote:
               | But Atlassian's (published*) SLA offers a credit of at
               | most 50% of the month.. not really the same as a
               | manufacturer warranty on a car, which the costs of
               | servicing could easily exceed the price paid for the car.
               | 
               | * - their larger customers will have negotiated SLAs.
               | 
               | edit: to be clear, I expect Atlassian will offer
               | concessions beyond their SLA obligations. I'm only
               | responding to the comparison.
        
             | towelrod wrote:
             | The linked article directly talks about this, at this level
             | of downtime customers are promised a 50% discount. That's
             | what the SLA means, effectively
        
           | profmonocle wrote:
           | > and consequences if they don't.
           | 
           | And these consequences usually just amount to getting some
           | percentage of your service fees back. I'm sure the affected
           | customers will get their entire monthly Atlassian Cloud fees
           | back. Since this is _so_ severe maybe Atlassian will even
           | give them credits for some # of free months.
           | 
           | But there's no way the amount they'll get from Atlassian is
           | going to come close to what they're losing in productivity by
           | not having access to Jira & Confluence. At my company,
           | getting an entire free year of Jira wouldn't be worth Jira
           | being inaccessible for a week.
        
             | bee_rider wrote:
             | Does that indicate it would be preferable to pay more for a
             | more reliable solution, if such a thing were to exist?
             | Although, it definitely would be hard to quantify 'more
             | reliable' there.
        
         | miketria wrote:
         | Hi, this is Mike from Atlassian Engineering. For the customers
         | impacted by this incident covered by an SLA, we will adhere to
         | our contractual terms. However, given the long duration of this
         | outage, we are planning to go above and beyond for our impacted
         | customers. We are currently focused on restoring service, but
         | after that will be discussing how we can make it right for each
         | impacted customer.
        
           | encryptluks2 wrote:
           | It looks like you are focused on Hacker News comments.
        
         | leeoniya wrote:
         | > Atlassian's SLA page says, Premium Cloud Products 99.9%
         | 
         | > That's 43 minutes of downtime per month.
         | 
         | we need a better default way to communicate SLOs than "number
         | of 9s", which are more human. how the status quo has stayed
         | this way can only be attributed to intentional dark patterns,
         | imho.
        
           | deathanatos wrote:
           | ... honestly, even the "number of 9s" concept is a struggle
           | for some companies. I've seen a number of SLAs that fail to
           | correctly state a unit: it's %/<unit of time>, and I see the
           | "unit of time" get dropped every now and then, and the
           | resulting thing is meaningless absurdity.
        
         | mc4ndr3 wrote:
         | I've yet to work at an office that paid sufficient attention to
         | regular backup & restore validation, to scalable design, or
         | proper unit testing, or to basic security updates. Upper
         | management is repeatedly incentivized to produce vaporware, not
         | reliable service.
         | 
         | Suits think a crummy Flash quiz on PII is enough to stop leaks.
         | The automotive industry couldn't stop airbags from acting as
         | claymores. It's even harder to get good code approved in tech.
        
       | hinkley wrote:
       | The longest Atlassian outage _so far_ ...
        
       | anshumankmr wrote:
       | I don't get it. JIRA is working for me.
        
         | vvpan wrote:
         | Article points out that 400 companies were effected.
        
         | politelemon wrote:
         | A subset of their customers are affected (badly). Enumerate
         | your blessings!
        
       | [deleted]
        
       | 1970-01-01 wrote:
       | Wouldn't you love to see the Atlassian internal JIRA epic for
       | this outage?
        
       | captaincaveman wrote:
       | A dumpster fire of a company that has terrible communication with
       | customers outside of outages as well.
        
       | snarkerson wrote:
       | > Most of them said they won't leave the Atlassian stack, as long
       | as they don't lose data. This is because moving is complex and
       | they don't see a move would mitigate a risk of a cloud provider
       | going down. However, all customers said they will invest in
       | having a backup plan in case a SaaS they rely on goes down.
       | 
       | The real key lesson here. Your business is important to you. Not
       | so much to the service provider.
        
       | hougaard wrote:
       | Always judge companies on how they handle a crysis, not on how
       | they do when everything runs smoothly.
        
       | escot wrote:
       | When doing bulk deletes like this what safe guards do you put in
       | place, other than testing the script up/down in another
       | environment, turning off app servers etc (which Im guessing they
       | did not do)?
        
         | mh- wrote:
         | Depends how complex the query/procedure is.
         | 
         | Naive approach, replace delete with select and see if you're
         | surprised at the results.
         | 
         | More mature approach, especially in an environment where
         | engineers are running bulk changes against the database, you
         | don't do bulk deletes. You change that delete into an update
         | that marks things for later collection.
         | 
         | One tactic I've seen that worked, assuming you have
         | straightforward relational tables: you add a "marked for
         | deletion" column whose value is an identifier for the single
         | run of the bulk job you just did. Then you can query rows with
         | that value in that column to ensure it had the desired effect.
         | If you're satisfied, you run another bulk job which doesn't re-
         | run your original query.. it just deletes rows with that
         | "marked" value.
         | 
         | Lots of places rely on schema-enforced foreign keys and
         | cascading deletes though. In that case, my recommendation is:
         | don't.
        
       ___________________________________________________________________
       (page generated 2022-04-13 23:00 UTC)