[HN Gopher] Inside the longest Atlassian outage ___________________________________________________________________ Inside the longest Atlassian outage Author : andyjohnson0 Score : 714 points Date : 2022-04-13 15:27 UTC (7 hours ago) (HTM) web link (newsletter.pragmaticengineer.com) (TXT) w3m dump (newsletter.pragmaticengineer.com) | hsnewman wrote: | Sounds like the continuity planners at Atlassian (the fall guys) | will be looking for a new job. | bogomipz wrote: | >"The outage is its 9th day, having started on Monday, 4th of | April." >"It took until Day 9 for executives at the company to | acknowledge the outage." | | Just to put this in perspective. These executives would have left | on a Friday afternoon to start their weekends without bothering | to publicly address an ongoing outage that was by then 5 days | old. | | This is mind boggling. Like did some C-level exec say something | like "Let's just park this whole outage communication discussion | until Monday, have a good weekend everyone."? | gunapologist99 wrote: | Trello seems to still be up? | R0ger wrote: | I guess this is wake call for the people rushing to SaaS | solutions. | brianwawok wrote: | Is it? | | We use JIRA. Not impacted. | | If this had hit us.. we would just switch to excel or something | for a week/month? | | But maybe we are a very light user of JIRA. Nothing in there | can't be replaced. It's "nice" to be able to go look up a 3 | year old bug and which client reported it, but not really | crucial for day to day ops. | ProAm wrote: | > We use JIRA. Not impacted. | | This time. | bborud wrote: | "Switch to Excel for a week/month" | | Right. | mrits wrote: | I wonder why you use Jira if a spreadsheet is sufficient for | your use case. | HeyLaughingBoy wrote: | He didn't say it was sufficient; he said they could do it | for a short while. I consider myself in the same situation: | we depend on Jira, but for a week or so it's not a big deal | to use a bunch of Post-It notes. | function_seven wrote: | Same reason I use oil lamps when the power is out, even | though electric bulbs are my normal lighting. | | A spreadsheet may be sufficient, but it's not as good as a | system designed for development workflows. | | (This comment sounds like I have a speck of love for JIRA. | I don't! :) | mrits wrote: | I don't see this as a valid comparison. There is | information loss. This has happened to my team which had | about 50 people and it was very chaotic. It took us | several days to just create the state our features were | in. | | Today it would even be more troublesome as we have a lot | of integration rules dependent upon the workflow. I'd | probably just recommend everyone uses a few weeks for | self improvement and only address critical production | issues. | dangus wrote: | Outages definitely happen with on-premise software. | | At some point the logo on the engineer's badge doesn't really | matter. | kgeist wrote: | We use on-premises setups for almost everything (we generally | avoid cloud solutions to have full control of our data), | sometimes (approximately once a month) it goes down for a few | minutes which already feels like a torture because all our | processes depend on it, I can't imagine having no access to it | for several weeks, all our work would stop to a halt... The | office of the guy who administers on-premise servers is literally | next door, all it takes is to make a visit to him and everything | works again after 5 minutes. Reading horror stories like this | (Slack being down, Atlassian being down, no one knows what is | happening and when it will end etc.), I wonder why many companies | choose cloud solutions for critical business processes. Is it | pricing? Ease of use? I can understand why very small companies | would choose it, but I don't understand why a medium/large | business would choose anything but an on-premises setup. | lloydatkinson wrote: | Cloud solutions can work well. I've used GitHub, Azure Devops, | and BitBucket (another wonderful atlassian product /s) and | BitBucket frequently craps out, multiple times a week. We need | to rerun builds in TeamCity because BitBucket stops talking to | it. | Msw242 wrote: | You're assuming every team would have better uptime with in- | house solutions | | I think many would have worse uptime even with more headcount | tetha wrote: | In our experience, this strongly depends on the services | involved, as well as the scale. | | For example, for our own service: If you have a hundred or | two hundred licenses, you can drop our system on a linux box | and usually you have to throw a yum update and one or two | service restarts at it every few months and it just works. I | honestly wouldn't be surprised if many of our small on-prem | solutions have better uptime than the SaaS clusters, or be | capped in uptime by some externality, rendering the system | downtime irrelevant. If their VMWare cluster is down, our | system is down, but no one cares. | | This also mirrors a lot of our internal systems. At a small | scale, you can just dump chef, jenkins, sonar, nexus, | whatever on a linux box and forget about it. | | However, this changes with high license counts. We have | singular customers in our SaaS offering that are more than 50 | - 100x bigger than the small on prem systems. At that point, | our SaaS offering is better than anything the customer could | to on-prem. I'm confident to say this about all of our | customers, except maybe 2. | bob1029 wrote: | I find this argument to be totally bs these days. | | If anything, a smaller company with smaller footprint and | fewer total requirements is going to be more likely to manage | a vertical slice of some SAAS product. | | The reason things like github go down so often is _because_ | they are public /shared resources. | kgeist wrote: | >The reason things like github go down so often is because | they are public/shared resources. | | Very much this. Managing shared resources at scale is | pretty hard. We have a bunch of internal sites made by | interns as part of their internships, and, funny enough, | those sites have much greater uptime and appear more stable | than our own multi-tenant SaaS solution made by seasoned | devs. | kgeist wrote: | I've heard this argument many times before, but is there | research into this? I.e. where they would compare uptime of | cloud vs. on-premises across a wide range of companies. | rirze wrote: | I mean, you're going to get biased results, no? Only | companies who are confident in self-hosting will self-host | it. You won't have any real data about companies who are | not confident in self-hosting maintaining their on-premises | version of the software. | NationalPark wrote: | What do you do if the on-premises guy gets hit by a car and | isn't in his office? | kgeist wrote: | There's IIRC 3 or 4 people in their department, they | administer the whole building (wifi, security cams, LDAP, | etc.), not only the on-premises servers. From what I | gathered, our internal systems usually go down due to lack of | disk space or some bug in the software which requires merely | a reboot, it's not rocket science. Another thing is that our | IT department (for internal systems) and the SRE department | (for client-facing systems) have 24/7 on-call duty so it's | unlikely that no one will respond. | oriki wrote: | The same thing that the cloud company would do. If there are | other people there who share that guy's responsibilities, | have them do it. If there aren't, you should have an on-call. | | Cloud just outsources that problem to another business. Sure, | they have better reasons to actually cover those positions | and make sure they have on-calls and backup and a disaster | plan, but just because you pay extra money for it doesn't | actually make it work better if the company underlying it | sucks. | snark42 wrote: | > but I don't understand why a medium/large business would | choose anything but an on-premises setup. | | Atlassian is in the process of killing the on-premise | small/medium business option, already announced an EOL date. | | Move to the cloud, buy a 500+ user solution for a much higher | price or migrate away are my choices. Of course I use the local | database and have local services JIRA/Confluence talk to so | it's not really an option to move to the cloud. | | I assume lack of competent on-site staff 24/7, having someone | else to blame as well as lower costs are why people choose the | cloud over on-premise though. | originalvichy wrote: | I am biased but I can tell you what works best for mid-large | companies: having a solution provider. Basically a partner that | hosts and maintains the instance and has enough Atlassian | certified people to help you with any question so that you will | never have to hire people to just maintain the beasts or tell | you about features, tricks or plugins that could solve problem | X. | | Experienced people hosting and tuning Atlassian products has a | greater success rate than someone doing it alone for a large | company. Almost every time I've migrated an old Atlassian | installation under our wing it's given me shock how users have | been made to suffer the loading times and perfs that come from | underprovisioning (db or actual machine) and messy | configuration. I'm not blaming the former admins but it just | happens. Usually end users are happy after we clean the mess up | and everything feels snappy. | | Disclosure: I've worked in this kind of expert role. | crummy wrote: | I can't see the difference between a "solution provider" that | hosts your Jira and just getting Atlassian to do it. What's | stopping the solution provider from accidentally running a | script that deletes some customer's files and struggling to | do a partial backup restore? | snark42 wrote: | I would assume the MSP is running a dedicated instance and | can do a full/backup restore just for the user they're | supporting. | | If it's some multi-tenant solution it's no better. | originalvichy wrote: | Correct. There are probably not a lot of MSPs that have | so many customers that they need to share that much data, | and their customers probably use MSPs for the strict | purpose that they don't want to share things with other | companies. | originalvichy wrote: | Because you can get the best parts of self-hosted and | managed services. And on that backup question: self-hosted | Atlassian is vastly easier to protect against disasters. | The problem these Atlassian guys had arose from multi- | tenant architecture. Usually managed service providers will | host your stack on individual databases and VMs, and | backing up the software is just a matter of taking pg_dumps | and rsyncing certain directories (pretty old school) or | just taking disk level snapshots. | | Many medium-large corporations have their own cloud | environments that their IT Ops control. Solution providers | can host Atlassian stacks on their own cloud environment | where they are not affected by data privacy concerns (it's | in their already green-lit cloud providers data center) so | they can host it behind a firewall with only VPN access | allowed. They can also do all the magic you can usually do | with web software like put a frontend proxy in front of it, | or use more flexible/legacy authentication methods. Not to | mention that for example you could have a Jira Cloud that | you would need to integrate with a SCM program. Jira data | could be "OK" to live in the cloud but code would be a big | no-no. These problems can be solved by having them all live | behind the firewall. | | A competent managed solution provider also has consultants | that can train or instruct on usage. It costs but it is | simpler and faster than having to go through the forums or | send a support ticket for every small issue to Atlassian | itself. | pphysch wrote: | It seems like if you are going to pay for a bunch of SaaS | seats AND a team of technicians/engineers for make it work, | you might as well just do the latter and roll your own | solutions... | | A lot of these SaaS are just glorified Rails apps with a | patina of professional "security" and "reliability", and | loads of extra junk that your co will never use. | originalvichy wrote: | Trust me, if someone could clone Jira and its functionality | they would have done so already. Truth is that if you build | one product for 20 years you have a giant lead in features. | If all it took was having a Kanban board then Jira would | have died years ago. | yibg wrote: | What do you do if your on prem setup lost data? There is an | implicit assumption here that on prem is more reliable than | cloud. Less downtime, less chances of data loss etc. Obviously | it depends on which cloud product we're talking about but I | don't think a blanket "my on prem goes down less and when it | does go down I can get it back up sooner" is true. | originalvichy wrote: | I think for that question we also have to define on-prem just | to be clear. To many on-prem means "own cloud subscription". | kingofpandora wrote: | Engineering mistakes happen. | | The most inexcusable thing is not communicating with the paying | customers who have been affected for over a week. | | Atlassian's Global Head of Customer Success probably should have | been fired but here she is promoting Atlassian Cloud on LinkedIn | three days ago: https://www.linkedin.com/mwlite/in/gertie- | rizzo-5b70061 | | Actually reading a bit more, it seems like their customer team | was partying in Las Vegas instead of taking care of business: | https://www.linkedin.com/mwlite/feed/hashtag/atlassianteam22 | | Priorities. | naoqj wrote: | _sword wrote: | Fair criticisms on response times but regarding Vegas, it was | their annual user conference last week in Vegas. | madmulita wrote: | They claim they test backups quarterly yet they don't have a | procedure in place to restore the operation. We all know your | backup is not tested until you restored everything | successfully. This is not an engineering mistake, it is a flat | out lie. | iancarroll wrote: | Well, their explanation makes sense. These are multi-tenant | environments where not every tenant was affected; sensibly, | the backups appear divided by environment, not tenant. You | can't blindly revert to an environment's last backup in this | scenario, although you'd think they would have done it | before. | julesallen wrote: | No argument on the crappy comms. | | If I was in customer success at an enterprise vendor I doubt | I'd be let anywhere near the tools to get this back up and | running. These guys are generally in the way rather than | helping in a situation like this. | | Head of engineering or some product rather than customer | support? That might be a different outcome. | iamtheworstdev wrote: | Can confirm. Saw them there while I was on vacation. | benreesman wrote: | Jesus, if there was ever an example of the internet making | the world smaller. | | When do execs living it up at the fucking Wynn Encore while | the house burns down start to not get another job? | | They'll keep pulling this shit until it cost money. | benreesman wrote: | For clarity: I went through a period where some combination | of self-indulgence and legitimate life crisis caused me to | take my eye off the ball when it mattered. | | I'm still trying to kickstart a second act years later, | because I'm trailer trash and it's hard work when you're | that. | syshum wrote: | sales never takes the blame. If anyone is fired it will be | scapegoats in engineering once they have busted their ass to | restore their reward will be the door | systemvoltage wrote: | This is an engineering problem. They should own it and | improve things, make sure it doesn't happen again. | | Also, GP's quote | | > Engineering mistakes happen. | | I don't like this statement because it offers consolation at | the expense of unintentional normalization. | nix23 wrote: | Ever heard of Space Shuttle Challenger? You cant own it if | your management is against it. | syshum wrote: | The deletion of customer data was engineering mistake, that | is not what I was talking about | | The Negative fall out was not due to the deletion of | customer data, as the Story and multiple customers have | state the negative fall out was the SILENCE, which is Sales | / Customer Service not engineering | | As the comment I was replying to noted while engineering | was trying to recover from what might possibly be the | biggest outage in the history of the company Sales was | partying and not handling customer communications | | That (the failure to communicate with customers) should be | a resume generating event of all leadership customer | service / sales. It will not be however because sales will | simply redirect their failure on to engineering in the | exact same manner you just have | buscoquadnary wrote: | And coders that say all code has bugs are just defeatists | that are trying to make excuses for being lazy. | | Sometimes manure will always hit the fan. Being robust | means being able to handle that. | jacksnipe wrote: | I think this is obviously incorrect. | | Human error is probabilistic, and the probability of | making an error cannot be zero. | | On the flip side, it's infeasible to use only provably | correct systems; not lazy, but literally not a practical | option due to compute costs, developer time, what formal | techniques can even be applied to the problem at hand, | etc... | systemvoltage wrote: | A culture where mistakes are taken too seriously or too | lightly leads to problems. Also it depends on what stage | of the product cycle (Innovation/Rapid Development vs. | Robustness/Quality). I'd argue that Atlassian products | should err towards robustness and high quality. Not | trying to break any new ground. | xyst wrote: | Atlassian about to dip over the next few years as firms around | the world slowly remove themselves from their ecosystem of | products. | jlawer wrote: | Not to mention a CEO who is more interested in activities | outside the company like the green energy transition and | politics. | | As an Aussie I always wanted Atlassian to succeed as we have so | few tech companies at that scale or larger. Now I view them as | another Oracle. Now they innovate little, they keep ratchetting | up prices, pushing deployments to cloud where they make more | money. Nickel and dime you for what should be core features | (SAML Auth?). They aren't coming up with anything new to keep | the value in the ecosystem. They buy applications in, spend a | little to make some cross integration and then drop down to a | slower development Cadence. | ekanes wrote: | Right. Feels similar in a way to an ongoing conflict | elsewhere... There is what happens now, and what happens over | the next decade because people have lost fundamental trust in | you. | tpmx wrote: | Their core customers are unfortunately just as dysfunctional | and slow-learning. Think Boeing, etc. Witness: | | https://jobs.boeing.com/job/annapolis-junction/jira-administ... | xiaodai wrote: | Comes across as jerk. How can an outsider say things with such | certianty? | bluedino wrote: | Regarding the backup restores: | | I once worked a company that had a data loss issue. There was | nothing else we could do, we had exhausted every option we had | over almost 40 hours. At the end of the second day, it was | decided to restore from backup. | | We had done this before, as a test. It took about 12 hours to | restore the data and another 12 hours to import the data and get | back up and running. | | One small thing was different this time, and it had huge | consequences. As a cost-saving measure, an engineer had changed | the location of our backups to the cold-storage tier offered by | our cloud provider. All backups, not just 'old' ones. | | This added 2 additional days to our recovery time, for a total of | five days. Interestingly enough, even though we offered a full | month's refund to all of our customers, not even half of them | took us up on it. | bombcar wrote: | In these cases the best thing to do is just give every customer | the full month refund; don't make them ask for it. | sodality2 wrote: | The best thing to do business-wise, or as a good faith move? | rjmunro wrote: | What's the difference? | treesknees wrote: | Good faith would be to lose all of that money to people | who are already your customers. | | Business-wise would be to stay in their good graces and | keep those customers by offering the refund, but you | don't lose any money to those who either don't care or | won't move to a competitor. | function_seven wrote: | 25 years ago the clutch in my beater truck was slipping. | I was 16 years old, making $50 a _week_ and had very | little in savings. I took that truck to a shop within | walking distance of my job. | | 2 hours later I walked back to see what they found. I | figured it would be several hundred dollars for a new | clutch, and I'd have to borrow money or something to get | it done. I talked to the owner who told be it was an | adjustment on the cable. Just needed to be scootched up a | bit and it was probably good for another 30k miles. | | When I asked him how much I owed, he laughed at me and | said, "For that? Not worth writing it up. No charge. You | want me to show you how to do it yourself next time?" | | The shop could very easily have charged me 1 hour of | labor at their standard rate, maybe $75 or so. Plus a | diagnostic or test drive fee. Whatever. He could have | told me, "$123.98" and I would have paid it. I wouldn't | even have been mad. But I sure as hell wouldn't have | remembered the experience so clearly. Nor would I have | told a dozen people over the years to take their cars | there. And I definitely would not have driven 20 miles | out of my way to return to that shop in the future years. | | Being cynical about this stuff will hurt your brand. It's | not obvious. It doesn't show up on the earnings report as | a line item. This is service segmentation that seems like | a no-brainer to a clueless MBA, but actually matters in | the long run. How people view your brand is immensely | important. | | Not forcing customers you already screwed over to then | spend more time chasing a refund is not only the right | thing to do, it's also good business. | heisenbit wrote: | Reducing the impact analysis within a long running | relationship to a single transaction is too narrow. | People observe how other people are treated and draw | their conclusions even if not impacted. People may | tolerate some abuse but it moves them closer to leaving | next time. Money lost in the outage may provide for a | budget creation to look for an alternative. | rgj wrote: | A lot of people making those decisions don't care about a | refund because it's other people's money anyway. In my | experience only small companies care about that. | | Focussing on communicating open and honestly allows them | to explain the crap they're going through because of your | mistakes to their bosses, so in fact you can help them | save their asses, and they'll save your ass in return. | This is much more important and valuable than a refund. | | So you should ALWAYS communicate open and honestly, and | offer the refund as an option for clients who do not have | a boss to account to. | bzxcvbn wrote: | Not every business can afford to go one month without income. | What's the best thing for customers? Have the business go | bankrupt and irremediably lose access to the service? | function_seven wrote: | It's 400 clients, not all their user base. They can handle | the lost income from a small slice of their customers for | one month. | | And if they can't sustain that, then it's even _more_ | imperative that those customers migrate away. | enra wrote: | Atlassian had almost a billion in free cashflow last year | and over a billion in cash. I think they should cover the | whole year for these customers. | miketria wrote: | Hi, I'm Mike and I work in Engineering at Atlassian. Here's our | approach to backup and data management: | https://www.atlassian.com/trust/security/data-management - we | certainly have the backups and have a restore process that we | keep to. However, this incident stressed our ability to do this | at scale, which has led to the very long times to restore. | sizzle wrote: | How's the atmosphere internally Mike? Must be crazy times | there. I know this isn't your fault, so hang in there. | Cheers! | encryptluks2 wrote: | You mean your poor practices and bad design. The only way to | prevent this type of issue in the future is to admit the | failures. | farseer wrote: | They have recently killed off on premise offerings, it's cloud | only now. And this makes it harder to trust both the security and | integrity of your data. | ocdtrekkie wrote: | The fact that a single bad script could delete 400 of their | customers should be absolute proof they do not have the | processes in place to be a steward of your data in the cloud. | On-prem or bust. | dangus wrote: | On-premise just means that your overworked IT person is going | to spend 5% of their time keeping your service maintained, at | no point gaining any more than baseline familiarity with the | product. | | On-premise isn't a magic pill guaranteeing 100% uptime and 0 | data loss. | | While on-premise may be a good choice in many cases, it's not | like running on-premise business tools has no risk associated | with that choice. | | Remember that the goal of a company is to sell the most | product possible (output) with the lowest cost possible | (input). | | Any Joe off the street starting their own business can pay | Atlassian $0/month for up to a 10 users. On-prem doesn't | compete with that. | dzikimarian wrote: | On Prem means you have control over spending. I calculated | that if we've moved to the cloud, we would pay YEARLY as | much as we spent on Atlassian licenses in last 5 years. | That easily pays for the maintenance overhead on our devops | team. | [deleted] | hrpnk wrote: | Afaik, the Data Center option still allows for on-premise | deployment, incl. Kubernetes and cloud deployments [1, 2, 3]. | | [1] https://www.atlassian.com/enterprise/data-center | | [2] https://confluence.atlassian.com/enterprise/jira-data- | center... | | [3] https://confluence.atlassian.com/enterprise/deploying- | enterp... | jmondi wrote: | What blow's my mind is that Atlassian stock has barely taken a | hit... | ferdowsi wrote: | The market will react at the next earnings report, not now. And | only if customers start to bail. | NineStarPoint wrote: | Unless their revenue takes a long term hit over the outage, no | reason for the stock market to care. There isn't news of people | actually planning to stop using Atlassian products over this. | The only direct consequence is going to be the one time payment | of SLA credits. So I guess the part I find surprising is how | little impact this looks like it will have on people using | their products more so than I am that the stock market doesn't | care much about this. | mountainriver wrote: | Yeah Atlassian is a corporate leech, you don't get away that | easy | pigtailgirl wrote: | the stock is on a rally today - just goes to show - the market | is full of surprise | capableweb wrote: | It was a long time ago individual stocks represented anything | grounded in reality. People talk about "fundamentals" and so | on, but that's not what the price is based on. I don't think | anyone know why the prices move as they do anymore, as there | are so many algorithms involved today, both manual and | automatic ones. | __app_dev__ wrote: | Yeah, I place a Put option order yesterday. By end of day I was | up over 50% and now down to 50% of what I original purchased | the Put at because it went up 5% today. | | Oh well, better luck next time. | devmunchies wrote: | it was at $317 on the day of the outage and now at $278.5. A | ~12% drop. You're right, not much of a drop for such a large | outage. | [deleted] | __app_dev__ wrote: | The outage did not impact the stock, most major tech stocks | have taken a large hit in the past week and a half (until | today). | | This even is not even showing on any financial news site. I'm | still hoping it does and the stock goes down because I place | an option order yesterday betting that it goes down by next | Friday. Seems like it won't now but the risk was worth taking | in my book. | mdoms wrote: | Title is a bit misleading, there's no insider info here. This is | all stuff we knew from the official statements, the blog post, | reddit and twitter. | rmbyrro wrote: | Are Confluence pages and Jira tickets build like a GPT-3 300 | Terabyte model? | | I mean, I thought they were text. | | 5 days to restore text? | | They must be generated by a huge complex deep learning voodoo. | | Atlassian is working on the bleeding edge of technology. This | outage is understandable... | er4hn wrote: | I suppose if they recover what they can and restore the rest | using GPT-3 that may make the process easier. | katbyte wrote: | images and other files can be attached to issues or embedded in | pages so a single instance can use a lot of storage. | napolux wrote: | Yeah, let's centralize the Internet (born decentralized). This is | what the Internet has become. | Crabber wrote: | How do we solve this problem? In other industries based on | physical products there is a big incentive to buy goods as | locally as possible because of reduced shipping costs, shorter | shipping time, no import taxes etc. | | But with software it costs nothing to spin up new instances, | costs nothing to deliver half way across the world, and has no | delivery time. How can you convince a manager to use a software | solution provided by a local company when a company in a | completely different country 600 miles away offers similar | software with 5 extra features? | | It seems like the internet is now perfectly set up to create, | for each software type, a single company that has a global | monopoly. | barneygale wrote: | That's OK in principle, as long as those companies function | like governments (i.e. they work to improve things rather | than turn a profit, subject to constitutions, public voting, | judicial review). As engineers we should embrace the | efficiency of scale, but it's quite clear that it can't work | under capitalism. | Alex3917 wrote: | A few years ago we didn't renew our subscription on time because | we got the email over Christmas break, and iirc they deleted all | of our data in less than two weeks. They were eventually able to | manually restore it from backups, but they restored it | incorrectly so there was a bunch of stuff broken. This whole | thing isn't even remotely surprising to me. | a2800276 wrote: | You can sleep soundly: it seems like they back _everything_ up: | | > Second, the script we used provided both the "mark for | deletion" capability ... (where recoverability is desirable), | and the "permanently delete" capability that is required to | permanently remove data _when required for compliance reasons_. | The script was executed with the wrong execution mode and the | wrong list of IDs. The result was that sites for approximately | 400 customers were improperly deleted. | | > To recover from this incident, our global engineering team | has implemented a > methodical process for restoring our | impacted customers. | | [https://www.atlassian.com/engineering/april-2022-outage- | upda...] | | Anyone else find it disturbing that they are able to restore | data that they deleted permanently for "compliance" reasons? If | this is true, how were they ever compliant? I guess data is | only permanently deleted when the engineering team is following | their typical, non-methodical process... | notatoad wrote: | No, I don't think that's disturbing. That's the point of | backups - even when something is permanently and completely | erased in the production database, it's still in the backup. | Eventually it will get rotated out as the backups expire. | | Going back and purging things from the backups as part of the | delete process would be overdoing it to a ridiculous degree. | Delitio wrote: | Nope it's not ridiculous. If you are only allowed to store | data for x month that's it. | | It's your job to use technics which allow you to do this | like using encryption on your backup and deleting the keys | for it, for example. | usefulcat wrote: | > Going back and purging things from the backups as part of | the delete process would be overdoing it to a ridiculous | degree. | | Also, modifying backups is a great way to inadvertently | hose your backups. | yebyen wrote: | I think that depends on what you mean by compliance. Some | regulations require you to irreversibly destroy data when | they prescribe the destruction of that data. | | That can mean as much as "you have to encrypt everything | with a separate key, so that you can destroy the key for | the given (say, personally identifiable) dataset making its | retrieval irrecoverable" | | I'm not saying that's the particular compliance reason they | had here, or that the analysis you're giving is wrong, | either. There is an interpretation where either of these | ideas could be the correct one. | a2800276 wrote: | "permanently delete" strongly suggests to me that it was | the "medical and financial data" kind of compliance. If | data can be restored, it's not permanently deleted. But | this was a statement from the CEO, so words can have | arbitrary meaning :) | notatoad wrote: | "permanently delete" does not mean the same thing as | "immediately delete". deleting from the live database is | the first step of a permanent deletion, as long as the | data exists somewhere the deletion process is still in- | progress. | | there's a whole lot of people in here who are way too | quick to assume that just because one part of a permanent | deletion process was inadvertently triggered and then | caught while they still had backups, their whole | permanent deletion process is a lie. | voxic11 wrote: | https://ico.org.uk/for-organisations/guide-to-data- | protectio... | | You seem to be right-ish, while the gdpr in certain | circumstances allows you to keep backups of data that | should have been deleted it seems like they are trying to | discourage it in the future. | | > ...It is, however, important to note that where data | put beyond use is still held it might need to be provided | in response to a court order. Therefore data controllers | should work towards technical solutions to prevent | deletion problems recurring in the future. | jacobsenscott wrote: | A better way to do this sort of thing is not an actual | "delete", but a "cryptographic delete". The data should | be encrypted, and you just delete the key. The data is | then unrecoverable everywhere, including backups. Of | course you probably don't want to just nuke the key, but | disable it for some period of time, and then nuke it. | nindalf wrote: | This is why regulations specify that data must be | destroyed within a time period, typically 90 days. It | gives enough time for backups to rotate out. | | If this weren't a concern, regulations would demand | immediate deletion of data. | deepspace wrote: | I asked the same question yesterday, and the responses were | food for thought. | | If you make backups, you are, almost by definition, unable to | perform a full 'Compliance Delete' before the oldest backup | in the set has expired. | | Compliance-based deletion, if it is offered as a service, is | almost always something time-based, like "we guarantee the | data will be deleted 7 years from now". And then that | deliberate deletion step is baked into the backup process. | | So, i.m.o. at best they misrepresented the nature of the | compliance deletion process. It never did what it was | designed to do. | luhn wrote: | It's generally recognized that deleting data from a backup | would violate the integrity of the backup, so allowances are | made. Usually you have to make sure the data is deleted as | part of the restore process. For example, from CCPA: | | > If a business stores any personal information on archived | or backup systems, it may delay compliance with the | consumer's request to delete, with respect to data stored on | the archived or backup system, until the archived or backup | system relating to that data is restored to an active system | or next accessed or used for a sale, disclosure, or | commercial purpose. | jacquesm wrote: | Did you continue as their customer after that? | Alex3917 wrote: | Nope. I exported our data after they restored the backup and | then we cancelled less than a month later. Like I obviously | understand suspending our logins, but why would you ever | delete someone's data when it's literally only 160 KB of | text? The whole thing made zero sense. | Kwpolska wrote: | > why would you ever delete someone's data when it's | literally only 160 KB of text? | | Compliance? The contract has expired, so there's no legal | basis for them to keep your data? | usefulcat wrote: | Seems like that could be addressed with some fine print | in the initial agreements. "In the event that you stop | paying us, we may keep your data for up to N days unless | directed otherwise by you"--or similar. | bzxcvbn wrote: | Why would they bother? | herpderperator wrote: | I don't think people write code saying "if accountSize < | 160kB { skipDelete() }" - THAT would make zero sense. So, | the size is not relevant here. The process was likely to | delete data after some event occurred, or lack of event | occurred. | hinkley wrote: | Someone somewhere got a promotion sooner because they | lowered the slope of a line a little bit. | hallway_monitor wrote: | Or some overzealous engineer said hey guys let's delete | all data 7 days after an account is canceled. This is | called over optimizing. | dangrossman wrote: | Such a decision is just as likely to have come from the | legal/compliance team as an engineer. Data you no longer | have clear consent or a legitimate business need to store | is a liability, and if you operate in Europe, potentially | illegal to continue storing. | nemo1618 wrote: | After I met my now-fiancee on OkCupid, I deactivated my | profile, turned off notifications and forgot about it for a | while. A while later, I thought it be nice to revisit the | first messages we sent to each other, only to find that... | OkCupid had deleted both of our accounts. They didn't give | me any advance warning, either, because I turned off | notifications, remember? :^) | | I'm still kinda salty about it. I understand why big | services can't retain data indefinitely, but like... it's | just a few KB of text, and that text happens to have a lot | of sentimental value. Besides, OkCupid _knows_ that I | deactivated my account _because I am a success story_ -- | why not hold onto those profiles a bit longer? Or better | yet, how about emailing an archive of those messages | immediately when you click the "I'm leaving because I'm in | a happy relationship now" button? /rant | dangrossman wrote: | With GDPR, privacy regulations and data breach | regulations sweeping the globe, holding onto unnecessary | data is a huge liability. Getting rid of data you no | longer have clear consent to store, or which you're | unlikely to have a clear business need to continue | storing, is a sign of a good company these days. | jacquesm wrote: | True, but likely not _this_ kind of data. | callalex wrote: | Yes, this kind of data. Your OkCupid account has all | kinds of information about who you associate with. | cto_of_antifa wrote: | [deleted] | RomanPushkin wrote: | > I've never seen a product outage last this long | | Title should be "Inside the longest outage of all time", without | "Atlassian" word in it | k8sToGo wrote: | If I remember correctly, many years ago PSN was down for | months. | nemothekid wrote: | > _Most of them said they won't leave the Atlassian stack, as | long as they don't lose data. This is because moving is complex | and they don't see a move would mitigate a risk of a cloud | provider going down._ | | I still don't understand the strangehold JIRA has on some | clients. I can't quickly think of another SaaS product that could | be down for almost 2 weeks and not have most customers leave. | brimble wrote: | If they don't lose data, two weeks of downtime every few years | might be cheaper than the cost of switching. Plus, it's not | like you know the thing you switch to will be any better, if | it's another SaaS. | user22 wrote: | Let's say we have an announced release schedule on may 1st. | With the tools down, there is no way to meet that date. For a | 4 billion dollar company, this can make a huge difference in | revenue. For a public company, the stock will definitely drop | when it's announced the revenue goals were missed because the | tools were down. | | For companies of size, the cost of tools being down for 3 | weeks can easily be in the multi-millions of dollars. | brimble wrote: | Again, part of the trouble is it's hard to gain enough | _certainty_ that the thing you switch to--self-hosted, or | another service--won 't be _at least_ as bad. You can look | at their past record, but then, when 's the last time | Atlassian had _this_ happen? (or maybe they 've been having | similar issues every year or two and I've just not noticed, | in which case, yeah, it's probably a safe bet that | switching to almost anything else would be an improvement) | tyingq wrote: | >I still don't understand the strangehold JIRA has on some | clients. | | - Integrations with things like the source code repos, incident | management systems, confluence or other wikis, Slack, etc. | Moving away from Jira creates a bunch of dead links. | | - Internal dependence on complex workflows and state transition | rules that are implemented in Jira. | | - Various very customized reports that leaders depend on to | make decisions, despite the often dubious value and/or | accuracy. | femto113 wrote: | Many years worth of source code filled with comments like | // if we don't toggle bit 7 here 10% of transactions will | fail on Thursdays // see JIRA issue BIGPROJ-12654 for | detailed discussion | DannyBee wrote: | Having migrated bug systems for very large, very old code | bases before, it's pretty easy to make the URls and links | like this still go to the right place. | | This is actually the least difficult thing, i would say ;) | ReidZB wrote: | When we migrated away from JIRA, we scripted it such that | the JIRA issue numbers were recorded in the newly migrated | issues exactly because of things like this. | dirtybirdnj wrote: | >I still don't understand the strangehold JIRA has on some | clients. | | But it's got what plant's crave... | z58 wrote: | I imagine most people used something like Google Sheets during | the downtime | chupchap wrote: | A lot of companies have integrations to atlassian suite which | might not be easy to shift from. | | Secondly, there are a lot of individual competitors to Jira, | Confluence and Bitbucket but which competitor can offer all | three under a single invoice? May be Microsoft, can't think of | anyone else. | | Also for such an extended downtime the customers are entitled | to a discount or a credit note which a lot of CXOs consider in | their decision making. | krinchan wrote: | We are in a similar place with Slack. We moved from HipChat | to Slack and that was painful enough. Then the company | noticed we get Teams for "free" and they tried to push us | over to it. But folks have so much automation (because | "ChatOps" is that new new) that is pushing things into Slack | the company eventually gave up. | judge2020 wrote: | > May be Microsoft, | | Is there a Jira replacement/offering in the Microsoft 365 | suite? | iameoin wrote: | Microsoft has Azure Devops Boards that is similar: | https://azure.microsoft.com/en-us/services/devops/boards/ | ralgozino wrote: | GitHub / GitHub Enterprise? | muricula wrote: | visual studio online is what it was called internally, the | marketing may have changed. It's okay, and is what | was/probably still is used at MS internally to develop | windows. | lfpeb8b45ez wrote: | Azure DevOps is really underrated: | https://www.thoughtworks.com/radar/platforms/azure-devops | travellingprog wrote: | Never used it, but looks like Microsoft Project would fit | that box. | HeyLaughingBoy wrote: | Oh god, no. We moved from Project to Jira and life became | immeasurably better! | shadowronin42 wrote: | Azure DevOps has boards and tickets and whatnot, so | probably that? | encryptluks2 wrote: | Atlassian sells to execs and gives kickbacks. You don't want to | burn the company that gave you money and that you pushed | through although you knew they sucked. | jasd wrote: | Even if they don't, I imagine they will have conversations | internally to see what's feasible. It's just really difficult | for an organization to move away from a product that everyone | has learnt how to use. The company I work for is struggling to | move away from something as simple as a collaborative editor, | when I feel like I find no difference between the two products. | travisgriggs wrote: | > Atlassian is a tech company, built by engineers, building | products for tech professionals. | | I am curious if anyone can provide any more insight on this | simplification. | | I've worked at companies like this. Originally a core of | motivated creative individuals make a cool product. As the | business grows rapidly, Pournelle's (Iron) Law (of Bureaucracy) | takes over. For a variety of reasons, the very capable creators | depart and are replaced by less motivated/aware individuals who | are glad to have a job and easily compelled to do things to the | product that probably should not be done. | | My guess is that while Atlassian may have originally been one of | those cool founder places, it has probably morphed into the more | incompetent version that comes with scale all too often. But I | don't know. Thus my question if anyone can speak to the true | current tech capabilities of this company. | cdjk wrote: | This isn't the longest outage - last time they couldn't recover | and recovered data from email archives. | elesbao wrote: | In a side note that someone else already made: it is interesting | to see that many companies that uses JIRA also uses Slack but the | noise/complaint/mentions comparing when Slack is down is way | different. I barely saw people complaning. | caymanjim wrote: | I dunno about everyone else, but I'm generally frustrated and | feel blocked when Slack is down, and I celebrate Jira being | down because I've never had a pleasant experience using it. | Jira is bureaucracy that gets in the way of me getting things | done, and Slack is a critical communication path. | mountainriver wrote: | Yup Jira is bureaucracy incarnate. Middle managers love it | though | elxr wrote: | Same here. I actively made an effort to tell coworkers how | much I hate Jira. Hopefully new startups choose something | more sensible. | upbeat_general wrote: | I don't believe slack has been down as long? | | Slack is generally much more critical than JIRA in order to | keep working. | adhesive_wombat wrote: | It's it though? You can hop onto any of a constellation of | other IM platforms, FOSS and not fairly quickly for an | instant comms channel, even if you're missing the history. | Having all your issue tickets missing is something you can't | really deal with unless you have a very recent dump, and even | then you can't just fire up Bugzilla and get something | working without a lot of migration and administrative effort. | | You can do without JIRA for a week or two as long as managers | understand and you all have a good concept of what work | needed doing anyway. Then it starts getting dicey unless | someone becomes a human JIRA to connect temporary manual bug | tracking systems with everyone involved. | adamc wrote: | We have all sorts of slack channels set up to coordinate | activity, so that internal customers can talk to engineers | easily, or engineers can engage with each other. If slack | goes down, we'd have to work all that out. For many days, | it would be a huge drag on the process, slowing down | interactions. | | Other IM platforms wouldn't solve that just by existing. | Sure, in principle one could set up such channels | elsewhere, but that takes time, and the communication about | it takes considerably more time. | adhesive_wombat wrote: | Sounds like having a fallback pre-defined would be | prudent if it's that important and you don't feel you | could collectively extemporise something. "If Slack goes | down, the plan is to use WhatsApp/Teams/Jeff's Matrix | homeserver in his garage until service comes back. A list | of group channels will be emailed if that happens." | | Then if it does go down, you don't have to waste the | first day arguing about the plan. | ineedasername wrote: | Something to consider is that Jira can require a great deal of | configuration to tailor it to your needs. If you already have a | DevOps team of some capacity (not everyone does) then it may only | be a small incremental increase to run thinks on prem. I did it | myself: I'm ver much not a DevOps person, mostly unfamiliar with | optimizing JVM parameters for apps like this, but it still only | took me about 5 hours to get things running stable, and then | another 2 hours or so a few weeks later to tweak things like heap | size to help things go a bit faster (though it was still somewhat | slow) | | To be complete open though I don't know how much DevOps overhead | is involved in maintenance or feature updates. I hated the app | and used it for less than a year so I didn't have much exposure. | I guess my point though is simply that you may not need to use | their SaaS option if you have a decent DevOps team already. After | the initial setup time I doubt I spent more than half an hour a | month managing the internals and updates. | | I _did_ spend more than that on configuring the system for use, | which you 'll need to do regardless. | thesh4d0w wrote: | Atlassian has EOL'ed their non-cloud products | | https://www.atlassian.com/migration/assess/journey-to-cloud | originalvichy wrote: | I have had to correct this too many times already. Server is | the name of the deployment type of their on-prem. It means | single node non-clustered. Data center is their deployment | that supports clustering to multiple nodes (and used to | support a few extra features). They are retiring the Server | deployment type licenses and pushing everyone to data center | or cloud. So no, they aren't EOLing their on-prem. | bombcar wrote: | The cost of server was lower and fixed (buy it once), the | cost of datacenter is MUCH higher (minimum 500 users, pay | per year). | | Which is even more amusing when you realize Server has been | Datacenter with a fake mustache for years now. | tedivm wrote: | The datacenter product also seems geared towards people | reselling Atlassian stacks. For example there's a company | that offers HIPAA compliant Confluence (complete with | signing a BAA, so you can actual store PHI on it). It | doesn't seem like a great replacement for the server | version. | dzikimarian wrote: | Our instance is half the size of minimal Data center | license. For us and for many customers this is effectively | EOL. | tyingq wrote: | Their on-prem options are being reduced down to one product | with pretty high minimum spend numbers: | | - 500 users (Jira Software, Confluence, Crowd) | | - 50 agents (Jira Service Management) | | - 25 users (Bitbucket) | | https://www.atlassian.com/migration/assess/journey-to-cloud | | https://www.atlassian.com/migration/assess/compare-cloud-dat... | k8sToGo wrote: | I am _Dev_ Ops, and not Ops. So I try to not waste time with | self hosting as much as possible. | h2odragon wrote: | > However, if they [restore backups], while the impacted ~400 | companies would get back all their data, everyone else would lose | all data committed since that point | | OK, so you restore backups to a separate system, and selectively | copy the stomped accounts data back to production. Simple | concepts aren't that simple at their scale, sure, but I suspect | this is skimping details on some truly horrendous monolithic | architecture choices that they're trying to hide. | | Not that I ever thought using their products was a good idea; to | be clear about my position... But at this point anyone continuing | to rely on them for anything is asking for the suffering they'll | get. Signing up for their crap for a vital business function is | like offering your tonker to a snapping turtle. | throwaway894345 wrote: | I would really like to understand who makes the decision to | purchase JIRA. It's like the C++ of ticketing software--it does | everything because no one wanted to sit down and think | critically about the use cases and instead decided it would be | easier to say "yes" to every single feature request. It | definitely feels like whoever is buying JIRA is not on the team | who is using it (maybe IT or finance) because it ticks their | boxes and it has such a huge list of features that _nominally_ | it appears to tick the product development boxes (ignoring more | subjective concerns like "quality", "performance", and | "usability"). | | I would really like to try working in an organization that uses | something simpler, like Trello (although now that this is also | an Atlassian property, maybe not exactly Trello?). | robertlagrant wrote: | The reason to buy Jira is that loads of stuff integrates with | it, and lots of people know it. Maybe not perfect, but that's | why. And unless you're in it all the time, which some people | may be, its ergonomics are not as important as, say, an | IDE's. | tenacious_tuna wrote: | My relatively small team at a massive enterprise built all | our report generation tools around JIRA for an entire class | of offerings. It's been easier for them to justify continuing | to pay for JIRA and keep it propped up than to develop (or | migrate to) a new solution. | | As the lone dev on the team I've been continually astounded | by my leadership's willingness to commit more and more to | tech debt laden paths. The notion that _all software_ | requires maintenance is anathema to them, and it 's led us to | be 'cornered' into decisions re: what software we can use / | where we can invest our discretionary funding. | | Moreover, we're constrained by the parent mega-enterprise's | software purchase policies; JIRA's already approved (and run | elsewhere in the enterprise), whereas off-the-shelf or SaaSy | alternatives are significantly harder to get buy-in for. (No | using corporate cards for SaaS, all purchases need to go | through the quote/purchase-order process, etc). | Cd00d wrote: | Interesting take. | | Personally, I like JIRA. I think it adds a ton of | transparency in our org, and while I've used Trello for | personal and home projects, I don't see how it's good enough | for business. Trello doesn't even allow for time estimates | (last I tried), which for us is part of planning. Search in | JIRA is also really good, so no ticket is ever just lost to | the ether. | | Sure, it's not perfect, and waiting for a board to load is | annoying, but for distributed work and visibility, I haven't | seen something as professionally useful. | | Open to exploring though. | stuff4ben wrote: | GitHub Enterprise and ZenHub Enterprise work well for us | here @IBM, not that I speak on behalf of them, just a drone | doing work. | Cd00d wrote: | ZenHub looks really interesting - thank you for pointing | it out. | | How good is ticket search? I have to be honest, JQL is | the superpower that makes or breaks for me. | x0x0 wrote: | I made the decision, unfortunately. The rationale was | literally that I hated pivotal tracker -- what a garbage app | that is -- and I'd heard of jira, needed something to track | bugs / work items, and signed up. It crucially had a zendesk | -> jira sync, so all our zendesk requests could end up in | jira. | | In the beginning, with me plus 2 engineers, I noticed it was | slow but since I used it for 20 minutes a week, that didn't | really matter. By the time I started using it for an hour a | day, we had 10 engineers on 2 teams using it. I got to see a | friend using linear, and I had some spare time that I was | going to use to switch, but I couldn't get in the beta. By | the time they let me in, the opportunity was over and I was | too busy. | BolexNOLA wrote: | I really, really like Trello and am dreading the day when | atlassian starts tinkering with it in any real capacity. As a | content creator, it is the first workflow system I've ever | seen that I can effectively share with my client. It's so | simple and streamlined and the fact that I've stuck with it | despite my ADHD says a lot. | | Clients add their notes to the card, I check the boxes as I | hit the notes, and I move the card further right as we enter | different stages of the post production process. We then have | a column of every completed project, which is incredibly easy | to sift through if we need to revisit something. It's | literally left to right in the workflow, it visually is | telling me where we are at all times. | | It's incredibly simple and elegant. For fast turnaround, | relatively stripped down content (like podcasts) there is | nothing like it. | systemvoltage wrote: | What's wrong with C++? Seems unfair to compare it with JIRA. | throwaway894345 wrote: | I was a C++ programmer in a past life and I sorta like it. | C++ and JIRA seem to have the same philosophy with respect | to choosing which features to admit: "yes". The idea is | that by supporting the largest number of features possible, | they'll surely build something that everyone likes because | it will tick everyone's boxes. What people frequently fail | to realize is that the absence of misfeatures or redundant | features is an important feature in and of itself. | Moreover, the more features you support, the harder it is | to control for quality. | hhmc wrote: | > The idea is that by supporting the largest number of | features possible, they'll surely build something that | everyone likes because it will tick everyone's boxes. | | The idea that the C++ committee are unthinking people | pleasers it patently false. | | C++ does have a lot of cruft, but mostly because it aims | to: i) support new features ii) maintain pretty strong | backward compatibility guarantees | | In general the new features are actually pretty well | liked, but in conjunction with (ii) it creates a big | language. There's a reasonably decent subset that can be | carved out, but it's also clear why newcomers without | legacy baggage (e.g. rust) are making inroads. | throwaway894345 wrote: | "unthinking people pleasers" isn't how I would | characterize it; rather, I think of it more as a "kitchen | sink" or "more is more" philosophy rather than a "less is | more" philosophy. I'm sure the committee deliberated | extensively, but deliberation within their particular | philosophical context still produced an unpleasant | result. I think the same is true of JIRA. | | EDIT: clarified wording a bit. | aaaaaaaaaaab wrote: | It's a HackerNews meme from people who never bothered to | properly learn C++ and are angry that it's not | JavaScript/Ruby/Rust/whatever. | throwaway894345 wrote: | I refer you to my sibling comment: | https://news.ycombinator.com/item?id=31017079 | uuyi wrote: | Ex C++ dev and ex JIRA admin. They are the same class of | complete bananas. | antiterra wrote: | In a lot of ways, JIRA disrupted Remedy Action Request | System, which had a painful transition from X to Windows | client. Remedy was even more admin dependent and unwieldy. | fnord123 wrote: | > like Trello | | Maybe Asana or Monday would work for you. | spookthesunset wrote: | I find it helpful to stop thinking of JIRA as a bug tracker | or anything like that. In my opinion JIRA is more of a way to | create and track workflows. It can be used as a blank slate | for quite a lot of things (which I cannot come up with any | examples for at the moment!) | | That being said, because it can do anything, it doesn't take | much effort to make a workflow as painful as possible. | Somebody with the "right" mind might make all kinds of | checkpoints in a workflow, which makes a lot of operations a | pain in the ass because you wind up hopping through a bunch | of steps. Pretty sure in our org we just make our workflow | "you can hop from any state to any other state"--basically a | free-for-all. | | Dunno my point, but there you go! | RandallBrown wrote: | I think people buy JIRA because you can set it up however you | want. I've seen it almost as simple as Trello and much more | complicated. It doesn't have to be terrible, it just usually | is. | | If JIRA didn't allow you to make it terrible, it wouldn't | allow for some of the absurd things that people want it for | and those companies might not buy it. | a4isms wrote: | They used to say of Microsoft Word, "Nobody uses more than | 5% of its features, but every company uses a different 5%." | | The saying is apocryphal and unlikely to be accurate, but | the shape of the thing its describing applies to almost | every piece of enterprise software whether installed on- | prem or SaaS. | | And as another comment points out, at Enterprise scale you | can substitute "team" or "group" for customer. Every team | might use a different 5%, and unless you standardize their | processes, you have to buy the product that can accomodate | all of their needs. | grog454 wrote: | >"Nobody uses more than 5% of its features, but every | company uses a different 5%." | | >The saying is apocryphal and unlikely to be accurate | | Well its mathematically impossible to be accurate as soon | as you have > 20 users. | shukantpal wrote: | False. If you have 100 features, there are nCr(100, 5) | combinations of 5% features = 75287520. | a4isms wrote: | If it's no more than 5% of the features, it's actually | n-choose-k(100,5) + n-choose-k(100,4) + n-choose-k(100,3) | + n-choose-k(100,2) + 100! 75,287,520 | + 3,921,225 + 161,700 + 4,950 | + 100 ------------ 79,375,495 | moonbooth wrote: | Only if you assume the 5% of features to be a contiguous | block each time. | | However, if we assume there are, say, 100 features in | Word (the real number is likely much higher), the number | of combinations is orders of magnitude higher than 20. | [deleted] | robertlagrant wrote: | A better counterexample is that one user could use all | features. | | But your statement doesn't make sense; there might be | millions of features, and trillions of ways to combine | them to make 5%. | KronisLV wrote: | > Well its mathematically impossible to be accurate as | soon as you have > 20 users. | | It's probably in the semantics. | | Text input and editing is clearly a part of functionality | that's probably used by everyone (or at least most | users), so it's not possible for "different 5%" to mean | what you're alluding to, maybe the phrasing needs work. | | In any given 5% there might be 1-4% of overlap with what | others are using and the remainder of that is specific to | the company. | grog454 wrote: | And the greater the degree of overlap the weaker the | implicit argument. | | If it's a uniform distribution of discrete features then | each feature is equally "important" and worth equal | resources and dev time. If 81/100 companies use the exact | same 5% of features and the remaining 19 cover the | remaining 95%, then all else equal you can probably drop | 95% of your features and still do well. | a4isms wrote: | The dynamics of the Enterprise market are such that there | are features where having just one customer that will | make a buy/no-buy decision based on just one feature will | deliver enough incremental ARR to justify the opportunity | cost of doing that feature instead of a bunch of others. | | Typically you do the most popular features first, but | most Enterprise vendors end up working on a long tail of | niche features that nevertheless are profitable. | | There's a long conversation to be had about how this ends | up being a trap where Enterprise software gets bloated | and shitty and eventually gets disrupted by a small | vendor that does "less," but in a powerful, | transformative way that obsoletes the Enterprise | "standard," which leads us back to discussing Atlassian | :-) | | They're a good example of this dynamic, because they have | a "constellation" of products to sell. So if they build a | niche feature that gets a new customer to buy Jira seats, | having "landed" in the account, their salespeople can | "expand" by selling OpsGenie and other related products | very profitably. | karaterobot wrote: | The way it works is, someone always says "Sure, JIRA is bad | out of the box, but you can customize it to work the way you | want" and there is nobody around to say "so now you have two | problems: a bad system that depends on having an expert to | make it work the way it should". | | Then, you pay for JIRA, and that expert customizes it the way | _they_ like. It still doesn 't work very well for most | people. Nobody likes it except one stakeholder, and the | engineering lead who acts as a admin on it. A while later, | those people have left the company, and everyone else is out | of luck. | | Seen this exact scenario play out at two different companies | now. Am witnessing it play out in real time at a third. | sam0x17 wrote: | And yet, it actually is set up in an extremely opinionated | annoying way. For example there is no way to actually assign | multiple users to the same ticket, which is a big problem if | your org legitimately does pair programming (mine does for | juniors) | robertlagrant wrote: | Having a single owner for each ticket is not a bad idea. | You can see contributors in git. | Cd00d wrote: | Why not just clone the ticket? | BlargMcLarg wrote: | Trickle down and first mover. JIRA was there first being | "decently ok", enough people adapted it and now others do the | same. Then couple with that what you write, the people in | charge of deciding the software are generally the ones who | can justify wasting half their day on it. | | To this day I still don't know what JIRA does so much better | that other products don't which big corps are willing to | waste months worth of manhours over. It's biggest selling | point is integration with the remainder of the Atlassian | stack, not exactly known for being great either. | vikingerik wrote: | Jira's big feature is being widely known. It's the modern | version of "nobody ever got fired for buying IBM." | csours wrote: | "I am not a fish" - the people who buy it are not the people | who use it. - | https://www.ted.com/talks/seth_godin_this_is_broken | prescriptivist wrote: | This reminds me of one of my favorite HN comments of all | time: https://news.ycombinator.com/item?id=16424423 | dangus wrote: | The answer is medium to large companies. Jira is a tool that | can satisfy hundreds of different teams' work management | needs without having to buy dozens of different products. | | The fact that it's so feature packed and customizable is the | point. | | I think the complainers are not really investing the time in | to change project settings to fit their needs. | | My only complaint about the Atlassian suite is the | performance of Jira and Confluence. The overall page load | speed is too slow. | matwood wrote: | I agree. I look at every JIRA killer and think we could | maybe move and nope...they're missing something we use. In | many ways JIRA is like Excel. On the surface it can appear | easy to replicate for a single user, then you realize every | user uses 10 different features. | macintux wrote: | How do you change the markup language to be consistent | between Jira and Confluence? | | How do you eliminate all non-task ticket types in a Jira | board and allow any ticket to be a child of any other | ticket? | | It's hard to configure away complexity from a product if | it's designed to be complicated. | Karunamon wrote: | Re 1: I'm not sure why that's a necessity beyond a notion | of consistency. I find that major wiki editors are not | often major ticket creators, and these are different | products with different audiences at the end of the day. | Also, Confluence uses a WYSIWYG editor, so it's rare to | need to think about the markup. | | Re 2: Set the project's issue type scheme to one that | only allows tasks and subtasks. That gets you one level | of nesting. (And even though task and subtasks are | different issue types, changing from one to the other is | trivial since they have identical fields.) Allowing epics | gets you another at the top level. That's a bit limited, | but wouldn't arbitrary nesting be even more complex? | DocTomoe wrote: | > allow any ticket to be a child of any other ticket? | | I have no idea why you would want this from a work | management point of view, but you can just use issue | linking to describe a parent <-> child relationship. | chrisseaton wrote: | > How do you change the markup language to be consistent | between Jira and Confluence? | | This here is the single most insane thing about | Atlassian. | anecd0te wrote: | ime people pick Jira because they've used Jira and have been | promoted via the peter principle to the level at which they | make purchasing decisions. | brimble wrote: | IBM effect. If you don't care _a whole lot_ about your | ticketing system, you just pick Jira because everyone 'll | nod along with the choice and you won't personally be | blamed if/when it sucks, you won't make enemies or have to | argue over the choice because it can't do something that | someone else in the org "needs" it to do, et c. | SatvikBeri wrote: | At big companies I've worked at, the justification was that | JIRA was the only one that met all the regulatory/compliance | requirements. I don't know if this is actually true, but | smaller companies certainly don't market compliance as well. | JoBrad wrote: | I was part of the decision to purchase Atlassian tools at my | company. We had been using a variety of self-hosted and SaaS | tools which had varying abilities to integrate with each | other. We've had very positive feedback from users since | switching to them. We were also able to move some of our help | desks to JIRA Service Management, and away from another self- | hosted product which is still used by a good portion of our | business. The self-hosted product is honestly a nightmare to | maintain and keep secure. According to the vendor, the "fix" | is to have 1-2 people dedicated to that product, which simply | isn't something that my team has the bandwidth or will to do. | | JIRA does try to be all things to all people...and mostly | succeeds. For instance, we use the same workflow and mostly | the same nomenclature across our development and helpdesk | teams. Some of our software projects use Kanban-style | workflows, while others use sprints, but we can keep track of | a project across multiple teams using the same tools. I'm | sure other products also offer this, but we liked the | integration and overall capability for the price. | | There are definitely issues: some feature requests and bugs | have languished in their backlog for years. But you can get | started very quickly and we've had great feedback from users. | throwawayboise wrote: | The one place I worked that used Jira was a small-but-not- | tiny company (about 15 devs at the time). The only people who | actually used Jira were the managers. Developers got printed | stories. These were used for planning, and were printed on | cards and taped to a white board when ready. Developer would | pull a card to work on, and return it to the manager when it | was complete. The manager did all the status updates and | reporting to upper management. | | IDK if this was to cheap out on the licensing with a minimal | number of users, or if it was to insulate the developers from | the experience of using Jira. Perhaps some of both. | | Clearly that usage pattern would only scale so far. | notreallyserio wrote: | JIRA is generally fine software that is good enough for most | folks, especially if you're willing to adapt your workflow to | it. Where it goes wrong is where tools like Jenkins go wrong: | folks add too much customization. | | That means the tool is often the wrong one for the job, but | instead of picking something that's a better match out of the | box folks stick with the easy choice (extend what they have). | closeparen wrote: | JIRA is a framework for making assembly lines out of | knowledge workers. When you're a middle manager at a decent | sized company, a major problem you face is that the mass of | knowledge workers beneath you are _opaque_ : you have no way | of knowing whether they're working or not. Another problem | you face is that they're _uppity_ : people who went to | college and got used to managing their own time now have all | kinds of idiosyncratic ideas about how to manage their own | time and arrange their own working lives. Since you are a | middle manager you despise local differences. Since you are a | manager you're pretty sure that only you and your lieutenants | can be trusted with this kind of decision making power. | Adopting JIRA is a powerful level to put people back in their | place as work item churning machines. Constraints such as | only certain people can create or assign tickets, only | certain people can mark them completed, only certain states | are valid transitions from other states, etc. implement a | level of domination over white-collar workforces that | managers would be otherwise uncomfortable asserting face to | face. | | Other ticketing systems do not work nearly as well for this | purpose because they are designed mainly as external brains | or communication platforms for workers, and they assume a | level of worker autonomy in moving tasks through their | lifecycle. In Trello you cannot make it so that a PM has to | sign off before a card is moved to the in-progress column, or | that only in-progress cards can have code reviews associated | with them. JIRA eats these kinds of requirements for | breakfast. | | EDIT: This is not to say you _can 't_ use JIRA in a workflow- | neutral way, or that everyone uses it for this reason, but I | would submit that it's JIRA's differentiated advantage. | TheRealDunkirk wrote: | Even worse, companies with the resources to buy JIRA will | probably hire consultants to set it up, and you wind up | with a system 1) bought by people who don't understand how | programmers work, 2) configured by people who don't know | how your company works. So end users usually wind up with a | terrible system that continually generates complaints | (along MANY axes), and the people responsible for foisting | it on them think they're just being difficult. | mikepurvis wrote: | So I would say that this assessment is on the whole, kind | of cynical, however I suppose I have the interesting | position of being in an organization where I feel like I | actually see _both_ JIRAs. | | One JIRA is the project that's used for development of the | core product, where there are no constraints-- anyone can | add a comment, create links, change assignee, add new tags, | push the tickets through whatever state transitions they | want, and so on. It works, though it is a little chaotic | sometimes as subgroups of people have different preferences | for how things should go (eg, for tickets requiring test | team validation, should the ticket assignee remain as the | person who did the original work so it's clear who has more | to do if it fails validation, or should the assignee change | to the test team person, so that it's clear that that's the | next person who has it as an action item?) | | The second JIRA is the IT team's internal support project, | which is completely locked down-- no one except them can | close tickets or move them around, or even edit the | contents, closed tickets can't be commented on any more, | and so on. This is the one that gives me the vibes you are | talking about. Every time I have to interact with it, I | loathe it because every inch of it is transparently a | funnel, railroading me along a path toward one of either | DONE or WONTFIX. This is absolutely _efficient_ , in the | sense of meeting the goal of closing all the tickets, but I | feel it introduces friction for the larger business goal of | actually helping people resolve their problems. To the | point where eventually most of the IT support activity | moved away from the JIRA project to an informal Slack | channel, which is way more accessible, but worse in | basically every other way: it's harder to effectively | search, impossible to properly link, bad for async, bad for | dealing with more than one thing at once, etc. | codycraven wrote: | It sounds like you've been hurt by the some terrible | management practices, I'm truly sorry that some managers | think their job is to control their subordinates. | | However, regarding ticketing systems, in team environments, | it is very effective and helpful to have a system that | manages the data about the work that has been completed, is | being worked, and is planned to be worked on . | | Part of that system might be defining restrictive workflows | for some teams, not for control, but to ensure the agreed | upon process is followed for quality or consistency. | | One of the many problems Jira has is that if you don't have | a Jira admin on your team, it's impossible to build an | effective and efficient workflow for your team. Coupled | with Jira making many things global by default (it takes a | lot of care to make a change that only affects specific | Jira projects) most configurations end up being a pile of | garbage automatically inherited from changes an admin(that | is not part of the team) made when intending to change | something for another specific team. | agalunar wrote: | Caveat: this is going to be a meta comment rather than a | comment about the topic proper, and so maybe not | appropriate for HN, but I think it's worth discussing. | | > It sounds like you've been hurt by the some terrible | management practices, I'm truly sorry that some managers | think their job is to control their subordinates. | | When we assume someone was hurt, and imply they hold an | opinion only because they were hurt, we risk | delegitimizing their position. The interpolated message | we might be sending is "your experience is personal and | not representative of the subject at hand, and so your | thoughts are only applicable to your situation; so, after | we express our sympathy, your thoughts can be dismissed." | Or the message we might be sending can be patronizing: | "you hold your opinion for emotional, rather than | rational, reasons; I'm sorry that you are so | unfortunate." | | To be clear, though, I'm sure this wasn't your intent, | and it makes me glad to see someone being compassionate | (i.e. that you bothered to consider the experiences and | feelings of the parent commenter). | | A personal story: I was raised devoutly religious but | left the church in my twenties. My family and friends | assumed I left because I wanted to be free from guilt, | had been hurt by a culture that belied the doctrine, and | so on (and they said as much). My change of belief | occurred after recovering from a few years of mental | illness, and while it is true that I may not have left | _when_ I did were it not for the opportunity to reexamine | my beliefs (while trying to piece back the fragments of | my life into a sense of self), the reasons _why_ I left | were the result of a lot of research and thinking. It was | mildly frustrating when people assumed my decision was | made for emotional convenience, when in reality, the | research was uncomfortable and contemplating an | unfamiliar universe was scary. | | I recognize the irony here - the issue I'm highlighting | in this comment may be something that only I feel is an | issue, born from a personal experience. But I _think_ it | 's more common than that. | [deleted] | liamwire wrote: | I sincerely appreciate your articulation of this, thank | you for taking the time. | closeparen wrote: | >ensure the agreed upon process is followed for quality | or consistency | | That is what I mean here by "assembly line" and | "control." Making sure that processes lead and | individuals follow. | | Citing consistency as a terminal value in the same breath | as quality is also exactly what I mean by the middle- | manager aversion to local differences. | chousuke wrote: | Beyond trivial scale, you need good processes so that | individuals can do their jobs. If you have no processes, | change and development becomes _extremely_ difficult | because people will be hunting for documentation all the | time, stepping on each other 's toes, and making mistakes | that they should not be making because they forgot a | trivial procedure that was a prerequisite to solving | their actual problem. | | I work with a variety of different environments, and | depending on the environment I can either solve my | problem in minutes and get it deployed in another few | minutes _or_ solve the problem in minutes and spend hours | figuring out how to safely deploy it without breaking | everything. JIRA is terrible if you do anything that it | offers by default, but when used properly it can | absolutely help with this. | baq wrote: | To add to that, and perhaps educate your downvoters a | bit, it can be very hard to imagine why or when such | strict processes are helpful without having direct | experience with organizations of sufficient scale. It | literally boggles the mind but the process truly is king | when there are hundreds (or thousands) of individuals | working on a single product. | hn_go_brrrrr wrote: | Agreed. An essential part of blameless engineering | culture is "the outage isn't any one person's fault, it's | the fault of the tooling and processes for allowing them | to do that". Good processes prevent everyone from making | the same mistakes. | tyingq wrote: | >However, regarding ticketing systems, in team | environments, it is very effective and helpful to have a | system | | I think the point is that Jira is particularly granular | in the way that it lets you do things with permissions, | workflow rules, roles, metrics, etc. There's a fair | number of places that use that granularity to create a | weird digital sweatshop. | | Meaning the complaint is more about really deep _" | micromanagement as a service"_ than what you might get | with lighter tools. | brazzledazzle wrote: | Micro managers are everywhere, even in places that may | seem culturally incompatible. I've yet to work for a | business that prioritizes regularly evaluating managers | for their management skills. It's only addressed when | shit really hits the fan. Managers are primarily | evaluated by their own managers on deliverables. As long | as they're getting results and entire teams aren't | quitting simultaneously there's no need to question | anything. As long as a manager is toxic in ways that | don't break the law or violate major company policies any | attempt to address this by a direct report carries the | risk of termination or retribution. Does it contradict | your company's cultural values? Rules for thee. | | And I wouldn't assume you're not one of them. The worst | cases I've run into aren't even the psychos that embrace | micro management as part of their "management style". | It's the ones that genuinely believe they aren't engaging | in the behavior. They're not micro-ing, they're "helping" | their team because they are an awesome manager and their | team is _almost_ awesome, they just need to be monitored | very carefully and given "suggestions" until they nail | it. But they'll never nail it. Because no one is as | smart, experienced or does a task "just so". They view | themselves as a mentor to all. All decisions must be | theirs to make. Jira becomes the perfect tool since the | team effectively becomes little boxes that accept tickets | or stories and return work both performed and delivered | as specified. | | For any managers reading this that don't see a problem | with this or see some of those behaviors in yourself | please understand that you are sacrificing your team's | happiness and motivation at the altar of your own | insecurities. No one can grow where they're not trusted | and no one can improve their skills when they're never | given latitude to make meaningful decisions. Your people | will make mistakes. They will accomplish things in ways | that are different from how you would do them. It might | even be objectively worse. That's ok. That's how you grow | into a strong team with confident members. | mistrial9 wrote: | I was told by a lifetime manager turned successful | consultant, that roughly fifty percent of engineering | firms govern their engineers basically using fear. | ornornor wrote: | > using fear | | Could you elaborate? What kind of fear? "You're fired"? I | wonder how effective it actually is because of the | current job market and also because I (and others) react | very poorly to this kind of tactics: "you want me to fear | getting fired? Joke's on you, please DO fire me, I dare | you" | KronisLV wrote: | > I wonder how effective it actually is because of the | current job market | | Counterpoint: software developers aren't necessarily well | paid or highly regarded _everywhere_ , since remote | working for companies abroad hasn't quite gotten | mainstream enough. | | So it might just be effective against some people, or in | cases where the hiring process itself has become | increasingly unreasonable - the job being working on | boring CRUD apps but the hiring process being multiple | stages of Leetcode and complex interviews. | | That's probably not applicable to everyone since plenty | of folk can grokk Leetcode and find jobs without too much | trouble, but i still recall "The Unseen 99%" article: | https://www.hanselman.com/blog/dark-matter-developers- | the-un... | | It probably applies to the industries and companies where | devs are treated as a cost center and since those | companies aren't all out of business, plenty of people | must be working in such environments, with sometimes sub- | optimal conditions. | numpad0 wrote: | I'm guessing it's a sort of a nerd shorthand for "various | means that are accompanied with self confusion of users | but not with strong rational or scientific or technical | basis" | zrail wrote: | The perf process is basically one big exercise in fear- | based control. | 52-6F-62 wrote: | Kanban, by design, was a tool used in production control. | It's one of the ways Toyota made their JIT production | function. | | I worked on the line (Toyoda Iron Works) and used a real- | life Kanban implemented by the plant engineers. It was | used for quality control, to broadcast quality control | and station output, and was checked regularly against | their internal estimates and baselines and used also as a | gauge for employee output. | | Control is what it's designed to do. The very fact that | Kanban is the tool of choice should support at least some | of OP's points, objectively. | [deleted] | sjtindell wrote: | Agreed. This is a problem of scale in my opinion. When we | have 10 engineers, it is easy to check in with everyone | and know what they are working on and get a status | update. When we have 500 engineers, making sure all their | tasks are aligning (organizations are one big race | condition) is not just hard but impossible without some | sort of tracking system. We all want to grow big. To do | so, your processes need to change as you add more people. | The exceptions (Valve, Netflix, etc.) that can handle | being flat or semi-flat are very unique. | biomcgary wrote: | Are they unique because their problem domain allows it or | because the leadership is uniquely ideologically driven | (and competent) to implement efficient, flat systems? | malermeister wrote: | > ensure the agreed upon process is followed for quality | or consistency. | | Isn't that just a more corporate way of phrasing | "control"? | robertlagrant wrote: | Not in a negative way. You want to trust engineers to | always have changes built and tested before they go to | production, but when something egregious happens you need | to go back and see what went wrong. You can choose to | interpret that as control, but really the only | alternative (often cited) is "Well that shouldn't ever | happen, so you don't need tooling to support that | situation". | | And that is not a useful way of thinking when you have | real engineers writing software that people depend on. | malermeister wrote: | I think the problem is that the processes are often not | _mutually agreed_ , but instead dictated by middle | managers. | | JIRA then becomes a tool for enforcing arbitrary rules, | e.g. control | robertlagrant wrote: | This is very likely even if engineers come up with the | processes, unless all process is scrapped and done from | scratch every time an engineer is hired. | Rimbo wrote: | Oh, nonsense. People buy Atlassisn because the licensing is | cheap, not because it's particularly good at what it does | or designed with any particular workflow in mind. | Viliam1234 wrote: | Cheaper than whatever is the open-source alternative? | chaosite wrote: | Sure, if you host it yourself you have to pay someone to | admin it (usually significantly more expensive than a | license), and if you use a hosted solution you have to | pay the host. | ivan_gammel wrote: | Free software has zero acquisition cost, but non-zero | TCO, which can measure in millions USD (recurring salary | of dedicated IT team), depending on the size of | organization and complexity of the setup. You will need | to maintain on-premise infrastructure, automate backups | and recovery, automate security, automate updates | (including testing and rollbacks) etc etc, basically | doing all the jobs of the people responsible for the | infrastructure at the SaaS provider, but at much smaller | scale and not achieving the same efficiency. You will | have to do those jobs considerably better to justify the | costs. | mistrial9 wrote: | in thirty years of experience, I see this talking point | straight from Microsoft anti-Open Source days.. | | > Free software has zero acquisition cost, but non-zero | TCO, which can measure in millions USD | | Often a primary driver is exactly the opposite -- for- | profit companies are accustomed to paying money for a | good or service, with a billing pattern and legal | obligations. The company financial deciders do not want a | setup that does not have a billing pattern and clear | legal obligations. Meanwhile, Open Source Software went | from niche to mission-critical in the 2000s via the | Internet. For-profit companies (and their publicists) | scrambled to explain it, and came up with that exact line | repeated again today. I do not blame any person for | saying it, it was in print in some reliable place. It | does not capture the reality in 2022 IMO. | ivan_gammel wrote: | To be honest, I do not understand your comment. | | > The company financial deciders do not want a setup that | does not have a billing pattern and clear legal | obligations. | | I haven't ever met a CTO or CIO, who would make budget | decisions like that, neither I do it this way myself. The | reality in 2022 is the same as it was in 2012 or in 2002: | when you choose a solution, you consider all long term | costs. In 2022 TCO for the server software includes | everything that I mentioned in my comment and more. | There's a lot of use cases for OSS in corporate | environment, for sure, but not every OSS solution is | cheap or even affordable. Running on-premise open source | collaboration tool is certainly not cheap if you do it | right. | ofrzeta wrote: | I don't see how it is cheap. Standard may be cheap but | then you are missing a lot of features that are announced | on the product pages with a small footnote saying "only | in premium". | _dark_matter_ wrote: | I feel you here, but I've been at multiple companies that | used JIRA and never once had any of those requirements. | I've also never seen it come up when deciding which | ticketing system to use. Teams have always been free to | move tickets at-will. | KptMarchewa wrote: | One very large video game studio has tons of automation | for Jira. Imagine someone deciding to add new weapon. The | automation creates 100s of tasks for concept artists, 3d | artists, animators, sound artists, software developers | with complex dependencies better those. Most importantly, | automation creates multiple QA steps for each element of | completed work. | | The same exists for levels, enemies, quests and tons of | other elements. | | I would not be surprised if a lot of studios had similar | workflows. | BlargMcLarg wrote: | See, that is great. Automate what can logically be | deduced from the information available and set up | templates to provide that information. For developers, it | should be automated enough you shouldn't have to write | the same info twice, once in commit | messages/merges/branch names, once in the ticket itself. | If the workflow is so streamlined, all that information | can be deduced and the ticket can be advanced | automatically. Most information is available and | documented for other parties. | | However, that's just not what most people go through in | companies using JIRA. Worse, they have to toggle between | pages multiple times, each taking at least a few decent | seconds to reload. I'd like to give JIRA the benefit of | the doubt here, but it sounds like the tool is just _very | easy_ to misconfigure and abuse. | robertlagrant wrote: | This is pretty easy with Jira. There's a GitHub plugin | which links PRs and commits to a ticket, and a GitHub | plugin that links ticket numbers back to Jira tickets. | | And you generally do them both at a lower level than | tickets, certainly commits, so you don't want to have too | much automation between them as that starts adding | constraints. | theptip wrote: | I think you've got part of the answer here, but are selling | it short. Jira is the most complex task-processing rule | engine that is also easy enough for a small team to | operate, and also has the broadest set of integrated tools | of any offering. | | You can use Jira as a simple Scrum board, a Kanban board, | or you can build enforced-process monstrosities. You can | build customer-support / internal-helpdesk workflows, or | even model internal work-item-oriented business processes, | etc. Now, as you point out, just because you can doesn't | mean you should, and many orgs fall into the trap of making | issue workflows overly-restrictive. But most companies (I | believe) choose Jira before they choose those hairy task | workflows. Startups with zero process use Jira. | | Also, you can integrate it all together to give good-enough | dashboards/roadmaps, good-enough (for some, not me) docs | integrations with Confluence, Git integration with | Bitbucket etc. -- while there are big issues with these | systems, I think it would be myopic to ignore the real | benefits of working in one integrated stack where every | design doc you write has dynamically-updated labels and | auto-complete for each issue you type in. | | For context, I use Jira for tasks and don't love it, found | Confluence to be really annoying and so I don't use it, and | prefer Gitlab to Bitbucket, but I think you have to | recognize these unique selling points. If all Jira had to | offer was the rule engine it would not be as widely used. | pid-1 wrote: | Yeah my team uses Jira to keep track of what we are doing | and what we need to do. | | Each member can actually organize their sprint and create | tasks. | | Point assignment is not a big deal, it's just there so we | avoid promising more than we can chew. | | I've found Jira really pleasant to use for lightweight | processes. | [deleted] | richardw wrote: | I'm just a user but totally happy with all our Atlassian | apps. Confluence is a huge win across our multi-thousand | person company and the best teams use it very well. I like | the integration between Jira and Bitbucket. We don't over | complicate things and it works fine. | | It's like my taste in wine. I don't want an overdeveloped | sense of taste where only a $400 bottle will do. I'm fine | with what we have because the work is what excites me and if | people are documenting projects and managing workloads and | committing code, we're 90% of the way there. | danielovichdk wrote: | Good point. | | Wine that costs 400$ is for fun. | | You don't drink that professionally. | spaetzleesser wrote: | "horrendous monolithic architecture " | | I don't really understand what this has to do with "monolithic" | or not. | | Atlassian's software is probably very complex and convoluted | but from my experience it's almost impossible to keep a clean | architecture in a software system that has grown over many | years and is used and customized by many customers so you have | to avoid breaking backwards compatibility. | JoBrad wrote: | It sounds like that's what they are doing, but it's manual. | outworlder wrote: | > OK, so you restore backups to a separate system, and | selectively copy the stomped accounts data back to production | | This seems to be exactly what they are doing, as described in | the article. They don't have automated tools to do this. | omoikane wrote: | Not being able to selectively restore data for a subset of | users might also indicate that users are not fully isolated | from each other, which is worrying for technical and | nontechnical reasons. | indymike wrote: | There is nothing non-technical that matters. If we start | acting like it does, we incredibly poor decisions that in | fact have nothing to do with physical reality, and quickly | arrive at unworkable technology. | omoikane wrote: | Non-technical reasons include "legal" and "compliance", | which often matters a fair bit. I am not disagreeing that | non-technical requirements occasionally lead to poor | decisions, for some value of poor. | indymike wrote: | I live is a state that once tried to legislate that pi = | 3.15. The results were tragic, and the attempt to | legislate a ratio was a failure, much like systems | created by regulation and laws often are. Math is much | less forgiving than legal prose. Making database | decisions based on criteria that don't make any | engineering sense one way or the other is not far off | from legislating the value of PI. | ksala_ wrote: | Personally, given the multi-day outage, I think I would just | restore everything to a separate system, and then only point | the affected customers to this new system. Take the hit of | having two fully separated system initially, and work on | reconciling them after without having to worry about the on- | going outage. | | I wonder if they're not doing this due to some tech | limitations, to avoid taking the financial cost of running two | systems, or to avoid having to reconcile the systems. | bigtones wrote: | That's a really good idea. | mandevil wrote: | At a big multi-tenancy company I used to work at, the problem | would have been the accessory machines: we had something like | 15-20 different machines around the main DB and API machines, | running cron jobs, terminating SSL connections, load | balancing, sending alerts to us and customer emails out, etc. | And while the backing up and failing over on DB and API | machines was a well documented, thoroughly tested process... | the other machines were all custom jobs that were very poorly | documented, with who knows what scripts running on them, that | might or might not be important. Trying to replicate all of | that during an emergency would have been a challenge. | | For just this sort of problem, we actually had three DB | servers running all the time: active, passive, and _hour | behind_ with the ability to break _hour behind_ 's copying of | the write-ahead log of active as the DBA's secret weapon for | just this problem. If all customers had accidentally lost an | hours worth of data it would have been embarrassing, but much | less than completely shutting out hundreds of paying | customers for two weeks, I think? | underdeserver wrote: | > Simple concepts aren't that simple at their scale | | It's true that nothing is simple at scale, but it's important | to note that simple concepts are the _only_ concepts that work | at scale. | VWWHFSfQ wrote: | Most likely the database tables themselves are just a mixture | of everyone's data. There's no true multitenancy. So they have | to load the backups into a separate database. Then just go | through and individually select/insert into the old database. | And then you have to worry about things like foreign key | constraints complicating the bulk data loading. Are you going | to disable constraint enforcement while you bulk load the data? | How does that affect existing and new data from customers using | the database? Just a guess. But this sounds like a nightmare | honestly. | tetha wrote: | Yup. The database schema of one of our products uses a | tenant_id in most tables to separate customers logically. | | I've eventually gotten a tenant exporter to work. | Practically, this requires some deep and nasty digging | through the information_schema to build a graph of tables and | foreign key constraints. Once it had that, it generates | selects with a simple where clause for tables with the | tenant_id, and selects with weird joins all over the place | for other tables to dump the tenant data. | | All of that sounds complex, but that part took a day or two | to hammer together to 90% completion, since it's just some | graph handling. The other 10% were getting some weird date | formatting questions right to produce a properly importable | sql dump. And interestingly enough, it's working for more | than just that one product. | | But that's just where the journey started. After that, it | took a weeks and months to sort out legacy tables, old | tables, tables without indexes, tables no one knew about, | tables that were important (but not), tables with | inconsistent data, .... And it's just handling a single | relational database. And compared to \copy in psql, it's | slow. And at times, weird things happen if you import huge | chunks of sql into a postgres with deferred foreign keys | (because our schema has cyclical references). | | Point is, I know how painful it can be to handle that kind of | database schema, at a ridiculously smaller scale. I'm kind of | happy to not work there. | [deleted] | radicaldreamer wrote: | I can't believe that they would intermix the data in that | way... but if they did, godspeed to them, they're likely | still overpromising what can be done in this time frame. | mdoms wrote: | They don't, you're responding to speculation which is just | outright wrong. Jira and Confluence is single tenanted | databases, unless something fundamental has changed at | Atlassian in the past 4 years. | | Source: worked at Atlassian, on Jira, 4 years ago. | robertlagrant wrote: | Then Atlassian's description of why the restore took so | long makes no sense to me. | dabeeeenster wrote: | How else do you run a multitenancy platform? | kikki wrote: | Not quite the same but at Fandom (Wikia), every wiki has | its own DB (over 300,000 wikis), and they are clustered | across a bunch of servers (usually balanced by traffic). | It works well - but we don't ever really need to query | across databases. There's a bunch of logic around | instance/db selection but that's about as complex as it | gets. | jjice wrote: | Interesting architecture. From a design point of view, I | like the idea of full isolation. From an infrastructure | point of view I'm a little scared. I'd assume it's | actually not that bad and there's a good way to manage | the individual DBs and scale them individually. | | Really interested if you can share any details. | | Edit: I know each wiki is on a subdomain. Does each wiki | also have it's own server? | kikki wrote: | There are _many_ databases on each server, last I checked | there was around 8 servers (or: "clusters") - and we have | it so the traffic is somewhat evenly distributed across | each server. There are reasonable capacity limits, and | when servers get full we spin up a new one and start | accepting new wikis there. I am not in OPS, and they do a | lot of work behind the scenes to make this all run | smoothly - but from an eng perspective we rarely have | issues with this at scale. | | Some of this was open source before we unified all of our | wiki products, which has a lot of the selection / db | logic, at https://github.com/Wikia/app. | spookthesunset wrote: | How do you update the schema on 300,000 databases? | msh wrote: | At minimum separate tables for each tenant. | Spivak wrote: | At that point you might as well just do separate schemas, | it's actually less headache. | radicaldreamer wrote: | Sorry, I'm not actually sure... maybe someone who's | experienced in backend db can elucidate here. | | Is it not a good idea to spin up separate db instances | for each client/company? | indymike wrote: | Answer: it depends on the application. For example big | social app is not going to provision a new db for every | user, or for every customer that runs an ad. Likewise, a | lot of enterprise software fits a model where each | customer getting it's own db makes sense. So, really, | just a design decision. | nemothekid wrote: | I believe you can sign up an account for free or | incredibly cheap ($5/user). You would potentially have | tens of thousands of databases. Imagine trying to do | something like a database migration to add a column. I | believe the day to day operations would be a nightmare as | no RDBMS has probably had that kind of feature stress | tested. | some-guy wrote: | The company I work at (Workday) does this, but it's for | business / liability reasons. | robertlagrant wrote: | Bearing in mind the licence fees of Workday, the costs of | separate databases pale in comparison! | andyjohnson0 wrote: | > Is it not a good idea to spin up separate db instances | for each client/company? | | It depends, really. There is a trade-off in terms of | software and operational complexity vs scalability/perf | and isolation. And probably a bunch of other factors. | | If you have separate databases for each customer, schema | migrations can be staged over time. But that means your | software backend needs to be able to work with different | schemas concurrently. You can also benefit from | resilience and isolation guarantees provided by the dbms. | On the other hand, having a dbms manage lots of databases | can affect perf. Linking between databases can be a | minefield, especially w/r/t foreign keys and distributed | transactions. | | https://docs.microsoft.com/en-us/azure/azure- | sql/database/sa... | Spivak wrote: | > But that means your software backend needs to be able | to work with different schemas concurrently. | | Not if you're truly multi-tenant and each customer has | their own app servers. Then your code and schema version | are always in lock-step. | andyjohnson0 wrote: | True. But then you have an additional problem ... | truffdog wrote: | Separate DB instances doesn't scale as well cost wise, | and generally means onboarding takes a few minutes | instead of being instant. It is very common though. | Spivak wrote: | The solution that satisfies everyone is having a separate | _schema_ per customer and a number of database clusters. | Then each customer is assigned to a particular cluster. | Always make sure you have excess capacity on your pool of | clusters and onboarding is still instant. | brightball wrote: | There are basically two options for multi-tenancy with | their own tradeoffs. | | 1. An account/tenant_id field for each table | | 2. A schema for each tenant wrapping all of the tables | | Option 2 gives you cleaner separation but complicates | your deployment process because now you have to run every | database change across every schema every time you | deploy. This gets more complicated as your code is | deploying in case the code itself gets out of sync, | there's a rollback or an error mid deploy due to an issue | with some specific data. | | The benefit of the approach is the option to do different | backup policies for different customers, makes moving | specific customers to specific instances easier and you | avoid the extra index on tenant_id in every table. | | Option 1 is significantly easier to shard out | horizontally and simplifies the database change process, | but you lose space on the extra indexes. Plus in many | databases you can partition on the tenant_id. | | Most people typically end up with option 1 after dealing | with or reading horror stories about the operational | complexity of option 2. | outworlder wrote: | Option 2 has many unforeseen consequences. | | Business wants to run a query across customers? In most | DBs you need either custom code or to create a stored | procedure to iterate across schemas. | | Every table that you create is multiplied by the number | of customers. This has implications for some database | systems (like PG's vacuum). | | Your migrations will take _forever_ to run. | | Etc. | Spivak wrote: | The second problem is mitigated by the fact that schemas | are trivially migratable between database servers. Once | you grow too big for one cluster just make another. | eropple wrote: | The secret bomb in option 1 is that you generally have to | have smarter primary keys that fully embrace multitenancy | and while Atlassian hires smart folks and I'm sure they | at some level know this--that's a relatively hard | retrofit to work into a system. | [deleted] | codingdave wrote: | It is like any other architectural choice - there are | pros and cons both directions. If you have separate db | instances, you have to scale up the operations to manage | each one - migrations, scripts, etc need to be either run | against them all, or you need good tooling in place to | automate it. A single instance avoids all that, but is | more complex in the actual software and definitely more | complex for security. A single DB also would let you | share data amongst organizations fairly easily, but | whether that is good or bad depends on your product. I've | created and run products both ways, and I like separate | DBs at small scales, single DBs at medium scale, but | separate DBs again at huge scale if you also put | management tooling in place. | stickfigure wrote: | I have built multiple multi-tenancy platforms and I never | create separate databases for each customer. If you have | separate databases, it's almost impossible to run | meaningful queries across all of them. That architectural | choice creates far more headaches than it solves. Usually | people end up with the split-database architecture when | they want a quick retrofit for a system that wasn't | designed with multiple tenants. | | I've also had to restore partial data from backups on a | few occasions when customers fat-fingered some data and | asked pretty-please to undo. If someone on staff | understands the system well, it's not hard. I suspect | Atlassian suffers from a complicated schema and a post- | IPO brain drain. | x0x0 wrote: | I can't believe anyone would do separate databases. | | Just wait until a migration doesn't run on 2 of your 400+ | customer databases. Or multi-hour migrations. | tedunangst wrote: | Sounds good to me. Now you've got 398 happy customers on | the new version, and a small scale issue to resolve with | two customers. | rectang wrote: | When all customer data lives in the unified database: | Just wait until a bug in a query exposes the data of | customers to each other, creating instant regulatory and | privacy nightmares for everyone. | x0x0 wrote: | With an orm and customer objects to create scoped | queries, I haven't found this to be a problem. It's also | very easy to check in code reviews. And not a painful | issue from, well, the lack of this happening given it's | an extremely common app design. | sharken wrote: | It's likely a mixture of all these factors, the brain | drain could absolutely be responsible. | | At least it would not be the first time in history that a | company has lost the engineering spirit. And instead the | business people have taken over, so that details like | disaster plans become less of a priority. | | A business person and an engineer will always view risk | differently, better disaster plans is a kind of insurance | that is a lot harder to sell when too many business | people run the company. | ZetaZero wrote: | This. It would be an impossible nightmare for every | account to have their own DB. Hundreds of thousands of | accounts and databases.... | lelandbatey wrote: | I worked at company that architected their multi-tenancy | in almost exactly this style. In their particular case, | only a few of the very largest customers had their | database set aside on their own dedicated instance, but | every customer did have their own DB with their own set | of tables. Having worked in that world (every customer | had their own DB) and on a product where all customers | had their data intermingled in one gigantic set of tables | in one giant DB on one logical instance, I'd definitely | encourage the "every customer gets their own DB". | | Giving every customer their own table means you're going | to need database administrators. For these folks their | _dedicated_ job was maintaining, operating, and changing | their fleet of databases, but they where very technical | and were _amazing_ to work with. | david422 wrote: | > I'd definitely encourage the "every customer gets their | own DB". | | Does this extend to services as well? We have a suite of | (micro) services. Are they all segregated? | mdoms wrote: | This is the case. I won't comment on your "hundreds of | thousands" figure because the number of Cloud customers | was a closely guarded secret at least when I worked | there, but yes one DB per tenant, dozens to hundreds of | DBs per server, and some complicated shuffling of tenant | DBs when you run into noisy neighbours. | mh- wrote: | That makes this prolonged restore process all the more | confusing, then. | | I (and many others) assumed they had to graft in data | from backups since a full restore would clobber newer | changes from unaffected customers. | | If they're all isolated in their own logical per-tenant | DBs, I'm really at a loss for what is making restoration | take 3 weeks for 400 tenants. | | I understand if you'd rather not venture into it, but | care to offer any speculation? | spookthesunset wrote: | If they had multi-tenant databases for SaaS it would mean | either the self-hosted jira instances also had the same | multi-tenant database schema or they'd have to maintain | two almost entirely different data access layers for | cloud vs. on-prem. Since their cloud offering came from a | historically on-prem codebase, I would expect the easiest | way to offer cloud stuff is to do a DB per tenant. | Otherwise there would a shit-ton of new code that only | applies for cloud stuff.... | [deleted] | taeric wrote: | Wait. Why? This sounds like something that feels hard, if | you are used to the giant DBs of old. But you can | probably get many many instances of the smaller databases | without much trouble. | | Would still be some maintenance, don't get me wrong. But | far from impossible. | ezekg wrote: | Imagine the database schema migrations... | jagged-chisel wrote: | the good news is by the time you get to the 100th client, | you'll likely have run into all possible bugs and the | remaining 6900 will be pretty smooth. | Spivak wrote: | Having worked at shops that used this architecture it's | really not that bad. Can you write the code to do one | schema migration? Great, now you can do 1000. App server | boots and runs the schema migrations, drops privs and | launches the app. Now you've staved off your scaling | issues from "how to have a db large enough to hold all | our customer data" to "how to have a db large enough to | hold our biggest customer's data." Much easier. | robertlagrant wrote: | You can write the code to do 1000 schema migrations, but | the problem is if you've migrated 40% of them and hit an | issue. What do? | spookthesunset wrote: | One of the many reasons to put good constrains on fields | and use referential integrity! If you don't let the | database enforce data validity you are gonna get fucked | at some point! | | source: every single place I've worked at that poo-poos | referential integrity has a database that is full of | bullshit that "the application code" never cleaned up | | Always use referential integrity. The people who are | against it almost always are against it for superstitious | reasons (eg: "it makes things slow" or "only one codebase | calls it so the code can enforce the integrity"). All it | takes is exactly one bug in the application code to | corrupt the whole damn thing. And that bug _will_ happen | over the lifetime of the product regardless of how | "good" or "awesome" the programmers think they are.... | | ... I'll get off my soapbox now! | oauea wrote: | You'll quickly run into limitations of how many tcp | connections you can hold open. Unless you also want to | run separate app servers for each customer, which will | cost a lot of $$$ | | Oh, and just forget about allowing your customers to | share their data with each other, which most enterprises | want in one way or another. | kubami wrote: | Wait. What? None of the enterprise customers want to | share data with each other. And definitely not on a DB | level. That should happen in the business logic. | lalaithion wrote: | Lots of companies have consultants, and want to be able | to share their consulting-related tickets with their | consultants. And the consultants want one system they can | log into and see the tickets from all of the companies | that are hiring them. | outworlder wrote: | It would be a nightmarish scenario if you have thousands | of customers. And completely unnecessary. You can create | multiple databases and or schemas in a single instance. | | Don't do any of the above unless you understand the | implications. | hnlmorg wrote: | Multiple schemas? You don't need every tenant in the same | schema. However I'm not a DBA by trade so there might be | some issue with doing this at scale that I'm unaware of. | doliveira wrote: | By segregating as much as you can. Definitely not by | putting everything in a single table. At the very least | separate databases/schemas with proper permissions so | there's not any chance of data intermiBy segregating as | much as you can. Definitely not by putting everything in | a single table. At the very least separate | databases/schemas with proper permissions so there's no | chance of data intermixing. | | The best would be multiple separate database instances, | which is not even hard to manage specially for qualified | engineers like Atlassian surely has plenty of. The | problem are business decisions of ignoring the tech debt, | usually... | akie wrote: | Now every time you run a database migration, you have to | adjust N tables - and in Atlassian's case, N is 200000. | Is that better? It depends. There is no "best" way of | doing multitenancy. | doliveira wrote: | There is a worst way of doing multitenancy, and that is | sharing a single big table. | hnlmorg wrote: | That's just an automation issue. It's not like you have | to write a bespoke database migration script per DB. | robertlagrant wrote: | The bug we are mitigating was also just an automation | issue. | hnlmorg wrote: | It's also pretty easy to foobar up a single DB instance | if you don't have proper guardrails in place. | | Automation wasn't the issue here. It's the symptom not | the cause. | doliveira wrote: | Way easier, actually. | robertlagrant wrote: | No, the symptom was the loss of customer data. | tus666 wrote: | > OK, so you restore backups to a separate system, and | selectively copy the stomped accounts data back to production | | You don't think that's exactly what they are doing? | qiskit wrote: | > However, if they [restore backups], while the impacted ~400 | companies would get back all their data, everyone else would | lose all data committed since that point | | How would they lose committed data? Even after restoring the | backups can't they run the logs so that everyone is caught up? | drjasonharrison wrote: | Are you assuming that they record the events in a way that | can be played back? | mh- wrote: | _(There 's a tacit assumption here that the data across | tenants is commingled in tables, and that's being disputed | elsewhere in the thread, but playing along..)_ | | You wouldn't be able to do that without forcing downtime for | all customers, for the duration it takes to restore the | snapshot and then replay the logs. Not to mention the risks | of the process failing somehow | | You could narrow the window to just the "replay" portion, if | you were able to stand up an extra database/infra, to switch | over to when it was ready. But at some point you'd probably | still have to go read only to checkpoint the logs and begin | the replay. | | It's of course possible to do something more complicated here | and stream the changes then eventually enact a failover, but | this would all be too complex and error prone to introduce in | their current crisis mode. It's something I'd suggest | _considering_ when architecting their DR /BCP, but it's too | late for that kind of elegance (and complexity) now. | more_corn wrote: | Yeah, I'm thinking the exact same thing. | | Perhaps they don't have the right people on hand to do hard | things like this. | | They also apparently lack an incident response plan since a | critical component of that is coms to affected customers. | | They also lack good practices around preventing human error. It | should not have even been possible to make the initial mistake. | It certainly should have involved multiple steps of "are you | sure" and potentially even review. | | Sounds like an operations shit show. Glad it's not my circus. | robertlagrant wrote: | They have great practices; they even published them. They | just didn't follow them here. | ChrisMarshallNY wrote: | Heh. We have a Confluence account. | | That no one uses. | | So we didn't notice. | teh_klev wrote: | You probably wouldn't if you weren't in the affected subset of | customers who were. This wasn't a total outage, but rather it | affected a group of users who had been running a legacy | standalone app called "Insight - Asset Management". | throwawayHN378 wrote: | When in doubt I go on LinkedIn and find an engineer that works | for the company and message them directly. | rjmunro wrote: | Does that work? I've sometimes thought about trying it, but | never actually done so. | flaviotsf wrote: | I recommend doing disaster recovery steps for your personal data | as well, such as Gmail. At one point recently I was creating | filters to delete bulk messages and - when the filter got | created, it somehow missed the from:@xyz.com domain part and I | ended up deleting => delete forever all emails. I noticed the | issue right away but it was enough to wipe 2-3 months worth of | emails (all of them, even Sent ones). | Traster wrote: | I remember finding out one of the senior managers from my company | ended up as head of software at Atlassian. It was at that point I | was convinced Atlassian has no idea what the hell they're doing. | I think this demonstrates the point nicely. | celim307 wrote: | After this they might have to boomerang back to your company | lol | Cederfjard wrote: | PSA, because I'm seeing a lot of JIRA in this thread: Since the | 2017 rebranding, Jira is no longer officially written in all | caps: https://community.atlassian.com/t5/Feedback-Forum- | articles/A... | | (You can argue how successful it was when people are still using | the old style in 2022). | | It also makes more sense, since Jira is not an acronym, it's a | truncation of Gojira, inspired by Bugzilla/Mozilla. | spookthesunset wrote: | Yeah, I've never typed it as anything but JIRA. Pretty sure my | auto-complete will vouch for that. | Vaslo wrote: | I bet the Shitlassian guy is dancing and singing because of this. | a-dub wrote: | i hate deleting things. prefer flags that hide things instead | (like a boolean deleted flag in an rdbms table). | | prevents data integrity issues in relational databases, makes | debugging easier and prevents disasters. | | ideally also include a timestamp, both for bookkeeping and safe | tools that only remove things that have been soft deleted for | some time and are safe to delete without compromising integrity | of anything that is not deleted (this is especially important in | relational data models) | jacquesm wrote: | Better still: a field that registers at what date a record was | supposedly marked as deleted. Because otherwise you still can't | bulk recover from an error. | a-dub wrote: | yep. but at least in the rdbms case, and probably in all | cases, a flag (and an index on it) tends to be essential for | query performance since the state of the flag will appear in | most, if not all queries. | | that's okay though, queries that reference the timestamp can | be slow since they're housekeeping. | bombcar wrote: | The GDPR and various things have made companies more skittish | in doing things this way, because they get scared. | | Perhaps an effective measure would be to create a key that | encrypts a customer's data, and give them a copy of the key, | and let them know that after a certain point your copy of the | key will be deleted, and if they want a restore past that point | they'll need to provide the key. | brimble wrote: | You may as well just delete it, then. I guarantee a high | percentage of users won't save that key _and_ be able to find | it later. GH (edit: or similarly nerdy sites) might (might!) | be able to get away with that, but as soon as part of your | process is "give the user a cryptographic key" you've just | guaranteed yourself a support nightmare, with normal users. | It's why the only cryptographic person-to-person | communication systems that've been broadly successful haven't | involved keeping track of _anything_ , and don't have a setup | process more complex than "point camera at QR code". | bombcar wrote: | Yeah, you end up in the case where you "officially" cannot | recover after X, but then you make sure that "accidentally" | you might be able to recover by keeping copies around | somewhere ... until someone realizes and you get sued. | a-dub wrote: | that's an interesting question, i've given a little thought | to this multi tenant saas stuff... | | not sure if the right way forward is some sort of innovation | in operating system and software design where people write | and run apps that feel like single tenant apps attached to | dedicated per tenant datastores where os and framework magic | handle per tenant encryption and segmentation (tenant id as | an os level concept) | | or... if it makes more sense to encrypt at the record level | with keys that only the customers hold using (assuming it's | up to the task) homomorphic encryption for things like | searches and other backend functions. | | either way, for now, soft deleting and following up with an | automatic daily hard delete of things soft deleted more than | x days ago is a totally reasonable approach. | | ops scripts should require typing "yes i know what i'm doing" | if someone attempts to hard delete things that have not yet | been soft deleted. | bombcar wrote: | Yeah, soft delete is the way to go in 99.99% of the cases, | with a system setup to eventually hard delete on some | schedule (preferably don't hard delete until X number of | backups have caught the soft deleted data safely, for | example). | miketria wrote: | Hi, this is Mike from Atlassian Engineering. Strongly | agree with this. I'd say that if you can afford it, don't | do the hard deletes on a schedule though. You never know | when there's a system out there referring to soft deleted | data that fails once the data is hard deleted. Hard | deletes should feel frightening because they are | frightening. | deckard1 wrote: | > The GDPR and various things have made companies more | skittish in doing things this way, because they get scared. | | They may be scared. But are they scared enough to reload | every single backup they have, purge the desired records, and | resave each and every single backup they have? And not also | worry they will corrupt/break the backups in the process. | | GDPR compliance is a mess of contradictions and unreasonable | asks which all seem to amount to "depends on who you ask." | yabones wrote: | What's a good Jira replacement? Redmine? Phabricator? | OpenProject? Just leaving the jira server alone and hoping | there's no new and exciting zero-days? One thing is clear, these | guys are a bunch of cowboys who can't be trusted with any amount | of data. | elxr wrote: | If you're hosting your code on GitHub, then GitHub projects is | definitely worth using. | | Does everything I used to use Jira for, but feels more modern | and lightweight. Also, it has dark mode. | dzikimarian wrote: | I'm on the same boat. Currently best choice seems to be | youtrack, which has reasonable licensing model for self hosted | option. | 420official wrote: | I'm not at all familiar but a tweet linked from the OP and | written by the author plugs https://linear.app/ | gkoberger wrote: | Linear is phenomenal. Probably built for a different audience | than Jira (it's like Superhuman for tickets), but if you want | something that works well and is opinionated I highly highly | recommend it. | bloopernova wrote: | Linear has a dark mode. I'm already won over! ;) | kitsune_ wrote: | Gitlab would be enough for engineering teams | nicoburns wrote: | We switched from JIRA to Shortcut https://shortcut.com/ | (formerly Clubhouse), and I'd highly recommend them. It's much | better than JIRA ever was, both from a UX perspective and an | implementation/performance perspective. | _dark_matter_ wrote: | _Bugzilla_ | [deleted] | originalvichy wrote: | For pure engineering teams it's either Gitlab or Azure Devops. | Those are the most common competitors I hear about. If you have | non-engineers the choice gets trickier. | histriosum wrote: | I've used Request Tracker for years. It's not pretty, it's | written in Perl, but I can fairly easily make it do all the | ticket tracking flows I care about and it just runs and runs | and runs. My scale is admittedly small, but I put tens of | thousands of tickets per year through my instance, and i | basically never have to touch it unless I'm setting up a new | queue or different flow for something. | beardbound wrote: | Wow, I've never seen anyone mention RT here. I used it for | years when I was working IT for my university while in | undergrad. It worked pretty well. It didn't have a lot of | features but it allowed clients/customers to respond to | tickets via email which was pretty cool at the time (late | 00s). It also ran pretty fast on the terrible servers we had | it on. | WC3w6pXxgGd wrote: | Tomsilverberry wrote: | jacquesm wrote: | I suspect - pure speculation - that they _can 't_ restore the | backups, because if they could then they could easily do this in | a way that accounts affected could be restored selectively. In | other words: test your backups, if you don't they won't be there | for you when you need them. | digital79 wrote: | ordiel wrote: | All I can say as an Attlassian Server products user is that the | moment they say it was Cloud or nothing, I choose nothing. | | I much rather running Gittea on a raspberry pi that I CONTROL | than having to have the impotence of doing nothing for more than | a week. + having work at cloud companies and having been | requested to "collect customer data" to hand it over to the | government I would NEVER move critical pieces to anyone else's | infa... | | (Note: I am not supporting crime, but I rather to have privacy | and criminals than living on an authoritarian regime where a | dictator who knows everything abot everyone keeps "peace".... Yes | I am looking at you China!) | | If mistakes will be made, at least I wont pay others to do them | for me.... | Phelinofist wrote: | As I understood it is not "Cloud or Nothing" but "Cloud or Data | Center" - is this wrong? | rsstack wrote: | The on-prem offering of Atlassian was discontinued. Existing | contracts are being honored but as of March 2022, that's the | end of the line for it. Maybe it will be revived now. | grnmamba wrote: | Unlike server, Data Center starts at 42.000$ per year. | | For most SMBs, it's cloud or nothing (or a different vendor, | of course). | yabones wrote: | AFAIK the Datacenter pricing starts at 500 users and goes up | from there. So a small org could end up paying 5-10x what | they were before on the Server license. | callamdelaney wrote: | Where do you think the cloud lives? | kache_ wrote: | return to monke | | vi your todolists on an ec2 box | [deleted] | mc4ndr3 wrote: | They never heard of beta testing, rolling updates, infrastructure | as code, federation, customer isolation, or Public Relations. | What the heck. | parentheses wrote: | A case for reducing complexity of software. Also, given the | recent GitHub incident spree, it's almost debilitating. The | entire tech industry takes a hit when companies like these fail | at operations. | oldshatterhand wrote: | Random guess, that this is a "we say we make backups, but we | actually take snapshots" issue :) | luckydata wrote: | so this is the end of Atlassian as a company right? | vinay_ys wrote: | Depends. Are there strong alternate products to which customers | can easily migrate in next 6-12 months? If yes, and they choose | to move away, then Atlassian will be in serious trouble. I | wonder how many of their customers have long-term locked-in | contracts and if they have performance clauses that allows them | to exit such contracts. | lifefeed wrote: | Eh. The Exxon Valdez oil spill is a case study in the failure | of crisis management, but Exxon weathered it. It's a vastly | different industry with huge "economic moats," but it does | point to the fact that a company can weather a crisis. | raincom wrote: | I don't think so, as long as investors hold the stock, as long | as customers keep paying Atlassian. | function_seven wrote: | I had the same initial thought. _Surely_ a weekslong outage | would drive customers away permanently, right? | | Nope. From TFA: | | > _I asked customers if they would offboard Atlassian as a | result of the outage. Most of them said they won't leave the | Atlassian stack, as long as they don't lose data. This is | because moving is complex and they don't see a move would | mitigate a risk of a cloud provider going down._ | luckydata wrote: | it doesn't happen overnight, but this is a really bad | precedent and it will definitely have an effect on both sales | AND renewals. This market is theirs to lose and seems they | are doing everything they can to do just that. Github is | getting better, and it has mindshare amongst developers, not | to mention it's part of a company that like it or not knows | how to sell to large enterprises (Microsoft). | case0x wrote: | Why would it ? On our end everything works fine. If you're not | one of the 400 companies, there's no difference | asah wrote: | "first they came for ..." | gmfawcett wrote: | Poor taste, buddy. Comparing the Atlassian mess-up to the | Holocaust diminishes the Holocaust. | asah wrote: | um... the sentiment is universal it's not specific to | that particularly awful history. Sorry if it triggered | you, HN doesn't offer a delete button. | | FYI my ancestors fled oppression on both sides and I'm | well aware that it's a miracle I'm alive. | | Again, one bad thing leading to another is a common human | behavior, and the Holocaust is just an extreme example | that I ABSOLUTELY did not intend whatsoever. You make | this connection, not me. | gmfawcett wrote: | If I'm then one making this connection, then it should be | trivial for you to finish your sentence. "First they came | for..." Who are the Jews in your analogy ? Who are the | communists? the trade unionists? And who is the | totalitarian regime? | | Suggesting that Niemoller's poem is about "one bad thing | leads to another" is like suggesting that Anne Frank's | diary is about "sometimes girls have really bad days." I | understand you didn't mean any offense to anyone. But | that's not a license to be offensive, and then duck for | cover. | [deleted] | foobiekr wrote: | Severe operational issues don't give you pause? | dividedbyzero wrote: | Even those 400, especially Jira is crazy popular with a lot | of scrum masters and the scrum crowd in general. I could see | some of those 400 stick with Jira even after this shit show | if only to avoid losing all their scrum masters. | openknot wrote: | Yep. The vast majority of users don't follow these outages | (aka don't browse forums like Hacker News or r/sysadmin), and | thus aren't aware of them. | | Many of these users are decision-makers who decide what tools | to use, and will continue to use Atlassian out of inertia due | to lots of existing documentation on the tool (this is | compounded by not knowing about the outages, or not knowing | the severity of the outages), and also because large, | professional companies use their tools too. | | I don't necessarily agree with the perspective to stay with | it, but it uses a lot of political capital/innovation | tokens/goodwill/etc. to change systems, when there are | usually higher-priority things to do (than to get buy-in to | switch). | knbrlo wrote: | My current employer uses Jira but we seem to have not been | affected by this. Hopefully those customers affected are able to | press Atlassian for improvements from notification time, backups, | usability etc. | bitwise101 wrote: | This talk from Atlassian aged well | https://conferences.oreilly.com/software-architecture/sa-eu-... | danuker wrote: | I am tired of survivor-biased "best practices" advice. I wonder | which practices contained there are the _worst practices_. | nitinagg wrote: | Selectively restoring data only for certain rows is super hard. | But the communications by Atlassian has been the worst I have | ever seen in the industry. | raincom wrote: | So, it must be a bad idea to shove the data of multiple | customers in a single table controlled by some column name | ('tenant'). | profmonocle wrote: | I actually got an email from our Atlassian contact just the | other day encouraging us to switch to their cloud service. | Crazy that no one thought to pause those. (I assume it _must_ | have been scheduled.) | HeyLaughingBoy wrote: | This article on HN is the _only_ time I 've even heard that | Atlassian was having a problem. I suspect that 99% of the | tech "community" has absolutely no idea this is happening. | | We use Jira, but it's self-hosted for my team. Maybe other | teams that have transitioned to the cloud version are aware | that there's a problem, but I haven't heard about it. | LadyCailin wrote: | Apparently the self hosted version goes out of support in | 2024, so there will only be cloud hosting. Dumb dumb dumb. | mcintyre1994 wrote: | It's only 400 teams affected, but from this article it | sounds like they're all really big ones. | seanwilson wrote: | > Selectively restoring data only for certain rows is super | hard. | | What's the right way to structure your data here that would | make restoring more straightforward here? Is this | backup/restore scenario niche or they should have designed for | it? | inopinatus wrote: | in theory, shard your customer databases 1:1, job done. alas, | in practice, many SaaS compromise this two ways: | | a) overwhelmed by creeping featuritis, each customer's data | has relationships to global tables, and | | b) they backup their entire database cluster in one snapshot | | and there maybe other gotchas for restoration, like relying | on denormalized views and caches that have to be rebuilt. | they may also have erroneously assumed that data protection's | main value driver is whole-of-system disaster recovery, which | can lead to pathologies such as "we don't have a single- | customer restoration tool". | | this is not a niche scenario | bpicolo wrote: | Heck, it's worse now - if your data deletion tooling did a | good job, there are dozens or hundreds of microservice | databases to restore. | seanwilson wrote: | > shard your customer databases 1:1 | | What are the downsides to this? | inopinatus wrote: | * makes it much harder to distribute your tables by any | other factor, for whatever reason (usually performance, | sometimes archival) | | * disaggregates data that the SaaS might be interested in | querying/updating as an aggregate | | * not all ORM frameworks handle this case well, if at all | | * dumps are more than a single trivial command | | basically all your data operations gain an additional | dimension of complexity, and you may not perceive the | benefits until much later | seanwilson wrote: | Would it be fair to estimate that the majority of SaaS | companies aren't sharding like this then? Seems like a | lot of downsides that impact everything often except for | backups, which you'd restore rarely. | mypalmike wrote: | Per-customer is a common sharding strategy for noSQL | databases, so it may not be entirely uncommon. | darkwater wrote: | All of your points (minus maybe the first one) should be | "easily" solved/implemented in a company the size of | Atlassian, and maybe there are newer costumers sharded | like this already. IMO what happened in this case is | basically tech debt that is now being paid with loooot of | interests. | deckard1 wrote: | > not all ORM frameworks handle this case well, if at all | | typically this is probably for internal | reporting/metrics. But yeah, a custom script with direct | SQL is in order. Personally my opinion is avoid ORM at | all costs. Never seen a benefit that wasn't trivially | done in SQL, and the downsides are incredibly painful. | | The big downside of sharding out, per customer, is that's | a _lot_ of databases to migrate on upgrades. Or rollback | if shit hits the fan. | | The upside? You can have customers on different versions | of your app if you really wanted to do such a thing. | | In any case, proper tooling goes a _long_ way to making | it the difference between wonderfully manageable and | torturous nightmare. Think idempotent backup scripts that | are capable of failing at any time and resuming where | they died, etc. | oauea wrote: | Work out a relationship graph and automate the export/import | ollien wrote: | As someone who has never had to perform this kind of recovery: | why is it so hard? | jacquesm wrote: | Because it is very difficult to maintain relational integrity | during a restore like that. | ollien wrote: | Gotcha. I guess you could be heavy-handed and disable | foreign key checks, but who knows what other bugs that | would bring into the mix. | teling2 wrote: | The other difficulty is if you don't restore the entire | state in a single transaction. Imagine you have partial | data restored in Table A but haven't updated Table B | correspondingly. Now some other program that consumes | Table A and Table B and doesn't have error handling will | crash (or worse, mutate state in other weird ways). | jacquesm wrote: | That _is_ relational integrity. | miketria wrote: | Hi, this is Mike from Atlassian Engineering. You are right the | communications from us have not lived up to our standard. We | will focus on this specifically once we restore service and get | the post incident review out there. More details here: | https://www.atlassian.com/engineering/april-2022-outage-upda... | lallysingh wrote: | Spamming HN isn't helping your cause man. | [deleted] | jacquesm wrote: | It is, but between 'hard' and 'impossible' there is the nagging | question of whether you actually really still _have_ that data. | chousuke wrote: | If the database schema for Jira on the cloud is anything like | the Datacenter version, I'm not surprised they're having a hard | time restoring data. I once tried to figure out how to find | duplicate / redundant project schemas by querying the database | (the required APIs are cloud-only) and could not even find | which tables stored half the data, never mind how they referred | to each other. | duxup wrote: | As this continues I suspect that this might be one of the few | times where a lack of transparency / good communication really | ... might not be better or worse because the situation is so | bad that transparency would be horrible just the same. | | Granted that's how all lies start / what sometimes people | assume and they're wrong but ... maybe this is that time? | | Maybe it is in fact so bad that honesty would be a push or | worse? | adamc wrote: | If so, that itself would be a huge red flag for dealing with | Atlassian. | duxup wrote: | I think it is...either way. | tmpz22 wrote: | It's super hard no doubt but I wonder how much of the data was | hot vs cold. | abraae wrote: | This is extremely poor for a large SaaS company. | | A standard RFP question for SaaS should be: | | - Can you restore data for a single customer, and if so, what is | the RTO for that operation? | | A smaller SaaS could be excused for only thinking about full | database restores. When you're a scrappy upstart, thinking about | hypotheticals is less important than survival. | | But for any decent size multi-tenanted SaaS, it's imperative that | you have the ability to selectively restore individual customers. | | The usual approach is to do a full database restore into a | separate instance, then run your pre-prepared "restore customer" | scripts to extract a single customer's data from there and pump | it across your prod instance. In Oracle for example you might use | database links to give your restore code access to prod and also | the restore instance at the same time. | | Atlassian - MUST DO BETTER. | scottlamb wrote: | Is it standard for a RFP to have a long list of questions like | this? I've never been involved in an RFP from either side. | | Is it standard to (in addition or instead) to have something | more general/forward-looking like: how do you watch other | providers' postmortems and apply the lessons to your own | system? | | > - Can you restore data for a single customer, and if so, what | is the RTO for that operation? | | If I were to aim something at this specifically, it'd be: can | you restore data for N customers or N% of customers, and if so, | what is the RTO for that operation? | | I mentioned in another comment that Gmail had a similar outage | in which they had to restore from tape. | https://news.ycombinator.com/item?id=31017160 They had a tool | for restoring a single account but not for restoring N accounts | in bulk, which would be significantly more efficient than doing | the one-account process N times. (E.g., in the case of tape | backups, imagine the difference between pulling data from the | tape library sequentially for each user vs all N at once, | particularly when one tape may hold data for many of these | customers.) | taude wrote: | Yes, pages of them. Multiple pages of security questions, | ciphers used, how data is stored, when is it encrypted, etc. | I filled out a 20 pager once. As the company got better and | more mature, we had a bunch of canned answers to make it | easier and faster.... | drsim wrote: | Which SaaS platforms provide account-level restores? | | If you contact them and say "please restore our data to as it | was last week" those I know do not offer this. | boardwaalk wrote: | I wouldn't expect them to advertise such a thing, but the | question is "can they recover from their own mistakes" not | "can they recover from mine." I don't care if this is with an | "account-level restore" or whatever; it shouldn't be my | concern. | fknorangesite wrote: | I wouldn't expect it if I just asked. I think it's reasonable | as part of their disaster recovery though. | Tobani wrote: | I accidentally built out this feature at a company once and | it totally saved our asses a week later. | hotpotamus wrote: | I actually did this once with Dropbox, though it wasn't a | feature they actually published. I clobbered my Dropbox | directory accidentally, but I was able to find a script | someone wrote to roll it back to a previous point in time and | it worked quite well. After that I also took my own snapshots | just in case. | MapleWalnut wrote: | Dropbox support can rollback your Dropbox account to a | previous point in time too. | ibejoeb wrote: | I did. It was an first-principles architectural decision. A | client could request any point-in-time within the contracted | period, and it could be either a restoration or a fully | operational, parallel instance of the account. | | It was initially a cover-my-own-ass design, but it turned out | to be an extremely popular feature that was never even used | for disaster recovery. Instead, it was used for audit | support, trial scenarios, projections, and all kinds of other | stuff. | Animats wrote: | Rather, customers must stop using Atlassian cloud services. | imroot wrote: | Which is becoming more and more difficult due to them | focusing on Cloud Products (my on-prem renewal jumped almost | 8x this year). | | I'd rather use request tracker or bugzilla over Atlassian | these days | 1970-01-01 wrote: | Interesting note: Atlassian stock (NASDAQ: TEAM) is up 4% as of | noon today. | radicaldreamer wrote: | It might be a good short opportunity... I imagine a lot of | customers are kicking off their own internal process for | migrating away from JIRA. By the time they actually do, it'll | be at least a couple of quarters from now, which is when the | customer hit will start materializing in quarterly results for | the company. | | Maybe time to throw a few chips at some long term puts? | eli wrote: | Aren't most customers in 12+ month contracts? A migration | seems like it would take many months to select a new vendor | and migrate regardless. Be careful about the date on those | puts. It's pretty hard to out-think the market on this kind | of stuff. I'd just as soon bet the other way: few customers | will _actually_ churn and in 6 months this won 't really | matter. | __app_dev__ wrote: | They might even get some new customers after people who | never used it look at their site and offerings. | | Disclaimer I have puts that expire 4/22 (purchased | yesterday) so I hope they go down in the short term. Seems | like a total loss now after being up 50% yesterday. | dahdum wrote: | I wouldn't short. They just slapped 400+ customers and likely | hundreds of thousands of users in the face and the C-suite | didn't think it was important to even acknowledge. | | That might look like incompetence, but I think it's | confidence. They know the switching costs for large orgs are | so high they can treat these people like trash and few if any | will leave. I wouldn't be surprised if the total number of | seats among affected customers has gone up in a few months. | By failing to acknowledge the problem they've kept it out of | the mainstream media and financial press. | | They have their customers by the balls and don't respect | them. That's a short term bullish signal to me. | Iolaum wrote: | Given how out of sync tech people are with the general | population I 'd be tempted to think it's a buy opportunity. | Time will tell. | __app_dev__ wrote: | I bought puts yesterday morning. Was up 50% by the end of day | but now down to 50% of what I paid. | | Mine expire 4/22 but I have more calls open at the moment | anyways so if I had to choose between this going down or the | market up I'll take a full loss on these puts (seems likely | at the moment) | mkl95 wrote: | The fact it's been so long and they still haven't revealed and | explained the root cause of the outage is going to make it hard | to regain trust on their buggy, slow tools. The bright side of | the incident is that competitors that somewhat care about users | have a unique opportunity to stand out. | pgwhalen wrote: | > The fact it's been so long and they still haven't revealed | and explained the root cause of the outage | | They did last night: | https://www.atlassian.com/engineering/april-2022-outage-upda... | hu3 wrote: | > Faulty script. Second, the script we used provided both the | "mark for deletion" capability used in normal day-to-day | operations (where recoverability is desirable), and the | "permanently delete" capability that is required to | permanently remove data when required for compliance reasons. | The script was executed with the wrong execution mode and the | wrong list of IDs. The result was that sites for | approximately 400 customers were improperly deleted. | | Ouch. I hope no one person got the blame. This is a systemic | failure. Regardless, my regards to the engineers involved. | gtm1260 wrote: | Right? The way this reads it seems like one person set a | flag incorrectly, something I'm sure we've all done | numerous times. And there were no checks down the line to | catch it. | miketria wrote: | Hi, this is Mike from Atlassian Engineering. You are | right that the checks need to improve to reduce human | error, but that's only half of it. I don't see this as | human error though. It's a system error. We will be doing | some work to make these kind of hard deletes impossible | in our system. | TheJoeMan wrote: | I suppose that's why you don't combine a tazer and gun into | 1 device with 2 triggers. | mrits wrote: | If you have a 3rd trigger where the gun turns on the user | it would be fairly safe. | femiagbabiaka wrote: | the problem is that sometimes that gun looks like a | taser. | dylan604 wrote: | Instead, you make it with one trigger and a PRNG that | decides which gets activated. Just hope you've chosen the | right PRNG!! | tadfisher wrote: | I will then write a script calls your script with the | PRNG of my choice: PRNG1 always returns "trigger 2", and | PRNG2 always returns "trigger 1". This detail will be | documented in Confluence. | rubyist5eva wrote: | Considering American police can't even seem to get it | right when they have two distinct firearms, and are | trained to holster them on specific sides so they know | what they are grabbing - and still manage to f*ck it | up....this might be an improvement. | hinkley wrote: | If coding is theatrical then ops is operatic. You have to | telegraph stuff so over the top that the people in the | cheap seats know what's going on. | | I think what we've lost in the post-XP world is that just | because you build something incrementally doesn't mean it's | designed incrementally (read: myopically). | | My idiot coworkers are "fixing" redundancy issues by adding | caching, which recreates the same problem they're | (un?)knowingly trying to avoid, which is having to iterate | over things twice to accomplish anything. They've just | moved the conditional branches to the cache and added more. | | Most of the time, and especially on a concurrent system, | you are better off building a plan of action first and then | executing it second. You can dedupe while assembling the | plan (dynamic programming) and you don't have to worry | about weird eviction issues dropping you into a logic | problem like an infinite loop. | | More importantly, you can build the plan and then explain | the plan. You can explain the plan without running it. You | can abort the plan in the middle when you realize you've | clicked the wrong button. And you can clean up on abort | because the plan is not twelve levels deep in a recursive | call, where trying to clean up will have bugs you don't see | in a Dev sandbox. Deleting 500 users... | | Versus Permanently deleting 500 users... | | Maybe with a nice 10 second pause (what's an extra ten | seconds for a task that takes five minutes?) | deckard1 wrote: | I don't want to assume too much, since the details are | sparse. But I know for a fact that few of my current | coworkers know a thing about writing tooling code. It's | becoming a bit of a lost art. | | Here's the way such a script should be done. You have a | dry-run flag. Or, better yet, make the script dry-run | _only_. What this script does is it checks the database, | gathers actions, and then sends those actions to stdout. | You dump this to a file. These commands are executable. | They can be SQL, or additional shell scripts (e.g. | "delete-recoverable <customer-id>" vs. "delete-permanent | <customer-id>"). | | The idea is you now have something to verify. You can scan | it for errors. You can even put it up on Github for review | by stakeholders. You double/triple check the output and | then you execute it. | | Tooling that enhances visibility by breaking down changes | into verifiable commands is incredibly powerful. Making | these tools idempotent is also an art form, and important. | krooj wrote: | This speaks to a lack of operational excellence - when you | develop a platform like JIRA, Confluence, etc, the | operational tools required to manage the systems are just | as important as the features themselves. If all you do is | pump out features, you're a feature factory and will suffer | these kinds of issues. There's no reasonable explanation | for needing a script to do what was described when the | necessary tooling to generalize such an operation should | have been in existence. | drc500free wrote: | Highlighting the text in any of their lists breaks the page | in interesting ways, apparently due to some twitter-sharing | functionality. | mkl95 wrote: | > Communication gap. First, there was a communication gap | between the team that requested the deactivation and the team | that ran the deactivation. Instead of providing the IDs of | the intended app being marked for deactivation, the team | provided the IDs of the entire cloud site where the apps were | to be deactivated. | | So what they are saying is that they are not testing scripts | at some staging server before running them in production. | It's wild that they've managed to scale their products so | much before something like this happened. | | I hope they've learnt their lesson and they set up some QA | process for that stuff. | notdang wrote: | it seems that it worked as intended, thus they have a QA | process. The problem was in the wrong IDs provided and I | doubt that at their scale they have a staging environment | that duplicates the customer data. | dylan604 wrote: | Would it be bad practice to append values to a GUID type | of ID that would help a human recognize them? For | instance, in this specific case they wanted app IDs as | APP-XXXXX-XXXX-blahblah and CLOUD-XXXXX-blahblah. | | I'm not looking to help their specific problems, but this | is more from a general question I've thought of doing but | never have done just because I'm sure I'd get laughed at | for blazing my own trail | bombcar wrote: | This is recommended in my experience, but you do have | some potential issues when a UUID gets reused or | repurposed. | | WHENEVER a human is involved in the chain, UUIDs can be | suspicious because there's no easy way to verify what it | is, whereas a human has a good chance of realizing that | $1,342.34 is probably not a valid date. | xeromal wrote: | I kind of dig it. Something that helps make things | obvious to a human | mkl95 wrote: | > I doubt that at their scale they have a staging | environment that duplicates the customer data. | | If there is no feasible way of replicating their | production environment somewhere else, then there should | be some sanity checks in place. Something like "if an | abnormally high amount of customer sites go down during | the script's execution, kill the script". This is a 20/20 | hindsight approach though and if Atlassian engineers | can't solve I doubt a random HN user like me can. | tempest_ wrote: | Is it just me or is highlighting on that site broken, | | Perhaps my ad blocker is causing that stupid highlight to | tweet js they are using to break. | teh_klev wrote: | > and they still haven't revealed and explained the root cause | of the outage | | They did, this post by Atlassian from yesterday is referenced | in the article. | | https://www.atlassian.com/engineering/april-2022-outage-upda... | | Still doesn't excuse them for the time taken to come clean. | selimnairb wrote: | CTO should be fired. | scottlamb wrote: | Gmail had a vaguely similar outage years ago. [1] tl;dr: | | 1. Different root cause. There was a bug in a refactoring of | gmail's storage layer (iirc a missing asterisk caused a pointer | to an important bool to be set to null, rather than setting the | bool to false), which slipped through code review, automated | testing, and early test servers dedicated to the team, so it got | rolled out to some fraction of real users. Online data was | lost/corrupted for 0.02% of users (a huge amount of email). | | 2. There were tape backups, but the tooling wasn't ready for a | restore at scale. It was all hands on deck to get those accounts | back to an acceptable state, and it took four days to get back to | basically normal (iirc no lost mail, although some got bounced). | | 3. During the outage, some users could log in and see something | frightening: an empty/incomplete mailbox, and no banner or | anything telling them "we're fixing it". | | 4. Google communicated more openly, sooner, [2] which I think | helped with customer trust. Wow, Atlassian really didn't say | anything publicly for nine days?!? | | Aside from the obvious "have backups and try hard to not need | them", a big lesson is that you have to be prepared to do a | _mass_ restore, and you have to have good communication: not only | traditional support and PR communication but also within the UI | itself. | | [1] | https://static.googleusercontent.com/media/www.google.com/en... | | [2] https://gmail.googleblog.com/2011/02/gmail-back-soon-for- | eve... | fishnchips wrote: | Funny enough, most of what we restored then was spam (ex gTape | SRE, remember the outage). | Aissen wrote: | The sad truth is that with 99.8% of customers unaffected, it was | probably thought to be a minor issue. If those customers didn't | have Gergely's ear we probably wouldn't have heard about it. | miketria wrote: | Hi, this is Mike from Atlassian Engineering. Not a minor issue. | Once we knew the extent and severity of the incident, we had | hundreds of engineers engaged and working to restore service. | tpmx wrote: | Is there a source on this number? | Aissen wrote: | From the article: | | > Atlassian claims the customers impacted were "only" 0.18% | of its customer base at 400 companies. | | From https://jira-software.status.atlassian.com/ : | | > The team is continuing the restoration process for the ~400 | impacted customers. | tpmx wrote: | > The team is continuing the restoration process for the | ~400 impacted customers. We have restored functionality for | 45% of impacted users. | | If this is truthful it implies implies more than 400 | impacted customers. | jgrahamc wrote: | _Communicate directly and transparently_ | | Yes. Always. | politelemon wrote: | > it takes between 4 and 5 elapsed days to hand a site back to a | customer. | | Atlassian's SLA page says, Premium Cloud Products 99.9% | | That's 43 minutes of downtime per month. | | That works out to, Atlassian can't have any more downtime for the | next 14 years. Are SLAs even real? | | I'm being slightly facetious. From the page text it's just a | threshold after which I think you're entitled to some money back | for that month. | bborud wrote: | Think of SLAs as "this is how hard we'll scramble when shit | hits the fan". | | Except...I don't even believe that. | chrsig wrote: | It's more "this is our contractual obligation, if we're down | more than this, then we might not charge you" | dylan604 wrote: | Lawyers are involved, so I'd assume some text about | "excluding acts of god, sabotage,etc" to weasel their way | out of things. They might even be able to get away with | "acts of incompetence" how ever a lawyer might phrase that | to allow their client to weasel. | mywittyname wrote: | That's a good way to get executive approval to replace a | system. Google or Apple can get away with this kind of | behavior, I doubt Atlassian can. | | This outage alone has spurred conversations in slack | about how terrible JIRA is and why we should replace it. | If this kind of shit was pulled, I can guarantee we'd be | on shortcut, linear, or something else in short order. | [deleted] | MajorBee wrote: | > Google or Apple can get away with this kind of | behavior, I doubt Atlassian can | | Atlassian absolutely can in enterprise settings. In my | company (a large cloud company), if JIRA goes down, large | swathes of the business will also stall, including code | deployment (deployments are tracked through change | management JIRA tickets). We also use the DC version of | Atlassian products, so presumably we aren't be at the | mercy of Atlassian cloud engineers. | TheCoelacanth wrote: | SLA credits are a thing that actually happen in the | industry. I wouldn't automatically assume that they will | be able to weasel out of it. | | They are typically limited to the amount that you | actually paid, though, so basically they don't charge you | for the time when you couldn't use the product. You | usually won't get more than that. | [deleted] | mmcgaha wrote: | I think of SLAs as how do we design this thing. Ask for a | system without an SLA and I will give you a system that is | well designed and almost never goes down. As soon as you ask | for an SLA, I will give you an over engineered system that | costs more, takes longer to implement and is slower to | iterate but it will almost never go down either. | echelon wrote: | In some industries, three nines isn't exactly stellar. Every | service I've worked on recently has demanded five nines of | uptime and tons of reporting on latency and even seconds-long | outages. | | I've been on-call during a total infrastructure outage whose | root cause was a service my team owned [1]. Our CEO was aware | of it. Customers and business partners were aware of it. | Other CEOs were aware of it. The media, you name it. | | Some outages can be "business ending" or "business damaging". | That's why we made a practice and process of performing | regular disaster recovery exercises, had exceptionally well | documented runbooks, had monitoring attached to everything, | and engineered for resilience. | | Though I'm not familiar with how Atlassian runs, I think this | is an "engineering culture" thing or can be mitigated with a | proper approach. | | [1] The company has only had a few of these in total, and no | member of our team was culpable for the complicated failure. | krinchan wrote: | Per the article, if you experience < 95% uptime in any 30 day | window you qualify for a 50% discount. On a month or your | next year or ... ? it doesn't say. | hinkley wrote: | Basically not counting lost sales their income for this year | went down 2%, which is not as big a deal to them as it is to | their customers. | 0xbadcafebee wrote: | The typical SLA has no teeth because even if the customer gets | their money back, the real harm to the customer may be orders | of magnitude greater than what they paid for the service. Some | services are contractual or tightly embedded and you know | you're not gonna lose the customer if your service goes down | frequently. If the service provider doesn't lose money or face, | they aren't motivated to prevent the downtime. | | One alternative I thought of is the Charity SLA. The service | provider pledges to give $5,000 to charity for every minute of | downtime. Now everyone within the company knows "if we're down, | we're losing thousands of dollars a minute!" and thus will be | motivated to ensure the services stay up. But even if the | services go down, the company's making tax-free donations, | which isn't really bad for anybody. The company could even have | a specific downtime goal every year, to make sure their | monitoring/alerting/runbooks actually work, and to ensure they | donate every year. | bluedino wrote: | > Are SLAs even real? | | _Tommy: Here 's the way I see it, Ted. Guy puts a fancy | guarantee on a box 'cause he wants you to fell all warm and | toasty inside. | | Ted Nelson: Yeah, makes a man feel good. | | Ted Nelson: But why do they put a guarantee on the box? | | Tommy: Because they know all they sold ya was a guaranteed | piece of shit. That's all it is, isn't it? Hey, if you want me | to take a dump in a box and mark it guaranteed, I will._ | rglover wrote: | Haha I needed this, thank you. | nh2 wrote: | > it's just a threshold after which I think you're entitled to | some money back for that month | | That is exactly what SLAs are. | | There are just a lot of people applying the wishful thinking | that SLAs are a goal or metric of uptime. | | Consider the AWS S3 page on the topic: | https://aws.amazon.com/s3/sla/ | | "Reasonable efforts"; if not met, you get some fraction of the | money back. | | S3 has worse uptime than my desktop PC over the last years, but | affected users got some fraction of their spending back. | iso1631 wrote: | > S3 has worse uptime than my desktop PC over the last years | | That's sacrilege on HN | colechristensen wrote: | SLAs aren't real unless there's a contractual consequence for | not meeting them. | | And a couple of percent discount on services for the extra | downtime isn't really a meaningful consequence. | imglorp wrote: | I was just thinking that there's a hysteresis function here: | the service is worth much more to your team after you've | wired your whole process into it than before you joined. | | Offering you a free month or whatever doesn't acknowledge all | the person-hours lost. | colechristensen wrote: | There are certainly circumstances where you might have | grounds to sue for damages if an SLA is breached. I'm not | sure how often this happens but the losses from something | like Jira being down could be quite a lot more than anybody | pays for it. It's quite likely that defenses against | exactly this are written into the contracts you agree to | signing up for the service though. | dxf wrote: | >Are SLAs even real? | | SLI: Some metric you use to measure a thing (e.g. uptime, | latency, etc.) | | SLO: Some objective you try to hit, as measured by the SLI | (e.g. "99.99% of requests are processed within 3 seconds) | | SLA: A promise to a customer that they will meet some SLO, and | consequences if they don't. If there aren't consequences for | not meeting the SLO, then measuring and tracking the metrics is | a pointless exercise. | | The SLA is "real" to the extent Atlassian is adhering to any | listed consequences. | aunty_helen wrote: | You could use it as a material breach of the contract and | possibly get out of any arrangement you have with Atlassian. | inopinatus wrote: | A typical SLA precludes that by specifying the remedy for | noncompliance with the performance measure. Only if they | fail to apply the remedy is there a material breach. For a | month-to-month SLA, this limits liability to one month's | subscription, as agreed in black-and-white. | | Customers that demand service level agreements often fail | to recognise that they cut both ways. | bombcar wrote: | Most SLAs say "if we miss this, you get time for free" which | means that these companies will hopefully get a refund ... | for the time they can't use the service. | | SLAs are mostly aspirational. | hinkley wrote: | Cars warranties are also aspirational/virtue signaling, to | a point. | | If the maintenance costs exceed the margins on the cars you | lose money. Do that on too many product lines for too often | and you're looking at bankruptcy. But some makers clearly | are more risk averse than others, so a 6 year warranty from | maker X does not translate to a 7 year warranty from maker | Y. | mh- wrote: | But Atlassian's (published*) SLA offers a credit of at | most 50% of the month.. not really the same as a | manufacturer warranty on a car, which the costs of | servicing could easily exceed the price paid for the car. | | * - their larger customers will have negotiated SLAs. | | edit: to be clear, I expect Atlassian will offer | concessions beyond their SLA obligations. I'm only | responding to the comparison. | towelrod wrote: | The linked article directly talks about this, at this level | of downtime customers are promised a 50% discount. That's | what the SLA means, effectively | profmonocle wrote: | > and consequences if they don't. | | And these consequences usually just amount to getting some | percentage of your service fees back. I'm sure the affected | customers will get their entire monthly Atlassian Cloud fees | back. Since this is _so_ severe maybe Atlassian will even | give them credits for some # of free months. | | But there's no way the amount they'll get from Atlassian is | going to come close to what they're losing in productivity by | not having access to Jira & Confluence. At my company, | getting an entire free year of Jira wouldn't be worth Jira | being inaccessible for a week. | bee_rider wrote: | Does that indicate it would be preferable to pay more for a | more reliable solution, if such a thing were to exist? | Although, it definitely would be hard to quantify 'more | reliable' there. | miketria wrote: | Hi, this is Mike from Atlassian Engineering. For the customers | impacted by this incident covered by an SLA, we will adhere to | our contractual terms. However, given the long duration of this | outage, we are planning to go above and beyond for our impacted | customers. We are currently focused on restoring service, but | after that will be discussing how we can make it right for each | impacted customer. | encryptluks2 wrote: | It looks like you are focused on Hacker News comments. | leeoniya wrote: | > Atlassian's SLA page says, Premium Cloud Products 99.9% | | > That's 43 minutes of downtime per month. | | we need a better default way to communicate SLOs than "number | of 9s", which are more human. how the status quo has stayed | this way can only be attributed to intentional dark patterns, | imho. | deathanatos wrote: | ... honestly, even the "number of 9s" concept is a struggle | for some companies. I've seen a number of SLAs that fail to | correctly state a unit: it's %/<unit of time>, and I see the | "unit of time" get dropped every now and then, and the | resulting thing is meaningless absurdity. | mc4ndr3 wrote: | I've yet to work at an office that paid sufficient attention to | regular backup & restore validation, to scalable design, or | proper unit testing, or to basic security updates. Upper | management is repeatedly incentivized to produce vaporware, not | reliable service. | | Suits think a crummy Flash quiz on PII is enough to stop leaks. | The automotive industry couldn't stop airbags from acting as | claymores. It's even harder to get good code approved in tech. | hinkley wrote: | The longest Atlassian outage _so far_ ... | anshumankmr wrote: | I don't get it. JIRA is working for me. | vvpan wrote: | Article points out that 400 companies were effected. | politelemon wrote: | A subset of their customers are affected (badly). Enumerate | your blessings! | [deleted] | 1970-01-01 wrote: | Wouldn't you love to see the Atlassian internal JIRA epic for | this outage? | captaincaveman wrote: | A dumpster fire of a company that has terrible communication with | customers outside of outages as well. | snarkerson wrote: | > Most of them said they won't leave the Atlassian stack, as long | as they don't lose data. This is because moving is complex and | they don't see a move would mitigate a risk of a cloud provider | going down. However, all customers said they will invest in | having a backup plan in case a SaaS they rely on goes down. | | The real key lesson here. Your business is important to you. Not | so much to the service provider. | hougaard wrote: | Always judge companies on how they handle a crysis, not on how | they do when everything runs smoothly. | escot wrote: | When doing bulk deletes like this what safe guards do you put in | place, other than testing the script up/down in another | environment, turning off app servers etc (which Im guessing they | did not do)? | mh- wrote: | Depends how complex the query/procedure is. | | Naive approach, replace delete with select and see if you're | surprised at the results. | | More mature approach, especially in an environment where | engineers are running bulk changes against the database, you | don't do bulk deletes. You change that delete into an update | that marks things for later collection. | | One tactic I've seen that worked, assuming you have | straightforward relational tables: you add a "marked for | deletion" column whose value is an identifier for the single | run of the bulk job you just did. Then you can query rows with | that value in that column to ensure it had the desired effect. | If you're satisfied, you run another bulk job which doesn't re- | run your original query.. it just deletes rows with that | "marked" value. | | Lots of places rely on schema-enforced foreign keys and | cascading deletes though. In that case, my recommendation is: | don't. ___________________________________________________________________ (page generated 2022-04-13 23:00 UTC)