[HN Gopher] Roblox October Outage Postmortem ___________________________________________________________________ Roblox October Outage Postmortem Author : kbuck Score : 235 points Date : 2022-01-20 20:01 UTC (2 hours ago) (HTM) web link (blog.roblox.com) (TXT) w3m dump (blog.roblox.com) | ryanworl wrote: | It seems that Consul does not have the ability to use the newer | hashmap implementation of freelist that Alibaba implemented for | etcd. I cannot find any reference to setting this option in | Consul's configuration. | | Unfortunate, given it has been around for a while. | | https://www.alibabacloud.com/blog/594750 | throwdbaaway wrote: | I think they just made the switch to the fork that does contain | the freelist improvement in | https://github.com/hashicorp/consul/pull/11720 | | Took a major incident to swallow your pride? (consul, powered | by go.etcd.io/bbolt) | ryanworl wrote: | Is this option enabled by default? I don't this it is and I | don't think they actually set it manually anywhere. | | EDIT: I think we're talking about two different options. I | meant the ability to leave sync turned on but change the data | structure. | ctvo wrote: | It's a spicy read. Really could have happened to anyone. All very | reasonable assumptions and steps taken. You could argue they | could have more thoroughly load tested Consul, but doubtful any | of us would have done more due diligence than they did with the | slow rollout of streaming support. | | (Ignoring the points around observability dependencies on the | system that went down causing the failure to be extended) | yashap wrote: | The main mistake IMO is that, the day before the outage, they | made a significant Consul-related infra change. Then they have | this massive outage, where Consul is clearly the root cause, | but nobody ever tries rolling that recent change back? That's | weird. | | I went into more detail here: | https://news.ycombinator.com/item?id=30015826 | | The outage occurring could certainly happen to anyone, but it | taking 72 hours to resolve seems like a pretty fundamental SRE | mistake. It's also strange that "try rollbacks of changes | related to the affected system" isn't even acknowledged as a | learning in their learnings/action items section. | statguy wrote: | So the outage lasted 3 days and the postmortem took 3 months! | koshergweilo wrote: | Read the article " It has been 2.5 months since the outage. | What have we been up to? We used this time to learn as much as | we could from the outage, to adjust engineering priorities | based on what we learned, and to aggressively harden our | systems. One of our Roblox values is Respect The Community, and | while we could have issued a post sooner to explain what | happened, we felt we owed it to you, our community, to make | significant progress on improving the reliability of our | systems before publishing." | | They wanted to make sure everything was fixed before publishing | Operyl wrote: | They just got out of their busiest time of year, and taking the | time to write an accurate post mortem with data gleamed | afterwards seems sensible to me. | encryptluks2 wrote: | willcipriano wrote: | I have this little idea I think about called the "status update | chain". When I worked in small organizations and we had issues | the status update chain looked like this: ceo-->me, as the | organizations got larger the chain got longer first it was | ceo-->manager-->me then ceo-->director-->manager-->me and so on. | I wonder how long the status update chains are at companies like | this? How long does at status update take to make it end to end? | tacLog wrote: | I am sorry, I didn't have enough context to understand what | your saying. | | When you say: status update chain: ceo --> me. What information | is flowing from the CEO to you? or is it the other way around? | willcipriano wrote: | Both directions, he is asking "What is going on" and I am | telling him. | wizwit999 wrote: | > On October 27th at 14:00, one day before the outage, we enabled | this feature on a backend service that is responsible for traffic | routing. As part of this rollout, in order to prepare for the | increased traffic we typically see at the end of the year, we | also increased the number of nodes supporting traffic routing by | 50%. | | Seems like the smoking gun, this should have been identified and | rolled back much earlier. | conorh wrote: | Excellent write up. Reading a thorough, detailed and open | postmortem like this makes me respect the company. They may have | issues but it sounds like the type of company that (hopefully) | does not blame, has open processes, and looks to improve - the | type of company I'd want to work for! | sam0x17 wrote: | Too bad they exploit young game developers by taking a 75.5% | cut of their earnings. Big yikes of a red flag for me. | https://www.nme.com/news/gaming-news/roblox-is-allegedly-exp... | badcc wrote: | This % includes cost of all game server hosting, databases, | memory stores, etc. even with millions of concurrents, app | store fees, etc. All included in that number. Developer gets | effectively pure profit for the sole cost of | programming/designing a great game. Taught me how to program, | & changed my entire future. Disclosure: My game is one of | most popular on the platform. | ygjb wrote: | And that's a reasonable decision for an adult to make, and | if they were targeting an adult developer community. | | I don't think anyone objects to adults making that choice | over say, using Unity or Unreal, and targeting other | platforms. | | In practice, explaining to my son who is growing into an | avid developer why I won't a) help him build on Roblox, or | b) fund his objectives of advertising and promoting his | work in Roblox (by spending Roblox company scrip) on the | platform has necessitated helping him to learn and | understand what exploitation means and how to recognize it. | | It's a learning experience for him, and a challenging issue | for me as a technically proficient and financially literate | parent who actually owns and run businesses related to | intellectual property. It's got to be much more painful for | parents who lack in any of those three areas. | RussianCow wrote: | Are you really suggesting that Roblox's cut should be | lower purely because the target market is children? Why? | If anything, the fact that a kid can code a game in a | high-level environment and immediately start making money | --without any of the normal challenges of setting up | infrastructure, let alone marketing and discovery--is | _amazing_ , and a feat for which Roblox should definitely | be rewarded. | | In any case, what's the alternative? To teach your son | how to build the game from scratch in Unity, spin up a | server infrastructure that won't crumble with more than a | few concurrent players (not to mention the cash required | for this), figure out distribution, and then actually get | people to find and play the game? That seems quite | unreasonable for most children/parents. | | If this were easy, a competitor would have come in and | offered the same service with significantly lower fees. | adgjlsfhk1 wrote: | The problem is that robolox essentially lies to kids (by | omission) in an attempt to get free labor out of them. | RussianCow wrote: | Yes, I agree that the deception is a problem, although I | admit I'm not well versed in the issue. (I'm watching the | documentary linked elsewhere now.) But the original claim | was that they were exploiting young developers by taking | a big cut of revenues, which I disagree with. | noobhacker wrote: | Does your son have other alternatives to learn | programming and make money other than Roblox? | | If there are, then it's a great lesson about looking | outside of one's immediate circumstance and striving | towards something better. | lolinder wrote: | > And that's a reasonable decision for an adult to make, | and if they were targeting an adult developer community. | | If it's a reasonable decision for an adult to make | because the trade-offs might be worth it, doesn't that | mean that it would also be reasonable for a child to make | the same decision for the same reason? | | It's either exploitative or it isn't, the age of the | developer doesn't alter the trade-offs involved. | JauntyHatAngle wrote: | No, because a child is not deemed to have the necessary | faculties to make these decisions. | | The question should not be posed to a child, that is the | law for child labour, and why we do not have children | gambling on roulette wheels. | [deleted] | DerArzt wrote: | To add, there is a nice documentary here[1] which also has a | followup[2] that show even more of the issue at hand. Kids | making games and only getting 24.5% of the profit is one | thing, but everything else that Roblox does is much worse. | | [1] https://youtu.be/_gXlauRB1EQ | | [2] https://youtu.be/vTMF6xEiAaY | Qualadore wrote: | The 24.5% cut is fine, you have to consider the 30% app | store fees for a majority mobile playerbase, all hosting is | free, moderation is a major expense, and engine and | platform development. | | Successful games subsidize everyone else, which is not | comparable to Steam or anything else. | | Collectible items are fine and can't be exchanged for USD, | Roblox can't arbitrate developer disputes, "black markets" | are an extremely tiny niche. A lot of misinformation. | | It's annoying to see these videos brought up every single | time Roblox is mentioned anywhere for these reasons. Part | of the blame lies with Roblox for one of the worst PR | responses I have seen in tech, I suppose. | brimble wrote: | > The 24.5% cut is fine, you have to consider the 30% app | store fees for a majority mobile playerbase, all hosting | is free, moderation is a major expense, and engine and | platform development. | | You have successfully made the case for a 45% fee and | being considered approximately normal, or a 60% fee and | being considered pretty high still. 75+% is crazy. | Qualadore wrote: | I can't think of any other platform with comparable | expenses. Traditional game engines have the R&D | component, but not moderation, developer services, or | subsidizing games that don't succeed. | | It helps that seriously launching a Roblox game costs < | $1k USD always, usually < $200 USD. It's not easy to | generate a loss, even when including production costs. | That's the tradeoff. | [deleted] | nostrebored wrote: | The idea that these children would otherwise be making their | own games is knowingly, generally wrong. | munk-a wrote: | No matter what the cut is I think there are some legitimate | social questions to ask about whether want young people to be | potentially exposed to economic pressure to earn or whether | we'd rather push back aggressively against youth monetization | to preserve a childhood where, ideally, children get to play. | | I know there are lots of child actors and plenty of household | situations that make enjoying childhood difficult for many | youths - but just because we're already bad at a thing | doesn't mean we should let it get worse. Child labour laws | were some of the first steps of regulation in the industrial | revolution because inflation works in such a way where | opening the door up to child labour can put significant | financial pressure on families that choose not to participate | when demand adjusts to that participation being normal. | Aunche wrote: | By that logic, Dreams is "exploiting" developers by taking a | 100% cut of their earnings. Making money isn't the point of | either of these platforms. | loceng wrote: | The solution is creating a competing platform and offering a | better cut. You up for the task? | | Edit to add: lazy people downvote. | flippinburgers wrote: | I am naive about the reality on the ground when it comes to | this issue, but doesn't this hinge on transparency? If they | can show they are covering costs + the going market rate, | which seems to be 30% (at best), then wouldn't it be | reasonable? So is a 45% cut for infra ok or not seems to be | the question. | perihelions wrote: | More egregiously, they're (per your article) manipulating | kids into _buying real ads_ for their creations, with the | false promise that "you could get rich if you pay us". | | > _" As there are no discoverability tools, users are only | able to see a tiny selection of the millions of experiences | available. One of the ways boost to discoverability is to pay | to advertise on the platform using its virtual currency, | Robux."_ | | (Note that "virtual" currency is real money, bidirectionally | exchangeable with USD). | | The sales pitch is "get rich fast": | | > _" Under the platform's 'Create' tab, it sells the idea | that users can "make anything", "reach millions of players", | and "earn serious cash", while its official tutorials and | support website both "assume" they are looking for help with | monetisation."_ | | I agree that this doesn't really look like a labor issue. | That's distracting and contentious tangent; it's easier to | just label it a type of _consumer_ exploitation. (Most of the | people aren 't earning money -- but they are all _paying | money_ ). It's a scam either way. | tptacek wrote: | Again, as across-thread: this is a tangent unrelated to the | actual story, which is interesting for reasons having nothing | at all to do with Roblox (I'll never use it, but operating | HashiStack at this scale is intensely relevant to me). We | need to be careful with tangents like this, because they're | easier for people to comment on than the vagaries of Raft and | Go synchronization primitives, and --- as you can see here | --- they quickly drown out the on-topic comments. | breakfastduck wrote: | Or how about giving a free platform to get into games | development for young people that otherwise wouldn't have | become interested. | digitalengineer wrote: | badcc wrote: | As one of the top developers on the platform (& 22 y/o, | taught myself how to program through Roblox, ~13 years ago), | I can say that it seems a majority of us in the developer | community are quite unhappy with the image this video | portrays. We love Roblox. | dan_pixelflow wrote: | That's kind of on Roblox then for not answering their | questions transparently. | duxup wrote: | My son loves it, I think it is a great way to learn. | empressplay wrote: | I think what bothers me the most is the effective 'pay to | play' aspect | [deleted] | tptacek wrote: | This is an interesting debate to have somewhere, but it has | _nothing to do with this thread_. We need to be careful about | tangents like this, because it 's a lot easier to have an | opinion about the cut a platform should take from UGC than it | is to have opinion about Raft, channel-based concurrency, and | on-disk freelists. If we're not careful, we basically can't | have the technical threads, because they're crowded out by | the "easier" debate. | digitalengineer wrote: | True, it is off topic to the postmortem. However, the top | comment talks about wanting to work there. I get is is very | relevant to see a bigger picture. Personally, I could never | work for them. I have a kid and the services and culture | they created around their product is sickening and should | be made illegal. | nightpool wrote: | While I personally think digitalengineer's comment was low- | effort and knee-jerk, I think this general thread of | discussion is on topic for the comment replied to, which | was specifically about how the postmortem increased the | commenter's respect for Roblox as a company and made them | want to work there. I think an acceptable compromise | between "ethical considerations drown out any technical | discussion" and "any non-technical discussion gets | downvoted/flagged to oblivion" would be to quarantine | comments about the ethics of Roblox's business model to a | single thread of discussion, and this one seems as good as | any. | pvg wrote: | The guidelines and zillions of moderation comments are | pretty explicit that doesn't count as 'on topic'. You can | always hang some rage-subthread off the unrelated | perfidy, real or perceived, of some entity or another. | This one is extra tenuous and forced given that 'the type | of company I'd want to work for' is a generic expression | of approval/admiration. | BolexNOLA wrote: | You've pretty much articulated for me why I've been | commenting on Reddit less and less frequently. | duxup wrote: | I loathe the constant riffing on <related and yet nothing | indicates it is actually related/> topics. | | Sadly it is happening here on HN too, < insert the next | blurb about corporatism/> | BolexNOLA wrote: | Guess we need to find the next space lol | micromacrofoot wrote: | Yeah as long as Roblox is exploiting children they're just | flat-out not respectable. This video is a good look at a | phenomenon most people are unaware of. | [deleted] | charcircuit wrote: | Players of your game creating content for it is not | exploitation. It's just how it works in the gaming world. | When I was a kid I spent time creating a minecraft mod that | hundreds of people used. Did Mojang or anyone else ever pay | me? No. I did it because I wanted to. | jawngee wrote: | Mojang was likely not selling you on making a mod with | promises of making money though. Roblox did that, maybe | they still do it. | digitalengineer wrote: | Please review the video. The problem is not 'players | creating content'. | [deleted] | micromacrofoot wrote: | The way they're paying kids and what they're telling them | is a big part of the problem... they're pushing a lot of | the problematic game development industry onto kids that | are sometimes as young as 10. | | If this was free content creation when kids want to do | it, then it would be an entirely different story. | ehsankia wrote: | > the type of company I'd want to work for! | | I recommend watching the following: | | https://www.youtube.com/watch?v=_gXlauRB1EQ | | https://www.youtube.com/watch?v=vTMF6xEiAaY | ineedasername wrote: | ">circular dependencies in our observability stack" | | This appears to be why the outage was extended, and was | referenced elsewhere too. It's hard to diagnose something when | part of the diagnostic tool kit is also malfunctioning. | AaronFriel wrote: | This outage has it all, distributed systems, non-uniform memory | access contention (aka "you wanted scale up? how about instead we | make your CPU a distributed system that you have to reason | about?"), a defect in a log-structured merge tree based data | store, malfunctioning heartbeats affecting scheduling, wow wow | wow. | | Big props to the on-calls during this. | tacLog wrote: | > Big props to the on-calls during this. | | Kind of curious about this. I know this is probably company | specific but how do outages get handled at large orgs? Would | the on-calls have been called in first then called in the rest | of the relevant team? | | Is their a leadership structure that takes command of the | incident to make big coordinated decisions to manage the risk | of different approaches? | | Would this have represented crunch time to all the relevant | people or would this be a core team with other people helping | as needed? | WaxProlix wrote: | Oncalls get paged first and then escalate. As they assess | impact to other teams and orgs, they usually post their | tickets to a shared space. Once multiple team/org impact is | determined, leadership and relevant ops groups (networking, | eg) get pulled in to a call. A single ticket gets designated | the Master Ticket for the Event, and oncalls dump diagnostic | info there. Root cause is found (hopefully), affected teams | work to mitigate while RC team rushes to fix. | | The largest of these calls I've seen was well into the | hundreds of sw engineers, managers, network engineers, etc. | yazaddaruvala wrote: | Typically: | | Yes. This was a multi-day outage and eventually the oncall | does need sleep, so you need more of the team to help with | it. Typically, at any reasonable team, everyone that chipped | in nights get to take off equivalent days and sprint tasks | are all punted. | | Yes. Not just to manage risks, but also to get quick | prioritization from all teams at the company. "You need | legal? Ok, meet ..." "You need string translations? Ok | escalated to ..." "You need financial approval? Ok, looped in | ..." | | Kinda. Definitely would have represented crunch time, but a | very very demoralizing crunch time. Managers also try to | insulate most of their teams from it, but everyone pays | attention anyways. There is no "core team" other than the | leadership structure from your question 2. Otherwise, it is | very much "people/teams helping as needed". | quirino wrote: | Google has his Site Reliability Engineering book, which might | answer some of your questions | | https://sre.google/sre-book/table-of-contents/ | sjtindell wrote: | Super interesting. A place where ipvs or ebpf rules per-host for | the discovery of services seems much more resilient than this | heavy reliance on a functional consul service. The team shared a | great postmortem here. I know the feeling well of testing | something like a full redeploy and seeing no improvement...easy | to lose hope at that point. 70+ hours of a full outage, multiple | failed attempts to restore, has got to add years to your life in | stress. Well done to all involved. | johnmarcus wrote: | aaaalllllllll the way down at the bottom is this gem: >Some core | Roblox services are using Consul's KV store directly as a | convenient place to store data, even though we have other storage | systems that are likely more appropriate. | | Yeah, don't use consul as redis, they are not the same. | stuff4ben wrote: | But you can... which is what some engineers were thinking. In | my experience they do this because: | | A) they're afraid to ask for permission and would rather ask | for forgiveness | | B) management refused to provision extra infra to support the | engineers need, but they needed to do this "one thing" anyways | | C) security was lax and permissions were wide open so people | just decided to take advantage of it to test a thing that then | became a feature and so they kept it but "put it on the | backlog" to refactor to something better later | stuff4ben wrote: | Sounds like they need to switch to Kubernetes? | | I kid of course. One of the best post-mortems I've seen. I'm sure | there are K8s horror stories out there of etcd giving up the | ghost in a similar fashion. | spydum wrote: | you joke, but it's precisely this: | | >Critical monitoring systems that would have provided better | visibility into the cause of the outage relied on affected | systems, such as Consul. This combination severely hampered the | triage process. | | which gives me goosebumps whenever I hear people proselytizing | everything run on Kubernetes. At some point, it makes good | sense to keep capabilities isolated from each other, especially | when those functions are key to keeping the lights on. Mapping | out system dependencies (either systems, software components, | etc) is really the soft underbelly of most tech stacks. | YATA0 wrote: | >Sounds like they need to switch to Kubernetes? | | Hah! Good one! | schoolornot wrote: | The one thing you can say about Nomad is that's generally | incredibly scalable compared to Kubernetes. At 1000+ nodes over | multiple datacenters, things in Kube seem to break down. | tapoxi wrote: | Do they still? GKE supports 15,000 nodes per cluster. | samkone wrote: | Mayhem. Hipsters | chainwax wrote: | Love the "Note on Public Cloud", and their stance on owning and | operating their own hardware in general. I know there has to be | people thinking this could all be avoided/the blame could be | passed if they used a public cloud solution. Directly addressing | that and doubling down on your philosophies is a badass move, | especially after a situation like this. | regnull wrote: | It's weird it took them so long to disable streaming. One of the | first things you do in this case is roll back the last software | and config updates, even innocent looking ones. | yashap wrote: | That's what stood out to me too. Although they'd been slowly | rolling it out for awhile, their last major rollout was quite | close to the outage start: | | > Several months ago, we enabled a new Consul streaming feature | on a subset of our services. This feature, designed to lower | the CPU usage and network bandwidth of the Consul cluster, | worked as expected, so over the next few months we | incrementally enabled the feature on more of our backend | services. On October 27th at 14:00, one day before the outage, | we enabled this feature on a backend service that is | responsible for traffic routing. As part of this rollout, in | order to prepare for the increased traffic we typically see at | the end of the year, we also increased the number of nodes | supporting traffic routing by 50% | | Consul was clearly the culprit early on, and you just made a | significant Consul-related infrastructure change, you'd think | rolling that back would be one of the first things you'd try. | One of the absolute first steps in any outage is "is there any | recent change we could possibly see causing this? If so, try | rolling it back." | | They've obviously got a lot of strong engineers there, and it's | easy to critique from the outside, but this certainly struck me | as odd. Sounds like they never even tried "let's try rolling | back Consul-related changes", it was more that, 50+ hours into | a full outage, they'd done some deep profiling, and discovered | the steaming issue. But IMO root cause analysis is for later, | "resolve ASAP" is the first response, and that often involves | rollbacks. | | I wonder if this actually hindered their response: | | > Roblox Engineering and technical staff from HashiCorp | combined efforts to return Roblox to service. We want to | acknowledge the HashiCorp team, who brought on board incredible | resources and worked with us tirelessly until the issues were | resolved. | | i.e. earlier on, were there HashiCorp peeps saying "naw, we | tested streaming very thoroughly, can't be that"? | notacoward wrote: | In a not-too-distant alternate universe, they made the rookie | assumption that every change to every system is trivially | reversible, only to find that it's not always true | (especially for storage or storage-adjacent systems), and | ended up making things worse. Naturally, people in alternate- | universe HN bashed them for that too. | otterley wrote: | When you're at Roblox's scale, it is often difficult to know | in advance whether you will have a lower MTTR by rolling back | or fixing forward. If it takes you longer to resolve a | problem by rolling back a significant change than by tweaking | a configuration file, then rolling back is not the best | action to take. | | Also, multiple changes may have confounded the analysis. | Adjusting the Consul configuration may have been one of many | changes that happened in the recent past, and certainly | changes in client load could have been a possible culprit. | yashap wrote: | Some changes are extremely hard to rollback, but this | doesn't sound like one of them. From their report, sounds | like the rollback process involved simply making a config | change to disable the streaming feature, it took a bit to | rollout to all nodes, and then Consul performance almost | immediately returned to normal. | | Blind rollbacks are one thing, but they identified Consul | as the issue early on, and clearly made a significant | Consul config change shortly before the outage started, | that was also clearly quite reversible. Not even trying to | roll that back is quite strange to me - that's gotta be | something you try within the first hour of the outage, | nevermind the first 50 hours. | [deleted] | Twirrim wrote: | The post indicates they'd been rolling it out for months, and | indicate the feature went live "several months ago". | | With the behaviour matching other types of degradation | (hardware), it's entirely reasonable that it could have taken | quite a while to recognise that software and configurations | that have proven stable for several months, that is still there | working, wasn't quite so stable as it seemed. | nightpool wrote: | Right, but it only went live on the DB that failed the day | before. Obviously, hindsight is 20/20, but it's strange that | the oversight didn't rate a mention in the postmortem. | Twirrim wrote: | "We enjoyed seeing some of our most dedicated players figure out | our DNS steering scheme and start exchanging this information on | Twitter so that they could get "early" access as we brought the | service back up." | | Why do I have a feeling "enjoyed" wasn't really enjoyed so much | as "WTF", followed by "oh shit..." at the thought that their main | way to balance load may have gone out the window. | Symbiote wrote: | It's difficult to know how quickly word could have spread, but | I enjoy knowing a few 11 year olds learned something about the | Internet in order to play a game an hour early. | jandrese wrote: | The BoltDB issue seems like straight up bad design. Needing a | freelist is fine, needing to sync the entire freelist to disk | after every append is pants on head. | benbjohnson wrote: | BoltDB author here. Yes, it is a bad design. The project was | never intended to go to production but rather it was a port of | LMDB so I could understand the internals. I simplified the | freelist handling since it was a toy project. At Shopify, we | had some serious issues at the time (~2014) with either LMDB or | the Go driver that we couldn't resolve after several months so | we swapped out for Bolt. And alas, my poor design stuck around. | | LMDB uses a regular bucket for the freelist whereas Bolt simply | saved the list as an array. It simplified the logic quite a bit | and generally didn't cause a problem for most use cases. It | only became an issue when someone wrote a ton of data and then | deleted it and never used it again. Roblox reported having 4GB | of free pages which translated into a giant array of 4-byte | page numbers. | tacLog wrote: | > BoltDB author here. | | How does this happen so often? It's awesome to get the | authors take on things. Also thank you for explaining and | owning it. Where you part of this incident response? | otterley wrote: | I, for one, appreciate you owning this. It takes humility and | strength of character to admit one's errors. And Heaven knows | we all make them, large and small. | kjw wrote: | I would not have guessed Roblox was on-prem with such little | redundancy. Later in the post, they address the obvious "why not | public cloud question"? They argue that running their own | hardware gives them advantages to cost and performance. But those | seem irrelevant if usage and revenue go to zero when you can't | keep a service up. It will be interesting to see how well this | architecural decision ages if they keep scaling to their | ambitions. I wonder about their ability to recruit the level of | talent required to run a service at this scale. | dylan604 wrote: | >I wonder about their ability to recruit the level of talent | required to run a service at this scale. | | According to this user's comments, it doesn't look like it'll | be that tough for them: | | https://news.ycombinator.com/item?id=30014748 | nomel wrote: | > But those seem irrelevant if usage and revenue go to zero | when you can't keep a service up | | You're assuming the average profits lost are more than the | average cost of doing things differently, which, according to | their statement, is not the case. | noahtallen wrote: | I think the public cloud is a good choice for startups, teams, | and projects which don't have infrastructure experience. Plenty | of companies still have their own infrastructure expertise and | roll their own CDNs, as an example. | | Not only can one save a significant amount of money, it can | also be simpler to troubleshoot and resolve issues when you | have a simpler backend tech stack. Perhaps that doesn't apply | in this case, but there are plenty of use cases which don't | need a hundred micro services on AWS, none of which anyone | fully understands. | otterley wrote: | Since the issue's root cause was a pathological database | software issue, Roblox would have suffered the same issue in | the public cloud. (I am assuming for this analysis that their | software stack would be identical.) Perhaps they would have | been better off with other distributed databases than Consul | (e.g., DynamoDB), but at their scale, that's not guaranteed, | either. Different choices present different potential | difficulties. | | Playing "what-if" thought experiments is fun, but when the | rubber hits the road, you often find that things that are | stable for 99.99%+ of load patterns encounter previously | unforeseen problems once you get into that far-right-hand side | of the scale. And it's not like we've completely mastered | squeezing performance out of huge CPU core counts on NUMA | architectures while avoiding bottlenecking on critical sections | in software. This shit is hard, man. | baskethead wrote: | This is not true, if they handled the rollout properly. | Companies like Uber have two entirely different data centers | and during outages they failover you either datacenter. | | Everything is duplicated which is potentially wasteful but | ensures complete redundancy and it's an insurance policy. If | you rollout, you rollout to each datacenter separately. So in | this case rolling out in one complete datacenter and waiting | a day for their Consul streaming changes probably would have | caught it. | otterley wrote: | The Consul streaming changes were rolled out months before | the incident occurred. | Symbiote wrote: | > So in this case rolling out in one complete datacenter | and waiting a day for their Consul streaming changes | probably would have caught it. | | But this has nothing to do with cloud vs. colo. | erwincoumans wrote: | >> We are working to move to multiple availability zones and data | centers. | | Surprised it was a single availability zone, without redundancy. | Having multiple fully independent zones seems more reliable and | failsafe. | mbesto wrote: | There have been multiple discussions on HN about cloud vs not | cloud and there are endless amount of opinions of "cloud is a | waste blah blah". | | This is exactly one of the reasons people go cloud. Introducing | an additional AZ is a click of a button and some relatively | trivial infrastructure as code scripting, even at this scale. | | Running your own data center and AZ on the other hand requires | a very tight relationship with your data center provider at | global scale. | | For a platform like Roblox where downtime equals money loss | (i.e. every hour of the day people make purchases), then there | is a real tangible benefit to using something _like_ AWS. 72 | hours downtime is A LOT, and we 're talking potentially | millions of dollars of real value lost and millions of | potential in brand value lost. I'm not saying definitively they | would save money (in this case profit impact) by going to AWS, | but there is definitely a story to be had here. | treis wrote: | But it wasn't a hardware issue. It was a software one and | that would have crossed AZ boundaries. | mbesto wrote: | So then why does the post mortem suggest setting up multi- | az to address the problems they encountered? | treis wrote: | I took that to mean sharding Roblox instead of spanning | it across data center AZs. | abarringer wrote: | Was on a call with a bank VP that had moved to AWS. Asked how | it was going. Said it was going great after six months but just | learning about availability zones so they were going to have to | rework a bunch of things. | | Astonishing how our important infrastructure is moved to AWS | with zero knowledge of how AWS works. | kreeben wrote: | >> Having multiple fully independent zones seems more reliable | | I don't think these independent zones exist. See AWS's recent | outages, where east cripples west and vice versa. | count wrote: | That's not how they work. They exist, and work extremely well | within their defined engineering / design goals. It's much | more nuanced than 'everything works independently'. | kreeben wrote: | If the design goal of these zones is that they should be | independent of each other then, no, they do not work | extremely well. | Karrot_Kream wrote: | Availability Zones aren't the same thing as regions. AWS | regions have multiple Availability Zones. Independent | availability zones publishes lower reliability SLAs so you | need to load balance across multiple independent availability | zones in a region to reach higher reliability. Per AZ SLAs | are discussed in more detail here [1] | | (N.B. I find HN commentary on AWS outages pretty depressing | because it becomes pretty obvious that folks don't understand | cloud networking concepts at all.) | | [1]: https://aws.amazon.com/compute/sla/ | kreeben wrote: | >> you need to load balance across multiple independent | availability zones | | The only problem with that is, there are no independent | availability zones. | | What we do have, though, is an architecture where errors | propagate cross-zone until they can't propagate any | further, because services can't take any more requests, | because they froze, because they weren't designed for a | split brain scenario, and then, half the internet goes | down. | outworlder wrote: | > The only problem with that is, there are no independent | availability zones. | | There are - they can be as independent as you need them | to be. | | Errors won't necessarily propagate cross-zone. If they | do, someone either screwed up, or they made a trade-off. | Screwing up is easy, so you need to do chaos testing to | make sure your system will survive as intended. | kreeben wrote: | I'm not talking about my global app. I'm talking about | the system I deploy to, the actual plumbing, and how a | huge turd in a western toilet causes east's sewerage | system to over-flow. | mlyle wrote: | > (N.B. I find HN commentary on AWS outages pretty | depressing because it becomes pretty obvious that folks | don't understand cloud networking concepts at all.) | | What he said was perfectly cogent. | | Outages in us-east-1 AZ us-east-1a have caused outages in | us-west-1a, which is a different region _and_ a different | AZ. | | Or, to put it in the terms of reliability engineering: even | though these are abstracted as independent systems, in | reality there are common-mode failures that can cause | outages to propagate. | | So, if you span multiple availability zones, you are not | spared from events that will impact all of them. | Karrot_Kream wrote: | > Or, to put it in the terms of reliability engineering: | even though these are abstracted as independent systems, | in reality there are common-mode failures that can cause | outages to propagate. | | It's up to the _user_ of AWS to design around this level | of reliability. This isn't any different than not using | AWS. I can run my web business on the super cheap by | running it out of my house. Of course, then my site's | availability is based around the uptime of my residential | internet connection, my residential power, my own ability | to keep my server plugged into power, and general | reliability of my server's components. I can try to make | things more reliable by putting it into a DC, but if a | backhoe takes out the fiber to that DC, then the DC will | become unavailable. | | It's up to the _user_ to architect their services to be | reliable. AWS isn't magic reliability sauce you sprinkle | on your web apps to make them stay up for longer. AWS | clearly states in their SLA pages what their EC2 instance | SLAs are in a given AZ; it's 99.5% availability for a | given EC2 instance in a given region and AZ. This is | roughly ~1.82 days, or ~ 43.8 hours, of downtime in a | year. If you add a SPOF around a single EC2 instance in a | given AZ then your system has a 99.5% availability SLA. | Remember the cloud is all about leveraging large amounts | commodity hardware instead of leveraging large, high- | reliability mainframe style design. This isn't a secret. | It's openly called out, like in Nishtala et al's "Scaling | Memcache at Facebook" [1] from 2013! | | The background of all of this is that it costs money, in | terms of knowledgable engineers (not like the kinds in | this comment thread who are conflating availability zones | and regions) who understand these issues. Most companies | don't care; they're okay with being down for a couple | days a year. But if you want to design high reliability | architectures, there are plenty of senior engineers | willing to help, _if_ you're willing to pay their | salaries. | | If you want to come up with a lower cognitive overhead | cloud solution for high reliability services that's | economical for companies, be my guest. I think we'd all | welcome innovation in this space. | | [1]: https://www.usenix.org/system/files/conference/nsdi1 | 3/nsdi13... | mlyle wrote: | Yes, but the underlying point you're willfully missing | is: | | You can't engineer around AWS AZ common-mode failures | using AWS. | | The moment that you have failures that are not | independent and common mode, you can't just multiply | together failure probabilities to know your outage times. | roughly wrote: | During a recent AWS outage, the STS service running in | us-east-1 was unavailable. Unfortunately, all of the | other _regions_ - not AZs, but _regions_, rely on the STS | service in us-east-1, which meant that customers which | had built around Amazon's published reliability model had | services in every region impacted by an outage in one | specific availability zone. | | This is what kreeben was referring to - not some abstract | misconception about the difference between AZs and | Regions, but an actual real world incident in which a | failure in one AZ had an impact in other Regions. | Karrot_Kream wrote: | > Unfortunately, all of the other _regions_ - not AZs, | but _regions_, rely on the STS service in us-east-1, | which meant that customers which had built around | Amazon's published reliability model had services in | every region impacted by an outage in one specific | availability zone. | | That's not true. STS offers regional endpoints, for | example if you're in Australia and don't want to pay the | latency cost to transit to us-east-1 [1]. It's up to the | user to opt into them though. And that goes back to what | I was saying earlier, you need engineers willing to read | their docs closely and architect systems properly. | | [1]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_ | credenti... | otterley wrote: | It's more subtle than that. | | For high availability, STS offers regional endpoints -- | and AWS recommends using them[1] -- but the SDKs don't | use them by default. The author of the client code, or | the person configuring the software, has to enable them. | | [1] https://docs.aws.amazon.com/IAM/latest/UserGuide/id_c | redenti... | | (I work for AWS. Opinions are my own and not necessarily | those of my employer.) | johnmarcus wrote: | Yup, so true. People think redundant == 100% uptime, or | that when they advertise 99.9% uptime, it's the same thing | as 100% minus a tiny bit for "glitches". | | It's not. .1% of 365 _24 = 87.6 hours of downtime - that 's | over 3 days of complete downtime every year! | | For a more complete list of their SLA's for every service: | https://aws.amazon.com/legal/service-level- | agreements/?aws-s... | | They only refund 100% when they fall below 95% of | availability! 95-99= 30%. I believe the real target is | above 99.9% though, as that results in 0 refund to the | customer. What that means is, 3 days of downtime is | acceptable! | | Alternatively, you can return to your own datacenter and | find out first hand that it's not particularly as easy to | deliver that as you may think. You too will have power | outages, network provider disruptions, and the occasional | "oh shit, did someone just kick that power cord out?" or | complete disk array meltdowns. | | Anywho, they have a lot more room in their published SLA's | than you think._ | | Edit: as someone correctly pointed out i did a typo in my | math. it is only ~9 hours of aloted downtime. Keeping in | mind that this is _per service_ though - meaning each | service can have a different 9 hours of downtime before | they need to pay out 10% of that one service. I still stand | by my statement thier SLA 's have a lot of wiggle room that | people should take more seriously. | mqnfred wrote: | Your computation is incorrect, 3 days out of 365 is 1% of | downtime, not 0.1%. I believe your error stems from | reporting .1% as 0.1. Indeed: | | 0.001 (.1%) * 8760 (365d*24h) = 8.76h | | Alternatively, the common industry standard in | infrastructure (the place I work at at least,) is 4 | nines, so 99.99% availability, which is around 52 mins a | year or 4 mins a month iirc. There's not as much room as | you'd think! :) | foobarian wrote: | > Surprised it was a single availability zone, without | redundancy. Having multiple fully independent zones seems more | reliable and failsafe. | | It's also a lot more expensive. Probably order of magnitude | more expensive than the cost of a 1 day outage | sam0x17 wrote: | Most startups I've worked at literally have a script to | deploy their whole setup to a new region when desired. Then | you just need latency-based routing running on top of it to | ensure people are processed in the closest region to them. | Really not expensive. You can do this with under $200/month | in terms of complexity and the bandwidth + database costs are | going to be roughly the same as they normally are because | you're splitting your load between regions. Now if you | stupidly just duplicate your current infrastructure entirely, | yes it would be expensive because you'd be massively | overpaying on DB. | | In theory the only additional cost should be the latency- | based routing itself, which is $50/month. Other than that, | you'll probably save money if you choose the right regions. | Symbiote wrote: | So Roblox need a button to press to (re)deploy 18,000 | servers and 170,000 containers? They already have multiple | core data centres, as well as many edge locations. | | You will note the problem was with the software provided | and supported by Hashicorp. | e4e78a06 wrote: | Correctly handling failure edge cases in a active-active | multi-region distributed database requires work. SaaS DBs | do a lot of the heavy lifting but they are still highly | configurable and you need to understand the impact of the | config you use. Not to mention your scale-up runbooks need | to be established so a stampede from a failure in one | region doesn't cause the other region to go down. You also | need to avoid cross-region traffic even though you might | have stateful services that aren't replicated across | regions. That might mean changes in config or business | logic across all your services. | | It is absolutely not as simple as spinning up a cluster on | AWS at Roblox's scale. | Twirrim wrote: | Roblox is not a startup, and has a significant sized | footprint (18,000 servers isn't something that's just | available, even within clouds. They're not magically | scalable places, capacity tends to land just ahead of | demand). It's not even remotely a simple case of just "run | a script and wee we have redundancy" There are _lots_ of | things to consider. | | 18k servers is also not cheap, at all. They suggest at | least some of their clusters are running on 64 cores, some | on 128. I'm guessing they probably have a fair spread of | cores. | | Just to give a sense of cost, AWS's calculator estimates | 18,0000 _32_ core instances would set you back $9m per | month. That 's just the EC2 cost, and assuming a lower core | count is used by other components in the platform. 64 core | would bump that to $18m. Per month. Doing nothing but | sitting waiting ready. That's not considering network | bandwidth costs, load balancers etc. etc. | | When you're talking infrastructure on that scale, you have | to contact cloud companies in advance, and work with them | around capacity requirements, or you'll find you're barely | started on provisioning and you won't find capacity | available (you'll want to on that scale _anyway_ because | you 'll get discounts but it's still going to be very | expensive) | bradly wrote: | Are the same services available in all regions? | | Are the same instance sizes available in all regions? | | Are there enough instances of the sizes you need? | | Do you have reserved instances in the other region? | | Are your increased quotas applied to all regions? | | What region are your S3 assets in? Are you going to migrate | those as well? | | Is it acceptable for all user sessions to be terminated? | | Have you load tested the other region? | | How often are you going to test the region fail over? | Yearly? Quarterly? With every code change? | | What is the acceptable RTO and RPO with executives and | board-members? | | And all of that is without thinking about cache warming, | database migration/mirror/replication, solr indexing (are | you going to migrate the index or rebuild? Do you know how | long it takes to rebuild your solr index?). | | The startups you worked at probably had different needs the | Roblox. I was the tech leach on a Rails app that was | embedded in TurboTax and QuickBooks and was rendered on | each TT screen transition and reading your comment in that | context shows a lot of inexperience in large, production | systems. | [deleted] | bradly wrote: | Yes. If you are running in two zones in the hopes that you | will be up if one goes down, you need to be handling less | than 50% load in each zone. If you can scale up fast enough | for your use case, great. But when a zone goes down and | everyone is trying to launch in the zone still up, there may | not be instances for you available at that time. Our site had | a billion in revenue or something based on a single day, so | for us it was worth the cost, but it not easy (or at least it | wasn't at the time). | outworlder wrote: | > It's also a lot more expensive. Probably order of magnitude | more expensive than the cost of a 1 day outage | | Not sure I agree. Yes, network costs are higher, but your | overall costs may not be depending on how you architect. | Independent services across AZs? Sure. You'll have multiples | of your current costs. Deploying your clusters spanning AZs? | Not that much - you'll pay for AZ traffic though. | adrr wrote: | It is when you run your own date centers and have to shell | out a large capital outlays to spin up a new datacenter. | Symbiote wrote: | The usual way this works (and I assume this is the case | for Roblox) is not by constructing buildings, but by | renting space in someone else's datacentre. | | Pretty much every city worldwide has at least one place | providing power, cooling, racks and (optionally) network. | You rent space for one or more servers, or you rent | racks, or parts of a floor, or whole floors. You buy your | own servers, and either install them yourself, or pay the | datacentre staff to install them. | Hamuko wrote: | How expensive? Remember that the Roblox Corporation does | about a billion dollars in revenue per year and takes about | 50% of all revenue developers generate on their platform. | dev_by_day wrote: | Right, outages get more expensive the larger you grow. What | else needs to be thought of is not just the loss of revenue | for the time your service is down but also it's affect on | user trust and usability. Customers will gladly leave you | for a more reliable competitor once they get fed up. | johnmarcus wrote: | Multi-AZ is free at Amazon. Having things split amongst 3 | AZ's cost no more than having in a single AZ. | | Multi-Region is a different story. | otterley wrote: | There are definitely cost and other considerations you have | to think about when going multi-AZ. | | Cross-AZ network traffic has charges associated with it. | Inter-AZ network latency is higher than intra-AZ latency. | And there are other limitations as well, such as EBS | volumes being attachable only to an instance in the same AZ | as the volume. | | That said, AWS does recommend using multiple Availability | Zones to improve overall availability and reduce Mean Time | to Recovery (MTTR). | | (I work for AWS. Opinions are my own and not necessarily | those of my employer.) | znep wrote: | This is very true, the costs and performance impacts can | be significant if your architecture isn't designed to | account for it. And sometimes even if it is. | | In addition, unless you can cleanly survive an AZ going | down, which can take a bunch more work in some cases, | then being multi-AZ can actually reduce your availability | by giving more things to fail. | | AZs are a powerful tool but are not a no-brainer for | applications at scale that are not designed for them, it | is literally spreading your workload across multiple | nearby data centers with a bit (or a lot) more tooling | and services to help than if you were doing it in your | own data centers. | suifbwish wrote: | Having bare metal may not be less stress but AWS is by no | means cheap. | johnmarcus wrote: | AWS is not cheap, but splitting amongst AZ's is of no | additional cost. | orangepurple wrote: | False | | Data Transfer within the same AWS Region Data transferred | "in" to and "out" from Amazon EC2, Amazon RDS, Amazon | Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon | ElastiCache instances, Elastic Network Interfaces or VPC | Peering connections across Availability Zones in the same | AWS Region is charged at $0.01/GB in each direction. | | https://aws.amazon.com/ec2/pricing/on- | demand/#Data_Transfer_... | vorpalhex wrote: | I'm more impressed that it hasn't been an issue until now. | bob1029 wrote: | > Having multiple fully independent zones seems more reliable | and failsafe. | | This also introduces new modes of failure which did not exist | before. There are no silver bullets for this problem. | rhizome wrote: | There are no silver bullets to _any_ problem, but there are | other ways of implementing services and architecture that can | sidestep these things. | maxclark wrote: | No surprised at all. Multi AZ is a PITA. You'd be surprised how | many 7fig+/month infra is single region/az | hedwall wrote: | A guess would be that game servers are distributed across the | globe but backend services l are in one place. A common pattern | in game companies. ___________________________________________________________________ (page generated 2022-01-20 23:00 UTC)