[HN Gopher] Roblox October Outage Postmortem
       ___________________________________________________________________
        
       Roblox October Outage Postmortem
        
       Author : kbuck
       Score  : 235 points
       Date   : 2022-01-20 20:01 UTC (2 hours ago)
        
 (HTM) web link (blog.roblox.com)
 (TXT) w3m dump (blog.roblox.com)
        
       | ryanworl wrote:
       | It seems that Consul does not have the ability to use the newer
       | hashmap implementation of freelist that Alibaba implemented for
       | etcd. I cannot find any reference to setting this option in
       | Consul's configuration.
       | 
       | Unfortunate, given it has been around for a while.
       | 
       | https://www.alibabacloud.com/blog/594750
        
         | throwdbaaway wrote:
         | I think they just made the switch to the fork that does contain
         | the freelist improvement in
         | https://github.com/hashicorp/consul/pull/11720
         | 
         | Took a major incident to swallow your pride? (consul, powered
         | by go.etcd.io/bbolt)
        
           | ryanworl wrote:
           | Is this option enabled by default? I don't this it is and I
           | don't think they actually set it manually anywhere.
           | 
           | EDIT: I think we're talking about two different options. I
           | meant the ability to leave sync turned on but change the data
           | structure.
        
       | ctvo wrote:
       | It's a spicy read. Really could have happened to anyone. All very
       | reasonable assumptions and steps taken. You could argue they
       | could have more thoroughly load tested Consul, but doubtful any
       | of us would have done more due diligence than they did with the
       | slow rollout of streaming support.
       | 
       | (Ignoring the points around observability dependencies on the
       | system that went down causing the failure to be extended)
        
         | yashap wrote:
         | The main mistake IMO is that, the day before the outage, they
         | made a significant Consul-related infra change. Then they have
         | this massive outage, where Consul is clearly the root cause,
         | but nobody ever tries rolling that recent change back? That's
         | weird.
         | 
         | I went into more detail here:
         | https://news.ycombinator.com/item?id=30015826
         | 
         | The outage occurring could certainly happen to anyone, but it
         | taking 72 hours to resolve seems like a pretty fundamental SRE
         | mistake. It's also strange that "try rollbacks of changes
         | related to the affected system" isn't even acknowledged as a
         | learning in their learnings/action items section.
        
       | statguy wrote:
       | So the outage lasted 3 days and the postmortem took 3 months!
        
         | koshergweilo wrote:
         | Read the article " It has been 2.5 months since the outage.
         | What have we been up to? We used this time to learn as much as
         | we could from the outage, to adjust engineering priorities
         | based on what we learned, and to aggressively harden our
         | systems. One of our Roblox values is Respect The Community, and
         | while we could have issued a post sooner to explain what
         | happened, we felt we owed it to you, our community, to make
         | significant progress on improving the reliability of our
         | systems before publishing."
         | 
         | They wanted to make sure everything was fixed before publishing
        
         | Operyl wrote:
         | They just got out of their busiest time of year, and taking the
         | time to write an accurate post mortem with data gleamed
         | afterwards seems sensible to me.
        
         | encryptluks2 wrote:
        
       | willcipriano wrote:
       | I have this little idea I think about called the "status update
       | chain". When I worked in small organizations and we had issues
       | the status update chain looked like this: ceo-->me, as the
       | organizations got larger the chain got longer first it was
       | ceo-->manager-->me then ceo-->director-->manager-->me and so on.
       | I wonder how long the status update chains are at companies like
       | this? How long does at status update take to make it end to end?
        
         | tacLog wrote:
         | I am sorry, I didn't have enough context to understand what
         | your saying.
         | 
         | When you say: status update chain: ceo --> me. What information
         | is flowing from the CEO to you? or is it the other way around?
        
           | willcipriano wrote:
           | Both directions, he is asking "What is going on" and I am
           | telling him.
        
       | wizwit999 wrote:
       | > On October 27th at 14:00, one day before the outage, we enabled
       | this feature on a backend service that is responsible for traffic
       | routing. As part of this rollout, in order to prepare for the
       | increased traffic we typically see at the end of the year, we
       | also increased the number of nodes supporting traffic routing by
       | 50%.
       | 
       | Seems like the smoking gun, this should have been identified and
       | rolled back much earlier.
        
       | conorh wrote:
       | Excellent write up. Reading a thorough, detailed and open
       | postmortem like this makes me respect the company. They may have
       | issues but it sounds like the type of company that (hopefully)
       | does not blame, has open processes, and looks to improve - the
       | type of company I'd want to work for!
        
         | sam0x17 wrote:
         | Too bad they exploit young game developers by taking a 75.5%
         | cut of their earnings. Big yikes of a red flag for me.
         | https://www.nme.com/news/gaming-news/roblox-is-allegedly-exp...
        
           | badcc wrote:
           | This % includes cost of all game server hosting, databases,
           | memory stores, etc. even with millions of concurrents, app
           | store fees, etc. All included in that number. Developer gets
           | effectively pure profit for the sole cost of
           | programming/designing a great game. Taught me how to program,
           | & changed my entire future. Disclosure: My game is one of
           | most popular on the platform.
        
             | ygjb wrote:
             | And that's a reasonable decision for an adult to make, and
             | if they were targeting an adult developer community.
             | 
             | I don't think anyone objects to adults making that choice
             | over say, using Unity or Unreal, and targeting other
             | platforms.
             | 
             | In practice, explaining to my son who is growing into an
             | avid developer why I won't a) help him build on Roblox, or
             | b) fund his objectives of advertising and promoting his
             | work in Roblox (by spending Roblox company scrip) on the
             | platform has necessitated helping him to learn and
             | understand what exploitation means and how to recognize it.
             | 
             | It's a learning experience for him, and a challenging issue
             | for me as a technically proficient and financially literate
             | parent who actually owns and run businesses related to
             | intellectual property. It's got to be much more painful for
             | parents who lack in any of those three areas.
        
               | RussianCow wrote:
               | Are you really suggesting that Roblox's cut should be
               | lower purely because the target market is children? Why?
               | If anything, the fact that a kid can code a game in a
               | high-level environment and immediately start making money
               | --without any of the normal challenges of setting up
               | infrastructure, let alone marketing and discovery--is
               | _amazing_ , and a feat for which Roblox should definitely
               | be rewarded.
               | 
               | In any case, what's the alternative? To teach your son
               | how to build the game from scratch in Unity, spin up a
               | server infrastructure that won't crumble with more than a
               | few concurrent players (not to mention the cash required
               | for this), figure out distribution, and then actually get
               | people to find and play the game? That seems quite
               | unreasonable for most children/parents.
               | 
               | If this were easy, a competitor would have come in and
               | offered the same service with significantly lower fees.
        
               | adgjlsfhk1 wrote:
               | The problem is that robolox essentially lies to kids (by
               | omission) in an attempt to get free labor out of them.
        
               | RussianCow wrote:
               | Yes, I agree that the deception is a problem, although I
               | admit I'm not well versed in the issue. (I'm watching the
               | documentary linked elsewhere now.) But the original claim
               | was that they were exploiting young developers by taking
               | a big cut of revenues, which I disagree with.
        
               | noobhacker wrote:
               | Does your son have other alternatives to learn
               | programming and make money other than Roblox?
               | 
               | If there are, then it's a great lesson about looking
               | outside of one's immediate circumstance and striving
               | towards something better.
        
               | lolinder wrote:
               | > And that's a reasonable decision for an adult to make,
               | and if they were targeting an adult developer community.
               | 
               | If it's a reasonable decision for an adult to make
               | because the trade-offs might be worth it, doesn't that
               | mean that it would also be reasonable for a child to make
               | the same decision for the same reason?
               | 
               | It's either exploitative or it isn't, the age of the
               | developer doesn't alter the trade-offs involved.
        
               | JauntyHatAngle wrote:
               | No, because a child is not deemed to have the necessary
               | faculties to make these decisions.
               | 
               | The question should not be posed to a child, that is the
               | law for child labour, and why we do not have children
               | gambling on roulette wheels.
        
               | [deleted]
        
           | DerArzt wrote:
           | To add, there is a nice documentary here[1] which also has a
           | followup[2] that show even more of the issue at hand. Kids
           | making games and only getting 24.5% of the profit is one
           | thing, but everything else that Roblox does is much worse.
           | 
           | [1] https://youtu.be/_gXlauRB1EQ
           | 
           | [2] https://youtu.be/vTMF6xEiAaY
        
             | Qualadore wrote:
             | The 24.5% cut is fine, you have to consider the 30% app
             | store fees for a majority mobile playerbase, all hosting is
             | free, moderation is a major expense, and engine and
             | platform development.
             | 
             | Successful games subsidize everyone else, which is not
             | comparable to Steam or anything else.
             | 
             | Collectible items are fine and can't be exchanged for USD,
             | Roblox can't arbitrate developer disputes, "black markets"
             | are an extremely tiny niche. A lot of misinformation.
             | 
             | It's annoying to see these videos brought up every single
             | time Roblox is mentioned anywhere for these reasons. Part
             | of the blame lies with Roblox for one of the worst PR
             | responses I have seen in tech, I suppose.
        
               | brimble wrote:
               | > The 24.5% cut is fine, you have to consider the 30% app
               | store fees for a majority mobile playerbase, all hosting
               | is free, moderation is a major expense, and engine and
               | platform development.
               | 
               | You have successfully made the case for a 45% fee and
               | being considered approximately normal, or a 60% fee and
               | being considered pretty high still. 75+% is crazy.
        
               | Qualadore wrote:
               | I can't think of any other platform with comparable
               | expenses. Traditional game engines have the R&D
               | component, but not moderation, developer services, or
               | subsidizing games that don't succeed.
               | 
               | It helps that seriously launching a Roblox game costs <
               | $1k USD always, usually < $200 USD. It's not easy to
               | generate a loss, even when including production costs.
               | That's the tradeoff.
        
           | [deleted]
        
           | nostrebored wrote:
           | The idea that these children would otherwise be making their
           | own games is knowingly, generally wrong.
        
           | munk-a wrote:
           | No matter what the cut is I think there are some legitimate
           | social questions to ask about whether want young people to be
           | potentially exposed to economic pressure to earn or whether
           | we'd rather push back aggressively against youth monetization
           | to preserve a childhood where, ideally, children get to play.
           | 
           | I know there are lots of child actors and plenty of household
           | situations that make enjoying childhood difficult for many
           | youths - but just because we're already bad at a thing
           | doesn't mean we should let it get worse. Child labour laws
           | were some of the first steps of regulation in the industrial
           | revolution because inflation works in such a way where
           | opening the door up to child labour can put significant
           | financial pressure on families that choose not to participate
           | when demand adjusts to that participation being normal.
        
           | Aunche wrote:
           | By that logic, Dreams is "exploiting" developers by taking a
           | 100% cut of their earnings. Making money isn't the point of
           | either of these platforms.
        
           | loceng wrote:
           | The solution is creating a competing platform and offering a
           | better cut. You up for the task?
           | 
           | Edit to add: lazy people downvote.
        
           | flippinburgers wrote:
           | I am naive about the reality on the ground when it comes to
           | this issue, but doesn't this hinge on transparency? If they
           | can show they are covering costs + the going market rate,
           | which seems to be 30% (at best), then wouldn't it be
           | reasonable? So is a 45% cut for infra ok or not seems to be
           | the question.
        
           | perihelions wrote:
           | More egregiously, they're (per your article) manipulating
           | kids into _buying real ads_ for their creations, with the
           | false promise that  "you could get rich if you pay us".
           | 
           | > _" As there are no discoverability tools, users are only
           | able to see a tiny selection of the millions of experiences
           | available. One of the ways boost to discoverability is to pay
           | to advertise on the platform using its virtual currency,
           | Robux."_
           | 
           | (Note that "virtual" currency is real money, bidirectionally
           | exchangeable with USD).
           | 
           | The sales pitch is "get rich fast":
           | 
           | > _" Under the platform's 'Create' tab, it sells the idea
           | that users can "make anything", "reach millions of players",
           | and "earn serious cash", while its official tutorials and
           | support website both "assume" they are looking for help with
           | monetisation."_
           | 
           | I agree that this doesn't really look like a labor issue.
           | That's distracting and contentious tangent; it's easier to
           | just label it a type of _consumer_ exploitation. (Most of the
           | people aren 't earning money -- but they are all _paying
           | money_ ). It's a scam either way.
        
           | tptacek wrote:
           | Again, as across-thread: this is a tangent unrelated to the
           | actual story, which is interesting for reasons having nothing
           | at all to do with Roblox (I'll never use it, but operating
           | HashiStack at this scale is intensely relevant to me). We
           | need to be careful with tangents like this, because they're
           | easier for people to comment on than the vagaries of Raft and
           | Go synchronization primitives, and --- as you can see here
           | --- they quickly drown out the on-topic comments.
        
           | breakfastduck wrote:
           | Or how about giving a free platform to get into games
           | development for young people that otherwise wouldn't have
           | become interested.
        
         | digitalengineer wrote:
        
           | badcc wrote:
           | As one of the top developers on the platform (& 22 y/o,
           | taught myself how to program through Roblox, ~13 years ago),
           | I can say that it seems a majority of us in the developer
           | community are quite unhappy with the image this video
           | portrays. We love Roblox.
        
             | dan_pixelflow wrote:
             | That's kind of on Roblox then for not answering their
             | questions transparently.
        
             | duxup wrote:
             | My son loves it, I think it is a great way to learn.
        
             | empressplay wrote:
             | I think what bothers me the most is the effective 'pay to
             | play' aspect
        
             | [deleted]
        
           | tptacek wrote:
           | This is an interesting debate to have somewhere, but it has
           | _nothing to do with this thread_. We need to be careful about
           | tangents like this, because it 's a lot easier to have an
           | opinion about the cut a platform should take from UGC than it
           | is to have opinion about Raft, channel-based concurrency, and
           | on-disk freelists. If we're not careful, we basically can't
           | have the technical threads, because they're crowded out by
           | the "easier" debate.
        
             | digitalengineer wrote:
             | True, it is off topic to the postmortem. However, the top
             | comment talks about wanting to work there. I get is is very
             | relevant to see a bigger picture. Personally, I could never
             | work for them. I have a kid and the services and culture
             | they created around their product is sickening and should
             | be made illegal.
        
             | nightpool wrote:
             | While I personally think digitalengineer's comment was low-
             | effort and knee-jerk, I think this general thread of
             | discussion is on topic for the comment replied to, which
             | was specifically about how the postmortem increased the
             | commenter's respect for Roblox as a company and made them
             | want to work there. I think an acceptable compromise
             | between "ethical considerations drown out any technical
             | discussion" and "any non-technical discussion gets
             | downvoted/flagged to oblivion" would be to quarantine
             | comments about the ethics of Roblox's business model to a
             | single thread of discussion, and this one seems as good as
             | any.
        
               | pvg wrote:
               | The guidelines and zillions of moderation comments are
               | pretty explicit that doesn't count as 'on topic'. You can
               | always hang some rage-subthread off the unrelated
               | perfidy, real or perceived, of some entity or another.
               | This one is extra tenuous and forced given that 'the type
               | of company I'd want to work for' is a generic expression
               | of approval/admiration.
        
             | BolexNOLA wrote:
             | You've pretty much articulated for me why I've been
             | commenting on Reddit less and less frequently.
        
               | duxup wrote:
               | I loathe the constant riffing on <related and yet nothing
               | indicates it is actually related/> topics.
               | 
               | Sadly it is happening here on HN too, < insert the next
               | blurb about corporatism/>
        
               | BolexNOLA wrote:
               | Guess we need to find the next space lol
        
           | micromacrofoot wrote:
           | Yeah as long as Roblox is exploiting children they're just
           | flat-out not respectable. This video is a good look at a
           | phenomenon most people are unaware of.
        
             | [deleted]
        
             | charcircuit wrote:
             | Players of your game creating content for it is not
             | exploitation. It's just how it works in the gaming world.
             | When I was a kid I spent time creating a minecraft mod that
             | hundreds of people used. Did Mojang or anyone else ever pay
             | me? No. I did it because I wanted to.
        
               | jawngee wrote:
               | Mojang was likely not selling you on making a mod with
               | promises of making money though. Roblox did that, maybe
               | they still do it.
        
               | digitalengineer wrote:
               | Please review the video. The problem is not 'players
               | creating content'.
        
               | [deleted]
        
               | micromacrofoot wrote:
               | The way they're paying kids and what they're telling them
               | is a big part of the problem... they're pushing a lot of
               | the problematic game development industry onto kids that
               | are sometimes as young as 10.
               | 
               | If this was free content creation when kids want to do
               | it, then it would be an entirely different story.
        
         | ehsankia wrote:
         | > the type of company I'd want to work for!
         | 
         | I recommend watching the following:
         | 
         | https://www.youtube.com/watch?v=_gXlauRB1EQ
         | 
         | https://www.youtube.com/watch?v=vTMF6xEiAaY
        
       | ineedasername wrote:
       | ">circular dependencies in our observability stack"
       | 
       | This appears to be why the outage was extended, and was
       | referenced elsewhere too. It's hard to diagnose something when
       | part of the diagnostic tool kit is also malfunctioning.
        
       | AaronFriel wrote:
       | This outage has it all, distributed systems, non-uniform memory
       | access contention (aka "you wanted scale up? how about instead we
       | make your CPU a distributed system that you have to reason
       | about?"), a defect in a log-structured merge tree based data
       | store, malfunctioning heartbeats affecting scheduling, wow wow
       | wow.
       | 
       | Big props to the on-calls during this.
        
         | tacLog wrote:
         | > Big props to the on-calls during this.
         | 
         | Kind of curious about this. I know this is probably company
         | specific but how do outages get handled at large orgs? Would
         | the on-calls have been called in first then called in the rest
         | of the relevant team?
         | 
         | Is their a leadership structure that takes command of the
         | incident to make big coordinated decisions to manage the risk
         | of different approaches?
         | 
         | Would this have represented crunch time to all the relevant
         | people or would this be a core team with other people helping
         | as needed?
        
           | WaxProlix wrote:
           | Oncalls get paged first and then escalate. As they assess
           | impact to other teams and orgs, they usually post their
           | tickets to a shared space. Once multiple team/org impact is
           | determined, leadership and relevant ops groups (networking,
           | eg) get pulled in to a call. A single ticket gets designated
           | the Master Ticket for the Event, and oncalls dump diagnostic
           | info there. Root cause is found (hopefully), affected teams
           | work to mitigate while RC team rushes to fix.
           | 
           | The largest of these calls I've seen was well into the
           | hundreds of sw engineers, managers, network engineers, etc.
        
           | yazaddaruvala wrote:
           | Typically:
           | 
           | Yes. This was a multi-day outage and eventually the oncall
           | does need sleep, so you need more of the team to help with
           | it. Typically, at any reasonable team, everyone that chipped
           | in nights get to take off equivalent days and sprint tasks
           | are all punted.
           | 
           | Yes. Not just to manage risks, but also to get quick
           | prioritization from all teams at the company. "You need
           | legal? Ok, meet ..." "You need string translations? Ok
           | escalated to ..." "You need financial approval? Ok, looped in
           | ..."
           | 
           | Kinda. Definitely would have represented crunch time, but a
           | very very demoralizing crunch time. Managers also try to
           | insulate most of their teams from it, but everyone pays
           | attention anyways. There is no "core team" other than the
           | leadership structure from your question 2. Otherwise, it is
           | very much "people/teams helping as needed".
        
           | quirino wrote:
           | Google has his Site Reliability Engineering book, which might
           | answer some of your questions
           | 
           | https://sre.google/sre-book/table-of-contents/
        
       | sjtindell wrote:
       | Super interesting. A place where ipvs or ebpf rules per-host for
       | the discovery of services seems much more resilient than this
       | heavy reliance on a functional consul service. The team shared a
       | great postmortem here. I know the feeling well of testing
       | something like a full redeploy and seeing no improvement...easy
       | to lose hope at that point. 70+ hours of a full outage, multiple
       | failed attempts to restore, has got to add years to your life in
       | stress. Well done to all involved.
        
       | johnmarcus wrote:
       | aaaalllllllll the way down at the bottom is this gem: >Some core
       | Roblox services are using Consul's KV store directly as a
       | convenient place to store data, even though we have other storage
       | systems that are likely more appropriate.
       | 
       | Yeah, don't use consul as redis, they are not the same.
        
         | stuff4ben wrote:
         | But you can... which is what some engineers were thinking. In
         | my experience they do this because:
         | 
         | A) they're afraid to ask for permission and would rather ask
         | for forgiveness
         | 
         | B) management refused to provision extra infra to support the
         | engineers need, but they needed to do this "one thing" anyways
         | 
         | C) security was lax and permissions were wide open so people
         | just decided to take advantage of it to test a thing that then
         | became a feature and so they kept it but "put it on the
         | backlog" to refactor to something better later
        
       | stuff4ben wrote:
       | Sounds like they need to switch to Kubernetes?
       | 
       | I kid of course. One of the best post-mortems I've seen. I'm sure
       | there are K8s horror stories out there of etcd giving up the
       | ghost in a similar fashion.
        
         | spydum wrote:
         | you joke, but it's precisely this:
         | 
         | >Critical monitoring systems that would have provided better
         | visibility into the cause of the outage relied on affected
         | systems, such as Consul. This combination severely hampered the
         | triage process.
         | 
         | which gives me goosebumps whenever I hear people proselytizing
         | everything run on Kubernetes. At some point, it makes good
         | sense to keep capabilities isolated from each other, especially
         | when those functions are key to keeping the lights on. Mapping
         | out system dependencies (either systems, software components,
         | etc) is really the soft underbelly of most tech stacks.
        
         | YATA0 wrote:
         | >Sounds like they need to switch to Kubernetes?
         | 
         | Hah! Good one!
        
         | schoolornot wrote:
         | The one thing you can say about Nomad is that's generally
         | incredibly scalable compared to Kubernetes. At 1000+ nodes over
         | multiple datacenters, things in Kube seem to break down.
        
           | tapoxi wrote:
           | Do they still? GKE supports 15,000 nodes per cluster.
        
       | samkone wrote:
       | Mayhem. Hipsters
        
       | chainwax wrote:
       | Love the "Note on Public Cloud", and their stance on owning and
       | operating their own hardware in general. I know there has to be
       | people thinking this could all be avoided/the blame could be
       | passed if they used a public cloud solution. Directly addressing
       | that and doubling down on your philosophies is a badass move,
       | especially after a situation like this.
        
       | regnull wrote:
       | It's weird it took them so long to disable streaming. One of the
       | first things you do in this case is roll back the last software
       | and config updates, even innocent looking ones.
        
         | yashap wrote:
         | That's what stood out to me too. Although they'd been slowly
         | rolling it out for awhile, their last major rollout was quite
         | close to the outage start:
         | 
         | > Several months ago, we enabled a new Consul streaming feature
         | on a subset of our services. This feature, designed to lower
         | the CPU usage and network bandwidth of the Consul cluster,
         | worked as expected, so over the next few months we
         | incrementally enabled the feature on more of our backend
         | services. On October 27th at 14:00, one day before the outage,
         | we enabled this feature on a backend service that is
         | responsible for traffic routing. As part of this rollout, in
         | order to prepare for the increased traffic we typically see at
         | the end of the year, we also increased the number of nodes
         | supporting traffic routing by 50%
         | 
         | Consul was clearly the culprit early on, and you just made a
         | significant Consul-related infrastructure change, you'd think
         | rolling that back would be one of the first things you'd try.
         | One of the absolute first steps in any outage is "is there any
         | recent change we could possibly see causing this? If so, try
         | rolling it back."
         | 
         | They've obviously got a lot of strong engineers there, and it's
         | easy to critique from the outside, but this certainly struck me
         | as odd. Sounds like they never even tried "let's try rolling
         | back Consul-related changes", it was more that, 50+ hours into
         | a full outage, they'd done some deep profiling, and discovered
         | the steaming issue. But IMO root cause analysis is for later,
         | "resolve ASAP" is the first response, and that often involves
         | rollbacks.
         | 
         | I wonder if this actually hindered their response:
         | 
         | > Roblox Engineering and technical staff from HashiCorp
         | combined efforts to return Roblox to service. We want to
         | acknowledge the HashiCorp team, who brought on board incredible
         | resources and worked with us tirelessly until the issues were
         | resolved.
         | 
         | i.e. earlier on, were there HashiCorp peeps saying "naw, we
         | tested streaming very thoroughly, can't be that"?
        
           | notacoward wrote:
           | In a not-too-distant alternate universe, they made the rookie
           | assumption that every change to every system is trivially
           | reversible, only to find that it's not always true
           | (especially for storage or storage-adjacent systems), and
           | ended up making things worse. Naturally, people in alternate-
           | universe HN bashed them for that too.
        
           | otterley wrote:
           | When you're at Roblox's scale, it is often difficult to know
           | in advance whether you will have a lower MTTR by rolling back
           | or fixing forward. If it takes you longer to resolve a
           | problem by rolling back a significant change than by tweaking
           | a configuration file, then rolling back is not the best
           | action to take.
           | 
           | Also, multiple changes may have confounded the analysis.
           | Adjusting the Consul configuration may have been one of many
           | changes that happened in the recent past, and certainly
           | changes in client load could have been a possible culprit.
        
             | yashap wrote:
             | Some changes are extremely hard to rollback, but this
             | doesn't sound like one of them. From their report, sounds
             | like the rollback process involved simply making a config
             | change to disable the streaming feature, it took a bit to
             | rollout to all nodes, and then Consul performance almost
             | immediately returned to normal.
             | 
             | Blind rollbacks are one thing, but they identified Consul
             | as the issue early on, and clearly made a significant
             | Consul config change shortly before the outage started,
             | that was also clearly quite reversible. Not even trying to
             | roll that back is quite strange to me - that's gotta be
             | something you try within the first hour of the outage,
             | nevermind the first 50 hours.
        
               | [deleted]
        
         | Twirrim wrote:
         | The post indicates they'd been rolling it out for months, and
         | indicate the feature went live "several months ago".
         | 
         | With the behaviour matching other types of degradation
         | (hardware), it's entirely reasonable that it could have taken
         | quite a while to recognise that software and configurations
         | that have proven stable for several months, that is still there
         | working, wasn't quite so stable as it seemed.
        
           | nightpool wrote:
           | Right, but it only went live on the DB that failed the day
           | before. Obviously, hindsight is 20/20, but it's strange that
           | the oversight didn't rate a mention in the postmortem.
        
       | Twirrim wrote:
       | "We enjoyed seeing some of our most dedicated players figure out
       | our DNS steering scheme and start exchanging this information on
       | Twitter so that they could get "early" access as we brought the
       | service back up."
       | 
       | Why do I have a feeling "enjoyed" wasn't really enjoyed so much
       | as "WTF", followed by "oh shit..." at the thought that their main
       | way to balance load may have gone out the window.
        
         | Symbiote wrote:
         | It's difficult to know how quickly word could have spread, but
         | I enjoy knowing a few 11 year olds learned something about the
         | Internet in order to play a game an hour early.
        
       | jandrese wrote:
       | The BoltDB issue seems like straight up bad design. Needing a
       | freelist is fine, needing to sync the entire freelist to disk
       | after every append is pants on head.
        
         | benbjohnson wrote:
         | BoltDB author here. Yes, it is a bad design. The project was
         | never intended to go to production but rather it was a port of
         | LMDB so I could understand the internals. I simplified the
         | freelist handling since it was a toy project. At Shopify, we
         | had some serious issues at the time (~2014) with either LMDB or
         | the Go driver that we couldn't resolve after several months so
         | we swapped out for Bolt. And alas, my poor design stuck around.
         | 
         | LMDB uses a regular bucket for the freelist whereas Bolt simply
         | saved the list as an array. It simplified the logic quite a bit
         | and generally didn't cause a problem for most use cases. It
         | only became an issue when someone wrote a ton of data and then
         | deleted it and never used it again. Roblox reported having 4GB
         | of free pages which translated into a giant array of 4-byte
         | page numbers.
        
           | tacLog wrote:
           | > BoltDB author here.
           | 
           | How does this happen so often? It's awesome to get the
           | authors take on things. Also thank you for explaining and
           | owning it. Where you part of this incident response?
        
           | otterley wrote:
           | I, for one, appreciate you owning this. It takes humility and
           | strength of character to admit one's errors. And Heaven knows
           | we all make them, large and small.
        
       | kjw wrote:
       | I would not have guessed Roblox was on-prem with such little
       | redundancy. Later in the post, they address the obvious "why not
       | public cloud question"? They argue that running their own
       | hardware gives them advantages to cost and performance. But those
       | seem irrelevant if usage and revenue go to zero when you can't
       | keep a service up. It will be interesting to see how well this
       | architecural decision ages if they keep scaling to their
       | ambitions. I wonder about their ability to recruit the level of
       | talent required to run a service at this scale.
        
         | dylan604 wrote:
         | >I wonder about their ability to recruit the level of talent
         | required to run a service at this scale.
         | 
         | According to this user's comments, it doesn't look like it'll
         | be that tough for them:
         | 
         | https://news.ycombinator.com/item?id=30014748
        
         | nomel wrote:
         | > But those seem irrelevant if usage and revenue go to zero
         | when you can't keep a service up
         | 
         | You're assuming the average profits lost are more than the
         | average cost of doing things differently, which, according to
         | their statement, is not the case.
        
         | noahtallen wrote:
         | I think the public cloud is a good choice for startups, teams,
         | and projects which don't have infrastructure experience. Plenty
         | of companies still have their own infrastructure expertise and
         | roll their own CDNs, as an example.
         | 
         | Not only can one save a significant amount of money, it can
         | also be simpler to troubleshoot and resolve issues when you
         | have a simpler backend tech stack. Perhaps that doesn't apply
         | in this case, but there are plenty of use cases which don't
         | need a hundred micro services on AWS, none of which anyone
         | fully understands.
        
         | otterley wrote:
         | Since the issue's root cause was a pathological database
         | software issue, Roblox would have suffered the same issue in
         | the public cloud. (I am assuming for this analysis that their
         | software stack would be identical.) Perhaps they would have
         | been better off with other distributed databases than Consul
         | (e.g., DynamoDB), but at their scale, that's not guaranteed,
         | either. Different choices present different potential
         | difficulties.
         | 
         | Playing "what-if" thought experiments is fun, but when the
         | rubber hits the road, you often find that things that are
         | stable for 99.99%+ of load patterns encounter previously
         | unforeseen problems once you get into that far-right-hand side
         | of the scale. And it's not like we've completely mastered
         | squeezing performance out of huge CPU core counts on NUMA
         | architectures while avoiding bottlenecking on critical sections
         | in software. This shit is hard, man.
        
           | baskethead wrote:
           | This is not true, if they handled the rollout properly.
           | Companies like Uber have two entirely different data centers
           | and during outages they failover you either datacenter.
           | 
           | Everything is duplicated which is potentially wasteful but
           | ensures complete redundancy and it's an insurance policy. If
           | you rollout, you rollout to each datacenter separately. So in
           | this case rolling out in one complete datacenter and waiting
           | a day for their Consul streaming changes probably would have
           | caught it.
        
             | otterley wrote:
             | The Consul streaming changes were rolled out months before
             | the incident occurred.
        
             | Symbiote wrote:
             | > So in this case rolling out in one complete datacenter
             | and waiting a day for their Consul streaming changes
             | probably would have caught it.
             | 
             | But this has nothing to do with cloud vs. colo.
        
       | erwincoumans wrote:
       | >> We are working to move to multiple availability zones and data
       | centers.
       | 
       | Surprised it was a single availability zone, without redundancy.
       | Having multiple fully independent zones seems more reliable and
       | failsafe.
        
         | mbesto wrote:
         | There have been multiple discussions on HN about cloud vs not
         | cloud and there are endless amount of opinions of "cloud is a
         | waste blah blah".
         | 
         | This is exactly one of the reasons people go cloud. Introducing
         | an additional AZ is a click of a button and some relatively
         | trivial infrastructure as code scripting, even at this scale.
         | 
         | Running your own data center and AZ on the other hand requires
         | a very tight relationship with your data center provider at
         | global scale.
         | 
         | For a platform like Roblox where downtime equals money loss
         | (i.e. every hour of the day people make purchases), then there
         | is a real tangible benefit to using something _like_ AWS. 72
         | hours downtime is A LOT, and we 're talking potentially
         | millions of dollars of real value lost and millions of
         | potential in brand value lost. I'm not saying definitively they
         | would save money (in this case profit impact) by going to AWS,
         | but there is definitely a story to be had here.
        
           | treis wrote:
           | But it wasn't a hardware issue. It was a software one and
           | that would have crossed AZ boundaries.
        
             | mbesto wrote:
             | So then why does the post mortem suggest setting up multi-
             | az to address the problems they encountered?
        
               | treis wrote:
               | I took that to mean sharding Roblox instead of spanning
               | it across data center AZs.
        
         | abarringer wrote:
         | Was on a call with a bank VP that had moved to AWS. Asked how
         | it was going. Said it was going great after six months but just
         | learning about availability zones so they were going to have to
         | rework a bunch of things.
         | 
         | Astonishing how our important infrastructure is moved to AWS
         | with zero knowledge of how AWS works.
        
         | kreeben wrote:
         | >> Having multiple fully independent zones seems more reliable
         | 
         | I don't think these independent zones exist. See AWS's recent
         | outages, where east cripples west and vice versa.
        
           | count wrote:
           | That's not how they work. They exist, and work extremely well
           | within their defined engineering / design goals. It's much
           | more nuanced than 'everything works independently'.
        
             | kreeben wrote:
             | If the design goal of these zones is that they should be
             | independent of each other then, no, they do not work
             | extremely well.
        
           | Karrot_Kream wrote:
           | Availability Zones aren't the same thing as regions. AWS
           | regions have multiple Availability Zones. Independent
           | availability zones publishes lower reliability SLAs so you
           | need to load balance across multiple independent availability
           | zones in a region to reach higher reliability. Per AZ SLAs
           | are discussed in more detail here [1]
           | 
           | (N.B. I find HN commentary on AWS outages pretty depressing
           | because it becomes pretty obvious that folks don't understand
           | cloud networking concepts at all.)
           | 
           | [1]: https://aws.amazon.com/compute/sla/
        
             | kreeben wrote:
             | >> you need to load balance across multiple independent
             | availability zones
             | 
             | The only problem with that is, there are no independent
             | availability zones.
             | 
             | What we do have, though, is an architecture where errors
             | propagate cross-zone until they can't propagate any
             | further, because services can't take any more requests,
             | because they froze, because they weren't designed for a
             | split brain scenario, and then, half the internet goes
             | down.
        
               | outworlder wrote:
               | > The only problem with that is, there are no independent
               | availability zones.
               | 
               | There are - they can be as independent as you need them
               | to be.
               | 
               | Errors won't necessarily propagate cross-zone. If they
               | do, someone either screwed up, or they made a trade-off.
               | Screwing up is easy, so you need to do chaos testing to
               | make sure your system will survive as intended.
        
               | kreeben wrote:
               | I'm not talking about my global app. I'm talking about
               | the system I deploy to, the actual plumbing, and how a
               | huge turd in a western toilet causes east's sewerage
               | system to over-flow.
        
             | mlyle wrote:
             | > (N.B. I find HN commentary on AWS outages pretty
             | depressing because it becomes pretty obvious that folks
             | don't understand cloud networking concepts at all.)
             | 
             | What he said was perfectly cogent.
             | 
             | Outages in us-east-1 AZ us-east-1a have caused outages in
             | us-west-1a, which is a different region _and_ a different
             | AZ.
             | 
             | Or, to put it in the terms of reliability engineering: even
             | though these are abstracted as independent systems, in
             | reality there are common-mode failures that can cause
             | outages to propagate.
             | 
             | So, if you span multiple availability zones, you are not
             | spared from events that will impact all of them.
        
               | Karrot_Kream wrote:
               | > Or, to put it in the terms of reliability engineering:
               | even though these are abstracted as independent systems,
               | in reality there are common-mode failures that can cause
               | outages to propagate.
               | 
               | It's up to the _user_ of AWS to design around this level
               | of reliability. This isn't any different than not using
               | AWS. I can run my web business on the super cheap by
               | running it out of my house. Of course, then my site's
               | availability is based around the uptime of my residential
               | internet connection, my residential power, my own ability
               | to keep my server plugged into power, and general
               | reliability of my server's components. I can try to make
               | things more reliable by putting it into a DC, but if a
               | backhoe takes out the fiber to that DC, then the DC will
               | become unavailable.
               | 
               | It's up to the _user_ to architect their services to be
               | reliable. AWS isn't magic reliability sauce you sprinkle
               | on your web apps to make them stay up for longer. AWS
               | clearly states in their SLA pages what their EC2 instance
               | SLAs are in a given AZ; it's 99.5% availability for a
               | given EC2 instance in a given region and AZ. This is
               | roughly ~1.82 days, or ~ 43.8 hours, of downtime in a
               | year. If you add a SPOF around a single EC2 instance in a
               | given AZ then your system has a 99.5% availability SLA.
               | Remember the cloud is all about leveraging large amounts
               | commodity hardware instead of leveraging large, high-
               | reliability mainframe style design. This isn't a secret.
               | It's openly called out, like in Nishtala et al's "Scaling
               | Memcache at Facebook" [1] from 2013!
               | 
               | The background of all of this is that it costs money, in
               | terms of knowledgable engineers (not like the kinds in
               | this comment thread who are conflating availability zones
               | and regions) who understand these issues. Most companies
               | don't care; they're okay with being down for a couple
               | days a year. But if you want to design high reliability
               | architectures, there are plenty of senior engineers
               | willing to help, _if_ you're willing to pay their
               | salaries.
               | 
               | If you want to come up with a lower cognitive overhead
               | cloud solution for high reliability services that's
               | economical for companies, be my guest. I think we'd all
               | welcome innovation in this space.
               | 
               | [1]: https://www.usenix.org/system/files/conference/nsdi1
               | 3/nsdi13...
        
               | mlyle wrote:
               | Yes, but the underlying point you're willfully missing
               | is:
               | 
               | You can't engineer around AWS AZ common-mode failures
               | using AWS.
               | 
               | The moment that you have failures that are not
               | independent and common mode, you can't just multiply
               | together failure probabilities to know your outage times.
        
               | roughly wrote:
               | During a recent AWS outage, the STS service running in
               | us-east-1 was unavailable. Unfortunately, all of the
               | other _regions_ - not AZs, but _regions_, rely on the STS
               | service in us-east-1, which meant that customers which
               | had built around Amazon's published reliability model had
               | services in every region impacted by an outage in one
               | specific availability zone.
               | 
               | This is what kreeben was referring to - not some abstract
               | misconception about the difference between AZs and
               | Regions, but an actual real world incident in which a
               | failure in one AZ had an impact in other Regions.
        
               | Karrot_Kream wrote:
               | > Unfortunately, all of the other _regions_ - not AZs,
               | but _regions_, rely on the STS service in us-east-1,
               | which meant that customers which had built around
               | Amazon's published reliability model had services in
               | every region impacted by an outage in one specific
               | availability zone.
               | 
               | That's not true. STS offers regional endpoints, for
               | example if you're in Australia and don't want to pay the
               | latency cost to transit to us-east-1 [1]. It's up to the
               | user to opt into them though. And that goes back to what
               | I was saying earlier, you need engineers willing to read
               | their docs closely and architect systems properly.
               | 
               | [1]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_
               | credenti...
        
               | otterley wrote:
               | It's more subtle than that.
               | 
               | For high availability, STS offers regional endpoints --
               | and AWS recommends using them[1] -- but the SDKs don't
               | use them by default. The author of the client code, or
               | the person configuring the software, has to enable them.
               | 
               | [1] https://docs.aws.amazon.com/IAM/latest/UserGuide/id_c
               | redenti...
               | 
               | (I work for AWS. Opinions are my own and not necessarily
               | those of my employer.)
        
             | johnmarcus wrote:
             | Yup, so true. People think redundant == 100% uptime, or
             | that when they advertise 99.9% uptime, it's the same thing
             | as 100% minus a tiny bit for "glitches".
             | 
             | It's not. .1% of 365 _24 = 87.6 hours of downtime - that 's
             | over 3 days of complete downtime every year!
             | 
             | For a more complete list of their SLA's for every service:
             | https://aws.amazon.com/legal/service-level-
             | agreements/?aws-s...
             | 
             | They only refund 100% when they fall below 95% of
             | availability! 95-99= 30%. I believe the real target is
             | above 99.9% though, as that results in 0 refund to the
             | customer. What that means is, 3 days of downtime is
             | acceptable!
             | 
             | Alternatively, you can return to your own datacenter and
             | find out first hand that it's not particularly as easy to
             | deliver that as you may think. You too will have power
             | outages, network provider disruptions, and the occasional
             | "oh shit, did someone just kick that power cord out?" or
             | complete disk array meltdowns.
             | 
             | Anywho, they have a lot more room in their published SLA's
             | than you think._
             | 
             | Edit: as someone correctly pointed out i did a typo in my
             | math. it is only ~9 hours of aloted downtime. Keeping in
             | mind that this is _per service_ though - meaning each
             | service can have a different 9 hours of downtime before
             | they need to pay out 10% of that one service. I still stand
             | by my statement thier SLA 's have a lot of wiggle room that
             | people should take more seriously.
        
               | mqnfred wrote:
               | Your computation is incorrect, 3 days out of 365 is 1% of
               | downtime, not 0.1%. I believe your error stems from
               | reporting .1% as 0.1. Indeed:
               | 
               | 0.001 (.1%) * 8760 (365d*24h) = 8.76h
               | 
               | Alternatively, the common industry standard in
               | infrastructure (the place I work at at least,) is 4
               | nines, so 99.99% availability, which is around 52 mins a
               | year or 4 mins a month iirc. There's not as much room as
               | you'd think! :)
        
         | foobarian wrote:
         | > Surprised it was a single availability zone, without
         | redundancy. Having multiple fully independent zones seems more
         | reliable and failsafe.
         | 
         | It's also a lot more expensive. Probably order of magnitude
         | more expensive than the cost of a 1 day outage
        
           | sam0x17 wrote:
           | Most startups I've worked at literally have a script to
           | deploy their whole setup to a new region when desired. Then
           | you just need latency-based routing running on top of it to
           | ensure people are processed in the closest region to them.
           | Really not expensive. You can do this with under $200/month
           | in terms of complexity and the bandwidth + database costs are
           | going to be roughly the same as they normally are because
           | you're splitting your load between regions. Now if you
           | stupidly just duplicate your current infrastructure entirely,
           | yes it would be expensive because you'd be massively
           | overpaying on DB.
           | 
           | In theory the only additional cost should be the latency-
           | based routing itself, which is $50/month. Other than that,
           | you'll probably save money if you choose the right regions.
        
             | Symbiote wrote:
             | So Roblox need a button to press to (re)deploy 18,000
             | servers and 170,000 containers? They already have multiple
             | core data centres, as well as many edge locations.
             | 
             | You will note the problem was with the software provided
             | and supported by Hashicorp.
        
             | e4e78a06 wrote:
             | Correctly handling failure edge cases in a active-active
             | multi-region distributed database requires work. SaaS DBs
             | do a lot of the heavy lifting but they are still highly
             | configurable and you need to understand the impact of the
             | config you use. Not to mention your scale-up runbooks need
             | to be established so a stampede from a failure in one
             | region doesn't cause the other region to go down. You also
             | need to avoid cross-region traffic even though you might
             | have stateful services that aren't replicated across
             | regions. That might mean changes in config or business
             | logic across all your services.
             | 
             | It is absolutely not as simple as spinning up a cluster on
             | AWS at Roblox's scale.
        
             | Twirrim wrote:
             | Roblox is not a startup, and has a significant sized
             | footprint (18,000 servers isn't something that's just
             | available, even within clouds. They're not magically
             | scalable places, capacity tends to land just ahead of
             | demand). It's not even remotely a simple case of just "run
             | a script and wee we have redundancy" There are _lots_ of
             | things to consider.
             | 
             | 18k servers is also not cheap, at all. They suggest at
             | least some of their clusters are running on 64 cores, some
             | on 128. I'm guessing they probably have a fair spread of
             | cores.
             | 
             | Just to give a sense of cost, AWS's calculator estimates
             | 18,0000 _32_ core instances would set you back $9m per
             | month. That 's just the EC2 cost, and assuming a lower core
             | count is used by other components in the platform. 64 core
             | would bump that to $18m. Per month. Doing nothing but
             | sitting waiting ready. That's not considering network
             | bandwidth costs, load balancers etc. etc.
             | 
             | When you're talking infrastructure on that scale, you have
             | to contact cloud companies in advance, and work with them
             | around capacity requirements, or you'll find you're barely
             | started on provisioning and you won't find capacity
             | available (you'll want to on that scale _anyway_ because
             | you 'll get discounts but it's still going to be very
             | expensive)
        
             | bradly wrote:
             | Are the same services available in all regions?
             | 
             | Are the same instance sizes available in all regions?
             | 
             | Are there enough instances of the sizes you need?
             | 
             | Do you have reserved instances in the other region?
             | 
             | Are your increased quotas applied to all regions?
             | 
             | What region are your S3 assets in? Are you going to migrate
             | those as well?
             | 
             | Is it acceptable for all user sessions to be terminated?
             | 
             | Have you load tested the other region?
             | 
             | How often are you going to test the region fail over?
             | Yearly? Quarterly? With every code change?
             | 
             | What is the acceptable RTO and RPO with executives and
             | board-members?
             | 
             | And all of that is without thinking about cache warming,
             | database migration/mirror/replication, solr indexing (are
             | you going to migrate the index or rebuild? Do you know how
             | long it takes to rebuild your solr index?).
             | 
             | The startups you worked at probably had different needs the
             | Roblox. I was the tech leach on a Rails app that was
             | embedded in TurboTax and QuickBooks and was rendered on
             | each TT screen transition and reading your comment in that
             | context shows a lot of inexperience in large, production
             | systems.
        
             | [deleted]
        
           | bradly wrote:
           | Yes. If you are running in two zones in the hopes that you
           | will be up if one goes down, you need to be handling less
           | than 50% load in each zone. If you can scale up fast enough
           | for your use case, great. But when a zone goes down and
           | everyone is trying to launch in the zone still up, there may
           | not be instances for you available at that time. Our site had
           | a billion in revenue or something based on a single day, so
           | for us it was worth the cost, but it not easy (or at least it
           | wasn't at the time).
        
           | outworlder wrote:
           | > It's also a lot more expensive. Probably order of magnitude
           | more expensive than the cost of a 1 day outage
           | 
           | Not sure I agree. Yes, network costs are higher, but your
           | overall costs may not be depending on how you architect.
           | Independent services across AZs? Sure. You'll have multiples
           | of your current costs. Deploying your clusters spanning AZs?
           | Not that much - you'll pay for AZ traffic though.
        
             | adrr wrote:
             | It is when you run your own date centers and have to shell
             | out a large capital outlays to spin up a new datacenter.
        
               | Symbiote wrote:
               | The usual way this works (and I assume this is the case
               | for Roblox) is not by constructing buildings, but by
               | renting space in someone else's datacentre.
               | 
               | Pretty much every city worldwide has at least one place
               | providing power, cooling, racks and (optionally) network.
               | You rent space for one or more servers, or you rent
               | racks, or parts of a floor, or whole floors. You buy your
               | own servers, and either install them yourself, or pay the
               | datacentre staff to install them.
        
           | Hamuko wrote:
           | How expensive? Remember that the Roblox Corporation does
           | about a billion dollars in revenue per year and takes about
           | 50% of all revenue developers generate on their platform.
        
             | dev_by_day wrote:
             | Right, outages get more expensive the larger you grow. What
             | else needs to be thought of is not just the loss of revenue
             | for the time your service is down but also it's affect on
             | user trust and usability. Customers will gladly leave you
             | for a more reliable competitor once they get fed up.
        
           | johnmarcus wrote:
           | Multi-AZ is free at Amazon. Having things split amongst 3
           | AZ's cost no more than having in a single AZ.
           | 
           | Multi-Region is a different story.
        
             | otterley wrote:
             | There are definitely cost and other considerations you have
             | to think about when going multi-AZ.
             | 
             | Cross-AZ network traffic has charges associated with it.
             | Inter-AZ network latency is higher than intra-AZ latency.
             | And there are other limitations as well, such as EBS
             | volumes being attachable only to an instance in the same AZ
             | as the volume.
             | 
             | That said, AWS does recommend using multiple Availability
             | Zones to improve overall availability and reduce Mean Time
             | to Recovery (MTTR).
             | 
             | (I work for AWS. Opinions are my own and not necessarily
             | those of my employer.)
        
               | znep wrote:
               | This is very true, the costs and performance impacts can
               | be significant if your architecture isn't designed to
               | account for it. And sometimes even if it is.
               | 
               | In addition, unless you can cleanly survive an AZ going
               | down, which can take a bunch more work in some cases,
               | then being multi-AZ can actually reduce your availability
               | by giving more things to fail.
               | 
               | AZs are a powerful tool but are not a no-brainer for
               | applications at scale that are not designed for them, it
               | is literally spreading your workload across multiple
               | nearby data centers with a bit (or a lot) more tooling
               | and services to help than if you were doing it in your
               | own data centers.
        
             | suifbwish wrote:
             | Having bare metal may not be less stress but AWS is by no
             | means cheap.
        
               | johnmarcus wrote:
               | AWS is not cheap, but splitting amongst AZ's is of no
               | additional cost.
        
               | orangepurple wrote:
               | False
               | 
               | Data Transfer within the same AWS Region Data transferred
               | "in" to and "out" from Amazon EC2, Amazon RDS, Amazon
               | Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon
               | ElastiCache instances, Elastic Network Interfaces or VPC
               | Peering connections across Availability Zones in the same
               | AWS Region is charged at $0.01/GB in each direction.
               | 
               | https://aws.amazon.com/ec2/pricing/on-
               | demand/#Data_Transfer_...
        
         | vorpalhex wrote:
         | I'm more impressed that it hasn't been an issue until now.
        
         | bob1029 wrote:
         | > Having multiple fully independent zones seems more reliable
         | and failsafe.
         | 
         | This also introduces new modes of failure which did not exist
         | before. There are no silver bullets for this problem.
        
           | rhizome wrote:
           | There are no silver bullets to _any_ problem, but there are
           | other ways of implementing services and architecture that can
           | sidestep these things.
        
         | maxclark wrote:
         | No surprised at all. Multi AZ is a PITA. You'd be surprised how
         | many 7fig+/month infra is single region/az
        
         | hedwall wrote:
         | A guess would be that game servers are distributed across the
         | globe but backend services l are in one place. A common pattern
         | in game companies.
        
       ___________________________________________________________________
       (page generated 2022-01-20 23:00 UTC)