[HN Gopher] Summary of the Amazon Kinesis Event in the Northern ... ___________________________________________________________________ Summary of the Amazon Kinesis Event in the Northern Virginia (US- East-1) Region Author : codesparkle Score : 263 points Date : 2020-11-28 07:51 UTC (15 hours ago) (HTM) web link (aws.amazon.com) (TXT) w3m dump (aws.amazon.com) | tmk1108 wrote: | How does the architecture of Kinesis compare to Kafka? If you | scale up the number of Kafka brokers can you hit similar problem? | Or does Kafka not rely on creating threads to connect to each | other broker | aloknnikhil wrote: | Kafka uses a thread pool for request processing. Both the | brokers and the consumer clients use the same request | processing loop. | | This goes a bit more in-depth: | https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafk... | ipsocannibal wrote: | So the cause of outage boils down to not having a metric on total | file descriptors with an alarm if usage gets within 10% of the | Max and a faulty scaling plan that should of said "for every N | backend hosts we add we must add X frontend hosts". One metric | and a couple of lines in a wiki could have saved Amazon what is | probably millions in outage related costs. One wonders if Amazon | retail will start hedging its bets and go multicloud to prevent | impacts on the retail customers from AWS LSE's. | lytigas wrote: | > During the early part of this event, we were unable to update | the Service Health Dashboard because the tool we use to post | these updates itself uses Cognito, which was impacted by this | event. | | Poetry. | | Then, to be fair: | | > We have a back-up means of updating the Service Health | Dashboard that has minimal service dependencies. While this | worked as expected, we encountered several delays during the | earlier part of the event in posting to the Service Health | Dashboard with this tool, as it is a more manual and less | familiar tool for our support operators. To ensure customers were | getting timely updates, the support team used the Personal Health | Dashboard to notify impacted customers if they were impacted by | the service issues. | | I'm curious if anyone here actually got one of these. | ufmace wrote: | My employer is a pretty big spender with AWS. I didn't hear | anything about anybody getting status updates from a "Personal | Health Dashboard" or anywhere else. I can't be 100% sure such | an update would have made its way to me, but given the amount | of buzzing, it's hard to believe that somebody had info like | that and didn't share it. | loriverkutya wrote: | I can confirm we got the Personal Health Dashboard | notifications. | newscom59 wrote: | The PHD is _always_ updated first, long before the global | status page is updated. Every single one of my clients that use | AWS got updates on the PHD literally hours before the status | page was even showing any issues, which is typical. It's the | entire point of the PHD. | | Through reading Reddit and HN during this event I learned that | most people apparently aren't even aware of the existence of | the PHD and rely solely on the global status page, despite the | fact that there is a giant "View my PHD" button at the very top | of the global status page, and additionally there is a | notification icon on the header of every AWS console page that | lights up and links you directly to the PHD whenever there is | an issue. | | The PHD is always where you should look first. It is, by | design, updated long before the global status page is. | mwarkentin wrote: | Yes, we had some messages coming through in our PHD. | 0x11 wrote: | I can't say for sure that the company I work for didn't, but it | certainly didn't make it's way to me and there are only 8 of | us. | vishnugupta wrote: | This won't be a first. The status page was hosted in S3. It is | hilarious in the hindsight, but understandable. | capableweb wrote: | > but understandable | | Is it really? I get the value of eating your own dogfood, it | improves things a lot. | | But your status page? Such a high importance, low difficulty | thing to build that dogfeeding it gives you small amount of | benefits (dogfeed something bigger/more complex instead) in | the good case, and high amount of drawback when things go | wrong (like when your infrastructure goes down, so does your | status page). So what's the point? | KingOfCoders wrote: | Arrogance. | tpetry wrote: | I can really imafgine what happened: Engineer wants to host | dashboard at different provider for resilience. Manager | argues that they cant do this, it would be embarassing if | anybody found out. And why choose another provider? Aws has | multiple AZs and cant be down everywhere at the same | moment. Engineer then says ,,fu __it" and just builds it on | a single solution. | freeone3000 wrote: | The failure to update the Service Health Dashboard was due to | reliance on internal services to update. This also happened in | March 2017[0]. Perhaps a general, instead of piecemeal, approach | to removing dependencies on running services from the dashboard | would be valuable here? | | 0: https://aws.amazon.com/message/41926/ | temp0826 wrote: | us-east-1 is AWS's dirty secret. If ddb had gone down there, | there would likely be a worldwide and multi-service interruption. | fafner wrote: | From the summary I don't understand why front end servers need to | talk to each other ("continuous processing of messages from other | Kinesis front-end servers"). It sounds like this is part of | building the shard map or the cache. Well in the end an | unfortunate design decision. #hugops for the team handling this. | Cascading failures are the worst. | hintymad wrote: | A tangential question, why would AWS even use the term | "microservice"? A service is a service, right? I'm not sure what | the term "microservice" signifies here. | arduinomancer wrote: | It's because service can be confused with "AWS Service" which | is not the same as a microservice (a component of a full | service) | londons_explore wrote: | One requirement on my "production ready" checklist is that any | catastrophic system failure can be resolved by starting a | completely new instance of the service, and it be ready to serve | traffic inside 10 minutes. | | That should be tested at least quarterly (but preferably | automatically with every build). | | If Amazon did that, this outage would have been reduced to 10 | mins, rather than the 12+ hours that some super slow rolling | restarts took... | WJW wrote: | Kinesis probably runs well over 100k instances. Restarting it | might not be so trivial that you can do it in 10 minutes. | why-el wrote: | The same OS limits would apply to new instances, unless they | knew the root cause and forced new instances to be configured | with larger descriptor limits, which is....well, hindsight is | 20/20, no? | WookieRushing wrote: | This only works for stateless services. If you've got frontends | that take longer than 10 mins to serve traffic then you have a | problem. | | But if you're running a DB or a storage system, 10 mins is a | blink of an eye. Storage systems in particular can run a few | hundred TB per node and moving that data to another node can | take over an hour. | | In this case, the frontends have a shard map which is | definitely not stateless. This is typically okay if you have a | fast load operation which blocks other traffic until shard map | is fully loaded | londons_explore wrote: | It's possible (albeit much harder) for stateful services too. | | It basically boils down to "We must be able to restore the | minimum necessary parts of a full backup in under 10 | minutes". | | Take wikipedia as an example. I'd expect them to be able to | restore a backup of the latest version of all pages in 10 | minutes. It's 20GB of data, and I assume it's sharded at | least 10 ways. That means each instance will have to grab 2GB | from the backups. Very do-able. | | As a service gets bigger, you typically scale horizontally, | so the problem doesn't get harder. | | Restoring all the old page versions and re enabling editing | might take longer, but that's less critical functionality. | tnolet wrote: | This is a pretty damn decent post mortem so soon after the | outage. Also gives an architectural analysis of how Kinesis works | which is something they had not have to do at all. | ignoramous wrote: | root-cause tldr: | | _...[adding] new capacity [to the front-end fleet] had caused | all of the servers in the [front-end] fleet to exceed the maximum | number of threads allowed by an operating system configuration | [number of threads spawned is directly proportional to number of | servers in the fleet]. As this limit was being exceeded, cache | construction was failing to complete and front-end servers were | ending up with useless shard-maps that left them unable to route | requests to back-end clusters._ | | fixes: | | ...moving to larger CPU and memory servers [and thus fewer front- | end servers]. Having fewer servers means that each server | maintains fewer threads. | | ...making a number of changes to radically improve the cold-start | time for the front-end fleet. | | ...moving the front-end server [shard-map] cache [that takes a | long time to build, up to an hour sometimes?] to a dedicated | fleet. | | ...move a few large AWS services, like CloudWatch, to a separate, | partitioned front-end fleet. | | ...accelerate the cellularization [0] of the front-end fleet to | match what we've done with the back-end. | | [0] https://www.youtube.com/watch?v=swQbA4zub20 and | https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c... | frankietaylr wrote: | I wonder how many of them are already logged engineering tasks | which never got prioritized because of the aggressive push to | add features. | bithavoc wrote: | They're calling it an "Event", title should say "Summary of the | Amazon Kinesis Outage..." | pps43 wrote: | > the new capacity had caused all of the servers in the fleet to | exceed the maximum number of threads allowed by an operating | system configuration. [...] We didn't want to increase the | operating system limit without further testing | | Is it because operating system configuration is managed by a | different team within the organization? | mcqueenjordan wrote: | Nope. It's just a case of "stop the bleeding before starting | the surgery." | sitharus wrote: | More likely they need to understand what effect changing the | thread limit would have - for example it could increase kernel | memory usage or increase scheduler latency. It's not something | you want to mess with in an outage. | sudhirj wrote: | I've heard AWS follows a you build it, you run it policy, so | that seems unlikely. Just seems prudent to not mess with OS | settings in a hurry. | Androider wrote: | If you start haphazardly changing things while firefighting | without testing, you might make things even worse. And there's | worse things than downtime, for instance if the system appears | to work but you're actually silently corrupting customer data. | joneholland wrote: | Running out of file handles and other IO limits is embarrassing | and happens at every company, but I'm surprised that AWS was not | monitoring this. | | I'm also surprised at the general architecture of Kinesis. What | appears to be their own hand rolled gossip protocol (that is | clearly terrible compared to raft or paxos, a thread per cluster | member? Everyone talking to everyone? An hour to reach | consensus?) and the front end servers being stateful period | breaks a lot of good design choices. | | The problem with growing as fast as Amazon has is that their | talent bar couldn't keep up. I can't imagine this design being | okay 10 years ago when I was there. | marcinzm wrote: | I don't think it's about growing fast so much as, from those I | talked to, Amazon now has a fairly bad reputation in the tech | community. You only go to work there if you don't have a better | option (Google, Facebook, etc) or have some specialty skill | they're willing to pay for. Pay is below other FAANG companies | and the work culture isn't great (toxic even some would say). | | edit: They also had the most disorganized and de-centralized | interview approach from all the FAANG companies I talked with. | Which isn't growing pains this far in, it's just bad management | and process. | ActorNightly wrote: | Just as a general reminder to anyone reading this: forum | comments are incredibly biased and hardly ever represent | reality accurately. | imajoredinecon wrote: | Interesting re interview experience | | I interviewed as a new grad SWE and the process was totally | straightforward, and way lower friction (albeit much less | human interaction, which made it feel even more impersonal) | than almost everywhere else I applied: initial online screen, | online programming task, and then a video call with an | engineer where you explained your answer to the programming | task. | marcinzm wrote: | I was doing machine learning so more specialized than | regular SDE. Other companies it was talk to recruiter, | phone screen with manager, and then virtual onsite | interviews. Hiring was either not team specific or the | recruiter helped manage the process (ie: what does this | role actually need). Very clear directions on what type of | questions will be asked, format of the interviews, what to | prepare for, etc. Amazon the recruiter just told me to look | on the job site and then, despite me being clear, applied | me to the wrong role. Then got one of those automated | coding exercises despite 15 year experience and an internal | referral. Wasn't hard but it also pointless since the | coding exercise was for the wrong role. Finally got a phone | screen and they asked me nothing but pedantic college | textbook questions for 40 minutes. Recruiter provided no | warning for that. | | edit: You could blame the recruiter but every other company | had a well oiled machine for their recruiters. So even if | they provided only generic information there was still a | standard process for what they provided. | akhilcacharya wrote: | The process for new grads and interns is different from | industry hires and is decided by team. | nixass wrote: | Very anecdotal | hedora wrote: | I've noticed a strong tendency for older systems to accumulate | "spaghetti architecture", where newer employees add new | subsystems and tenured employees are blind to the early design | mistakes they made. For instance, in this system, it sound like | they added a complicated health check mechanism at some point | to bounce faulty nodes. | | Now, they don't know how it behaves, so they're afraid to take | corrective actions in production. | | They built that before ensuring that they logged the result of | each failed system call. The prioritization seems odd, but most | places look at logging as a cost center, and the work of | improving it as drudgery, even though it's far more important | than shiny things like automatic response to failures, and also | takes a person with more experience to do properly. | karmakaze wrote: | Kinesis was the worst AWS tech I've ever used. Splitting a | stream into shards doesn't increase throughput if you still | need to run the same types/number of consumers on every shard. | The suggested workaround at the time was to use larger batches | and add latency to the processing pipeline. | jen20 wrote: | > What appears to be their own hand rolled gossip protocol | (that is clearly terrible compared to raft or paxos) | | Raft and Paxos are not gossip protocols - they are consensus | protocols. | joneholland wrote: | Fair. What I meant to say is "hand rolling a way to have | consensus on a shared piece of data" by implementing it with | a naive gossip system. | justicezyx wrote: | I led the storage engine prototyping for Kinesis in 2012 (the | best time in my career so far). | | Kinesis uses Chain Replication, a dead simple fault tolerante | storage algorithm: machines formed a chain, data flow from head | to tail in one direction, writes always start at head, and read | at tail, new nodes always join at tail, but nodes can be kicked | out at any position. | | The membership management of chain node is done through a | paxos-based consensus service like chubby or zookeeper. Allan | [2] (the best engineer I personally worked with so far, way | better than anyone I encountered) wrote that system. The Java | code quality shows itself after the first glance. Not | mentioning the humbleness and openness in sharing his knowledge | during early design meetings. | | I am not sure what protocol is actually used now. But I would | be surprised it's different, given the protocol's simplicity | and performance. | | [1] https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf [2] | https://www.linkedin.com/in/allan-vermeulen-58835b/ | ryanworl wrote: | Can you explain why the sequence numbers are so giant? I've | never understood that. | justicezyx wrote: | I dont remember the size, is it 128bits? | | It was chosen for future expansion. Kinesis was envisioned | to be a much larger-scale Kafka + Storm (storm was the | streaming programming framework popular in 2012, it was | since falls out of favor). | ryanworl wrote: | 128bit might be accurate, I meant more along the lines of | they are non-contiguous and don't seem to be correlated | with the amount of records actually being written to a | stream. | crgwbr wrote: | I always assumed they were sequential per-region rather | than per-stream, but that's just a guess. | justicezyx wrote: | I recall it was sequential per-shard, but no ordering | among the shards of the same stream. But I literally | haven't touch Kinesis after 2013. | hintymad wrote: | AWS frontend services are usually implemented in Java. If | Kinesis' frontend does too, then it's surprising that the | threads created by a frontend service would exceed the OS | limit. This tells three possibilities: 1. Kinesis did not | impose a max thread count in their app, which is a gross | omission; 2. Or there was a resource leak in their code. 3. | Each of their frontend instances stored all the placement | information of backend servers, which means their frontend was | not scalable by backend size. | joneholland wrote: | My understanding is that every front end server has at least | one connection (on a dedicated thread) to every other front | end server. | | Assuming they have say, 5000 front end instances, thats 5000 | file descriptors being used just for this, before you are | even talking about whatever threads the application needs. | | It's not surprising that they bumped into ulimits, though as | part of OS provisioning, you typically have those tuned for | workload. | | More concerning is the 5000 x 5000 amount of open tcp | sessions across their network to support this architecture. | This has to be a lot of fun on any stateful firewall it might | cross. | pentlander wrote: | The hand rolled gossip protocol (DFDD) is not used for | consensus, it's just used for cluster membership and health | information. It's used by pretty much every foundational AWS | service. There's a separate internal service for consensus that | uses paxos. | | The thread per frontend member definitely sounds like a | problematic early design choice. It wouldn't be the first time | I heard of an AWS issue due to "too many threads". Unlike gRPC, | the internal RPC framework defaults to a thread per request | rather than an async model. The async way was pretty painful | and error prone. | joneholland wrote: | Are they still using Coral and Codigo as the RPC stack? | morsma wrote: | Not Codigo, but Coral, yes. | 8note wrote: | Well, there's still codigo around, but coral is quite | pleasent | pentlander wrote: | Yeah Amazon still runs on Coral, there were some recent | (release a few years ago) advances on it under the hood and | ergonomically. I think the "replacement" for it is | Smithy[0] though it will likely just replace the XML schema | and codegen and not the protocol. Honestly at this point I | think it would be in Amazon's best interest to heavily | invest in Java Project Loom rather than trying to convert | to async. | | [0] https://awslabs.github.io/smithy/ | trhway wrote: | > Java Project Loom | | sound like after more than 20 years the green threads are | back! Everything new is well-forgotten old (especially | the bad parts of that old :) | | I think in the Kinesis case it isn't the thread-per- | connection what is the root of the problem. It is the | each-to-each flat topology of the front-end which is | growing and waiting for the various thresholds to get | triggered. First the number of threads, next it will be | something else,... until they re-architect into something | like layered structure. | | Many here mention the quality of the current talent at | AMZN. Anecdotally, 2 people they recently hired from our | dept were among the weakest. Just a bit more stronger guy | got offers from Amazon and Apple, went for Apple. A much | more stronger and experienced guy failed to get an offer | from Amazon. | [deleted] | ignoramous wrote: | Are you sure Kinesis uses DFDD [0]? | | [0] Seems like a relic of years gone by | https://patents.justia.com/patent/9838240 | cowsandmilk wrote: | That patent is from when Kinesis Data Streams were | originally announced to the public. Any reason not to think | it uses it. Seems like it would have been a logical choice | in the initial architecture and change is slow. | pentlander wrote: | Though I no longer work for Amazon, I'm reasonably certain | they use it from the description. Especially given I know | for a fact that other more foundational services use it. | | Why is it a "relic of years gone by"? Consul uses a | similar, though more advanced technique[0]. Consul may not | be as widely used as etcd, but I don't think most would | consider it a relic. | | [0] https://www.consul.io/docs/architecture/gossip | bonfire wrote: | If you want to eat in a restaurant it's better not to look in | the kitchen :-| | jknoepfler wrote: | It irks me to this day that AWS all-hands meetings (circa 2015) | celebrated an exponential hiring curve (as in the graph was | greeted with applause and touted as evidence of success by the | speaker). The next plot would be an exponential revenue curve | with approximately the same shape. Meanwhile the median | lifespan of an engineer was ~10 months. I don't know, I just | couldn't square that one in my head. | throwaway189262 wrote: | I can't remember which db, but somebody a while back claimed | that one of Amazon's "infinitely scalable" dbs was tons of | abstraction on top a massive heap of MySQL instances. | | I don't trust anything outside core services on AWS. Regardless | of whether the rumor I heard is true, it's clear they | appreciate quantity over quality. | WookieRushing wrote: | This is really common and works pretty awesomely. MySQL is | extremely battle tested and there's lots of experts out there | for it. | | FB built a similar system to maintain their graph: | https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or- | nos... | | It's a ton of tiny DBs that look like one massive eventually | consistent DB | rubiquity wrote: | Disclosure: I work at AWS, possibly near the system you're | describing. Opinions are my own. | | If we're talking about the same thing then I think casting | stones just because it is based on MySQL is severely | misguided. MySQL has decades of optimizations and this | particular system at Amazon has solved scaling problems and | brought reliability to countless services without ever being | the direct cause of an outage (to the best of my knowledge). | | Indeed, MySQL is not without its flaws but many of these are | related to its quirks in transactions and replication which | this system completely solves. The cherry on top is that you | have a rock solid database with a familiar querying language | and a massive knowledge base to get help from when needed. | Oh, and did I mention this system supports multiple storage | engines besides just MySQL/InnoDB? | | I for one wish we would open source this system though there | are a ton of hurdles both technical and not. I think it would | do wonders for the greater tech community by providing a much | better option as your needs grow beyond a single node system. | It has certainly served Amazon well in that role and I've | heard Facebook and YouTube have similar systems based on | MySQL. | | To further address your comment about Amazon/AWS lacking | quality: this system is the epitome of our values of | pragmatism and focusing our efforts on innovating where we | can make the biggest impact. Hand rolling your own storage | engines is fun and all but countless others have already | spent decades doing so for marginal gains. | panarky wrote: | _> the system you 're describing_ | | Why all the mysterious cloak-and-dagger? | rubiquity wrote: | It is possible that the person I replied to could be | talking about an entirely different piece of software. | Another reason is that the specifics of the system I'm | referring to are not public knowledge. | | The more important takeaway is that building on top of | MySQL/InnoDB is perfectly fine and that is what I tried | to emphasize. | ignoramous wrote: | I believe someone _allegedly_ from AWS said DynamoDB was | written on top of MySQL (on top of InnoDB, really) [0] which | would be similar to what Uber and Pinterest did as well. [1] | | [0] https://news.ycombinator.com/item?id=18426096 | | [1] https://news.ycombinator.com/item?id=12605026 | blandflakes wrote: | At more than one of my jobs, this has been exactly the right | way to horizontally scale relational workloads and has gone | very well. | derwiki wrote: | Sounds like Dynamo | newscom59 wrote: | What's the problem with this? _Anything_ that's scalable is | "just an abstraction" on top of a heap of shards | /processes/nodes/data centers/whatever. | solatic wrote: | > The problem with growing as fast as Amazon has is that their | talent bar couldn't keep up. I can't imagine this design being | okay 10 years ago when I was there. | | I see where you're coming from with this, but you really have | to wonder. It sounds more like the original architects made | implicit assumptions regarding scale that, likely due to | original architects and engineers moving on, were not re- | evaluated by the current engineers on Kinesis as Kinesis grew. | While it may take an hour now for the front-end cache to sync, | I find it highly unlikely that it needed that much time when | Kinesis first launched. | | The process failure here is organizational, one where Amazon | failed to audit its current systems in a complete and current | manner such that sufficient attention and resources could be | paid to a re-architecture of a critical service before it | caused the service to fail. Even now, vertically scaling the | front-end cache fleet is just a band-aid - eventually, that | won't be possible anymore. Sadly, the postmortem doesn't seem | to identify the organizational failure that was the true root | cause of the outage. | edoceo wrote: | Oof. My little company is refactoring some five year old | architecture design choices. Ugly. Process isn't visible | outside the refactor and the work is tedious. Can't imagine | what a service refactor is like at A. I bet it sucks | Twirrim wrote: | > Can't imagine what a service refactor is like at A. I bet | it sucks | | It's not all that hard. AWS heavily focuses on Service | Oriented Architecture approaches, with specific | knowledge/responsibility domains for each. It's a proven | scalable pattern. The APIs will often be fairly straight- | forward behind the front end. With clearly lines of | responsibility between components, you'll almost never have | to worry about what other services are doing. Just fix what | you've got right in front of you. This is an area where | TLA+ can come in handy too. Build a formal proof and | rebuild your service based on it. | | I joined Glacier 9 months after launch, and it was in the | band-aid stage. In cloud services your first years will | roughly look like: | | 1) Band-aids, and emergency refactoring. Customers never do | what you expect or can predict them to do, no matter how | you price your service to encourage particular behaviour. | First year is very much keep the lights on focused. Fixing | bugs and applying band-aids where needed. In AWS, it's | likely they'll target a price decrease for re:invent | instead of new features. | | 2) Scalability, first new feature work. Traffic will | hopefully be picking up by now for your service, you'll | start to see where things may need a bit of help to scale. | You'll start working on the first bits of innovation for | the platform. This is a key stage because it'll start to | show you where you've potentially painted yourself in to a | corner. (AWS will be looking for some bold feature to tout | at Re:Invent) | | 3) Refactoring, feature work starts in earnest. You'll have | learned where your issues are. Product managers, market | research, leadership etc. will have ideas about what new | features to be working on, and have much more of a roadmap | for your service. New features will be tied in to the first | refactoring efforts needed to scale as per customer | workload, and save you from that corner you're painted in | to. | | Year 3 is where some of the fun kicks in. The more senior | engineers will be driving the refactoring work, they know | what and why things were done how they were done, and can | likely see how things need to be. A design proposal gets | created and refined over a few weeks of presentations to | the team and direct leadership. It's a broad spectrum | review, based around constructive criticism. Engineers will | challenge every assumption and design decision. There's no | expectation of perfection. Good enough is good enough. You | just need to be able to justify why you made decisions. | | New components will be built from the design, and plans for | roll out will be worked on. In Glacier's case one mechanism | we'd use was to signal to a service that $x % of requests | should use a new code path. You'd test behind a whitelist, | and then very very slowly ramp up public traffic in one | smaller region towards the new code path while tracking | metrics until you hit 100%, repeat the process on a large | region slightly faster, before turning it on everywhere. | For other bits we'd figure out ways to run things in shadow | mode. Same requests hitting old and new code, with the new | code neutered. Then compare the results. | | side note: One of the key values engineers get evaluated on | before reaching "Principal Engineer" level is "respect what | has gone before". No one sets out to build something crap. | You likely weren't involved in the original decisions, you | don't necessarily know what the thinking was behind various | things. Respect that those before you built something as | best as suited the known constraints at the time. The same | applies forwards. Respect what is before you now, and be | aware that in 3-5 years someone will be refactoring what | you're about to create. The document you present to your | team now will help the next engineers when they come to | refactor later on down the line. Things like TLA+ models | will be invaluable here too. | pentlander wrote: | As another ex-Glacier dev, I disagree with it not being | "all that hard" but I agree with everything else. Now I'm | curious who this is :) | xiwenc wrote: | Indeed your side note should be the first point. I see | that very often in practice. As a result, products get | rewritten all the time because developers dont want to | spend time understanding the current system believing | they can do better job than the previous one. They will | create different problems. And the cycle repeats. | | The not-invented-here (or by-me) syndrome is probably | also at play here. | edoceo wrote: | Your side note should be the first point. | fafner wrote: | Yep. Having each front end needing to scale with the overall | size of the front end sounds is obviously going to hit some | scaling limit. It's not clear to me from the summary why they | are doing that. If it's for the shard-map or cache? Maybe if | the front end is stateful that's a way to do stickiness? Seems | we can only guess. | steelframe wrote: | > Cellularization is an approach we use to isolate the effects of | failure within a service, and to keep the components of the | service (in this case, the shard-map cache) operating within a | previously tested and operated range. This had been under way for | the front-end fleet in Kinesis, but unfortunately the work is | significant and had not yet been completed. | | Translation: The eng team knew that they had accumulated tech | debt by cutting a corner here in order to meet one of Amazon's | typical and insane "just get the feature out the door" timelines. | Eng warned management about it, and management decided to take | the risk and lean on on-call to pull heroics to just fix any | issues as they come up. Most of the time yanking a team out of | bed in the middle of the night works, so that's the modus | operandi at Amazon. This time, the actual problem was more | fundamental and wasn't effectively addressable with middle-of- | the-night heroics. | | Management rolled the "just page everyone and hope they can fix | it" dice yet again, as they usually do, and this time they got | snake eyes. | | I guarantee you that the "cellularization" of the front-end fleet | wasn't actually under way, but the teams were instead completely | consumed with whatever the next typical and insane "just get the | feature out the door" thing was at AWS. The eng team was never | going to get around to cellularizing the front-end fleet because | they were given no time or incentive to do so by management. | During/after this incident, I wouldn't be surprised if management | didn't yell at the eng team, "Wait, you KNEW this was a problem, | and you're not done yet?!?" Without recognizing that THEY are the | ones actually culpable for failing to prioritize payments on tech | debt vs. "new shiny" feature work, which is typical of Amazon | product development culture. | | I've worked with enough former AWS engineers to know what goes on | there, and there's a really good reason why anybody who CAN move | on from AWS will happily walk away from their 3rd- and 4th-year | stock vest schedules (when the majority of your _promised_ amount | of your sign-on RSUs actually starts to vest) to flee to a | company that fosters a healthy product development and | engineering culture. | | (Not to mention that, this time, a whole bunch of peoples' | Thanksgiving plans were preempted with the demand to get a full | investation and post-mortem written up, including the public | post, ASAP. Was that really necessary? Couldn't it have waited | until next Wednesday or something?) | metaedge wrote: | I would have started the response with: | | First of all, we want to apologize for the impact this event | caused for our customers. While we are proud of our long track | record of availability with Amazon Kinesis, we know how critical | this service is to our customers, their applications and end | users, and their businesses. We will do everything we can to | learn from this event and use it to improve our availability even | further. | | Then move on to explain... | sigzero wrote: | What they did was fine. | lend000 wrote: | Even today I had a few minutes of intermittent networking outages | around 9:30am EST (which started on the day of the incident), and | compared to other regions, I frequently get timeouts when calling | S3 from us-east-1 (although that has been happening since | forever). | [deleted] | codesparkle wrote: | From the postmortem: | | _At 9:39 AM PST, we were able to confirm a root cause [...] the | new capacity had caused all of the servers in the fleet to exceed | the maximum number of threads allowed by an operating system | configuration._ | karmakaze wrote: | Seems to me that the root problem could also be fixed by not | using presumably blocking application threads talking to each of | the other servers. Any async or poll mechanism wouldn't require | N^2 threads across the pool. | why-el wrote: | I wonder if the new wonders coming out from linux (io_uring...) | would have made this a better design, but that work in the | kernel is still in active development. | ris wrote: | The one thing I want to know in cases like this is: why did it | affect multiple Availability Zones? Making a resource multi-AZ is | a significant additional cost (and often involves additional | complexity) and we really need to be confident that typical | observed outages would _actually_ have been mitigated in return. | EdwardDiego wrote: | Indeed. We're paying (and designing our systems to work on | multiple AZs) to reduce the risk of outages, but then their | back-end services are reliant on services in a sole region? | frankietaylr wrote: | AWS should make this more transparent so that we make better | design choices. | otterley wrote: | (Disclaimer: I work for AWS but opinions are my own. I also | do not work with the Kinesis team.) | | Nearly all AWS services are regional in scope, and for many | (if not most) services, they are scaled at a cellular level | within a region. Accounts are assigned to specific cells | within that region. | | There are very, very few services that are global in scope, | and it is strongly discouraged to create cross-regional | dependencies -- not just as applied to our customers, but to | ourselves as well. IAM and Route 53 are notable exceptions, | but they offer read replicas in every region and are | eventually consistent: if the primary region has a failure, | you might not be able to make changes to your configuration, | but the other regions will operate on read-only replicas. | | This incident was regional in scope: us-east-1 was the only | impacted region. As far as I know, no other region was | impacted by this event. So customers operating in other | regions were largely unaffected. (If you know otherwise, | please correct me.) | | As a Solutions Architect, I regularly warn customers that | running in multiple Availability Zones is not enough. | Availability Zones protect you from many kinds of physical | infrastructure failures, but not necessarily from regional | service failures. So it is super important to run in multiple | regions as well: not necessarily active-active, but at least | in a standby mode (i.e. "pilot light") so that customers can | shed traffic from the failing region and continue to run | their workloads. | qz2 wrote: | Correct. | | I, as many people have, discovered this when something broke | in one of the golden regions. In my case cloudfront and ACM. | | Realistically you can't trust one provider at all if you have | high availability requirements. | | The justification is apparently that the cloud is taking all | this responsibility away from people but from personal | experience running two cages of kit at two datacenters the | TCO was lower and the reliability and availability higher. | Possibly the largest cost is navigating Harry-Potter-esque | pricing and automation laws. The only gain is scaling past | those two cages. | | Edit: I should point out however that an advantage of the | cloud is actually being able to click a couple of buttons and | get rid of two cages worth of DC equipment instantly if your | product or idea doesn't work out! | freehunter wrote: | >you can't trust one provider at all | | The hard part with multi-cloud is, you're just increasing | your risk of being impacted by someone's failure. Sure if | you're all-in on AWS and AWS goes down, you're all-out. But | if you're on [AWS, GCP] and GCP goes down, you're down | anyway. Even though AWS is up, you're down because Google | went down. And if you're on [AWS, GCP, Azure] and Azure | goes down, it doesn't matter than AWS and GCP are up... | you're down because Azure is down. The only way around that | is architecting your business to run with only one of those | vendors, which means you're paying 3x more than you need to | 99.99999% of the time. | | The probability that one of [AWS, Azure, GCP] is down is | way higher than the probability that just _one_ of them is | down. And the probability that your two cages in your | datacenter is down is way higher than the probability that | any one of the hyperscalers is down. | [deleted] | qz2 wrote: | I disagree. It's about mitigating the risk of a single | provider's failure. Single providers go down all the | time. We've seen it from all three major cloud vendors. | hedora wrote: | How do you test failover from provider-wide outages? | | I've never heard of an untested failover mechanism that | worked. Most places are afraid to invoke such a thing, | even during a major outage. | [deleted] | qz2 wrote: | That's fairly simple. Regular scenario planning, drills | and suitable chaos engineering tooling. | | Being afraid of failures is a massive sign of problems. | I've worked in those sorts of places before. | coredog64 wrote: | ACM and CloudFront being sticky to us-east-1 is | particularly annoying. I'm happy not being multi regional | (I don't have that level of DR requirements), but these | types of services require me to incorporate all the multi | region complexity. | lttlrck wrote: | Harry-Potter-esque pricing? | | Is that a reference to the difficulty of calculating the | cost of visiting all the rides at universal? That's my best | guess... | qz2 wrote: | It's more a stab at the inconsistency of rules around | magic. | | _" Well this pricing rule only works on a Tuesday lunch | time if you're standing on one leg with a sausage under | each arm and a traffic cone on your head"_ | | And there are a million of those to navigate. | talawahtech wrote: | Multi-AZ doesn't protect against a software/OS issue like this, | Multi-AZ would be relevant if it was an infrastructure failure | (e.g. underlying EC2 instances or networking). | | The relevant resiliency pattern in this case would be what they | refer to as cell-based architecture, where within an AZ | services are broken down into smaller independent cells to | minimize the blast radius. | | They specifically mention in the write-up that this was a gap | they plan to address, the "backend" portion of Kinesis was | already cellularized but that step had not yet been completed | on the "frontend". | | Celluarization in combination with workload partitioning would | have helped, e.g. don't run Cloudwatch, Cognito and Customer | workloads on the same set of cells. | | It is also important to note that celluarization only helps in | this case if they limit code deployment to a limited number of | cells at a time. | | This YouTube video[1] of a re:invent presentation does a great | job of explaining it. The cell-based stuff, starts around | minute 20. | | 1. https://youtu.be/swQbA4zub20 | talawahtech wrote: | Another relevant point made in the video is that they | restrict cells to a maximum size which then makes it easier | to test behavior at that size. This would have also helped | avoid this specific issue since the number of threads would | have been tied to the number of instances in a cell. | | I definitely recommend checking out the video. Even if you | have seen it before, rewatching it in the context of this | post-mortem really makes it hit home. | ignoramous wrote: | > _Another relevant point made in the video is that they | restrict cells to a maximum size which then makes it easier | to test behavior at that size._ | | Googlers would be quick to point out that Borg does this | natively across all their services: | https://news.ycombinator.com/item?id=19393926 | brown9-2 wrote: | but why does a Kinesis outage due to a capacity increase | affect multiple AZs, if one assumes the capacity increase | (and the frontend servers impacted by it) are in a single | zone? ___________________________________________________________________ (page generated 2020-11-28 23:00 UTC)