[HN Gopher] How Amazon uses chaos engineering to handle 80k requ... ___________________________________________________________________ How Amazon uses chaos engineering to handle 80k requests per second Author : jbredeche Score : 53 points Date : 2023-09-05 20:39 UTC (2 hours ago) (HTM) web link (community.aws) (TXT) w3m dump (community.aws) | andrewmcwatters wrote: | I would love to know what software stack, hardware, and uplink | connections in total they utilize to accomplish a real-world 80k | request per second throughput. How many instances do you guys | think Amazon runs for its primary e-commerce front-end stack? In | total, and per region? Assuming they have a multi-region rollout. | | If it's the real-deal, and not like people saying "Bun.js can | serve 65k req/s+ ( _cough_ _cough_ to localhost,) " that's | impressive. | | But I never see anyone talk about real-world numbers. Just | synthetic poopoo. | | I think I read the article correctly, but I think it only talks | about how they introduce "chaos engineering," I didn't recall | them talking about how they _actually_ handle a volume of traffic | like 80k req /s. | tacozilla wrote: | Amazon doesn't really have a "primary e-commerce front-end | stack" in any concrete sense. They have hundreds/thousands of | teams that deploy bits and pieces to a massive pipeline that | ultimately makes up what you see on Amazon.com, but each team | can have their own infrastructure backing things. Some teams | might run everything off a dozen low-end EC2 instances while | another sibling team has 3k+ instances; it's really all over | the place, and that's ignoring specific events like Black | Friday or Prime Day, etc. where teams need to prescale things | in advance. | | Source: Worked at Amazon/AWS for almost a decade. | baby_souffle wrote: | > I would love to know what software stack, hardware, and | uplink connections in total they utilize to accomplish a real- | world 80k request per second throughput. How many instances do | you guys think Amazon runs for its primary e-commerce front-end | stack? In total, and per region? Assuming they have a multi- | region rollout. > But I never see anyone talk about real-world | numbers. Just synthetic poopoo. | | The number probably changes all the time based on load. They'll | never release these details because it's a competitive | advantage to have the "how popular are they in $place at $time- | of-day" data private. | | When they do share numbers, it'll always be the most flattering | and devoid of any context beyond the "wow" factor. | tacozilla wrote: | While I can understand the cynicism, the real answer is a lot | closer to something much more boring, which is most people | just don't care about the actual numbers, and if they were to | release them, while interesting to a small few, generally no | one would actually care. | | There's also a common misnomer that Amazon.com is somehow | just this one giant app running on a set of servers, which | isn't remotely how it's actually deployed, and that's before | we spend time arguing whether a team's instances even count | as "primary e-commerce front-end stack" or not. :P | andrewmcwatters wrote: | I'm not sure it's the biggest deal in the world, plus real- | world market presence data is regularly detailed in | shareholder reports for various companies. | | You could take low-end instance specifications and standard | industry stacks and extrapolate forward how many instances | they might need to maintain at a maximum, but those numbers | are going to be off. | | Are they running 400 low-end front-end instances across the | globe? Probably, (plus the 40 or so other services they claim | to need, multiplied by region count at a minimum) and that | would actually be well below realistic and reasonable for a | company like Amazon. You can take a bunch of regional | instances that handle roughly 200 req/s and make that work. | superfrank wrote: | I used to work on a system that did about 55k/sec at peak. The | service was internal it was only handling grpc calls which were | coming from inside our VPC and it was written in Go. It's main | job was was reading and writing to an SQL db that was sharded | across 3 or 4 of the biggest instances AWS offered at the time | (2017ish). | | Everything was Dockerized and I think we were using Docker | Swarm for container orchestration. I don't remember the specs | for each box, but we had auto scaling set up so at peak we'd | hit a little over 200 containers. | | Looking back now, I'm sure we could have gotten much better | performance out of that service, but the team was young and | inexperienced and throwing money at the problem was an easier | solution. | hexo wrote: | only 80k? | aeturnum wrote: | I'm pretty unclear on the "how" here - but from what I can | understand in the article the search resilience team injected | properly tagged synthetic traffic into their system to do | testing? That does seem like the kind of practice that could be | part of healthy holistic approach - but the article elides a ton | of details. I suppose the idea is that it promotes AWS services | (with the idea of suggesting that this kind of resiliency comes | easier on their platform) - but this is a great example of how | good writing strips things down to the barest details. I would | love to take lessons from it but I think the details actually | aren't here. | _a_a_a_ wrote: | > I suppose the idea is that it promotes AWS services | | an advert in disguise then? | catchnear4321 wrote: | anything that uses the template "How X did Y to improve Z by | METRIC" and also hosted on company x domain is an ad. | | for that matter, any blog post by a company is an ad. maybe | not to sell product, but to at least build exposure and | familiarity with the brand. | latchkey wrote: | I feel like Amazon search is one of the worst products I've ever | used. It is a clusterf/ck of paid advertisements and obviously | gamed results. I don't care how many requests/sec you get. If the | results are horrible, what does it matter? | beebmam wrote: | I'm sick of it being impossible to identify cheaply made | products from high quality, durable products on Amazon. The | rating system is flat out broken and there's an entire industry | built around gaming those ratings. | | I'm at the point that I rarely ever buy products on Amazon | anymore. It's a total disgrace. On an ethical level, I wish I | had the ability to say "I only want to be presented with | results that weren't made in China or other slave societies". | digging wrote: | These days it should be assumed that the quality of anything | bought on Amazon is dogshit. Which, even though I canceled | Prime years ago for ethical reasons, is honestly the biggest | reason I won't even click an Amazon link. | baz00 wrote: | The trick is to find your products somewhere else and then | look them up on Amazon for a price comparison. | xnx wrote: | Is there any alternative front end for Amazon? AI with some | image-similarity smarts could do a much better job grouping/de- | duping similar products. | tkahnoski wrote: | This would be awesome... When I'm in a rabbit hole trying to | find a product and I see three or four of roughly the same | design I know not to bother any further unless I can find a | reliable manufacturer website. | RockRobotRock wrote: | Working as intended. | waynesonfire wrote: | source: https://www.wsj.com/articles/amazon-changed-search- | algorithm... | wombat-man wrote: | 'chaos' is a great way to describe the results, to be fair. | latchkey wrote: | technically they said 'chaos engineering'... which obviously | means they use monkeys slapping their hands on keyboards to | write the code that returns the results. | yazaddaruvala wrote: | Disclaimer: Used to work at Amazon. | | Funny story, internally Amazon Search doesn't consider the ads | products to be part of the "search results". It is tracked and | accounted to Ads. | | The way Ads are handled on Amazon is really poorly done. The | Ads teams claim to make a lot of money (and based on the | internal accounting tricks they do), and as such have been | pushing Amazon's leadership to go more into Ads, even tho every | person I've met that worked at Amazon also hated the prevalence | of Ads. | | Literally, Directors and VPs at Amazon are afraid to step on | the toes of Ads' leadership team because of how well they have | told the story about "Ads is excessively profitable". | | Meanwhile, all of us in the thread can easily say, even if it | is short term profitable, it most certainly is not long term | profitable for Amazon. | | Both from internally and externally it has been very | disappointing to watch actually. | acchow wrote: | Sure the ads aren't great. But that's almost barely a problem | compared to the gamed reviews | SteveNuts wrote: | > But that's almost barely a problem compared to the gamed | reviews | | Which pales in comparison to the problem of counterfeit | goods, IMO. | | I can at least somewhat comb through the reviews to look | for outliers of well written reviews. Getting something | that's obviously a fake (has happened to me multiple times) | is completely unacceptable. | | Newegg has this issue too, I got a knock-off Intel CPU | there once, I was furious. | malfist wrote: | You can complain about two things you know. Just because | search is bad doesn't mean reviews can't be bad too. Or | <insert pet peeve> | [deleted] | truetraveller wrote: | Surprised that 80K/second is called "massive" for Amazon.com's | main search feature. | rcme wrote: | That's pretty crazy. At a big social media company, a service I | ran got 300K+ requests per second directly from end users. | yazaddaruvala wrote: | Not all requests per second are made the same. | | For example, it is just as true for this title to have said | "How Amazon uses ... to load 1.6 MM requests per second, from | just the search page." | | Each search page load, is 1 request to the search backend, | but 20x request fanout to the product's key-value store to | render the images and titles, etc. | srcreigh wrote: | Google search is only 99k qps | rrdharan wrote: | Bigtable does 6B QPS though... | | https://cloud.google.com/blog/products/databases/youtube- | run... | nextworddev wrote: | I'm sure there's massive variance and seasonality around that | number | tpmx wrote: | From my experience with that scale of traffic (with Opera | Mini): | | There is surprisingly little seasonal variance. You have | your daily traffic rhythms based on when your users are | awake/active based on their geographical distribution. | | "World events" also have very little impact - they tend to | barely make a dent in that massive background noise. | tpmx wrote: | Huh. | | About a _decade ago_ Opera Mini did 150k transcoded full | pageloads /s (times about 30 inlines per pageload that was | the average back then, so about 4.5 million | requested/loaded/processed/compressed HTTP resources/s). | | (All of the public Google Search numbers I've seen have | seemed one or two orders of magnitudes too small. Or maybe | most people don't use their search engine/browser as much as | I do, so my perspective is skewed...) | Scalene2 wrote: | Did the math, that's about 7 billion searches a day. That | doesn't sound like a lot. | WrongAssumption wrote: | I mean, google does 8.5 billion per day. What would be a lot | to you? | marginalia_nu wrote: | To be fair, a lot of various URL bars and input fields turn | other activities into implicit Google queries. | umpalumpaaa wrote: | There are 8 billion people on the planet. | andrewmcwatters wrote: | It dawned on me that in web software, people talk about req/s | from two entirely different perspectives and it's borderline | fraud: | | req/s from localhost to localhost, and req/s from the Internet | to any user. | | The latter is actually interesting. People saying you can get | 10k req/s from Node.js is stupid. You're not actually getting | that on say, a single low-end instance over the Internet, which | is what most developers are actually going to do. | | Instead, you'll get two orders of magnitude fewer requests per | second. | | What Amazon is talking about here is most likely non-synthetic, | real-world 80k requests per second. Which is actually a decent | job. | adamckay wrote: | > People saying you can get 10k req/s from Node.js is stupid. | | No, it's not, for exactly the reason you state: | | > You're not actually getting that on say, a single low-end | instance over the Internet | | Some languages are, of course, more efficient, but it doesn't | matter - you can get very good performance out of any | language/runtime - it's all about your architecture and | infrastructure. | dlisboa wrote: | > Some languages are, of course, more efficient, but it | doesn't matter - you can get very good performance out of | any language/runtime - it's all about your architecture and | infrastructure. | | That's more or less true for some value of good | performance, but what's seldom talked about is how some | languages take you down a path of inefficiency through | idiomatic code. You can write efficient code, but you end | up fighting what the language gives you. | | Efficient languages are so because they either forbid or | make it very awkward to write inefficient code, with the | trade-off being fewer domain model abstractions | (abstractions closer to the metal than to the ubiquitous | language). | | They also force the programmer to think about resources | first hand, before writing anything. So you get a lot more | leeway to make mistakes in architecture and infra, which is | often the costliest part of software design. | [deleted] | dr-detroit wrote: | [dead] ___________________________________________________________________ (page generated 2023-09-05 23:00 UTC)