[HN Gopher] How Amazon uses chaos engineering to handle 80k requ...
       ___________________________________________________________________
        
       How Amazon uses chaos engineering to handle 80k requests per second
        
       Author : jbredeche
       Score  : 53 points
       Date   : 2023-09-05 20:39 UTC (2 hours ago)
        
 (HTM) web link (community.aws)
 (TXT) w3m dump (community.aws)
        
       | andrewmcwatters wrote:
       | I would love to know what software stack, hardware, and uplink
       | connections in total they utilize to accomplish a real-world 80k
       | request per second throughput. How many instances do you guys
       | think Amazon runs for its primary e-commerce front-end stack? In
       | total, and per region? Assuming they have a multi-region rollout.
       | 
       | If it's the real-deal, and not like people saying "Bun.js can
       | serve 65k req/s+ ( _cough_ _cough_ to localhost,) " that's
       | impressive.
       | 
       | But I never see anyone talk about real-world numbers. Just
       | synthetic poopoo.
       | 
       | I think I read the article correctly, but I think it only talks
       | about how they introduce "chaos engineering," I didn't recall
       | them talking about how they _actually_ handle a volume of traffic
       | like 80k req /s.
        
         | tacozilla wrote:
         | Amazon doesn't really have a "primary e-commerce front-end
         | stack" in any concrete sense. They have hundreds/thousands of
         | teams that deploy bits and pieces to a massive pipeline that
         | ultimately makes up what you see on Amazon.com, but each team
         | can have their own infrastructure backing things. Some teams
         | might run everything off a dozen low-end EC2 instances while
         | another sibling team has 3k+ instances; it's really all over
         | the place, and that's ignoring specific events like Black
         | Friday or Prime Day, etc. where teams need to prescale things
         | in advance.
         | 
         | Source: Worked at Amazon/AWS for almost a decade.
        
         | baby_souffle wrote:
         | > I would love to know what software stack, hardware, and
         | uplink connections in total they utilize to accomplish a real-
         | world 80k request per second throughput. How many instances do
         | you guys think Amazon runs for its primary e-commerce front-end
         | stack? In total, and per region? Assuming they have a multi-
         | region rollout. > But I never see anyone talk about real-world
         | numbers. Just synthetic poopoo.
         | 
         | The number probably changes all the time based on load. They'll
         | never release these details because it's a competitive
         | advantage to have the "how popular are they in $place at $time-
         | of-day" data private.
         | 
         | When they do share numbers, it'll always be the most flattering
         | and devoid of any context beyond the "wow" factor.
        
           | tacozilla wrote:
           | While I can understand the cynicism, the real answer is a lot
           | closer to something much more boring, which is most people
           | just don't care about the actual numbers, and if they were to
           | release them, while interesting to a small few, generally no
           | one would actually care.
           | 
           | There's also a common misnomer that Amazon.com is somehow
           | just this one giant app running on a set of servers, which
           | isn't remotely how it's actually deployed, and that's before
           | we spend time arguing whether a team's instances even count
           | as "primary e-commerce front-end stack" or not. :P
        
           | andrewmcwatters wrote:
           | I'm not sure it's the biggest deal in the world, plus real-
           | world market presence data is regularly detailed in
           | shareholder reports for various companies.
           | 
           | You could take low-end instance specifications and standard
           | industry stacks and extrapolate forward how many instances
           | they might need to maintain at a maximum, but those numbers
           | are going to be off.
           | 
           | Are they running 400 low-end front-end instances across the
           | globe? Probably, (plus the 40 or so other services they claim
           | to need, multiplied by region count at a minimum) and that
           | would actually be well below realistic and reasonable for a
           | company like Amazon. You can take a bunch of regional
           | instances that handle roughly 200 req/s and make that work.
        
         | superfrank wrote:
         | I used to work on a system that did about 55k/sec at peak. The
         | service was internal it was only handling grpc calls which were
         | coming from inside our VPC and it was written in Go. It's main
         | job was was reading and writing to an SQL db that was sharded
         | across 3 or 4 of the biggest instances AWS offered at the time
         | (2017ish).
         | 
         | Everything was Dockerized and I think we were using Docker
         | Swarm for container orchestration. I don't remember the specs
         | for each box, but we had auto scaling set up so at peak we'd
         | hit a little over 200 containers.
         | 
         | Looking back now, I'm sure we could have gotten much better
         | performance out of that service, but the team was young and
         | inexperienced and throwing money at the problem was an easier
         | solution.
        
       | hexo wrote:
       | only 80k?
        
       | aeturnum wrote:
       | I'm pretty unclear on the "how" here - but from what I can
       | understand in the article the search resilience team injected
       | properly tagged synthetic traffic into their system to do
       | testing? That does seem like the kind of practice that could be
       | part of healthy holistic approach - but the article elides a ton
       | of details. I suppose the idea is that it promotes AWS services
       | (with the idea of suggesting that this kind of resiliency comes
       | easier on their platform) - but this is a great example of how
       | good writing strips things down to the barest details. I would
       | love to take lessons from it but I think the details actually
       | aren't here.
        
         | _a_a_a_ wrote:
         | > I suppose the idea is that it promotes AWS services
         | 
         | an advert in disguise then?
        
           | catchnear4321 wrote:
           | anything that uses the template "How X did Y to improve Z by
           | METRIC" and also hosted on company x domain is an ad.
           | 
           | for that matter, any blog post by a company is an ad. maybe
           | not to sell product, but to at least build exposure and
           | familiarity with the brand.
        
       | latchkey wrote:
       | I feel like Amazon search is one of the worst products I've ever
       | used. It is a clusterf/ck of paid advertisements and obviously
       | gamed results. I don't care how many requests/sec you get. If the
       | results are horrible, what does it matter?
        
         | beebmam wrote:
         | I'm sick of it being impossible to identify cheaply made
         | products from high quality, durable products on Amazon. The
         | rating system is flat out broken and there's an entire industry
         | built around gaming those ratings.
         | 
         | I'm at the point that I rarely ever buy products on Amazon
         | anymore. It's a total disgrace. On an ethical level, I wish I
         | had the ability to say "I only want to be presented with
         | results that weren't made in China or other slave societies".
        
           | digging wrote:
           | These days it should be assumed that the quality of anything
           | bought on Amazon is dogshit. Which, even though I canceled
           | Prime years ago for ethical reasons, is honestly the biggest
           | reason I won't even click an Amazon link.
        
           | baz00 wrote:
           | The trick is to find your products somewhere else and then
           | look them up on Amazon for a price comparison.
        
         | xnx wrote:
         | Is there any alternative front end for Amazon? AI with some
         | image-similarity smarts could do a much better job grouping/de-
         | duping similar products.
        
           | tkahnoski wrote:
           | This would be awesome... When I'm in a rabbit hole trying to
           | find a product and I see three or four of roughly the same
           | design I know not to bother any further unless I can find a
           | reliable manufacturer website.
        
         | RockRobotRock wrote:
         | Working as intended.
        
           | waynesonfire wrote:
           | source: https://www.wsj.com/articles/amazon-changed-search-
           | algorithm...
        
         | wombat-man wrote:
         | 'chaos' is a great way to describe the results, to be fair.
        
           | latchkey wrote:
           | technically they said 'chaos engineering'... which obviously
           | means they use monkeys slapping their hands on keyboards to
           | write the code that returns the results.
        
         | yazaddaruvala wrote:
         | Disclaimer: Used to work at Amazon.
         | 
         | Funny story, internally Amazon Search doesn't consider the ads
         | products to be part of the "search results". It is tracked and
         | accounted to Ads.
         | 
         | The way Ads are handled on Amazon is really poorly done. The
         | Ads teams claim to make a lot of money (and based on the
         | internal accounting tricks they do), and as such have been
         | pushing Amazon's leadership to go more into Ads, even tho every
         | person I've met that worked at Amazon also hated the prevalence
         | of Ads.
         | 
         | Literally, Directors and VPs at Amazon are afraid to step on
         | the toes of Ads' leadership team because of how well they have
         | told the story about "Ads is excessively profitable".
         | 
         | Meanwhile, all of us in the thread can easily say, even if it
         | is short term profitable, it most certainly is not long term
         | profitable for Amazon.
         | 
         | Both from internally and externally it has been very
         | disappointing to watch actually.
        
           | acchow wrote:
           | Sure the ads aren't great. But that's almost barely a problem
           | compared to the gamed reviews
        
             | SteveNuts wrote:
             | > But that's almost barely a problem compared to the gamed
             | reviews
             | 
             | Which pales in comparison to the problem of counterfeit
             | goods, IMO.
             | 
             | I can at least somewhat comb through the reviews to look
             | for outliers of well written reviews. Getting something
             | that's obviously a fake (has happened to me multiple times)
             | is completely unacceptable.
             | 
             | Newegg has this issue too, I got a knock-off Intel CPU
             | there once, I was furious.
        
             | malfist wrote:
             | You can complain about two things you know. Just because
             | search is bad doesn't mean reviews can't be bad too. Or
             | <insert pet peeve>
        
         | [deleted]
        
       | truetraveller wrote:
       | Surprised that 80K/second is called "massive" for Amazon.com's
       | main search feature.
        
         | rcme wrote:
         | That's pretty crazy. At a big social media company, a service I
         | ran got 300K+ requests per second directly from end users.
        
           | yazaddaruvala wrote:
           | Not all requests per second are made the same.
           | 
           | For example, it is just as true for this title to have said
           | "How Amazon uses ... to load 1.6 MM requests per second, from
           | just the search page."
           | 
           | Each search page load, is 1 request to the search backend,
           | but 20x request fanout to the product's key-value store to
           | render the images and titles, etc.
        
         | srcreigh wrote:
         | Google search is only 99k qps
        
           | rrdharan wrote:
           | Bigtable does 6B QPS though...
           | 
           | https://cloud.google.com/blog/products/databases/youtube-
           | run...
        
           | nextworddev wrote:
           | I'm sure there's massive variance and seasonality around that
           | number
        
             | tpmx wrote:
             | From my experience with that scale of traffic (with Opera
             | Mini):
             | 
             | There is surprisingly little seasonal variance. You have
             | your daily traffic rhythms based on when your users are
             | awake/active based on their geographical distribution.
             | 
             | "World events" also have very little impact - they tend to
             | barely make a dent in that massive background noise.
        
           | tpmx wrote:
           | Huh.
           | 
           | About a _decade ago_ Opera Mini did 150k transcoded full
           | pageloads /s (times about 30 inlines per pageload that was
           | the average back then, so about 4.5 million
           | requested/loaded/processed/compressed HTTP resources/s).
           | 
           | (All of the public Google Search numbers I've seen have
           | seemed one or two orders of magnitudes too small. Or maybe
           | most people don't use their search engine/browser as much as
           | I do, so my perspective is skewed...)
        
         | Scalene2 wrote:
         | Did the math, that's about 7 billion searches a day. That
         | doesn't sound like a lot.
        
           | WrongAssumption wrote:
           | I mean, google does 8.5 billion per day. What would be a lot
           | to you?
        
             | marginalia_nu wrote:
             | To be fair, a lot of various URL bars and input fields turn
             | other activities into implicit Google queries.
        
           | umpalumpaaa wrote:
           | There are 8 billion people on the planet.
        
         | andrewmcwatters wrote:
         | It dawned on me that in web software, people talk about req/s
         | from two entirely different perspectives and it's borderline
         | fraud:
         | 
         | req/s from localhost to localhost, and req/s from the Internet
         | to any user.
         | 
         | The latter is actually interesting. People saying you can get
         | 10k req/s from Node.js is stupid. You're not actually getting
         | that on say, a single low-end instance over the Internet, which
         | is what most developers are actually going to do.
         | 
         | Instead, you'll get two orders of magnitude fewer requests per
         | second.
         | 
         | What Amazon is talking about here is most likely non-synthetic,
         | real-world 80k requests per second. Which is actually a decent
         | job.
        
           | adamckay wrote:
           | > People saying you can get 10k req/s from Node.js is stupid.
           | 
           | No, it's not, for exactly the reason you state:
           | 
           | > You're not actually getting that on say, a single low-end
           | instance over the Internet
           | 
           | Some languages are, of course, more efficient, but it doesn't
           | matter - you can get very good performance out of any
           | language/runtime - it's all about your architecture and
           | infrastructure.
        
             | dlisboa wrote:
             | > Some languages are, of course, more efficient, but it
             | doesn't matter - you can get very good performance out of
             | any language/runtime - it's all about your architecture and
             | infrastructure.
             | 
             | That's more or less true for some value of good
             | performance, but what's seldom talked about is how some
             | languages take you down a path of inefficiency through
             | idiomatic code. You can write efficient code, but you end
             | up fighting what the language gives you.
             | 
             | Efficient languages are so because they either forbid or
             | make it very awkward to write inefficient code, with the
             | trade-off being fewer domain model abstractions
             | (abstractions closer to the metal than to the ubiquitous
             | language).
             | 
             | They also force the programmer to think about resources
             | first hand, before writing anything. So you get a lot more
             | leeway to make mistakes in architecture and infra, which is
             | often the costliest part of software design.
        
         | [deleted]
        
       | dr-detroit wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2023-09-05 23:00 UTC)