[HN Gopher] Summary of the Amazon Kinesis Event in the Northern ...
       ___________________________________________________________________
        
       Summary of the Amazon Kinesis Event in the Northern Virginia (US-
       East-1) Region
        
       Author : codesparkle
       Score  : 263 points
       Date   : 2020-11-28 07:51 UTC (15 hours ago)
        
 (HTM) web link (aws.amazon.com)
 (TXT) w3m dump (aws.amazon.com)
        
       | tmk1108 wrote:
       | How does the architecture of Kinesis compare to Kafka? If you
       | scale up the number of Kafka brokers can you hit similar problem?
       | Or does Kafka not rely on creating threads to connect to each
       | other broker
        
         | aloknnikhil wrote:
         | Kafka uses a thread pool for request processing. Both the
         | brokers and the consumer clients use the same request
         | processing loop.
         | 
         | This goes a bit more in-depth:
         | https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafk...
        
       | ipsocannibal wrote:
       | So the cause of outage boils down to not having a metric on total
       | file descriptors with an alarm if usage gets within 10% of the
       | Max and a faulty scaling plan that should of said "for every N
       | backend hosts we add we must add X frontend hosts". One metric
       | and a couple of lines in a wiki could have saved Amazon what is
       | probably millions in outage related costs. One wonders if Amazon
       | retail will start hedging its bets and go multicloud to prevent
       | impacts on the retail customers from AWS LSE's.
        
       | lytigas wrote:
       | > During the early part of this event, we were unable to update
       | the Service Health Dashboard because the tool we use to post
       | these updates itself uses Cognito, which was impacted by this
       | event.
       | 
       | Poetry.
       | 
       | Then, to be fair:
       | 
       | > We have a back-up means of updating the Service Health
       | Dashboard that has minimal service dependencies. While this
       | worked as expected, we encountered several delays during the
       | earlier part of the event in posting to the Service Health
       | Dashboard with this tool, as it is a more manual and less
       | familiar tool for our support operators. To ensure customers were
       | getting timely updates, the support team used the Personal Health
       | Dashboard to notify impacted customers if they were impacted by
       | the service issues.
       | 
       | I'm curious if anyone here actually got one of these.
        
         | ufmace wrote:
         | My employer is a pretty big spender with AWS. I didn't hear
         | anything about anybody getting status updates from a "Personal
         | Health Dashboard" or anywhere else. I can't be 100% sure such
         | an update would have made its way to me, but given the amount
         | of buzzing, it's hard to believe that somebody had info like
         | that and didn't share it.
        
         | loriverkutya wrote:
         | I can confirm we got the Personal Health Dashboard
         | notifications.
        
         | newscom59 wrote:
         | The PHD is _always_ updated first, long before the global
         | status page is updated. Every single one of my clients that use
         | AWS got updates on the PHD literally hours before the status
         | page was even showing any issues, which is typical. It's the
         | entire point of the PHD.
         | 
         | Through reading Reddit and HN during this event I learned that
         | most people apparently aren't even aware of the existence of
         | the PHD and rely solely on the global status page, despite the
         | fact that there is a giant "View my PHD" button at the very top
         | of the global status page, and additionally there is a
         | notification icon on the header of every AWS console page that
         | lights up and links you directly to the PHD whenever there is
         | an issue.
         | 
         | The PHD is always where you should look first. It is, by
         | design, updated long before the global status page is.
        
         | mwarkentin wrote:
         | Yes, we had some messages coming through in our PHD.
        
         | 0x11 wrote:
         | I can't say for sure that the company I work for didn't, but it
         | certainly didn't make it's way to me and there are only 8 of
         | us.
        
         | vishnugupta wrote:
         | This won't be a first. The status page was hosted in S3. It is
         | hilarious in the hindsight, but understandable.
        
           | capableweb wrote:
           | > but understandable
           | 
           | Is it really? I get the value of eating your own dogfood, it
           | improves things a lot.
           | 
           | But your status page? Such a high importance, low difficulty
           | thing to build that dogfeeding it gives you small amount of
           | benefits (dogfeed something bigger/more complex instead) in
           | the good case, and high amount of drawback when things go
           | wrong (like when your infrastructure goes down, so does your
           | status page). So what's the point?
        
             | KingOfCoders wrote:
             | Arrogance.
        
             | tpetry wrote:
             | I can really imafgine what happened: Engineer wants to host
             | dashboard at different provider for resilience. Manager
             | argues that they cant do this, it would be embarassing if
             | anybody found out. And why choose another provider? Aws has
             | multiple AZs and cant be down everywhere at the same
             | moment. Engineer then says ,,fu __it" and just builds it on
             | a single solution.
        
       | freeone3000 wrote:
       | The failure to update the Service Health Dashboard was due to
       | reliance on internal services to update. This also happened in
       | March 2017[0]. Perhaps a general, instead of piecemeal, approach
       | to removing dependencies on running services from the dashboard
       | would be valuable here?
       | 
       | 0: https://aws.amazon.com/message/41926/
        
       | temp0826 wrote:
       | us-east-1 is AWS's dirty secret. If ddb had gone down there,
       | there would likely be a worldwide and multi-service interruption.
        
       | fafner wrote:
       | From the summary I don't understand why front end servers need to
       | talk to each other ("continuous processing of messages from other
       | Kinesis front-end servers"). It sounds like this is part of
       | building the shard map or the cache. Well in the end an
       | unfortunate design decision. #hugops for the team handling this.
       | Cascading failures are the worst.
        
       | hintymad wrote:
       | A tangential question, why would AWS even use the term
       | "microservice"? A service is a service, right? I'm not sure what
       | the term "microservice" signifies here.
        
         | arduinomancer wrote:
         | It's because service can be confused with "AWS Service" which
         | is not the same as a microservice (a component of a full
         | service)
        
       | londons_explore wrote:
       | One requirement on my "production ready" checklist is that any
       | catastrophic system failure can be resolved by starting a
       | completely new instance of the service, and it be ready to serve
       | traffic inside 10 minutes.
       | 
       | That should be tested at least quarterly (but preferably
       | automatically with every build).
       | 
       | If Amazon did that, this outage would have been reduced to 10
       | mins, rather than the 12+ hours that some super slow rolling
       | restarts took...
        
         | WJW wrote:
         | Kinesis probably runs well over 100k instances. Restarting it
         | might not be so trivial that you can do it in 10 minutes.
        
         | why-el wrote:
         | The same OS limits would apply to new instances, unless they
         | knew the root cause and forced new instances to be configured
         | with larger descriptor limits, which is....well, hindsight is
         | 20/20, no?
        
         | WookieRushing wrote:
         | This only works for stateless services. If you've got frontends
         | that take longer than 10 mins to serve traffic then you have a
         | problem.
         | 
         | But if you're running a DB or a storage system, 10 mins is a
         | blink of an eye. Storage systems in particular can run a few
         | hundred TB per node and moving that data to another node can
         | take over an hour.
         | 
         | In this case, the frontends have a shard map which is
         | definitely not stateless. This is typically okay if you have a
         | fast load operation which blocks other traffic until shard map
         | is fully loaded
        
           | londons_explore wrote:
           | It's possible (albeit much harder) for stateful services too.
           | 
           | It basically boils down to "We must be able to restore the
           | minimum necessary parts of a full backup in under 10
           | minutes".
           | 
           | Take wikipedia as an example. I'd expect them to be able to
           | restore a backup of the latest version of all pages in 10
           | minutes. It's 20GB of data, and I assume it's sharded at
           | least 10 ways. That means each instance will have to grab 2GB
           | from the backups. Very do-able.
           | 
           | As a service gets bigger, you typically scale horizontally,
           | so the problem doesn't get harder.
           | 
           | Restoring all the old page versions and re enabling editing
           | might take longer, but that's less critical functionality.
        
       | tnolet wrote:
       | This is a pretty damn decent post mortem so soon after the
       | outage. Also gives an architectural analysis of how Kinesis works
       | which is something they had not have to do at all.
        
       | ignoramous wrote:
       | root-cause tldr:
       | 
       |  _...[adding] new capacity [to the front-end fleet] had caused
       | all of the servers in the [front-end] fleet to exceed the maximum
       | number of threads allowed by an operating system configuration
       | [number of threads spawned is directly proportional to number of
       | servers in the fleet]. As this limit was being exceeded, cache
       | construction was failing to complete and front-end servers were
       | ending up with useless shard-maps that left them unable to route
       | requests to back-end clusters._
       | 
       | fixes:
       | 
       | ...moving to larger CPU and memory servers [and thus fewer front-
       | end servers]. Having fewer servers means that each server
       | maintains fewer threads.
       | 
       | ...making a number of changes to radically improve the cold-start
       | time for the front-end fleet.
       | 
       | ...moving the front-end server [shard-map] cache [that takes a
       | long time to build, up to an hour sometimes?] to a dedicated
       | fleet.
       | 
       | ...move a few large AWS services, like CloudWatch, to a separate,
       | partitioned front-end fleet.
       | 
       | ...accelerate the cellularization [0] of the front-end fleet to
       | match what we've done with the back-end.
       | 
       | [0] https://www.youtube.com/watch?v=swQbA4zub20 and
       | https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...
        
         | frankietaylr wrote:
         | I wonder how many of them are already logged engineering tasks
         | which never got prioritized because of the aggressive push to
         | add features.
        
       | bithavoc wrote:
       | They're calling it an "Event", title should say "Summary of the
       | Amazon Kinesis Outage..."
        
       | pps43 wrote:
       | > the new capacity had caused all of the servers in the fleet to
       | exceed the maximum number of threads allowed by an operating
       | system configuration. [...] We didn't want to increase the
       | operating system limit without further testing
       | 
       | Is it because operating system configuration is managed by a
       | different team within the organization?
        
         | mcqueenjordan wrote:
         | Nope. It's just a case of "stop the bleeding before starting
         | the surgery."
        
         | sitharus wrote:
         | More likely they need to understand what effect changing the
         | thread limit would have - for example it could increase kernel
         | memory usage or increase scheduler latency. It's not something
         | you want to mess with in an outage.
        
         | sudhirj wrote:
         | I've heard AWS follows a you build it, you run it policy, so
         | that seems unlikely. Just seems prudent to not mess with OS
         | settings in a hurry.
        
         | Androider wrote:
         | If you start haphazardly changing things while firefighting
         | without testing, you might make things even worse. And there's
         | worse things than downtime, for instance if the system appears
         | to work but you're actually silently corrupting customer data.
        
       | joneholland wrote:
       | Running out of file handles and other IO limits is embarrassing
       | and happens at every company, but I'm surprised that AWS was not
       | monitoring this.
       | 
       | I'm also surprised at the general architecture of Kinesis. What
       | appears to be their own hand rolled gossip protocol (that is
       | clearly terrible compared to raft or paxos, a thread per cluster
       | member? Everyone talking to everyone? An hour to reach
       | consensus?) and the front end servers being stateful period
       | breaks a lot of good design choices.
       | 
       | The problem with growing as fast as Amazon has is that their
       | talent bar couldn't keep up. I can't imagine this design being
       | okay 10 years ago when I was there.
        
         | marcinzm wrote:
         | I don't think it's about growing fast so much as, from those I
         | talked to, Amazon now has a fairly bad reputation in the tech
         | community. You only go to work there if you don't have a better
         | option (Google, Facebook, etc) or have some specialty skill
         | they're willing to pay for. Pay is below other FAANG companies
         | and the work culture isn't great (toxic even some would say).
         | 
         | edit: They also had the most disorganized and de-centralized
         | interview approach from all the FAANG companies I talked with.
         | Which isn't growing pains this far in, it's just bad management
         | and process.
        
           | ActorNightly wrote:
           | Just as a general reminder to anyone reading this: forum
           | comments are incredibly biased and hardly ever represent
           | reality accurately.
        
           | imajoredinecon wrote:
           | Interesting re interview experience
           | 
           | I interviewed as a new grad SWE and the process was totally
           | straightforward, and way lower friction (albeit much less
           | human interaction, which made it feel even more impersonal)
           | than almost everywhere else I applied: initial online screen,
           | online programming task, and then a video call with an
           | engineer where you explained your answer to the programming
           | task.
        
             | marcinzm wrote:
             | I was doing machine learning so more specialized than
             | regular SDE. Other companies it was talk to recruiter,
             | phone screen with manager, and then virtual onsite
             | interviews. Hiring was either not team specific or the
             | recruiter helped manage the process (ie: what does this
             | role actually need). Very clear directions on what type of
             | questions will be asked, format of the interviews, what to
             | prepare for, etc. Amazon the recruiter just told me to look
             | on the job site and then, despite me being clear, applied
             | me to the wrong role. Then got one of those automated
             | coding exercises despite 15 year experience and an internal
             | referral. Wasn't hard but it also pointless since the
             | coding exercise was for the wrong role. Finally got a phone
             | screen and they asked me nothing but pedantic college
             | textbook questions for 40 minutes. Recruiter provided no
             | warning for that.
             | 
             | edit: You could blame the recruiter but every other company
             | had a well oiled machine for their recruiters. So even if
             | they provided only generic information there was still a
             | standard process for what they provided.
        
             | akhilcacharya wrote:
             | The process for new grads and interns is different from
             | industry hires and is decided by team.
        
           | nixass wrote:
           | Very anecdotal
        
         | hedora wrote:
         | I've noticed a strong tendency for older systems to accumulate
         | "spaghetti architecture", where newer employees add new
         | subsystems and tenured employees are blind to the early design
         | mistakes they made. For instance, in this system, it sound like
         | they added a complicated health check mechanism at some point
         | to bounce faulty nodes.
         | 
         | Now, they don't know how it behaves, so they're afraid to take
         | corrective actions in production.
         | 
         | They built that before ensuring that they logged the result of
         | each failed system call. The prioritization seems odd, but most
         | places look at logging as a cost center, and the work of
         | improving it as drudgery, even though it's far more important
         | than shiny things like automatic response to failures, and also
         | takes a person with more experience to do properly.
        
         | karmakaze wrote:
         | Kinesis was the worst AWS tech I've ever used. Splitting a
         | stream into shards doesn't increase throughput if you still
         | need to run the same types/number of consumers on every shard.
         | The suggested workaround at the time was to use larger batches
         | and add latency to the processing pipeline.
        
         | jen20 wrote:
         | > What appears to be their own hand rolled gossip protocol
         | (that is clearly terrible compared to raft or paxos)
         | 
         | Raft and Paxos are not gossip protocols - they are consensus
         | protocols.
        
           | joneholland wrote:
           | Fair. What I meant to say is "hand rolling a way to have
           | consensus on a shared piece of data" by implementing it with
           | a naive gossip system.
        
         | justicezyx wrote:
         | I led the storage engine prototyping for Kinesis in 2012 (the
         | best time in my career so far).
         | 
         | Kinesis uses Chain Replication, a dead simple fault tolerante
         | storage algorithm: machines formed a chain, data flow from head
         | to tail in one direction, writes always start at head, and read
         | at tail, new nodes always join at tail, but nodes can be kicked
         | out at any position.
         | 
         | The membership management of chain node is done through a
         | paxos-based consensus service like chubby or zookeeper. Allan
         | [2] (the best engineer I personally worked with so far, way
         | better than anyone I encountered) wrote that system. The Java
         | code quality shows itself after the first glance. Not
         | mentioning the humbleness and openness in sharing his knowledge
         | during early design meetings.
         | 
         | I am not sure what protocol is actually used now. But I would
         | be surprised it's different, given the protocol's simplicity
         | and performance.
         | 
         | [1] https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf [2]
         | https://www.linkedin.com/in/allan-vermeulen-58835b/
        
           | ryanworl wrote:
           | Can you explain why the sequence numbers are so giant? I've
           | never understood that.
        
             | justicezyx wrote:
             | I dont remember the size, is it 128bits?
             | 
             | It was chosen for future expansion. Kinesis was envisioned
             | to be a much larger-scale Kafka + Storm (storm was the
             | streaming programming framework popular in 2012, it was
             | since falls out of favor).
        
               | ryanworl wrote:
               | 128bit might be accurate, I meant more along the lines of
               | they are non-contiguous and don't seem to be correlated
               | with the amount of records actually being written to a
               | stream.
        
               | crgwbr wrote:
               | I always assumed they were sequential per-region rather
               | than per-stream, but that's just a guess.
        
               | justicezyx wrote:
               | I recall it was sequential per-shard, but no ordering
               | among the shards of the same stream. But I literally
               | haven't touch Kinesis after 2013.
        
         | hintymad wrote:
         | AWS frontend services are usually implemented in Java. If
         | Kinesis' frontend does too, then it's surprising that the
         | threads created by a frontend service would exceed the OS
         | limit. This tells three possibilities: 1. Kinesis did not
         | impose a max thread count in their app, which is a gross
         | omission; 2. Or there was a resource leak in their code. 3.
         | Each of their frontend instances stored all the placement
         | information of backend servers, which means their frontend was
         | not scalable by backend size.
        
           | joneholland wrote:
           | My understanding is that every front end server has at least
           | one connection (on a dedicated thread) to every other front
           | end server.
           | 
           | Assuming they have say, 5000 front end instances, thats 5000
           | file descriptors being used just for this, before you are
           | even talking about whatever threads the application needs.
           | 
           | It's not surprising that they bumped into ulimits, though as
           | part of OS provisioning, you typically have those tuned for
           | workload.
           | 
           | More concerning is the 5000 x 5000 amount of open tcp
           | sessions across their network to support this architecture.
           | This has to be a lot of fun on any stateful firewall it might
           | cross.
        
         | pentlander wrote:
         | The hand rolled gossip protocol (DFDD) is not used for
         | consensus, it's just used for cluster membership and health
         | information. It's used by pretty much every foundational AWS
         | service. There's a separate internal service for consensus that
         | uses paxos.
         | 
         | The thread per frontend member definitely sounds like a
         | problematic early design choice. It wouldn't be the first time
         | I heard of an AWS issue due to "too many threads". Unlike gRPC,
         | the internal RPC framework defaults to a thread per request
         | rather than an async model. The async way was pretty painful
         | and error prone.
        
           | joneholland wrote:
           | Are they still using Coral and Codigo as the RPC stack?
        
             | morsma wrote:
             | Not Codigo, but Coral, yes.
        
               | 8note wrote:
               | Well, there's still codigo around, but coral is quite
               | pleasent
        
             | pentlander wrote:
             | Yeah Amazon still runs on Coral, there were some recent
             | (release a few years ago) advances on it under the hood and
             | ergonomically. I think the "replacement" for it is
             | Smithy[0] though it will likely just replace the XML schema
             | and codegen and not the protocol. Honestly at this point I
             | think it would be in Amazon's best interest to heavily
             | invest in Java Project Loom rather than trying to convert
             | to async.
             | 
             | [0] https://awslabs.github.io/smithy/
        
               | trhway wrote:
               | > Java Project Loom
               | 
               | sound like after more than 20 years the green threads are
               | back! Everything new is well-forgotten old (especially
               | the bad parts of that old :)
               | 
               | I think in the Kinesis case it isn't the thread-per-
               | connection what is the root of the problem. It is the
               | each-to-each flat topology of the front-end which is
               | growing and waiting for the various thresholds to get
               | triggered. First the number of threads, next it will be
               | something else,... until they re-architect into something
               | like layered structure.
               | 
               | Many here mention the quality of the current talent at
               | AMZN. Anecdotally, 2 people they recently hired from our
               | dept were among the weakest. Just a bit more stronger guy
               | got offers from Amazon and Apple, went for Apple. A much
               | more stronger and experienced guy failed to get an offer
               | from Amazon.
        
             | [deleted]
        
           | ignoramous wrote:
           | Are you sure Kinesis uses DFDD [0]?
           | 
           | [0] Seems like a relic of years gone by
           | https://patents.justia.com/patent/9838240
        
             | cowsandmilk wrote:
             | That patent is from when Kinesis Data Streams were
             | originally announced to the public. Any reason not to think
             | it uses it. Seems like it would have been a logical choice
             | in the initial architecture and change is slow.
        
             | pentlander wrote:
             | Though I no longer work for Amazon, I'm reasonably certain
             | they use it from the description. Especially given I know
             | for a fact that other more foundational services use it.
             | 
             | Why is it a "relic of years gone by"? Consul uses a
             | similar, though more advanced technique[0]. Consul may not
             | be as widely used as etcd, but I don't think most would
             | consider it a relic.
             | 
             | [0] https://www.consul.io/docs/architecture/gossip
        
         | bonfire wrote:
         | If you want to eat in a restaurant it's better not to look in
         | the kitchen :-|
        
         | jknoepfler wrote:
         | It irks me to this day that AWS all-hands meetings (circa 2015)
         | celebrated an exponential hiring curve (as in the graph was
         | greeted with applause and touted as evidence of success by the
         | speaker). The next plot would be an exponential revenue curve
         | with approximately the same shape. Meanwhile the median
         | lifespan of an engineer was ~10 months. I don't know, I just
         | couldn't square that one in my head.
        
         | throwaway189262 wrote:
         | I can't remember which db, but somebody a while back claimed
         | that one of Amazon's "infinitely scalable" dbs was tons of
         | abstraction on top a massive heap of MySQL instances.
         | 
         | I don't trust anything outside core services on AWS. Regardless
         | of whether the rumor I heard is true, it's clear they
         | appreciate quantity over quality.
        
           | WookieRushing wrote:
           | This is really common and works pretty awesomely. MySQL is
           | extremely battle tested and there's lots of experts out there
           | for it.
           | 
           | FB built a similar system to maintain their graph:
           | https://blog.yugabyte.com/facebooks-user-db-is-it-sql-or-
           | nos...
           | 
           | It's a ton of tiny DBs that look like one massive eventually
           | consistent DB
        
           | rubiquity wrote:
           | Disclosure: I work at AWS, possibly near the system you're
           | describing. Opinions are my own.
           | 
           | If we're talking about the same thing then I think casting
           | stones just because it is based on MySQL is severely
           | misguided. MySQL has decades of optimizations and this
           | particular system at Amazon has solved scaling problems and
           | brought reliability to countless services without ever being
           | the direct cause of an outage (to the best of my knowledge).
           | 
           | Indeed, MySQL is not without its flaws but many of these are
           | related to its quirks in transactions and replication which
           | this system completely solves. The cherry on top is that you
           | have a rock solid database with a familiar querying language
           | and a massive knowledge base to get help from when needed.
           | Oh, and did I mention this system supports multiple storage
           | engines besides just MySQL/InnoDB?
           | 
           | I for one wish we would open source this system though there
           | are a ton of hurdles both technical and not. I think it would
           | do wonders for the greater tech community by providing a much
           | better option as your needs grow beyond a single node system.
           | It has certainly served Amazon well in that role and I've
           | heard Facebook and YouTube have similar systems based on
           | MySQL.
           | 
           | To further address your comment about Amazon/AWS lacking
           | quality: this system is the epitome of our values of
           | pragmatism and focusing our efforts on innovating where we
           | can make the biggest impact. Hand rolling your own storage
           | engines is fun and all but countless others have already
           | spent decades doing so for marginal gains.
        
             | panarky wrote:
             | _> the system you 're describing_
             | 
             | Why all the mysterious cloak-and-dagger?
        
               | rubiquity wrote:
               | It is possible that the person I replied to could be
               | talking about an entirely different piece of software.
               | Another reason is that the specifics of the system I'm
               | referring to are not public knowledge.
               | 
               | The more important takeaway is that building on top of
               | MySQL/InnoDB is perfectly fine and that is what I tried
               | to emphasize.
        
           | ignoramous wrote:
           | I believe someone _allegedly_ from AWS said DynamoDB was
           | written on top of MySQL (on top of InnoDB, really) [0] which
           | would be similar to what Uber and Pinterest did as well. [1]
           | 
           | [0] https://news.ycombinator.com/item?id=18426096
           | 
           | [1] https://news.ycombinator.com/item?id=12605026
        
           | blandflakes wrote:
           | At more than one of my jobs, this has been exactly the right
           | way to horizontally scale relational workloads and has gone
           | very well.
        
           | derwiki wrote:
           | Sounds like Dynamo
        
           | newscom59 wrote:
           | What's the problem with this? _Anything_ that's scalable is
           | "just an abstraction" on top of a heap of shards
           | /processes/nodes/data centers/whatever.
        
         | solatic wrote:
         | > The problem with growing as fast as Amazon has is that their
         | talent bar couldn't keep up. I can't imagine this design being
         | okay 10 years ago when I was there.
         | 
         | I see where you're coming from with this, but you really have
         | to wonder. It sounds more like the original architects made
         | implicit assumptions regarding scale that, likely due to
         | original architects and engineers moving on, were not re-
         | evaluated by the current engineers on Kinesis as Kinesis grew.
         | While it may take an hour now for the front-end cache to sync,
         | I find it highly unlikely that it needed that much time when
         | Kinesis first launched.
         | 
         | The process failure here is organizational, one where Amazon
         | failed to audit its current systems in a complete and current
         | manner such that sufficient attention and resources could be
         | paid to a re-architecture of a critical service before it
         | caused the service to fail. Even now, vertically scaling the
         | front-end cache fleet is just a band-aid - eventually, that
         | won't be possible anymore. Sadly, the postmortem doesn't seem
         | to identify the organizational failure that was the true root
         | cause of the outage.
        
           | edoceo wrote:
           | Oof. My little company is refactoring some five year old
           | architecture design choices. Ugly. Process isn't visible
           | outside the refactor and the work is tedious. Can't imagine
           | what a service refactor is like at A. I bet it sucks
        
             | Twirrim wrote:
             | > Can't imagine what a service refactor is like at A. I bet
             | it sucks
             | 
             | It's not all that hard. AWS heavily focuses on Service
             | Oriented Architecture approaches, with specific
             | knowledge/responsibility domains for each. It's a proven
             | scalable pattern. The APIs will often be fairly straight-
             | forward behind the front end. With clearly lines of
             | responsibility between components, you'll almost never have
             | to worry about what other services are doing. Just fix what
             | you've got right in front of you. This is an area where
             | TLA+ can come in handy too. Build a formal proof and
             | rebuild your service based on it.
             | 
             | I joined Glacier 9 months after launch, and it was in the
             | band-aid stage. In cloud services your first years will
             | roughly look like:
             | 
             | 1) Band-aids, and emergency refactoring. Customers never do
             | what you expect or can predict them to do, no matter how
             | you price your service to encourage particular behaviour.
             | First year is very much keep the lights on focused. Fixing
             | bugs and applying band-aids where needed. In AWS, it's
             | likely they'll target a price decrease for re:invent
             | instead of new features.
             | 
             | 2) Scalability, first new feature work. Traffic will
             | hopefully be picking up by now for your service, you'll
             | start to see where things may need a bit of help to scale.
             | You'll start working on the first bits of innovation for
             | the platform. This is a key stage because it'll start to
             | show you where you've potentially painted yourself in to a
             | corner. (AWS will be looking for some bold feature to tout
             | at Re:Invent)
             | 
             | 3) Refactoring, feature work starts in earnest. You'll have
             | learned where your issues are. Product managers, market
             | research, leadership etc. will have ideas about what new
             | features to be working on, and have much more of a roadmap
             | for your service. New features will be tied in to the first
             | refactoring efforts needed to scale as per customer
             | workload, and save you from that corner you're painted in
             | to.
             | 
             | Year 3 is where some of the fun kicks in. The more senior
             | engineers will be driving the refactoring work, they know
             | what and why things were done how they were done, and can
             | likely see how things need to be. A design proposal gets
             | created and refined over a few weeks of presentations to
             | the team and direct leadership. It's a broad spectrum
             | review, based around constructive criticism. Engineers will
             | challenge every assumption and design decision. There's no
             | expectation of perfection. Good enough is good enough. You
             | just need to be able to justify why you made decisions.
             | 
             | New components will be built from the design, and plans for
             | roll out will be worked on. In Glacier's case one mechanism
             | we'd use was to signal to a service that $x % of requests
             | should use a new code path. You'd test behind a whitelist,
             | and then very very slowly ramp up public traffic in one
             | smaller region towards the new code path while tracking
             | metrics until you hit 100%, repeat the process on a large
             | region slightly faster, before turning it on everywhere.
             | For other bits we'd figure out ways to run things in shadow
             | mode. Same requests hitting old and new code, with the new
             | code neutered. Then compare the results.
             | 
             | side note: One of the key values engineers get evaluated on
             | before reaching "Principal Engineer" level is "respect what
             | has gone before". No one sets out to build something crap.
             | You likely weren't involved in the original decisions, you
             | don't necessarily know what the thinking was behind various
             | things. Respect that those before you built something as
             | best as suited the known constraints at the time. The same
             | applies forwards. Respect what is before you now, and be
             | aware that in 3-5 years someone will be refactoring what
             | you're about to create. The document you present to your
             | team now will help the next engineers when they come to
             | refactor later on down the line. Things like TLA+ models
             | will be invaluable here too.
        
               | pentlander wrote:
               | As another ex-Glacier dev, I disagree with it not being
               | "all that hard" but I agree with everything else. Now I'm
               | curious who this is :)
        
               | xiwenc wrote:
               | Indeed your side note should be the first point. I see
               | that very often in practice. As a result, products get
               | rewritten all the time because developers dont want to
               | spend time understanding the current system believing
               | they can do better job than the previous one. They will
               | create different problems. And the cycle repeats.
               | 
               | The not-invented-here (or by-me) syndrome is probably
               | also at play here.
        
               | edoceo wrote:
               | Your side note should be the first point.
        
         | fafner wrote:
         | Yep. Having each front end needing to scale with the overall
         | size of the front end sounds is obviously going to hit some
         | scaling limit. It's not clear to me from the summary why they
         | are doing that. If it's for the shard-map or cache? Maybe if
         | the front end is stateful that's a way to do stickiness? Seems
         | we can only guess.
        
       | steelframe wrote:
       | > Cellularization is an approach we use to isolate the effects of
       | failure within a service, and to keep the components of the
       | service (in this case, the shard-map cache) operating within a
       | previously tested and operated range. This had been under way for
       | the front-end fleet in Kinesis, but unfortunately the work is
       | significant and had not yet been completed.
       | 
       | Translation: The eng team knew that they had accumulated tech
       | debt by cutting a corner here in order to meet one of Amazon's
       | typical and insane "just get the feature out the door" timelines.
       | Eng warned management about it, and management decided to take
       | the risk and lean on on-call to pull heroics to just fix any
       | issues as they come up. Most of the time yanking a team out of
       | bed in the middle of the night works, so that's the modus
       | operandi at Amazon. This time, the actual problem was more
       | fundamental and wasn't effectively addressable with middle-of-
       | the-night heroics.
       | 
       | Management rolled the "just page everyone and hope they can fix
       | it" dice yet again, as they usually do, and this time they got
       | snake eyes.
       | 
       | I guarantee you that the "cellularization" of the front-end fleet
       | wasn't actually under way, but the teams were instead completely
       | consumed with whatever the next typical and insane "just get the
       | feature out the door" thing was at AWS. The eng team was never
       | going to get around to cellularizing the front-end fleet because
       | they were given no time or incentive to do so by management.
       | During/after this incident, I wouldn't be surprised if management
       | didn't yell at the eng team, "Wait, you KNEW this was a problem,
       | and you're not done yet?!?" Without recognizing that THEY are the
       | ones actually culpable for failing to prioritize payments on tech
       | debt vs. "new shiny" feature work, which is typical of Amazon
       | product development culture.
       | 
       | I've worked with enough former AWS engineers to know what goes on
       | there, and there's a really good reason why anybody who CAN move
       | on from AWS will happily walk away from their 3rd- and 4th-year
       | stock vest schedules (when the majority of your _promised_ amount
       | of your sign-on RSUs actually starts to vest) to flee to a
       | company that fosters a healthy product development and
       | engineering culture.
       | 
       | (Not to mention that, this time, a whole bunch of peoples'
       | Thanksgiving plans were preempted with the demand to get a full
       | investation and post-mortem written up, including the public
       | post, ASAP. Was that really necessary? Couldn't it have waited
       | until next Wednesday or something?)
        
       | metaedge wrote:
       | I would have started the response with:
       | 
       | First of all, we want to apologize for the impact this event
       | caused for our customers. While we are proud of our long track
       | record of availability with Amazon Kinesis, we know how critical
       | this service is to our customers, their applications and end
       | users, and their businesses. We will do everything we can to
       | learn from this event and use it to improve our availability even
       | further.
       | 
       | Then move on to explain...
        
         | sigzero wrote:
         | What they did was fine.
        
       | lend000 wrote:
       | Even today I had a few minutes of intermittent networking outages
       | around 9:30am EST (which started on the day of the incident), and
       | compared to other regions, I frequently get timeouts when calling
       | S3 from us-east-1 (although that has been happening since
       | forever).
        
       | [deleted]
        
       | codesparkle wrote:
       | From the postmortem:
       | 
       |  _At 9:39 AM PST, we were able to confirm a root cause [...] the
       | new capacity had caused all of the servers in the fleet to exceed
       | the maximum number of threads allowed by an operating system
       | configuration._
        
       | karmakaze wrote:
       | Seems to me that the root problem could also be fixed by not
       | using presumably blocking application threads talking to each of
       | the other servers. Any async or poll mechanism wouldn't require
       | N^2 threads across the pool.
        
         | why-el wrote:
         | I wonder if the new wonders coming out from linux (io_uring...)
         | would have made this a better design, but that work in the
         | kernel is still in active development.
        
       | ris wrote:
       | The one thing I want to know in cases like this is: why did it
       | affect multiple Availability Zones? Making a resource multi-AZ is
       | a significant additional cost (and often involves additional
       | complexity) and we really need to be confident that typical
       | observed outages would _actually_ have been mitigated in return.
        
         | EdwardDiego wrote:
         | Indeed. We're paying (and designing our systems to work on
         | multiple AZs) to reduce the risk of outages, but then their
         | back-end services are reliant on services in a sole region?
        
           | frankietaylr wrote:
           | AWS should make this more transparent so that we make better
           | design choices.
        
           | otterley wrote:
           | (Disclaimer: I work for AWS but opinions are my own. I also
           | do not work with the Kinesis team.)
           | 
           | Nearly all AWS services are regional in scope, and for many
           | (if not most) services, they are scaled at a cellular level
           | within a region. Accounts are assigned to specific cells
           | within that region.
           | 
           | There are very, very few services that are global in scope,
           | and it is strongly discouraged to create cross-regional
           | dependencies -- not just as applied to our customers, but to
           | ourselves as well. IAM and Route 53 are notable exceptions,
           | but they offer read replicas in every region and are
           | eventually consistent: if the primary region has a failure,
           | you might not be able to make changes to your configuration,
           | but the other regions will operate on read-only replicas.
           | 
           | This incident was regional in scope: us-east-1 was the only
           | impacted region. As far as I know, no other region was
           | impacted by this event. So customers operating in other
           | regions were largely unaffected. (If you know otherwise,
           | please correct me.)
           | 
           | As a Solutions Architect, I regularly warn customers that
           | running in multiple Availability Zones is not enough.
           | Availability Zones protect you from many kinds of physical
           | infrastructure failures, but not necessarily from regional
           | service failures. So it is super important to run in multiple
           | regions as well: not necessarily active-active, but at least
           | in a standby mode (i.e. "pilot light") so that customers can
           | shed traffic from the failing region and continue to run
           | their workloads.
        
           | qz2 wrote:
           | Correct.
           | 
           | I, as many people have, discovered this when something broke
           | in one of the golden regions. In my case cloudfront and ACM.
           | 
           | Realistically you can't trust one provider at all if you have
           | high availability requirements.
           | 
           | The justification is apparently that the cloud is taking all
           | this responsibility away from people but from personal
           | experience running two cages of kit at two datacenters the
           | TCO was lower and the reliability and availability higher.
           | Possibly the largest cost is navigating Harry-Potter-esque
           | pricing and automation laws. The only gain is scaling past
           | those two cages.
           | 
           | Edit: I should point out however that an advantage of the
           | cloud is actually being able to click a couple of buttons and
           | get rid of two cages worth of DC equipment instantly if your
           | product or idea doesn't work out!
        
             | freehunter wrote:
             | >you can't trust one provider at all
             | 
             | The hard part with multi-cloud is, you're just increasing
             | your risk of being impacted by someone's failure. Sure if
             | you're all-in on AWS and AWS goes down, you're all-out. But
             | if you're on [AWS, GCP] and GCP goes down, you're down
             | anyway. Even though AWS is up, you're down because Google
             | went down. And if you're on [AWS, GCP, Azure] and Azure
             | goes down, it doesn't matter than AWS and GCP are up...
             | you're down because Azure is down. The only way around that
             | is architecting your business to run with only one of those
             | vendors, which means you're paying 3x more than you need to
             | 99.99999% of the time.
             | 
             | The probability that one of [AWS, Azure, GCP] is down is
             | way higher than the probability that just _one_ of them is
             | down. And the probability that your two cages in your
             | datacenter is down is way higher than the probability that
             | any one of the hyperscalers is down.
        
               | [deleted]
        
               | qz2 wrote:
               | I disagree. It's about mitigating the risk of a single
               | provider's failure. Single providers go down all the
               | time. We've seen it from all three major cloud vendors.
        
               | hedora wrote:
               | How do you test failover from provider-wide outages?
               | 
               | I've never heard of an untested failover mechanism that
               | worked. Most places are afraid to invoke such a thing,
               | even during a major outage.
        
               | [deleted]
        
               | qz2 wrote:
               | That's fairly simple. Regular scenario planning, drills
               | and suitable chaos engineering tooling.
               | 
               | Being afraid of failures is a massive sign of problems.
               | I've worked in those sorts of places before.
        
             | coredog64 wrote:
             | ACM and CloudFront being sticky to us-east-1 is
             | particularly annoying. I'm happy not being multi regional
             | (I don't have that level of DR requirements), but these
             | types of services require me to incorporate all the multi
             | region complexity.
        
             | lttlrck wrote:
             | Harry-Potter-esque pricing?
             | 
             | Is that a reference to the difficulty of calculating the
             | cost of visiting all the rides at universal? That's my best
             | guess...
        
               | qz2 wrote:
               | It's more a stab at the inconsistency of rules around
               | magic.
               | 
               |  _" Well this pricing rule only works on a Tuesday lunch
               | time if you're standing on one leg with a sausage under
               | each arm and a traffic cone on your head"_
               | 
               | And there are a million of those to navigate.
        
         | talawahtech wrote:
         | Multi-AZ doesn't protect against a software/OS issue like this,
         | Multi-AZ would be relevant if it was an infrastructure failure
         | (e.g. underlying EC2 instances or networking).
         | 
         | The relevant resiliency pattern in this case would be what they
         | refer to as cell-based architecture, where within an AZ
         | services are broken down into smaller independent cells to
         | minimize the blast radius.
         | 
         | They specifically mention in the write-up that this was a gap
         | they plan to address, the "backend" portion of Kinesis was
         | already cellularized but that step had not yet been completed
         | on the "frontend".
         | 
         | Celluarization in combination with workload partitioning would
         | have helped, e.g. don't run Cloudwatch, Cognito and Customer
         | workloads on the same set of cells.
         | 
         | It is also important to note that celluarization only helps in
         | this case if they limit code deployment to a limited number of
         | cells at a time.
         | 
         | This YouTube video[1] of a re:invent presentation does a great
         | job of explaining it. The cell-based stuff, starts around
         | minute 20.
         | 
         | 1. https://youtu.be/swQbA4zub20
        
           | talawahtech wrote:
           | Another relevant point made in the video is that they
           | restrict cells to a maximum size which then makes it easier
           | to test behavior at that size. This would have also helped
           | avoid this specific issue since the number of threads would
           | have been tied to the number of instances in a cell.
           | 
           | I definitely recommend checking out the video. Even if you
           | have seen it before, rewatching it in the context of this
           | post-mortem really makes it hit home.
        
             | ignoramous wrote:
             | > _Another relevant point made in the video is that they
             | restrict cells to a maximum size which then makes it easier
             | to test behavior at that size._
             | 
             | Googlers would be quick to point out that Borg does this
             | natively across all their services:
             | https://news.ycombinator.com/item?id=19393926
        
           | brown9-2 wrote:
           | but why does a Kinesis outage due to a capacity increase
           | affect multiple AZs, if one assumes the capacity increase
           | (and the frontend servers impacted by it) are in a single
           | zone?
        
       ___________________________________________________________________
       (page generated 2020-11-28 23:00 UTC)