hngopher.com

       [HN Gopher] AWS's us-east-1 region is experiencing issues
       ___________________________________________________________________
        
       AWS's us-east-1 region is experiencing issues
        
       Author : zaltekk
       Score  : 87 points
       Date   : 2022-03-09 20:59 UTC (2 hours ago)
        
 (HTM) web link (health.aws.amazon.com)
 (TXT) w3m dump (health.aws.amazon.com)
        
       | mtrunkat wrote:
       | In our case (Apify.com) there was a complete outage of SQS
       | (15mins+), most likely DNS problems + EC2 instances got restarted
       | probably as a result of an SQS outage.
       | 
       | EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a
       | very high error rate and machines slow startup times.
        
         | BigGreenTurtle wrote:
         | Yep, I saw empty responses for sqs.us-east-1.amazonaws.com for
         | a while. Seems okay now though.
        
       | temptemptemp111 wrote:
        
       | saltypal wrote:
       | Based on our telemetry, this started as NXDOMAINs for sqs.us-
       | east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and
       | becoming a total outage at 20:48 UTC. Naturally, it was
       | completely resolved by 20:57, 5 minutes before anything was
       | posted in the "Personal Health Dashboard" in the AWS console.
       | 
       | It takes a while to find a Vice President, I guess.
        
         | mcqueenjordan wrote:
         | Or perhaps triaging, root-causing, and fixing the issue is the
         | highest-order bit?
        
           | nostrebored wrote:
           | It definitely is. For an issue like this, you will see
           | relevant teams and delegates looped in very quickly. Getting
           | approved wording about an outage requires some very senior
           | people though. Often they have to be paged in as well.
           | 
           | Having worked at a few other large tech companies now --
           | Amazon's incident response process is honestly great. It's
           | one of the things I miss about working there.
        
             | saltypal wrote:
             | This. We have a 4-person team and posted our own incident
             | about this 7 minutes before Amazon did. Surely they can aim
             | a little higher.
        
               | ElevenLathe wrote:
               | IME, this actually becomes more challenging as a company
               | gets larger, not less (but that doesn't mean it can't be
               | done).
        
           | smachiz wrote:
           | sure, but if those people are updating the status pages to
           | say something isn't right and we're looking into it, we're
           | doomed.
        
           | viraptor wrote:
           | Different people have different responsibilities. At Amazon
           | scale, the comms and people doing a deep dive to fix stuff
           | will not be the same.
        
           | saltypal wrote:
           | Separate teams. We have a tiny team and even _we_ appoint a
           | group to fix and a group or individual to do nothing but
           | communicate.
        
       | halestock wrote:
       | I can't help but wonder, with the increases in attrition across
       | the industry, are we hitting some kind of tipping point where the
       | institutional knowledge in these massive tech corporations is
       | disappearing?
       | 
       | Mistakes happen all the time but when all the people who
       | intimately know how these systems work leave for other
       | opportunities, disasters are bound to happen more and more.
        
         | nyellin wrote:
         | That's the problem we're out to solve with robusta.dev.
         | 
         | We're slowly but surely converting the world's institutional
         | technical knowledge into re-usable and automated runbooks.
        
           | hughrr wrote:
           | I'm just going to have to spend all day fixing the runbooks
           | as well as the technology ;)
        
         | zwirbl wrote:
         | Just like the tech priests in Warhammer 40k, keeping occult old
         | engineering, thatno one could build anymore, running
        
           | hughrr wrote:
           | So today I find out my job title is tech priest. I was happy
           | with necromancer before. Does it come with a pay rise?
        
           | viraptor wrote:
           | Not familiar with 40k. Was it a similar idea to nuclear-
           | power-as-religion from Foundation?
        
             | atty wrote:
             | Not far off. The "golden age" of humanity was shattered
             | long ago, with the mortal wounding of the god emperor, and
             | knowledge of most of the greatest technology was
             | lost.Millennia later, a cult has grown up that both
             | worships and maintains technology as having machine
             | spirits, which are somehow linked to the machine god
             | itself. That god may or may not be the same or related to
             | the god emperor of mankind, depending on the
             | interpretation.
             | 
             | Honestly the lore of w40k is quite fun to read, if you're
             | into dystopian and fantasy sci-fi.
        
             | aaronax wrote:
             | Or how about Anathem, with the Ita class doing computer
             | things and nuclear materials cared for by a select group?
        
           | SketchySeaBeast wrote:
           | If we want to normalize letting long term support people call
           | themselves tech priests I'd very much appreciate it.
           | 
           | "What were your duties at your last position?" "Performing
           | the daily ministrations and singing the praise of the machine
           | god."
        
         | Traster wrote:
         | I think this is a transient issue. When you're in growth mode
         | you make a huge series of hacks to just keep things running and
         | then when you leave.... well, it's a problem. But if the
         | business is robust, and lives beyond you, what replaces your
         | work is better documented, better tested, and maintainable.
         | 
         | That's the dream. Obviously there are companies that sink
         | between v1 and v2, but that's life.
         | 
         | Fundamentally I think the cloud business _is_ robust, it 's a
         | fundamentally reasonable way of organising things (for enough
         | people), which is why it attracts customers despite being
         | arguably more expensive.
        
         | fragmede wrote:
         | You're right, but that's been true since the beginning of the
         | tech boom (but isn't exclusive to tech) when no one works for a
         | place for several decades. Companies weather this in different
         | ways but attrition has always been around.
         | 
         | What's causing people to believe that the latest round of
         | attrition is any different?
        
           | hkt wrote:
           | I'd speculate that perhaps more senior people are moving,
           | and/or a greater overall rate of attrition combined with much
           | more complex technologies and organisations. In other words,
           | it might be harder to become good at jobs now, and fewer
           | people stick with them. Just a hunch but definitely seems to
           | be where the incentives point with loyalty penalties and tech
           | bloat.
        
       | lyjackal wrote:
       | noticed issues with SQS for a couple minutes. Errors from java
       | sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP
       | request: sqs.us-east-1.amazonaws.com`
        
       | PeterBarrett wrote:
       | SQS went down for us in us-east-1 and we lost health checks on
       | instances there. Fully recovered now.
        
       | asah wrote:
       | us-east-1 again!
        
         | [deleted]
        
       | amar0c wrote:
       | My Aruba Instant ON Ap's are "offline" (orange) even tho they
       | work and I am online. My first tought is that some Cloud went
       | nirvana state
        
       | etaioinshrdlu wrote:
       | Does AWS have a plan to improve this region?
       | 
       | Do they acknowledge the problem?
       | 
       | It's been a joke for years how bad us-east-1 is.
        
         | consumer451 wrote:
         | Nuke it from orbit
         | 
         | It's the only way to be sure
        
       | xilni wrote:
       | This is why you are strongly urged not to rely on one region or
       | AZ.
        
         | Johnny555 wrote:
         | I would strongly urge not using us-east-1 -- of all the regions
         | we're in, it's by far the most problematic. Use us-east-2 if
         | you need good latency to the East Coast.
        
         | tyingq wrote:
         | Good advice, though AWS still has some services that don't work
         | completely independently. Cloudfront, because of certificates.
         | Route53. The control API for IAM (adding/removing roles, etc).
         | And I wish they didn't have global-looking endpoints (like
         | https://sts.amazonaws.com) that aren't really global or
         | resilient.
        
           | ranman wrote:
           | STS will let you use regional endpoints now, right?
        
             | tyingq wrote:
             | Yes. It's just that the "global endpoint" is misleading.
             | They don't repoint it if it fails. It really shouldn't
             | exist given that's how it functions.
        
         | didip wrote:
         | Multi AZ is great and should be by default, but multi Region is
         | expensive.
        
         | pid-1 wrote:
         | Given the total amount of money I've lost due a single AZ being
         | down, it was totally worth it to NOT go multi az or multi
         | region so far.
         | 
         | Multi AZ isn't that hard, but generally requires extra costs
         | (one nat gw per az, etc...)
         | 
         | But multi region in AWS is a royal pain in the ass. Many
         | services (like SSO) do not play well with multi region setups,
         | making things really complicated even if you IaCed your whole
         | stack.
        
           | evrydayhustling wrote:
           | Those costs are the actual reason you are encouraged to go
           | multi-AZ!
           | 
           | (I actually love that we have strategies and infrastructure
           | for multi-region... it just tends to come up at scales and
           | for applications where it is not justified.)
        
         | systemvoltage wrote:
         | Seems like it would be conflict of interest to increase
         | robustness of single AZ (so it never goes down or has its own
         | redundancy) vs. increased revenues from multi AZ deployment.
         | 
         | What's the point of cloud if we have to manage robustness of
         | their own infrastructure. I can understand if that's due to
         | natural disasters and earthquakes, but the idea should be that
         | a single AZ should never go down barring extraordinary
         | circumstances. AWS should be auto-balancing, handling downtimes
         | of a single AZ without the customer ever noticing it.
         | 
         | It might not be a good analogy, but if a single Cloudflare edge
         | datacenter goes down, it will automatically route traffic
         | through others. Transparent and painless to the customer. I
         | understand AWS is huge, and different services have different
         | redundancy mechanisms, but just conceptually it feels like
         | they're in a conflict of interest to increase robustness of
         | their data centers - "We told you to have multi-AZ deployment,
         | not our fault".
         | 
         | Another way to put this is make sure as an AWS customer, to 3x
         | multiply all costs + management of multi-AZ deployment into
         | your total costs.
        
           | thedougd wrote:
           | They would simply charge for the privilege. An EC2 'always
           | on' or whatever option that enabled your instance to live
           | migrate between availability zones would be a nice and
           | expensive option.
        
             | systemvoltage wrote:
             | Definitely. Then I wonder why we need the cloud :) if not
             | for services (not EC2). Lot of mid-sized companies are re-
             | evaluating:
             | https://www.economist.com/business/2021/07/03/do-the-
             | costs-o...
        
         | m34 wrote:
         | Might be true for running stuff in different regions/AZs but if
         | the provisioning region is down (e.g. deploying lambda@edge)
         | one does not really have an alternative
        
       | easton wrote:
       | From temuze last time:
       | 
       | "If you're having SLA problems I feel bad for you son
       | 
       | I got two 9 problems cuz of us-east-1"
        
       | didip wrote:
       | This is why us-east-1 is perfect for chaos-testing, non-prod,
       | environment.
        
       | 0xCAP wrote:
       | Is us-east cursed or what?!
        
       | csdvrx wrote:
       | As usual?
        
       | fotta wrote:
       | Somehow AWS managed to make their new status page more opaque
       | than the old one. It's like they want you to scroll through their
       | gigantic list so they can fix the issue before you find the right
       | line.
        
       | operator1 wrote:
       | What's up with all of the multi-platform outages lately? Seems
       | abnormal looking at historical data. Are there issues affecting
       | the internet backbone or something? Or just a coincidence?
        
         | super_linear wrote:
         | Absolutely no way to prove this but maybe Q1 deadlines coming
         | up and people trying to launch things and make changes?
        
         | frays wrote:
         | Increase in attrition across the industry.
         | 
         | A lot of institutional knowledge in these massive tech
         | corporations is disappearing and we're starting to reach the
         | tipping point.
        
           | fragmede wrote:
           | But there's always been attrition. What are some of the ways
           | that is now different that is affecting attrition rates and
           | their effects?
        
             | Spooky23 wrote:
             | Things are bigger anywhere. My colleagues and I thought
             | we're hot shit managing 5-7k applications and
             | infrastructure. Amazon probably runs 20,000 orgs like mine.
             | 
             | Also, times are good and rates are crazy. Even at VARs, you
             | can make a lot of cash. I have a buddy who went from $150k
             | to $600k. The guy paid off his mortgage and is at a point
             | where he could burn out and work at Home Depot if he needed
             | to.
        
             | stone-monkey wrote:
             | Probably increased salary and switch to permanent remote.
             | Amazon is notorious for their frugality and they recently
             | doubled their maximum salary cap to 350k. They would only
             | have done this to stay competitive in the current job
             | market. This implies that many of their existing employees
             | are underpaid relative to their peers at comparable
             | companies and they've likely seen a large uptick in
             | attrition. Not to mention attrition begets more attrition,
             | especially if it's "influential" employees who are leaving.
        
         | 300bps wrote:
         | Important to keep in mind that AWS has 250 services in 84
         | Availability Zones in 26 regions.
         | 
         | This outage is reportedly impacting 5 services in 1 region.
         | 
         | For those impacted, pretty terrible. But as a heavy user of
         | AWS, I've seen these notices posted multiple times on HN and
         | haven't been impacted by one yet.
        
           | quxbar wrote:
           | For businesses with uptime guarantees and lots of boxes to
           | spin up in failover scenario, this has been a very eventful
           | 12 months. At least that's what I'm experiencing.
        
         | xeromal wrote:
         | Russian war is another juicy possibility
        
           | adamrezich wrote:
           | told myself I'd click this submission's comments link, CTRL+F
           | `Russia`, & quit HN for the day if anything came up, thanks
           | for not disappointing
        
             | xeromal wrote:
             | Haha, no problemo.
        
       ___________________________________________________________________
       (page generated 2022-03-09 23:00 UTC)