[HN Gopher] AWS's us-east-1 region is experiencing issues ___________________________________________________________________ AWS's us-east-1 region is experiencing issues Author : zaltekk Score : 87 points Date : 2022-03-09 20:59 UTC (2 hours ago) (HTM) web link (health.aws.amazon.com) (TXT) w3m dump (health.aws.amazon.com) | mtrunkat wrote: | In our case (Apify.com) there was a complete outage of SQS | (15mins+), most likely DNS problems + EC2 instances got restarted | probably as a result of an SQS outage. | | EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a | very high error rate and machines slow startup times. | BigGreenTurtle wrote: | Yep, I saw empty responses for sqs.us-east-1.amazonaws.com for | a while. Seems okay now though. | temptemptemp111 wrote: | saltypal wrote: | Based on our telemetry, this started as NXDOMAINs for sqs.us- | east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and | becoming a total outage at 20:48 UTC. Naturally, it was | completely resolved by 20:57, 5 minutes before anything was | posted in the "Personal Health Dashboard" in the AWS console. | | It takes a while to find a Vice President, I guess. | mcqueenjordan wrote: | Or perhaps triaging, root-causing, and fixing the issue is the | highest-order bit? | nostrebored wrote: | It definitely is. For an issue like this, you will see | relevant teams and delegates looped in very quickly. Getting | approved wording about an outage requires some very senior | people though. Often they have to be paged in as well. | | Having worked at a few other large tech companies now -- | Amazon's incident response process is honestly great. It's | one of the things I miss about working there. | saltypal wrote: | This. We have a 4-person team and posted our own incident | about this 7 minutes before Amazon did. Surely they can aim | a little higher. | ElevenLathe wrote: | IME, this actually becomes more challenging as a company | gets larger, not less (but that doesn't mean it can't be | done). | smachiz wrote: | sure, but if those people are updating the status pages to | say something isn't right and we're looking into it, we're | doomed. | viraptor wrote: | Different people have different responsibilities. At Amazon | scale, the comms and people doing a deep dive to fix stuff | will not be the same. | saltypal wrote: | Separate teams. We have a tiny team and even _we_ appoint a | group to fix and a group or individual to do nothing but | communicate. | halestock wrote: | I can't help but wonder, with the increases in attrition across | the industry, are we hitting some kind of tipping point where the | institutional knowledge in these massive tech corporations is | disappearing? | | Mistakes happen all the time but when all the people who | intimately know how these systems work leave for other | opportunities, disasters are bound to happen more and more. | nyellin wrote: | That's the problem we're out to solve with robusta.dev. | | We're slowly but surely converting the world's institutional | technical knowledge into re-usable and automated runbooks. | hughrr wrote: | I'm just going to have to spend all day fixing the runbooks | as well as the technology ;) | zwirbl wrote: | Just like the tech priests in Warhammer 40k, keeping occult old | engineering, thatno one could build anymore, running | hughrr wrote: | So today I find out my job title is tech priest. I was happy | with necromancer before. Does it come with a pay rise? | viraptor wrote: | Not familiar with 40k. Was it a similar idea to nuclear- | power-as-religion from Foundation? | atty wrote: | Not far off. The "golden age" of humanity was shattered | long ago, with the mortal wounding of the god emperor, and | knowledge of most of the greatest technology was | lost.Millennia later, a cult has grown up that both | worships and maintains technology as having machine | spirits, which are somehow linked to the machine god | itself. That god may or may not be the same or related to | the god emperor of mankind, depending on the | interpretation. | | Honestly the lore of w40k is quite fun to read, if you're | into dystopian and fantasy sci-fi. | aaronax wrote: | Or how about Anathem, with the Ita class doing computer | things and nuclear materials cared for by a select group? | SketchySeaBeast wrote: | If we want to normalize letting long term support people call | themselves tech priests I'd very much appreciate it. | | "What were your duties at your last position?" "Performing | the daily ministrations and singing the praise of the machine | god." | Traster wrote: | I think this is a transient issue. When you're in growth mode | you make a huge series of hacks to just keep things running and | then when you leave.... well, it's a problem. But if the | business is robust, and lives beyond you, what replaces your | work is better documented, better tested, and maintainable. | | That's the dream. Obviously there are companies that sink | between v1 and v2, but that's life. | | Fundamentally I think the cloud business _is_ robust, it 's a | fundamentally reasonable way of organising things (for enough | people), which is why it attracts customers despite being | arguably more expensive. | fragmede wrote: | You're right, but that's been true since the beginning of the | tech boom (but isn't exclusive to tech) when no one works for a | place for several decades. Companies weather this in different | ways but attrition has always been around. | | What's causing people to believe that the latest round of | attrition is any different? | hkt wrote: | I'd speculate that perhaps more senior people are moving, | and/or a greater overall rate of attrition combined with much | more complex technologies and organisations. In other words, | it might be harder to become good at jobs now, and fewer | people stick with them. Just a hunch but definitely seems to | be where the incentives point with loyalty penalties and tech | bloat. | lyjackal wrote: | noticed issues with SQS for a couple minutes. Errors from java | sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP | request: sqs.us-east-1.amazonaws.com` | PeterBarrett wrote: | SQS went down for us in us-east-1 and we lost health checks on | instances there. Fully recovered now. | asah wrote: | us-east-1 again! | [deleted] | amar0c wrote: | My Aruba Instant ON Ap's are "offline" (orange) even tho they | work and I am online. My first tought is that some Cloud went | nirvana state | etaioinshrdlu wrote: | Does AWS have a plan to improve this region? | | Do they acknowledge the problem? | | It's been a joke for years how bad us-east-1 is. | consumer451 wrote: | Nuke it from orbit | | It's the only way to be sure | xilni wrote: | This is why you are strongly urged not to rely on one region or | AZ. | Johnny555 wrote: | I would strongly urge not using us-east-1 -- of all the regions | we're in, it's by far the most problematic. Use us-east-2 if | you need good latency to the East Coast. | tyingq wrote: | Good advice, though AWS still has some services that don't work | completely independently. Cloudfront, because of certificates. | Route53. The control API for IAM (adding/removing roles, etc). | And I wish they didn't have global-looking endpoints (like | https://sts.amazonaws.com) that aren't really global or | resilient. | ranman wrote: | STS will let you use regional endpoints now, right? | tyingq wrote: | Yes. It's just that the "global endpoint" is misleading. | They don't repoint it if it fails. It really shouldn't | exist given that's how it functions. | didip wrote: | Multi AZ is great and should be by default, but multi Region is | expensive. | pid-1 wrote: | Given the total amount of money I've lost due a single AZ being | down, it was totally worth it to NOT go multi az or multi | region so far. | | Multi AZ isn't that hard, but generally requires extra costs | (one nat gw per az, etc...) | | But multi region in AWS is a royal pain in the ass. Many | services (like SSO) do not play well with multi region setups, | making things really complicated even if you IaCed your whole | stack. | evrydayhustling wrote: | Those costs are the actual reason you are encouraged to go | multi-AZ! | | (I actually love that we have strategies and infrastructure | for multi-region... it just tends to come up at scales and | for applications where it is not justified.) | systemvoltage wrote: | Seems like it would be conflict of interest to increase | robustness of single AZ (so it never goes down or has its own | redundancy) vs. increased revenues from multi AZ deployment. | | What's the point of cloud if we have to manage robustness of | their own infrastructure. I can understand if that's due to | natural disasters and earthquakes, but the idea should be that | a single AZ should never go down barring extraordinary | circumstances. AWS should be auto-balancing, handling downtimes | of a single AZ without the customer ever noticing it. | | It might not be a good analogy, but if a single Cloudflare edge | datacenter goes down, it will automatically route traffic | through others. Transparent and painless to the customer. I | understand AWS is huge, and different services have different | redundancy mechanisms, but just conceptually it feels like | they're in a conflict of interest to increase robustness of | their data centers - "We told you to have multi-AZ deployment, | not our fault". | | Another way to put this is make sure as an AWS customer, to 3x | multiply all costs + management of multi-AZ deployment into | your total costs. | thedougd wrote: | They would simply charge for the privilege. An EC2 'always | on' or whatever option that enabled your instance to live | migrate between availability zones would be a nice and | expensive option. | systemvoltage wrote: | Definitely. Then I wonder why we need the cloud :) if not | for services (not EC2). Lot of mid-sized companies are re- | evaluating: | https://www.economist.com/business/2021/07/03/do-the- | costs-o... | m34 wrote: | Might be true for running stuff in different regions/AZs but if | the provisioning region is down (e.g. deploying lambda@edge) | one does not really have an alternative | easton wrote: | From temuze last time: | | "If you're having SLA problems I feel bad for you son | | I got two 9 problems cuz of us-east-1" | didip wrote: | This is why us-east-1 is perfect for chaos-testing, non-prod, | environment. | 0xCAP wrote: | Is us-east cursed or what?! | csdvrx wrote: | As usual? | fotta wrote: | Somehow AWS managed to make their new status page more opaque | than the old one. It's like they want you to scroll through their | gigantic list so they can fix the issue before you find the right | line. | operator1 wrote: | What's up with all of the multi-platform outages lately? Seems | abnormal looking at historical data. Are there issues affecting | the internet backbone or something? Or just a coincidence? | super_linear wrote: | Absolutely no way to prove this but maybe Q1 deadlines coming | up and people trying to launch things and make changes? | frays wrote: | Increase in attrition across the industry. | | A lot of institutional knowledge in these massive tech | corporations is disappearing and we're starting to reach the | tipping point. | fragmede wrote: | But there's always been attrition. What are some of the ways | that is now different that is affecting attrition rates and | their effects? | Spooky23 wrote: | Things are bigger anywhere. My colleagues and I thought | we're hot shit managing 5-7k applications and | infrastructure. Amazon probably runs 20,000 orgs like mine. | | Also, times are good and rates are crazy. Even at VARs, you | can make a lot of cash. I have a buddy who went from $150k | to $600k. The guy paid off his mortgage and is at a point | where he could burn out and work at Home Depot if he needed | to. | stone-monkey wrote: | Probably increased salary and switch to permanent remote. | Amazon is notorious for their frugality and they recently | doubled their maximum salary cap to 350k. They would only | have done this to stay competitive in the current job | market. This implies that many of their existing employees | are underpaid relative to their peers at comparable | companies and they've likely seen a large uptick in | attrition. Not to mention attrition begets more attrition, | especially if it's "influential" employees who are leaving. | 300bps wrote: | Important to keep in mind that AWS has 250 services in 84 | Availability Zones in 26 regions. | | This outage is reportedly impacting 5 services in 1 region. | | For those impacted, pretty terrible. But as a heavy user of | AWS, I've seen these notices posted multiple times on HN and | haven't been impacted by one yet. | quxbar wrote: | For businesses with uptime guarantees and lots of boxes to | spin up in failover scenario, this has been a very eventful | 12 months. At least that's what I'm experiencing. | xeromal wrote: | Russian war is another juicy possibility | adamrezich wrote: | told myself I'd click this submission's comments link, CTRL+F | `Russia`, & quit HN for the day if anything came up, thanks | for not disappointing | xeromal wrote: | Haha, no problemo. ___________________________________________________________________ (page generated 2022-03-09 23:00 UTC)