[HN Gopher] AWS power failure in US-EAST-1 region killed some ha... ___________________________________________________________________ AWS power failure in US-EAST-1 region killed some hardware and instances Author : bratao Score : 83 points Date : 2021-12-23 18:34 UTC (4 hours ago) (HTM) web link (www.theregister.com) (TXT) w3m dump (www.theregister.com) | iJohnDoe wrote: | Not piling on AWS. These things happen and I'm sure everyone | involved is working to improve things. Yes, everyone should | deploy in multiple availability zones. | | My 2 cents. Outages happen. Network glitches happen. Bad configs | and bad updates happen. However, power issues should not really | happen. One of the primary cost saving areas of going to the | cloud is not having to do on-prem power, such as UPS, generators, | maintenance etc. Not having to do on-prem cooling is another | thing. These should be solved from the customer's perspective | when going into a professional data center and are the things you | don't want to worry about anymore. | ksec wrote: | Yes. I expect software and network glitches because they are | complicated and error prone. But I fully expect hardware and | power redundancy. Even if they were down it shouldn't "kill" | them. At least not on AWS, the biggest and arguably best DC | running in the world. But it did. | | This suggest to me DataCenter, as an infrastructure and | building on itself still have plenty of room for improvements. | kazen44 wrote: | Getting power right in a DC takes a ton of preventative | maintenance. (doing black building tests, make sure there are | working load tests which occur frequently, managing UPS battery | health etc). | spenczar5 wrote: | Yes, it's hard work... but it's work AWS should be great at. | Like, _really_ great at. | | I used to work at AWS. The most worked-up I ever saw Charlie | Bell (the de facto head of AWS engineering) was in the weekly | operations meeting, going over a postmortem which described a | near-miss as several generators failed to start properly in a | power interruption. In this meeting - with nearly 1,000 | participants! - Charlie got increasingly irate about the fact | that this had happened. For the next few weeks, details of | some sort of flushing subsystem for backup generators was an | every-time topic. | | Sadly, Charlie left AWS and now works at Microsoft. I can't | help but wonder what that operations meeting is like now | without him. | CoastalCoder wrote: | Any idea why a power _failure_ would cause (or reveal) hardware | damage? | | Leading up to Y2K, I remember concerns about spinning hard disks | not being able to start up again. | | And if the power is _flaky_ with spikes and brown-outs, I | understand that 's a problem. | | But is either of those relevant to AWS? | dilyevsky wrote: | I worked for one of the largest datacenter company in the world | and significant portion of boxes would never come up after | being powercycled | mgsouth wrote: | Slamming on the brakes can be hard on the electronics. It | causes large, high-speed voltage or current spikes that kill | (old, tired) components before the spikes can be clamped. Can | be made worse by old electolytic caps that no longer filter | effectively. Especially hard on power supply diodes and | mosfets. | | Here's a few references at random: | | https://vbn.aau.dk/ws/portalfiles/portal/108034396/ESREF2013... | | https://www.infineon.com/dgdl/s30p5.pdf?fileId=5546d46253360... | | https://www.sciencedirect.com/science/article/pii/S003811010... | Spivak wrote: | Which is why if you realize that your hard drive is failing | the worst thing you can do is cut the power [1]. You get your | data off that box as fast as possible. | | [1] Yes yes always have backups, and RAID, and replication | something about not being able to count that low. Betcha | most'a'ya don't have any of that on your MBP though except | for a maybe a backup that lags hours behind your work. | dahfizz wrote: | https://news.ycombinator.com/item?id=29666048 | | TLDR if electronics are near-death, a power cycle is likely to | kick it over the edge. | SteveNuts wrote: | In my experience it's common when there's a full power down of | a large data center like this. In 2013 the company I was | working at had an unplanned power outage which took out 200 | spinning drives, and 18 servers. I think the motors in the | drives are fine running at constant speed but don't have the | juice to spin up from a dead stop. | | I remember IBM was renting private planes to get us replacement | hardware from around the country (we were a very large | customer). | jbergens wrote: | A very long time ago I worked for a medium sized company and | there was a power outage. The servers were in a room in our | building but they had ups power. The problem was that the | ups's were broken/malfunctioning and the servers needed to do | a disk check after an unplanned reboot. It took hours before | we could work again even though most drives survived. This | was also before git so we could not reach the central version | control system and could not commit any code. I think all | code and workspaces were on servers that was not up yet. | wdfx wrote: | It's just the numbers. Eventually all the hardware will fail. | Some of that hardware will require a specific set of conditions | to fail. Such a condition may be current inrush from being | powercycled. Given the scale of a data centre and powercycling | it all at once, then some will fail due to that condition. | | Further to that thought, there must be hardware in that centre | which is failing and being replaced all the time due to other | conditions, we just won't hear about it because it's part of | normal maintenance. | emptybottle wrote: | It can be a number of things, but often times it's the thermal | cycle and wear on the components. | | After letting devices (especially that rarely do this) cool, | and then bringing them back up to temperature some percentage | of components may fail or operate out of spec. | | And mechanical parts have similar issue. A motor that's been | happily running for months/years may not restart after being | stopped. | | It's also possible that a power fault itself could damage | hardware, although in my experience this type of issue is far | less common. | michaelrpeskin wrote: | This might be old lore, but I remember many years ago that | someone told me that when a spinning disk has been up to | temperature for a long time there's a bit of lubricant that is | vaporized and evenly distributed everywhere (like the halides | in your halogen headlamps). But when it stops spinning that | stuff cools down and gets sticky and ends up sticking the heads | to the disk so that it can't get going anymore. | | Don't know the truth to that, but it seemed like a reasonable | explanation at the time. | _3u10 wrote: | Think of it like a car with a starter wearing out, at some | point you turn the engine off and it never starts up again, | until you replace the starter. Similarly bad batteries, | corrosion on terminals, etc. Startup is a much different | workload on electronics and in the case of HDDs mechanical | devices than running. | | I have an old fan that will need replacing one day that takes | about 15 minutes to reach full speed as the bearings are gone. | Everytime I turn it off I wonder if the next time I turn it on | it will be time to replace it. | kube-system wrote: | If I were to make some WAGs about things that could generally | be described this way: | | * increased failure rates of drives (or other hardware) due to | thermal cycling | | * cache power failures leading to corrupted data | | * previously unknown failures in recovery processes (like | firmware bugs that might be described as a "hardware problem") | | * cooling failures leading to hardware failure | | Most of these are pretty rare issues, but it becomes more | probable to happen at least once when you're running freekin' | us-east-1 | zkirill wrote: | Have there been any incidents that affected more than one AZ? | | AWS RDS in multi-AZ deployment gives you two availability zones. | Aurora gives you three. What kind of scenario would be used to | justify three AZ's for the purposes of high availability? | wmf wrote: | Modern Paxos/Raft HA requires an odd number of nodes so that's | how you end up with three AZs. | Jweb_Guru wrote: | Aurora isn't really designed to handle the total failure of two | availability zones (not short term, anyway). It's designed to | handle one AZ + one additional node failure (which is | reasonably likely to happen on large instances due to data from | a single database being striped across up to 12,800 machines | per AZ). Due to how quorum systems work the "simplest" way they | decided to handle that was six replicas per 10 GiB segment | across 3 AZs, with three of the replicas being log-only and | three designated as full replicas (which is the lowest number | of full replicas you can have given their failure model, and | hitting that lower bound does require you to deploy across 3 | AZs). If any three of the nodes are dead (log-only or | otherwise), write traffic is stopped until they can bring up a | fourth node, though there is some support for backup 3-of-4 | quorums for very long-term AZ outages. | zkirill wrote: | Amazingly insightful answer. Thank you very much! | gundmc wrote: | Rare to see power issues at a modern data center cause downtime. | All of those racks should have UPS and batteries to sustain | during an outage until the automatic transfer switch can fail | over to a redundant system or generator. Would be interested in | reading more about what happened here. | electroly wrote: | In previous postmortems[0] they've mentioned their UPSes. They | definitely have them. They don't seem to write a lot of | postmortems, though. I'm not sure whether we should expect one | for this event. | | [0] https://aws.amazon.com/premiumsupport/technology/pes/ | 1cvmask wrote: | A lot more companies will go to a multi-cloud active active | architecture with maybe even bare metal redundancies. | beermonster wrote: | As large cloud outages become more frequent and the impact | greater each time, I feel it's more likely people will | reconsider moving workloads off-prem. | [deleted] | tyingq wrote: | Multi-cloud is odd to me, unless you're a company selling a | service to cloud customers. | | By definition, you would have to either go lowest-common- | denominator, or build complicated facades in front of like | services. | | If you're going lowest-common-denominator, then multi-old- | school-hosting would be far cheaper. | ransom1538 wrote: | me: "Yeah! AWS went down a few times, I think we should pay | double!" | | vp: no. | emodendroket wrote: | More than double realistically. But you could achieve most of | the benefit at much lower cost by going multi-AZ. | profmonocle wrote: | Egress fees can make replicating databases, storage buckets, | etc. between clouds _very_ expensive. Multi-region is a much | more affordable option. Multi-region outages aren 't unheard of | among the major cloud operators, but they're less common than | single-AZ or single-region outages. | | IMO, most companies just aren't sensitive enough to downtime | that multi-AZ + multi-region deployment within a single cloud | provider isn't good enough. | psanford wrote: | I would stay away from any company that thinks "we need to go | multicloud" as a response to this outage. This affected a | single az in a single region. If it caused you downtime or a | partial outage, it means you are not fully resilient to single | az failures. The correct thing is to fix your application to | handle that. | | If you can't handle a single az failure there is no way you are | going to handle failing over across different cloud providers | correctly. | dragonwriter wrote: | > If it caused you downtime or a partial outage, it means you | are not fully resilient to single az failures. | | Given the number of AWS global services that have | dependencies on infra in US-EAST-1 (and, from the impacts of | this and other past outages, seen vulnerable to single-AZ | failures in US-EAST-1) that's...less avoidable for _certain_ | regions /AZs than one might naively expect. Most clouds seem | to have at least some degree of this kind of vulnerability. | CubsFan1060 wrote: | There will, however, be a lot of executives _talking_ about | going multi cloud. | nonane wrote: | > This affected a single az in a single region. If it caused | you downtime or a partial outage, it means you are not fully | resilient to single az failures. | | This is not true. Amazon is not being upfront about what | happened here. It was simply not a single AZ failure. Our us- | east-1 ELB load balancers were hosed and were unable to | direct traffic to other AZs - they simply stopped working an | were dropping traffic. We tried creating load balancers in | different AZs and that didn't work either. | | How can you be resilient to single AZ failures if load | balancers stop working region wide during a single AZ outage? | acdha wrote: | Did your TAM go into any details on that? Over a couple | hundred load-balancers, the only issue we had was taking | longer to register new instances and that affected only a | couple of them. Running services weren't interrupted, | latency remained, etc. which is what I'd expect for a | single AZ failure. | luhn wrote: | To be fair though, from what I heard AZ outage caused an EC2 | API brownout, so people couldn't launch new instances in the | other AZs. That put a wrench in a lot of multi-AZ | architectures. | | Not advocating for multi-cloud though... | daneel_w wrote: | _" As is often the case with a loss of power, there may be some | hardware that is not recoverable..."_ | | No. Not even rarely. If they lost hardware because of this | something much different than just loss of power happened on | their servers' mains rails. | jacquesm wrote: | To forestall that reply: I did think about it before hitting | the reply button. I've seen a couple of large scale DC power | outages in my 35 years of IT work. The vast majority of the | hardware came through just fine. But older data centers with | gear that has been running uninterrupted for many years tend to | have at least some hardware that simply won't come up again, | due to a variety of reasons. One can be that the on-board | batteries have gone bad, which nobody noticed until the power | cycle. Another is that some harddrives function well as long as | they keep spinning but that has worn a nasty little spot on the | bearing that keeps the spindle aloft. When it spins down, | depending on the orientation it may not be able to start back | up again, it may even crash while spinning down. Powercycle | enough gear that has been running for years and you will most | likely lose at least some small fraction of it. You could do it | the next day after taking care of those and everything would | likely be fine. | vlovich123 wrote: | Wouldn't a UPS + generator failover mean that the HW never | spins down in the first place? That's how I interpreted op's | statement anyway. | jacquesm wrote: | I've seen plenty of UPS's fail, and generator failover is | that moment when everybody stands there with their fingers | crossed. | | None of those things are fool proof and in case of a large | scale outage, especially one that lasts more than a couple | of minutes there is a fair chance that you will find the | limitations of some of your contingency plans. | acdha wrote: | Hopefully, but those aren't perfect. Large data centers | regularly test those because things can go wrong with any | of the UPS, generator[1], or the distribution hardware | which switches between the line and generator power. One of | the longer outages I've seen was when the distribution | hardware itself failed and burnt out multiple parts which | the manufacturer did not have sufficient spares in our | region. | | AWS has a lot of very skilled professionals who are quite | familiar with those issues so I'd be quite surprised if it | turned out to be something that simple but you always want | to have a contingency plan you've tested for what happens | if core infrastructure like that fails and takes multiple | days to recover. | | 1. One nasty example: fuel issues which aren't immediately | obvious so someone doing a 5 minute test wouldn't learn | that the system wasn't going to handle an outage longer | than that. | Johnny555 wrote: | If my home UPS failed, I'd be awfully surprised if my home | fileserver that was plugged into it failed too. | | But if I lost power to thousands of servers, I'd expect some | number of them to fail. I've even lost servers when losing | power to a single rack. | joshuamorton wrote: | Yes, its a fairly well known issue at datacenter scale that if | you power-cycle everything, some percentage of things won't | turn back on (I recall something like 1% of HDDs just failing | to work again being the number quoted at me). This obviously | isn't the case for fresh new HDDs, but ones that have been | spinning continuously for a years. | | Other hardware is similar. | jdsully wrote: | Starting up is hard on electronics due to the inrush current. | There may be marginal devices just barely operating that will | not survive a reboot. Things like bad capacitors aren't always | detectable during steady state. | | At AWS scale it's highly likely there are more than a few of | these. | daneel_w wrote: | Did you think this over before hitting the reply button? Can | you imagine any server or networking brand surviving if its | products were at substantial risk of dying when powering up | or cycling? Have you ever heard of common consumer grade | computer equipment regularly dying from it? No? Not me | either. | | It's _really_ common to find thyristors in devices to limit | in-rush current, even in cheap electronics. You can bet that | PSUs in data center equipment use them. | emodendroket wrote: | > Have you ever heard of common consumer grade computer | equipment regularly dying from powering up? | | Yes, absolutely. | daneel_w wrote: | You're romanticizing here. Electronics eventually dying | is not the same as being at substantial risk from dying | due to power-cycling itself. | emodendroket wrote: | No, I am not. I personally have experienced this. Pop, | whoops, it doesn't work anymore. | daneel_w wrote: | Same. What you're romanticizing about is it being a | common and regular occurrence. It's not. It's a fluke. | emodendroket wrote: | Now imagine you have one of the world's largest | assemblages of electronic devices and they all turn on at | once. How likely are some flukes? | dpratt wrote: | And when you have 500,000 copies of something, most of | which have been in use for some appreciable fraction of | their effective lifetime, it doesn't have to be "common" | to occur to some fraction of them. | | The post you're responding to never implied that it | happens to _most_ of them, or even many of them, just | that when you have a gigantic farm, it's not unreasonable | to see a small handful of hardware instances release up | their magic smoke when they're all coming back from being | powered down. | | MTBF is a thing. | ericd wrote: | A fluke x massive scale = regular occurrence. | [deleted] | [deleted] | kube-system wrote: | I am not sure whether or not inrush current (or thermal | cycling, or something else) is to blame, but server | equipment in data centers has been fairly well documented | to exhibit increased hardware failure rates after outages. | organsnyder wrote: | Substantial? No. A low percentage that is still large | enough to be impactful in a datacenter? Absolutely. | X-Istence wrote: | When you've got a massive datacenter packed, even 0.01% | is a meaningful amount of hardware. | sonofhans wrote: | It's a scale problem. We're not talking about "substantial | risk" or "regularly dying." At AWS scale even 0.01% cold | start failure is noticeable. It's not possible to guarantee | perfect operation of each system or component; physical | things wear out. If you can find a way to avoid cold-start | failure 100% of the time your product would be valuable to | many businesses. | | Also: | | > Did you think this over before hitting the reply button? | | This is ad hominem, and we should avoid it here. | jeffbee wrote: | > Have you ever heard of common consumer grade computer | equipment regularly dying from it? | | You have outed yourself as someone with zero experience | operating computers, either individually or at scale. | Computers have moving parts, whether fans or hard disk | drives. Just because those _were moving_ doesn 't mean they | will _begin moving again_ from a dead stop. There are also | things like dead cmos /nvram batteries that can prevent | machines from automatically starting when power is applied. | hhh wrote: | I don't know if it's intentional, but your comment comes | off very hostile. | | Recently we has a violent power outage where I work due to | the tornados in the midwest. A few of our facilities have | industrial hardware with batteries that will last for 72 | hours and after that point it's a world of unknowns what | will happen. The batteries died after a lot of effort to | try and restore power, but everything came up just fine. | | However, a random Cisco switch elsewhere in the building | had at least one power supply fail. | | This was in one facility where most of these have been | running for 4+ years straight (12 years for the system with | a 72 hour battery.) | | I find it hard to imagine that in the scale of even one AZ | that it wouldn't be possible that at least one system has | this happen. | acdha wrote: | > Did you think this over before hitting the reply button? | Can you imagine any server or networking brand surviving if | its products were at substantial risk of dying when | powering up or cycling? | | This is hostile enough that I'd have trouble squaring it | with the site guidelines. | | It's especially bad because, as others have been saying, | your belief isn't supported by real lived experience for | many of us. A data center has enough devices that even a | low probability error rate will fairly reliably happen, and | that doesn't usually hurt the manufacturer's reputation | because they never promised 100% and will replace it under | warranty. I've seen this with hard drives especially but | also things like power supplies and on-board batteries, and | even things like motherboards where a cooling/heating cycle | was enough to unseat RAM or cause hairline fractures in | solder traces. | | This can be especially bad with unplanned power outages if | the power doesn't instantly go out and stay out for the | duration. Especially for hard drives hitting the power | up/down cycle a few times was a good way to have extra | failures. | | Unscheduled power outages can be even worse | jacquesm wrote: | I've seen even planned outages and tests result in failed | hardware. | acdha wrote: | Ditto -- you never do a substantial shutdown without | learning something. | nicolaslem wrote: | It is common for failing hard drives to continue running until | their are power cycled, at which point they never start again. | X-Istence wrote: | Having had to restore a NetApp cluster from backup because | the disks would no longer spin up after having been running | for 4+ years... yup. The hardware was fully working and | according to all internal metrics was still perfectly good, | but a power outage and they never came back up. | X-Istence wrote: | Having worked in datacenter where servers were running for 5+ | years in racks, when the datacenter went dark due to a UPS | going up in flames there were definitely systems that did not | come back online. | | Most notably we had a large NetApp array that would not boot, | once we replaced the controllers we had lost 13 out of the 60 | hard drives in the array. They would no longer spin up. Like | physically seized. Because they had been spinning for so long, | they would have likely kept spinning just fine, but with power | gone, it was done. | | Fans are another fun one, anything with ball bearings really. | Power supplies are another issue, due to the sudden large | inrush of power when the switch was flipped back to on, some | power supplies had their capacitors go up in smoke. | | This is not rare, this a common occurrence. When you have an | absolutely massive footprint with thousands upon thousands of | servers, that have been running for a long time, there will be | things that just don't come back once they have stopped running | or when the electricity is gone. ___________________________________________________________________ (page generated 2021-12-23 23:01 UTC)