[HN Gopher] AWS power failure in US-EAST-1 region killed some ha...
       ___________________________________________________________________
        
       AWS power failure in US-EAST-1 region killed some hardware and
       instances
        
       Author : bratao
       Score  : 83 points
       Date   : 2021-12-23 18:34 UTC (4 hours ago)
        
 (HTM) web link (www.theregister.com)
 (TXT) w3m dump (www.theregister.com)
        
       | iJohnDoe wrote:
       | Not piling on AWS. These things happen and I'm sure everyone
       | involved is working to improve things. Yes, everyone should
       | deploy in multiple availability zones.
       | 
       | My 2 cents. Outages happen. Network glitches happen. Bad configs
       | and bad updates happen. However, power issues should not really
       | happen. One of the primary cost saving areas of going to the
       | cloud is not having to do on-prem power, such as UPS, generators,
       | maintenance etc. Not having to do on-prem cooling is another
       | thing. These should be solved from the customer's perspective
       | when going into a professional data center and are the things you
       | don't want to worry about anymore.
        
         | ksec wrote:
         | Yes. I expect software and network glitches because they are
         | complicated and error prone. But I fully expect hardware and
         | power redundancy. Even if they were down it shouldn't "kill"
         | them. At least not on AWS, the biggest and arguably best DC
         | running in the world. But it did.
         | 
         | This suggest to me DataCenter, as an infrastructure and
         | building on itself still have plenty of room for improvements.
        
         | kazen44 wrote:
         | Getting power right in a DC takes a ton of preventative
         | maintenance. (doing black building tests, make sure there are
         | working load tests which occur frequently, managing UPS battery
         | health etc).
        
           | spenczar5 wrote:
           | Yes, it's hard work... but it's work AWS should be great at.
           | Like, _really_ great at.
           | 
           | I used to work at AWS. The most worked-up I ever saw Charlie
           | Bell (the de facto head of AWS engineering) was in the weekly
           | operations meeting, going over a postmortem which described a
           | near-miss as several generators failed to start properly in a
           | power interruption. In this meeting - with nearly 1,000
           | participants! - Charlie got increasingly irate about the fact
           | that this had happened. For the next few weeks, details of
           | some sort of flushing subsystem for backup generators was an
           | every-time topic.
           | 
           | Sadly, Charlie left AWS and now works at Microsoft. I can't
           | help but wonder what that operations meeting is like now
           | without him.
        
       | CoastalCoder wrote:
       | Any idea why a power _failure_ would cause (or reveal) hardware
       | damage?
       | 
       | Leading up to Y2K, I remember concerns about spinning hard disks
       | not being able to start up again.
       | 
       | And if the power is _flaky_ with spikes and brown-outs, I
       | understand that 's a problem.
       | 
       | But is either of those relevant to AWS?
        
         | dilyevsky wrote:
         | I worked for one of the largest datacenter company in the world
         | and significant portion of boxes would never come up after
         | being powercycled
        
         | mgsouth wrote:
         | Slamming on the brakes can be hard on the electronics. It
         | causes large, high-speed voltage or current spikes that kill
         | (old, tired) components before the spikes can be clamped. Can
         | be made worse by old electolytic caps that no longer filter
         | effectively. Especially hard on power supply diodes and
         | mosfets.
         | 
         | Here's a few references at random:
         | 
         | https://vbn.aau.dk/ws/portalfiles/portal/108034396/ESREF2013...
         | 
         | https://www.infineon.com/dgdl/s30p5.pdf?fileId=5546d46253360...
         | 
         | https://www.sciencedirect.com/science/article/pii/S003811010...
        
           | Spivak wrote:
           | Which is why if you realize that your hard drive is failing
           | the worst thing you can do is cut the power [1]. You get your
           | data off that box as fast as possible.
           | 
           | [1] Yes yes always have backups, and RAID, and replication
           | something about not being able to count that low. Betcha
           | most'a'ya don't have any of that on your MBP though except
           | for a maybe a backup that lags hours behind your work.
        
         | dahfizz wrote:
         | https://news.ycombinator.com/item?id=29666048
         | 
         | TLDR if electronics are near-death, a power cycle is likely to
         | kick it over the edge.
        
         | SteveNuts wrote:
         | In my experience it's common when there's a full power down of
         | a large data center like this. In 2013 the company I was
         | working at had an unplanned power outage which took out 200
         | spinning drives, and 18 servers. I think the motors in the
         | drives are fine running at constant speed but don't have the
         | juice to spin up from a dead stop.
         | 
         | I remember IBM was renting private planes to get us replacement
         | hardware from around the country (we were a very large
         | customer).
        
           | jbergens wrote:
           | A very long time ago I worked for a medium sized company and
           | there was a power outage. The servers were in a room in our
           | building but they had ups power. The problem was that the
           | ups's were broken/malfunctioning and the servers needed to do
           | a disk check after an unplanned reboot. It took hours before
           | we could work again even though most drives survived. This
           | was also before git so we could not reach the central version
           | control system and could not commit any code. I think all
           | code and workspaces were on servers that was not up yet.
        
         | wdfx wrote:
         | It's just the numbers. Eventually all the hardware will fail.
         | Some of that hardware will require a specific set of conditions
         | to fail. Such a condition may be current inrush from being
         | powercycled. Given the scale of a data centre and powercycling
         | it all at once, then some will fail due to that condition.
         | 
         | Further to that thought, there must be hardware in that centre
         | which is failing and being replaced all the time due to other
         | conditions, we just won't hear about it because it's part of
         | normal maintenance.
        
         | emptybottle wrote:
         | It can be a number of things, but often times it's the thermal
         | cycle and wear on the components.
         | 
         | After letting devices (especially that rarely do this) cool,
         | and then bringing them back up to temperature some percentage
         | of components may fail or operate out of spec.
         | 
         | And mechanical parts have similar issue. A motor that's been
         | happily running for months/years may not restart after being
         | stopped.
         | 
         | It's also possible that a power fault itself could damage
         | hardware, although in my experience this type of issue is far
         | less common.
        
         | michaelrpeskin wrote:
         | This might be old lore, but I remember many years ago that
         | someone told me that when a spinning disk has been up to
         | temperature for a long time there's a bit of lubricant that is
         | vaporized and evenly distributed everywhere (like the halides
         | in your halogen headlamps). But when it stops spinning that
         | stuff cools down and gets sticky and ends up sticking the heads
         | to the disk so that it can't get going anymore.
         | 
         | Don't know the truth to that, but it seemed like a reasonable
         | explanation at the time.
        
         | _3u10 wrote:
         | Think of it like a car with a starter wearing out, at some
         | point you turn the engine off and it never starts up again,
         | until you replace the starter. Similarly bad batteries,
         | corrosion on terminals, etc. Startup is a much different
         | workload on electronics and in the case of HDDs mechanical
         | devices than running.
         | 
         | I have an old fan that will need replacing one day that takes
         | about 15 minutes to reach full speed as the bearings are gone.
         | Everytime I turn it off I wonder if the next time I turn it on
         | it will be time to replace it.
        
         | kube-system wrote:
         | If I were to make some WAGs about things that could generally
         | be described this way:
         | 
         | * increased failure rates of drives (or other hardware) due to
         | thermal cycling
         | 
         | * cache power failures leading to corrupted data
         | 
         | * previously unknown failures in recovery processes (like
         | firmware bugs that might be described as a "hardware problem")
         | 
         | * cooling failures leading to hardware failure
         | 
         | Most of these are pretty rare issues, but it becomes more
         | probable to happen at least once when you're running freekin'
         | us-east-1
        
       | zkirill wrote:
       | Have there been any incidents that affected more than one AZ?
       | 
       | AWS RDS in multi-AZ deployment gives you two availability zones.
       | Aurora gives you three. What kind of scenario would be used to
       | justify three AZ's for the purposes of high availability?
        
         | wmf wrote:
         | Modern Paxos/Raft HA requires an odd number of nodes so that's
         | how you end up with three AZs.
        
         | Jweb_Guru wrote:
         | Aurora isn't really designed to handle the total failure of two
         | availability zones (not short term, anyway). It's designed to
         | handle one AZ + one additional node failure (which is
         | reasonably likely to happen on large instances due to data from
         | a single database being striped across up to 12,800 machines
         | per AZ). Due to how quorum systems work the "simplest" way they
         | decided to handle that was six replicas per 10 GiB segment
         | across 3 AZs, with three of the replicas being log-only and
         | three designated as full replicas (which is the lowest number
         | of full replicas you can have given their failure model, and
         | hitting that lower bound does require you to deploy across 3
         | AZs). If any three of the nodes are dead (log-only or
         | otherwise), write traffic is stopped until they can bring up a
         | fourth node, though there is some support for backup 3-of-4
         | quorums for very long-term AZ outages.
        
           | zkirill wrote:
           | Amazingly insightful answer. Thank you very much!
        
       | gundmc wrote:
       | Rare to see power issues at a modern data center cause downtime.
       | All of those racks should have UPS and batteries to sustain
       | during an outage until the automatic transfer switch can fail
       | over to a redundant system or generator. Would be interested in
       | reading more about what happened here.
        
         | electroly wrote:
         | In previous postmortems[0] they've mentioned their UPSes. They
         | definitely have them. They don't seem to write a lot of
         | postmortems, though. I'm not sure whether we should expect one
         | for this event.
         | 
         | [0] https://aws.amazon.com/premiumsupport/technology/pes/
        
       | 1cvmask wrote:
       | A lot more companies will go to a multi-cloud active active
       | architecture with maybe even bare metal redundancies.
        
         | beermonster wrote:
         | As large cloud outages become more frequent and the impact
         | greater each time, I feel it's more likely people will
         | reconsider moving workloads off-prem.
        
         | [deleted]
        
         | tyingq wrote:
         | Multi-cloud is odd to me, unless you're a company selling a
         | service to cloud customers.
         | 
         | By definition, you would have to either go lowest-common-
         | denominator, or build complicated facades in front of like
         | services.
         | 
         | If you're going lowest-common-denominator, then multi-old-
         | school-hosting would be far cheaper.
        
         | ransom1538 wrote:
         | me: "Yeah! AWS went down a few times, I think we should pay
         | double!"
         | 
         | vp: no.
        
           | emodendroket wrote:
           | More than double realistically. But you could achieve most of
           | the benefit at much lower cost by going multi-AZ.
        
         | profmonocle wrote:
         | Egress fees can make replicating databases, storage buckets,
         | etc. between clouds _very_ expensive. Multi-region is a much
         | more affordable option. Multi-region outages aren 't unheard of
         | among the major cloud operators, but they're less common than
         | single-AZ or single-region outages.
         | 
         | IMO, most companies just aren't sensitive enough to downtime
         | that multi-AZ + multi-region deployment within a single cloud
         | provider isn't good enough.
        
         | psanford wrote:
         | I would stay away from any company that thinks "we need to go
         | multicloud" as a response to this outage. This affected a
         | single az in a single region. If it caused you downtime or a
         | partial outage, it means you are not fully resilient to single
         | az failures. The correct thing is to fix your application to
         | handle that.
         | 
         | If you can't handle a single az failure there is no way you are
         | going to handle failing over across different cloud providers
         | correctly.
        
           | dragonwriter wrote:
           | > If it caused you downtime or a partial outage, it means you
           | are not fully resilient to single az failures.
           | 
           | Given the number of AWS global services that have
           | dependencies on infra in US-EAST-1 (and, from the impacts of
           | this and other past outages, seen vulnerable to single-AZ
           | failures in US-EAST-1) that's...less avoidable for _certain_
           | regions /AZs than one might naively expect. Most clouds seem
           | to have at least some degree of this kind of vulnerability.
        
           | CubsFan1060 wrote:
           | There will, however, be a lot of executives _talking_ about
           | going multi cloud.
        
           | nonane wrote:
           | > This affected a single az in a single region. If it caused
           | you downtime or a partial outage, it means you are not fully
           | resilient to single az failures.
           | 
           | This is not true. Amazon is not being upfront about what
           | happened here. It was simply not a single AZ failure. Our us-
           | east-1 ELB load balancers were hosed and were unable to
           | direct traffic to other AZs - they simply stopped working an
           | were dropping traffic. We tried creating load balancers in
           | different AZs and that didn't work either.
           | 
           | How can you be resilient to single AZ failures if load
           | balancers stop working region wide during a single AZ outage?
        
             | acdha wrote:
             | Did your TAM go into any details on that? Over a couple
             | hundred load-balancers, the only issue we had was taking
             | longer to register new instances and that affected only a
             | couple of them. Running services weren't interrupted,
             | latency remained, etc. which is what I'd expect for a
             | single AZ failure.
        
           | luhn wrote:
           | To be fair though, from what I heard AZ outage caused an EC2
           | API brownout, so people couldn't launch new instances in the
           | other AZs. That put a wrench in a lot of multi-AZ
           | architectures.
           | 
           | Not advocating for multi-cloud though...
        
       | daneel_w wrote:
       | _" As is often the case with a loss of power, there may be some
       | hardware that is not recoverable..."_
       | 
       | No. Not even rarely. If they lost hardware because of this
       | something much different than just loss of power happened on
       | their servers' mains rails.
        
         | jacquesm wrote:
         | To forestall that reply: I did think about it before hitting
         | the reply button. I've seen a couple of large scale DC power
         | outages in my 35 years of IT work. The vast majority of the
         | hardware came through just fine. But older data centers with
         | gear that has been running uninterrupted for many years tend to
         | have at least some hardware that simply won't come up again,
         | due to a variety of reasons. One can be that the on-board
         | batteries have gone bad, which nobody noticed until the power
         | cycle. Another is that some harddrives function well as long as
         | they keep spinning but that has worn a nasty little spot on the
         | bearing that keeps the spindle aloft. When it spins down,
         | depending on the orientation it may not be able to start back
         | up again, it may even crash while spinning down. Powercycle
         | enough gear that has been running for years and you will most
         | likely lose at least some small fraction of it. You could do it
         | the next day after taking care of those and everything would
         | likely be fine.
        
           | vlovich123 wrote:
           | Wouldn't a UPS + generator failover mean that the HW never
           | spins down in the first place? That's how I interpreted op's
           | statement anyway.
        
             | jacquesm wrote:
             | I've seen plenty of UPS's fail, and generator failover is
             | that moment when everybody stands there with their fingers
             | crossed.
             | 
             | None of those things are fool proof and in case of a large
             | scale outage, especially one that lasts more than a couple
             | of minutes there is a fair chance that you will find the
             | limitations of some of your contingency plans.
        
             | acdha wrote:
             | Hopefully, but those aren't perfect. Large data centers
             | regularly test those because things can go wrong with any
             | of the UPS, generator[1], or the distribution hardware
             | which switches between the line and generator power. One of
             | the longer outages I've seen was when the distribution
             | hardware itself failed and burnt out multiple parts which
             | the manufacturer did not have sufficient spares in our
             | region.
             | 
             | AWS has a lot of very skilled professionals who are quite
             | familiar with those issues so I'd be quite surprised if it
             | turned out to be something that simple but you always want
             | to have a contingency plan you've tested for what happens
             | if core infrastructure like that fails and takes multiple
             | days to recover.
             | 
             | 1. One nasty example: fuel issues which aren't immediately
             | obvious so someone doing a 5 minute test wouldn't learn
             | that the system wasn't going to handle an outage longer
             | than that.
        
         | Johnny555 wrote:
         | If my home UPS failed, I'd be awfully surprised if my home
         | fileserver that was plugged into it failed too.
         | 
         | But if I lost power to thousands of servers, I'd expect some
         | number of them to fail. I've even lost servers when losing
         | power to a single rack.
        
         | joshuamorton wrote:
         | Yes, its a fairly well known issue at datacenter scale that if
         | you power-cycle everything, some percentage of things won't
         | turn back on (I recall something like 1% of HDDs just failing
         | to work again being the number quoted at me). This obviously
         | isn't the case for fresh new HDDs, but ones that have been
         | spinning continuously for a years.
         | 
         | Other hardware is similar.
        
         | jdsully wrote:
         | Starting up is hard on electronics due to the inrush current.
         | There may be marginal devices just barely operating that will
         | not survive a reboot. Things like bad capacitors aren't always
         | detectable during steady state.
         | 
         | At AWS scale it's highly likely there are more than a few of
         | these.
        
           | daneel_w wrote:
           | Did you think this over before hitting the reply button? Can
           | you imagine any server or networking brand surviving if its
           | products were at substantial risk of dying when powering up
           | or cycling? Have you ever heard of common consumer grade
           | computer equipment regularly dying from it? No? Not me
           | either.
           | 
           | It's _really_ common to find thyristors in devices to limit
           | in-rush current, even in cheap electronics. You can bet that
           | PSUs in data center equipment use them.
        
             | emodendroket wrote:
             | > Have you ever heard of common consumer grade computer
             | equipment regularly dying from powering up?
             | 
             | Yes, absolutely.
        
               | daneel_w wrote:
               | You're romanticizing here. Electronics eventually dying
               | is not the same as being at substantial risk from dying
               | due to power-cycling itself.
        
               | emodendroket wrote:
               | No, I am not. I personally have experienced this. Pop,
               | whoops, it doesn't work anymore.
        
               | daneel_w wrote:
               | Same. What you're romanticizing about is it being a
               | common and regular occurrence. It's not. It's a fluke.
        
               | emodendroket wrote:
               | Now imagine you have one of the world's largest
               | assemblages of electronic devices and they all turn on at
               | once. How likely are some flukes?
        
               | dpratt wrote:
               | And when you have 500,000 copies of something, most of
               | which have been in use for some appreciable fraction of
               | their effective lifetime, it doesn't have to be "common"
               | to occur to some fraction of them.
               | 
               | The post you're responding to never implied that it
               | happens to _most_ of them, or even many of them, just
               | that when you have a gigantic farm, it's not unreasonable
               | to see a small handful of hardware instances release up
               | their magic smoke when they're all coming back from being
               | powered down.
               | 
               | MTBF is a thing.
        
               | ericd wrote:
               | A fluke x massive scale = regular occurrence.
        
               | [deleted]
        
               | [deleted]
        
             | kube-system wrote:
             | I am not sure whether or not inrush current (or thermal
             | cycling, or something else) is to blame, but server
             | equipment in data centers has been fairly well documented
             | to exhibit increased hardware failure rates after outages.
        
             | organsnyder wrote:
             | Substantial? No. A low percentage that is still large
             | enough to be impactful in a datacenter? Absolutely.
        
               | X-Istence wrote:
               | When you've got a massive datacenter packed, even 0.01%
               | is a meaningful amount of hardware.
        
             | sonofhans wrote:
             | It's a scale problem. We're not talking about "substantial
             | risk" or "regularly dying." At AWS scale even 0.01% cold
             | start failure is noticeable. It's not possible to guarantee
             | perfect operation of each system or component; physical
             | things wear out. If you can find a way to avoid cold-start
             | failure 100% of the time your product would be valuable to
             | many businesses.
             | 
             | Also:
             | 
             | > Did you think this over before hitting the reply button?
             | 
             | This is ad hominem, and we should avoid it here.
        
             | jeffbee wrote:
             | > Have you ever heard of common consumer grade computer
             | equipment regularly dying from it?
             | 
             | You have outed yourself as someone with zero experience
             | operating computers, either individually or at scale.
             | Computers have moving parts, whether fans or hard disk
             | drives. Just because those _were moving_ doesn 't mean they
             | will _begin moving again_ from a dead stop. There are also
             | things like dead cmos /nvram batteries that can prevent
             | machines from automatically starting when power is applied.
        
             | hhh wrote:
             | I don't know if it's intentional, but your comment comes
             | off very hostile.
             | 
             | Recently we has a violent power outage where I work due to
             | the tornados in the midwest. A few of our facilities have
             | industrial hardware with batteries that will last for 72
             | hours and after that point it's a world of unknowns what
             | will happen. The batteries died after a lot of effort to
             | try and restore power, but everything came up just fine.
             | 
             | However, a random Cisco switch elsewhere in the building
             | had at least one power supply fail.
             | 
             | This was in one facility where most of these have been
             | running for 4+ years straight (12 years for the system with
             | a 72 hour battery.)
             | 
             | I find it hard to imagine that in the scale of even one AZ
             | that it wouldn't be possible that at least one system has
             | this happen.
        
             | acdha wrote:
             | > Did you think this over before hitting the reply button?
             | Can you imagine any server or networking brand surviving if
             | its products were at substantial risk of dying when
             | powering up or cycling?
             | 
             | This is hostile enough that I'd have trouble squaring it
             | with the site guidelines.
             | 
             | It's especially bad because, as others have been saying,
             | your belief isn't supported by real lived experience for
             | many of us. A data center has enough devices that even a
             | low probability error rate will fairly reliably happen, and
             | that doesn't usually hurt the manufacturer's reputation
             | because they never promised 100% and will replace it under
             | warranty. I've seen this with hard drives especially but
             | also things like power supplies and on-board batteries, and
             | even things like motherboards where a cooling/heating cycle
             | was enough to unseat RAM or cause hairline fractures in
             | solder traces.
             | 
             | This can be especially bad with unplanned power outages if
             | the power doesn't instantly go out and stay out for the
             | duration. Especially for hard drives hitting the power
             | up/down cycle a few times was a good way to have extra
             | failures.
             | 
             | Unscheduled power outages can be even worse
        
               | jacquesm wrote:
               | I've seen even planned outages and tests result in failed
               | hardware.
        
               | acdha wrote:
               | Ditto -- you never do a substantial shutdown without
               | learning something.
        
         | nicolaslem wrote:
         | It is common for failing hard drives to continue running until
         | their are power cycled, at which point they never start again.
        
           | X-Istence wrote:
           | Having had to restore a NetApp cluster from backup because
           | the disks would no longer spin up after having been running
           | for 4+ years... yup. The hardware was fully working and
           | according to all internal metrics was still perfectly good,
           | but a power outage and they never came back up.
        
         | X-Istence wrote:
         | Having worked in datacenter where servers were running for 5+
         | years in racks, when the datacenter went dark due to a UPS
         | going up in flames there were definitely systems that did not
         | come back online.
         | 
         | Most notably we had a large NetApp array that would not boot,
         | once we replaced the controllers we had lost 13 out of the 60
         | hard drives in the array. They would no longer spin up. Like
         | physically seized. Because they had been spinning for so long,
         | they would have likely kept spinning just fine, but with power
         | gone, it was done.
         | 
         | Fans are another fun one, anything with ball bearings really.
         | Power supplies are another issue, due to the sudden large
         | inrush of power when the switch was flipped back to on, some
         | power supplies had their capacitors go up in smoke.
         | 
         | This is not rare, this a common occurrence. When you have an
         | absolutely massive footprint with thousands upon thousands of
         | servers, that have been running for a long time, there will be
         | things that just don't come back once they have stopped running
         | or when the electricity is gone.
        
       ___________________________________________________________________
       (page generated 2021-12-23 23:01 UTC)