hngopher.com

       [HN Gopher] Tell HN: AWS appears to be down again
       ___________________________________________________________________
        
       Tell HN: AWS appears to be down again
        
       Console is flickering between "website is unavailable" and being up
       for my team. This is happening very frequently just now,
       reliability seems to have taken a hit.
        
       Author : riknox
       Score  : 793 points
       Date   : 2021-12-22 12:21 UTC (10 hours ago)
        
       | RobertKerans wrote:
       | Assuming crates.io is AWS-backed? Getting fun situation where
       | direct dependencies of an application are downloading but then
       | the sub-dependencies aren't.
        
         | lukeqsee wrote:
         | crates.io is directly hosted on GitHub, but I'm sure some
         | dependencies use S3 or other AWS services for things.
        
           | RobertKerans wrote:
           | Yep, S3 possibly the villain here
        
             | withinboredom wrote:
             | I wonder if there's an s3 compatible service with similar
             | pricing that can be used as a fallback? Are digital ocean
             | s3 compatible storage accounts's backed by real s3?
        
               | Ancapistani wrote:
               | Would Wasabi.com meet your requirements?
               | 
               | I'm not affiliated with them, and haven't even really
               | used them other than to explore a bit. They come highly
               | recommended by my acquaintances, though.
        
               | RobertKerans wrote:
               | afaik there's nothing tying it specifically to GH (where
               | the metatada is), and then the actual code is just in an
               | S3 bucket, so in theory should be reasonably easy [ha!]
               | to just host anywhere. In theory, I mean that's a massive
               | lump of stuff, and surely wherever it gets hosted is
               | going to face exactly the same issues (though if it does
               | become very widely used, then you'd think every major
               | provider that controls infra could easily have a mirror)
        
             | RobertKerans wrote:
             | Ah, back to normal now. Getting intermittent flickers on
             | some of our apps but all _seems_ solid-ish again
        
           | pietroalbini wrote:
           | The crates.io index is hosted on GitHub, but the
           | application/API is hosted on Heroku (so in the us-east-1 AWS
           | region) and the downloads on S3/CloudFront. And yes crates.io
           | is currently impacted.
        
         | mwcampbell wrote:
         | Yeah, and I can't publish a crate.
        
       | CaptRon wrote:
       | At least HN works.
        
       | sh4un wrote:
       | Damn you all eggs in one basket.
        
       | potas wrote:
       | Slack seems to have some issues because of that - I'm not sure if
       | anyone is receiving messages, as it became completely silent for
       | the last 15 minutes or so.
        
         | jenoer wrote:
         | Sending and receiving messages works here, but editing them
         | does not, it throws an error. Statuses such as "calling" also
         | do not seem to be updated any longer.
         | 
         | Edit: Restarting Slack _does_ update the edited messages.
         | 
         | Edit 15:24 CET: Slack is back up.
        
           | jakub_g wrote:
           | Same: only normal text seems kinda working
           | 
           | - edits failing or working with big lag;
           | 
           | - "Threads" view slow;
           | 
           | - can't emoji-react;
           | 
           | - can't upload images;
           | 
           | - people also say they can't join new channels.
        
             | [deleted]
        
         | oneeyedpigeon wrote:
         | New messages seem to be ok for me, but editing old ones and
         | uploading images both seem to be broken right now.
        
         | Pandabob wrote:
         | Uploading images doesn't work for me.
        
         | jakub_g wrote:
         | https://status.slack.com/2021-12/a17eae991fdc437d
         | 
         | > We are experiencing issues with file uploads, message
         | editing, and other services. We're currently investigating the
         | issue and will provide a status update once we have more
         | information.
         | 
         | > Dec 22, 1:58 PM GMT+1
        
         | darkwater wrote:
         | I fail to understand how a big player like Slack can be
         | impacted this way by a failure in a single AZ in a specific AWS
         | region. But at least the main feature (sending and displaying
         | messages) is still working.
        
         | aden1ne wrote:
         | I can't edit messages, nor create channels. Messages are only
         | received with a several minute delay.
        
       | hnarn wrote:
       | Is there a history of AWS downtimes available somewhere? This
       | makes what, three times in as many months?
       | 
       | edit: The question isn't necessarily AWS specific, just any data
       | on amount of downtime per cloud provider on a timeline would be
       | nice.
        
         | LuciusVerus wrote:
         | I'd say three times in as many weeks, give it or take
        
         | MatteoFrigo wrote:
         | I don't know about AWS, but both Google Cloud and Oracle Cloud
         | maintain at least a high level history of past outages. See
         | https://status.cloud.google.com/summary and
         | https://ocistatus.oraclecloud.com/history
        
           | dijit wrote:
           | Given the hilariously awful reputation of the AWS status page
           | I would hazard a guess that such a page would also be
           | incredibly inaccurate.
           | 
           | If you can't even admit you're having an issue how can you
           | keep an accurate record?
        
             | cassianoleal wrote:
             | Similar with GCP. We had a pretty bad outage once where the
             | status page was showing all green. Google informed us that
             | because the actual issue was further down the stack and
             | didn't trigger any internal SLOs the status didn't get an
             | update. It took them hours to acknowledge and fix it.
        
               | dijit wrote:
               | Assuming you have a support contract the rep should send
               | out a post-mortem page.
               | 
               | This is what happens when we've been affected by outages
               | (even without involving support).
        
               | cassianoleal wrote:
               | I think they did eventually but it took us quite a bit of
               | troubleshooting, then creating a P1 ticket, then their
               | investigation in order to get to the bottom of it and
               | getting it fixed. And the status page never got an
               | update, which is the subject I was adding to.
        
         | colinbartlett wrote:
         | I have tons of this kind of data due to my side project,
         | StatusGator. For some services like the big cloud providers I
         | have data going back 7 years.
         | 
         | There indeed has been an uptick in AWS outages recently. You
         | can see a bit of the history here:
         | https://statusgator.com/services/amazon-web-services
        
           | exikyut wrote:
           | (I was idly curious. It appears this data is available as
           | part of the ~US$280/mo tier, along with a bunch of other
           | things.)
        
         | spmurrayzzz wrote:
         | This is a little more broad, beyond just cloud infra providers,
         | but includes some of the kind of data you're looking for (post-
         | mortems for outage events): https://github.com/danluu/post-
         | mortems
        
       | sydthrowaway wrote:
       | Switch to Azure
        
       | omosubi wrote:
       | I do wonder if the great resignation has anything to do with
       | this. My team (no affiliation with Amazon) was cut in half from
       | last year and we are struggling to keep up with all the work
        
       | fipar wrote:
       | https://downdetector.com/status/aws-amazon-web-services/
        
       | JCM9 wrote:
       | AWS didn't "go down". They had an outage in one AZ, which is why
       | there are multiple AZs in each region. If your app went down then
       | you should be blaming your developers on this one, not AWS. Those
       | having issues are discovering gaps in their HA designs.
       | 
       | Obviously it's not good for an AZ to go down but it does happen
       | and why any production workload should be architected to have
       | seamless failover and recover to other AZs, typically by just
       | dropping nodes in the down AZ.
       | 
       | People commenting that servers shouldn't go down ect don't
       | understand how true HA architectures work. You should expect and
       | build for stuff to fail like this. Otherwise it's like
       | complaining that you lost data because a disk failed. Disks
       | fail... build architecture where that won't take you down.
        
         | [deleted]
        
         | boudin wrote:
         | Issues are across all us-east 1, not one AZ.
         | 
         | Load balancers are not doing well at all. The only way in this
         | case to avoid an outage is to be cross regions or cross cloud
         | which is quite more complex to handle and require more
         | resources to do well.
         | 
         | And I hope that nobody is listening your blaming and pointing
         | fingers advice, that's the worst way to solve anything.
         | 
         | It's AWS job to ensure that things are reliable, that there is
         | redundancy and that multi-AZ infra should be safe enough. The
         | amount of issues in US-EAST-1 lately is really worrying.
        
           | phamilton wrote:
           | Echoing this. We had to manually intervene and cut off the
           | faulty AZ because our ASGs kept spinning up instances in it
           | and our load balancers kept sending traffic to bad hosts.
           | 
           | In the past I've seen both of those systems seamlessly handle
           | an AZ failure. Today was different.
        
           | acdha wrote:
           | Some load balancers may be having issues but I have multiple
           | busy workloads showing no issues all morning. One big
           | challenge can be that some people reporting multi-AZ issues
           | are shifting traffic and competing with everyone else, while
           | workloads which were already running in the other AZs were
           | fine. It can be really hard to accurately tell how much the
           | problems you're seeing generalize to everyone else.
           | 
           | I do agree that the end of this year has been a very bad
           | period for AWS. I wonder whether there's a connection to the
           | pandemic conditions and the current job market - it feels
           | like a lot of teams are running without much slack capacity,
           | which could lead to both mistakes and longer recovery times.
        
             | boudin wrote:
             | I hope AWS will provide some explanation about those issues
             | and what actions they will take to prevent those in the
             | future
             | 
             | On our side we saw some EC2 VM totally disconnected from
             | the network in 3 AZs.
        
               | acdha wrote:
               | Yeah, definitely needs good visibility. They're asking
               | customers to trust them to a large degree.
        
         | bencoder wrote:
         | Our API is just appsync (graphql) + lambdas + dynamoDB so,
         | theoretically, we shouldn't have been affected. But about 1 in
         | 3 requests was just hanging and timing out.
         | 
         | As others have said, they are not being forthright about the
         | severity of the issue, as is standard.
        
         | strunz wrote:
         | Except many AWS services still route through us-east-1 anyway,
         | which is why they have had huge outages recently. AWS isn't as
         | redundant as people think it is.
        
         | tyingq wrote:
         | >AWS didn't "go down"
         | 
         | The context of the parent seems to be that they intermittently
         | couldn't get to the console. That seems fair to me. If we're
         | blaming developers and finding gaps in HA design, then AWS
         | should also figure out how to make the console url resilient.
         | If it's not, then AWS does appear to be down.
         | 
         | I imagine it's pretty hard to design around these failures,
         | because it's not always clear what to do. You would think, for
         | example, that load balancers would work properly during this
         | outage. They aren't. Or that you could deploy an Elasticache
         | cluster to the remaining AZs. You can't. And I imagine the
         | problems vary based on the AWS outage type.
         | 
         | Similarly, with the earlier broad us-east-1 outage, you
         | couldn't update Route53 records. I don't think that was known
         | beforehand by everyone that uses AWS. You can imagine changing
         | DNS records might be useful during an outage.
        
         | tluyben2 wrote:
         | > People commenting that servers shouldn't go down ect don't
         | understand how true HA architectures work. You should expect
         | and build for stuff to fail like this. Otherwise it's like
         | complaining that you lost data because a disk failed. Disks
         | fail... build architecture where that won't take you down.
         | 
         | Is that comparison fair? If you have 2 raid-5 mirrored raid 5
         | boxes in your room and all disks fail at the same time, you
         | should complain. And that won't happen. These entire datacenter
         | failures should be anticipated, but to expect them is a bit too
         | easy I think. There are plenty of hosters who don't have this
         | stuff even once for the last decade in their only datacenter. I
         | do not find it strange to expect or even demand that level but
         | to protect yourself if it happens in any case if that fits your
         | specific project and budget.
         | 
         | Edit; OK meant that raid-5 remark in the same context as the
         | hosting; it _can_ and does happen but it shouldn 't; you should
         | plan for a contingency but expect it goes far. We never had it
         | (1000s of hard-drive, decades of hosting, millions of sites)
         | and so we plan for it with backups; if it happens it will take
         | some downtime but it costs next to nothing over time to do
         | that. If we expected it, we would need to take far different
         | measures. And we had less downtime in a decade than aws AZ had
         | in the past months. I have a problem with the word 'expect'.
        
           | acdha wrote:
           | > If you have 2 raid-5 mirrored raid 5 boxes in your room and
           | all disks fail at the same time, you should complain. And
           | that won't happen.
           | 
           | Here are some examples where that happened:
           | 
           | 1. Drive manufacturer had a hardware issue affecting a
           | certain production batch, causing failures pretty reliably
           | after a certain number of power-on hours. A friend learned
           | the hard way that his request to have mixed drives in his
           | RAID array wasn't followed.
           | 
           | 2. AC issues showed a problem with airflow, causing one row
           | to get enough warmer that faults were happening faster than
           | the RAID rebuild time.
           | 
           | 3. UPS took out a couple racks by cycling power off and on
           | repeatedly until the hardware failed.
           | 
           | No, these aren't common but they were very hard to recover
           | from because even if some of the drives were usable you
           | couldn't trust them. One interesting dynamic of the public
           | clouds are that you tend to have better bounds on the maximum
           | outage duration, which is an interesting trade off compared
           | to several incidents I've seen where the downtime stretched
           | into weeks due to replacement delays or manual rebuild
           | processes.
        
           | AtNightWeCode wrote:
           | "Won't happen". The 40,000 hours of runtime bug did happen. I
           | would recommend people to take backups and store them offline
           | or at least isolated from the main storage.
        
             | tluyben2 wrote:
             | Sure, I plan for it but I do not expect it. And it never
             | happened for me over decades. But I did plan for it, just
             | not in the way the parent said and it did cost me far less.
        
               | tluyben2 wrote:
               | Good to know ;) still think I am lucky; have had very
               | little harddisks fail across 1000s. And only very few
               | unrecoverable (had to restore from backup) failures;
               | those weren't HD failures but software failures; hds were
               | fine.
        
           | dylan604 wrote:
           | >And that won't happen
           | 
           | HA! I had received new 16-bay chasis and all of the drives
           | needed plus cold spares for each chasis. Set them up and
           | started the RAID-5 init on a Friday. Left them running in the
           | rack over the weekend. Returned on Monday to find multiple
           | drives in each chasis had failed. Even with dedicated one of
           | the 16 drives as a hot swap, the volumes would all have
           | failed in an unrecoverable manner.
           | 
           | All drives were purchased at the same time, and happened to
           | all come from a single batch from the manufacture. The
           | manufacture confirmed this via serial numbers, and admitted
           | they had an issue during production. All drives were replaced
           | and at a larger volume size.
           | 
           | TL;DR: Drives will fail, and manufacturing issues happend.
           | Don't buy all of your drives in an array from the same batch!
           | It will happen. To say it won't is just pure inexeperience.
        
             | tluyben2 wrote:
             | Guess I was lucky, we ran a lot of these over the decades
             | when things were far more unreliable than now and never
             | experienced anything like that. Manufacturing issues, sure,
             | but we always had everything we bought run on stress for 48
             | hours and see if that killed it, if it didn't, it didn't
             | usually break anymore (I have many of the machines from mid
             | to end 2000s still and they don't have diskfailures now
             | while they ran for many years).
        
               | dannyw wrote:
               | The older hdds imo are more reliable than the newer ones.
               | They're pushing for density, and mitigating the inheritly
               | higher sensitivity through aggressive error correction.
        
             | Tempest1981 wrote:
             | Would like to know the manufacturer and model.
        
               | dylan604 wrote:
               | Sorry, this was back in 2006-2007 time frame. I have no
               | idea on model numbers as that's just not information I
               | ever cared to commit to memory.
        
               | tluyben2 wrote:
               | My time frame is about since end 90s to now. I saw more
               | failures in general in the olden days. I have quite a lot
               | of rented servers at traditional hosters or Colo that
               | have not been down outside kernel security updates for
               | 10+ years. No hardware issues. I am now swapping them out
               | for new servers which cost less for more, but hardware
               | wise they have been pounder hard for 10 years without
               | issues. All hotswap raid5 drives, so when broken, they or
               | us just fixed it without downtime.
        
           | 8note wrote:
           | More generally, any correlation between two items gives
           | potential for a correlated failure.
           | 
           | Same manufacturer, same disk space, same location, same
           | operator, same maintenance schedule, same legal jurisdiction,
           | same planet, you name it, and there's a common failure to
           | match
        
           | phone8675309 wrote:
           | > Is that comparison fair? If you have 2 raid-5 mirrored raid
           | 5 boxes in your room and all disks fail at the same time, you
           | should complain. And that won't happen.
           | 
           | There are plenty of situations where this might happen if
           | they're in your room: a lightning strike can cause a surge
           | that causes the disks to fry, a thief might break in and
           | steal your system, your house might burn down, an earthquake
           | could cause your disks to crash, a flood could destroy the
           | machines, and a sinkhole could open up and swallow your
           | house. You may laugh at some of these as being improbable,
           | but I have seen _all_ of these take out systems between my
           | times in Florida (lightning, thief, sinkhole, and flood) and
           | California (earthquake and house fire).
           | 
           | The fix for this is the same fix as being proposed by the
           | parent post - putting physical space between the two systems
           | so if one place become unavailable you still have a backup.
        
             | greiskul wrote:
             | I have had a job where my small, internal tool, for
             | debugging purposes, had to be deployed to a minimum of 3
             | datacenters. I had 2 of them in the US and one in Europe,
             | and was asked to move one of the US ones to a datacenter
             | that was in another coast, cause who knows, maybe an
             | earthquake will knock off all of the US west coast. That is
             | the paranoia level necessary to achieve crazy high uptime.
        
         | dkryptr wrote:
         | 100% agree. I'm actually surprised AWS hasn't built in a Chaos
         | Monkey into their APIs/console so people can test their
         | resiliency regularly if an AZ goes down.
         | 
         | edit: of course, AWS does have this: AWS Fault Injection
         | Simulator
        
           | stingraycharles wrote:
           | Because then people would complain about AWS being less
           | reliable than Azure / GCP.
        
           | biohax2015 wrote:
           | AWS Fault Injection Simulator does this.
        
             | dkryptr wrote:
             | TIL. Thank you!
        
             | lljk_kennedy wrote:
             | Is that what they call us-east-1 nowadays?
        
               | voidfunc wrote:
               | us-east-1 has always been the canary region hasn't it?
        
         | [deleted]
        
         | TameAntelope wrote:
         | Here's a secret that's now saved me from three outages this
         | month:
         | 
         | Be in multiple AZs, and even multiple regions but if you're
         | going to be in only one AZ or one region, make it us-east-2.
        
         | matharmin wrote:
         | AWS is under-reporting the severity of the issue though. The
         | primary outage may be in a single AZ, but there are parts of
         | the AWS stack that affected all AZs in us-east-1, and
         | potentially other regions as well. For example, even now I'm
         | unable to create a new ElastiCache cluster in different AZs of
         | us-east-1.
        
           | zymhan wrote:
           | > I'm unable to create a new ElastiCache cluster in different
           | AZs of us-east-1
           | 
           | Isn't that because Elasticache will distribute the cluster
           | across AZs automatically?
           | 
           | https://docs.aws.amazon.com/AmazonElastiCache/latest/red-
           | ug/...
        
             | matharmin wrote:
             | In this case, this was specifically with a single-AZ setup,
             | using an AZ that was supposed to be unaffected.
        
       | 13daug wrote:
       | This S3 how you gonna get you investment back from it
        
       | [deleted]
        
       | clavicat wrote:
       | How much more frequent do these outages need to become before it
       | starts triggering SLA limits?
        
       | rsp1984 wrote:
       | Bitbucket having issues too:
       | https://bitbucket.status.atlassian.com/
        
       | darkwater wrote:
       | Fields of green here https://status.aws.amazon.com/ Anyway I can
       | access the web console with no issue (eu-west)
        
         | hnarn wrote:
         | I think it's pretty widely accepted that AWS' own status pages
         | are utterly useless.
        
           | darkwater wrote:
           | Yeah, it was just to confirm that this time was no different
           | :)
        
             | hdjjhhvvhga wrote:
             | In Russia they have a specific name for it:
             | 
             | https://en.wikipedia.org/wiki/Potemkin_village
        
           | s_dev wrote:
           | You would think that but there always a few contrarian AWS
           | evangelists in the comments going on about the "difficulty"
           | in operating a status page as though it were trying to
           | conjure a N=NP proof.
           | 
           | Like how come down detector can do a superb job of detecting
           | when AWS goes down and AWS can't? Because AWS doesn't want
           | account managers of SLAs asking for credits for the uptime
           | they're paying for but not getting.
           | 
           | https://downdetector.co.uk/status/aws-amazon-web-services/
        
         | temp0826 wrote:
         | Changes to this page require very high level management
         | approvals (source: used to work at aws)
        
         | lordnacho wrote:
         | The elite DevOps teams are always assigned to the status page
        
         | JCM9 wrote:
         | Status page says there are issues. It's not all green.
        
           | oneeyedpigeon wrote:
           | _Now_. It took a lot longer for that page to know /admit the
           | problems than it did half the internet.
        
       | debarshri wrote:
       | Hubspot seems to be down too [0].
       | 
       | [0] https://status.hubspot.com/
        
       | temptemptemp111 wrote:
        
       | schnebbau wrote:
       | So, how many execs are going to push to move to self-managed
       | hosting in the new year?
       | 
       | Packaging a way to migrate off AWS could be a unicorn idea.
        
         | dehrmann wrote:
         | Depends on how many customers are ready to move to a different
         | vendor. I suspect most customers are forgiving because either
         | they were also down or half the services they use were down.
         | You don't get fired for hosting in AWS.
        
         | adamm255 wrote:
         | Anyone using VMware Cloud services is probably laughing. Just
         | chuck it at Azure or GCP or back on prem.
        
         | wallacoloo wrote:
         | AWS has its Outpost product for on-prem hosting. not 100% self-
         | managed, but maybe enough to satisfy the execs and make your
         | market a bit smaller.
        
           | Nextgrid wrote:
           | Does it come with its own locally-hosted console or does it
           | still rely on the main AWS control plane? If the latter then
           | it could be affected too.
        
         | qwertyuiop_ wrote:
         | None. Amazon hired all ex VPS, CTOs, Directors of small, medium
         | large companies with Rolodexes.
        
         | mikece wrote:
         | Would need one hell of a compressional algorithm to keep the
         | data exfiltration costs down.
        
           | pm90 wrote:
           | Pied Piper
        
       | joe_chip wrote:
        
       | amai wrote:
       | A problem with log4j/logshell?
        
       | throwaway81523 wrote:
       | Ok, enough AWS outages to say I'm tired of hearing about low end
       | stuff being flaky.
        
         | BiteCode_dev wrote:
         | "Don't use a self hosted monolithe, it's not reliable! You need
         | a cloud FS with a load balancer under observability and your
         | data in a db that scales horizontally, all orchestrated by
         | kubs."
         | 
         | Meanwhile, I currently have a gig to work on a video service
         | which features a never updated centos 6, an unsupported python
         | 2 blob website, and a push to prod deployment procedure,
         | running a single postgres db serving streaming for 4 millions
         | users a month.
         | 
         | And it's got years of up time, cost 1/100th of AWS, and can be
         | maintained by one dev.
         | 
         | Not saying "cloud is bad", but we got to stop screaming old
         | techs are no good either.
        
           | osrec wrote:
           | Purely out of interest, I'd like to know more about your
           | streaming architecture. I assume postgres just holds the meta
           | data, and the actual video content is stored elsewhere? What
           | strategies have you employed to scale the streaming part of
           | your service? I imagine 4 million users a month is quite a
           | significant amount of traffic!
        
             | BiteCode_dev wrote:
             | 1 - For the last 10 years, servers have been beasts. You
             | have a lot of cores, plenty of HD and RAM. Servers are less
             | expensives than devs. Scaling vertically can go VERY far.
             | 
             | 2 - Caching is life. We have 3 layers of caching:
             | cloudflare, varnish, and redis. Most things don't need to
             | be real time. A lot of things can be a month old and the
             | user doesn't care. User need immediate feedback to be
             | happy, but not necessary fresh data.
             | 
             | 3 - if you compile nginx manually, you get to use a lot of
             | plugins that can do stuff super fast, including serving
             | videos. You can script stuff in lua that will just skip the
             | backend completly.
             | 
             | 4 - mind your encoding. We carefully chose how we encode
             | videos. The ffmpeg parameters are pretty insane, but the
             | space / quality ratio is amazing, espacially on mobile. It
             | takes a lot of time to experiment with those, nobody share
             | them :)
             | 
             | 5 - we offload everything we can to cron tasks or task
             | queues. Including, obviously, encoding, screenshooting,
             | etc.
             | 
             | 6 - don't hold data you can't lose. E.G: billing. This way
             | you can have a relaxed attitute toward data. If we ever
             | loose a day of business, users will be in a bad mood for a
             | week, but that won't be the end of the world. We don't need
             | a bullet proof system if bullets can't kill us.
             | 
             | 7 - give money to ffmpeg and opencv, because damn those
             | things are fast. And good.
             | 
             | 8 - servers are hosted accross 2 providers. This way, if
             | one goes down, or decide to stop doing business with us
             | Google style, we have a second one. Happened recently with
             | leaseweb: they shutdown a whole room without offering an
             | alternative.
             | 
             | E.G: votes.
             | 
             | They don't hit the backend on write. We pile them from
             | nginx to redis, then once a day, we aggregate and store on
             | postgres, which the backends will consumme. We just store
             | each vote on localstorage as well so that the user feels
             | like it's real time when they vote, but in reality it's
             | updated once a day. But votes don't affect the money side
             | of our business, so if we lose them one day, it does not
             | mean death.
             | 
             | P.S: yes, posgres/redis/elasticsearch only hold metadata.
             | Videos are stored on disk. There is no docker images, no
             | mircoservices, FS is ext4. Which means with a lot of RAM,
             | the OS FS cache will have most popular videos already
             | loaded and ready to be streamed. Everything is raid 0, so
             | if we get one disk corrupted, you lose the server. But we
             | upload each videos on severeal servers, so when a disk get
             | corrupted, we just replace the whole server. In fact,
             | anything goes wrong on a server, we replace it. It's not
             | worth it to find the root cause, unless 2 servers die in
             | the same way successively.
        
               | truetraveller wrote:
               | Thank you!
        
               | ffritz wrote:
               | This was super interesting to read, thank you very much.
               | 
               | Regarding the ffmpeg parameters and formats in general:
               | Do you use newer formats too, like AV1 and the like?
        
               | BiteCode_dev wrote:
               | No, we use only H264, because nobody has the courage to
               | redo all the work we've done to optimize the encoding
               | with a newer format :)
        
         | jacob019 wrote:
         | Right. I've had an excellent experince with Vultr for the last
         | couple years, for about 1/10th the cost of AWS. I use other
         | small VPS providers as well. I run my own small business and I
         | need to keep costs down to stay competitive. I used to use AWS
         | more but the bill always creeps up to inappropriate levels. AWS
         | billing is insulting, oh you forgot to renew your reserved
         | instance? That's going to be double this month. I still use
         | cloudfront, route 53, and a few of the smallest instances for
         | mail servers and asterisk though. It's foolish to go all in
         | with AWS, or with anything really.
        
         | henriquez wrote:
         | Heroku isn't "low end," it's a PaaS built on top of AWS. So
         | you're really just hearing about another AWS outage lol
        
           | christophilus wrote:
           | They're not saying Heroku is low end. They're saying, "I'm
           | tired of hearing that it's irresponsible to run your own
           | servers."
           | 
           | At least, that's what I understood.
        
             | ryanbrunner wrote:
             | Any place I've worked at that managed their own servers (to
             | be fair, the last time I worked at a place like that was
             | 2010) definitely had more protracted downtimes than AWS -
             | it just felt not as bad because we were in control of the
             | situation, but at the end of the day that didn't get us up
             | any faster.
             | 
             | Another side benefit of being with AWS is when you do have
             | an outage, a lot of other people have outages, and so you
             | sort of blend in with the noise. It's not great to be down,
             | but if you're down and also "big service X" who's also an
             | AWS customer is down, it makes your downtime look less like
             | a lack of competence and more like an unavoidable force of
             | nature.
        
               | dijit wrote:
               | I guess it's extremely dependent on an org to org basis.
               | 
               | I worked at a company that's bread and butter was online
               | services (e-commerce SaaS platform, similar to Netsuite)
               | and we had _significantly fewer_ outages than AWS had.
               | 
               | But we had redundancies built in to most things, I'm not
               | saying it was perfect but it worked.
               | 
               | The major difference might be that almost nobody is
               | willing to spend 20% of what they spend on AWS/GCP to
               | have a self-hosted solution.
               | 
               | The reason "cloud is so expensive" is because they're
               | essentially telling you what the price will be and even
               | if they only spend 40% of that on actual hardware and
               | operations: it's more than most companies would invest in
               | themselves.
               | 
               | This is absurd, of course, but it's absolutely true.
        
               | [deleted]
        
             | [deleted]
        
           | mijoharas wrote:
           | This comment doesn't say anything about heroku?
        
         | api wrote:
         | Nobody ever got fired for using AWS.
        
           | debarshri wrote:
           | Today DO also went down. We could not login briefly.
        
             | jeremyjh wrote:
             | Just the control panel or were your instances down as well?
        
               | debarshri wrote:
               | Just the control panel, we couldn't login
        
           | pxue wrote:
           | maybe except a team at google? ;)
        
             | falcolas wrote:
             | I have a story from only a few years ago where the finance
             | section, and a good portion of management, of Google had no
             | idea how poor their GAE solution was for uptime, until they
             | tried to do business critical work using software that was
             | hosted on GAE.
             | 
             | Uptime improved rather dramatically after that.
        
               | pxue wrote:
               | yeap. it's sloowwly getting there.
        
           | flatiron wrote:
           | If you rely solely on east 1 maybe?
        
             | omh2 wrote:
             | AWS doesn't follow their own advice about hosting multi-
             | regional.
             | 
             | When us-east-1 is sufficiently borked the management API
             | and IAM services in all regions tend to go down with it.
             | 
             | Static infrastructures usually avoid the fallout, but
             | anyone dependent on the API or otherwise dynamically
             | created resources often get caught in the blast regardless
             | of region
        
               | jeremyjh wrote:
               | I didn't hear any reports of that happening in the most
               | recent outage. The console was inoperable but you could
               | work around using regional console host names.
        
               | throwanem wrote:
               | It wasn't reliable. I heard of many more who weren't able
               | to get in that way than who were, and was in the former
               | category myself.
               | 
               | We didn't take any downtime, but if anything had gone
               | wrong there would have been nothing we could do about it
               | until IAM came back up.
        
               | tyingq wrote:
               | There are some services that do have hard US-EAST-1
               | dependencies. Cloudfront, because of certificates.
               | Route53. The control API for IAM (adding/removing roles,
               | etc). And there's also the notion of "global endpoints"
               | like https://sts.amazonaws.com... it's not clear why that
               | exists, because it fails when us-east-1 does. It would be
               | better to only have regional endpoints if the "global"
               | ones are region-specific in reality. The endpoint thing
               | is documented, but it's still confusing to people.
               | 
               | The dependency chains can bite you too. During the us-
               | east-1 outage, a Lambda run by cron-like schedules via
               | EventBridge was itself in an okay state, but the
               | EventBridge events that kick it off were stuck in a queue
               | that was released when the problem was fixed. So if your
               | Lambda wasn't idempotent, and you ran it in another
               | region during the outage, you ended up with problems.
        
               | omh2 wrote:
               | If you're referring to Dec 7, it absolutely did. Metrics
               | went down nearly across the board, which also means most
               | auto-scaling setups were non-functional. Cloudfront
               | metrics didn't properly recover until the next day
               | 
               | Logging in with root credentials was not possible in any
               | region, and even logging in with IAM creds in other
               | regions yielded an intermittently buggy console
               | 
               | and as is usual with us-east-1 outages management API
               | calls were a complete crap shoot regardless of region
        
           | trabant00 wrote:
           | True sad fact. I first thought it is a management problem but
           | lately I see it is the tech bros who push for fads in the
           | hopes of staying relevant and not asuming responsability for
           | choices.
        
             | datavirtue wrote:
             | Omg, this needs to be on a plaque or something.
             | 
             | "Let's move our internal app with 50 users to k8s in the
             | cloud." --true story
        
               | Jupe wrote:
               | Amazing. And as long as "technological progress"
               | sufficiently obscures the impact of such ridiculousness
               | the more such projects will continue to occur.
               | 
               | It's a real shame that the collective world of technology
               | does not properly respect the simple solutions that work.
               | 
               | It is almost funny the dichotomy here. Most technological
               | people "admire" the simplicity, elegance and
               | extensibility of the command line. But tell those same
               | people that the best data store for the solution is a
               | relational database and their nose crinkles up.
        
               | datavirtue wrote:
               | Yeah, after getting caught up in the hype for ten years
               | I'm running back to proven tech that is flushed out
               | (Java, Swing-omg it just works, wanting to try Ruby,even
               | PHP is looking good at this point).
               | 
               | Every dependency scrutinized and discarded if possible.
               | 
               | I would probably work for free if someone setup their own
               | on-prem cloud in Tanzu, Open shift, or Ranger and used
               | old school proven frameworks for development.
               | 
               | Working in AWS has been a real shitty experience at these
               | large companies. All the nit picky problems (of which
               | there are thousands) get dumped on devs who are trying to
               | deliver working software.
        
             | Jupe wrote:
             | (Accidentally down-voted, apologies! I would upvote to fix,
             | but can't... Update: fixed)
             | 
             | Agreed. Arguably, _not_ using an existing cloud service is
             | a red flag on any new hires. AWS being the primary, but
             | experience using GCS or Azure are at least viable skills,
             | even if your business is AWS-based.
             | 
             | But the "fad-based-development" meme is not going away any
             | time soon. The incentives in the business are built around
             | it (really! No one want's to work on a boring old
             | relational database solution any more). In the old days it
             | was 4th generation languages, RUP, XML and Function Point
             | Analysis... today it's functional programming, SDKs, big-
             | three cloud PaaS experience or (shudder) block-chain.
             | 
             | I think back to my much younger self, when I thought that
             | technology was something to be mastered to solve real-world
             | problems, and I laugh. Little did I know the real problem
             | to be solved was to figure out how to solve those same-old
             | business problems but with the technology of the season
             | (Kubernetes, GraphQL or ML).
        
           | alecbz wrote:
           | I wonder to what extent this actually becomes _less_ of a
           | problem the more people use AWS. At this point AWS being down
           | just feels like  "the internet is down", it's hard for
           | customers to be too mad at any company being down when all
           | their competitors are too.
           | 
           | Though I guess there's still probably just lost revenue that
           | could be captured by having better uptime, even if your
           | competitors are down.
        
             | AQuantized wrote:
             | This seems like an interesting pendulum swing where the few
             | companies not reliant on AWS could capture significant
             | enough revenue by maintaining uptime during a potential
             | busy season outage.
        
               | ryanbrunner wrote:
               | That depends on them being up the rest of the time. If
               | they have an equal number of outages as AWS, it has the
               | risk to make them look worse (since all attention is on
               | them when they're out).
        
               | hwers wrote:
               | These two outages have been incredibly anomalous, I doubt
               | you'd get much revenue betting that they'll be a common
               | occurrence.
        
               | FridayoLeary wrote:
               | Maybe they'll just find themselves DDOSed by the sudden
               | influx of visitors? As a small example, i think HN was
               | slightly slower when FB had their outage.
        
               | tapan_jk wrote:
               | Agree. Now viable alternatives exist. The nextgen cloud
               | providers will learn from the weaknesses of incumbents
               | and innovate.
        
       | allocate wrote:
       | Also running a big production app in east-1 and we're
       | experiencing issues.
        
         | sprite wrote:
         | I'm also in east-1 and completely down.
        
       | ChrisMarshallNY wrote:
       | I can't play Borderlands 3 this morning (Epic).
       | 
       | Wonder if it's connected?
        
       | iso1631 wrote:
       | Ahh, the cloud
       | 
       | https://imgflip.com/i/5yrt24
        
       | ClumsyPilot wrote:
       | Now that everyone and their dog is on AWS, it is not just 'a
       | website stops working', half the world, from telephones to
       | security doors and Iot equipment, stops working?
       | 
       | I am not sure if the movement the cloud has reduced amount of
       | failures, but it definitely has made these failures more
       | catastrophic.
       | 
       | Our profession is busy makin the world less reliable and more
       | fragile, we will have our reconning just like the shipping
       | industry did.
        
         | dehrmann wrote:
         | It's more like it's making downtimes correlated rather than
         | random. For everything other than urgent communication, I'm not
         | sure if this is a big deal.
        
         | madeofpalk wrote:
         | all I've noticed is slack was a bit unreliable for a little
         | bit, but i just carried on and otherwise ignored it. my world
         | did not stop working.
        
           | ClumsyPilot wrote:
           | My apartment block has a dialing system, that, instead if
           | using a cale that goes to your apartment, relies on IP
           | telephony and calls your mobile phone. It stos working if
           | there is no internet, or your phone is out of battery, or you
           | are not home but your wife is.
        
           | KronisLV wrote:
           | Same, maybe that was a related issue.
           | 
           | Today, on Slack i could not edit messages, could not edit
           | statuses and could not post attachments. Pretty annoying!
        
       | kingsloi wrote:
       | Of all the AWS outage, my team and I have dodged them all, except
       | this one. 3 instances down and unavailable
       | 
       | > Due to this degradation your instance could already be
       | unreachable
       | 
       | >:(
        
         | electroly wrote:
         | FWIW I don't think that message has anything to do with this
         | outage. I think it's just a coincidence that you got some
         | degraded hosts. They didn't send out emails like that for this
         | AZ outage (nor would I expect them to -- that email is for when
         | host machines die).
        
       | Demcox wrote:
       | Imgur is suffering from this too, I think.
        
       | sctgrhm wrote:
       | Invision image uploads are down too because of this :
       | https://status.invisionapp.com/
        
       | streamofdigits wrote:
       | Somebody call the IT department
        
       | stunt wrote:
       | It seems that it's due to powerloss.
       | 
       | [05:01 AM PST] We can confirm a loss of power within a single
       | data center within a single Availability Zone (USE1-AZ4) in the
       | US-EAST-1 Region. This is affecting availability and connectivity
       | to EC2 instances that are part of the affected data center within
       | the affected Availability Zone. We are also experiencing elevated
       | RunInstance API error rates for launches within the affected
       | Availability Zone. Connectivity and power to other data centers
       | within the affected Availability Zone, or other Availability
       | Zones within the US-EAST-1 Region are not affected by this issue,
       | but we would recommend failing away from the affected
       | Availability Zone (USE1-AZ4) if you are able to do so. We
       | continue to work to address the issue and restore power within
       | the affected data center.
        
       | captn3m0 wrote:
       | 4:35 AM PST We are investigating increased EC2 launched failures
       | and networking connectivity issues for some instances in a single
       | Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other
       | Availability Zones within the US-EAST-1 Region are not affected
       | by this issue.
       | 
       | via https://stop.lying.cloud/
        
         | snth wrote:
         | What is this website? Is there an "about" or something? What is
         | it doing differently from the official AWS status page?
        
         | junon wrote:
         | Can anyone explain the affiliation of stop.lying.cloud to
         | Amazon? All of the legalese in the header/footer seem to
         | indicate it's actually owned and run by Amazon. If so... why?
         | Why not just... use the real status page?
         | 
         | I mean I'm glad it exists, don't get me wrong. Just weird that
         | they'd have two status pages, one seemingly existing only to
         | sort of 'mock' themselves...
        
           | IceWreck wrote:
           | Amazon's own status page sort of lies. So someone probably
           | wget-ed the status page, kept the same html and css and
           | hooked it to their own API to display correct info.
        
           | deadbunny wrote:
           | FWIW `lying.cloud` is registered with Namecheap.
           | `amazon.com`/`aws.com`/`amazon.ca` are all registered with
           | Mark Monitor. And I know that AWS uses ghandi behind the
           | scenes for domain reg. Given that, I'd hazard a guess that
           | it's not owned by Amazon. Definitely not a guarantee though.
        
           | anpat wrote:
           | Corey Quinn (https://twitter.com/quinnypig) runs it.
           | 
           | He also has a decent newsletter and witty commentary, for all
           | things AWS.
        
           | bithavoc wrote:
           | I think it was built[0] by @quinnypig
           | 
           | [0] https://twitter.com/quinnypig/status/1468331194471178241?
           | s=2...
        
           | andyjih_ wrote:
           | It's not official. The people making the page probably just
           | copied everything, including the legalese.
        
           | taspeotis wrote:
           | The people who maintain the unofficial site would have, at
           | some point, used their CTRL and C keys followed not
           | immediately, but closely by, their CTRL and V keys.
        
             | junon wrote:
             | But that is copyright infringement. You're not allowed to
             | copy some work, modify it, then slap the original copyright
             | on it. This is an illegal website, prone to being taken
             | down by AWS.
             | 
             | It's just strange.
        
               | Acestus wrote:
               | I would be worried because getting taken down is Amazon's
               | speciality.
        
               | bsenftner wrote:
               | Actually, having a satire site taken down over copyright
               | is one of the best ways to extort large amounts of money
               | from the copyright holder, because constitutional
               | attorneys will seep from the floorboards and appear in
               | your shower trying to be an attorney on that case. Satire
               | is extremely protected speech.
        
               | taspeotis wrote:
               | Yes mate, Internet Police Officer Jeff Bezos has been
               | dispatched and will take this illegal website down right
               | away.
               | 
               | (Using copyrighted material is permitted under fair use;
               | this website is a parody. I'm not a lawyer but at some
               | level preserving the copyright notice is probably better
               | than claiming it as their own.)
        
               | junon wrote:
               | This website is not a parody, and fair use does NOT
               | permit you to retain the original copyright notice on the
               | derived work.
               | 
               | You may say that the _original_ work is copyright of the
               | respective owners and that this is a parody work. But
               | that 's not what the site is doing. The footer contains
               | the original, unaltered copyright, creating confusion as
               | to who owns the derived work. Amazon does not own this,
               | nor do they endorse it, so you're not allowed to say it's
               | copyrighted by Amazon.
        
               | [deleted]
        
               | sparkling wrote:
               | oh no :o
        
               | isoprophlex wrote:
               | Can you imagine that some might see your position as one
               | of unmitigated pedantry unfitting in any discussion of
               | this - clearly jocular - website?
        
               | junon wrote:
               | Nope. Because these laws also protect people who make
               | such websites from the corporations they're commenting
               | on, too. The respect must be mutual, else e.g. OSS has no
               | basis for legal protection, either.
               | 
               | Being concerned for the proper respect of IP laws is
               | something that benefits everyone.
        
               | gilrain wrote:
               | You make more friends defending humans from big companies
               | than you do defending big companies from humans.
               | 
               | Your argument would have a small amount of merit if you
               | acknowledged that the laws DO NOT protect people like
               | they do corporations. That is a hollow ideal, not
               | reality.
        
               | junon wrote:
               | I don't really see your point. My original comment was
               | more pointing out that the operators, if not Amazon,
               | could be seen as infringing their rights. They should
               | update their legalese if they want to be truly protected.
               | How is that defending Amazon?
               | 
               | Regardless of your pointed comment, I'm operating in the
               | land of legal objectivity. The law doesn't care about
               | your feelings much.
        
               | gilrain wrote:
               | Your argument is that IP law should be equally observed
               | by everyone because it protects individuals and
               | corporations alike:
               | 
               | > Nope. Because these laws also protect people who make
               | such websites from the corporations they're commenting
               | on, too.
               | 
               | My response is that your assumption is very obviously
               | wrong: the law does not protect individuals and
               | corporations alike.
               | 
               | That's all.
        
               | junon wrote:
               | > Your argument is that IP law should be equally observed
               | by everyone because it protects individuals and
               | corporations alike
               | 
               | That is a weird understanding of what I said, and I don't
               | really think you're arguing in good faith here. There's a
               | lot of bias so I am choosing to not further this
               | conversation.
        
               | willcipriano wrote:
               | Let them sue. I see the headline now, "Amazon sues status
               | page website for accurately reporting on outages".
               | Followed by lots of people hosting mirrors to
               | stop.lying.cloud and saying things like "We are all
               | stop.lying.cloud now"
        
               | junon wrote:
               | This has nothing to do with the validity of the case,
               | though. Just because it's a ridiculous court case doesn't
               | mean it's not legally sound...
        
               | willcipriano wrote:
               | I have a legally sound case to divorce my wife (anyone
               | does, you can divorce for no reason), however she need
               | not worry as that would be colossally stupid on my part.
               | The same goes here.
        
               | junon wrote:
               | This is a reductio ad absurdum comparison.
        
               | [deleted]
        
               | [deleted]
        
               | colejohnson66 wrote:
               | Fair use is not a right. It's a _defense_. When you are
               | sued for copyright infringement, you have to argue that
               | you're doing it in fair use. It's not the "get out of
               | jail free" card people think it is.
        
               | yholio wrote:
               | > It's a defense.
               | 
               | That's warped view of the world. A corporation can always
               | take you to court and harass the crap out of you, the
               | court will side with your defense because you were
               | _right_ and claimant was _wrong_ , you had the _right_ to
               | do what you did.
        
               | bsenftner wrote:
               | Satire is the loophole of Copyright. If you satire
               | ANYTHING you can use their copyrights in the satire. One
               | could safely and legally drive an entire nation's
               | transportation industry thru that loophole.
        
               | junon wrote:
               | No, fair use does NOT allow you to retain the original
               | copyright. That would be passing off a derived work as
               | the original copyright holder's work, which could be very
               | damaging. This is a violation of fair use, if it could
               | even be considered that to begin with.
        
               | bsenftner wrote:
               | Fair use in the case of satire is not retaining the
               | original copyright, it is referencing the copyright. It
               | is a legal split hair, but it stands in court.
        
               | junon wrote:
               | I think you misunderstand. The website in question has
               | "Copyright (c) Amazon, Inc" in the bottom, when in fact
               | the _derived work_ (the site) is _not_ created by Amazon,
               | but by a third party. IANAL but my understanding is that
               | this copyright notice being retained without any
               | clarification of the owner of the derived work can be
               | seen as endorsement, which is an infringement of
               | copyright unless Amazon has expressly permitted such use
               | (which is usually indicated as such, anyway).
               | 
               | It is also clearly _not_ satire. That would _not_ hold up
               | in court, and there are _many_ instances where they have
               | tried that angle and failed.
        
               | azernik wrote:
               | That's more of a trademark issue, and would require a
               | reasonable consumer to be likely to be deceived. Which
               | they're not.
        
               | junon wrote:
               | No, it's not a trademark issue. They copied the work
               | verbatim (including code, which is not covered by
               | trademark law, but by copyright law), modified it, and
               | then put the _original_ copyright notice in the legalese.
               | This is copyright infringement.
               | 
               | And consumers are clearly deceived - hence why my
               | original comment asking about it was written and has
               | several upvotes.
        
               | azernik wrote:
               | The direct copyright is covered by the satire exceptions.
               | If you want to argue that copying the legalese is
               | different, it's going to be on those confusion grounds,
               | which IIUC aren't a copyright concern.
        
               | throwaway2331 wrote:
               | Drugs are illegal too, yet people do them all the time.
               | 
               | Speeding? Basically a national past-time at this point.
               | 
               | Misrepresentation, common fraud, and misappropriation?
               | Par for the course in most small businesses.
               | 
               | It's only a crime if someone gives enough of a shit to do
               | something about it; otherwise, it's just life.
        
               | roenxi wrote:
               | AWS might be smart enough not to make that strategic
               | blunder. They won't want to draw attention to the
               | inaccuracies of their status page.
        
               | junon wrote:
               | Perhaps, but the status page could exist within legal
               | boundaries with a few trivial changes anyway. Why not
               | just do that?
        
               | chrisan wrote:
               | No one cares, including Amazon. This webpage isn't
               | profiting off them. There is no valuable IP here being
               | stolen.
               | 
               | Amazon has much bigger legal issues to focus on than some
               | satire.
        
               | junon wrote:
               | Amazon has taken down trivial things before. This is a
               | dangerous assumption to make - "it's not important to
               | them so I shouldn't worry about the law". Those are
               | famous last words.
        
           | jrumbut wrote:
           | I was curious too. An HN user takes credit for it here:
           | https://news.ycombinator.com/item?id=24499159
           | 
           | Apparently it does some simple transformations of the actual
           | status page, which is why the Amazon copyright stuff is in
           | there.
        
       | richardfey wrote:
       | As far as I understood a whole availability zone went down; today
       | is also the day a lot of people understand why "multi-AZ"
       | matters, so I don't think it's fair to say that services are down
       | because the whole AWS is down.
        
       | joelbondurant wrote:
        
       | sreitshamer wrote:
       | Console is sluggish for me, but S3 (us-east-1) seems to work
       | fine.
        
       | loudtieblahblah wrote:
       | Yay! Adult snowday!
        
         | RobertKerans wrote:
         | Apropos of nothing, but a few Christmasses ago the place I
         | worked had a dedicated fibre line that some workmen doing gas
         | line repairs sawed straight through, took out everything; I was
         | just drone worker at the time & it was a beautiful thing
        
       | bobviolier wrote:
       | Seems unlogical that this is just a single region in a single US
       | region We are having issues pulling images from public.ecr.aws
       | from an EU region.
        
         | saxonww wrote:
         | I don't know what's still true, but at one point us-east-1
         | seemed more critical than other regions because there were some
         | things that had to be there. One thing that comes to mind is
         | ACM certificates used with things like API Gateway (probably
         | Cloudfront), they had to be in us-east-1 no matter where the
         | rest of your infrastructure was.
         | 
         | So it's not shocking to me that something going down in us-
         | east-1 could have impact on other regions.
        
       | 300bps wrote:
       | Can we please stop saying, "AWS is down"?
       | 
       | AWS consists of over 200 services offered in 86 availability
       | zones in 26 regions each with their own availability.
       | 
       | If one service in one availability zone being impaired equals a
       | post about "AWS is down" we might as well auto-post that every
       | day.
        
         | omh2 wrote:
         | AWS doesn't follow their own advice about hosting multi-
         | regional so every time us-east-1 has significant issues pretty
         | much every AZ and region is affected.
         | 
         | Specifically large parts of the management API, and IAM service
         | are seemingly centrally hosted in us-east-1.
         | 
         | If your infrastructure is static you'll largely avoid the
         | fallout, but if you rely on API calls or dynamically created
         | resources you can get caught in the blast regardless of region
        
         | KptMarchewa wrote:
         | Would be cool if this wasn't the region where AWS hosts their
         | internals, making other regions unusable, right?
        
         | satya71 wrote:
         | Seems enough services in us-east-1 are down to cause most apps
         | to fail. My simple app uses 10s of AWS services, at least some
         | of which are out.
        
           | 300bps wrote:
           | I may have seen more of these posts than you. The last one I
           | saw where "AWS is down" was us-west-1.
        
         | sawmurai wrote:
         | It's like my grandma saying "Honey, the internet is broken
         | again." xD
        
       | exabrial wrote:
       | Stat That.
        
       | aledalgrande wrote:
       | If you haven't seen yet, news is it was a power loss:
       | 
       | > 5:01 AM PST We can confirm a loss of power within a single data
       | center within a single Availability Zone (USE1-AZ4) in the US-
       | EAST-1 Region. This is affecting availability and connectivity to
       | EC2 instances that are part of the affected data center within
       | the affected Availability Zone. We are also experiencing elevated
       | RunInstance API error rates for launches within the affected
       | Availability Zone. Connectivity and power to other data centers
       | within the affected Availability Zone, or other Availability
       | Zones within the US-EAST-1 Region are not affected by this issue,
       | but we would recommend failing away from the affected
       | Availability Zone (USE1-AZ4) if you are able to do so. We
       | continue to work to address the issue and restore power within
       | the affected data center.
        
         | GrumpyNl wrote:
         | How come they dont have power backups?
        
           | redm wrote:
           | Some datacenter failures aren't related to redundancy. Some
           | examples: 1) transfer switch failure where you can't switch
           | over to backup generators and the UPS runs out, 2) someone
           | accidentally hits the EOD, 3) maintenance work makes a
           | mistake such as turning off the wrong circuits, 4) cooling
           | doesn't switch over fully to backups and while your systems
           | have power, its too hot to run. The list can go on and on.
           | 
           | I'm not sure why this is a big deal though, this is why
           | Amazon has multiple AZ's. If your in one AZ, you take your
           | chances.
        
           | Spooky23 wrote:
           | Their datacenter(s) aren't magic because they are AWS. That
           | facility is probably a decade old and like anything else as
           | it ages the technical and maintenance debt makes management
           | more challenging.
        
           | trelane wrote:
           | Anything can fail, even your backup, and especially if it's
           | mechanical.
        
             | rdines wrote:
             | The battery backups (called uninterruptible power supplies)
             | are only meant to bridge the gap between the power going
             | out and the generator turning on, which is a few minutes.
             | Did they say power was the issue this time? I suspect it's
             | actually something else (ahem network)
        
           | chkhd wrote:
           | "When a fail-safe system fails, it fails by failing to fail-
           | safe." - https://en.wikipedia.org/wiki/Systemantics
        
             | 2-718-281-828 wrote:
             | is that just playing with words?
        
               | losvedir wrote:
               | No. For example train signalling which controls whether a
               | train can do onto a section of track operates in a fail
               | safe manner, where if something goes wrong, the signal
               | fails into a safe "closed" state rather than an unsafe
               | "open" state. This means trains are incorrectly being
               | told to stop even though technically the tracks are
               | clear, rather than incorrectly being told to go even
               | though there is another train ahead.
               | 
               | "fail-safe" doesn't mean "doesn't fail", it means that
               | the failure mode chooses false negatives or false
               | positives (depending on the context) to be on the safe
               | side.
        
               | itsoktocry wrote:
               | > _is that just playing with words?_
               | 
               | It conveys reality, that "fail-safe" isn't literal, as if
               | anyone believed that.
        
               | 2-718-281-828 wrote:
               | I mean it has to be play with words or tongue in cheek
               | simply b/c the assumption of a fail-safe system failing
               | is already contradictory. So you cannot say anything
               | smart about that beyond - there are no fail-safe systems
               | that fail.
        
               | seeking_future wrote:
               | The real world is the play. Words are just catching up.
        
               | frupert52 wrote:
               | Do you mean in that it fails by failing to be the thing
               | that it purports to be? Making it no longer that thing?
               | At what point does bread become toast?
        
               | Talanes wrote:
               | https://en.wikipedia.org/wiki/Gare_de_Lyon_rail_accident
               | 
               | Fail safes do fail. Often due to severe user error.
        
               | NovemberWhiskey wrote:
               | I think it's predicated on a misunderstanding of what
               | "fail-safe" actually means.
               | 
               | For example, in railway signaling, drivers are trained to
               | interpret a signal with no light as the most restrictive
               | aspect (e.g. "danger"). That way, any failure of a bulb
               | in a colored light signal, or a failure of the signal as
               | a whole, results in a safe outcome (albeit that the train
               | might be delayed while the driver calls up the signaler).
               | 
               | Or, in another example from the railways, the air brake
               | system on a train is configured such that a loss of air
               | pressure causes emergency brake activation.
               | 
               | Fail-safe doesn't mean "able to continue operation in the
               | presence of failures"; it means "systematically safe in
               | the presence of failure".
               | 
               | Systems which require "liveness" (e.g. fly-by-wire for a
               | relaxed stability aircraft) need different safety
               | mechanisms because failure of the control law is never
               | safe.
        
               | pdpi wrote:
               | > "systematically safe in the presence of failure".
               | 
               | And even then, you still need to define "safe". Imagine a
               | lock powered by an electromagnet. What happens if you
               | lose power?
               | 
               | The safety-first approach is almost always for the
               | unpowered lock to default to the open state -- allow
               | people to escape in case of emergency.
               | 
               | Conversely, the security-first approach is to keep the
               | door locked -- nothing goes in or out until the situation
               | is under control.
               | 
               | A more complex solution is to design the lock to be
               | bistable. During operating hours when the door is
               | unlocked, failure keeps it unlocked. Outside operating
               | hours, when the door is set to locked, it stays locked.
               | 
               | The common factor with all these scenarios is that you
               | have a failure mode (power outage), and a design for how
               | the system ensures a reasonable outcome in the face of
               | said failure.
        
               | jsmith99 wrote:
               | Or nuclear reactors that fail safe by dropping all the
               | control rods into the core to stop all activity. The
               | reactor may be permanently ruined after that (with a cost
               | of hundreds of millions or billions to revert) but there
               | will be no risk of meltdown.
        
               | NovemberWhiskey wrote:
               | I don't know enough about reactor control systems to be
               | sure on that one. The idea of a fail-safe system is not
               | that there's an easy way to shut them down, but more that
               | the ways we expect the component parts of a system to
               | fail result in the safe state.
               | 
               | e.g. consider a railway track circuit - this is the way
               | that a signaling system knows whether a particular block
               | of a track is occupied by a train or not. The wheels and
               | axle are conductive so you can measure this electrically
               | by determining whether there's a circuit between the
               | rails or not.
               | 
               | The naive way to do this would be to say something like
               | "OK, we'll apply a voltage to one rail, and if we see a
               | current flowing between the rails we'll say the block is
               | occupied." This is not fail-safe. Say the rail has a
               | small break, or if power is interrupted: no current will
               | flow, so the track always looks unoccupied even if
               | there's a train.
               | 
               | The better way is to say "We'll apply a voltage to one
               | rail, but we'll have the rails connected together in a
               | circuit during normal operation. That will energize a
               | relay which will cause the track to indicate clear. If a
               | train is on the track, then we'll get a short circuit,
               | which will cause the relay to de-energize, indicating the
               | track is occupied."
               | 
               | If the power fails, it shows the track occupied because
               | the relay opens. If the rail develops a crack, the
               | circuit opens, again causing the relay to open and
               | indicate the track is occupied. If the relay fails, then
               | as long as it fails open (which is the predominant
               | failure mode of relays) the track is also indicated as
               | occupied.
        
               | sgarland wrote:
               | Sort of. A failsafe reactor design [can] include[s]
               | things like:
               | 
               | * Negative temperature coefficient of reactivity: as
               | temperature increases, the neutron flux is reduced, which
               | both makes it more controllable, and tends to prevent
               | runaway reactions.
               | 
               | * Negative void coefficient of reactivity: as voids
               | (steam pockets) increase, the neutron is reduced.
               | 
               | * Control rods constructed solely of neutron adsorbent.
               | The RBMK reactor (Chernobyl) in particular used graphite
               | followers (tips), which _increased_ reactivity initially
               | when being lowered.
               | 
               | It's also worth noting that nuclear reactors are designed
               | to be operated within certain limits. The RBMK reactor
               | would have been fine had it been operated as designed.
               | 
               | Source: was a nuclear reactor operator on a submarine.
        
               | the-dude wrote:
               | An unknown unknown.
        
               | marcosdumay wrote:
               | You mean to ask if it's a joke? Yes, it's a joke.
               | 
               | Or you ask if it's a lesson about how real systems
               | operate? Because yes, it's a very serious lesson about
               | how real systems operate.
               | 
               | Anyway, you seem out of grasp on system engineering. Your
               | reply downthread isn't applicable (of course fail-safes
               | can fail, anything can fail). If you want to learn more
               | on this area (not everybody wants, and its ok), following
               | that link of system theory books on the wiki may be a
               | good idea. Or maybe start at the root:
               | 
               | https://en.wikipedia.org/wiki/Systems_theory
               | 
               | Notice that there is a huge amount of handwaving in
               | system engineering. I don't think this is good, but I
               | don't think it's avoidable either.
        
               | jerf wrote:
               | "Notice that there is a huge amount of handwaving in
               | system engineering. I don't think this is good, but I
               | don't think it's avoidable either."
               | 
               | In my experience, you can be specific, but then you get
               | the problem that people think that if they just 'what if'
               | a narrow solution to the particular problem you're
               | presenting they've invalidated the example, when the
               | point was 1. that this is a representative problem, not
               | this specific problem and 2. in real life you don't get a
               | big arrow pointing at the exact problem 3. in real life
               | you don't have one of these problems, your entire system
               | is _made_ out of these problems, because you can 't help
               | but have them, and 4. availability bias: the fact that
               | I'm pointing an arrow at this problem for demonstration
               | purposes makes it very easy to see, but in real life, you
               | wouldn't have a guarantee that the problem you see is the
               | most important one.
               | 
               | There's a certain mindset that can only be acquired
               | through experience. Then you can talk systems engineering
               | to other systems engineers and it makes sense. But prior
               | to that it just sounds like people making excuses or
               | telling silly stories or something.
               | 
               | "(of course fail-safes can fail, anything can fail)"
               | 
               | Another way to think of it is the correlation between
               | failure. In principle, you want all your failures to be
               | uncorrelated, so you can do analysis assuming they're all
               | independent events, which means you can use high school
               | statistics on them. Unfortunately, in real life there's a
               | long tail (but a completely real tail) of correlation you
               | can't get rid of. If nothing else, things are physically
               | correlated by virtue of existing in the same physical
               | location... if a server catches fire, you're going to
               | experience _all sorts_ of highly correlated failures in
               | that location. And  "just don't let things catch fire"
               | isn't terribly practical, unfortunately.
               | 
               | Which reiterates the theme that in real life, you
               | generally have very incomplete data to be operating on. I
               | don't have a machine that I can take into my data center
               | and point at my servers and get a "fire will start in
               | this server in 89 hours" readout. I don't get a heads up
               | that the world's largest DDOS is about to be fired at my
               | system in ten minutes. I don't get a heads up that a
               | catastrophic security vulnerability is about to come out
               | in the largest logging library for the largest language
               | and I'm going to have a never-before-seen random rolling
               | restart on half the services in my company with who knows
               | what consequences. All the little sample problems I can
               | give in order to demonstrate systems engineering problems
               | imply a degree of visibility you don't get in real life.
        
           | TrueDuality wrote:
           | According to the SOC certifications they give their customers
           | they do.
        
           | thetinguy wrote:
           | They do. I remember watching one of their sessions where they
           | showed every rack having its own battery backup.
        
             | tyingq wrote:
             | An article on that: https://datacenterfrontier.com/aws-
             | designs-in-rack-micro-ups...
             | 
             | Interesting quote:
             | 
             |  _"This is exactly the sort of design that lets me sleep
             | like a baby," said DeSantis. "And indeed, this new design
             | is getting even better availability" - better than "seven
             | nines" or 99.99999 percent uptime, DeSantis said._
        
           | chousuke wrote:
           | Sometimes, you have a component which fails in such a way
           | that your redundancies can't really help.
           | 
           | I once had to prepare for a total blackout scenario in a
           | datacenter because there was a fault in the power supply
           | system that required bypassing major systems to fix. Had some
           | mistake or fault happened during those critical moments, all
           | power would've been lost.
           | 
           | Well-designed redundancy makes high-impact incidents less
           | likely, but you're not immune to Murphy's law.
        
             | macintux wrote:
             | To my mind, among the more frustrating aspects to
             | implementing protection against failure is that the
             | mechanisms to be added can themselves cause failure.
             | 
             | It's turtles all the way down.
        
               | chousuke wrote:
               | You need to pick your battles and choose what you want to
               | protect against to mitigate risk and enable day-to-day
               | operations.
               | 
               | For example, too often people will set up clustered
               | databases and whatnot because "they need HA" without much
               | thought about all the other potential effects of using a
               | cluster, such as much more complicated recovery
               | scenarios.
               | 
               | In the vast majority of cases, an active-passive
               | replicated database with manual failover is likely to
               | have fewer pitfalls and gives you the same operational HA
               | a clustered database would, even though in the case of a
               | (rare) real failure it would not automatically recover
               | like a cluster _might_.
        
           | taf2 wrote:
           | it was not a total power loss. out of 40 instances we had
           | running at the time of the incident only 5 of our instances
           | appeared to be lost to the power outage. the bigger issue for
           | us was ec2 api to stop/start these instances appeared to be
           | unavailable (but probably due to the rack these instances
           | were in having no power). The other issue that was impactful
           | to us was that many of the remaining running instances in the
           | zone had intermittent connectivity out to the internet.
           | Additionally, the incident was made worse by many of our
           | supporting vendors being impacted as well...
           | 
           | IMO it was handled rather well and fast by AWS... not saying
           | we shouldn't beat them up (for a discount) but being honest
           | this wasn't that bad.
        
             | res0nat0r wrote:
             | If the rack your instances are running in are totally
             | offline then the ec2 api unfortunately can't talk to the
             | dom0 and tell the instances to stop/start, so you get
             | annoying "stuck instances", and really can't do anything
             | until the rack is back online and able to respond to API
             | calls unfortunately.
        
         | SCdF wrote:
         | So dumb question from someone who hasn't maintained large
         | public infrastructure:
         | 
         | Isn't the whole point of availability zones is that you deploy
         | to more than one and support failing over if one fails?
         | 
         | IE why are we (consumers) hearing about this or being obviously
         | impacted (eg Epic Games Store is very broken right now)? Is my
         | assessment wrong, or are all these apps that are failing built
         | wrong? Or something in between?
        
           | sprite wrote:
           | I thought I was Multi AZ but something failed. I am mostly
           | running EC2 + RDS both with 2 availability zones. I will have
           | to dig into the problem but I think the issue is that my
           | setup for RDS is one writer instance and one reader instance,
           | each in a different AZ. However I guess there was nothing for
           | it to fail over to since my other instance was the writer
           | instance, so I guess I need to keep a 3rd instance available
           | preferably in a 3rd AZ?
        
           | TruthWillHurt wrote:
           | Amazon shifts the responsibility for multi-AZ deployment to
           | us customers, saving themselves complexity and charging us
           | extra - win-win for them.
        
           | fulafel wrote:
           | IME people rarely test and drill for the failovers, it's just
           | a checkbox in a high level plan. Maybe they have a todo item
           | for it somewhere but it never seems very important as AZ
           | failures are usually quite rare. After ignoring the issue for
           | a while it starts to seem risky to test for it, you might get
           | an outage due to bugs it's likely to uncover.
        
           | robjan wrote:
           | That's the theory but in practice very few companies bother
           | because it's expensive, complicated and most workloads or
           | customers can tolerate less than 100% uptime.
        
           | gpm wrote:
           | > or are all these apps that are failing built wrong
           | 
           | Deploying to multiple places is more expensive, it's not
           | wrong to choose not to, it's trading off reliability for
           | cost.
           | 
           | It's also unclear to me how often things fail in a way that
           | actually only affect one AZ, but I haven't seen any good
           | statistics either way on that one.
        
           | _joel wrote:
           | You're supposed to build your app across multiple AZ's but I
           | know a lot of companies that don't do this and shove
           | everything in a single AZ. It's not just about deploying and
           | instance there but ensuring the consistency of data and state
           | across the az's
        
           | peeters wrote:
           | As I understand it for something like SQS, Lambda etc, AWS
           | should automatically tolerate an AZ going down. They're
           | responsible for making the service highly available. For
           | something like EC2 though, where a customer is just running a
           | node on AWS, there's no automatic failover. It's a lot more
           | complicated to replicate a running, stateful virtual machine
           | and have it seamlessly failover to a different host. So
           | typically it's up to the developers to use EC2 in a way that
           | makes it easy to relaunch the nodes on a different AZ.
        
             | luhn wrote:
             | It sounds like EC2 API is having a brownout due to this, so
             | a lot of people _can 't_ failover to a new AZ.
        
         | xyst wrote:
         | This region in general is a clusterfuck. If companies by now do
         | not have a disaster recovery and resiliency strategy in place,
         | you are just shooting yourself in the foot.
        
           | philsnow wrote:
           | In today's world of stitching together dozens of services,
           | who each probably do the same thing, how is one to avoid a
           | dependency on us-east-1? Add yet another bullet to the vendor
           | questionnaire (ugh) about whether they are singly-homed /
           | have a failover plan?
           | 
           | It's turtles all the way down, and underneath all the turtles
           | is us-east-1.
        
             | [deleted]
        
         | vinay_ys wrote:
         | This is quite interesting as they claim their datacenter design
         | does better than Uptime's Tier3+ design requirements which
         | require redundant power supply paths.
         | [https://aws.amazon.com/compliance/uptimeinstitute/]. I really
         | hope they publish a thorough RCA for this incident.
        
           | tyingq wrote:
           | _" Electrical power systems are designed to be fully
           | redundant so that in the event of a disruption,
           | uninterruptible power supply units can be engaged for certain
           | functions, while generators can provide backup power for the
           | entire facility."_ https://aws.amazon.com/compliance/data-
           | center/infrastructure...
           | 
           | So they have 2 different sources of power coming in. And
           | generators. They do mention the UPS is only for _" certain
           | functions"_, so I guess it's not enough to handle full load
           | while generators spin up if the 2 primaries go out. Or
           | perhaps some failure in the source switching equipment
           | (typically called a "static transfer switch").
           | 
           | Some detail on different approaches:
           | https://www.donwil.com/wp-content/uploads/white-
           | papers/Using...
        
             | dylan604 wrote:
             | The generators should be powering up as soon as one of the
             | 2 different sources goes down. It takes generators a few
             | minutes to power up and get "warmed up". If they don't
             | start this process until both mains sources are down, then
             | oops, there's power outage.
             | 
             | I used to work next door to a "major" cable TV station's
             | broadcast location. They had multiple generators on-site,
             | and one of them was running 24/7 (they rotated which one
             | was hot). A major power outage hit, and there was a
             | thunderous roar as all of the generators fired up. The
             | channel never went off the air.
        
               | tyingq wrote:
               | There are setups where the UPS is designed to last long
               | enough for generator spin up as well. I believe it's the
               | most common setup if you have both. I assume spinning up
               | the generators for very short-lived line power blips
               | might be undesirable.
        
               | hughesjj wrote:
               | I thought running a generator full time was illegal AF
               | due to environmental regulations?
        
               | BenjiWiebe wrote:
               | Are you sure about the few minutes part? The standby
               | generators I've seen take seconds to go from off to full
               | load. We have an 80kw model, but I've also seen videos of
               | load tests of much larger generators and they also take
               | only seconds to go to full load.
        
               | reaperducer wrote:
               | It might depend on when the backup system was built. No
               | company updates their system every year.
               | 
               | A few minutes seems correct for one place I worked.
               | 
               | This was back in the 90's, before UPS technology got
               | really interesting. Our system was two large rooms with
               | racks and racks and racks of car batteries wired
               | together. When the power went out, the batteries took
               | over until the diesel generator could come online.
               | 
               | I saw it work during several hurricanes and other flood
               | events.
               | 
               | I always found the idea of running an entire building off
               | of car batteries amusing. The engineers didn't share my
               | mirth.
        
               | idiotsecant wrote:
               | Lead acid batteries are still industry standard in many
               | applications where you are OK with doing regular
               | maintenance and you just need them to work, full stop. I
               | think you'd be surprised how much of your power
               | generation infrustructure, for example, has a 125VDC
               | battery system for blackouts.
        
               | dylan604 wrote:
               | Lead acid batteries in that form factor were the staple
               | for many UPS systems, and the thing most people didn't
               | really appreciate was how expensive they were to
               | maintain. If you didn't do regular maintenance, you'd
               | find out that one of the cells in one of the batts was
               | dead causing the whole thing to be unable to deliver
               | power at precisely the worst time. Financially strapped
               | companies cut maintenance contracts at the first sign of
               | trouble.
               | 
               | Edit to add: I was at a place that took over a company
               | that had one of these. With all of the dead batteries, it
               | was just a really really large inverter taking the
               | 3-phase AC to DC back to AC with a really nice and clean
               | sine wave.
        
             | AtlasBarfed wrote:
             | Has datacenter power redundancy undergone any sort of
             | revolution with grid storage becoming industrial scale?
             | 
             | I wonder if a lot of AWS dc design in this area predates
             | the battery grid storage revolution with (what my
             | impression is) a far faster adaptation/switchover time than
             | a generator spin up, and possibly software systems that
             | work to detect and switch over quickly?
             | 
             | AWS can claim it will be best of breed, but they aren't
             | going to throw out a DC power redundancy investment (or
             | threaten downtime) that they can't wring more ROI on.
        
               | tyingq wrote:
               | I'd be surprised. Data centers eat a lot of energy, and
               | it's hard to beat the energy density of diesel (120 MJ/kg
               | vs ~1 for batteries) and the ability to have nearby tanks
               | or scheduled trucks.
               | 
               | Tesla apparently did some early pilot stuff:
               | https://www.datacenterdynamics.com/en/analysis/teslas-
               | powerp...
        
             | vinay_ys wrote:
             | Usually when someone claims T3+ they mean they have UPS
             | clusters in 3+1 (or such) configuration and two different
             | such UPS clusters power two power-strips in a rack. Then,
             | would also have incoming grid power supply from two
             | different HV sub-stations with non-intersecting cable
             | paths. They would also have diesel power generators in 3+1
             | or 5+2 configurations with automatic startup time in
             | seconds. The UPS's power storage (chemical or potential
             | energy based devices) can hold enough energy to handle full
             | load for several minutes. If these are design and
             | maintained correctly, even while concurrent scheduled
             | maintenance is ongoing, an unexpected component failure
             | should not cause catastrophic outage. At each layer (grid
             | incomers, generator incomers, UPS power incomers) there are
             | switches to switch over whenever there's a need
             | (maintenance or failure).
             | 
             | If they claim tier4, then they basically have everything in
             | n+n configuration.
        
               | tyingq wrote:
               | Though that doesn't match very well with _"
               | uninterruptible power supply units can be engaged for
               | certain functions"_. It sounds worded to convey that the
               | UPS is limited in some way. An interesting old summary of
               | their 2012 us-east-1 incident with
               | power/generators/ups/switching:
               | https://aws.amazon.com/message/67457/
        
             | rainbowzootsuit wrote:
             | Likely the UPS can't run HVAC, and you are in an overheat
             | condition in about two minutes with a fully loaded data
             | center without cooling. Proportionately longer as load is
             | reduced.
        
         | notyourday wrote:
         | We are being told that the are still issues in the USE1-AZ4 and
         | some of the instances are stuck in the wrong state as of 16:15
         | PM EST. There's no ET for resolution.
        
         | codeduck wrote:
         | another example of a single dc in a single AZ rendering an
         | entire region almost unusable. This has shades of eu-central-1
         | all over again.
        
           | nightpool wrote:
           | Amazon is claiming the failure is limited to a single AZ. Are
           | you seeing failures for instances outside of that AZ? If not,
           | how has this rendered "the entire region almost unusable"?
        
             | matharmin wrote:
             | Yes, I've seen issues that affected the entire region. In
             | my specific case, I happened to have an ElastiCache cluster
             | in the affected AZ that became unreachable (my fault for
             | single AZ). But even now, I'm unable to create any new
             | ElastiCache clusters in different AZs (which I wanted to
             | use for manual failover). And there were a lot of errors on
             | the AWS console during the outage.
             | 
             | "almost unusable" is maybe exaggerating, but there were
             | definitely issues affecting more than just the single AZ.
        
               | jedberg wrote:
               | Probably because you aren't the only one trying to do
               | that. The folks who successfully fail over a zone are the
               | ones who have already automated the process and are
               | running active/active configurations so everything is set
               | up and ready to go.
        
               | wizwit999 wrote:
               | That seems acceptable. The Data plane failure is
               | contained to an AZ. Control plane is often not.
        
               | [deleted]
        
             | codeduck wrote:
             | We've had alerts for packet loss and had issues in
             | recovering region-spanning services (both AWS and 3rd
             | party).
             | 
             | Yes, some of these we should be better at handling
             | ourselves, but... it's all very well to say "expect to lose
             | an AZ" but during this outage it's not been physically
             | possible to remove the broken AZ instances from multi-AZ
             | services because we cannot physically get them to respond
             | to or acknowledge commands.
             | 
             | edit: just to short circuit any "well, why aren't you
             | running redundant regions" - we run redundant regions at
             | all times. But for reasons of latency, many customers will
             | bind to their closest region, and the nature of our
             | technology is highly location-bound It is not possible for
             | us to move active sessions to an alternate region. So
             | something like this is... unpleasant.
        
               | mentat wrote:
               | You don't have health checks?
        
             | londons_explore wrote:
             | A lot of people will automatically fail over jobs to other
             | AZ's. That often involves spinning up lots more EC2
             | instances and moving PB's of data. The end result is all
             | capacity on other AZ's gets used up, and networks get full
             | to capacity, and even if those other zones are technically
             | working, practically they aren't really usable.
        
               | Godel_unicode wrote:
               | That doesn't appear to have happened though, I haven't
               | seen issues outside az4
        
               | reilly3000 wrote:
               | While there may be more machines provisioned, many orgs
               | run active setups for failover so they aren't as
               | affected. In terms of data transfer, it should already be
               | there. Where would it come from? Certainly not the dead
               | AZ.
        
               | manquer wrote:
               | It is Amazon's services themselves which are advertised
               | multi-AZ that would do bulk of this thundering hurd kind
               | of requests.
        
             | tyingq wrote:
             | Perspective, I would guess. Unless you spend a lot of time
             | on retry/timeout/fail logic around AWS apis, your app could
             | be stuck/blocked in the RunInstances() api, for example.
        
         | alostpuppy wrote:
         | Why do folks host their stuff in us-East? Is there a draw other
         | than organizational momentum?
        
           | superdug wrote:
           | It's the cheapest.
        
             | deanCommie wrote:
             | us-east-2 has exactly the same prices as us-east-1.
        
               | res0nat0r wrote:
               | Most likely inertia. us-east-1 was the first AWS region,
               | gets new features released there first and is the largest
               | in the USA, so many companies have been running their for
               | many years, and the cost of moving to us-east-2 > the
               | cost of occasional AWS created downtime.
        
           | dragonwriter wrote:
           | > Why do folks host their stuff in us-East?
           | 
           | Off the top of my head, US-EAST-1 is:
           | 
           | (1) topologically closer to certain customers than other
           | regions (this applies to all regions for different
           | customers),
           | 
           | (2) consistently in the first set of regions to get new
           | features,
           | 
           | (3) usually in the lowest price tier for features whose
           | pricing varies by region,
           | 
           | (4) where certain global (notionally region agnostic)
           | services are effectively hosted and certain interactions with
           | them in region-specific services need to be done.
           | 
           | #4 is a unique feature of US-East-1, #2-#3 are factors in
           | region selection that can also favor other regions, e.g., for
           | users in the West US, US-West-2 beats US-West-1 on them, and
           | is why some users topologically closer to US-West-1 favor US-
           | West-2.
        
             | alostpuppy wrote:
             | Thank you! This one is why I love HN.
        
       | reactive55 wrote:
       | Bitbucket is down as well
        
       | mule1 wrote:
       | Feel for devops peeps who are just trying to chill for Christmas
        
       | reactive55 wrote:
       | Bitbucket is down as well because of this.
       | https://bitbucket.status.atlassian.com/incidents/r8kyb5w606g...
        
       | sswaner wrote:
       | Not down as of 7:40 EST. US-EAST-1 hosted site (athene.com).
       | Cognito, API Gateway, Lambda, S3, DynamoDB, RDS, S3, Cloudfront.
        
       | andyjih_ wrote:
       | The most hilarious irony of not being able to acknowledge a 4AM
       | page in the PagerDuty mobile app because AWS is down.
        
         | exikyut wrote:
         | (Which was about AWS being down?)
        
       | antihero wrote:
       | I wonder how many 9s AWS is going for. Can't be a lot of 9s
       | anymore.
        
         | arh68 wrote:
         | 89.9999 % has a lot of 9s, dare I say military-grade.
        
         | yabones wrote:
         | Nine Fives is the new Five Nines!
        
       | amai wrote:
       | Thank goodness we host all IT services in the same cloud. Imagine
       | the chaos we had if everything would not fail at the same time.
        
       | sascha_sl wrote:
       | quay.io is also dead, as well as giphy, some parts of slack
       | 
       | just the weekly internet apocalypse, happy holdidays fellow SREs
        
       | camdenreslink wrote:
       | Who needs chaos monkey? Just host on AWS for a similar effect.
        
       | anonu wrote:
       | Better polish off your BCP docs. People will be asking for them
       | quite a bit more in the new year.
        
       | exabrial wrote:
       | As an industry, can we please stop making products like vacuums
       | that can't operate unless someone else's computer is working in a
       | field in Virgina? There's literally no reason for it.
        
       | bob1029 wrote:
       | 2 of our servers are fucked right now. VOIP services down.
       | 
       | Only with AWS and Github do I seem get panicked text messages on
       | my phone first thing in the morning... Our workloads on Azure
       | typically only have faults when everyone is in bed.
        
       | throwaway875487 wrote:
       | Our RDS instances have completely packed up. Hell knows what's
       | going on. Here come the customer support tickets.
        
       | ItsBob wrote:
       | I've built out many 42U racks in DC's in my time and there were a
       | couple of rules that we never skipped:
       | 
       | 1. Dual power in each server/device - One PSU was powered by one
       | outlet, the other PSU by a different one with a different source
       | meaning that we can lose a single power supply/circuit and
       | nothing happens 2. Dual network (at minimum) - For the same
       | reasons as above since the switches didn't always have dual power
       | in them.
       | 
       | I've only had a DC fail once when the engineer was performing
       | work on the power circuitry for the DC and thought he was taking
       | down one, but was in fact the wrong one and took both power
       | circuits down at the same time.
       | 
       | However, a power cut (in the traditional sense where the supplier
       | has a failure so nothing comes in over the wire) should have
       | literally zero effect!
       | 
       | What am I missing?
       | 
       | I've never worked anywhere with Amazon's budget so why are they
       | not handling this? Is it more than just the imcoming supply being
       | down?
        
         | Bluecobra wrote:
         | > What am I missing?
         | 
         | My guess is that they cheaped out in having redundant PSUs to
         | get you to use multiple availability zones. (More zones = more
         | revenue)
         | 
         | Even a single PSU shouldn't be an issue if they plugged in an
         | ATS switch though.
        
           | Godel_unicode wrote:
           | Unless the ATS breaks, which happens.
        
             | Bluecobra wrote:
             | For sure, in my context I meant a ATS in single
             | rack/cabinet. If that went bad the blast radius would be
             | contained to a single cabinet. But yeah, anything can and
             | will happen. At another place I worked at, a site UPS took
             | down an entire server room. It was pretty nice Eaton system
             | but there was some event that fried the whole thing. Eaton
             | had to send an specialist to investigate the matter as
             | those events are pretty rare.
        
             | mnordhoff wrote:
             | Yup. I'm still upset (but not angry) about
             | https://status.linode.com/incidents/kqhypy8v5cm8.
        
         | notyourday wrote:
         | > I've only had a DC fail once when the engineer was performing
         | work on the power circuitry for the DC and thought he was
         | taking down one, but was in fact the wrong one and took both
         | power circuits down at the same time.
         | 
         | This is all local scale. Your setup would not survive a data
         | center scale power outage. At scale power outages are
         | datacenter scale.
         | 
         | Data centers lose supply lines. They lose transformers.
         | Sometimes they lose primary feed and secondary feed at the same
         | time. Automatic transfer switches _cannot be tested
         | periodically_ i.e. they are typically tested _once_. Testing
         | them is not  "fire up a generator and see if we can draw from
         | it"
         | 
         | It is cheaper to design a system that must be up which accounts
         | for a data center being totally down and a portion of the
         | system being totally unavailable than to add more datacenter
         | mitigations.
        
           | ClumsyPilot wrote:
           | "It is cheaper to design a system that must be up which
           | accounts for a data center being totally down and a portion
           | of the system being totally unavailable than to add more
           | datacenter mitigations."
           | 
           | Citation needed - the same issue with testing, data races and
           | expensive bandwidth come up.
        
             | notyourday wrote:
             | At high energy the lead time for the components is measured
             | not in days but in years.
        
               | ClumsyPilot wrote:
               | And so is development time of any distributed software
               | system, and training time required to operate it
               | correctly
        
               | notyourday wrote:
               | > And so is development time of any distributed software
               | system, and training time required to operate it
               | correctly
               | 
               | Software is much easier than hardware. If you are to
               | start a project today in this kind of hardware, you will
               | be operating it in 2029, without changes.
        
           | vel0city wrote:
           | The only full datacenter outage I've personally experienced
           | was a power maintenance tech testing the transfer switch
           | between systems where the power was 90 degrees out of phase.
           | Big oof.
        
           | ItsBob wrote:
           | Yes but if you have reliable power from two different sources
           | then the biggest risk (I'd imagine) is the failover
           | circuitry! Something that should be tested tbh.
           | 
           | Also, there are banks of batteries and generators in between
           | the power company cables and the kit: did they not kick-in?
           | 
           | Again, this is all pure speculation: I have absolutely no
           | idea of the exact failure, nor how their infrastructure is
           | held together - this is all just speculation for the hell of
           | it :)
        
             | notyourday wrote:
             | > Yes but if you have reliable power from two different
             | sources then the biggest risk (I'd imagine) is the failover
             | circuitry! Something that should be tested tbh.
             | 
             | That's ATS. It is not really advisable to test their under
             | load performance because the failure of an ATS would be
             | catastrophic. ATS typically would be tested at the
             | installation and after that their parameters would be
             | monitored.
             | 
             | Replacing a functional in line ATS would be a 9-12 months
             | long project.
             | 
             | > Also, there are banks of batteries and generators in
             | between the power company cables and the kit: did they not
             | kick-in?
             | 
             | At high energy you are pretty much always going to use an
             | ATS.
        
               | belfalas wrote:
               | > the failure of an ATS would be catastrophic
               | 
               | Because that would mean no power at all to the DC and no
               | way to get it back? (I am completely ignorant on this
               | topic)
        
               | notyourday wrote:
               | > Because that would mean no power at all to the DC and
               | no way to get it back? (I am completely ignorant on this
               | topic)
               | 
               | While most of smarts in the ATS are in the electronics,
               | the really nasty failures come from the mechanical part.
               | 
               | At the end of the day a high energy ATS looks just like a
               | switch behind a meter in your house. There's a lip that
               | goes from one position to another, except in a high
               | energy ATS the lip is big and when the transfer occurs it
               | slams from one source to another.
               | 
               | There are only so many of those physical slams that it
               | can withstand to being with so you want to minimize that
               | number.
               | 
               | The second failure mode is that after transfer to non-
               | main source, the lip can get stuck there, making it
               | impossible to switch back on the main. [Once I have seem
               | the lip _melt_ into the secondary position. While I
               | thought it was weird, the guys from the power company
               | said it is not that uncommon.] This creates a massive
               | problem as the non-main source is typically not designed
               | for long term 24x7 operation. So now you are stuck on a
               | secondary feeding system and you cant just transfer to
               | main without de-energizing the system i.e. taking the
               | power out of the entire data center.
        
             | merlyn wrote:
             | Frying hardware can affect much wider scope.
             | 
             | I've had bad power supplies fry out taking the whole power
             | circuit with it, and thus half (or whatever fraction) of
             | the rack's power. I've also had bad power supplies bring
             | down the whole machine as they shunted everything internal
             | too.
             | 
             | When things go bad, anything can happen. You can provide
             | the best effort, and it'll usually work as expected, but
             | there will always be something that can overcome your best
             | efforts.
        
           | bombcar wrote:
           | The datacenter we were in had dual-sourced grid power (two
           | separate grid connections on opposite sides of the block,
           | coming from different substations) along with a room of
           | batteries (good for iirc 1hr total runtime for the whole
           | datacenter, setup in quad banks, two on each "rail"), _and_
           | multiple independent massive diesel generators, _which they
           | ran and switched power to every month for at least an hour_.
           | 
           | And to top it off each rack had its own smaller UPS at the
           | bottom and top, fed off both rails, and each server was fed
           | from both.
           | 
           | We _never_ had a power issue there; in fact SDGE would _ask_
           | them to throw to the generators during potential brown-out
           | conditions.
           | 
           | Of course this was a datacenter that was a former General
           | Atomics setup iirc ...
        
             | notyourday wrote:
             | We were in a triple sourced data center. Fed by three
             | different substations. Everything worked like a charm.
             | Until Sandy hit. It did not affect us at all. But it
             | affected the power company. And everything still worked
             | fine, until one of the transfer switches transferred into
             | UPS position and stopped working in that position.
        
           | theideaofcoffee wrote:
           | Transfer switches at any facility that's worth being
           | colocated in are exercised as periodically as the generators
           | to which they connect. In all of the facilities I have had
           | systems in (>20MW total steady state IT load), that meant
           | once per month at minimum to keep generators happy -and to
           | ensure the transfer functionality works-, and more often if
           | the local grid demands it, e.g. ComEd in Chicago, or Dominion
           | in NoVA asking for load shedding.
        
         | bob1029 wrote:
         | > I've never worked anywhere with Amazon's budget so why are
         | they not handling this?
         | 
         | Perhaps we are going to discover how AWS produces such lofty
         | margins by way of their next RCA publication.
        
         | [deleted]
        
         | lordnacho wrote:
         | What about a UPS/battery thingy? That's saved me a few times,
         | though it normally just gives enough time for a short outage.
         | Is it uncommon in cloud infra?
        
           | vel0city wrote:
           | For even regular datacenters they'll often have UPS systems
           | the size of a small car, usually several of these, to power
           | the entire datacenters for a few minutes to get the diesel
           | generator started.
        
         | growse wrote:
         | > 1. Dual power in each server/device - One PSU was powered by
         | one outlet, the other PSU by a different one with a different
         | source meaning that we can lose a single power supply/circuit
         | and nothing happens
         | 
         | Nothing happens _if_ you remember that your new capacity limit
         | per DC supply is 50% of the actual limit, _and_ you 're 100%
         | confident that either of your supplies can seamlessly handle
         | their load suddenly increasing by 100%.
         | 
         | I've seen more than one failure in a DC where they wired it up
         | as you described, had a whole power side fail, followed by the
         | other side promptly also failing because it couldn't handle the
         | sudden new load placed on it.
        
           | dijit wrote:
           | EDIT: I misunderstood you were talking about power feeds, the
           | normal case is the run "48% as if it's 100%" (because of
           | power spikes, but also most types of transformers run more
           | efficiently under specific levels of load (40-60).
           | 
           | Normally this is factored into the Rack you buy from a
           | hardware provider, they will tell you that you have 10A or
           | 16A on each feed, if you exceed that: it will work, but you
           | are overloading their feed and they might complain about it.
        
             | vel0city wrote:
             | The poster was speaking more of the power delivery going
             | _to_ the power supplies, not the server 's power supplies
             | themselves. So say each PSU 1 is wired to circuit A, each
             | PSU 2 is wired to circuit B. Circuit A experiences a
             | failure. All servers instantly switch over all their load
             | to their PSU 2's on circuit B. Suddenly circuit B's load is
             | roughly double what it was just moments ago. If proper
             | planning wasn't created or followed, this might overload
             | circuit B, meaning all PSU 2's go dark regardless of the
             | server being able to do the change over or not.
        
               | dijit wrote:
               | Yeah I understand on re-reading: but that's also not how
               | people run datacenters.
               | 
               | Obviously people can operate things however they want,
               | but you wont get a tier 3 classification with that setup.
        
             | praseodym wrote:
             | OP is talking about the DC power feed, not a single server
             | PSU.
        
               | dijit wrote:
               | You don't get fed DC power, you get fed AC power.
               | 
               | But, point taken: yes your power feed should be running
               | at <50%. But that just means you treat 50% as 100% just
               | like any resource.
               | 
               | Mostly this is outsourced to the datacenter provider;
               | they'll give you a per side rating. (usually 10A or 16A)
               | which also matches the cooling profile of the cabinet.
        
               | vel0city wrote:
               | I mean, in some datacenters they run DC power to each
               | rack. Its definitely more esoteric than having each
               | device run AC but some people do it.
               | 
               | However, with their comment DC == Data Center, not Direct
               | Current.
        
               | dijit wrote:
               | Yeah, I got thrown off by the "per DC supply is 50% of
               | the actual limit"
               | 
               | DC = Datacenter? makes no sense, so my head replaced it
               | with "Power Supply" instead of "DC Supply", second
               | sentence does make sense as being datacenter though.
        
         | uluyol wrote:
         | Why spend the cost on dual X and Y when you can failover to
         | another cluster?
         | 
         | For big DC workloads, it is usually, though not always, better
         | to take the higher failure rate than add redundancy.
        
           | ItsBob wrote:
           | Really? You'd think at Amazon's scale an additional PSU in a
           | 1U custom-built server (I assume they're custom) would be a
           | few tens of $ at most.
           | 
           | Actually, now that I type that it makes sense. Scaling a few
           | tens of dollars to a bajillion servers on the off-chance that
           | you get an inbound power failure (quite rare I'd reckon)
           | might cost more than what they'd lose if it does actually
           | fail.
           | 
           | So yeah, they're potentially just balancing the risk here and
           | minimising cost on the hardware.
           | 
           | Edit: changed grammar a bit.
        
             | [deleted]
        
             | vel0city wrote:
             | At big cloud provider scale like Amazon, Azure, and Google
             | they probably aren't even running PSUs at each server,
             | they're probably doing DC at the rack these days. No point
             | in having a million little transformers everywhere, far
             | easier maintenance centralizing those and have multiple
             | feeding the bus bars going to each rack.
        
               | rainbowzootsuit wrote:
               | The ones Im seeing designed have been moving the DC out
               | to the cabinets with A/B 480VAC power feeds on the bus,
               | and integrated DC inverters/rectifiers/batteries at the
               | rack level.
               | 
               | More modular and a lot less copper at 10x the voltage.
               | Still a lot of copper.
        
         | [deleted]
        
       | exogenousdata wrote:
       | Looks like the SEC's Edgar website is affected. This is the site
       | the SEC uses to post the filings of public companies. Normally
       | there are a hundred or more company filings in the morning
       | starting at 6am ET. This morning there are two.
       | 
       | https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent
        
       | networkisfine wrote:
       | Isn't the point of the design of an availability zone having
       | multiple data centers so that if a single data center in the
       | availability zone fails, services aren't affected?
        
         | temptemptemp111 wrote:
        
       | sprite wrote:
       | My app running on AWS is currently down. Having intermittent
       | problems with console as well.
        
         | stevehawk wrote:
         | also having console issues in us-east-1, bitbucket is randomly
         | throwing bad gateways at me
        
         | dugmartin wrote:
         | I'm getting a plain "504 Gateway Time-out" page when trying
         | access anything past the console homepage in us-east-1.
        
       | izietto wrote:
       | I guess that's why I'm experiencing weird issues with Heroku:
       | remote: Compressing source files... done.         remote:
       | Building source:         remote:          remote: ! Heroku Git
       | error, please try again shortly.         remote: ! See
       | http://status.heroku.com for current Heroku platform status.
       | remote: ! If the problem persists, please open a ticket
       | remote: ! on https://help.heroku.com/tickets/new
        
         | dijit wrote:
         | Yes.
         | 
         | Another thread: https://news.ycombinator.com/item?id=29648325
        
       | IceWreck wrote:
       | Honestly my server at home has more uptime than US-East-1
        
         | TacticalCoder wrote:
         | I should blog about this one day but...
         | 
         | I have a server at OVH (not affiliated to them) which, at this
         | point, I keep only for fun. It has 3162 days of uptime as I
         | type this.
         | 
         | 3 162 days. That's 8 years+ of uptime.
         | 
         | Does it have the traffic of Amazon? No.
         | 
         | Is it secure? Very likely not: it's running an old Debian
         | version (Debian 7, which came out in, well, 2013).
         | 
         | It only has one port opened though, SSH. And with quite a
         | hardened SSH setup at that.
         | 
         | I installed all the security patches I could install without
         | rebooting it (so, yes, I know, this means I didn't install
         | _all_ the security patches for some required rebooting).
         | 
         | This server is, by now, a statement. It's about how stable
         | Linux can be. It's about how amazingly stable Debian is. It's
         | also about OVH: at times they had part of their datacenter burn
         | (yup), at times they had full racks that had to be
         | moved/disconnected. But somehow my server never got affected.
         | It may have happened that at one point OVH had connectivity
         | issues but my server went down.
         | 
         | I "gave back" many of my servers I didn't need anymore. But
         | this one I keep just because...
         | 
         | I still use it, but only as an additional online/off-site
         | backup where I send encrypted backups. It's not as if it gets
         | zero use: I typically push backups to it daily.
         | 
         | They're only backups, they're encrypted. Even if my server is
         | "owned" by some bad guys, the damage he could do is limited.
         | Never seen anything suspicious on it though.
         | 
         | I like to do "silly" stuff like that. Like that one time I
         | solve LCS35 by computing for about four years on commodity
         | hardware at home.
         | 
         | I think it's about time I start to do some archeology on that
         | server, to see what I can find. Apparently I installed Debian 7
         | on it in mid-october 2013.
         | 
         | I've created a temporary user account on it, which at times
         | I've handle the password (before resetting it) to people just
         | so they could SSH in and type: "uptime".
         | 
         | It is a thing of beauty.
         | 
         | Eight. Years. Of. Uptime.
        
           | plandis wrote:
           | Your server could just be an outlier. Doesn't really say
           | anything about AWS or any cloud provider.
        
           | kasey_junk wrote:
           | I read this as a cautionary tale. Here we have a server that
           | only through the grace of god is still up, and is likely
           | owned up. If it isn't, it's because of how little is going on
           | with it.
           | 
           | At its current use, it's likely not a major issue but imagine
           | if someone saw this uptime and thought to take it as a
           | statement of reliability and built a service on it. I for
           | one, would want that disclosed because this is a disaster
           | waiting to happen. I'd much rather someone disclose that they
           | had a few servers each with no longer than 7 days of uptime
           | because they'd been fully imaged and cycled in that time...
        
             | TacticalCoder wrote:
             | It works both ways: it is also a cautionary tale for those
             | who are prone to believe it's all unreliable cattle that
             | needs constant restart because nothing is stable nor
             | reliable...
        
           | nextaccountic wrote:
           | > Like that one time I solve LCS35 by computing for about
           | four years on commodity hardware at home.
           | 
           | Awesome! Are you Bernard Fabrot [0]?
           | 
           | [0] https://www.csail.mit.edu/news/programmers-solve-
           | mits-20-yea...
        
             | TacticalCoder wrote:
             | Yup that's me... I fear this (old by now) story blew my
             | "tacticalcoder" cover.
        
         | BossingAround wrote:
         | Does your server at home handle similar traffic to that of US-
         | East-1 since you're comparing uptime?
         | 
         | Simiarly, my laptop, if I keep it plugged in the wall, and
         | enable httpd on localhost, will surely have better uptime than
         | any of the top clouds. I'd bet that it'd have 100% uptime if I
         | plugged in a UPS and cared for traffic on my local network
         | only.
        
           | christophilus wrote:
           | Most people don't need to handle the traffic of US-East-1.
           | They just need a single, simple, mostly reliable server. But
           | they're often told, "Don't do that. It's too hard, and
           | irresponsible, and what if you get a spike in traffic, and
           | what if you need to add 5 new servers, and security is really
           | hard."
           | 
           | In reality, most people don't need to scale. An occasional
           | spike in traffic is a nuisance, but not the end of the world,
           | and security is not terribly hard, if you keep your servers
           | patched (which is trivial to automate).
           | 
           | I really don't understand why there's so much FUD around
           | running your own stuff.
        
             | [deleted]
        
             | ryanbrunner wrote:
             | I think most people on here are coming from the perspective
             | of startups, which scale out of a single server setup
             | pretty quickly. At a bare minimum, most will have dedicated
             | purpose-built servers like Redis or a DB, and often there's
             | separate background workers, or a load balancer with a
             | couple of web servers.
             | 
             | When your server requirements get into needing 5-6 servers
             | (not at all atypical for a startup in their first year of
             | being launched), running your own stuff becomes more of a
             | challenge pretty quickly. Factor in 2-3x growth a year, and
             | the challenges just mount.
        
               | doublerabbit wrote:
               | > running your own stuff becomes more of a challenge
               | pretty quickly. Factor in 2-3x growth a year, and the
               | challenges just mount.
               | 
               | What challenges are you thinking of? You buy a full-rack
               | in colocation and then just buy servers/hardware when
               | required.
               | 
               | If a company has the budget for AWS or some other cloud
               | provider then they would have a budget for colocation;
               | which in long term is cheaper. I see no additional
               | challenge other than maintaining X amount of hardware
               | than just one.
        
               | manquer wrote:
               | Long term is unknown to the startup , they may fail or
               | pivot .
               | 
               | Buying upfront hardware is not feasible even if I had the
               | cash(which most don't), I don't know if the company would
               | last that long or would be doing things that require x
               | servers .
               | 
               | What you are saying is similar to saying may be it is
               | cheaper to buy the building /floor instead of renting
               | space for office. - most small biz cannot afford do that,
               | or expect their business to change (fail/take off) in the
               | time frame ROI would come to take that commitment.
               | 
               | This is all assuming that a the startup has skill in
               | setting up and managing physical servers and there is no
               | opportunity costs( delayed features) on doing so, both
               | are not a given.
               | 
               | small companies ( and poor people) typically don't buy
               | low quality stuff or buy into rent seeking business
               | models because they are dumb it is usually because they
               | cannot afford to do long term thinking.
        
           | loopdoend wrote:
           | Your home ISP has 100% uptime? That's incredible.
        
             | adamm255 wrote:
             | Mines had 100% uptime for the past 2 months. I've had great
             | value for money using a NUC for personal projects than
             | public cloud subscriptions over the past few years.
        
             | BossingAround wrote:
             | I mentioned local network, didn't I...
        
             | omh2 wrote:
             | Lets be real here, we don't need anywhere near 100% ISP
             | uptime to beat AWS over the last couple months...
        
               | acdha wrote:
               | That depends on what you mean by AWS. I had production
               | workloads in us-east-1 which haven't been affected by any
               | of these, and others which had only modest degradation.
               | We had control plane issues but the running services were
               | fine.
               | 
               | Put another way: even if your home ISP has had 100%
               | uptime, are you comfortable saying that was true for all
               | of their customers?
        
           | IceWreck wrote:
           | No but I access my home-server remotely from my university
           | all the time and it hasn't gone down once.
           | 
           | Better uptime than paying for EC2 on AWS US-East-1.
           | 
           | Obviously this approach isn't scalable but it serves me well.
        
             | amelius wrote:
             | > Obviously this approach isn't scalable but it serves me
             | well.
             | 
             | It's perfectly scalable. Just give everybody their own home
             | server.
        
           | Sammi wrote:
           | > Does your server at home handle similar traffic to that of
           | US-East-1 since you're comparing uptime?
           | 
           | Of course it doesn't. Why are you asking antagonistic
           | questions?
        
             | grumple wrote:
             | He asked it to demonstrate the point that uptime is trivial
             | for one server with no traffic, and much harder at scale
             | with auto scaling.
        
               | dijit wrote:
               | Then don't host with so many people?
               | 
               | I don't think people care that AWS has other customers,
               | they want _their workload_ to work, if it doesn 't: then
               | that's a today issue.
        
             | akyoan wrote:
             | > Honestly my server at home has more uptime than US-East-1
             | 
             | Is this not antagonistic? It's pointless to make these
             | statements, so your parent comment pointed it out. Go
             | downvote the first one instead.
        
       | jorgeudajer wrote:
        
       | RONROC wrote:
       | The prevailing wisdom throughout the last couple of years was:
       | 
       | "ditch your on-prem infrastructure and migrate to a major cloud
       | provider"
       | 
       | And its starting to seem like it could be something like:
       | 
       | "ditch your on-prem infrastructure and spin up your own managed
       | cloud"
       | 
       | This is probably untenable for larger orgs where convenience gets
       | the blank check treatment, but for smaller operations that can't
       | realize that value at scale and are spooked by these outages,
       | what are the alternatives?
        
         | TameAntelope wrote:
         | I don't think it's reasonable to be spooked by these outages,
         | and to think your resolution would be to leave AWS entirely.
         | 
         | A _much_ faster and more effective solution that doesn 't have
         | you trading cloud problems with on-prem problems (the power
         | outage still happens, except now it's your team that has to
         | handle it) would be to update your services to run in multiple
         | AZs and multiple regions.
         | 
         | Get out of AWS is you want, but don't get out of AWS because of
         | outages. You should be able to mitigate this relatively easily.
        
         | paulryanrogers wrote:
         | Spread the risk? Smaller on prem and cloud / rented bare metal?
        
           | Spivak wrote:
           | Nah, it's actually better to concentrate the risk in this
           | case.
           | 
           | If your app depends on a few 3rd party services -- SendGrid,
           | Twilio, Okta and they're all hosted on different infra then
           | congrats! You're gonna have issues when any one of them are
           | down, yayyy.
           | 
           | Also the marketing benefit can't be downplayed. If your
           | postmortem is "AWS was having issues" then your execs and
           | customers just accept that as the cost of doing business
           | because there's a built-in assumption that AWS, Azure, GCP
           | are world class and any in-house team couldn't do it better.
        
             | aflag wrote:
             | > Also the marketing benefit can't be downplayed. If your
             | postmortem is "AWS was having issues" then your execs and
             | customers just accept that as the cost of doing business
             | because there's a built-in assumption that AWS, Azure, GCP
             | are world class and any in-house team couldn't do it
             | better.
             | 
             | In my experience, execs and customers don't treat an outage
             | differently because AWS is at fault. Though the developers
             | do often have the attitude that it's "someone else's
             | problem", which can actually can make execs more worried
             | than if the problem was well known and under the company's
             | control.
        
         | Victerius wrote:
         | I'm tempted to found a startup to help businesses migrate from
         | cloud providers to on-prem infrastructure.
        
           | datavirtue wrote:
           | Slinging some of that sweet Tanzu or Ranger?
        
         | xyst wrote:
         | "Hybrid and multi cloud" is the future. In other words, give us
         | more fucking money.
        
         | [deleted]
        
         | f6v wrote:
         | Self-managed infrastructure doesn't fail now?
        
           | RONROC wrote:
           | We're going to be having this same tired, pedantic, round-
           | about conversation when Tesla's routinely decide to take out
           | a family of four because it mistook a plastic bag for an off-
           | ramp.
           | 
           | Commenters will show up like clockwork and say shit like:
           | 
           | "What man, it's not like cars didn't crash before? Haha"
           | 
           | Don't be dense dude. And definitely don't pursue a leadership
           | position anytime in the future.
        
             | deanCommie wrote:
             | Tesla fans are annoying, but it is absolutely valid that
             | the safety bar for self-driving cars can't be "100%
             | perfectly safe" - it needs to be "safer than the
             | alternative".
             | 
             | The problem with both this example, and the AWS one (it
             | needs to have better availability than your personal home-
             | spun solution, and it does), is that people are amazing at
             | deluding themselves.
             | 
             | "Yes, cars are dangerous, because other people can't drive.
             | But I'm a better than average driver"
             | 
             | "Yes, other people will build unreliable systems. But I
             | know how to architect for my use case and ensure that for
             | my needs the availability will be higher than AWS's"
             | 
             | Both are true* in the micro sense and false in the macro
             | sense.
             | 
             | * Not really. 88% of americans think they are "above
             | average" drivers.
        
             | f6v wrote:
             | Well, anyone non-dense will tell you that most dense thing
             | you can do is say "oh my god run for your lives" whenever
             | there's an outage. No statics, no cost-benefit analysis.
             | Just commenting "Haha you can't manage your own raids and
             | ciscos what a noob" makes you a thought leader, yes.
        
           | iso1631 wrote:
           | Not at this rate.
           | 
           | I remember we had a power outage in 2006, it actually took
           | one of my services off air. Since then of course that has
           | been rectified, and the loss of a building wouldn't impact on
           | any of the critical, essential or important services I
           | provide.
        
             | ctvo wrote:
             | > Not at this rate.
             | 
             | And what rate is this? It gets attention because it impacts
             | more people, but AWS / GCP / Azure uptime is still better
             | than what I've seen for small / mid size businesses trying
             | to manage their own infrastructure.
        
               | iso1631 wrote:
               | Multiple outages in a single month.
        
             | mbesto wrote:
             | > Not at this rate.
             | 
             | Source? Has there ever been an industry wide survey that
             | compares availability from "insert average colo/data center
             | operations" with the cloud ones?
             | 
             | And I'm not talking about "we have 12 SREs who are based in
             | Cupertino and are all paid top dollar to support a
             | colo"...I'm talking _average_.
        
               | Spooky23 wrote:
               | Running a multi-tenant datacenter or hyper scale cloud
               | datacenter is a different business than running a
               | datacenter. The myth of the cost of running facilities on
               | HN is insane - it's like saying you can't drive a car
               | unless you hire a formula 1 driver.
               | 
               | I worked through the ranks at a large enterprise that ran
               | a "big" datacenter for a decade. The facilities team was
               | about 6 people, average salary around $90k. I can only
               | remember one power interruption affecting more than a
               | rack, caused by a failure during a maintenance event that
               | requires a shutdown for safety reasons. The rest is like
               | any other industrial facility - you have service
               | contracts for the equipment, etc and maintain things.
               | 
               | There's a cost/capability curve that you need to plan
               | around for these matters. You need to make business and
               | engineering decisions based on your actual circumstances.
               | If the answer is automatically "AWS <whatever>", you're
               | making a decision to burn dollars for convenience.
        
               | f6v wrote:
               | That's not what parent was asking about. The question is
               | if a company of certain size is more likely to suffer
               | from an outage on AWS compared to own hardware.
               | 
               | I've been deploying to AWS for years and can't remember
               | and outage on their side in my region. But this is
               | anecdotal and doesn't necessarily reflect the statistics.
        
               | mbesto wrote:
               | > The facilities team was about 6 people, average salary
               | around $90k.
               | 
               | Ok so $540k salaries + benefits, so ~$700k. Then you have
               | transaction costs:
               | 
               | - Annual salary increases
               | 
               | - Any cost associated with people leaving (severance,
               | hiring, recruiters, HR, HR systems)
               | 
               | - Systems that run in the data center (logging,
               | monitoring, etc.)
               | 
               | - Procurement costs with changing costs in hardware
               | (silicon shortages, etc.)
               | 
               | - Security compliance overhead and associated risks
               | 
               | - Finance resources required to capitalize and manage
               | asset allocation
               | 
               | - etc. etc.
               | 
               | Versus
               | 
               | - Click a button and voila it works.
               | 
               | - Hire way less engineers to manage the system
               | administrative portion
               | 
               | > If the answer is automatically "AWS <whatever>", you're
               | making a decision to burn dollars for convenience.
               | 
               | 100% AGREE. The answer is always "it depends", but just
               | like people are saying "just put in the cloud", the
               | opposite of "well it worked for us using a data center"
               | isn't that simple.
        
               | Spooky23 wrote:
               | Say all of those costs are $2,000,000, and you have
               | 25,000 billable endpoints in the datacenter... you're
               | looking at less than $0.01/hour for that overhead on a
               | unit basis.
               | 
               | Obviously, there's a huge capital investment component
               | too that has to be incorporated. Those costs may be
               | really high if you're in a growth phase as you need to
               | overbuy capacity.
               | 
               | Just to be clear, I'm not arguing that on-prem is
               | magically cheap. :) But it has its place too!
        
               | mbesto wrote:
               | Agreed on all accounts.
        
               | jandrewrogers wrote:
               | I've worked at a few small companies over the years that
               | had their own significant colos and/or data centers built
               | on the cheap, and only a sysadmin or two to run them.
               | Anecdotally, if the infrastructure is setup right,
               | outages are very rare. Some of these were serving up
               | massive loads at the time. I've done these build-outs a
               | few times in my career and it isn't that hard to do,
               | reliable software is more likely to be an issue. The only
               | significant outage I remember is when the redundant power
               | systems in one DC both failed at the same time for
               | different reasons, which can happen.
               | 
               | It is as if the software industry has collectively
               | forgotten how to run basic data center operations.
               | Something that used to be a blue collar skill is now
               | treated like arcane magic.
        
           | dijit wrote:
           | What an absolutely pointless comment.
           | 
           | Everything fails, we can argue the rate. But I would argue
           | that understanding your constraints is better.
           | 
           | if you _know_ that your secret storage system can 't survive
           | if a machine goes away: well, you wire redundant paths to the
           | hardware and do memory mirroring and RAID the hell out of the
           | disks. And if it fails you have a standby in place.
           | 
           | But if you use AWS Cognito.
           | 
           | And it goes down.
           | 
           | You're fucked mate.
        
             | plandis wrote:
             | If you think you can do better than AWS, GCP, Azure there
             | is a lot of money to be made, for sure.
        
             | f6v wrote:
             | It's pointless to discuss how crappy cloud is whenever AWS
             | goes down. Most of the businesses relying by the automatic
             | RDS backups or EC2 auto scaling just don't have time to
             | think about all the underlying tech. I mean, I don't
             | manually allocate memory for variables anymore either. Do I
             | get screwed when there's a memory leak? Yes. What do I do
             | about it? Move on.
        
               | dijit wrote:
               | Then don't host anything, don't do software and don't
               | pretend to be "the future".
        
               | Spivak wrote:
               | This makes no sense. This has nothing to do with the tech
               | and more to do with every team's natural push and pull
               | with build over buy. It's completely pointless to respond
               | to someone who didn't get their DoorDash order with "see
               | this is why you should just make food at home." It
               | completely ignores the reason someone chose to order
               | takeout in the first place.
        
       | biznickman wrote:
       | Why isn't Heroku showing a status error despite being offline?
        
         | mikece wrote:
         | Because it's built on AWS and uses the AWS status page for it's
         | status info?
        
       | kemals wrote:
       | Here is The Internet Report episode on the topic of recent AWS
       | outages that covers outage and root causes:
       | https://youtu.be/N68pQy8r1DI
        
       | bognition wrote:
       | What a way to start my day
        
       | Hippocrates wrote:
       | Every time a major cloud provider has an outage, Infra people and
       | execs cry foul and say we need to move to <the other one>. But
       | does anyone really have an objective measure of how clouds stack
       | up reliability-wise? I doubt it, since outages and their effects
       | are nuanced. The other move is that they want to go multi-
       | cloud... But I've been involved in enough multi-cloud initiatives
       | to know how much time and effort those soak up, not to mention
       | the overhead costs of maintaining two sets of infra sub-
       | optimally. I would say that for most businesses, these costs far
       | exceed that occasional six-hour-long outage.
        
         | mongrelion wrote:
         | I agree with you. I think that having multi-AZ is the first
         | thing to figure out before wanting to do multi-cloud, which is
         | just another buzzword taken out of management's bullshit bucket
         | :)
        
           | Hippocrates wrote:
           | Agree, and multi AZ is usually easy. IME with AWS and GCP the
           | control plane is the same, the scaling works across AZ,
           | bandwidth is free and latency is near zero. The level of
           | effort to do that is simply ticking the right boxes at setup
           | time IME.
        
           | jtc331 wrote:
           | I've seen at least half a dozen full region AWS issues in the
           | past 8 months.
           | 
           | You really need multi-region and also not be relying on any
           | AWS service that's located only in us-east-1 (including
           | everything from creating new S3 buckets to IAM's STS).
        
         | metadat wrote:
         | I know the Oracle OCI cloud has a reputation for never going
         | hard-down, but also realize HN seems to loathe Big Red
         | (understandably, to a degree, though OCI is pretty nice IME and
         | _very_ predictable).
        
         | sdevonoes wrote:
         | Perhaps is us, the customers (and our customers, and the
         | customers of our customers, ...), the ones who should get used
         | to the status of "things can go wrong"? Except for some
         | specific scenarios (medical-related stuff, for instance), if my
         | favourite online shopping place is down, well, it's down, I'll
         | buy later.
        
         | mijoharas wrote:
         | I mean from the explanation[0], assuming that is correct (I
         | don't have evidence to suggest it's false) - you don't need to
         | be multi-cloud, and you don't even need to be multi-region. As
         | long as you're spread out over multiple availability zones in a
         | region you should be resilient to this failure.
         | 
         | Somewhat surprising to see how many things are failing though,
         | which implies, either that a lot of services aren't able to
         | fail-over to a different availability zone, or there is
         | something else going wrong.
         | 
         | [0] https://news.ycombinator.com/item?id=29648992
        
           | zeckalpha wrote:
           | That's true for this failure but the prior two for AWS were
           | region wide and the one for GCP last month was global.
        
           | omh2 wrote:
           | AWS doesn't follow their own advice about hosting multi-
           | regional so every time us-east-1 has significant issues
           | pretty much every AZ and region is affected.
           | 
           | Specifically large parts of the management API, and IAM
           | service are seemingly centrally hosted in us-east-1. So
           | called Global endpoints are also dependent on us-east-1 and
           | parts of AWS' internal event queues (eg. event bridge
           | triggers)
           | 
           | If your infrastructure is static you'll largely avoid the
           | fallout, but if you rely on API calls or dynamically created
           | resources you can get caught in the blast regardless of
           | region
        
             | spmurrayzzz wrote:
             | Your last comment is really important, I think. I have
             | always petitioned for "passive over active" design in
             | distributed cloud systems. The recent outages, and also
             | ones from the past, demonstrate why.
             | 
             | The fewer API calls you need to make in-band with whatever
             | throughput is generated via your customer demand, the
             | better. Related to that, I have been critical of
             | lambda/FaaS/serverless infrastructure patterns for similar
             | reasons. Always felt like a brittle house of cards to me
             | (N.B. I do still use aws lambda, but keep it constrained to
             | non-critical workloads).
        
               | pm90 wrote:
               | > The fewer API calls you need to make in-band with
               | whatever throughput is generated via your customer
               | demand, the better.
               | 
               | Agreed; however, this is somewhat difficult to do
               | correctly. There are all sorts of systems that might have
               | hidden dependencies on managed services. e.g. AWS IAM
               | roles will almost always be checked at some point if your
               | services need to interact with AWS managed services.
               | 
               | I think cloud providers could meet developers half way
               | here, by providing ways to reduce API usage; but I'm not
               | sure if it aligns with their incentives.
        
               | electroly wrote:
               | AWS IAM is designed with a control plane / data plane
               | dichotomy. Even if the control plane is completely dead
               | and all API requests are failing, services in a steady-
               | state (i.e. not responding to changes via API calls) can
               | still rely on IAM roles using their cached information.
               | For example, in the recent us-east-1 outage when you
               | couldn't start new services because IAM checks would
               | fail, existing EC2 instances that rely on IAM instance
               | profiles to access services like S3 could still do so
               | even though IAM was down.
        
               | spmurrayzzz wrote:
               | I was gonna respond with the same commentary here. That
               | has been my experience beyond just IAM controls and why I
               | advocate for passive systems for critical workloads.
               | 
               | Sometimes this _can_ be costly. For example with
               | something like autoscaling, thats an active system I've
               | seen fail when seemingly unrelated systems are failing.
               | The result is scaling out systems intentionally ahead of
               | time to deal with oversubscription or burst traffic which
               | can leave you with (costly) idle compute.
               | 
               | I don't mind this tradeoff personally, but can understand
               | that budget constraints are going to be different org to
               | org.
        
               | 0xbadcafebee wrote:
               | It's not an AWS incentive thing really, it's a
               | developer/consumer incentive thing.
               | 
               | It's like the duality of modular code. If you want to
               | manage one change in a lot of places, it's _easiest_ to
               | change it in the one module that everything else sources.
               | But that means that one change to that module can take
               | down everything. The alternative where you copy+paste the
               | same change everywhere is the most resilient to failure,
               | but also the most difficult and expensive.
               | 
               | AWS provides a lot of modular, dynamic things because
               | that's what their customers want to use. But using each
               | of those things increases the probability of failure.
               | It's up to the customer to decide how they want to design
               | their system using the components available.... and the
               | customers always chose the easy path rather than the
               | resilient path.
               | 
               | The great thing is that with AWS, at least you have the
               | _option_ to design a super freaking reliable system. But
               | ultimately there 's no way to make it easy, short of a
               | sort of "Heroku for super reliable systems". (I know
               | there are a few, but I don't know anything about them)
        
           | Hippocrates wrote:
           | Yeah, my thought is not specific to this scenario. Indeed
           | multi-AZ is a low cost and probably good idea because you
           | often have a shared service management, control plane, and
           | cheap bandwidth between things. Of course, when things fail
           | they often ripple as may be the case here. I don't think
           | clouds have their blast radius perfectly contained and they
           | certainly don't communicate those details well.
           | 
           | One incident I recall was involving our GCP regional storage
           | buckets, which we were using to achieve mutli-region
           | redundancy. One day, both regions went down simultaneously.
           | Google told us that the data was safe but the control plane
           | and API for the service is global. Now I always wonder when I
           | read about MR what that actually means...
        
         | indigomm wrote:
         | > I doubt it, since outages and their effects are nuanced.
         | 
         | Your point here deserves highlighting. A failure such as a zone
         | failing is nowadays a relatively simple problem to have. But
         | cloud services do have bugs, internal limits or partial
         | failures that are much more complex. They often require support
         | assistance, which is where the expertise of their staff comes
         | into play. Having a single provider that you know well and
         | trust is better than having multiple providers where you need
         | to keep track of disparate issues.
        
       | dolibasija wrote:
       | One of our EC2 instances in us-east-1c is unavailable and stuck
       | in "stopping" state after a force stop. Interestingly enough, EC2
       | instances in us-east-1b don't seem to be affected.
       | 
       | The console is throwing errors from time to time. As usual no
       | information on AWS status page.
        
         | crescentfresh wrote:
         | The affected zone is use1-az4. Whatever that maps to (1a, 1b,
         | 1c) is different per customer.
        
           | benedikt wrote:
           | you can find out which zone is mapped to use1-az4 for your
           | account with awscli:                   aws ec2 describe-
           | availability-zones | jq -r '.AvailabilityZones[] |
           | select(.ZoneId == "use1-az4") | .ZoneName'
        
             | mnordhoff wrote:
             | Or if you open the EC2 console (it's up this time!) and
             | scroll down to the bottom.
             | 
             | https://console.aws.amazon.com/ec2/v2/home?region=us-
             | east-1#...:
             | 
             | (Edit: I hope I didn't sound sarcastic. I don't open random
             | console pages and scroll all the way down to check for new
             | features. Some people will have noticed, some won't.)
        
               | [deleted]
        
         | JshWright wrote:
         | Instances stuck in the "stopping" state is pretty common, in my
         | experience.
        
         | crescentfresh wrote:
         | Was stuck on stopping in us-east-1b. Cannot start now.
        
         | 300bps wrote:
         | The 1c part is meaningless. Those letters are randomized per
         | customer to prevent letter biases from leading to more people
         | in 1a for instance.
        
         | chrishynes wrote:
         | I had the same issue with unavailable, but on an instance in
         | us-east-1b. Finally just got the force stop to go through a
         | minute ago and it's now running and available again.
        
           | mike-cardwell wrote:
           | Your us-east-1b may be the parents us-east-1c.
           | 
           | The letters are randomised per AWS account so that instances
           | are spread evenly and biases to certain letters don't lead to
           | biases to certain zones.
        
             | chrishynes wrote:
             | Huh, that's interesting. Didn't know that, but makes sense.
        
               | ciceryadam wrote:
               | You can check which availability zone is with: aws ec2
               | describe-availability-zones --region us-east-1
        
               | thrtythreeforty wrote:
               | It's pretty cool. If I recall, they call it "shuffle
               | sharding."
        
           | throwaway984393 wrote:
           | I'm not sure if we should say "AWS is down" if only us-east-1
           | is down. That region is more unstable than Marjorie Taylor
           | Greene on a one-legged stool.
        
             | CubsFan1060 wrote:
             | And only one AZ in us-east-1. But... it's clearly having a
             | large impact as well.
        
             | fivea wrote:
             | > I'm not sure if we should say "AWS is down" if only us-
             | east-1 is down.
             | 
             | The thing is, us-east-1 represents the whole AWS for the
             | majority of us.
        
               | flatiron wrote:
               | Can you expand on that? What feature do you use in east 1
               | that isn't everywhere else that it's your whole
               | implementation?
        
               | fivea wrote:
               | > Can you expand on that? What feature do you use in east
               | 1 that isn't everywhere else that it's your whole
               | implementation?
               | 
               | Your question reads as a strawman. It matters nothing if
               | EC2 is also available in Mumbai or Hong Kong if by
               | default the whole world deploys everything and anything
               | to us-east-1, and us-east-1 alone.
               | 
               | https://www.reddit.com/r/aws/comments/nztxa5/why_useast1_
               | reg...
        
               | throwaway984393 wrote:
               | It's not a strawman. There's a huge difference between
               | "AWS is down" and "customers don't know how to use AWS".
               | For the people who use AWS correctly, they only had some
               | degraded service, not downtime.
        
               | manquer wrote:
               | There many AWS services which have only global endpoints
               | and not specific to geo, all of these are hosted on us-
               | east-1 .
        
       | pkulak wrote:
       | I used to think it was silly to have your own hardware (like a
       | NAS) in your house. What makes you think you can do it better
       | than AWS?
       | 
       | Santa is bringing me a Synology in three days.
        
         | darkstar999 wrote:
         | Why not both? I just got a Synology NAS and it makes cloud sync
         | dead simple. Now the most important things are on my PC,
         | mirrored on 2 drives in my NAS, and on AWS S3 (or any other
         | cloud storage).
        
           | pkulak wrote:
           | Oh yeah. My plan is to migrate everything to the NAS, then
           | have that back up to Glacier and/or Rsync.net. By S3, do you
           | mean Glacier?
        
             | darkstar999 wrote:
             | I have some in glacier, some in Infrequent Access.
        
       | lukeqsee wrote:
       | I can't get to the console either, receiving a "Temporarily
       | unavailable" notice without branding.
        
       | anshumankmr wrote:
       | If AWS, GCP and Azure go down, we will be back in the stone ages,
       | right?
        
         | dijit wrote:
         | The only stuff that will work will probably depend on things in
         | AWS in some form.
         | 
         | That, or people never took the "if AWS goes down then lots of
         | people will have a problem, so we'll be fine" line seriously;
         | there are few such cases.
        
       | [deleted]
        
       | vegai_ wrote:
       | 5ish years ago it was common knowledge that us-east-1 is
       | generally the worst place to put anything that needs to be
       | reliable. I guess this is still true?
        
         | thow-58d4e8b wrote:
         | Unfortunately, the fact that us-east-1 is roughly 10% cheaper
         | than other regions usually overrides any other concerns
        
         | beermonster wrote:
         | us-east-1 seems to be AWS's not so well kept little dark
         | secret!
         | 
         | In all seriousness though - even non-regional AWS services seem
         | to have ties to us-east-1 as evidenced by the recent outages.
         | So you might be impacted even if it looks like (on paper at
         | least) you're not using any services tied to that region.
        
         | taf2 wrote:
         | I don't know about that. It was more like common knowledge that
         | one availability zone in us-east-1 was a problem - you would
         | have to figure out which one it was usually by spinning up
         | instances in all 4 zones (now 6)... and that it was the largest
         | of all regions making it ideal place to put your service if you
         | wanted to be close to other vendors/partners in AWS...
        
       | rswail wrote:
       | So why are people not migrating out of us-east-1? Operating in
       | ap-southeast, we weren't that affected by the us-east-1 down
       | time, although our system is reasonably static and doesn't make
       | lots of IAM calls (which seems to be a large SPOF from us-
       | east-1).
        
         | taf2 wrote:
         | latency. us-east-1 is positioned very nicely relative to many
         | large businesses in North America and Europe. This gives you
         | pretty good access to a very large percentage of the economies
         | of the world with good latency... while not requiring you to
         | architect your application around multiple regions...
        
         | dijit wrote:
         | Some "global" systems run in us-east1 even if you're not hosted
         | there a service you depend on might be.
         | 
         | Notably: cognito, r53 and the default web UI. (You can work
         | around the webui one I'm told, by passing a different domain
         | instead of just console.aws.amazon.com)
        
           | watermelon0 wrote:
           | Don't forget about CloudFront, which can only be configured
           | via us-east-1.
        
       | gtsop wrote:
       | Question to the sysadmins here: Is it really that outrageous of
       | amazon to have such issues or are people way to spoiled to
       | appreciate the effort that goes into maintaining such a service?
       | 
       | Edit: Not supporting amazon, i generally dislike the company. I
       | just don't understand the extend to which the criticism is
       | justified
        
         | dsr_ wrote:
         | The issue is in three parts:
         | 
         | 1. Did AMZN build an appropriate architecture?
         | 
         | 2. Did AMZN properly represent that architecture in both
         | documentation and sales efforts?
         | 
         | 3. What the heck is going on with AMZN?
         | 
         | Let's say that they build an environment in which power is not
         | fully redundant and tested at the rack level, but is fully
         | redundant and tested across multiple availability zones. Did
         | they then issue statements of reliability to their prospective
         | and existing customers saying that a single availability zone
         | does not have redundant power, and customers must duplicate
         | functionality in at least 2 AZs to survive a SPOF?
        
       | pawelduda wrote:
       | Bitbucket is affected, pages randomly take forever to load or
       | return 500
        
         | Pandabob wrote:
         | Yep, just botched a merge likely because of this.
        
         | el_duderino wrote:
         | Bitbucket just completed their migration to AWS too. Rough
         | start.
        
       | sprite wrote:
       | My Elastic Beanstalk instances are completely unreachable. Seems
       | at the very least ELB is down. Looking @ down detector it looks
       | like this is taking a bunch of sites down with it. As usual AWS
       | status page shows all green.
        
       | quantumfissure wrote:
       | Me: _Hesitation at last job moving absolutely everything
       | (including backups) to AWS because if it goes down it 's a
       | problem_ I'm a firm believer in _some kind of_ physical /easily
       | accessible backup.
       | 
       | Coworkers: "You're an f'n idiot. Amazon and Facebook don't go
       | down, you're holding us back!" <-Quite literally their words.
       | 
       | Me: _leaves cause that treatment was the final straw_
       | 
       | Amazon and Facebook both go down within a month of each other,
       | and supposedly they needed backups
       | 
       | Them: _shocked pikachu face_
        
         | numbsafari wrote:
         | Today's gentle reminder that there are things other than
         | network or service outages that can and do occur that might
         | necessitate an outside backup.
         | 
         | What happens if AWS or [insert other megacloud] decides your
         | account needs to be nuked from orbit due to a hack or some
         | other confusion? We almost had this happen over the summer
         | because of a problem with our bank's ability to process ACH
         | payments. Very frustrating experience. Still isn't fully
         | resolved.
         | 
         | What happens if an admin account is taken over and your account
         | gets screwed up?
         | 
         | What happens if an admin loses his shit and blows up your
         | account?
         | 
         | What happens if your software has a bug that destroys a bunch
         | of your data or fubars your account?
         | 
         | There's a ton of cases where having at least a simple replica
         | of your S3 buckets into a third-party cloud could prove
         | _highly_ valuable.
        
           | btown wrote:
           | Would you be able to expand at all about the ACH/AWS
           | connection, obviously without identifying details?
           | 
           | Was it just a miscommunication around AWS billing and them
           | thinking you weren't paying? Or did AWS somehow put itself in
           | the middle of, or react to, your use of ACH payment
           | processing for *non-AWS* receivables or payables?
           | 
           | If the latter, that's a business risk I'd never even thought
           | about. I'm not even sure how they'd know. But I'm thoughtful
           | that things like the MATCH list [0] exist, and how easily a
           | merchant can accidentally wind up on these lists from either
           | human error or a small amount of high-value chargebacks. If
           | cloud providers are somehow paying attention to merchant
           | services reputation, that would be very scary for many
           | businesses!
           | 
           | [0] https://www.merchantmaverick.com/learning-terminated-
           | merchan...
        
             | numbsafari wrote:
             | Like most of these things, it was a series of unfortunate
             | events.
             | 
             | In our case {LargeCloud} acquired {SaaSVendor}. We were
             | already using {LargeCloud}, with an existing billing
             | arrangement. When {LargeCloud} got around to integrating
             | the {SaaSVendor} into their billing system, it exposed
             | multiple bugs in {LargeCloud}'s billing system, and
             | ultimately limitations in our bank's internal systems--a
             | well known establishment and it would blow your mind to
             | learn how much manual crap they do.
             | 
             | Traditionally, we received favor from {SaaSVendor} through
             | Invoices. But when {SaasVendor} was subsumed by
             | {LargeCloud}, we stopped receiving invoices. Our internal
             | ops reached out to {LargeCloud} about this two days before
             | we got our first "You will experience Dire Consequences"
             | email from {LargeCloud}'s Robot Overlords. Our attempts to
             | contact {LargeCloud} regarding this concerning message was
             | always routed to a Robot Overlord who only spoke in tongues
             | and could not solve our problems. Eventually, were able to
             | get the Robot Overload to escalate us to a Robot Superlord
             | that would only tell us to "follow the instructions in this
             | handy dandy web page thing", except following the
             | instructions always summoned a "Server 500" Demon, which
             | {LargeVendor} claimed was impossible because their Robots
             | are Divine and Holy.
             | 
             | Finally circling back through random Human Actors we were
             | able to avert the countdown to destruction. Some Robot
             | Necromancer was able to resurrect our billing account from
             | the "Server 500" Demon, but we would now need to setup
             | automatic ACH payments, as whatever fix was implemented
             | could only persist with regular monthly succor upon the
             | alters of the Federal Reserve Automated Clearing
             | WaffleHouse. Invoices, payments arranged through Our Lady
             | of Visa and The Master Card would no longer suffice.
             | 
             | We believed we had made the appropriate incantations before
             | FratBoy 3000 at our local branch of the Federal Reserve
             | Chapel. However, we eventually received another threat of
             | Dire Consequences from {LargeCloud}, indicating that our
             | prayers were not received. It took significant supplication
             | in order to get FratBoy 3000 to confirm that our Federal
             | Reserve Chapel had misrouted our prayers, deducting them
             | from our account, but sending them to the wrong Demon,
             | through no fault of our own.
             | 
             | The whole time this was going on, we kept getting threats
             | of Dire Consequences. We were told by Human Actors to have
             | great faith, that the {LargeVendor} Robot Overlords had
             | been placated through their secret prostrations. FratBoy
             | 3000 was replaced by our Federal Reserve Chaplain, who
             | informed us that they had no robots, this was all the
             | result of Human Actor failures, but that, forthwith, all of
             | our prayers could be answered if we moved all of our faith
             | into a New Account which itself required additional monthly
             | supplication, but would ensure divine routing of our
             | prayers would always be successful.
             | 
             | To this day, we continue to make our monthly pilgrimage to
             | our local Federal Reserve Chapel, supplicating upon all
             | necessary altars. The threats of Dire Consequences from
             | {LargeCloud} have subsided. But we have cast ourselves out
             | onto the trail, seeking refuge from a more receptive and
             | responsive Federal Reserve Chapel.
             | 
             | Everybody focuses on "what if us-east-X goes down", but,
             | literally, sometimes it's a combination of billing and
             | payment issues that can keep you up at night.
        
           | hinkley wrote:
           | I would make a friendly wager that AWS user IDs don't contain
           | check digits, let alone bullet proof ones (simple check
           | digits don't guard against transposition errors). And that
           | somewhere, someone can manually enter an account to delete,
           | and that one of us will eventually have an account numbered
           | XXX1234 and some idiot with account XXX1243 will legitimately
           | earn an account deletion, but we'll be the ones who wake up
           | to bad news.
        
         | mattl wrote:
         | Backup to rsync.net
        
         | [deleted]
        
         | dookahku wrote:
         | Send Your former colleagues a group email asking how it is
        
         | kburman wrote:
         | AWS or Google or any other reputable cloud provider are still
         | far more better options then your local backup. Only way I see
         | you losing your data is account getting locked.
        
         | fatnoah wrote:
         | My last startup migrated from Verizon Terremark after the
         | healthcare.gov fiasco several years ago. We also suffered from
         | that massive outage and that was the final straw in migrating
         | to AWS.
         | 
         | At AWS, we built a few layers of redundant infrastructure with
         | mulit-AZ availability within a region and then global
         | availability across multiple regions. All this was done at
         | roughly half the cost of the traditional hosting, even when
         | including the additional person-hours required to maintain it
         | on our end.
         | 
         | Keeping our infra simple helped that work, and it's literally
         | been years since an outage caused by any AWS issues, even
         | though there have been several large AWS events.
        
           | zymhan wrote:
           | Indeed, if you only deploy resources in us-east1, or any
           | other single region, you're risking the occasional downtime.
           | 
           | I'd wager that will still give you more uptime than a
           | physically-hosted solution for the same cost.
        
             | hinkley wrote:
             | Honestly, I have an app in production that isn't completely
             | hardened against single zone outages. There was pressure to
             | turn off some redundancy in our caching infra, and not
             | every backend service we call is free of tenant affinity,
             | so we could well lose at least 1/3rd of our customers in a
             | single AZ failure in the wrong region, or have huge latency
             | issues for all of our tenants based on high cache miss
             | rates.
             | 
             | Having written this, I'm going to ping our SME on the cache
             | replication and remind him that since the last time he
             | benchmarked it, we've upgraded to a newer generation of EC2
             | instances that has lower latency, and could he please run
             | those numbers again.
        
           | hinkley wrote:
           | Every time one of these conversations happen I end up
           | thinking to myself that Oxide Computing needs three more
           | competitors and a big pile of money.
           | 
           | AWS maintains a fiction of turnkey infrastructure, and the
           | reality of building your own is so starkly different that I
           | haven't seen an IT group for some time that could
           | successfully push back on these sorts of discussions.
           | 
           | Building your own datacenter is still too much like
           | maintaining a muscle car, fiddly bits and grease under your
           | fingernails all the time, meanwhile the world has moved on,
           | and we now have several options in soccer mom EVs that can
           | challenge a classic Corvette in the quarter mile, and
           | obliterate its 0-60-0 time. There is no Hyundai for the
           | operations people, and there should be.
           | 
           | I don't know the physics of shipping such a thing, but I
           | think we really do need to be able to buy a populated and
           | pre-wired rack and slot it into the data center. Literally
           | slot it in. If you've ever been curious about maritime
           | shipping, you know that they have a system for securing
           | containers to cranes, trailers, each other, and I don't see a
           | reason you couldn't steal that same design for mounting a
           | server rack to the floor. Other than the pins would need to
           | be removable (eg, a bolt that screws into a threaded hole in
           | the floor) so you don't trip on them.
           | 
           | In a word, we need to make physical servers fungible. There
           | are any number of things that we need to do to get there, but
           | I think we can. Honestly I'm surprised we haven't heard more
           | of this sort of talk from Dell, especially after they bought
           | VMWare. This just seems like a huge failure of imagination.
           | Or maybe it's simply a revolution lacking a poster child. At
           | this rate that 'child' has already been born, and we are just
           | waiting to see who it is.
        
             | jeremyjh wrote:
             | I don't think putting the hardware into the rack is really
             | the sticking point; what people like about the cloud is
             | that it abstracts all kinds of details away for them and
             | provides a cohesive system to manage it. AWS, Azure and
             | Google are actually selling something like what you are
             | talking about now [1], where for whatever
             | legal/legacy/performance reason you need it on-prem but
             | still want to pay AWS 5x the cost just to give you the same
             | management interface, and they have some kind of pod they
             | slap into your data-center.
             | 
             | What does it tell you that there is a market for this,
             | where essentially what you are buying from them is a
             | management and control plane, when other companies like BMC
             | have been selling that as a standalone product for decades
             | (and for the most part failing to live up to their
             | customer's actual expectations)?
             | 
             | [1] https://www.bizety.com/2020/06/28/aws-outposts-google-
             | anthos...
             | 
             | edit: I actually think a big pull of the cloud is also
             | about shutting down archaic internal IT organizations that
             | have been slowing people down so that it takes weeks and
             | weeks to launch a simple new webservice. Better to give
             | your programmers a cloud account and let them get shit
             | done.
        
         | jmartrican wrote:
         | Seems like multi-cloud solution might be the way to go.
        
           | thedougd wrote:
           | I doubt it. The complexity of multi-cloud will also give you
           | downtime.
           | 
           | Most of the folks impacted by cloud outages do not have
           | highly available systems in place. Perhaps, for their
           | business, the cost doesn't justify the outcome.
           | 
           | If you need high uptime for instances, build your system to
           | be highly available and leverage the fault domain constructs
           | your provider offers (placement groups, availability zones,
           | regions, load balancing, DNS routing, autoscaling groups,
           | service discovery, etc). For instances, double down and use
           | spot instance and maximum lifetimes in your groups so that
           | you're continuously validating your application can recovery
           | from instance interruptions.
           | 
           | If you're heavy on applications that leverage cloud APIs,
           | such as is often the case with labmdas, then strongly
           | consider multi-region active/active as API outages tend to
           | cross AZ's and impact the entire region.
        
             | jmartrican wrote:
             | Agreed it is hard for those reason you specified.
             | 
             | To do it, first I would not use any cloud features that
             | cannot be easily setup in another cloud. So no lambdas.
             | Just k8s clusters, maybe DBs if they can be setup to backup
             | between clouds. I was able to migrate from AWS k8s to DO
             | K8S very easily.... just pointed my k8s configs to the new
             | cluster (plus configuring the DO load balancers).
             | 
             | In my case, I need the dynamic DNS (havnt looked into it
             | yet), auto-scaling is already setup with k8s, and the DB
             | backups between DBs (next project).
        
           | nier wrote:
           | All while making sure that these cloud solutions are not
           | inter-dependent and that there are redundant paths to access
           | these services.
        
         | lmilcin wrote:
         | Think about it this way:
         | 
         | 1) Can you make your on prem infrastructure go down less than
         | Amazon's?
         | 
         | 2) Is it worth it?
         | 
         | In my experience most people grossly underestimate how
         | expensive it is to create reliable infrastructure and at the
         | same time overestimate how important it is for their services
         | to run uninterrupted.
         | 
         | --
         | 
         | EDIT: I am not arguing you shouldn't build your more reliable
         | infrastructure. AWS is just a point on a spectrum of possible
         | compromises between cost and reliability. It might not be right
         | for you. If it is too expensive -- go for cheaper options with
         | less reliability.
         | 
         | If it is too unreliable -- go build your own yourself, but make
         | sure you are not making huge mistake because you may not
         | understand what it actually costs to build to AWSs level.
         | 
         | For example, personally, not having to focus on infra
         | reliability makes it possible for me to focus on other things
         | that are more important to my company. Do I care about outages?
         | Of course I do, but I understand doing this better than AWS has
         | would cost me huge amount of focus on something that is not
         | core goal of what we are doing. I would rather spend that time
         | thinking how to hire/retain better people and how to make my
         | product better.
         | 
         | And adding all that complexity of running this infra to my
         | company would cause entire organisation be less flexible, which
         | is also a cost.
         | 
         | So you can't look at cost of running the infra like a bill of
         | materials for parts and services.
         | 
         | And if there is an outage it is good to know there is huge
         | organisation there trying to fix it while my small organisation
         | can focus preparing for what to do when it comes back up.
        
           | Retric wrote:
           | It really spends on how reliable you need to be. Don't forget
           | you get downtime from both AWS and your own issues so even 4
           | 9's is off the table with pure AWS. If you _need_ to be more
           | reliable than AWS you need to run a hybrid inside and outside
           | of AWS which means most of the advantages of running on AWS
           | goes away.
        
             | nostrebored wrote:
             | Very untrue. Many businesses with 4 9 SLAs are all in on
             | AWS. It requires active/active setups though!
        
               | Retric wrote:
               | Many business claim 4 9 SLAs on AWS, but that doesn't
               | mean they actually provide it. It's simply a question of
               | what the penalties of failing to reach their SLA is.
        
           | pkulak wrote:
           | > Can you make your on prem infrastructure go down less than
           | Amazon's?
           | 
           | Over the last two years, my track record has destroyed AWS.
           | I've got a single Mac Mini with two VMs on it, plugged in to
           | a UPS with enough power to keep it running for about three
           | hours. It's never had a second of unplanned downtime.
           | 
           | About 15 years ago I got sick of maintaining my own stuff. I
           | stopped building Linux desktops and bought an Apple laptop. I
           | moved my email, calendars, contacts, chat, photos, etc, to
           | Google. But lately I've swung 180 degrees and have been
           | undoing all those decisions. It's not as much of a PITA as I
           | remember. Maybe I'm better at it now? Or maybe it will become
           | a PITA and I'll swing right back.
           | 
           | EDIT: I realize you're talking in a commercial sense and I'm
           | talking about a homelab sense. Still, take my anecdote for
           | what it's worth. :D
        
           | StreamBright wrote:
           | 3) Could you hire talent that can build the thing?
           | 
           | In my experience problem number 3 is the hardest to solve.
        
           | jtc331 wrote:
           | You're missing a huge factor: agency.
        
           | autosharp wrote:
           | Also, you can just take two different amazon regions and hope
           | they don't both go down at the same time.
           | 
           | For extra safety, and extra work, you could even take Azure
           | as a backup if you're not locked in with AWS.
        
             | dijit wrote:
             | forgive me repeating myself: AWS Zones are not truly
             | independent of each other.
             | 
             | Global services such as route53, Cognito, the default cloud
             | console and Cloudfront are managed out of US-East-1.
             | 
             | If us-east-1 is unavailable, as is commonly the case, and
             | you depend on those systems, you are also down.
             | 
             | it does not matter if you're in timbuktu-1, you are dead in
             | the water.
             | 
             | it is a myth that amazon availability zones are truly
             | independent.
             | 
             | please stop blaming the victim, because you can do
             | everything right and still fail if you are not aware of
             | this; and you are perpetuating that unawareness.
        
               | autosharp wrote:
               | Of course that depends on what services you use and yes,
               | even then there is some remaining correlation just
               | because it is the same host.
               | 
               | > are not truly independent of each other
               | 
               | Indeed. They are even on the same planet!
               | 
               | > please stop blaming the victim
               | 
               | Excuse me?
        
               | dijit wrote:
               | >> are not truly independent of each other
               | 
               | > Indeed. They are even on the same planet!
               | 
               | Clever bastard, aren't you.
               | 
               | >> please stop blaming the victim
               | 
               | > Excuse me?
               | 
               | "If you're affected by us-east-1 outages then you're not
               | hosting in other regions and you're doing it wrong".
               | 
               | Except: You can be affected by this outage if you did
               | everything right. You're putting blame on people being
               | down for not being hosted in different regions when it
               | would not help them. You've effectively shifted blame
               | away from Amazon and onto the person who cannot control
               | their uptime by doing what you said.
        
               | autosharp wrote:
               | > "If you're affected by us-east-1 outages then you're
               | not hosting in other regions and you're doing it wrong".
               | 
               | You are attributing a quote to me which I never
               | expressed, nor was that expressed elsewhere in this
               | thread. You are even using quotation marks....
               | 
               | I certainly didn't mean to blame anyone. You appear to
               | see this AWS issue as one of victims and victimizers. I
               | was just trying to point out an agency that people may
               | have in some situations.
        
               | dijit wrote:
               | I was not quoting you, I was echoing your sentiment back
               | to you with different words to see if you disagreed with
               | it.
               | 
               | I was just re-wording the sentiment.
               | 
               | Let me quote you properly.
               | 
               | > Also, you can just take two different amazon regions
               | and hope they don't both go down at the same time.
               | 
               | Do you see how replacing that in my comments does not
               | change the sentiment?
        
           | Nextgrid wrote:
           | > Can you make your on prem infrastructure go down less than
           | Amazon's?
           | 
           | Obviously depends on what you need, but for a small to medium
           | web app that needs a load-balancer, a few app servers, a
           | database and a cache, yes absolutely - all of these have been
           | solved problems for over a decade and aren't rocket science
           | to install & maintain.
           | 
           | > Is it worth it?
           | 
           | I'd argue that the "worth" would be less about immunity to
           | occasional outages but the continuous savings when it comes
           | to price per performance & not having to pay for bandwidth.
           | 
           | > overestimate how important it is for their services to run
           | uninterrupted.
           | 
           | Agreed. However when running on-prem, should your service go
           | down and you need it back up, you can do something about it.
           | With the cloud, you have no choice but to wait.
        
             | laumars wrote:
             | I have run high availability (HA) systems in prem and your
             | statement vastly understates the difficulty and expense.
             | 
             | You need multiple physical links in running to different
             | ISPs because builders working on properties further down
             | the street could accidentally cut through your fibre. Or
             | the ISP themselves could suffer an outage.
             | 
             | You need a back up generator and to be a short distance
             | away from a petrol station so you can refuel quickly and
             | regularly when suffering from longer durations of power
             | outages. You absolutely do not want to run out of diesel!
             | 
             | You need redundancy of every piece of hardware AND you need
             | to test that failover works as expected because the last
             | thing you need is a core switch to fail and traffic not to
             | route over secondary core switch like expected.
             | 
             | You need your multiple air con units and them to be powered
             | off different mains inputs so if the electrics fail on one
             | unit it doesn't take out the others. I guarantee you that
             | if the air cons will fail, it will be on the hottest day of
             | the year a month amount of portable units will stop your
             | servers from overheating.
             | 
             | You need beefy UPS with multiple batteries. Ideally
             | multiple UPSs with each UPS powering a different rail on
             | your racks so that if one UPS fails your hardware is still
             | powered from the other rail. And you need to regularly
             | check the battery status and loads on the UPS. Remember
             | that the back up generator takes a second or two to kick in
             | so you need something to keep the power to the servers and
             | networking hardware to be uninterrupted. And since all your
             | hardware is powered via the UPS, if that dies you still
             | lose power even if the building is powered.
             | 
             | And you then need to duplicate all of the above in second
             | location just in case the first location still goes down.
             | 
             | By the way, all of the possible failure points I've raised
             | above HAVE failed on me when managing HA on prem.
             | 
             | The reason people move to the cloud for HA is because
             | rolling your own is like rolling your own encryption: it's
             | hard, error prone, expensive, and even when you have the
             | right people on the team there's still a good chance you'll
             | fuck it up. AWS, for all its faults, does make this side of
             | the job easier.
        
               | Nextgrid wrote:
               | That's true for _on-prem_ infrastructure, but is all
               | already handled for you if you rent servers from hosting
               | providers such as OVH /Hetzner or even rent colocation
               | space in an existing DC, and is _still_ cheaper than the
               | cloud equivalent (and as we saw recently, actually more
               | reliable as well).
        
               | nightpool wrote:
               | I've had way more networking and availability failures
               | from Hetzner _this year alone_ then I 've ever seen from
               | AWS. They regularly replace their networking switches
               | without any redundancy, leaving entire DCs offline for
               | hours. They're okay for hobby projects, but I would never
               | host a business-critical site with them
        
               | nh2 wrote:
               | Cannot confirm, do you have details?
               | 
               | Yes, Hetzner upgrades DCs (datacenter buildings), but
               | they are the equivalent to AWS AZs (Availability Zones).
               | When they upgrade a DC, they notify way in advance, and
               | if you set up your services to span multiple DCs as is
               | recommended, it does not affect you.
               | 
               | We run high-availability Ceph, Postgres, and Consul,
               | across 3 Hetzner DCs, and have not had a Hetzner-induced
               | service downtime in the 5 years that we do so.
        
               | nightpool wrote:
               | That's fair enough, I was comparing single-AZ AWS outages
               | to single-DC Hetzner outages, since that seems to be what
               | people are focusing on. For multi-DC deployments, I think
               | laumars' sibling response to mine makes a much better
               | argument--ultimately, you're still choosing who to pay
               | and who to trust, and if things go down, there's nothing
               | you can do to fix it. "Low-tech" cloud providers like
               | Hetzner, Colo providers, amazon, PaaS--in a physical
               | downtime event like this one, they're all the same.
        
               | abujazar wrote:
               | @nostrebored Well, Hetzner never went down for the 7
               | years I managed a HA setup spanning three of their data
               | centers. One of the DCs was unavailable for a few hours
               | during a planned moving op, but we had no outages. None.
        
               | nostrebored wrote:
               | That's not what you've seen recently. When Hetzner goes
               | down nobody cares, because nobody with important
               | workloads and brain cells is running then there.
               | 
               | Colo space assumes that the colo is operating more
               | efficiently than AWS/Azure/GCP when in reality you're
               | comparing apples and oranges.
        
               | laumars wrote:
               | But then you're still reliant on those hosting providers
               | not fscking up; just like with cloud providers. Literally
               | the same complaint the GP was making about AWS applies
               | for OVH et al too.
               | 
               | In fact I used to run some hobby projects in OVH (as an
               | aside, I really liked their services) so I'm aware that
               | they have their own failures too.
        
               | Nextgrid wrote:
               | Old-school hosting providers have a lot less moving parts
               | than cloud providers. They have their outages, but
               | they're usually less frequent.
        
               | laumars wrote:
               | Are they though? Let's look at what the recent AWS
               | outages have been: a single region (but AWS makes multi-
               | region easy). The biggest impact to most people is the
               | AWS console, something that one seldom actually needs
               | given AWS is API driven. If the same type of outage
               | happened on OVH then you'd lose KVM to your physical
               | servers. But you seldom need those either.
               | 
               | The Azure outage was just AD service but you can roll
               | your own there if you wanted.
               | 
               | Plus if you want to talk about SaaS then OVH et al have
               | their own SaaS too. In fact the difference between OVH
               | and AWS is more about scale than it is about reliability
               | (with AWS you can buy hardware and rack it in AWS just
               | like with OVH too).
               | 
               | Or maybe by "old skool" you mean the few independent
               | hosts that don't offer SaaS. However they're usually
               | pretty small fry and this outages are less likely to be
               | reported. Whereas any AWS service going down is massive
               | news.
               | 
               | I'm not a cloud-fanboy by any means (I actually find AWS
               | the least enjoyable to manage from a purely superficial
               | perspective) but I've worked across a number of different
               | hosting providers as well as building out HA systems on
               | prem and the anti-cloud sentiment here really misses the
               | pragmatic reality of things.
        
               | [deleted]
        
               | bcrosby95 wrote:
               | Most people using AWS aren't using multi-region, as
               | evidenced by the wide array of problems on the internet
               | when a region goes down.
               | 
               | I would also argue many aren't even using multiple
               | availability zones, as evidenced by the wide array of
               | problems on the internet when a single AZ goes down.
               | 
               | I think you're vastly over-estimating how most companies
               | are using AWS, and are substituting your own requirements
               | for theirs.
               | 
               | Which is very common in tech. It's part of why people
               | shit on cloud, microservices, and other techniques large
               | mega-corps use on HN. People write posts with lots of
               | assumptions and few details, then people that don't know
               | any better just carbon copy it because hey its what
               | Google does. Meanwhile their lambda microservice system
               | serving a blazing 60 requests per minute has more
               | downtime than if I just hosted it on my laptop with my
               | dialup internet connection.
        
               | laumars wrote:
               | I fully believe some people are doing AWS wrong. But you
               | cannot compare the worst offenders in AWS against the
               | best defenders of deploying on prem - it's just not a
               | fair like for like comparison comparing the worst against
               | the best.
               | 
               | Hence why I compare doing HA in AWS correctly vs doing HA
               | on prem correctly.
        
               | [deleted]
        
               | H1Supreme wrote:
               | > You need a back up generator and to be a short distance
               | away from a petrol station
               | 
               | My building has a natural gas backup generator.
        
               | Johnny555 wrote:
               | Does it have its own gas well? My sister has a home
               | backup generator, they lost power during some cold snap
               | and some pumping component failed and her neighborhood
               | lost gas too. The only house in the neighborhood that had
               | heat/power had a big propane tank because it was built
               | before the neighborhood got gas.
               | 
               | I've never seen a data center with natural gas backup
               | power. But I don't know if that's because of reliability
               | or if it's too expensive for a big natural gas hookup
               | that's used rarely. Though I have heard of the opposite
               | -- using natural gas turbines as primary power and
               | utility power as backup.
        
               | tzs wrote:
               | > You need a back up generator and to be a short distance
               | away from a petrol station so you can refuel quickly and
               | regularly when suffering from longer durations of power
               | outages.
               | 
               | I don't see why the petrol station needs to be a short
               | distance away. Unless the plan is to walk to the petrol
               | station and back (which should not be the plan[1]),
               | anyplace within reasonable driving distance should do.
               | 
               | [1] long duration electrical outages will often take out
               | everything a short distance away, and the petrol stations
               | usually have electric pumps.
        
               | laumars wrote:
               | Because there are laws on what you can and cannot fill
               | with fuel. So you may find you have to make smaller but
               | more frequent visits.
               | 
               | Also buying fuel for a petrol station is going to be more
               | expensive than having a commercial tanker refill it. So
               | ideally you wouldn't be making large top ups from the
               | local petrol station except under exceptional outages.
               | 
               | As for wider power outages affecting the fuel pumps, I
               | suspect they might have their own generators too. But
               | even if they don't, outages can still be localised (eg
               | road works accidentally cutting through the mains for
               | that street - I've had that happen before too). So
               | there's still a benefit in having a petrol station near
               | by.
               | 
               | To be clear, I'm not suggesting those petrol stations
               | should be 5 minutes walking distance. Just close enough
               | to drive there and back in under half an hour.
        
               | vinay_ys wrote:
               | A typical multi-MW power-hungry high-tech facility
               | (datacenters, manufacturing, hospitals etc) will have
               | large underground fuel storage tanks big enough to run
               | the full load on generators for couple of days and they
               | are continuously kept refilled via fuel tanker trucks
               | through contracts with bulk fuel distributors. They
               | usually have an SLA of a 40KL tanker in 4 hour notice. In
               | case of advance warning of heavy-rains/floods or other
               | natural disasters that can disrupt road networks, they
               | can have more fuel trucks situated close-by as stand-by.
               | Depending on your contract, you may have priority over
               | other customers in the area. These are fairly standard
               | practices.
        
               | laumars wrote:
               | indeed but that wasn't the type of facilities that the GP
               | was talking about when they said running web services
               | were a solved problem.
               | 
               | If you do move to an established data centre then you're
               | back to my earlier point that you're still then dependant
               | on their services instead of having ownership to fix all
               | the problems yourself (which was the original argument
               | the GP made in favour of switching away from the cloud).
        
               | sigstoat wrote:
               | > I don't see why the petrol station needs to be a short
               | distance away
               | 
               | some natural disasters can render driving trickier than
               | walking. extremely large snow storms, for instance. you
               | can still walk a block, but you might be hard pressed to
               | drive 5 miles.
               | 
               | (i don't have a bone in this particular cautiousness-
               | fight; personally i'd just suggest folks producing DR
               | plans cover the relevant natural disasters for the area
               | they live in, while balancing management desires, and a
               | realistic assessment of their own willingness to come to
               | work to execute a DR plan during a natural disaster.)
        
               | vinay_ys wrote:
               | It is much easier than you think. There are well-defined
               | standards and trained trades people and whole host of
               | companies who make great products and provide after-sales
               | services to do it. Every major financial services,
               | telecom and high-precision manufacturing companies run
               | their infra this way. It is definitely less niche than
               | rolling your own encryption.
        
               | nightpool wrote:
               | financial services, telecom and high-precision
               | manufacturing companies
               | 
               | _One of these things is not like the other, one of these
               | things is not the same..._
               | 
               | What use does a CNC shop have for an extensive on-prem
               | multi-DC with failover and high availability? It'd be
               | like buying your own snowplows to make sure that the road
               | is clear so your employees can get to work. Maybe
               | necessary if you live in a place with very bad snowplows
               | and no existing infrastructure, but in most places, just
               | a waste of money.
        
               | laumars wrote:
               | My analogy wasn't saying it's niche. It was comparing the
               | difficulty. And yes, there are trained people (I'm one of
               | them ;) ) but that doesn't make it easy, cheap, nor even
               | less error prone than using cloud services. Which was my
               | point.
               | 
               | Also the reasons those companies usually run their own
               | infra is historically down to legislation more than
               | preference. At least that's been the case with almost all
               | of the companies I've built on prem HA systems for.
        
               | jrockway wrote:
               | > You need multiple physical links in running to
               | different ISPs because builders working on properties
               | further down the street could accidentally cut through
               | your fibre.
               | 
               | At my last job we provided redundant paths (including
               | entry to your building) as an add-on service. So you
               | might not need two ISPs if you're only worried about
               | fiber cuts. You could still be worried about things like
               | "we think all Juniper routers in the world will die at
               | the exact same instant", in which case you need to make
               | sure you pick an ISP that uses Cisco equipment. And of
               | course, it's possible that your ISP pushes a bad route
               | and breaks the entirety of their link to the rest of the
               | Internet.
        
               | phil21 wrote:
               | You really don't need almost any of this stuff. If you
               | have small on-prem needs just grab a couple fiber links,
               | try for diversity on paths for them (good luck), add some
               | power backup if it fits your needs, and be done.
               | 
               | If you are going to the level of the above, you go with
               | co-location in purpose built centers at a wholesale
               | level. The "layer1" is all done to the specs you state
               | and you don't have to worry about it.
               | 
               | On-prem rarely actually means physically on-prem at any
               | scale beyond a small IT office room. It means co-locating
               | in purpose built datacenters.
               | 
               | I'm sure examples exist, but the days of large corporate
               | datacenters are pretty much long over - just inertia
               | keeping the old ones going before they move to somewhere
               | like Equinix or DRT. With the wholesalers you can
               | basically design things to spec, and they build out
               | 10ksqft 2MW critical load room for you a few months
               | later.
               | 
               | A few organizations will find it worthwhile to continue
               | to build at this scale (e.g. Visa, the government) but
               | it's exceptionally small.
        
               | laumars wrote:
               | > You really don't need almost any of this stuff. If you
               | have small on-prem needs just grab a couple fiber links,
               | try for diversity on paths for them (good luck), add some
               | power backup if it fits your needs, and be done.
               | 
               | Then you're not running HA and thus the argument about
               | cloud downtime being "worse" than on prem is moot.
               | 
               | Obviously if your SLA is basically "we will do our best"
               | then there are all sorts of short cuts one can take. ;)
        
             | badams2527 wrote:
             | Human capital side would disagree with that I think. You're
             | assuming the organization which owns this small/medium web
             | app has the personnel already on staff to handle such a
             | thing.
             | 
             | If you're outsourcing that, you'd likely have to pay a
             | boatload just for someone to be available for help, let
             | alone the actual tasks themselves. Like you said, if you're
             | on-prem and something goes down, you can do something. But
             | you've gotta have the personnel to actually do something.
             | 
             | That said, I think you're spot-on as long as you have the
             | skillset already.
        
               | Nextgrid wrote:
               | > Human capital side would disagree with that I think
               | 
               | I hear this argument a lot, but every startup I've been
               | involved with had a full-time DevOps engineer wrangling
               | Terraform & YAML files - that same engineer can be
               | assigned to manage the bare-metal infrastructure.
        
               | dragonwriter wrote:
               | > I hear this argument a lot, but every startup I've been
               | involved with had a full-time DevOps engineer wrangling
               | Terraform & YAML files - that same engineer can be
               | assigned to manage the bare-metal infrastructure.
               | 
               | Bare metal infrastructure requires a lot more management
               | at any given scale. I mean, you can run stuff that lets
               | you do part of the management the same as cloud
               | resources, but you also have to then manage _that_
               | software and manage the hardware.
        
               | bcrosby95 wrote:
               | Define "a lot".
               | 
               | We colocate about 20 servers and on the average month, no
               | one spends any time managing them. At all.
        
               | [deleted]
        
               | Retric wrote:
               | You still need to pay someone to manage AWS
               | infrastructure. It's possible to save money using AWS,
               | but things often get more expensive.
        
               | nostrebored wrote:
               | Of SMBs I've worked with, about 5% had a dedicated AWS
               | engineer
        
           | jerf wrote:
           | I think if you put a bit of effort into classifying
           | importance, you can likely justify backing up certain
           | critical systems in more than one way. Let "the cloud" handle
           | everyone's desktop backups and all the ancillary systems you
           | don't really _need_ immediately to do business, but certain
           | important systems should perhaps be backed up both to the
           | cloud _and_ locally, like Windows Domain Controllers and
           | other things you can 't do anything without.
           | 
           | Backup is cheap when you're focused about what you're backing
           | up.
           | 
           | In this case, the game isn't "going down less than Amazon",
           | it's about going down _uncorrelated_ to Amazon. Though that
           | 's getting harder!
           | 
           | "In more than one way" doesn't have to be _local_ , but it
           | may be across multiple cloud services. Still, "local" is nice
           | in that it doesn't require the Internet. ("The Internet"
           | doesn't tend to go down, but the portion you are on certainly
           | can.) Of course, as workers disperse, "local" means less and
           | less nowadays.
        
             | kaashif wrote:
             | > In this case, the game isn't "going down less than
             | Amazon", it's about going down uncorrelated to Amazon.
             | 
             | It's possible to go down in a mostly uncorrelated way to
             | Amazon by just being down all the time.
             | 
             | Obviously this is implicit in your comment, but I'll say it
             | anyway: your backups need to actually work when you need
             | them. You need to test them ( _really_ test them) to make
             | sure they 're not secretly non-functional in some subtle
             | way when Amazon is really down.
        
           | woodruffw wrote:
           | Not my company, but I work with another company that does
           | (nearly?) all of their infrastructure on premise. They have
           | pretty great uptime, in a large part because they're not
           | dependent on the 3-4 global state mechanisms that
           | consistently cause outages with cloud providers (DNS, BGP,
           | AWS's role management/control plane, &c.).
           | 
           | I think you're right about what we over- & under-estimate,
           | but that we _also_ under-estimate the inflection point for
           | when it makes sense to begin relying on major cloud services.
           | Put another way: we over-estimate our requirements, causing
           | us to pessimistically reach for services that have problems
           | that we 'd otherwise never have.
        
           | dgudkov wrote:
           | 1) Can you make your on prem infrastructure go down less than
           | Amazon's?
           | 
           | It's now hard to say how frequently Amazon's infrastructure
           | goes down. The incident rate seems to have accelerated.
        
           | patentatt wrote:
           | On the other hand, perhaps the large cloud providers bring a
           | level of complexity that outweighs their skills at keeping
           | everything up. What I mean is, a basic redundancy and
           | failover setup with two data centers is kind of
           | straightforward. Sure you need a person on call 24/7 to
           | oversee it, but it's conceptually not that complicated. And
           | if you're running bare metal, you get a surprising level of
           | performance per dollar and rack unit. On the other hand, the
           | big clouds are immensely complex with multiple layers of
           | software defined networking, millions of tenants, thousands
           | of employees, acres of floor space, org charts, etc. If
           | you're running your own infra as one competent sysadmin, you
           | know nobody else in another department will push a networking
           | code change that will break your shit in the middle of the
           | night. Maybe it's not right for everyone, but it's not
           | unreasonable to go on prem in 2021 despite the popular
           | opinions otherwise. Source: my company runs on prem and
           | routinely has 100% uptime years. Most unplanned downtime
           | occurs early on a Sunday morning following a planned action
           | during a maintenance window.
        
             | sgarland wrote:
             | I was and continue to be surprised how reliable even old
             | servers are. I run a small homelab (Debian VMs on Proxmox;
             | a Docker host, a jumpbox, a NAS running ZFS, etc.) on seven
             | year old hardware, and all of my problems are self-imposed.
             | If I leave everything alone, it just works.
             | 
             | As a counterpoint, though, my last place had a large Java
             | app, split between colo'd metal and AWS. Seemed like the
             | colo'd stuff failed more (bad RAM mostly, a few CPUs, and
             | an occasional PSU). Entirely anecdotal.
        
           | ocdtrekkie wrote:
           | My on prem infrastructure goes down drastically less than
           | Amazon's.
           | 
           | ...My home Internet even is scoring better than Amazon right
           | now, in fact. Yours probably is too.
        
             | lmilcin wrote:
             | I have a bolt lying on my desk.
             | 
             | It hasn't failed since 1970 when it was produced.
             | 
             | It must have been built better than Space Shuttle, then.
        
               | dijit wrote:
               | Sysadmin 102: Simple Systems Fail Less Often.
        
         | hinkley wrote:
         | Have you contacted them to see how things are going?
         | 
         | Maybe a cheery note asking how the team is doing, sent right in
         | the middle of an outage.
         | 
         | Passive aggressive? As hell. Cathartic? Damn skippy.
        
         | kalleth wrote:
         | I'd be surprised if they needed backups for a few hours of
         | downtime with (reportedly) complete recovery where no data was
         | corrupted. There are industries where this would be required,
         | and it's _possible_ I guess, but neither of these downtime
         | events were  "data loss" events, just availability events for
         | short-ish periods of time that wouldn't - for me - result in
         | activating our DR plans.
         | 
         | I must admit that I do always try and maintain a separate data
         | backup for true disaster recovery scenarios - but those are
         | mainly focused around AWS locking me out of our AWS account
         | (and hence we can't access our data or backups) or recovering
         | from a crypto scam hack that also corrupts on-platform backups,
         | for example.
        
           | aeonflux wrote:
           | I once had to argue that we still do need backup even though
           | S3 has redundancy. They laughed when I mentioned a possible
           | lock-up from AWS (even due to a mistake or whatever). I asked
           | what if we delete data from app by mistake? They told me we
           | need to be careful not to do that. I guess I am getting more
           | and more tired of arrogant 25 years old programmers with 1-2
           | years in industry and no experience.
        
             | nfkrk8j wrote:
        
             | silon42 wrote:
             | Next time also mention that it might be a problem to get a
             | consistent back of microservices...
        
             | sidpatil wrote:
             | > I asked what if we delete data from app by mistake? They
             | told me we need to be careful not to do that.
             | 
             | Ah, the Vinnie Boombatz treatment.
        
             | swid wrote:
             | One thing you should absolutely not count on, but might be
             | a course of actions for large clients, is to contact
             | support and ask them to restore accidentally / maliciously
             | deleted files.
             | 
             | I would never use this as part of the backup and restore
             | plan; but I was lucky when a bunch of customer files were
             | deleted due to a bug in a release. Something like 100k
             | files were deleted from Google Storage without us having
             | backup. In a panic we contact GCP. We were able to provide
             | a list of all the file names from our logs. In the end, all
             | but 6 files were recovered.
             | 
             | I think it took around 2-3 days to get all the files
             | restored, which was still a big headache and impactful to
             | people.
        
               | gime_tree_fiddy wrote:
               | This is not a reliable mechanism btw. There will be times
               | when they won't be able to restore the data for you.
               | Their product has options to avoid this situation like
               | object versioning.
        
             | smiths1999 wrote:
             | Maybe they are getting tired of arrogant older programmers
             | assuming they cannot possibly be wrong. God forbid a 25
             | year old might actually have a good idea (and I am far
             | removed from my 20s).
             | 
             | Maybe having S3 redundancy wasn't the most important thing
             | to be tackled? Does your company really need that
             | complexity? Are you so big and such an important service
             | that you cannot possibly risk going down or losing data?
        
               | javagram wrote:
               | Sounds like the kind of short-term thinking that leads to
               | companies being completely wiped out by ransomware. Who
               | needs backups anyway?
        
               | exdsq wrote:
               | I'd love to know what someone works on when the risk of
               | losing data is not worth one or two days engineering
               | work.
        
               | lostcolony wrote:
               | But that's just it; you can't even have that discussion
               | if the response to "hey, should we be backing up beyond
               | S3 redundancy?" is "No. Why would we? S3 is infallible"
        
               | smiths1999 wrote:
               | Sure you can. As the experienced engineer in that setting
               | it is a great opportunity to teach the less experienced
               | engineers. For example, "I have seen data loss on S3 at
               | my last job. If X, Y, or Z happen then we will lose data.
               | Is this data we can lose? And actually, it is pretty easy
               | to replicate - I think we could get this done in a day or
               | two."
               | 
               | It's also possible the response was "That's an excellent
               | point! I think we should put that on the backlog. Since
               | this data is already a backup of our DB data, I think we
               | should focus on getting the feature out rather than
               | replicating to GCP."
               | 
               | Those are two plausible conversations. Instead, what we
               | have is "these arrogant 25 year olds that have 1-2 years
               | of experience and know it all." That's a red flag to me.
        
               | Beltiras wrote:
               | Losing data usually means losing customers. Usually more
               | customers than just whos data you lost.
        
               | smiths1999 wrote:
               | I suppose the caveat is you have to have customers to
               | lose them :) We don't know what the data is or the size
               | of the company.
        
               | oblio wrote:
               | It's not a lot of complexity.
               | 
               | Add object versioning for your bucket (1 click) and
               | mirror/sync your bucket to another bucket (a few more
               | clicks).
               | 
               | Yes, your S3 costs will double, but usually they're
               | peanuts compared to all the other costs, anyway.
               | 
               | Debating it takes longer than configuring it.
        
               | nightpool wrote:
               | As I understand it, Aeonflux was talking about redundant
               | backups _outside_ of S3, which are much more complex.
        
               | exdsq wrote:
               | Cron-ran SFTP from s3:// to digitalOcean:// ?
        
               | ramraj07 wrote:
               | You really chose to die on "backups are for old people"
               | as a hill?
        
               | smiths1999 wrote:
               | I'm not sure how you got "backups are for old people"
               | from my post. My point is that there are two sides to
               | this. Perhaps the data being stored on S3 data _was_
               | backup data and this engineer was proposing replicating
               | the backup data to GCP. That's probably not the highest
               | priority for most companies. Maybe the OP was right and
               | the other engineers were wrong. Who knows.
               | 
               | In my experience, the kind of person that argues about
               | "arrogant 25 year olds that know everything" is the kind
               | of person that only sees their side of a discussion and
               | refuses to understand the whole context. Maybe OP was in
               | the right, maybe they weren't. But the fact that they are
               | focusing on age and making ad hominem attacks is a red
               | flag in my book.
        
               | FpUser wrote:
               | >"Maybe they are getting tired of arrogant older
               | programmers..."
               | 
               | And this is of course valid reason to ignore basic data
               | preservation approaches.
               | 
               | Myself I am an old fart and I realize that I am too
               | independent / cautious. But I see way too many young
               | programmers who just read sales pitch and honestly
               | believe that once data is on Amazon/Azure/Google it is
               | automatically safe, their apps are automatically
               | scalable, etc. etc.
        
               | smiths1999 wrote:
               | Yes - the point of that line was to be ridiculous. Age
               | has nothing to do with it. Anyone at any age can have
               | good ideas and bad ideas. There are some really
               | incredibly _older_ and highly experienced engineers out
               | there. But there are others that think that experience
               | means they are never wrong. Age has nothing to do with
               | this - what is important is your past experience, your
               | understanding of the problem and then context of the
               | problem, and how you work with your team.
               | 
               | And again, my point isn't that you never need backups. My
               | point is that it is entirely plausible that at that point
               | in time backups from S3 weren't a priority.
        
               | wly_cdgr wrote:
               | Would you put the one and only copy of your family photo
               | album up on AWS, where AWS going down would mean losing
               | it? Because your customers' data is more important than
               | that
        
               | smiths1999 wrote:
               | AWS going down means I've lost it or temporarily lost
               | access to it? Those are two very different scenarios. Of
               | course S3 could lose data - a quick Google search shows
               | it has happened to at least one account. My guess is it
               | is rare enough that it seems like a reasonable decision
               | to not prioritize backing up your S3 data. I'm not syaing
               | "never ever backup S3 data" only that it seems reasonable
               | to argue it's not the most important thing our team
               | should be working on at this moment.
               | 
               | I have my family photos on a RAIDed NAS. It took me years
               | to get that setup simply because there were higher
               | priority things in my life. I never once thought "ahh I
               | don't need backups of our data" I just had more important
               | things to do.
        
             | tonto wrote:
             | I had this experience when I asked about s3 backup also
             | (after a junior programmer deleted a directory in our s3
             | bucket...). The response from r/aws was "just don't let
             | that happen" or ("use IAM roles")
        
               | AceyMan wrote:
               | 411, in the latest reInvent AWS announced preview of AWS
               | Backup for S3 (right now in USW2 only).
               | 
               | Relevant blog post,
               | https://aws.amazon.com/blogs/aws/preview-aws-backup-adds-
               | sup...
        
             | manquer wrote:
             | S3 and (others) have version history that can be enabled.
             | 
             | If you have to take care of availablity and redundancy and
             | delete protection and backups then why pay the premium S3
             | is charging ?
             | 
             | Either you don't trust the cloud and you can run NAS or
             | equivalent (with s3 APIs easily today) much cheaper or
             | trust them to keep your data safe and available.
             | 
             | No point in investing in S3 and then doing it again
             | yourself.
        
               | scurvy wrote:
               | Backup on site and store tertiary copies in a cloud.
               | Storing all backups in AWS wouldn't meet a lot of
               | compliance requirements. Even multiple AZs in AWS would
               | not pass muster as there are single points of failure
               | (API, auth, etc).
        
               | kalleth wrote:
               | In most startups? You're mostly correct.
               | 
               | But you still have some risks here, yes, with a super low
               | probability, but a company-killing impact.
               | 
               | In some industries - banking, finance, anything
               | regulated, or really (I'd argue) anywhere where losing
               | all of your data is company killing - you will need a
               | disaster recovery strategy in place.
               | 
               | The risks requiring non-AWS backups are things like:
               | 
               | - A failed payment goes unnoticed and AWS locks us out of
               | your AWS account, which also goes unnoticed and the
               | account and data are deleted
               | 
               | - A bad actor gains access to the root account through
               | faxing Amazon a fake notarized letter, finding a leaked
               | AWS key, social engineering one of your DevOps team, and
               | encrypts all of your data while removing your AWS-based
               | backups
               | 
               | - An internal bad actor deletes all of your AWS data
               | because they know they're about to be fired
               | 
               | ...and so on.
               | 
               | There's so many scenarios that aren't technical which can
               | result in a single vendor dependency for your entire
               | business being unwise.
               | 
               | A storage array in a separate DC somewhere where your
               | platform can send (and only send! not access or modify)
               | backups of your business critical data ticks off those
               | super low probability but company-killing impact risks.
               | 
               | This is why risk matrices have separate probability and
               | impact sections. Miniscule probability but "the company
               | directors go to jail" impact? Better believe I'm spending
               | some time on that.
        
               | cameronh90 wrote:
               | Just to add that S3 supports a compliance object lock
               | that can't even be overridden by the root user. Also AWS
               | doesn't delete your account or data until 90 days after
               | the account is closed.
               | 
               | Between these two protections, it's pretty hard to lose
               | data from S3 if you really want to keep it. I would guess
               | they are better protections than you could achieve in
               | your own self managed DC.
               | 
               | I'm guessing AWS has some clause in their contract that
               | means they can refuse to deal with you or even return any
               | of your data if they feel like it. Not sure if that's
               | ever happened, but still worth considering it.
        
               | manquer wrote:
               | Yes threat models is obvious qualifier, if you have a
               | business that requires backup on the moon if there
               | asteroid collision then by all means got for it.[1]
               | 
               | For most companies what AWS.or Azure offers is more than
               | adequate.
               | 
               | An internal bad actor with that level of privileged
               | access can delete your local backups or external one can
               | all things you he can do to AWS he can likely do easier
               | to your company storage DC too.
               | 
               | Bottom-line it doesn't matter if customers can pay for
               | all this low probability stuff that can only happen on
               | the cloud and not on Prem sure go ahead. Half the things
               | customers pay for they don't need or use anyway.
               | 
               | [1] assuming your business model allows you to spend the
               | expense outlay you need for the threat model
        
               | chrisandchris wrote:
               | Nope. 3-2-1 strategy. 3 Backups, 2 Medias, 1 Offsite. Now
               | try to delete files from the media in my safe. Only I
               | have a key.
               | 
               | Sure, your threat model may vary. But relying on cloud
               | only for your backup is simply not enough. If you split
               | access for your AWS backup and your DC backup to two
               | different people, you mitigated your thread model. If you
               | only have 1 backup location, that's going to be very
               | hard.
        
               | manquer wrote:
               | All of these are questions asked and solved 10 years ago
               | by bean counters who only job is risk mitigation.
               | 
               | Every cloud provider has compliance locks which even root
               | user cannot disable, version history and you can setup
               | your own copy workflow storage container to second
               | container without delete/update access to second one to
               | two different people or whatever.
               | 
               | You don't need to do any of it offsite.
        
               | JackFr wrote:
               | Not sure I agree about the usefulness of different media.
               | 
               | Having had to restore databases from tapes and removable
               | drives for a compliance/legal incident, we had a failure
               | rate of >50% on the tapes and about 33% for the removable
               | drives.
               | 
               | I came away not trusting any backup that wasn't on line.
        
               | lambic wrote:
               | We have AWS backups, "offsite" backups on another cloud
               | provider, and air-gapped backups in a disconnected hard
               | drive in a safe.
               | 
               | The extra expense outlay for the 2 additional backups is
               | approximately $50/month, so it's not going to break the
               | bank.
        
               | manquer wrote:
               | Egress from aws is not cheap.
               | 
               | At $50/month scale a lot of things are possible. Most
               | companies cannot store their data in a hard disk in a
               | safe. If you can, then cloud is a convenience not a
               | necessity for you. I.e. you are perfectly fine running
               | your storage stack for the most part.
               | 
               | My company is not very big(100ish employees) and we pay
               | $200k+ for AWS in just storage and AWS is not even out
               | primary cloud. If we have to do what you have, it is
               | probably in bandwidth costs alone another $500k. Add
               | running costs in another cloud and recurring bandwidth
               | for transfers , retrieval from Glacier for older data on
               | top of that.[1]
               | 
               | Over 3 years that would be easily $1-$1.5 million in net
               | new expenses for us scale.
               | 
               | No sane business is going to sign off on +3x storage
               | costs on a risk that cannot be easily modeled[2] and
               | costs that cannot be priced into the product, just so one
               | sysadmin can sleep better at night.
               | 
               | [1]your hard disk in a safe third component is not
               | sensible discussion point at reasonable scale.
               | 
               | [2] this would be probability of data loss with AWS *
               | business cost of losing that data > cost of secondary
               | system.
               | 
               | Or probability of data availablity event(like now) *
               | business cost of that > cost of an active secondary
               | system .
               | 
               | For almost no business in the world the either equation
               | would be valid.
               | 
               | For example even the cost is 100B dollars in revenue with
               | 6 nines of durability the expected loss would be only
               | $10,000 (100B * 0.000001) a secondary system is much
               | costlier than that.
        
               | hinkley wrote:
               | Whether you realize it or not, you believe in the
               | Scapegoat Effect, and it's going to get you into a
               | shitload of trouble some day.
               | 
               | Customers don't care if it's you're fault or not, they
               | only care that your stuff is broken. That safety blanket
               | of having a vendor to blame for the problem might feel
               | like it'll protect your job but the fact is that there
               | are many points in your career where there is one
               | customer we can't afford to lose for financial or
               | political reasons, and if your lack of pessimistic
               | thinking loses us that customer, then you're boned. You
               | might not be fired, but you'll be at the top of the list
               | for a layoff round (and if the loss was financial,
               | that'll happen).
               | 
               | In IT, we pay someone else to clean our offices and
               | restock supplies because it's not part of our core
               | business. It's fine to let that go. If I work at a hotel
               | or a restaurant, though, 'we' have our own people that
               | clean the buildings and equipment. Because a hotel is a
               | clean, dry building that people rent in increments of 24
               | hours. Similarly, a restaurant has to build up a core
               | competency in cleanliness or the health department will
               | shut them down. If we violate that social contract, we
               | take it in the teeth, and then people legislate away our
               | opportunities to cut those corners.
               | 
               | For the life of me I can't figure out why IT companies
               | are running to AWS. This is the exact same sort of
               | facilities management problem that physical businesses
               | deal with internally.
               | 
               | I have saved myself and my teams from a few architectural
               | blunders by asking the head of IT or Operations what they
               | think of my solution. Sometimes the answer starts with,
               | "nobody would ever deploy a solution that looked like
               | that". Better to get that feedback in private rather than
               | in a post-mortem or via veto in a launch meeting. But I
               | have had less and less access to that sort of domain
               | knowledge over the last decade, between Cloud Services
               | and centralized, faceless IT at some bigger companies.
               | It's a huge loss of wisdom, and I don't know that the
               | consequences are entirely outweighed by the advantages.
        
               | jeremyjh wrote:
               | There are completely independent risks that you are
               | dealing with here. If you are a small company there is a
               | non-insignificant risk that your cloud account will be
               | closed and it will be impossible to find out why or to
               | fix it in a timely matter. There have been several that
               | were only fixed after being escalated to the front page
               | of Hacker News, and we haven't heard about the ones that
               | didn't get enough upvotes to get our attention and were
               | never fixed.
               | 
               | Also, what we saw on Dec 7th was that the complexity of
               | Amazon's infrastructure introduces risks of downtime that
               | simply cannot be fully mitigated by Amazon, or by any
               | other single provider. More redundancy introduces more
               | complexity at both the micro level and macro level.
               | 
               | It doesn't really cost that much to at least store
               | replicated data in an independent cloud, particularly a
               | low-cost one like Digital Ocean.
        
               | bbarnett wrote:
               | Erm.
               | 
               | In some orgs, recreating lost data, code, deployment and
               | more is literally hundreds of thousands of hours of work.
               | 
               | In a smaller org, the devastation can be just as stark.
               | Loosing hundreds of hours of work can be a death knell.
               | 
               | Anyone advocating placing an entire orgs's future on one
               | provider is literally, completely incompetent.
               | 
               | It's the equiv of a home user thinking all their baby
               | pics will be safe on google or facebook. It is just plain
               | _dumb_.
        
               | ncallaway wrote:
               | > No point in investing in S3 and then doing it again
               | yourself.
               | 
               | I mean that's just obviously wrong, though.
               | 
               | There is a point.
               | 
               | > Either you don't trust the cloud and you can run NAS or
               | equivalent (with s3 APIs easily today) much cheaper or
               | trust them to keep your data safe and available.
               | 
               | What if you trust the cloud 90%, and you trust yourself
               | 90%, and you think it's likely that the failure cases
               | between the two are likely to be independent? Then it
               | seems like the smart decision would be to do both.
               | 
               | Your position is basically arguing that redundant systems
               | are never necessary, because "either you trust A _or_ you
               | trust B, why do both? " If it's absolutely critical that
               | you don't suffer a particular failure, then having
               | redundant systems is very wise.
        
               | manquer wrote:
               | My point is if your redundancy is better than AWS then
               | why pay for them ? If it not they why invest in your
               | own?.
               | 
               | You can argue that you protect against different threats
               | than AWS does . So far I have not seen a meaningful
               | argument of threats a on Prem protects differently than
               | the cloud that you need _both_.
               | 
               | Say for example your solution is to put all your data
               | backups on the moon then it makes sense to do both, AWS
               | does not protect against threat to planet wide issues.
               | 
               | However if you are both protecting against exact same
               | risks having just provider redundancy only protects
               | against events like AWS goes down for days /months or
               | goes bankrupt.
               | 
               | All business decisions have some risk , provider
               | redundancy does not seem a risk to mitigate for the cost
               | it would mean for most businesses I have seen.
               | 
               | Even Amazon.com or Google apps host on their own cloud
               | and not use multi cloud after all, their regular
               | businesses are much bigger than their cloud biz , they
               | would still risk those to stick to their cloud/services
               | only.
        
               | miked85 wrote:
               | > _Even Amazon.com or Google apps host on their own cloud
               | and not use multi cloud after all, their regular
               | businesses are much bigger than their cloud biz_
               | 
               | This is probably true with Google, but AWS contributes >
               | 50% of Amazon's operating income. [1]
               | 
               | [1] https://www.techradar.com/news/aws-is-now-a-bigger-
               | part-of-a...
        
               | bee_rider wrote:
               | If you trust your airbag, why bother with the sealtbelt?
        
               | xboxnolifes wrote:
               | > My point is if your redundancy is better than AWS then
               | why pay for them ? If it not they why invest in your own?
               | 
               | This is a really confusing question. Redundancy
               | _requires_ more than 1 option. It 's not about it being
               | better than AWS, it's that in order to have it you need
               | something besides just AWS. AWS may provide redundant
               | drives, but they don't provide a redundant AWS. AWS can
               | protect against many things, but it cannot protect
               | against AWS being unavailable.
        
               | RA_Fisher wrote:
               | True with two independent servers at 90% each, that's
               | 0.1^2 = 1% chance both fail-- so redundancy can add a lot
               | of reliability.
        
               | JackFr wrote:
               | You assume failures are uncorrelated. Which, depending on
               | what you think you are protecting yourself from, might or
               | might not be true.
               | 
               | (Consider a buggy software release which incorrectly
               | deletes a backup. Depending on the bug it's very possible
               | it will delete in both places.)
        
               | bee_rider wrote:
               | If one buggy software release can delete both copies,
               | then you don't have actual redundancy from the point of
               | view of that issue.
        
               | manquer wrote:
               | Only if they are truly independent of each other.
               | 
               | You and AWS are using similar chips similar hard disks
               | even with similar failure rates.
               | 
               | If you both use same hardware from say batch both can
               | defects and fail at similar times.or you use the same
               | file systems, that say corrupts both your backups.
               | 
               | 90% is not a magic number , you need to know AWS supply
               | chains and practices thoroughly and keep yours different
               | enough not to have same risks as AWS does for your system
               | to have independent probability of failures.
        
             | hayd wrote:
             | Having an additional AWS account which some S3 backs up to,
             | with write only permissions (no delete) and in an account
             | that is not used by anyone, seems like a good idea for this
             | type of situation/concern.
        
           | hinkley wrote:
           | AWS has had at least one documented incident where a region
           | had an S3 failure that was not recoverable. They lost about
           | 2% of all data. That might not sound like much but if you
           | have a lot of data, partial restoration of that data doesn't
           | necessarily leave your system in a functional state. If it
           | loses my compiled CSS files I might be able to redeploy my
           | app to fix it. Then again if I'm a SaaS company and that file
           | was generated in part from user input, it might be more
           | difficult to reconstruct that data.
        
             | Johnny555 wrote:
             | Which incident is this? I can't find it online. The closest
             | I can recall is when they lost some number of EBS volumes.
             | We were affected by that, but ran snapshots (to s3) to
             | recover the affected servers.
        
         | whydoyoucare wrote:
         | It reminds me of the old adage: "Two is one, one is none. Have
         | a backup. Always."
        
         | uvdn7 wrote:
         | You could have just showed them historical data of both
         | companies being unavailable for extended amount of time. What
         | happened in the past few months is not new.
        
           | joana035 wrote:
           | "just", as if you never had to argument against aws
           | fanboys...
        
             | meshaneian wrote:
             | As a _pragmatic_ AWS fan, +1 this. Disposable distributed
             | hybrid multi-cloud architecture FTW.
        
         | davewritescode wrote:
         | You're not wrong but there's ways to do backups properly in AWS
         | and I'm not aware of there ever being an incident where AWS has
         | lost data.
         | 
         | It's not a bad idea store backups offline but costs might make
         | that an expensive proposition.
        
           | numbsafari wrote:
           | S3 isn't perfect. Read the fine print.
           | 
           | I've had buckets and objects disappear into the ether.
           | 
           | It is exceedingly rare, but it's not impossible.
           | 
           | Offline/alt-cloud backups are probably a lot cheaper than you
           | think, and will win you points during any audit.
        
             | thraxil wrote:
             | > Offline/alt-cloud backups are probably a lot cheaper than
             | you think, and will win you points during any audit.
             | 
             | With the caveat that you're going to have to implement all
             | your access controls, monitoring and compliance mechanisms
             | on those alternate backups. No point winning points during
             | an audit for having backups outside AWS if you lose even
             | more points for "backups weren't properly secured against
             | unauthorized access".
             | 
             | And you're regularly restoring from those alternate backups
             | as well to check their integrity, right?
        
               | numbsafari wrote:
               | Well, obviously, it goes without saying.
               | 
               | But none of that changes the fact that you shouldn't put
               | all your eggs in one basket.
        
         | rafale wrote:
         | Did u file a complaint on the use of swear words?
        
         | xwdv wrote:
         | You're still in the wrong, don't be so smug. These few
         | downtimes are no big deal in the grand scheme of things, and
         | your proposed solution would have been more work and headaches
         | for little to no realizable gains, and not to mention the
         | cybersecurity ramifications. Quite frankly, they are probably
         | glad that you're gone and not around to gloat about every
         | trivial bit of downtime.
        
           | locallost wrote:
           | They're not gloating and also not smug. There's not even a
           | 'hehe' in the post.
        
       | whoomp12342 wrote:
       | the cloud is great they said...
        
       | jakub_g wrote:
       | Where are you located? "X is down" without location is only
       | moderately useful.
       | 
       | I'm having issues with Slack from central EU (Poland) -- can't
       | upload images, or send emoji reactions to post; curiously, text
       | works fine). Wondering if linked
        
         | riknox wrote:
         | AWS Console runs in us-east-1 so that points to at least that
         | region having issues IIRC. I am also having Slack issues in EU.
        
         | hdjjhhvvhga wrote:
         | You should complain to Slack then. It's their problem to choose
         | a reliable provider, and AWS seems to have trouble with keeping
         | this status.
        
       | l0b0 wrote:
       | Meta: I posted a "PyPI is down" link a few days ago, and the post
       | got insta-flagged. Is there some rule about this sort of thing?
        
       | devoutsalsa wrote:
       | We'll never really know the answer, but I have to wonder what
       | percentage of comments on this thread are from Amazon downplaying
       | the severity & other cloud providers hyping it up.
        
         | mongrelion wrote:
         | You give HN too much credit.
        
       ___________________________________________________________________
       (page generated 2021-12-22 23:00 UTC)