hngopher.com

       [HN Gopher] Google cloud outage
       ___________________________________________________________________
        
       Google cloud outage
        
       Author : thomassharoon
       Score  : 45 points
       Date   : 2020-03-27 20:42 UTC (2 hours ago)
        
 (HTM) web link (status.cloud.google.com)
 (TXT) w3m dump (status.cloud.google.com)
        
       | nammi wrote:
       | We were seeing timeouts in east-1. I don't know what "normal"
       | looks like, but Pingdom's map seems to show the whole east coast
       | as affected https://livemap.pingdom.com/
        
       | svacko wrote:
       | yeah, our GKE pods running in us-east1 were dying ~90minutes ago
       | like crazy... hope they are gonna resolve this soon. not the
       | luckiest day for Google, nor us
        
       | qmarchi wrote:
       | Heyo Googler here.
       | 
       | The problem was a mix between another cloud provider and GCP.
       | 
       | Dare I say, there should be little customer impact as of 13:37
       | PST.....
       | 
       | The status dashboard is going to be your best idea on
       | information.
        
         | gigatexal wrote:
         | Oh man I had no idea the big cloud providers have dependencies
         | on other clouds like this.
        
           | lima wrote:
           | They do not, according to the dashboard, this issue merely
           | affected connectivity between GCP and other cloud providers.
           | 
           | There was a different outage yesterday, which has nothing to
           | do with the one discussed in this thread.
        
           | dodobirdlord wrote:
           | Given how much trans-continental/trans-oceanic network cable
           | the major cloud providers own, they almost certainly have
           | special trans-cloud network traffic infrastructure.
           | Especially since so much of "The Cloud" is within a few 10s
           | of square miles in a field in Virginia. I can easily see how
           | one provider could majorly disrupt another provider by
           | accidentally breaking inbound traffic on one of those links.
        
             | gigatexal wrote:
             | Yeah, I see that now. Makes total sense.
        
             | qmarchi wrote:
             | The bigger issue is that there's a lot of customers where
             | they have split cloud deployments, which means the
             | customers hurt even if they are stable within the clouds
             | themselves.
        
               | thedance wrote:
               | If you are deployed in such a way that both GCP and AWS
               | need to be up you're doing it backwards. Multi-cloud
               | strategy is supposed to result in the intersection of
               | cloud failures, not the union of them.
        
         | [deleted]
        
         | the-dude wrote:
         | This can't be real.
        
         | svacko wrote:
         | Is the another cloud provider AWS? I could see tons of
         | connection timeoutes between GCP & S3/Elasticsearch service.
         | 
         | Hope everything is resolved now for good.
        
           | judge2020 wrote:
           | Seems AWS, connection to gmail's smtp relay also started
           | timing out.
        
           | [deleted]
        
         | unbeli wrote:
         | [removed]
        
           | packetslave wrote:
           | Next time YOU are about to spout off about something, perhaps
           | think about reading the f'ing page being linked to?
           | 
           | "The issue with connectivity between the GCP us-east1, us-
           | east4, and us-central1 regions to other Cloud Providers has
           | been resolved for all affected projects as of Friday,
           | 2020-03-27 13:37 US/Pacific."
        
             | unbeli wrote:
             | Perhaps you're right. I'll better discuss it privately.
        
       | tagux wrote:
       | "We had a router failure in Atlanta".
       | 
       | WHAT? You kidding us?
       | 
       | Urs Holzle, technical infrastructure at Google Cloud senior vice
       | president, said, "We're very sorry about that! We had a router
       | failure in Atlanta, which affected traffic routed through that
       | region. Things should be back to normal now. Just to make sure:
       | This wasn't related to traffic levels or any kind of overload,
       | our network is not stressed by COVID-19."
        
         | ocdtrekkie wrote:
         | Was it like... a hardware failure? If you serve more than 100
         | people you probably should have redundant routers. Was it a
         | configuration issue that replicated over to multiple devices at
         | least, I hope?
        
           | AdamJacobMuller wrote:
           | Not that simple as you sometimes need to manually isolate the
           | faulty hardware and remove it from service.
        
           | toast0 wrote:
           | Have you worked with redundant routers? They certainly reduce
           | the number of outages, but sometimes the hardware (or
           | software) fails in exciting ways that doesn't engage the
           | redundancy, or doesn't engage it properly, and you still get
           | an outage (or you get an outage that wouldn't have happened).
           | Or sometimes, one circuit is out of service for repair or
           | upgrade, and the other circuit is connected to the router
           | that failed. And routing for the AS that travels on that
           | circuit was set not to fallback to transit because the last
           | time that happened, it caused major issues.
           | 
           | I have no specific knowledge of today's events, but this sort
           | of thing happens. You can get the number of incidents down
           | pretty low, but not to zero.
        
             | ocdtrekkie wrote:
             | I have. I am just highlighting that the problem surely
             | should be more complex than described. Or that their
             | redundancy for these events was not adequately devised.
        
           | dodobirdlord wrote:
           | Networks are harder than everyone thinks. The 2018
           | CenturyLink outage on the west coast was caused by 1 bad
           | network card that started writing malformed packets.
           | 
           | https://www.geekwire.com/2018/report-huge-centurylink-
           | outage...
        
           | thanksforfish wrote:
           | Surely the "100 people" metric is too low although I agree at
           | some point (and certainly Google-scale) a redundant router
           | makes sense.
        
           | packetslave wrote:
           | yes, because OBVIOUSLY Google is too stupid to know about
           | redundant routers. /s
           | 
           | https://twitter.com/uhoelzle/status/1243259280410554368
           | 
           | "When routers fail cleanly (say, power out) failover is
           | quick, so you never hear about these. This wasn't such a
           | simple case. We have "many" (not just two) routers in Atlanta
           | so it wasn't an issue of missing redundancy."
        
         | neonate wrote:
         | https://twitter.com/uhoelzle/status/1243217659690278912
        
       ___________________________________________________________________
       (page generated 2020-03-27 23:00 UTC)