[HN Gopher] Google cloud outage ___________________________________________________________________ Google cloud outage Author : thomassharoon Score : 45 points Date : 2020-03-27 20:42 UTC (2 hours ago) (HTM) web link (status.cloud.google.com) (TXT) w3m dump (status.cloud.google.com) | nammi wrote: | We were seeing timeouts in east-1. I don't know what "normal" | looks like, but Pingdom's map seems to show the whole east coast | as affected https://livemap.pingdom.com/ | svacko wrote: | yeah, our GKE pods running in us-east1 were dying ~90minutes ago | like crazy... hope they are gonna resolve this soon. not the | luckiest day for Google, nor us | qmarchi wrote: | Heyo Googler here. | | The problem was a mix between another cloud provider and GCP. | | Dare I say, there should be little customer impact as of 13:37 | PST..... | | The status dashboard is going to be your best idea on | information. | gigatexal wrote: | Oh man I had no idea the big cloud providers have dependencies | on other clouds like this. | lima wrote: | They do not, according to the dashboard, this issue merely | affected connectivity between GCP and other cloud providers. | | There was a different outage yesterday, which has nothing to | do with the one discussed in this thread. | dodobirdlord wrote: | Given how much trans-continental/trans-oceanic network cable | the major cloud providers own, they almost certainly have | special trans-cloud network traffic infrastructure. | Especially since so much of "The Cloud" is within a few 10s | of square miles in a field in Virginia. I can easily see how | one provider could majorly disrupt another provider by | accidentally breaking inbound traffic on one of those links. | gigatexal wrote: | Yeah, I see that now. Makes total sense. | qmarchi wrote: | The bigger issue is that there's a lot of customers where | they have split cloud deployments, which means the | customers hurt even if they are stable within the clouds | themselves. | thedance wrote: | If you are deployed in such a way that both GCP and AWS | need to be up you're doing it backwards. Multi-cloud | strategy is supposed to result in the intersection of | cloud failures, not the union of them. | [deleted] | the-dude wrote: | This can't be real. | svacko wrote: | Is the another cloud provider AWS? I could see tons of | connection timeoutes between GCP & S3/Elasticsearch service. | | Hope everything is resolved now for good. | judge2020 wrote: | Seems AWS, connection to gmail's smtp relay also started | timing out. | [deleted] | unbeli wrote: | [removed] | packetslave wrote: | Next time YOU are about to spout off about something, perhaps | think about reading the f'ing page being linked to? | | "The issue with connectivity between the GCP us-east1, us- | east4, and us-central1 regions to other Cloud Providers has | been resolved for all affected projects as of Friday, | 2020-03-27 13:37 US/Pacific." | unbeli wrote: | Perhaps you're right. I'll better discuss it privately. | tagux wrote: | "We had a router failure in Atlanta". | | WHAT? You kidding us? | | Urs Holzle, technical infrastructure at Google Cloud senior vice | president, said, "We're very sorry about that! We had a router | failure in Atlanta, which affected traffic routed through that | region. Things should be back to normal now. Just to make sure: | This wasn't related to traffic levels or any kind of overload, | our network is not stressed by COVID-19." | ocdtrekkie wrote: | Was it like... a hardware failure? If you serve more than 100 | people you probably should have redundant routers. Was it a | configuration issue that replicated over to multiple devices at | least, I hope? | AdamJacobMuller wrote: | Not that simple as you sometimes need to manually isolate the | faulty hardware and remove it from service. | toast0 wrote: | Have you worked with redundant routers? They certainly reduce | the number of outages, but sometimes the hardware (or | software) fails in exciting ways that doesn't engage the | redundancy, or doesn't engage it properly, and you still get | an outage (or you get an outage that wouldn't have happened). | Or sometimes, one circuit is out of service for repair or | upgrade, and the other circuit is connected to the router | that failed. And routing for the AS that travels on that | circuit was set not to fallback to transit because the last | time that happened, it caused major issues. | | I have no specific knowledge of today's events, but this sort | of thing happens. You can get the number of incidents down | pretty low, but not to zero. | ocdtrekkie wrote: | I have. I am just highlighting that the problem surely | should be more complex than described. Or that their | redundancy for these events was not adequately devised. | dodobirdlord wrote: | Networks are harder than everyone thinks. The 2018 | CenturyLink outage on the west coast was caused by 1 bad | network card that started writing malformed packets. | | https://www.geekwire.com/2018/report-huge-centurylink- | outage... | thanksforfish wrote: | Surely the "100 people" metric is too low although I agree at | some point (and certainly Google-scale) a redundant router | makes sense. | packetslave wrote: | yes, because OBVIOUSLY Google is too stupid to know about | redundant routers. /s | | https://twitter.com/uhoelzle/status/1243259280410554368 | | "When routers fail cleanly (say, power out) failover is | quick, so you never hear about these. This wasn't such a | simple case. We have "many" (not just two) routers in Atlanta | so it wasn't an issue of missing redundancy." | neonate wrote: | https://twitter.com/uhoelzle/status/1243217659690278912 ___________________________________________________________________ (page generated 2020-03-27 23:00 UTC)