[HN Gopher] Post Mortem of Google Outage on 14 December 2020 ___________________________________________________________________ Post Mortem of Google Outage on 14 December 2020 Author : saifulwebid Score : 141 points Date : 2020-12-18 21:46 UTC (1 hours ago) (HTM) web link (status.cloud.google.com) (TXT) w3m dump (status.cloud.google.com) | physicsgraph wrote: | My favorite quote: "prevent fast implementation of global | changes." This is how large organizations become slower compared | to small "nimble" companies. Everyone fails, but the | noticeability of failure incentivizes large organizations to be | more careful. | inopinatus wrote: | It doesn't have to be this way, but that's partly a matter of | culture. By aspiring to present/think/act as a monoplatform, | Google substantially increases the blast radius of individual | component failure. A global quota system mediating every other | service sounds both totally on brand, and also the antithesis | of everything I learned about public cloud scaling at AWS. | There we made jokes, that weren't jokes, about service teams | essentially DoS'ing each other, and this being the natural | order of things that every service must simply be resilient to. | | Having been impressed upon by that mindset, my design reflex is | instead, "eliminate global dependencies", rather than trying to | globally rate-limit a global rate-limiter. I see this sort of | thing as a generalised Conway's Law. | Xorlev wrote: | Slowing down actuation of prod changes to be over hours vs. | seconds is a far cry from the large org / small org problem. | Ultimately, when the world depends on you, limiting the blast | radius to X% of the world vs. 100% of it is a significant | improvement. | blaisio wrote: | Hmm one thing that jumped out at me was the organizational | mistake of having a very long automated "grace period". This is | actually bad system architecture. Whenever you have a timeout for | something that involves a major config change like this, the | timeout must be short (like less than a week). Otherwise, it is | very likely people will forget about it, and it will take a while | for people to recognize and fix the problem. The alternative is | to just use a calendar and have someone manually flip the switch | when they see the reminder pop up. Over reliance on automated | timeouts like this is indicative of a badly designed software | ownership structure. | x87678r wrote: | Does anyone know if SREs in Europe fixed the problem or it relied | on people in Mountain View? When an outage like this hits do devs | get involved? | | I've worked as an SRE and it sucks to be fixing developer's bugs | in the middle of the night. | [deleted] | sabujp wrote: | s/post mortem/incident analysis/ | gojomo wrote: | In this domain, the two terms are synonyms. And neither "post- | mortem" nor "incident report" appear on the log page, making | either an equally fair synthesized title. | b0afc375b5 wrote: | What is the difference between the two? I tried searching for | 'post mortem vs incident analysis' but couldn't find anything. | gojomo wrote: | Some org might have decreed internal fine distinctions, but | in common parlance, both terms are used for the same kind of | after-event writeups. | ph0rque wrote: | Well, post mortem means "after death" in Latin. So it would | seem the difference is, one can recover from an incident... | papito wrote: | Post-mortem is just a gruesome term. | polote wrote: | off-topic :Do we know what happened a day after that when Gmail | returned "this email doesn't exist" ? | rubyron wrote: | I'd say that's solidly on-topic, and was more damaging to my | business than the previous outtage. | azornathogron wrote: | Here is the incident report for the Gmail problem: | | https://static.googleusercontent.com/media/www.google.com/en... | | It is linked from the Google Workspace status page here: | | https://www.google.com/appsstatus#hl=en-GB&v=issue&sid=1&iid... | polote wrote: | thanks, so a change of an env variable put the most used | email system in the world down during 6 hours, not sure I can | believe that | xyzelement wrote: | >> so a change of an env variable put the most used email | system in the world down during 6 hours, not sure I can | believe that | | I can totally believe it. In my experience, the bigger the | outage the stupider-seeming the cause. | jeffbee wrote: | Google has had a bunch of notorious outages caused by | similar things, including pushing a completely blank | front-end load balancer config to global production. The | post mortem action items for these are always really deep | thoughts about the safety of config changes but in my | experience there, nobody ever really fixes them because | the problem is really hard. | | For this kind of change I would probably have wanted some | kind of shadow system that loaded the new config, | received production inputs, produced responses that were | monitored but discarded, and had no other observable side | effects. That's such a pain in the ass that most teams | aren't going to bother setting that up, even when the | risks are obvious. | jeffbee wrote: | Damn, that's awful. Should lead to a deep questioning of | people who claim to be migrating your junk to "best | practices" if the old config system worked fine for 15 years | and the new one caused a massive dataloss outage on Day 1. | gojomo wrote: | That Google is choosing to use a PDF (!) as the official | incident-reporting media is as confidence-destroying as was | the outage. | pdkl95 wrote: | > A configuration change during this migration shifted the | formatting behavior of a service option so that it | incorrectly provided an invalid domain name, instead of the | intended "gmail.com" domain name, to the Google SMTP inbound | service. | | Wow... how was this even possible? Did they do _any testing | whatsoever_ before migrating the live production system? They | misformatting the domain name should have broken even basic | functionality tests. | | I wonder if they _didn 't_ actually test the literal | "gmail.com" configuration, due to dev/testing environments | using a different domain name? I had that problem when on my | first Ruby on Rails project due to subtle differences between | the development/test/production settings in | config/environments/. Running "rake test" is not a substitute | for an actual test of the real production system. | sabujp wrote: | > "As part of an ongoing migration of the User ID Service to a | new quota system, a change was made in October to register the | User ID Service with the new quota system, but parts of the | previous quota system were left in place which incorrectly | reported the usage for the User ID Service as 0. " | | It's quite hilarious how we have to hide internals by saying | stuff like "new quota system" ..oh well carry on, no further | comment. | AdrianB1 wrote: | I am happy to see Google makes mistakes too, even if theirs are | in areas a lot more complex than what I see at my job :) | | Now on a serious note, the increasing complexity of the systems | and architectures makes it more challenging to manage and makes | failures a lot harder to prevent. | ak217 wrote: | I've worked on a number of systems where code was pushed to a | staging environment (a persistent functional replica of | production where integration tests happen) and sat there for a | week before being allowed in production. A staging setup might | have prevented this scenario, since the quota enforcement grace | period would expire in staging a week before prod and give the | team a week to notice and push an emergency fix. | papito wrote: | Staging is never hammered with production traffic, often not | exposing problems. It's a sanity check for developers and QA, | essentially. | | On the scale of Google, you test in production, but with | careful staged canary deployments. Even a 1% rollout is more | than most of us have ever dealt with. | [deleted] | tpmx wrote: | Thought: There should be a status dashboard for "free" (paid for | via personal ad targeting metadata) services like Search, Gmail, | Drive, etc. With postmortems, too. | Splendor wrote: | There are several. Here's one example: | https://downdetector.com/ | [deleted] | whermans wrote: | Existing dashboards showed the services as being up, since the | frontends worked fine - as long as you did not try to | authenticate. | jpxw wrote: | TLDR: the team forgot to update the resource quota requirements | for a critical component of Google's authentication system while | transitioning between quota systems. | jrockway wrote: | That's not the TL;DR is it? It seems that the quota system | detected current usage as "0", and thus adjusted the quota | downwards to 0 until the Paxos leader couldn't write, which | caused all of the data to become stale, which caused downstream | systems to fail because they reject outdated data. | basicneo wrote: | Is that an accurate tl;dr? | | "As part of an ongoing migration of the User ID Service to a | new quota system, a change was made in October to register the | User ID Service with the new quota system, but parts of the | previous quota system were left in place which incorrectly | reported the usage for the User ID Service as 0." | [deleted] | tpolm wrote: | it is interesting that is both cases (recent gmail and this one) | it was a "migration": | | "As part of an ongoing migration of the User ID Service to a new | quota system" | | "An ongoing migration was in effect to update this underlying | configuration system" | | it was not a new feature, not a massive hardware failure, it was | migrating part of the working product due to some unclear reason | of "best practices". | | both of those migrations failed with symptoms suggesting that | whoever was performing them did not have deep understanding of | systems architecture or safety practices and there was no one to | stop them from failing. Signs of slow degradation of engineering | culture at Google. There will be more to come. Sad. | cpncrunch wrote: | >both of those migrations failed with symptoms suggesting that | whoever was performing them did not have deep understanding of | systems architecture or safety practices and there was no one | to stop them from failing. | | Can any single person at Google have a full understanding of | all the dependencies for even a single system? I have no idea, | as I've never worked there, but I would imagine that there is a | lot of complexity. | jamisteven wrote: | thats one hell of a post mortem. | gregw2 wrote: | Anyone know if there's a similar public outage report for that | late November AWS us-east-1 outage? | ignoramous wrote: | That was a quote outage imposed by Linux defaults: | https://news.ycombinator.com/item?id=25236057 | mehrdadn wrote: | I'm skimming this but so far I still don't understand how any | kind of authentication failure can or should lead to an SMTP | server returning "this address doesn't exist". Does anyone see an | explanation? | | Edit 1: Oh wow I didn't even realize there were multiple | incidents. Thanks! | | Edit 2 (after reading the Gmail post-mortem): AH! It was a messed | up domain name! When this event happened, I said senders need to | avoid taking "invalid address" at face value when they've | recently succeeded delivering to the same addresses. But despite | the RFC saying senders "should not" repeat requests (rather than | "must not"), many people had a _lot_ of resistance to this idea, | and instead just blamed Google for messing up implementing the | RFC. People seemed to think every other server was right to treat | this as a permanent error. But this post-mortem makes it crystal | clear how that completely missed the point, and in fact wasn 't | the case at all. The part of the service concluding "address not | found" _was_ working correctly! They _didn 't_ mess up | implementing the spec--they messed up their domain name in the | configuration. Which is an operational error you just _can 't_ | assume will never happen--it's _exactly_ like the earlier analogy | to someone opening the door to the mailman and not recognizing | the name of a resident. A robust system (in this case the sender) | needs to be able to recognize sudden anomalies--in this case, | failure to deliver to an address that was accepting mail very | recently--and retry things a bit later. You just can 't assume | nobody will make mistake operational mistakes, even if you | somehow assume the software is bug-free. | klysm wrote: | If I had to guess I would say sometimes when you request | something you aren't authorized to see you get a 404 because | they don't want you to be able to tell what exists or not | without any creds. | mehrdadn wrote: | Sending an email doesn't require "seeing" anything other than | whether the server is willing to receive email for that | address though, right? | | Also, that problem occurs due to lack of authorization, not | due to authentication failure, right? Authentication failure | is a different kind of error to the client--if the server | fails to authenticate you, clearly you already know that it's | not going to show you anything? | jeffbee wrote: | That's a different outage. SMTP 550 was the next day. | adrianmonk wrote: | Some outage information showing the SMTP 550 incident was at | a different time: | | "Gmail - Service Details" ( https://www.google.com/appsstatus | #hl=en&v=issue&sid=1&iid=a8...) shows 12/15/20 as the date. | | That links to "Google Cloud Issue Summary" "Gmail - | 2020-12-14 and 2020-12-15" (https://static.googleusercontent. | com/media/www.google.com/en...) which mentions the 550 error | code. | ryanobjc wrote: | You ask the central database of accounts "does this exist?" And | if it doesn't say yes you bounce the email. | | Obviously an error condition should not result in this. But | complex systems. It happens. | antoinealb wrote: | It is not the same incident. This was for the global outage of | google, not for the Gmail incident. | echelon wrote: | This is a massive distributed system. | | Can't get quota? Certain pieces accidentally turn | "TooManyRequests / 429s" or related semantics into 500s, 400s, | 401s, etc. as you percolate upstream. | | Auth is one of the most central components of any system, so | there would be cascading failures everywhere. | jrockway wrote: | Different outage. The Gmail postmortem is linked in another | thread, but the gist was that "gmail.com" is a configuration | value that can be changed at runtime, and someone changed the | configuration. Thus, *@gmail.com stopped being a valid address, | and they returned "that mailbox is unavailable". | | If you don't want to scroll to the other thread, here's the | postmortem: | https://static.googleusercontent.com/media/www.google.com/en... | [deleted] | [deleted] | cpncrunch wrote: | Indeed, it seems to be a completely unrelated issue. Two | major Google outages on the same day for different reasons. ___________________________________________________________________ (page generated 2020-12-18 23:00 UTC)