[HN Gopher] Post Mortem of Google Outage on 14 December 2020
       ___________________________________________________________________
        
       Post Mortem of Google Outage on 14 December 2020
        
       Author : saifulwebid
       Score  : 141 points
       Date   : 2020-12-18 21:46 UTC (1 hours ago)
        
 (HTM) web link (status.cloud.google.com)
 (TXT) w3m dump (status.cloud.google.com)
        
       | physicsgraph wrote:
       | My favorite quote: "prevent fast implementation of global
       | changes." This is how large organizations become slower compared
       | to small "nimble" companies. Everyone fails, but the
       | noticeability of failure incentivizes large organizations to be
       | more careful.
        
         | inopinatus wrote:
         | It doesn't have to be this way, but that's partly a matter of
         | culture. By aspiring to present/think/act as a monoplatform,
         | Google substantially increases the blast radius of individual
         | component failure. A global quota system mediating every other
         | service sounds both totally on brand, and also the antithesis
         | of everything I learned about public cloud scaling at AWS.
         | There we made jokes, that weren't jokes, about service teams
         | essentially DoS'ing each other, and this being the natural
         | order of things that every service must simply be resilient to.
         | 
         | Having been impressed upon by that mindset, my design reflex is
         | instead, "eliminate global dependencies", rather than trying to
         | globally rate-limit a global rate-limiter. I see this sort of
         | thing as a generalised Conway's Law.
        
         | Xorlev wrote:
         | Slowing down actuation of prod changes to be over hours vs.
         | seconds is a far cry from the large org / small org problem.
         | Ultimately, when the world depends on you, limiting the blast
         | radius to X% of the world vs. 100% of it is a significant
         | improvement.
        
       | blaisio wrote:
       | Hmm one thing that jumped out at me was the organizational
       | mistake of having a very long automated "grace period". This is
       | actually bad system architecture. Whenever you have a timeout for
       | something that involves a major config change like this, the
       | timeout must be short (like less than a week). Otherwise, it is
       | very likely people will forget about it, and it will take a while
       | for people to recognize and fix the problem. The alternative is
       | to just use a calendar and have someone manually flip the switch
       | when they see the reminder pop up. Over reliance on automated
       | timeouts like this is indicative of a badly designed software
       | ownership structure.
        
       | x87678r wrote:
       | Does anyone know if SREs in Europe fixed the problem or it relied
       | on people in Mountain View? When an outage like this hits do devs
       | get involved?
       | 
       | I've worked as an SRE and it sucks to be fixing developer's bugs
       | in the middle of the night.
        
       | [deleted]
        
       | sabujp wrote:
       | s/post mortem/incident analysis/
        
         | gojomo wrote:
         | In this domain, the two terms are synonyms. And neither "post-
         | mortem" nor "incident report" appear on the log page, making
         | either an equally fair synthesized title.
        
         | b0afc375b5 wrote:
         | What is the difference between the two? I tried searching for
         | 'post mortem vs incident analysis' but couldn't find anything.
        
           | gojomo wrote:
           | Some org might have decreed internal fine distinctions, but
           | in common parlance, both terms are used for the same kind of
           | after-event writeups.
        
           | ph0rque wrote:
           | Well, post mortem means "after death" in Latin. So it would
           | seem the difference is, one can recover from an incident...
        
           | papito wrote:
           | Post-mortem is just a gruesome term.
        
       | polote wrote:
       | off-topic :Do we know what happened a day after that when Gmail
       | returned "this email doesn't exist" ?
        
         | rubyron wrote:
         | I'd say that's solidly on-topic, and was more damaging to my
         | business than the previous outtage.
        
         | azornathogron wrote:
         | Here is the incident report for the Gmail problem:
         | 
         | https://static.googleusercontent.com/media/www.google.com/en...
         | 
         | It is linked from the Google Workspace status page here:
         | 
         | https://www.google.com/appsstatus#hl=en-GB&v=issue&sid=1&iid...
        
           | polote wrote:
           | thanks, so a change of an env variable put the most used
           | email system in the world down during 6 hours, not sure I can
           | believe that
        
             | xyzelement wrote:
             | >> so a change of an env variable put the most used email
             | system in the world down during 6 hours, not sure I can
             | believe that
             | 
             | I can totally believe it. In my experience, the bigger the
             | outage the stupider-seeming the cause.
        
               | jeffbee wrote:
               | Google has had a bunch of notorious outages caused by
               | similar things, including pushing a completely blank
               | front-end load balancer config to global production. The
               | post mortem action items for these are always really deep
               | thoughts about the safety of config changes but in my
               | experience there, nobody ever really fixes them because
               | the problem is really hard.
               | 
               | For this kind of change I would probably have wanted some
               | kind of shadow system that loaded the new config,
               | received production inputs, produced responses that were
               | monitored but discarded, and had no other observable side
               | effects. That's such a pain in the ass that most teams
               | aren't going to bother setting that up, even when the
               | risks are obvious.
        
           | jeffbee wrote:
           | Damn, that's awful. Should lead to a deep questioning of
           | people who claim to be migrating your junk to "best
           | practices" if the old config system worked fine for 15 years
           | and the new one caused a massive dataloss outage on Day 1.
        
           | gojomo wrote:
           | That Google is choosing to use a PDF (!) as the official
           | incident-reporting media is as confidence-destroying as was
           | the outage.
        
           | pdkl95 wrote:
           | > A configuration change during this migration shifted the
           | formatting behavior of a service option so that it
           | incorrectly provided an invalid domain name, instead of the
           | intended "gmail.com" domain name, to the Google SMTP inbound
           | service.
           | 
           | Wow... how was this even possible? Did they do _any testing
           | whatsoever_ before migrating the live production system? They
           | misformatting the domain name should have broken even basic
           | functionality tests.
           | 
           | I wonder if they _didn 't_ actually test the literal
           | "gmail.com" configuration, due to dev/testing environments
           | using a different domain name? I had that problem when on my
           | first Ruby on Rails project due to subtle differences between
           | the development/test/production settings in
           | config/environments/. Running "rake test" is not a substitute
           | for an actual test of the real production system.
        
       | sabujp wrote:
       | > "As part of an ongoing migration of the User ID Service to a
       | new quota system, a change was made in October to register the
       | User ID Service with the new quota system, but parts of the
       | previous quota system were left in place which incorrectly
       | reported the usage for the User ID Service as 0. "
       | 
       | It's quite hilarious how we have to hide internals by saying
       | stuff like "new quota system" ..oh well carry on, no further
       | comment.
        
       | AdrianB1 wrote:
       | I am happy to see Google makes mistakes too, even if theirs are
       | in areas a lot more complex than what I see at my job :)
       | 
       | Now on a serious note, the increasing complexity of the systems
       | and architectures makes it more challenging to manage and makes
       | failures a lot harder to prevent.
        
       | ak217 wrote:
       | I've worked on a number of systems where code was pushed to a
       | staging environment (a persistent functional replica of
       | production where integration tests happen) and sat there for a
       | week before being allowed in production. A staging setup might
       | have prevented this scenario, since the quota enforcement grace
       | period would expire in staging a week before prod and give the
       | team a week to notice and push an emergency fix.
        
         | papito wrote:
         | Staging is never hammered with production traffic, often not
         | exposing problems. It's a sanity check for developers and QA,
         | essentially.
         | 
         | On the scale of Google, you test in production, but with
         | careful staged canary deployments. Even a 1% rollout is more
         | than most of us have ever dealt with.
        
         | [deleted]
        
       | tpmx wrote:
       | Thought: There should be a status dashboard for "free" (paid for
       | via personal ad targeting metadata) services like Search, Gmail,
       | Drive, etc. With postmortems, too.
        
         | Splendor wrote:
         | There are several. Here's one example:
         | https://downdetector.com/
        
         | [deleted]
        
         | whermans wrote:
         | Existing dashboards showed the services as being up, since the
         | frontends worked fine - as long as you did not try to
         | authenticate.
        
       | jpxw wrote:
       | TLDR: the team forgot to update the resource quota requirements
       | for a critical component of Google's authentication system while
       | transitioning between quota systems.
        
         | jrockway wrote:
         | That's not the TL;DR is it? It seems that the quota system
         | detected current usage as "0", and thus adjusted the quota
         | downwards to 0 until the Paxos leader couldn't write, which
         | caused all of the data to become stale, which caused downstream
         | systems to fail because they reject outdated data.
        
         | basicneo wrote:
         | Is that an accurate tl;dr?
         | 
         | "As part of an ongoing migration of the User ID Service to a
         | new quota system, a change was made in October to register the
         | User ID Service with the new quota system, but parts of the
         | previous quota system were left in place which incorrectly
         | reported the usage for the User ID Service as 0."
        
       | [deleted]
        
       | tpolm wrote:
       | it is interesting that is both cases (recent gmail and this one)
       | it was a "migration":
       | 
       | "As part of an ongoing migration of the User ID Service to a new
       | quota system"
       | 
       | "An ongoing migration was in effect to update this underlying
       | configuration system"
       | 
       | it was not a new feature, not a massive hardware failure, it was
       | migrating part of the working product due to some unclear reason
       | of "best practices".
       | 
       | both of those migrations failed with symptoms suggesting that
       | whoever was performing them did not have deep understanding of
       | systems architecture or safety practices and there was no one to
       | stop them from failing. Signs of slow degradation of engineering
       | culture at Google. There will be more to come. Sad.
        
         | cpncrunch wrote:
         | >both of those migrations failed with symptoms suggesting that
         | whoever was performing them did not have deep understanding of
         | systems architecture or safety practices and there was no one
         | to stop them from failing.
         | 
         | Can any single person at Google have a full understanding of
         | all the dependencies for even a single system? I have no idea,
         | as I've never worked there, but I would imagine that there is a
         | lot of complexity.
        
       | jamisteven wrote:
       | thats one hell of a post mortem.
        
       | gregw2 wrote:
       | Anyone know if there's a similar public outage report for that
       | late November AWS us-east-1 outage?
        
         | ignoramous wrote:
         | That was a quote outage imposed by Linux defaults:
         | https://news.ycombinator.com/item?id=25236057
        
       | mehrdadn wrote:
       | I'm skimming this but so far I still don't understand how any
       | kind of authentication failure can or should lead to an SMTP
       | server returning "this address doesn't exist". Does anyone see an
       | explanation?
       | 
       | Edit 1: Oh wow I didn't even realize there were multiple
       | incidents. Thanks!
       | 
       | Edit 2 (after reading the Gmail post-mortem): AH! It was a messed
       | up domain name! When this event happened, I said senders need to
       | avoid taking "invalid address" at face value when they've
       | recently succeeded delivering to the same addresses. But despite
       | the RFC saying senders "should not" repeat requests (rather than
       | "must not"), many people had a _lot_ of resistance to this idea,
       | and instead just blamed Google for messing up implementing the
       | RFC. People seemed to think every other server was right to treat
       | this as a permanent error. But this post-mortem makes it crystal
       | clear how that completely missed the point, and in fact wasn 't
       | the case at all. The part of the service concluding "address not
       | found" _was_ working correctly! They _didn 't_ mess up
       | implementing the spec--they messed up their domain name in the
       | configuration. Which is an operational error you just _can 't_
       | assume will never happen--it's _exactly_ like the earlier analogy
       | to someone opening the door to the mailman and not recognizing
       | the name of a resident. A robust system (in this case the sender)
       | needs to be able to recognize sudden anomalies--in this case,
       | failure to deliver to an address that was accepting mail very
       | recently--and retry things a bit later. You just can 't assume
       | nobody will make mistake operational mistakes, even if you
       | somehow assume the software is bug-free.
        
         | klysm wrote:
         | If I had to guess I would say sometimes when you request
         | something you aren't authorized to see you get a 404 because
         | they don't want you to be able to tell what exists or not
         | without any creds.
        
           | mehrdadn wrote:
           | Sending an email doesn't require "seeing" anything other than
           | whether the server is willing to receive email for that
           | address though, right?
           | 
           | Also, that problem occurs due to lack of authorization, not
           | due to authentication failure, right? Authentication failure
           | is a different kind of error to the client--if the server
           | fails to authenticate you, clearly you already know that it's
           | not going to show you anything?
        
         | jeffbee wrote:
         | That's a different outage. SMTP 550 was the next day.
        
           | adrianmonk wrote:
           | Some outage information showing the SMTP 550 incident was at
           | a different time:
           | 
           | "Gmail - Service Details" ( https://www.google.com/appsstatus
           | #hl=en&v=issue&sid=1&iid=a8...) shows 12/15/20 as the date.
           | 
           | That links to "Google Cloud Issue Summary" "Gmail -
           | 2020-12-14 and 2020-12-15" (https://static.googleusercontent.
           | com/media/www.google.com/en...) which mentions the 550 error
           | code.
        
         | ryanobjc wrote:
         | You ask the central database of accounts "does this exist?" And
         | if it doesn't say yes you bounce the email.
         | 
         | Obviously an error condition should not result in this. But
         | complex systems. It happens.
        
         | antoinealb wrote:
         | It is not the same incident. This was for the global outage of
         | google, not for the Gmail incident.
        
         | echelon wrote:
         | This is a massive distributed system.
         | 
         | Can't get quota? Certain pieces accidentally turn
         | "TooManyRequests / 429s" or related semantics into 500s, 400s,
         | 401s, etc. as you percolate upstream.
         | 
         | Auth is one of the most central components of any system, so
         | there would be cascading failures everywhere.
        
         | jrockway wrote:
         | Different outage. The Gmail postmortem is linked in another
         | thread, but the gist was that "gmail.com" is a configuration
         | value that can be changed at runtime, and someone changed the
         | configuration. Thus, *@gmail.com stopped being a valid address,
         | and they returned "that mailbox is unavailable".
         | 
         | If you don't want to scroll to the other thread, here's the
         | postmortem:
         | https://static.googleusercontent.com/media/www.google.com/en...
        
           | [deleted]
        
           | [deleted]
        
           | cpncrunch wrote:
           | Indeed, it seems to be a completely unrelated issue. Two
           | major Google outages on the same day for different reasons.
        
       ___________________________________________________________________
       (page generated 2020-12-18 23:00 UTC)