[HN Gopher] Understanding how Facebook disappeared from the inte...
       ___________________________________________________________________
        
       Understanding how Facebook disappeared from the internet
        
       Author : jgrahamc
       Score  : 299 points
       Date   : 2021-10-04 21:11 UTC (1 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | cwilkes wrote:
       | Updating BGP configs should go through a flowchart like this:
       | 
       | Do you want to update BGP?
       | 
       | No: exit
       | 
       | Yes: type this random 100 character phrase to continue no copy
       | paste
        
         | toomuchtodo wrote:
         | "Are you sure? Can people you need to recover from this change
         | already in the building?"
        
           | _joel wrote:
           | Maybe a timed rollback with the previous state stored on the
           | device that needs to be rolled back, althogh if you're doing
           | this at facebook scale I'm sure that's a little more
           | difficult than it sounds, perhaps.
        
       | nabaraz wrote:
       | Is there a way to secure BGP? Some kind of BGP table backup and
       | restore?
        
         | Wolfspirit wrote:
         | I think that it is possible to restore the previous state but
         | the question is, if it makes sense. When do you decide that it
         | was a failure? Facebook explicitly (even so automatically) told
         | all others that they shouldn't use that routes anymore.
         | 
         | When it comes to Facebooks side I guess they do have backups of
         | their BGP config. Applying them (probably remotely) however
         | seems to be harder then expected when the whole infrastructure
         | is down.
        
       | mnordhoff wrote:
       | "Because of this Cloudflare's 1.1.1.1 DNS resolver could no
       | longer respond to queries asking for the IP address of
       | facebook.com or instagram.com."
       | 
       | The instagram.com zone itself uses a third-party DNS service and
       | didn't go down. (But e.g. www.instagram.com is a CNAME to a zone
       | on FB DNS.)
        
         | yftsui wrote:
         | That's pretty much why during the downtime visit instagram.com
         | showed a 503 from AWS instead.
        
           | mschuster91 wrote:
           | Wonder why they're still using AWS given that FB operates its
           | own data centers...
        
             | mnordhoff wrote:
             | No idea. I'd speculate that it's some kind of historical
             | reasons from before FB acquired IG.
        
       | [deleted]
        
       | thezapxs wrote:
       | The thing is, because the deletion of the BGP, they have ruled
       | out the cyberattack but what happened doesn't make sense
        
       | andp97 wrote:
       | No one.
       | 
       | Cloudflare that write a report of some else network outage!
       | 
       | AWESOW COMPANY!!!
        
         | phoe-krk wrote:
         | Do you really think that Facebook is going to write one right
         | now?
         | 
         | It's good that we have some coverage from companies that have
         | some stake in the game because they are also affected by the
         | outage, even if only partially.
        
           | powera wrote:
           | Facebook has to make some public statement; the shareholders
           | will demand it.
           | 
           | I expect the detail level to be roughly "an automated system
           | pushed a broken configuration"; that is to say, there
           | probably won't be any interesting information at all for the
           | Hacker News crowd.
           | 
           | I doubt that this was caused by "hackers" or "hostile
           | governments" or "dissident employees upset about Facebook
           | privacy issues", and also doubt that Facebook would admit
           | such if it were true unless they were legally required to do
           | so.
        
             | 0des wrote:
             | >Facebook has to make some public statement
             | 
             | History has shown us they can give us zero response, or an
             | incorrect response, and we (via our representatives) will
             | accept it and continue living life as before.
        
             | bilbo0s wrote:
             | The fact is, it's altogether likely that they could be
             | legally require NOT to make such a statement outlining the
             | cause if it was a hostile actor. I've felt a distinct
             | change recently. The US government is not messing around
             | about cyber security anymore.
             | 
             | The guys with the blue windbreakers show up, I'd pretty
             | much say "yes, sir." Of course, I don't have FB's power,
             | but I don't think it matters.
        
           | Wolfspirit wrote:
           | Even if they wanted to write one they have no way to host it
           | :D
        
       | clemenspw wrote:
       | Thanks, great write up!
        
       | amachefe wrote:
       | Facebook.com is up now, looks like the issue now is the billions
       | of request that is DDoS on the DNS servers
        
       | AlbertCory wrote:
       | It sorta sounds like Facebook is Too Big To Fail.
       | 
       | Yet another reason to dismantle it.
        
         | literallyWTF wrote:
         | I would rather purge it from this planet and throw Zuck into
         | jail.
        
           | 0des wrote:
           | Side note: I wonder if this comment would do better with a
           | different username
        
         | [deleted]
        
       | robaato wrote:
       | From article: [This chart shows] the availability of the DNS name
       | 'facebook.com' on Cloudflare's DNS resolver 1.1.1.1. It stopped
       | being available at around 15:50 UTC and returned at 21:20 UTC.
        
       | Wolfspirit wrote:
       | Seems like someone found the post-it with the admin password for
       | the BGP routers now and they're back online
        
       | ruoso wrote:
       | > ... but as of 22:28 UTC Facebook appears to be ...
       | 
       | Someone assumed London==UTC, when London is 1 hour ahead :) that
       | was actually 21:28 UTC
        
         | interestica wrote:
         | No matter what time of year it is, people tend to use 'EST' for
         | 'Eastern Time' even when we might be in Eastern Daylight Time
         | rather than Standard.
         | 
         | It's especially annoying when dealing with multiple countries
         | that may or may not be using Daylight Saving Time.
        
           | Wolfspirit wrote:
           | Even google isn't quite sure about the summer time. Not sure
           | if that is just a Google German thing...
           | 
           | A few weeks ago I tried to find out what the current time in
           | CET is. Asking google for "CET" gave me: "23:27 CET". Asking
           | google for "CET time" (I know that "time" is twice in this
           | case) gave me "00:27 CET".
           | 
           | The last one is wrong and should be CEST or even more correct
           | would be just the same result for CET as I asked for
        
         | Wolfspirit wrote:
         | Timezones are the most annoying thing... right after encoding
        
           | paxys wrote:
           | I personally find timezones more annoying. At least with
           | encoding once you figure things out it will work
           | indefinitely. Timezones can simply change from under you with
           | or without notice.
        
           | doublerabbit wrote:
           | > right after encoding
           | 
           | No joke. Today I ended up writing a whole essay explaining
           | the issue I was having and sending it off to the core
           | developers because I thought I had discovered an issue with
           | the actual language. The bug was because I had forgot to
           | convert too&from utf-8 in these two procedures:
           | proc 2Hex { input } { binary encode hex [encoding convertto
           | utf-8 "$input"] }          ;# Converts string data to base16
           | proc 2Base { input } { encoding convertfrom utf-8 [binary
           | decode hex "$input"] }           ;# Converts string hex data
           | to base32
           | 
           | On the plus side, I now have written documentation of the
           | internals of my program.
        
             | VBprogrammer wrote:
             | Oh TCL. I didn't miss you.
        
       | throwdecro wrote:
       | This is a great write-up, but one thing I don't understand is why
       | the effect of withdrawing the BGP prefixes was instantaneous (if
       | I understand that correctly), but it's taking hours (so far) to
       | re-announce the prefixes. Why would it take so long to flip the
       | switch back the other way?
        
         | adamcharnock wrote:
         | I'm pretty new to BGP, but I'd imagine that cutting off access
         | to an AS is fast because all it takes is for the neighbouring
         | routers update their routes. At which point any traffic that
         | makes it that far is simply dropped.
         | 
         | Whereas to make an announcement, the entire internet (or at
         | least all routers between the AS and the user) need to pickup
         | the new announcement.
         | 
         | (Note: I still need to read the article)
        
           | bsedlm wrote:
           | (I'm trying to better understand this)
           | 
           | I think it's not so simple because authoritative DNS systems
           | are involved.
           | 
           | So it's not just a BGP error. It's a BGP error which
           | disconnected authoritative DNS for all facebook. I'm not
           | quite sure why that makes it so slow to fix. is it just
           | because internal difficulties due to having no DNS at all?
        
         | dec0dedab0de wrote:
         | I haven't been following closely, but I think once they moved
         | the prefixes they could no longer access the routers. Coupled
         | with barebones staff at the data center due to the pandemic,
         | and all internal communication being disrupted. Though I really
         | expected it to be up within an hour or two.
        
           | dr_orpheus wrote:
           | Yeah, I think that is true. If you look at the Update near
           | the end of the Cloudflare article there is a huge spike in
           | the BGP activity (I assume re-announcing all of the routes).
           | So that part of it was relatively instantaneous after they
           | got all of their ducks in a row actually getting to the
           | routers and locating the BGP from some earlier version before
           | it went offline this morning that they could use.
        
           | rifkiamil wrote:
           | We have had out-of-band management ports & networks design
           | for decades! I know the feeling of driving 8 hours because I
           | lost connection to the device I was configuring.
           | https://en.wikipedia.org/wiki/Out-of-band_management
        
         | Wolfspirit wrote:
         | I'm not sure if that is true (and I hope it is not cause that
         | would be fatal) but I read somewhere that with facebook being
         | down also means all internal infrastructure of facebook isn't
         | available at the moment (chats, communication) including remote
         | control tools for the BGP Routers. Therefor they require people
         | to get physical access to the router while many people are
         | working from home cause of the pandemic.
        
         | kube-system wrote:
         | Given my experience with DNS issues, I am guessing that they
         | are running into dependencies along the way that assume/require
         | DNS be available to function.
        
           | bink wrote:
           | With routing it's even worse than that. If they had no out-
           | of-band method to connect to these routers and they botched
           | the routing config then they had no way to route any traffic
           | to them at all. At least with DNS you can still connect to
           | the IPs.
           | 
           | I would find it a bit surprising if Facebook didn't have OOB
           | access to their data centers, however.
        
             | withinboredom wrote:
             | Assuming you don't need DNS to get authorization to enter
             | the OOB access...
        
           | theshadowknows wrote:
           | Yeah as part of my job I often have to work with our DNS team
           | to provision say a subdomain or get some domain verified.
           | They've got like...three people...trying to service thousands
           | of teams across the enterprise. I do not envy their job at
           | all.
        
         | PeterCorless wrote:
         | If an authoritative DNS entry was removed, it can take up to 72
         | hours for that change to be propagated around the world, though
         | usually just a few hours for some other authoritative DNS
         | systems to get you mostly back:
         | 
         | https://ns1.com/resources/dns-
         | propagation#:~:text=DNS%20prop....
        
           | tester756 wrote:
           | Why it takes this long?
        
             | withinboredom wrote:
             | Caching
        
         | Hikikomori wrote:
         | Restoring is just as simple as flipping the switch again, but
         | access to that switch is another matter when your internal
         | network is also down and you cannot even get access to your
         | office or datacenters.
        
       ___________________________________________________________________
       (page generated 2021-10-04 23:00 UTC)