[HN Gopher] Understanding how Facebook disappeared from the inte... ___________________________________________________________________ Understanding how Facebook disappeared from the internet Author : jgrahamc Score : 299 points Date : 2021-10-04 21:11 UTC (1 hours ago) (HTM) web link (blog.cloudflare.com) (TXT) w3m dump (blog.cloudflare.com) | cwilkes wrote: | Updating BGP configs should go through a flowchart like this: | | Do you want to update BGP? | | No: exit | | Yes: type this random 100 character phrase to continue no copy | paste | toomuchtodo wrote: | "Are you sure? Can people you need to recover from this change | already in the building?" | _joel wrote: | Maybe a timed rollback with the previous state stored on the | device that needs to be rolled back, althogh if you're doing | this at facebook scale I'm sure that's a little more | difficult than it sounds, perhaps. | nabaraz wrote: | Is there a way to secure BGP? Some kind of BGP table backup and | restore? | Wolfspirit wrote: | I think that it is possible to restore the previous state but | the question is, if it makes sense. When do you decide that it | was a failure? Facebook explicitly (even so automatically) told | all others that they shouldn't use that routes anymore. | | When it comes to Facebooks side I guess they do have backups of | their BGP config. Applying them (probably remotely) however | seems to be harder then expected when the whole infrastructure | is down. | mnordhoff wrote: | "Because of this Cloudflare's 1.1.1.1 DNS resolver could no | longer respond to queries asking for the IP address of | facebook.com or instagram.com." | | The instagram.com zone itself uses a third-party DNS service and | didn't go down. (But e.g. www.instagram.com is a CNAME to a zone | on FB DNS.) | yftsui wrote: | That's pretty much why during the downtime visit instagram.com | showed a 503 from AWS instead. | mschuster91 wrote: | Wonder why they're still using AWS given that FB operates its | own data centers... | mnordhoff wrote: | No idea. I'd speculate that it's some kind of historical | reasons from before FB acquired IG. | [deleted] | thezapxs wrote: | The thing is, because the deletion of the BGP, they have ruled | out the cyberattack but what happened doesn't make sense | andp97 wrote: | No one. | | Cloudflare that write a report of some else network outage! | | AWESOW COMPANY!!! | phoe-krk wrote: | Do you really think that Facebook is going to write one right | now? | | It's good that we have some coverage from companies that have | some stake in the game because they are also affected by the | outage, even if only partially. | powera wrote: | Facebook has to make some public statement; the shareholders | will demand it. | | I expect the detail level to be roughly "an automated system | pushed a broken configuration"; that is to say, there | probably won't be any interesting information at all for the | Hacker News crowd. | | I doubt that this was caused by "hackers" or "hostile | governments" or "dissident employees upset about Facebook | privacy issues", and also doubt that Facebook would admit | such if it were true unless they were legally required to do | so. | 0des wrote: | >Facebook has to make some public statement | | History has shown us they can give us zero response, or an | incorrect response, and we (via our representatives) will | accept it and continue living life as before. | bilbo0s wrote: | The fact is, it's altogether likely that they could be | legally require NOT to make such a statement outlining the | cause if it was a hostile actor. I've felt a distinct | change recently. The US government is not messing around | about cyber security anymore. | | The guys with the blue windbreakers show up, I'd pretty | much say "yes, sir." Of course, I don't have FB's power, | but I don't think it matters. | Wolfspirit wrote: | Even if they wanted to write one they have no way to host it | :D | clemenspw wrote: | Thanks, great write up! | amachefe wrote: | Facebook.com is up now, looks like the issue now is the billions | of request that is DDoS on the DNS servers | AlbertCory wrote: | It sorta sounds like Facebook is Too Big To Fail. | | Yet another reason to dismantle it. | literallyWTF wrote: | I would rather purge it from this planet and throw Zuck into | jail. | 0des wrote: | Side note: I wonder if this comment would do better with a | different username | [deleted] | robaato wrote: | From article: [This chart shows] the availability of the DNS name | 'facebook.com' on Cloudflare's DNS resolver 1.1.1.1. It stopped | being available at around 15:50 UTC and returned at 21:20 UTC. | Wolfspirit wrote: | Seems like someone found the post-it with the admin password for | the BGP routers now and they're back online | ruoso wrote: | > ... but as of 22:28 UTC Facebook appears to be ... | | Someone assumed London==UTC, when London is 1 hour ahead :) that | was actually 21:28 UTC | interestica wrote: | No matter what time of year it is, people tend to use 'EST' for | 'Eastern Time' even when we might be in Eastern Daylight Time | rather than Standard. | | It's especially annoying when dealing with multiple countries | that may or may not be using Daylight Saving Time. | Wolfspirit wrote: | Even google isn't quite sure about the summer time. Not sure | if that is just a Google German thing... | | A few weeks ago I tried to find out what the current time in | CET is. Asking google for "CET" gave me: "23:27 CET". Asking | google for "CET time" (I know that "time" is twice in this | case) gave me "00:27 CET". | | The last one is wrong and should be CEST or even more correct | would be just the same result for CET as I asked for | Wolfspirit wrote: | Timezones are the most annoying thing... right after encoding | paxys wrote: | I personally find timezones more annoying. At least with | encoding once you figure things out it will work | indefinitely. Timezones can simply change from under you with | or without notice. | doublerabbit wrote: | > right after encoding | | No joke. Today I ended up writing a whole essay explaining | the issue I was having and sending it off to the core | developers because I thought I had discovered an issue with | the actual language. The bug was because I had forgot to | convert too&from utf-8 in these two procedures: | proc 2Hex { input } { binary encode hex [encoding convertto | utf-8 "$input"] } ;# Converts string data to base16 | proc 2Base { input } { encoding convertfrom utf-8 [binary | decode hex "$input"] } ;# Converts string hex data | to base32 | | On the plus side, I now have written documentation of the | internals of my program. | VBprogrammer wrote: | Oh TCL. I didn't miss you. | throwdecro wrote: | This is a great write-up, but one thing I don't understand is why | the effect of withdrawing the BGP prefixes was instantaneous (if | I understand that correctly), but it's taking hours (so far) to | re-announce the prefixes. Why would it take so long to flip the | switch back the other way? | adamcharnock wrote: | I'm pretty new to BGP, but I'd imagine that cutting off access | to an AS is fast because all it takes is for the neighbouring | routers update their routes. At which point any traffic that | makes it that far is simply dropped. | | Whereas to make an announcement, the entire internet (or at | least all routers between the AS and the user) need to pickup | the new announcement. | | (Note: I still need to read the article) | bsedlm wrote: | (I'm trying to better understand this) | | I think it's not so simple because authoritative DNS systems | are involved. | | So it's not just a BGP error. It's a BGP error which | disconnected authoritative DNS for all facebook. I'm not | quite sure why that makes it so slow to fix. is it just | because internal difficulties due to having no DNS at all? | dec0dedab0de wrote: | I haven't been following closely, but I think once they moved | the prefixes they could no longer access the routers. Coupled | with barebones staff at the data center due to the pandemic, | and all internal communication being disrupted. Though I really | expected it to be up within an hour or two. | dr_orpheus wrote: | Yeah, I think that is true. If you look at the Update near | the end of the Cloudflare article there is a huge spike in | the BGP activity (I assume re-announcing all of the routes). | So that part of it was relatively instantaneous after they | got all of their ducks in a row actually getting to the | routers and locating the BGP from some earlier version before | it went offline this morning that they could use. | rifkiamil wrote: | We have had out-of-band management ports & networks design | for decades! I know the feeling of driving 8 hours because I | lost connection to the device I was configuring. | https://en.wikipedia.org/wiki/Out-of-band_management | Wolfspirit wrote: | I'm not sure if that is true (and I hope it is not cause that | would be fatal) but I read somewhere that with facebook being | down also means all internal infrastructure of facebook isn't | available at the moment (chats, communication) including remote | control tools for the BGP Routers. Therefor they require people | to get physical access to the router while many people are | working from home cause of the pandemic. | kube-system wrote: | Given my experience with DNS issues, I am guessing that they | are running into dependencies along the way that assume/require | DNS be available to function. | bink wrote: | With routing it's even worse than that. If they had no out- | of-band method to connect to these routers and they botched | the routing config then they had no way to route any traffic | to them at all. At least with DNS you can still connect to | the IPs. | | I would find it a bit surprising if Facebook didn't have OOB | access to their data centers, however. | withinboredom wrote: | Assuming you don't need DNS to get authorization to enter | the OOB access... | theshadowknows wrote: | Yeah as part of my job I often have to work with our DNS team | to provision say a subdomain or get some domain verified. | They've got like...three people...trying to service thousands | of teams across the enterprise. I do not envy their job at | all. | PeterCorless wrote: | If an authoritative DNS entry was removed, it can take up to 72 | hours for that change to be propagated around the world, though | usually just a few hours for some other authoritative DNS | systems to get you mostly back: | | https://ns1.com/resources/dns- | propagation#:~:text=DNS%20prop.... | tester756 wrote: | Why it takes this long? | withinboredom wrote: | Caching | Hikikomori wrote: | Restoring is just as simple as flipping the switch again, but | access to that switch is another matter when your internal | network is also down and you cannot even get access to your | office or datacenters. ___________________________________________________________________ (page generated 2021-10-04 23:00 UTC)