[HN Gopher] Every device with FB app is now DDoSing recursive DN... ___________________________________________________________________ Every device with FB app is now DDoSing recursive DNS resolvers Author : doener Score : 285 points Date : 2021-10-04 19:26 UTC (3 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | lrem wrote: | Isn't this "every device with an app using the FB SDK"? | Fordec wrote: | Is it just me, or every time there's an issue with facebook | server side, the FB SDK just absolutely damages any software | with it installed as a dependency. The thought put into | failover states under the hood is very lacking. It speaks to an | ethos of if developers aren't a working data pipeline to | facebook services, them and their own products can go pound | sand. | agilob wrote: | Every app with facebook SDK used for whatever reason, like | login or metrics or ads... | littlecranky67 wrote: | Oh great, and at some point when key ISP DNS servers crack under | the load, more and more websites will appear to "go down" from a | users perspective - suddenly gmail.com and outlook.com not | working. More and more people reload websites, restart devices | etc. and increase the load even further. People fall back to | using SMS/telephone, but since it is not used to that heavy load | of 2021, soon phone calls fail. With FB, WA, Email and Phone | "down", engineers can't be reached to fix this. And if they can, | they fail to call the Uber to get them somewhere. And even if | they could, the streets are congested with people that cannot | communicate remotely so try to get somewhere to convey messages | in-person. | | Hope they will just go back to fix FB and this is just in my head | :) | littlecranky67 wrote: | I wrote this as a pure joke, but now that I learned that | SERVFAIL is not cached on browsers, clients, intermediate DNS | servers [0] etc. I am curiously wondering what will be going | on. It is not only FB apps, it is basically every website | request (that uses FB JS for ads, tracking, etc.) that triggers | a DNS request, which will be forwarded 1:1 from the ISP's DNS | to the null-routed FB Subnet. This should put orders of | magnitude more load on resolving DNS servers than usual. | | [0]: https://serverfault.com/questions/479367/how-long-a-dns- | time... | bowaggoner wrote: | I enjoyed the comment. You might like my short story along | the same lines from a few years ago: | https://bowaggoner.com/writeups/robust.html | tarsius wrote: | For those you understand Swiss German: Mani Matter - I han es | Zundholzli azundt https://www.youtube.com/watch?v=PkGatIgXERI | Tainnor wrote: | I lit a match / And it made a fire. / And for my cigarette / | I wanted to take the fire from the match. / But the match | slipped from my hand and landed on the rug. / And it almost | made a hole in the rug. | | Well, you know what can happen / If you're not careful with | fire. / And for the light on a cigarette / A rug feels rather | too expensive. / And from the rug the fire, alas, / Might | have spread to the whole house / And who knows what would | have happened thereafter? | | There would have been a fire in the district / And the fire | fighters would have had to come. / They would have honked in | the streets / And unloaded their pipes. / And they would have | sprayed fire / And it would have been in vain / And the whole | city would have burned with nothing to protect it. | | And the people would have jumped around / Fearing for their | possessions. / They would have thought somebody had started a | fire. / They would have grabbed assault rifles. / Everyone | would have shouted: "Whose fault is it?" / The whole country | would have rioted. / And they would have shot at the | ministers behind the lecterns. | | The UN would have become involved / And the UN enemies as | well. / To preserve peace in Switzerland / Both would have | come with tanks. / It would have spread, little by little, / | To Europe, to Africa. / There would have been a World War and | humans would be no more. | | I lit a match / And it made a fire. / And for my cigarette / | I wanted to take the fire from the match. / But the match | slipped from my hand and landed on the rug. / Thankfully, I | picked it back up. | asdff wrote: | If its any consolation, I find SMS and telephone to be | remarkably robust in these situations. In college during | football games, the campus population would swell to probably | 200k people within a few square miles each with a cell phone in | a pocket. 3g and LTE would be worthless. Campus area wifi would | be worthless. The only thing that would work is shutting all | that off your phone and resorting to SMS and calling people | over EDGE, but it worked flawlessly even with all the people | stressing out a handful of towers at once. | littlecranky67 wrote: | Football game's are planned events, and the carriers plan | capacity accordingly. I remember sms+cell phone going down on | several music festivals with 20.000+ participants (especially | when it ended at around midnight). Only some carriers | supported those "sponteneous" gatherings in the middle of | nowhere with mobile cell towers that would keep connectivity | going - but low-cost carriers never did. | mschuster91 wrote: | German carrier O2 was/is notorious for offering somewhat | decent-ish service in urban areas under normal conditions - | but major events that happen to have lots of people moving | around the city like fans congregating to a soccer game with | public transport, political rallies or your average drunkard | festival (=Oktoberfest)? Instant collapse... | IgorPartola wrote: | Isn't there an easy fix to just add a bullshit record for FB to | DNS until this blows over? | | I am also thinking that all the poorly coded sites that don't | work unless the Share on Facebook button loads are also going | to hemorrhage money. So are all the e-commerce sites that rely | on Login with FB. | | I hope this results in everyone rethinking adding all that shit | to their infrastructure before the next time. | noahtallen wrote: | My understanding is that the Facebook network itself was | unreachable from the internet because of BGP. So even if an | IP was resolved from DNS, that IP wouldn't get routed to | Facebook because it withdrew its routes from its peers where | it connects to ISPs via BGP | AnimalMuppet wrote: | Not if the IP was 127.0.0.1... | jbotz wrote: | The problem isn't (wasn't) facebook.com's A records, it's | that the authoritative nameservers for facebook.com are | (were) unreachable. In theory someone could change the NS | records for facebook on the .com nameservers to point | somewhere else and serve up a fake facebook.com domain, | but... 1) those NS records have a 6 hour TTL, so it would | take a while be effective, and 2) who has the authority to do | that? | unanswered wrote: | ISPs keep floating charging Netflix for their own customers' data | traffic; by the same logic DNS operators should be charging | Facebook. | walrus01 wrote: | If you're an ISP that is anything bigger than a mom-and-pop | operation, you should have at least 3 or 4 geographically | distributed anycast recursive resolvers. | | Recursive DNS is pretty easy to do for really large volumes on a | $600 1U server. It's not like the days of 15 years ago... | fmajid wrote: | Ah, but what about all the marketing analytics ISPs deploy on | their DNS servers so that if your browser ever looked up | viagra.com, forever will it grace your web browsing retargeting | ad units? /s | Jumb0 wrote: | Could anyone explain this so people with no DNS knowledge could | understand? | Jumb0 wrote: | Could anyone explain this for people with no DNS knowledge? | ricardo81 wrote: | BGP is like the mail service. | | DNS is a translation from human readable addresses to machine | addresses. | | BGP determines how to find those addresses from your server to | theirs. | MrStonedOne wrote: | The internet phone book to convert .coms to ip addresses only | scales up to the load level of hte internet because results are | cached at multiple layers. | | your browser asks your computer which asks your router which | asks your isp which asks .com's dns servers which ask | facebook's dns servers for facebook's ip address. | | each layer will cache the results so say, even if 100,000 | people in seattle want to know facebook.com's ip address, only | the 5 or isps who provide internet in seattle have to ask for | facebook.com's ip address, so 100,000 requests, but only 5 | actual requests. | | even the per-device and per-home cache is helpful, because 500 | page loads in 15 minutes still only results in 1 actual dns | request. | | Here's the issue: | | Failures aren't cached. | | so while 100,000 people in 1 second trying to get facebook's ip | only resulted in 5 requests going to the core dns servers, now | results in 100,000 requests in 1 second trying going to the | core servers. | Rd6n6 wrote: | Ddos is a strong way to put this. Are we talking malware that is | sending thousands of requests per device, or a bug from a | connectivity issue? | Jtsummers wrote: | It's still a DDoS, just not an _attack_. Slashdotting a site is | a DDoS, but usually not intended as a deliberate attack. Now | take every WhatsApp, Facebook, Instagram, and Facebook | Messenger user, every app that uses FB for user authentication, | every _site_ that does the same, every app and site that serves | FB ads, every app and site that uses FB for metrics, and we | have an unintentional DDoS just waiting to happen. | tantalor wrote: | The word "attack" has multiple definitions, including "act | harmfully on" e.g., a heart attack. It does not require | intent or aggression. | protomyth wrote: | Yeah, I now know every user with the FB app installed. Its just | wild to watch the log of all the phones asking for facebook.com. | sschueller wrote: | Is there no response code for a DNS to say "I don't have what you | want right now, come back later but wait at least xxx seconds"? | | I guess alternatively you could return garbage (127.0.0.1) with a | 5 min ttl or so to get clients to backoff but also problematic. | a1369209993 wrote: | > you could return garbage (127.0.0.1) with a 5 min ttl | | I use 0.0.0.0, though I'm not sure if some layer in that mess | would interpret it creatively. Has worked on my machine for | years at least: $ ping facebook.com | connect: Invalid argument | | (If you do this, please set the TTL to at least a month, and | preferably upwards of a decade.) | asddubs wrote: | lots of ISPs straight up ignore ttl | Computeiful wrote: | Clients should really be using something like exponential | backoff ethernet-style. | xenonite wrote: | And even Hacker news is strangely slow to respond. | mrep wrote: | Lots of people interested in the event and normal web response | queries their sql database for user data like upvotes which is | their limiter. | | Logout which I believe they just forced for all users sends you | to the cache and it'll be fast. dang@ mentioned it during the | giant S3 outage a few years ago. | munk-a wrote: | Ah nice - I noticed it just got a lot more snappy to respond. | bifrost wrote: | Yep, with FB and IG down, people spending time on a much more | important site.... This site. | reayn wrote: | yeah I thought I was the only one, never really noticed this on | HN before... | adtac wrote: | really lol? HN goes down every full moon day or something | like that | littlecranky67 wrote: | I'd thank anybody who would post a tutorial/configfile to setup a | DNS server (dnsmasq?), forcefully caching even failed requests to | a configurable timeout, and large cache sizes. We might need them | in case DNS servers going down under the load of requests from | "smart devices" :) | agilob wrote: | > forcefully caching even failed requests to a configurable | timeout | | I've been doing ~SRE for 1.5 years and I've worked or helped on | 3 outages related to negative DNS.. Please don't use negative | cache, if you don't know how enough about DNS and can't monitor | it | littlecranky67 wrote: | I wouldn't suggest any ISP should do that (and I am none) but | probably host this for own personal usage/home networks. If | recursive DNS servers go down under the load of "smart | devices", having a local copy of a larger number/set of IPs I | usually visit might come in handy (and none of my requests | would worsen the issue of server overload). | agilob wrote: | This is 'cache', my OpenWRT router has this for thousands | of records, but negative cache means: "remember this domain | doesn't exist and don't retry asking other DNS providers". | This is very dangerous. | | Your browser AND operating system AND router already | provide DNS caching, it's not something average user should | even think about. You might want to consider it when things | in your ISP go wrong (hello BT), or majority of computer | request the same domains frequently, but then again, your | router should do it already. | jaywalk wrote: | This does a good job explaining how SERVFAIL caching works: | https://serverfault.com/questions/479367/how-long-a-dns-time... | littlecranky67 wrote: | From your linked SO post, the accepted answer concludes: | | "In summary, SERVFAIL is unlikely to be cached, but even if | cached, it'll be at most a double- or even a single-digit | number of seconds." | | That would be fatal right now, wouldn't it? That would mean | every major ISP's DNS server right now forwards millions of | _identical_ DNS resolve requests to the (currently null- | routed) Facebook DNS servers. These must be millions, as | heck, every larger website uses FB tracking tools, "like | buttons" etc. Are they at least smart enough to throttle | based on a domain/ip hash? Else it could happen that DNS | servers of major ISP are soon overloaded as (constantly | failing and thus uncached) requests to FB DNS would eat up | all bandwidth/ressources? | toast0 wrote: | Not that fatal. I think at least some recursive servers | will do 'collapsed forwarding', where additional requests | to resolve the same name while the first request is in | progress will wait for the first request to finish and send | the same results to all clients at that point. Although, | perhaps that's just wishful thinking on my part. | | Then you have port limits, usually each request goes out on | a new port, a recursive resolver can only have 64k requests | outstanding to any given authoritative (or upstream) server | IP for each IP the recursive uses. Facebook runs with 4 | hostnames listed, so that's a limit of 256k requests | outstanding, 512k if your recursive does IPv4 and v6 (and 1 | M if they're also making whatsapp requests). | | DNS services for both domains appear to be back up by the | way. | | On the authoritative side, it's not too hard to manage this | load. If you can't handle the big crush to start with, drop | all requests, and then accept all the requests from | 1.0.0.0/8, and add one /8 at a time as CPU permits until | you're allowing everything. Once you handle the initial | crush from a resolver, it should go back to normal load, | and there should be some distribution of load across the | various /8s. I wouldn't expect it to be evenly distributed, | but it should be even enough. | | Disclosure: I worked at WhatsApp, but left August 2019. I | don't know anything about this outage other than idle | speculation. I don't know if FB has a procedure to slow | start DNS, but the theory is simple; the practice is | complicated by the DNS ips being used in Anycast. | littlecranky67 wrote: | > servers will do 'collapsed forwarding', [...] perhaps | that's just wishful thinking on my part | | I think it is wishful thinking, because that would | basically be caching which is not allowed by the RFC. In | 2017 the BIND implementation changed to a default cache | time of 1s which would certainly ease the problem. | | > then you have port limits, usually each request goes | out on a new port, a recursive resolver can only have 64k | | I'm unsure if this helps or worsens the situation, | depending if the 'collapsed forwarding'/1s caching is in | place. If this is not the case, ephemeral port exhaustion | would kick in, at which point the DNS server will not be | able to server other requests. | | > On the authoritative side, it's not too hard to manage | this load | | Of course not, all you need to do is just present _any_ | response which will be cached by downstream resolvers. No | smartphone /end user device will query the authoritative | side as long as there is just any (even stale) response. | toast0 wrote: | > If this is not the case, ephemeral port exhaustion | would kick in, at which point the DNS server will not be | able to server other requests. | | You can use the same local ip/port to contact multiple | server ip/ports, so filling up connections to FB ips | shouldn't prevent you from connecting to others (but | there are plenty of ways to do that wrong, I guess) | | >> On the authoritative side, it's not too hard to manage | this load | | > Of course not, all you need to do is just present any | response which will be cached by downstream resolvers. | | You need to present a response before the resolver times | out. One can certainly imagine a situation where the | incoming packet processing results in enough delay that | the responses arrive too late and are discarded. In the | right conditions, this queuing delay would never clear | and things just get worse. If it doesn't happen, great, | but if it does, dropping most of the requests so you can | timely handle the few you accept is a good way to get | moving. | fmajid wrote: | It's not so much the load as the DNS servers having to | maintain state for all those queries until they time out. | Must consume tremendous RAM and servers that are not event- | driven could also be generating large numbers of threads. | fermentation wrote: | Excellent opportunity to set up a DNS-by-mail service. Just send | me a letter with the names you want and I'll get back to you | within 3 to 5 business days! | idatum wrote: | Never having explicitly queried fb.com before I never noticed how | they (face:b00c) got clever with their IPv6 address: | | 2a03:2880:f1ff:83:face:b00c:0:25de | ceejayoz wrote: | Facebook also used brute force to get facebookcorewwwi.onion on | Tor a while back. | https://en.wikipedia.org/wiki/Facebook_onion_address | ipaddr wrote: | facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd.onio | n is the latest | john37386 wrote: | At this point people hope it will just restart so that we can all | resume a normal life. There can only be 2 options. It will be fix | very soon or it will be a hell of a night with a lot of coffee. | Ajedi32 wrote: | Is this specifically an issue with the Facebook app? Or is it | just a predicable consequence of DNS responses no longer being | cached due to query failures for a site as popular as Facebook? | treesknees wrote: | It is certainly not specific to Facebook, but the scale at | which Facebook is referenced across websites and apps is pretty | unique (I can only think of a few key players like Google who | would cause a similar load.) | | And to clarify a bit, the queries aren't "no longer being | cached due to query failures", it's because their TTL expires | and the resulting SERVFAIL from the next query (which fails) | isn't cached at all. | [deleted] | justahuman1 wrote: | Is it possible for facebook to instead rely on an anycast IP | rather than DNS for their (non-web) phone apps? | bifrost wrote: | No. But FB's DNS is anycasted. | | FB's eggs were all in one basket, and the basket broke. | treesknees wrote: | Yes and no. Yes they could technically hardcode an anycasted IP | address, however it'd be less reliable. Also you'd run into | issues with TLS certificates. It'd be very inflexible and would | probably result in more outages. | | But even if they did hardcode an IP, the underlying | infrastructure for Facebook was also down not just DNS | resolution of facebook.com. So even if the FB app didn't need | to resolve a hostname, it would still be broken. | [deleted] | earth2mars wrote: | Is it just the people trying to connect or the app itself keep | polling and trying to send information from devices to Facebook | servers continuously. | slg wrote: | This technology problem is a good metaphor for Facebook overall | as a company. There is nothing fundamentally wrong with having | your app regularly polling for DNS records when it can't find | them, but that can be an actively harmful approach when you are | the size of Facebook. Being that size comes with a whole swath of | extra responsibilities to ensure that your behavior doesn't end | up harming society as a whole. | jmalicki wrote: | There is something inherently wrong with that - it's why | exponential backoff exists. | earth2mars wrote: | Is it just the people trying to connect or the app itself keep | polling and trying to send information from devices to Facebook | servers continuously. ___________________________________________________________________ (page generated 2021-10-04 23:00 UTC)