[HN Gopher] Every device with FB app is now DDoSing recursive DN...
       ___________________________________________________________________
        
       Every device with FB app is now DDoSing recursive DNS resolvers
        
       Author : doener
       Score  : 285 points
       Date   : 2021-10-04 19:26 UTC (3 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | lrem wrote:
       | Isn't this "every device with an app using the FB SDK"?
        
         | Fordec wrote:
         | Is it just me, or every time there's an issue with facebook
         | server side, the FB SDK just absolutely damages any software
         | with it installed as a dependency. The thought put into
         | failover states under the hood is very lacking. It speaks to an
         | ethos of if developers aren't a working data pipeline to
         | facebook services, them and their own products can go pound
         | sand.
        
         | agilob wrote:
         | Every app with facebook SDK used for whatever reason, like
         | login or metrics or ads...
        
       | littlecranky67 wrote:
       | Oh great, and at some point when key ISP DNS servers crack under
       | the load, more and more websites will appear to "go down" from a
       | users perspective - suddenly gmail.com and outlook.com not
       | working. More and more people reload websites, restart devices
       | etc. and increase the load even further. People fall back to
       | using SMS/telephone, but since it is not used to that heavy load
       | of 2021, soon phone calls fail. With FB, WA, Email and Phone
       | "down", engineers can't be reached to fix this. And if they can,
       | they fail to call the Uber to get them somewhere. And even if
       | they could, the streets are congested with people that cannot
       | communicate remotely so try to get somewhere to convey messages
       | in-person.
       | 
       | Hope they will just go back to fix FB and this is just in my head
       | :)
        
         | littlecranky67 wrote:
         | I wrote this as a pure joke, but now that I learned that
         | SERVFAIL is not cached on browsers, clients, intermediate DNS
         | servers [0] etc. I am curiously wondering what will be going
         | on. It is not only FB apps, it is basically every website
         | request (that uses FB JS for ads, tracking, etc.) that triggers
         | a DNS request, which will be forwarded 1:1 from the ISP's DNS
         | to the null-routed FB Subnet. This should put orders of
         | magnitude more load on resolving DNS servers than usual.
         | 
         | [0]: https://serverfault.com/questions/479367/how-long-a-dns-
         | time...
        
           | bowaggoner wrote:
           | I enjoyed the comment. You might like my short story along
           | the same lines from a few years ago:
           | https://bowaggoner.com/writeups/robust.html
        
         | tarsius wrote:
         | For those you understand Swiss German: Mani Matter - I han es
         | Zundholzli azundt https://www.youtube.com/watch?v=PkGatIgXERI
        
           | Tainnor wrote:
           | I lit a match / And it made a fire. / And for my cigarette /
           | I wanted to take the fire from the match. / But the match
           | slipped from my hand and landed on the rug. / And it almost
           | made a hole in the rug.
           | 
           | Well, you know what can happen / If you're not careful with
           | fire. / And for the light on a cigarette / A rug feels rather
           | too expensive. / And from the rug the fire, alas, / Might
           | have spread to the whole house / And who knows what would
           | have happened thereafter?
           | 
           | There would have been a fire in the district / And the fire
           | fighters would have had to come. / They would have honked in
           | the streets / And unloaded their pipes. / And they would have
           | sprayed fire / And it would have been in vain / And the whole
           | city would have burned with nothing to protect it.
           | 
           | And the people would have jumped around / Fearing for their
           | possessions. / They would have thought somebody had started a
           | fire. / They would have grabbed assault rifles. / Everyone
           | would have shouted: "Whose fault is it?" / The whole country
           | would have rioted. / And they would have shot at the
           | ministers behind the lecterns.
           | 
           | The UN would have become involved / And the UN enemies as
           | well. / To preserve peace in Switzerland / Both would have
           | come with tanks. / It would have spread, little by little, /
           | To Europe, to Africa. / There would have been a World War and
           | humans would be no more.
           | 
           | I lit a match / And it made a fire. / And for my cigarette /
           | I wanted to take the fire from the match. / But the match
           | slipped from my hand and landed on the rug. / Thankfully, I
           | picked it back up.
        
         | asdff wrote:
         | If its any consolation, I find SMS and telephone to be
         | remarkably robust in these situations. In college during
         | football games, the campus population would swell to probably
         | 200k people within a few square miles each with a cell phone in
         | a pocket. 3g and LTE would be worthless. Campus area wifi would
         | be worthless. The only thing that would work is shutting all
         | that off your phone and resorting to SMS and calling people
         | over EDGE, but it worked flawlessly even with all the people
         | stressing out a handful of towers at once.
        
           | littlecranky67 wrote:
           | Football game's are planned events, and the carriers plan
           | capacity accordingly. I remember sms+cell phone going down on
           | several music festivals with 20.000+ participants (especially
           | when it ended at around midnight). Only some carriers
           | supported those "sponteneous" gatherings in the middle of
           | nowhere with mobile cell towers that would keep connectivity
           | going - but low-cost carriers never did.
        
           | mschuster91 wrote:
           | German carrier O2 was/is notorious for offering somewhat
           | decent-ish service in urban areas under normal conditions -
           | but major events that happen to have lots of people moving
           | around the city like fans congregating to a soccer game with
           | public transport, political rallies or your average drunkard
           | festival (=Oktoberfest)? Instant collapse...
        
         | IgorPartola wrote:
         | Isn't there an easy fix to just add a bullshit record for FB to
         | DNS until this blows over?
         | 
         | I am also thinking that all the poorly coded sites that don't
         | work unless the Share on Facebook button loads are also going
         | to hemorrhage money. So are all the e-commerce sites that rely
         | on Login with FB.
         | 
         | I hope this results in everyone rethinking adding all that shit
         | to their infrastructure before the next time.
        
           | noahtallen wrote:
           | My understanding is that the Facebook network itself was
           | unreachable from the internet because of BGP. So even if an
           | IP was resolved from DNS, that IP wouldn't get routed to
           | Facebook because it withdrew its routes from its peers where
           | it connects to ISPs via BGP
        
             | AnimalMuppet wrote:
             | Not if the IP was 127.0.0.1...
        
           | jbotz wrote:
           | The problem isn't (wasn't) facebook.com's A records, it's
           | that the authoritative nameservers for facebook.com are
           | (were) unreachable. In theory someone could change the NS
           | records for facebook on the .com nameservers to point
           | somewhere else and serve up a fake facebook.com domain,
           | but... 1) those NS records have a 6 hour TTL, so it would
           | take a while be effective, and 2) who has the authority to do
           | that?
        
       | unanswered wrote:
       | ISPs keep floating charging Netflix for their own customers' data
       | traffic; by the same logic DNS operators should be charging
       | Facebook.
        
       | walrus01 wrote:
       | If you're an ISP that is anything bigger than a mom-and-pop
       | operation, you should have at least 3 or 4 geographically
       | distributed anycast recursive resolvers.
       | 
       | Recursive DNS is pretty easy to do for really large volumes on a
       | $600 1U server. It's not like the days of 15 years ago...
        
         | fmajid wrote:
         | Ah, but what about all the marketing analytics ISPs deploy on
         | their DNS servers so that if your browser ever looked up
         | viagra.com, forever will it grace your web browsing retargeting
         | ad units? /s
        
       | Jumb0 wrote:
       | Could anyone explain this so people with no DNS knowledge could
       | understand?
        
       | Jumb0 wrote:
       | Could anyone explain this for people with no DNS knowledge?
        
         | ricardo81 wrote:
         | BGP is like the mail service.
         | 
         | DNS is a translation from human readable addresses to machine
         | addresses.
         | 
         | BGP determines how to find those addresses from your server to
         | theirs.
        
         | MrStonedOne wrote:
         | The internet phone book to convert .coms to ip addresses only
         | scales up to the load level of hte internet because results are
         | cached at multiple layers.
         | 
         | your browser asks your computer which asks your router which
         | asks your isp which asks .com's dns servers which ask
         | facebook's dns servers for facebook's ip address.
         | 
         | each layer will cache the results so say, even if 100,000
         | people in seattle want to know facebook.com's ip address, only
         | the 5 or isps who provide internet in seattle have to ask for
         | facebook.com's ip address, so 100,000 requests, but only 5
         | actual requests.
         | 
         | even the per-device and per-home cache is helpful, because 500
         | page loads in 15 minutes still only results in 1 actual dns
         | request.
         | 
         | Here's the issue:
         | 
         | Failures aren't cached.
         | 
         | so while 100,000 people in 1 second trying to get facebook's ip
         | only resulted in 5 requests going to the core dns servers, now
         | results in 100,000 requests in 1 second trying going to the
         | core servers.
        
       | Rd6n6 wrote:
       | Ddos is a strong way to put this. Are we talking malware that is
       | sending thousands of requests per device, or a bug from a
       | connectivity issue?
        
         | Jtsummers wrote:
         | It's still a DDoS, just not an _attack_. Slashdotting a site is
         | a DDoS, but usually not intended as a deliberate attack. Now
         | take every WhatsApp, Facebook, Instagram, and Facebook
         | Messenger user, every app that uses FB for user authentication,
         | every _site_ that does the same, every app and site that serves
         | FB ads, every app and site that uses FB for metrics, and we
         | have an unintentional DDoS just waiting to happen.
        
           | tantalor wrote:
           | The word "attack" has multiple definitions, including "act
           | harmfully on" e.g., a heart attack. It does not require
           | intent or aggression.
        
       | protomyth wrote:
       | Yeah, I now know every user with the FB app installed. Its just
       | wild to watch the log of all the phones asking for facebook.com.
        
       | sschueller wrote:
       | Is there no response code for a DNS to say "I don't have what you
       | want right now, come back later but wait at least xxx seconds"?
       | 
       | I guess alternatively you could return garbage (127.0.0.1) with a
       | 5 min ttl or so to get clients to backoff but also problematic.
        
         | a1369209993 wrote:
         | > you could return garbage (127.0.0.1) with a 5 min ttl
         | 
         | I use 0.0.0.0, though I'm not sure if some layer in that mess
         | would interpret it creatively. Has worked on my machine for
         | years at least:                 $ ping facebook.com
         | connect: Invalid argument
         | 
         | (If you do this, please set the TTL to at least a month, and
         | preferably upwards of a decade.)
        
           | asddubs wrote:
           | lots of ISPs straight up ignore ttl
        
         | Computeiful wrote:
         | Clients should really be using something like exponential
         | backoff ethernet-style.
        
       | xenonite wrote:
       | And even Hacker news is strangely slow to respond.
        
         | mrep wrote:
         | Lots of people interested in the event and normal web response
         | queries their sql database for user data like upvotes which is
         | their limiter.
         | 
         | Logout which I believe they just forced for all users sends you
         | to the cache and it'll be fast. dang@ mentioned it during the
         | giant S3 outage a few years ago.
        
           | munk-a wrote:
           | Ah nice - I noticed it just got a lot more snappy to respond.
        
         | bifrost wrote:
         | Yep, with FB and IG down, people spending time on a much more
         | important site.... This site.
        
         | reayn wrote:
         | yeah I thought I was the only one, never really noticed this on
         | HN before...
        
           | adtac wrote:
           | really lol? HN goes down every full moon day or something
           | like that
        
       | littlecranky67 wrote:
       | I'd thank anybody who would post a tutorial/configfile to setup a
       | DNS server (dnsmasq?), forcefully caching even failed requests to
       | a configurable timeout, and large cache sizes. We might need them
       | in case DNS servers going down under the load of requests from
       | "smart devices" :)
        
         | agilob wrote:
         | > forcefully caching even failed requests to a configurable
         | timeout
         | 
         | I've been doing ~SRE for 1.5 years and I've worked or helped on
         | 3 outages related to negative DNS.. Please don't use negative
         | cache, if you don't know how enough about DNS and can't monitor
         | it
        
           | littlecranky67 wrote:
           | I wouldn't suggest any ISP should do that (and I am none) but
           | probably host this for own personal usage/home networks. If
           | recursive DNS servers go down under the load of "smart
           | devices", having a local copy of a larger number/set of IPs I
           | usually visit might come in handy (and none of my requests
           | would worsen the issue of server overload).
        
             | agilob wrote:
             | This is 'cache', my OpenWRT router has this for thousands
             | of records, but negative cache means: "remember this domain
             | doesn't exist and don't retry asking other DNS providers".
             | This is very dangerous.
             | 
             | Your browser AND operating system AND router already
             | provide DNS caching, it's not something average user should
             | even think about. You might want to consider it when things
             | in your ISP go wrong (hello BT), or majority of computer
             | request the same domains frequently, but then again, your
             | router should do it already.
        
         | jaywalk wrote:
         | This does a good job explaining how SERVFAIL caching works:
         | https://serverfault.com/questions/479367/how-long-a-dns-time...
        
           | littlecranky67 wrote:
           | From your linked SO post, the accepted answer concludes:
           | 
           | "In summary, SERVFAIL is unlikely to be cached, but even if
           | cached, it'll be at most a double- or even a single-digit
           | number of seconds."
           | 
           | That would be fatal right now, wouldn't it? That would mean
           | every major ISP's DNS server right now forwards millions of
           | _identical_ DNS resolve requests to the (currently null-
           | routed) Facebook DNS servers. These must be millions, as
           | heck, every larger website uses FB tracking tools,  "like
           | buttons" etc. Are they at least smart enough to throttle
           | based on a domain/ip hash? Else it could happen that DNS
           | servers of major ISP are soon overloaded as (constantly
           | failing and thus uncached) requests to FB DNS would eat up
           | all bandwidth/ressources?
        
             | toast0 wrote:
             | Not that fatal. I think at least some recursive servers
             | will do 'collapsed forwarding', where additional requests
             | to resolve the same name while the first request is in
             | progress will wait for the first request to finish and send
             | the same results to all clients at that point. Although,
             | perhaps that's just wishful thinking on my part.
             | 
             | Then you have port limits, usually each request goes out on
             | a new port, a recursive resolver can only have 64k requests
             | outstanding to any given authoritative (or upstream) server
             | IP for each IP the recursive uses. Facebook runs with 4
             | hostnames listed, so that's a limit of 256k requests
             | outstanding, 512k if your recursive does IPv4 and v6 (and 1
             | M if they're also making whatsapp requests).
             | 
             | DNS services for both domains appear to be back up by the
             | way.
             | 
             | On the authoritative side, it's not too hard to manage this
             | load. If you can't handle the big crush to start with, drop
             | all requests, and then accept all the requests from
             | 1.0.0.0/8, and add one /8 at a time as CPU permits until
             | you're allowing everything. Once you handle the initial
             | crush from a resolver, it should go back to normal load,
             | and there should be some distribution of load across the
             | various /8s. I wouldn't expect it to be evenly distributed,
             | but it should be even enough.
             | 
             | Disclosure: I worked at WhatsApp, but left August 2019. I
             | don't know anything about this outage other than idle
             | speculation. I don't know if FB has a procedure to slow
             | start DNS, but the theory is simple; the practice is
             | complicated by the DNS ips being used in Anycast.
        
               | littlecranky67 wrote:
               | > servers will do 'collapsed forwarding', [...] perhaps
               | that's just wishful thinking on my part
               | 
               | I think it is wishful thinking, because that would
               | basically be caching which is not allowed by the RFC. In
               | 2017 the BIND implementation changed to a default cache
               | time of 1s which would certainly ease the problem.
               | 
               | > then you have port limits, usually each request goes
               | out on a new port, a recursive resolver can only have 64k
               | 
               | I'm unsure if this helps or worsens the situation,
               | depending if the 'collapsed forwarding'/1s caching is in
               | place. If this is not the case, ephemeral port exhaustion
               | would kick in, at which point the DNS server will not be
               | able to server other requests.
               | 
               | > On the authoritative side, it's not too hard to manage
               | this load
               | 
               | Of course not, all you need to do is just present _any_
               | response which will be cached by downstream resolvers. No
               | smartphone /end user device will query the authoritative
               | side as long as there is just any (even stale) response.
        
               | toast0 wrote:
               | > If this is not the case, ephemeral port exhaustion
               | would kick in, at which point the DNS server will not be
               | able to server other requests.
               | 
               | You can use the same local ip/port to contact multiple
               | server ip/ports, so filling up connections to FB ips
               | shouldn't prevent you from connecting to others (but
               | there are plenty of ways to do that wrong, I guess)
               | 
               | >> On the authoritative side, it's not too hard to manage
               | this load
               | 
               | > Of course not, all you need to do is just present any
               | response which will be cached by downstream resolvers.
               | 
               | You need to present a response before the resolver times
               | out. One can certainly imagine a situation where the
               | incoming packet processing results in enough delay that
               | the responses arrive too late and are discarded. In the
               | right conditions, this queuing delay would never clear
               | and things just get worse. If it doesn't happen, great,
               | but if it does, dropping most of the requests so you can
               | timely handle the few you accept is a good way to get
               | moving.
        
             | fmajid wrote:
             | It's not so much the load as the DNS servers having to
             | maintain state for all those queries until they time out.
             | Must consume tremendous RAM and servers that are not event-
             | driven could also be generating large numbers of threads.
        
       | fermentation wrote:
       | Excellent opportunity to set up a DNS-by-mail service. Just send
       | me a letter with the names you want and I'll get back to you
       | within 3 to 5 business days!
        
       | idatum wrote:
       | Never having explicitly queried fb.com before I never noticed how
       | they (face:b00c) got clever with their IPv6 address:
       | 
       | 2a03:2880:f1ff:83:face:b00c:0:25de
        
         | ceejayoz wrote:
         | Facebook also used brute force to get facebookcorewwwi.onion on
         | Tor a while back.
         | https://en.wikipedia.org/wiki/Facebook_onion_address
        
           | ipaddr wrote:
           | facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd.onio
           | n is the latest
        
       | john37386 wrote:
       | At this point people hope it will just restart so that we can all
       | resume a normal life. There can only be 2 options. It will be fix
       | very soon or it will be a hell of a night with a lot of coffee.
        
       | Ajedi32 wrote:
       | Is this specifically an issue with the Facebook app? Or is it
       | just a predicable consequence of DNS responses no longer being
       | cached due to query failures for a site as popular as Facebook?
        
         | treesknees wrote:
         | It is certainly not specific to Facebook, but the scale at
         | which Facebook is referenced across websites and apps is pretty
         | unique (I can only think of a few key players like Google who
         | would cause a similar load.)
         | 
         | And to clarify a bit, the queries aren't "no longer being
         | cached due to query failures", it's because their TTL expires
         | and the resulting SERVFAIL from the next query (which fails)
         | isn't cached at all.
        
       | [deleted]
        
       | justahuman1 wrote:
       | Is it possible for facebook to instead rely on an anycast IP
       | rather than DNS for their (non-web) phone apps?
        
         | bifrost wrote:
         | No. But FB's DNS is anycasted.
         | 
         | FB's eggs were all in one basket, and the basket broke.
        
         | treesknees wrote:
         | Yes and no. Yes they could technically hardcode an anycasted IP
         | address, however it'd be less reliable. Also you'd run into
         | issues with TLS certificates. It'd be very inflexible and would
         | probably result in more outages.
         | 
         | But even if they did hardcode an IP, the underlying
         | infrastructure for Facebook was also down not just DNS
         | resolution of facebook.com. So even if the FB app didn't need
         | to resolve a hostname, it would still be broken.
        
       | [deleted]
        
       | earth2mars wrote:
       | Is it just the people trying to connect or the app itself keep
       | polling and trying to send information from devices to Facebook
       | servers continuously.
        
       | slg wrote:
       | This technology problem is a good metaphor for Facebook overall
       | as a company. There is nothing fundamentally wrong with having
       | your app regularly polling for DNS records when it can't find
       | them, but that can be an actively harmful approach when you are
       | the size of Facebook. Being that size comes with a whole swath of
       | extra responsibilities to ensure that your behavior doesn't end
       | up harming society as a whole.
        
         | jmalicki wrote:
         | There is something inherently wrong with that - it's why
         | exponential backoff exists.
        
       | earth2mars wrote:
       | Is it just the people trying to connect or the app itself keep
       | polling and trying to send information from devices to Facebook
       | servers continuously.
        
       ___________________________________________________________________
       (page generated 2021-10-04 23:00 UTC)