[HN Gopher] Cloudflare servers don't own IPs anymore so how do t...
       ___________________________________________________________________
        
       Cloudflare servers don't own IPs anymore so how do they connect to
       the internet?
        
       Author : jgrahamc
       Score  : 213 points
       Date   : 2022-11-25 14:16 UTC (8 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | danrl wrote:
       | As an industry, we are bad at deprecating old protocols like
       | IPv4. This is a genius hack for a problem we have due to IPv6 not
       | being adopted widely enough so that serving legacy IP users
       | becomes a dropable liability to the business. The ROI is still
       | high enough for us to "innovate" here. I applaud the solution but
       | mourn the fact that we still need this.
       | 
       | I guess ingress is next, then? Two layers of Unimog to achieve
       | stability before TCP/TLS termination maybe.
        
         | dopylitty wrote:
         | I've been thinking a lot about this in my own enterprise and
         | I've increasingly come to the conclusion that IP itself is the
         | wrong abstraction for how the majority of modern networked
         | compute works. IPv6, as a (quite old itself) iteration on top
         | of IPv4 with a bunch of byzantine processes and acronyms tacked
         | on is solving the wrong problem.
         | 
         | Originally IP was a way to allow discrete physical computers in
         | different locations owned by different organizations to find
         | each other and exchange information autonomously.
         | 
         | These days most compute actually doesn't look like that. All my
         | compute is in AWS. Rather than being autonomous it is
         | controlled by a single global control plane and uniquely
         | identified within that control plane.
         | 
         | So when I want my services to connect to each-other within AWS
         | why am I still dealing with these complex routing algorithms
         | and obtuse numbering schemes?
         | 
         | AWS knows exactly which physical hosts my processes are running
         | on and could at a control plane level connect them directly.
         | And I, as someone running a business, could focus on the higher
         | level problem of 'service X is allowed to connect to service Y'
         | rather than figuring out how to send IP packets across
         | subnets/TGWs and where to configure which ports in NACLs and
         | security groups to allow the connection.
         | 
         | Similarly my ISP knows exactly where Amazon and CloudFlare's
         | nearest front doors are so instead of 15 hops and DNS
         | resolutions my laptop could just make a request to Service X on
         | AWS. My ISP could drop the message in AWS' nearest front door
         | and AWS could figure out how to drop the message on the right
         | host however they want to.
         | 
         | I know there's a lot of legacy cruft and also that there are
         | benefits of the autonomous/decentralized model vs central
         | control for the internet as a whole but given the centralized
         | reality we're in, especially within the enterprise, I think
         | it's worth reevaluating how we approach networking and whether
         | the continuing focus on IP is the best use of of our time.
        
           | ec109685 wrote:
           | The IP addresses you see as an AWS customer aren't the same
           | used to route packets between hosts. That said, there's a
           | huge amount of commodity infrastructure built up that
           | understands IP addresses and routing layers, so unless a new
           | scheme offers tremendous benefits, it won't get adoption.
           | 
           | At least from a security perspective though ip acl's are
           | falling out of favor to service based identities, which is a
           | good thing.
           | 
           | You can see how AWS internally does networking here:
           | https://m.youtube.com/watch?v=ii5XWpcYYnI
        
           | wpietri wrote:
           | > my laptop could just make a request to Service X on AWS
           | 
           | I was looking for the "just" that handwaves away the
           | complexity and I was not disappointed.
           | 
           | How do you imagine your laptop expressing a request in a way
           | that it makes it through to the right machine? Doing a
           | traceroute to amazon.com, I count 26 devices between me and
           | it. How will those devices know which physical connection to
           | pass the request over? Remember that some of them will be
           | handling absurd amounts of traffic, so your scheme will need
           | to work with custom silicon for routing as well as doing ok
           | on the $40 Linksys home unit. What are you imagining that
           | would be so much more efficient that it's worth the enormous
           | switching costs?
           | 
           | I also have questions about your notion of "centralization".
           | Are you saying that Google, Microsoft, and other cloud
           | vendors should just... give up and hand their business to
           | AWS? Is that also true for anybody who does hosting,
           | including me running a server at home? If so, I invite you to
           | read up on the history of antitrust law, as there are good
           | reasons to avoid a small number of people having total
           | control over key economic sectors.
        
             | dopylitty wrote:
             | > How do you imagine your laptop expressing a request in a
             | way that it makes it through to the right machine? Doing a
             | traceroute to amazon.com, I count 26 devices between me and
             | it. How will those devices know which physical connection
             | to pass the request over?
             | 
             | That's my whole point. You're thinking of it from an IP
             | perspective where there are individual devices in some
             | chain and they all need to autonomously figure out a path
             | from my laptop to AWS. The reality is every device between
             | me and AWS is owned by my ISP. They know exactly which
             | physical path ahead of time will get a message from my
             | laptop to AWS. So why waste all the time on the IP
             | abstraction?
             | 
             | > I also have questions about your notion of
             | "centralization". Are you saying that Google, Microsoft,
             | and other cloud vendors should just... give up and hand
             | their business to AWS?
             | 
             | AWS is just an example. Realistically a huge amount of
             | traffic on the internet is going to 6 places and my ISP
             | already has direct physical connections to those places.
             | Maintaining this complex and byzantine abstraction to
             | figure out how to get a message from my laptop to compute
             | in those companies' infrastructure should not be necessary.
             | 
             | And in general the more important part is within AWS' (or
             | Microsoft's or enterprise X's) network why waste time on IP
             | when the network owner knows exactly which host every
             | compute process is running on?
             | 
             | Instead of thinking of an enterprise network as a set of
             | autonomous hosts that need to figure out a path between
             | each other think of it as a set of processes running on the
             | same OS (the virtual infrastructure). Linux doesn't need to
             | do BGP to figure out how to connect two processes so why
             | does your network?
        
               | scarmig wrote:
               | > The reality is every device between me and AWS is owned
               | by my ISP. They know exactly which physical path ahead of
               | time will get a message from my laptop to AWS.
               | 
               | None of these are true.
        
           | akira2501 wrote:
           | > Rather than being autonomous it is controlled by a single
           | global control plane and uniquely identified within that
           | control plane.
           | 
           | By default, sure. You can easily bring your own IPs into AWS
           | and use them instead, and I don't think it's hard to imagine
           | the pertinent use cases and risk management this brings.
        
       | mike256 wrote:
       | Wouldn't it be better when all those big CDNs just switch off
       | IPv4 and force the sleeping ISPs to enable IPv6? Maybe we should
       | introduce some IPv6 only days as a first step...
        
       | subarctic wrote:
       | Pretty interesting article. TLDR: they're now using anycast for
       | egress, not just ingress.
       | 
       | Each data center has a single IP for each country code (so that
       | they can make outgoing requests that are geolocated in any
       | country). In order to achieve that, they have a /24 or larger
       | range for each country, and announce it from all their data
       | centers, and then they route the traffic over their backbone to
       | the appropriate data center for that IP.
       | 
       | Then in the data center, they share the single IP across all
       | their servers by giving each server a range of TCP/UDP port space
       | (instead of doing stateful NAT).
        
         | ec109685 wrote:
         | It's not a single IP address per data center. Otherwise they'd
         | only be able to make 64k simultaneous egress connections, nor
         | would their scheme of different ip addresses per "geo" and
         | product work.
        
       | Terretta wrote:
       | I quite like what CloudFlare has done here.
       | 
       | There's a fourth way to resolve this, that works for the core use
       | case, is less engineering, and was in production 20 years ago,
       | but I can't fit it within the margins of this comment box.
       | 
       | // CF's approach has additional feature advantages though.
        
       | [deleted]
        
       | dfawcus wrote:
       | What they describe sounds a lot like a distributed static RSIP
       | scheme.
       | 
       | https://en.wikipedia.org/wiki/Realm-Specific_IP
       | 
       | With port ranges rather than being 'leased', being allocated on
       | the the basis of per server within a locale.
       | 
       | So the IP goes to the locale, the port range is the the static
       | RSIP to the server within that locale.
        
       | martinohansen wrote:
       | Am I missing something here or did they just reinvent a NAT
       | gateway with static rules?
       | 
       | I understand that they started using anycast for the egress IPs
       | as well, but thats unrelated to the NAT problem.
        
         | [deleted]
        
       | xg15 wrote:
       | > _However, while anycast works well in the ingress direction, it
       | can 't operate on egress. Establishing an outgoing connection
       | from an anycast IP won't work. Consider the response packet. It's
       | likely to be routed back to a wrong place - a data center
       | geographically closest to the sender, not necessarily the source
       | data center!_
       | 
       | Slightly OT question, but why wouldn't this be a problem with
       | ingress, too?
       | 
       | E.g. suppose I want to send a request to https://1.2.3.4. What I
       | don't know is that 1.2.3.4 is an anycast address.
       | 
       | So my client sends a SYN packet to 1.2.3.4:443 to open the
       | connection. The packet is routed to data center #1. The data
       | center duly replies with a SYN/ACK packet, which my client
       | answers with an ACK packet.
       | 
       | However, due to some bad luck, the ACK packet is routed to data
       | center #2 which is also a destination for the anycast address.
       | 
       | Of course, data center #2 doesn't know anything about my
       | connection, so it just drops the ACK or replies with a RST. In
       | the best case, I can eventually resend my ACK and reach the right
       | data center (with multi-second delay), in the worst case, the
       | connection setup will fail.
       | 
       | Why does this not happen on ingress, but is a problem for egress?
       | 
       | Even if the handshake uses SYN cookies and got through on data
       | center #2, what would keep subsequent packets that I send on that
       | connection from being routed to random data centers that don't
       | know anything about the connection?
        
         | matsur wrote:
         | This is a problem in theory. In practice (and through
         | experience) we see very little routing instability in the way
         | you describe.
        
           | xg15 wrote:
           | You mean, it's just luck?
        
             | Brian_K_White wrote:
             | right? also seems like load should or at least could be
             | changing all the time. geo or hops proximity is really the
             | only things that decide a route? not load also?
             | 
             | But although I would be surprised if load were not also
             | part of the route picker, I would also be surprised if the
             | routers didn't have some association or state tracking to
             | actively ensure related packets get the same route.
             | 
             | But I guess this is saying exactly that, that it's relying
             | on luck and happenstance.
             | 
             | It may be doing the job well enough that not enough people
             | complain, but I wouldn't be proud of it myself.
        
               | remram wrote:
               | Anycast is implemented by BGP and doesn't take load into
               | account in any way. You will reach the closest location
               | announcing that address (well, prefix).
        
               | ignoramous wrote:
               | TFA claims that _Anycast_ is an advantage when dealing
               | with DDoS because it helps spread the load? A regional
               | DDoS (where it consistently hits a small set of DCs) is
               | not a common scenario, I guess?
        
               | csande17 wrote:
               | Basically yes. Large-scale DDoS attacks rely on
               | compromising random servers and devices, either directly
               | with malware or indirectly with reflection attacks. Those
               | hosts aren't all going to be located in the same place.
               | 
               | An attacker could choose to only compromise devices
               | located near a particular data center, but that would
               | really reduce the amount of traffic they could generate,
               | and also other data centers would stay online and serve
               | requests from users in other places.
        
               | toast0 wrote:
               | Your intuition is more or less all wrong here, sorry.
               | 
               | Most routers with multiple viable paths pass was too much
               | traffic to do state tracking of individual flows. Most
               | typically, the default metric is BGP path length, for a
               | given prefix, send packets through the route that has the
               | most specific prefix, if there's a tie, use the route
               | that transits the fewest networks to get there, if
               | there's still a tie, use the route that has been up the
               | longest (which maybe counts as state tracking). Routing
               | like this doesn't take into account any sort of load
               | metric, although people managing the routers might do
               | traffic engineering to try to avoid overloaded routes
               | (but it's difficult to see what's overloaded a few hops
               | beyond your own router).
               | 
               | For the most part, an anycast operation is going to work
               | best if all sites can handle all the forseable load,
               | because it's easy to move all the traffic, but it's not
               | easy to only move some. Everything you can do to try to
               | move some traffic is likely to either not be effective or
               | move too much.
        
               | richieartoul wrote:
               | Why shouldn't they be proud of a massive system like
               | Cloudflare that works extremely well? As a commentor
               | below described, it's not luck or happenstance, it's a
               | natural consequence of how BGP works. Seems pretty
               | elegant to me.
        
             | rizky05 wrote:
        
             | [deleted]
        
         | tonyb wrote:
         | It works because the route to 1.2.3.4 is relatively stable. The
         | routes would only change and end up at data center #2 if data
         | center #1 stopped announcing the routes. In that case the
         | connection would just re-negotiate to data center #2.
        
           | xg15 wrote:
           | Ah, ok, that makes sense. So for a given point of origin,
           | anycast generally routes to the same server?
        
             | majke wrote:
             | Correct. From a single place, you're likely to BGP-reach
             | one Cloudflare location, and it doesn't change often.
        
         | ratorx wrote:
         | As others have mentioned, this is not often a problem because
         | routing is normally fairly stable (at least compared to the
         | lifetime of a typical connection). For longer lived connections
         | (e.g. video uploads), it's more of a problem.
         | 
         | Also, there are a fair number of ASes that attempt to load
         | balance traffic between multiple peering points, without
         | hashing (or only using the src/dst address and not the port).
         | This will also cause the problem you described.
         | 
         | In practice it's possible to handle this by keeping track of
         | where the connections for an IP address typically ingress and
         | sending packets there instead of handling them locally. Again,
         | since it's a few ASes that cause problems for typical
         | connections, is also possible to figure out which IP prefixes
         | experience the most instability and only turn on this overlay
         | for them.
        
         | grogers wrote:
         | Yep, it can happen that your packet gets routed to a different
         | DC from a prior packet. But the routers in between the client
         | and the anycast destination will do the same thing if the
         | environment is the same. So to get routed to a new location,
         | you would usually need either:
         | 
         | * A new (usually closer) DC comes online. That will probably be
         | your destination from now on.
         | 
         | * The prior DC (or a critical link on the path to it) goes
         | down.
         | 
         | The crucial thing is that the client will typically be routed
         | to the closest destination to it. In the egress case the
         | current DC may not be the closest DC to the server it is trying
         | to reach so the return traffic would go to the wrong place.
         | This system of identifying a server with unique IP/port(s)
         | means that CF's network can forward the return traffic to the
         | correct place.
        
         | ignoramous wrote:
         | Yes, as others have mentioned, route flapping is a problem.
         | But, in practice, not as big a problem as DNS-based routing.
         | 
         | - See: https://news.ycombinator.com/item?id=10636547
         | 
         | - And: https://news.ycombinator.com/item?id=17904663
         | 
         | Besides, SCTP / QUIC aware load balancers (or proxies) are
         | detached from IPs and should continue to hum along just fine
         | regardless of which server IP the packet ends up at.
        
       | Thorentis wrote:
       | The fact that we haven't yet adopted IPv6 tells me that IPv6
       | isn't actually that great of a solution. We need an Internet
       | Protocol that solves modern problems and that has a good
       | migration path.
        
         | wpietri wrote:
         | 40% of Google's traffic comes via IPv6. Up from 1% a decade
         | ago. https://www.google.com/intl/en/ipv6/statistics.html
         | 
         | If you think you can do better than that, I look forward to
         | hearing your plan. Personally, I think that's huge progress.
        
         | eastdakota wrote:
         | Fun fact: the first product we announced to celebrate
         | Cloudflare's launch day anniversary was a IPv4<->IPv6 gateway:
         | 
         | https://blog.cloudflare.com/introducing-cloudflares-automati...
         | 
         | The success of that convinced us we should do something to
         | improve the Internet every year to celebrate our "birthday."
         | Over time we ended up with more than one product that met that
         | criteria and timing, so it went from a day of celebration to a
         | week. That became our Birthday Week. Then we saw how well
         | bundling a set of announcements into a week was so we decided
         | to do it other times of the year. And that's how Cloudflare
         | Innovation Weeks got started, explicitly with us delivering
         | IPv6 support back in 2011.
        
         | growse wrote:
         | You need an IPv4 src address to connect out to an IPv4 origin.
        
         | zekica wrote:
         | Where do they say that they haven't adopted IPv6? All their
         | offerings support IPv6.
        
       | inopinatus wrote:
       | TLDR: Cloudflare is using five bits from the port number as a
       | subnetting & routing scheme, with optional content policy
       | semantics, for hosts behind anycast addressing and inside their
       | network boundary.
        
       | Ptchd wrote:
       | If you don't need an IP to be connected to the internet, sign me
       | up... I think they are full of it though... Even if you only have
       | one IP.... you still have an IP
       | 
       | > PING cloudflare.com (104.16.133.229) 56(84) bytes of data.
       | 
       | > 64 bytes from 104.16.133.229 (104.16.133.229): icmp_seq=1
       | ttl=52 time=10.6 ms
       | 
       | With a ping like this, you know that I am not using Musk's
       | Internet....
        
       | cesarb wrote:
       | All this wonderful complexity, just because a few servers insist
       | on behaving as if the location of the IP address and the location
       | of the user should always match.
        
       | jesuspiece wrote:
        
       | ronnier wrote:
       | Spammers are exploiting cloudflare by creating thousands of new
       | domains on the free tld (like .ml) and hosting the sites behind
       | cloudflare and spamming social media apps with links to scam
       | dating sites. CPA scammers.
       | 
       | If anyone from CF sees this, I can work with you and give you
       | data on this. I'm dealing with this at one of the large social
       | media companies.
       | 
       | Here's an example, this is NSFW - https://atragcara.ga
        
         | elorant wrote:
         | So why aren't social media platforms blocking the domains?
        
           | ronnier wrote:
           | We do. But with free TLD's, spammers and scammers can create
           | an unlimited number of new domains at zero cost. That's the
           | problem. They can send a single spam URL to a single person
           | and scale that out, each person gets a unique domain and URL.
        
             | elorant wrote:
             | So how about blocking the users then? Or limit their
             | ability to post links.
        
               | ronnier wrote:
               | That's done too. But it's not just a few, it's literally
               | 10s of thousands of individuals from places like
               | Bangladesh who do this as their source of income. They
               | are smart, on real devices, will solve any puzzle you
               | throw at them, and will adapt to any blocks or locking.
               | It's not an easy problem to solve which is why no
               | platform has solved it (oddly, spam is pretty much non
               | existent on HN)
        
               | elorant wrote:
               | I don't think there's any benefit in spamming HN. There
               | aren't that many users in here, and it could lead to a
               | backlash consider the technical expertise of most people.
        
             | gnfargbl wrote:
             | OK, but why don't you block Freenom domains entirely?
             | 
             | Apart from perhaps a couple of sites like gob.gq, there's
             | essentially nothing of any value on those TLDs. Allow-list
             | the handful of good sites, if you must, and default block
             | the rest.
        
               | ronnier wrote:
               | I could. But we are talking about one of the worlds
               | largest social media platforms used by hundreds of
               | millions of people daily. There's legit websites hosted
               | on these free domains and I don't want to kill those
               | along with the scam sites. I've mostly got the scam sites
               | blocked at this point though. Just took me a week or so
               | to adapt.
        
               | gnfargbl wrote:
               | > There's legit websites hosted on these free domains
               | 
               | Are there though, really? Can you give some examples?
               | 
               | To a first approximation, I contend that essentially
               | everything on Freenom is bad. There are maybe a _handful_
               | of good sites (the one I listed, https://koulouba.ml/,
               | etc) but you can find those on Google in a few minutes
               | with some _site:_ searches.
               | 
               | I commend your efforts in blocking the scam sites, but
               | also honestly believe that it would be better for you,
               | your customers and the internet at large to default block
               | Freenom. Freenom sites are junk, wherever they are
               | hosted.
        
               | ronnier wrote:
               | Here's NSFW scam sites behind CF that use free TLDs. I
               | could post 10s of thousands of these.
               | 
               | * https://atragcara.ga
               | 
               | * https://donaga.tk
               | 
               | * https://snatemhatzemerbedc.tk
        
               | gnfargbl wrote:
               | Yep, I know. I monitor these as they appear in
               | Certificate Transparency logs and DNS PTR records.
               | 
               | Freenom TLDs are just junk. Save yourself the hassle and
               | default block :-).
        
               | ronnier wrote:
               | Seems these sites should be blocked on CF, at the root.
               | Not all the leaf nodes apps. It's pretty easy for me to
               | automate it at my company. Seems CF could?
        
         | sschueller wrote:
         | Same goes for DDoS attacks. I am not sure how they do it but we
         | get hit by CF IPs with synfloods etc.
        
           | gnfargbl wrote:
           | Anyone can set the source IP on their packets to be anything.
           | I can send you TCP SYNs which are apparently from Cloudflare.
           | 
           | There was a proposal (BCP38) which said that networks should
           | not allow outbound packets with source IPs which could not
           | originate from that network, but it didn't really get a lot
           | of traction -- mainly due to BGP multihoming, I think.
        
             | toast0 wrote:
             | BCP38 has gotten some traction, but it's not super
             | effective until all the major tier-1 ISPs enforce it
             | against their customers. But it's hard to pressure tier-1
             | ISPs; you can't drop connections with them, because they're
             | too useful, anyway if you did, the traffic would just flow
             | through another tier-1 ISPs, because it's not really
             | realistic for tier-1s to prefix filter peerings between
             | themselves. Anyway, the customer that's spoofing could be
             | spoofing sources their ISP legitimagely handles, and
             | there's a lot of those.
             | 
             | Some tier-1s do follow BCP38 though, so one day maybe?
             | Still, there's plenty of abuse to be done without spoofing,
             | so while it would be an improvement, it wouldn't usher in
             | an era of no abuse.
        
           | slothsarecool wrote:
           | You do not get attacked from Cloudflare with TCP attacks.
           | Somebody is spoofing the IP header and make it seem like
           | Cloudflare is DDoSing you.
           | 
           | The only way for somebody to DDoS from Cloudflare would be
           | using workers, however, this isn't practical as workers have
           | a very limited IP Range.
        
             | fncivivue7 wrote:
        
             | cmeacham98 wrote:
             | The reason people do this, by the way, is because it's
             | common if you're hosting via CF to whitelist their IPs and
             | block the rest. This allows their SYN flood to bypass that.
        
         | [deleted]
        
       | uvdn7 wrote:
       | This is a wonderful article. Thanks for sharing. As always,
       | Cloudflare blog posts do not disappoint.
       | 
       | It's very interesting that they are essentially treating IP
       | addresses as "data". Once looking at the problem from a
       | distributed system lens, the solution here can be mapped to
       | distributed systems almost perfectly.
       | 
       | - Replicating a piece of data on every host in the fleet is
       | expensive, but fast and reliable. The compromise is usually to
       | keep one replica in a region; same as how they share a single /32
       | IP address in a region.
       | 
       | - "sending datagram to IP X" is no different than "fetching data
       | X from a distributed system". This is essentially the underlying
       | philosophy of the soft-unicast. Just like data lives in a
       | distributed system/cloud, you no longer know where is an IP
       | address located.
       | 
       | It's ingenious.
       | 
       | They said they don't like stateful NAT, which is understandable.
       | But the load balancer has to be stateful still to perform the
       | routing correctly. It would be an interesting follow up blog post
       | talking about how they coordinate port/data movements (moving a
       | port from server A to server B), as it's state management (not
       | very different from moving data in a distributed system again).
        
         | remram wrote:
         | I have a lot of trouble mapping your comment to the content of
         | the article. It is about the _egress addresses_ , the ones
         | CloudFlare use as source when fetching from origin servers.
         | Those addresses need to be separated by the region of the end-
         | user ("eyeball"/browser) and the CloudFlare service they are
         | using (CDN or WARP).
         | 
         | The cost they are working around is the cost of IPv4 addresses,
         | versus the combinatorial explosion in their allocation scheme
         | (they need number of services * number of regions * whatever
         | dimension they add next, because IP addresses are nothing like
         | data).
         | 
         | I am not sure where you see data replication in this scheme?
        
           | uvdn7 wrote:
           | It's not meant to be a perfect analogy. The replication
           | analogy is mostly talking about the tradeoff between
           | performance and cost. So it's less about "replicating" the ip
           | addresses (which is not happening). On that front, maybe
           | distribution would be a better term. Instead of storing a
           | single piece of data on a single host (unicast), they are
           | distributing it to a set of hosts.
           | 
           | Overall, it seems like they are treating ip addresses as data
           | essentially, which becomes most obvious when they talk about
           | soft-unicast.
           | 
           | Anyway, I just found it interesting to look at this through
           | this lens.
        
             | majke wrote:
             | "Overall, it seems like they are treating ip addresses as
             | data essentially"
             | 
             | Spot on!
             | 
             | In past:
             | 
             | * /24 per datacenter (BGP), /32 per server (local network)
             | (all 64K ports)
             | 
             | New:
             | 
             | * /24 per continent (group of colos), /32 per colo, port-
             | slice per server
             | 
             | This is totally hierarchical. All we did is build a tech to
             | change the "assignment granularity". Now with this tech we
             | can do... anything we want. We're not tied to BGP, or IP's
             | belonging to servers, or adjacent IP's needing to be
             | nearby.
             | 
             | The cost is the memory cost of global topology. We don't
             | want a global shared-state NAT (each 2 or 4-tuple being
             | replicated globally on all servers). We don't want zero-
             | state (a machine knowing nothing about routing, just BGP
             | does the job). We want to select a reasonable mix. Right
             | now it's /32 per datacenter.... but we can change it if we
             | want and be more, or less specific than that.
        
       | superkuh wrote:
       | Yikes. More cloudflare breakage of the internet model. Pretty
       | soon we might as well all just live within cloudflare's WAN
       | entirely.
        
         | eastdakota wrote:
         | -\\_(tsu)_/-
         | 
         | Another perspective is that the connection of an IP to specific
         | content or individuals was a bug of the Internet's original
         | design and thankfully we're finally finding ways to
         | disassociate them.
        
         | AlphaSite wrote:
         | The internets a set of abstractions, as long as they still
         | implement some common protocols and don't create a walled
         | garden, is there any real social or technical issue with them
         | doing unusual things in their network?
         | 
         | I can totally see an argument against their CDN being too
         | pervasive and problematic for TOR users, but this seems fine
         | IMO.
        
         | wrs wrote:
         | What's breaking the internet model is the internet becoming too
         | popular and running out of addresses. There's nothing specific
         | to Cloudflare here. You're free to do the same thing to
         | conserve your own address space. It's sort of a super-fancy
         | NAT.
        
         | majke wrote:
         | Author here, I know this is a dismissive comment, but I'll bite
         | anyway.
         | 
         | As far as I understand the history of the IP protocol,
         | initially an IP address pointed to a host. (/etc/hosts file
         | seems that way)
         | 
         | Then it was realized a single entity might have multiple
         | network interfaces, and an IP started to point to a network
         | card on a host. (a host can have many IP's). Then all the VRF,
         | dummy devices, tuntaps, VETH and containers. I guess an IP is
         | now pointing to a container or VM. But there is more. For
         | performance you can (almost should!) have an unique IP address
         | per NUMA node. Or even logical CPU.
         | 
         | In modern internet a server IP: points to a single CPU on a
         | container in a VM on a host.
         | 
         | Then consider Anycast, like 1.1.1.1 or 8.8.8.8. An IP means
         | something else... it means a resource.
         | 
         | On the "client" side we have customer NAT's. CG NAT's and
         | VPN's. An IP means similarly little.
         | 
         | The IP's are really expensive, so in some cases there is a
         | strong advantage to save them. Take a look at
         | https://blog.cloudflare.com/addressing-agility/
         | 
         | "So, test we did. From a /20 address set, to a /24 and then,
         | from June 2021, to an address set of one /32, and equivalently
         | a /128 (Ao1). It doesn't just work. It really works"
         | 
         | We're able to serve "all cloudflare" from /32.
         | 
         | There is this whole trend of getting denser and denser IP
         | usage. It's not avoidable. It's not "breaking the Internet" in
         | any way more than "NAT's are breaking the Internet". The
         | network evolves, because it has to. And for one, I don't think
         | this is inherently bad.
        
           | superkuh wrote:
           | >It's not avoidable. It's not "breaking the Internet" in any
           | way more than "NAT's are breaking the Internet".
           | 
           | I agree. NATs, particularly the Carrier NAT that smartphone
           | users are behind, has broken the internet. It's made it so
           | most people do not have ports and cannot participate in the
           | internet. So now software developers cannot write software
           | that uses the internet (without depending on third parties).
           | This is bad. So is what you've done.
           | 
           | Someday ipv6 will save us.
        
       | remram wrote:
       | TLDR:
       | 
       | > To avoid geofencing issues, we need to choose specific egress
       | addresses tagged with an appropriate country, depending on WARP
       | user location. (...) Instead of having one or two egress IP
       | addresses for each server, now we require dozens, and IPv4
       | addresses aren't cheap.
       | 
       | > Instead of assigning one /32 IPv4 address for each server, we
       | devised a method of assigning a /32 IP per data center, and then
       | sharing it among physical servers (...) splitting an egress IP
       | across servers by a port range.
        
         | majke wrote:
         | Ha, I guess this is one way of summarizing it :) Author here. I
         | wanted to share more subtleties of the design, but maybe I
         | failed.
         | 
         | Indeed, the starting point is sharing IP's across servers with
         | port-ranges.
         | 
         | But there is more:
         | 
         | * awesome performance allowed by anycast.
         | 
         | * ability to route /32 instead of /24 per datacenter.
         | 
         | Generally, with this tech we can have _much_ better IP usage
         | density, without sacrificing reliability or performance. You
         | can call it  "global anycast-based stateless NAT" but that
         | often implies some magic router configuration, which we don't
         | have.
         | 
         | Here's one example of problems we run into - the lack of
         | connectx() syscall on Linux - makes it hard to actually select
         | port range to originate connections from:
         | 
         | https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...
        
           | chatmasta wrote:
           | I was surprised IPv6 was only briefly mentioned! Is that
           | something you're looking at next, or are you already running
           | an IPv6 egress network?
           | 
           | Of course not every destination is an IPv6 host, so IPv4
           | remains necessary, but at least IPv6 can avoid the need for
           | port slicing, since you can encode the same bucketing
           | information in the IP address itself.
           | 
           | I've seen this idea used as a cool trick [0] to implement a
           | SOCKS proxy that randomizes outbound IPv6 address to be
           | within a publicly routed prefix for the host (commonly a
           | /64).
           | 
           | I guess as long as you need to support IPv4, then port
           | slicing is a requirement and IPv6 won't confer much benefit.
           | (Maybe it could help alleviate port exhaustion if IPv6
           | addresses can use dynamic ports from any slice?)
           | 
           | Either way, thanks for the blog post, I enjoyed it!
           | 
           | [0] https://github.com/blacklanternsecurity/TREVORproxy
        
             | miyuru wrote:
             | I was also interested to know how this was handled for
             | IPv6, but it was only briefly mentioned.
             | 
             | Probably they didn't need to do much work with IPv6, since
             | half of the post is solving IPv4 exhaustion problems.
        
               | chriscappuccio wrote:
               | Cloudflare wants to make money. The IPv6 features can
               | come second as v6 usage increases.
        
               | dknecht wrote:
               | All of Cloudflare services ship with IPv6 day 1. IPv6 not
               | an issue as we have enough IPv6 for each machine to have
               | own IPs.
        
           | pencilcode wrote:
           | Is the geofencing country level only? So if, using warp, I
           | use trip advisor and go and see nearby restaurants it will
           | have no idea of what city I'm in? Guessing that's not so but
           | wondering how it works
        
             | aeyes wrote:
             | This blog post has some info:
             | https://blog.cloudflare.com/geoexit-improving-warp-user-
             | expe...
             | 
             | Warp uses its own set of egress IPs and their geolocation
             | is close to your real location.
        
           | remram wrote:
           | From your article it seemed that your use of anycast was more
           | accident than feature, due to the limit of BGP prefix sizes.
           | If you could route those IPs to their correct destination,
           | you would, you only go to the closest data center and route
           | again because you have no choice.
           | 
           | Maybe this ends up reducing cost on customers though, because
           | the international transit happens in your backbone network
           | rather than on the internet (customer-side).
        
         | elp wrote:
         | In english they now do carrier grade nat.
        
           | cm2187 wrote:
           | well, vanilla NAT really.
        
             | [deleted]
        
       | immibis wrote:
       | This is a horrible way to avoid upgrading the world to IPv6.
        
         | xnyan wrote:
         | The industry will not transition to v6 unless: 1) The cost of
         | not doing so is higher than the cost of sticking with v4.
         | Because of all the numerous clever tricks and products designed
         | to mitigate v4's limitations, the cost argument still favors v4
         | for most people in most situations.
         | 
         | or
         | 
         | 2) We admit that v6 needs to be rethought and rethink it. I
         | understand why v6 does not just increase IP address bits from
         | 32 to 128, but at this point I think everyone has admitted that
         | v6 is simply too difficult for most IT departments to
         | implement. In particular, the complexity of the new assignment
         | schemes like prefix delegation and SLAAC needs to be paired
         | back. Offer a minimum set of features and spin off everything
         | else.
        
         | Animats wrote:
         | I'm surprised that Cloudflare isn't all IPv6 when Cloudflare is
         | the client. That would solve their address problems. Maybe
         | charge more if your servers can't talk IPv6. Or require it for
         | the free tier.
         | 
         | It's useful that they use client side certificates. (They call
         | this "authenticated origin pull", but it seems to be client
         | side certs.
        
           | ec109685 wrote:
           | They also have to egress to third party servers since they
           | are a CDN and support things like serverless functions
        
       ___________________________________________________________________
       (page generated 2022-11-25 23:00 UTC)