[HN Gopher] BPF, XDP, Packet Filters and UDP
       ___________________________________________________________________
        
       BPF, XDP, Packet Filters and UDP
        
       Author : dochtman
       Score  : 135 points
       Date   : 2020-10-21 14:47 UTC (8 hours ago)
        
 (HTM) web link (fly.io)
 (TXT) w3m dump (fly.io)
        
       | dochtman wrote:
       | So presumably this will also open up avenues for doing QUIC and
       | thus HTTP/3 on Fly?
        
         | mrkurt wrote:
         | Yep! We have a "Firecracker that accepts QUIC" running with
         | this.
         | 
         | People usually want HTTP + TLS handled for them, though. So
         | when we ship QUIC + HTTP3 as a first class feature, we'll
         | terminate QUIC and give people whatever their app process can
         | accept.
        
           | dochtman wrote:
           | Any insight into what QUIC/H3 stack you'll be using for the
           | proxy?
        
             | jeromegn wrote:
             | To be determined. We're hoping to contribute and use what's
             | going to come out of hyper's h3 efforts (we use Rust for
             | our reverse-proxy). There's not much there yet though:
             | https://github.com/hyperium/h3
             | 
             | We're not in a huge hurry to support QUIC / H3 given its
             | current adoption. However, our users' apps will be able to
             | support it once UDP is fully launched, if they want to.
        
               | anderspitman wrote:
               | Are you using a custom reverse proxy? For a recent
               | project I started with Caddy but ended up needing some
               | functionality it didn't have, and didn't need most of
               | what it did have. I'm currently using a custom proxy
               | layer, but I'm concerned I might end up having to
               | implement more than I want to (I know I'll at least need
               | gzip). Curious what your experience at fly has been with
               | this.
        
           | eptcyka wrote:
           | Unrelated, but could you please expand on how firecracker
           | fits within your stack?
        
             | tptacek wrote:
             | You could describe our job as "taking Dockerfiles from
             | customers and running them globally"; the way we actually
             | "run" Docker images is to convert them to root filesystems
             | and run them inside Firecracker. Firecracker is the core of
             | our stack.
        
       | edf13 wrote:
       | Could this be opened up to other (none http) based protocols and
       | also over UDP?
        
         | Isamu wrote:
         | You can filter pretty much any packet. The downside with using
         | TCP is you are looking at individual packets, which may be out
         | of order, that sort of thing.
        
           | tptacek wrote:
           | For CDN purposes, you assume that _something_ on each end of
           | the TCP connection --- something outside of BPF --- is going
           | to be running a full TCP. In our case, that something is
           | Linux 's TCP/IP stack running in a Firecracker VM (we could
           | load XDP programs into our VMs, but we don't).
           | 
           | You can do a lot with TCP, and be tolerant to out-of-order
           | delivery and drops, just by shuttling the individual packets.
           | So we can in fact "cut through" TCP sessions directly to
           | Firecracker, avoiding our proxies. We don't, though: our
           | "tcp" handlers actually route through our Rust proxies, both
           | because that's what they've always done, and because in most
           | cases there isn't much of a win to bypassing the proxies,
           | which have a lot more load balancing and resiliency logic
           | than the BPF-based UDP data path does.
        
         | tptacek wrote:
         | Should work for any UDP. We can do the same thing for non-UDP
         | protocols, I guess, too.
        
       | bogomipz wrote:
       | The post states:
       | 
       | >"You can make any protocol work with a custom proxy. Take DNS:
       | your edge servers listen for UDP packets, slap PROXY headers on
       | them, relay the packets to worker servers, unwrap them, and
       | deliver them to containers. You can intercept all of UDP with
       | AF_PACKET sockets, and write the last hop packet that way too to
       | fake addresses out. And at first, that's how I implemented this
       | for Fly."
       | 
       | This is really interesting. I looked at the linked blog post and
       | was hoping there was more implementation details. Does your Fly
       | pi-hole use HAProxy and the PROXY headers then? Is the config for
       | that available anywhere i could see?
        
         | tptacek wrote:
         | No, the Pi-hole example uses the XDP UDP scheme this blog post
         | talks about: DNS packets arrive on edge servers, XDP intercepts
         | them before they reach the IP stack, puts a proxy header on the
         | message (we don't use HAProxy's proxy protocol, to conserve
         | space), and relays it out WireGuard; TC BPF attached to the
         | WireGuard interface on the other end (the worker server) strips
         | off the header, fixes the addresses accordingly, and relays to
         | the tap interface for the right worker.
         | 
         | The first cut of this feature I built, without BPF, used
         | NFQueue (diverting packets based on iptables rules to
         | userspace), did a sockets-based proxy from edge to worker, and
         | used a simple raw socket to fix the addresses and write the
         | packet to its destination. NFQueue was annoying to work with, I
         | looked at BPF filters instead, and ultimately wound up just
         | doing the whole thing in BPF.
         | 
         | You don't need to know anything about this to use UDP on
         | Fly.io; you can just add UDP ports the same way you'd add TCP
         | ports (the `fly.toml` in the Pi-hole blog post shows an
         | example).
        
           | bogomipz wrote:
           | I see. Thanks for the clarification. I need to read up more
           | on XDP Schemas and headers. Might you or anyone else have any
           | resources you found helpful?
        
             | tptacek wrote:
             | There's not much to know! "XDP" is really just the Linux
             | term of art for "BPF running directly off the network
             | driver". Your BPF program --- ordinarily, just a C program
             | you compiled with clang --- is given a struct with pointers
             | to the beginning and end of a packet, and your program can
             | return OK, DROP, or REDIRECT, in addition to modifying the
             | packet.
             | 
             | The XDP project itself has a pretty excellent tutorial:
             | 
             | https://github.com/xdp-project/xdp-tutorial
        
               | [deleted]
        
       | DSingularity wrote:
       | _" If you 're just looking to play around with this stuff, by the
       | way, I can give you a Dockerfile that will get you a janky build
       | environment, which is how I did my BPF development before I
       | started using perf, which I couldn't get working under macOS
       | Docker"_
       | 
       | Anyone find this?
        
         | tptacek wrote:
         | Haven't published it! If nobody else has a good one, I'll post
         | mine; the only reason I haven't is that it's janky as fuck (it
         | installs extra stuff, and I pull a lot of my own kernel BPF
         | header deps in).
        
           | DSingularity wrote:
           | I haven't seen one! It would be nice to have one. Btw very
           | nice write up.
        
           | bogomipz wrote:
           | Yes please do publish it. It would make a great addition to
           | the article. Great post by the way.
        
       | keithalewis wrote:
       | This article is remarkably well written. The first paragraph lays
       | out why you would want to read it, or not. It then presents a
       | well documented history of the problems it is solving to
       | illustrate the whys and wherefores of the product. Well done!
       | Thanks.
        
       | ignoramous wrote:
       | So... this post casually outlines how one could go about build a
       | _Global Network Load Balancer_ at Google-scale. Amazing!
       | 
       | A few naive questions:
       | 
       | > _You can make any protocol work with a custom proxy. Take DNS:
       | your edge servers listen for UDP packets, slap PROXY headers on
       | them, relay the packets to worker servers, unwrap them, and
       | deliver them to containers._
       | 
       | Curious: Wouldn't SOCKS5 here be a like-for-like replacement for
       | PROXY? Why would one choose one over the other?
       | 
       | > _WireGuard doesn 't have link-layer headers, and XDP wants it
       | to_
       | 
       | Is the gist here that WireGuard doesn't because it is Layer 3?
       | And that XDP sits one layer below it?
       | 
       | > _Jason Donenfeld even wrote a patch, but the XDP developers
       | were not enthused, just about the concept of XDP running on
       | WireGuard at all_
       | 
       | Could someone please explain this? Is it that XDP here didn't
       | want to add a support to delegate routing onto WireGuard?
       | 
       | > _It 's a little hard to articulate how weird it is writing eBPF
       | code. You're in a little wrestling match with the verifier_
       | 
       | Would NetMap or Intel's dpdk instead make for an non-enterprising
       | choice here? Don't they have a similar profile in terms of
       | throughput? I guess, one has to use a userspace TCP/IP stack like
       | gVisor's NetStack or LwIP to go with NetMap/dpdk?
       | 
       | > _Those configurations are fed into distributed service
       | discovery; our servers listen on changes and, when they occur,
       | they update a routing map_
       | 
       | How is this system implemented? Curious because uptime,
       | availability, durability, and latency must be of prime importance
       | for such a service. Is there a blog about this detailing the
       | challenges inherent here? Or, does it use consul/etcd or some
       | such out-of-the-box solution?
       | 
       | > _a simple map of addresses to actions and next-hops; the Linux
       | bpf(2) system call lets you update these maps on the fly._
       | 
       | Clarification: does this mean the maps are already in a format
       | the bpf/2 command understands, or is something else going on
       | here?
       | 
       | Thanks.
        
         | [deleted]
        
       | bogomipz wrote:
       | I was curious about was what is the fly.io container orchestrator
       | that runs this edge architecture and were there any challenges
       | implementing this on that? Cheers.
        
       | [deleted]
        
       | Aaronstotle wrote:
       | I really enjoyed this post, as someone who doesn't possess much
       | programming prowess, I am fascinated with eBPF/kernel sub-systems
       | and I am always eager to learn more. I might have to take the
       | author's advice and build an emulator soon.
        
       | austinpena wrote:
       | I've had nothing but good experiences using Fly
        
       | otoburb wrote:
       | >> _Linux kernel developers quickly come to the same conclusion
       | the DTrace people came to 15 years ago: if you 're going to have
       | a compiler and a kernel-resident VM, you might as well use it for
       | everything. So, the seccomp system call filter gets eBPF. Kprobes
       | get eBPF. Kernel tracepoints gets eBPF. Userland tracing gets
       | eBPF. If it's in the Linux kernel and it's going to be
       | programmable (even if it shouldn't be), it's going to be
       | programmed with eBPF soon._
       | 
       | Feels like Oprah Winfrey's September 13th, 2004 show: "YOU get a
       | car! YOU get a car! And YOU get a car! Everybody gets a car!"[1]
       | 
       | [1] https://www.youtube.com/watch?v=pviYWzu0dzk
        
       ___________________________________________________________________
       (page generated 2020-10-21 23:00 UTC)