[HN Gopher] BPF, XDP, Packet Filters and UDP ___________________________________________________________________ BPF, XDP, Packet Filters and UDP Author : dochtman Score : 135 points Date : 2020-10-21 14:47 UTC (8 hours ago) (HTM) web link (fly.io) (TXT) w3m dump (fly.io) | dochtman wrote: | So presumably this will also open up avenues for doing QUIC and | thus HTTP/3 on Fly? | mrkurt wrote: | Yep! We have a "Firecracker that accepts QUIC" running with | this. | | People usually want HTTP + TLS handled for them, though. So | when we ship QUIC + HTTP3 as a first class feature, we'll | terminate QUIC and give people whatever their app process can | accept. | dochtman wrote: | Any insight into what QUIC/H3 stack you'll be using for the | proxy? | jeromegn wrote: | To be determined. We're hoping to contribute and use what's | going to come out of hyper's h3 efforts (we use Rust for | our reverse-proxy). There's not much there yet though: | https://github.com/hyperium/h3 | | We're not in a huge hurry to support QUIC / H3 given its | current adoption. However, our users' apps will be able to | support it once UDP is fully launched, if they want to. | anderspitman wrote: | Are you using a custom reverse proxy? For a recent | project I started with Caddy but ended up needing some | functionality it didn't have, and didn't need most of | what it did have. I'm currently using a custom proxy | layer, but I'm concerned I might end up having to | implement more than I want to (I know I'll at least need | gzip). Curious what your experience at fly has been with | this. | eptcyka wrote: | Unrelated, but could you please expand on how firecracker | fits within your stack? | tptacek wrote: | You could describe our job as "taking Dockerfiles from | customers and running them globally"; the way we actually | "run" Docker images is to convert them to root filesystems | and run them inside Firecracker. Firecracker is the core of | our stack. | edf13 wrote: | Could this be opened up to other (none http) based protocols and | also over UDP? | Isamu wrote: | You can filter pretty much any packet. The downside with using | TCP is you are looking at individual packets, which may be out | of order, that sort of thing. | tptacek wrote: | For CDN purposes, you assume that _something_ on each end of | the TCP connection --- something outside of BPF --- is going | to be running a full TCP. In our case, that something is | Linux 's TCP/IP stack running in a Firecracker VM (we could | load XDP programs into our VMs, but we don't). | | You can do a lot with TCP, and be tolerant to out-of-order | delivery and drops, just by shuttling the individual packets. | So we can in fact "cut through" TCP sessions directly to | Firecracker, avoiding our proxies. We don't, though: our | "tcp" handlers actually route through our Rust proxies, both | because that's what they've always done, and because in most | cases there isn't much of a win to bypassing the proxies, | which have a lot more load balancing and resiliency logic | than the BPF-based UDP data path does. | tptacek wrote: | Should work for any UDP. We can do the same thing for non-UDP | protocols, I guess, too. | bogomipz wrote: | The post states: | | >"You can make any protocol work with a custom proxy. Take DNS: | your edge servers listen for UDP packets, slap PROXY headers on | them, relay the packets to worker servers, unwrap them, and | deliver them to containers. You can intercept all of UDP with | AF_PACKET sockets, and write the last hop packet that way too to | fake addresses out. And at first, that's how I implemented this | for Fly." | | This is really interesting. I looked at the linked blog post and | was hoping there was more implementation details. Does your Fly | pi-hole use HAProxy and the PROXY headers then? Is the config for | that available anywhere i could see? | tptacek wrote: | No, the Pi-hole example uses the XDP UDP scheme this blog post | talks about: DNS packets arrive on edge servers, XDP intercepts | them before they reach the IP stack, puts a proxy header on the | message (we don't use HAProxy's proxy protocol, to conserve | space), and relays it out WireGuard; TC BPF attached to the | WireGuard interface on the other end (the worker server) strips | off the header, fixes the addresses accordingly, and relays to | the tap interface for the right worker. | | The first cut of this feature I built, without BPF, used | NFQueue (diverting packets based on iptables rules to | userspace), did a sockets-based proxy from edge to worker, and | used a simple raw socket to fix the addresses and write the | packet to its destination. NFQueue was annoying to work with, I | looked at BPF filters instead, and ultimately wound up just | doing the whole thing in BPF. | | You don't need to know anything about this to use UDP on | Fly.io; you can just add UDP ports the same way you'd add TCP | ports (the `fly.toml` in the Pi-hole blog post shows an | example). | bogomipz wrote: | I see. Thanks for the clarification. I need to read up more | on XDP Schemas and headers. Might you or anyone else have any | resources you found helpful? | tptacek wrote: | There's not much to know! "XDP" is really just the Linux | term of art for "BPF running directly off the network | driver". Your BPF program --- ordinarily, just a C program | you compiled with clang --- is given a struct with pointers | to the beginning and end of a packet, and your program can | return OK, DROP, or REDIRECT, in addition to modifying the | packet. | | The XDP project itself has a pretty excellent tutorial: | | https://github.com/xdp-project/xdp-tutorial | [deleted] | DSingularity wrote: | _" If you 're just looking to play around with this stuff, by the | way, I can give you a Dockerfile that will get you a janky build | environment, which is how I did my BPF development before I | started using perf, which I couldn't get working under macOS | Docker"_ | | Anyone find this? | tptacek wrote: | Haven't published it! If nobody else has a good one, I'll post | mine; the only reason I haven't is that it's janky as fuck (it | installs extra stuff, and I pull a lot of my own kernel BPF | header deps in). | DSingularity wrote: | I haven't seen one! It would be nice to have one. Btw very | nice write up. | bogomipz wrote: | Yes please do publish it. It would make a great addition to | the article. Great post by the way. | keithalewis wrote: | This article is remarkably well written. The first paragraph lays | out why you would want to read it, or not. It then presents a | well documented history of the problems it is solving to | illustrate the whys and wherefores of the product. Well done! | Thanks. | ignoramous wrote: | So... this post casually outlines how one could go about build a | _Global Network Load Balancer_ at Google-scale. Amazing! | | A few naive questions: | | > _You can make any protocol work with a custom proxy. Take DNS: | your edge servers listen for UDP packets, slap PROXY headers on | them, relay the packets to worker servers, unwrap them, and | deliver them to containers._ | | Curious: Wouldn't SOCKS5 here be a like-for-like replacement for | PROXY? Why would one choose one over the other? | | > _WireGuard doesn 't have link-layer headers, and XDP wants it | to_ | | Is the gist here that WireGuard doesn't because it is Layer 3? | And that XDP sits one layer below it? | | > _Jason Donenfeld even wrote a patch, but the XDP developers | were not enthused, just about the concept of XDP running on | WireGuard at all_ | | Could someone please explain this? Is it that XDP here didn't | want to add a support to delegate routing onto WireGuard? | | > _It 's a little hard to articulate how weird it is writing eBPF | code. You're in a little wrestling match with the verifier_ | | Would NetMap or Intel's dpdk instead make for an non-enterprising | choice here? Don't they have a similar profile in terms of | throughput? I guess, one has to use a userspace TCP/IP stack like | gVisor's NetStack or LwIP to go with NetMap/dpdk? | | > _Those configurations are fed into distributed service | discovery; our servers listen on changes and, when they occur, | they update a routing map_ | | How is this system implemented? Curious because uptime, | availability, durability, and latency must be of prime importance | for such a service. Is there a blog about this detailing the | challenges inherent here? Or, does it use consul/etcd or some | such out-of-the-box solution? | | > _a simple map of addresses to actions and next-hops; the Linux | bpf(2) system call lets you update these maps on the fly._ | | Clarification: does this mean the maps are already in a format | the bpf/2 command understands, or is something else going on | here? | | Thanks. | [deleted] | bogomipz wrote: | I was curious about was what is the fly.io container orchestrator | that runs this edge architecture and were there any challenges | implementing this on that? Cheers. | [deleted] | Aaronstotle wrote: | I really enjoyed this post, as someone who doesn't possess much | programming prowess, I am fascinated with eBPF/kernel sub-systems | and I am always eager to learn more. I might have to take the | author's advice and build an emulator soon. | austinpena wrote: | I've had nothing but good experiences using Fly | otoburb wrote: | >> _Linux kernel developers quickly come to the same conclusion | the DTrace people came to 15 years ago: if you 're going to have | a compiler and a kernel-resident VM, you might as well use it for | everything. So, the seccomp system call filter gets eBPF. Kprobes | get eBPF. Kernel tracepoints gets eBPF. Userland tracing gets | eBPF. If it's in the Linux kernel and it's going to be | programmable (even if it shouldn't be), it's going to be | programmed with eBPF soon._ | | Feels like Oprah Winfrey's September 13th, 2004 show: "YOU get a | car! YOU get a car! And YOU get a car! Everybody gets a car!"[1] | | [1] https://www.youtube.com/watch?v=pviYWzu0dzk ___________________________________________________________________ (page generated 2020-10-21 23:00 UTC)