[HN Gopher] Userspace isn't slow, some kernel interfaces are ___________________________________________________________________ Userspace isn't slow, some kernel interfaces are Author : jeffhenson Score : 151 points Date : 2022-12-13 17:25 UTC (5 hours ago) (HTM) web link (tailscale.com) (TXT) w3m dump (tailscale.com) | rwmj wrote: | For something completely different, you might want to look at | Unikernel Linux: https://arxiv.org/abs/2206.00789 You could run | all the code without switching between userspace and the kernel, | and you can call into kernel functions directly (with the usual | caveats about kernel ABI not being stable). | | There is a v1 patch posted on LKML, and I think they're hoping to | get a v2 patch posted by January. If you are interested in a chat | with the team, email me rjones redhat.com. | bradfitz wrote: | Fun! We have support for running on gokrazy | (https://gokrazy.org/) already, and that's probably where | Unikernel Linux is more applicable for us, for when people just | want a "Tailscale appliance" image. | | I'll email you. | ignoramous wrote: | Hi: Is there any possibility of TSO and GRO working on | Android? | raggi wrote: | One of the authors here: it could, all the kernel code is | present. Right now the android selinux policy blocks some | of the necessary ioctls (at least on the pixel builds I | tested). | pjdesno wrote: | There are a few gotchas with GRO, although I'm not sure they're | applicable to Wireguard - in particular, there used to be a line | in the kernel vswitch code that dropped a packet if it had been | processed by GRO. A while back I spent a long time debugging a | network problem caused by that particular "feature"... | mattpallissard wrote: | Title is click bait-y. This has next to nothing to do with kernel | interfaces and is all about network tuning and encapsulation. Not | sure why the authors went with the title as networking is | interesting enough. | | Also, the "slow" things about kernel interfaces (if you aren't | doing IO which is nearly always the slowest thing) usually isn't | a given syscall, it's the transition from user to kernel space | and back. Lots of stuff going on such as flushing cache and | buffers due to security concerns these days. | bradfitz wrote: | (Tailscale employee) | | We certainly didn't try to make it click-baity. The point of | the title is that people assume that Tailscale was slower than | kernel wireguard because the kernel must be intrinsically | faster somehow. The point of the blog post is to say, "no, code | can run fast on either side... you just have to cross the | boundary less." The blog post is all about how we then cross | that boundary less, using a less obvious kernel interface. | cycomanic wrote: | Just some feedback, that's not what I expected from the title | and I would agree with the previous poster that the title is | a little (quite minor though) clickbaity. | wpietri wrote: | For what it's worth, the combination of the source and the | title made sense to me, so I think it's fine as is. | karmakaze wrote: | Thanks for the clarifying reply. I thought most folks who | cared knew it was about context switches and not speed on one | side vs the other. Now I'm really interested to read the full | article. | jesboat wrote: | I disagree. Two main points of the article are "nothing is | inherently slow about doing stuff in userland (as shown by the | fact that we made a fast implementation)" and "kennel | interfaces, e.g. particular methods of boundary crossing, can | be (as shown by the fact that the way they made it faster was | in large part by doing the boundary crossings better)". | | The title gave me a reasonably decent idea of what to expect, | and the article delivered. | raggi wrote: | One of the authors here: What I was going for with the title is | that singular read/write switching (the before case) is very | slow (for packet sized work), and batching (~ >=64kb) is much | faster - it's about amortizing the cost of the transition, as | you rightly point out. That's the point the title is making - | some interfaces do not provide the ability to amortize that | cost, others do! | sublinear wrote: | Similar to how javascript isn't slow, networking in general is. | :) | adtac wrote: | It'd be interesting to see the benchmark environment's raw | ChaCha20Poly1305 throughput (the x/crypto implementation) in the | analysis. My hunch is it's several times greater than the | network, which would further support the argument. | wmf wrote: | I noticed that very little of the flame graph is crypto which | implies that the system under test could do 20-30 Gbps of | ChaCha20Poly1305. | adtac wrote: | Yeah the flame graphs show ~9% of the time being spent in | golang.org/x/crypto/chacha20poly1305 so you're probably | right, but flame graphs and throughput aren't always a one- | to-one mapping. Flame graphs just tell you where time was | spent per unit packet, but depending on the workload, there | are some things in the life of a packet that you can | parallelise and some things you can't. | | Just thought it'd be interesting to see the actual throughput | along with the rest for the benchmarked environment. | raggi wrote: | One of the authors here: yeah, it's very interesting. The | flame graphs here don't do a great job at highlighting an | aspect of the challenge which is that crypto fans out | across many CPUs. I think the hunch that 20-30gbps is | attainable (on a fast system) is accurate - it'll take more | work to get there. | | What's interesting is that the cost for x/crypto on these | system classes is prohibitive for serial decoding at | 10gbps. I was ballparking with 1280 MTU, you have about | 1000ns to process a packet, it takes about 2000ns to | encrypt. The fan-out is critical at these levels, and will | always introduce it's own additional costs, with | synchronization, memory management and so on. | [deleted] | lost_tourist wrote: | If userspace was really exceedingly slow then we wouldn't bother | using it. | yjftsjthsd-h wrote: | Leaving aside the general question (which sibling comment | covers), there's an unwritten qualification of "userspace is | generally seen as slow _for drivers_ (network, disk, | filesystem) ", and... we generally _don 't_ bother using it for | those things, or at least we try to move the data path into the | kernel when we care about performance. | throwaway09223 wrote: | I can explain why this isn't correct. | | We have a concept of userspace for safety. Systems without | protected memory are very unstable. The tradeoff is speed. | | Trading speed for safety is extremely commonplace. | | * Every assert or runtime validation | | * Every time we guard against speculative execution attacks | (enormous perf hits) | | * Every time we implement a "safe" runtime like java | | * process safety models using protected memory | | Efficiency is one of many competing concerns in a complex | system. | 0xQSL wrote: | Would it be an option to use io-uring to further recuce syscall | overhead? Perhaps there's also a way to do zerocopy? | bradfitz wrote: | That was previously explored in | https://github.com/tailscale/tailscale/issues/2303 and will | probably still happen. | | When Josh et al tried it, they hit some fun kernel bugs on | certain kernel versions and that soured them on it for a bit, | knowing it wouldn't be as widely usable as we'd hoped based on | what kernels were in common use at the time. It's almost | certainly better nowadays. | | Hopefully the Go runtime starts doing it instead: | https://github.com/golang/go/issues/31908 | majke wrote: | I can chime in with some optimizations (linux). | | For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster | than sendmmsg/recvmmsg. | | In Gvisor they decided that read/write from tun is slow so they | did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap | device and run a raw socket on it. Dumping packet from raw socket | has faster interface than the device itself. | | https://github.com/google/gvisor/blob/master/pkg/tcpip/link/... | https://github.com/google/gvisor/issues/210 | Matthias247 wrote: | It can not only be a lot faster, it definitely is. | | I did a lot of work on QUIC protocol efficiency improvements in | the last 3 years. The use of sendmmsg/recvmmsg yields maybe a | 10% efficiency improvement - because it only helps with | reducing the system call overhead. Once the data is inside the | kernel, these calls just behave like a loop of sendmsg/recvmsg | calls. | | The syscall overhead however isn't the bottleneck - all the | other work in the network stack is. E.g. looking up routes for | each packet, applying iptable rules, applying BPF calls, etc. | | Using segmentation offloads means the packets will traverse | also the remaining path as a single unit. This can allow for | efficiency improvements from somewhere between 200% and 500% | depending on the overall application. It's vastly useful to | look at GSO/GRO if you are doing anything which requires bulk | UDP datagram transmission. ___________________________________________________________________ (page generated 2022-12-13 23:00 UTC)