[HN Gopher] Userspace isn't slow, some kernel interfaces are
       ___________________________________________________________________
        
       Userspace isn't slow, some kernel interfaces are
        
       Author : jeffhenson
       Score  : 151 points
       Date   : 2022-12-13 17:25 UTC (5 hours ago)
        
 (HTM) web link (tailscale.com)
 (TXT) w3m dump (tailscale.com)
        
       | rwmj wrote:
       | For something completely different, you might want to look at
       | Unikernel Linux: https://arxiv.org/abs/2206.00789 You could run
       | all the code without switching between userspace and the kernel,
       | and you can call into kernel functions directly (with the usual
       | caveats about kernel ABI not being stable).
       | 
       | There is a v1 patch posted on LKML, and I think they're hoping to
       | get a v2 patch posted by January. If you are interested in a chat
       | with the team, email me rjones redhat.com.
        
         | bradfitz wrote:
         | Fun! We have support for running on gokrazy
         | (https://gokrazy.org/) already, and that's probably where
         | Unikernel Linux is more applicable for us, for when people just
         | want a "Tailscale appliance" image.
         | 
         | I'll email you.
        
           | ignoramous wrote:
           | Hi: Is there any possibility of TSO and GRO working on
           | Android?
        
             | raggi wrote:
             | One of the authors here: it could, all the kernel code is
             | present. Right now the android selinux policy blocks some
             | of the necessary ioctls (at least on the pixel builds I
             | tested).
        
       | pjdesno wrote:
       | There are a few gotchas with GRO, although I'm not sure they're
       | applicable to Wireguard - in particular, there used to be a line
       | in the kernel vswitch code that dropped a packet if it had been
       | processed by GRO. A while back I spent a long time debugging a
       | network problem caused by that particular "feature"...
        
       | mattpallissard wrote:
       | Title is click bait-y. This has next to nothing to do with kernel
       | interfaces and is all about network tuning and encapsulation. Not
       | sure why the authors went with the title as networking is
       | interesting enough.
       | 
       | Also, the "slow" things about kernel interfaces (if you aren't
       | doing IO which is nearly always the slowest thing) usually isn't
       | a given syscall, it's the transition from user to kernel space
       | and back. Lots of stuff going on such as flushing cache and
       | buffers due to security concerns these days.
        
         | bradfitz wrote:
         | (Tailscale employee)
         | 
         | We certainly didn't try to make it click-baity. The point of
         | the title is that people assume that Tailscale was slower than
         | kernel wireguard because the kernel must be intrinsically
         | faster somehow. The point of the blog post is to say, "no, code
         | can run fast on either side... you just have to cross the
         | boundary less." The blog post is all about how we then cross
         | that boundary less, using a less obvious kernel interface.
        
           | cycomanic wrote:
           | Just some feedback, that's not what I expected from the title
           | and I would agree with the previous poster that the title is
           | a little (quite minor though) clickbaity.
        
           | wpietri wrote:
           | For what it's worth, the combination of the source and the
           | title made sense to me, so I think it's fine as is.
        
           | karmakaze wrote:
           | Thanks for the clarifying reply. I thought most folks who
           | cared knew it was about context switches and not speed on one
           | side vs the other. Now I'm really interested to read the full
           | article.
        
         | jesboat wrote:
         | I disagree. Two main points of the article are "nothing is
         | inherently slow about doing stuff in userland (as shown by the
         | fact that we made a fast implementation)" and "kennel
         | interfaces, e.g. particular methods of boundary crossing, can
         | be (as shown by the fact that the way they made it faster was
         | in large part by doing the boundary crossings better)".
         | 
         | The title gave me a reasonably decent idea of what to expect,
         | and the article delivered.
        
         | raggi wrote:
         | One of the authors here: What I was going for with the title is
         | that singular read/write switching (the before case) is very
         | slow (for packet sized work), and batching (~ >=64kb) is much
         | faster - it's about amortizing the cost of the transition, as
         | you rightly point out. That's the point the title is making -
         | some interfaces do not provide the ability to amortize that
         | cost, others do!
        
       | sublinear wrote:
       | Similar to how javascript isn't slow, networking in general is.
       | :)
        
       | adtac wrote:
       | It'd be interesting to see the benchmark environment's raw
       | ChaCha20Poly1305 throughput (the x/crypto implementation) in the
       | analysis. My hunch is it's several times greater than the
       | network, which would further support the argument.
        
         | wmf wrote:
         | I noticed that very little of the flame graph is crypto which
         | implies that the system under test could do 20-30 Gbps of
         | ChaCha20Poly1305.
        
           | adtac wrote:
           | Yeah the flame graphs show ~9% of the time being spent in
           | golang.org/x/crypto/chacha20poly1305 so you're probably
           | right, but flame graphs and throughput aren't always a one-
           | to-one mapping. Flame graphs just tell you where time was
           | spent per unit packet, but depending on the workload, there
           | are some things in the life of a packet that you can
           | parallelise and some things you can't.
           | 
           | Just thought it'd be interesting to see the actual throughput
           | along with the rest for the benchmarked environment.
        
             | raggi wrote:
             | One of the authors here: yeah, it's very interesting. The
             | flame graphs here don't do a great job at highlighting an
             | aspect of the challenge which is that crypto fans out
             | across many CPUs. I think the hunch that 20-30gbps is
             | attainable (on a fast system) is accurate - it'll take more
             | work to get there.
             | 
             | What's interesting is that the cost for x/crypto on these
             | system classes is prohibitive for serial decoding at
             | 10gbps. I was ballparking with 1280 MTU, you have about
             | 1000ns to process a packet, it takes about 2000ns to
             | encrypt. The fan-out is critical at these levels, and will
             | always introduce it's own additional costs, with
             | synchronization, memory management and so on.
        
       | [deleted]
        
       | lost_tourist wrote:
       | If userspace was really exceedingly slow then we wouldn't bother
       | using it.
        
         | yjftsjthsd-h wrote:
         | Leaving aside the general question (which sibling comment
         | covers), there's an unwritten qualification of "userspace is
         | generally seen as slow _for drivers_ (network, disk,
         | filesystem) ", and... we generally _don 't_ bother using it for
         | those things, or at least we try to move the data path into the
         | kernel when we care about performance.
        
         | throwaway09223 wrote:
         | I can explain why this isn't correct.
         | 
         | We have a concept of userspace for safety. Systems without
         | protected memory are very unstable. The tradeoff is speed.
         | 
         | Trading speed for safety is extremely commonplace.
         | 
         | * Every assert or runtime validation
         | 
         | * Every time we guard against speculative execution attacks
         | (enormous perf hits)
         | 
         | * Every time we implement a "safe" runtime like java
         | 
         | * process safety models using protected memory
         | 
         | Efficiency is one of many competing concerns in a complex
         | system.
        
       | 0xQSL wrote:
       | Would it be an option to use io-uring to further recuce syscall
       | overhead? Perhaps there's also a way to do zerocopy?
        
         | bradfitz wrote:
         | That was previously explored in
         | https://github.com/tailscale/tailscale/issues/2303 and will
         | probably still happen.
         | 
         | When Josh et al tried it, they hit some fun kernel bugs on
         | certain kernel versions and that soured them on it for a bit,
         | knowing it wouldn't be as widely usable as we'd hoped based on
         | what kernels were in common use at the time. It's almost
         | certainly better nowadays.
         | 
         | Hopefully the Go runtime starts doing it instead:
         | https://github.com/golang/go/issues/31908
        
       | majke wrote:
       | I can chime in with some optimizations (linux).
       | 
       | For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster
       | than sendmmsg/recvmmsg.
       | 
       | In Gvisor they decided that read/write from tun is slow so they
       | did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap
       | device and run a raw socket on it. Dumping packet from raw socket
       | has faster interface than the device itself.
       | 
       | https://github.com/google/gvisor/blob/master/pkg/tcpip/link/...
       | https://github.com/google/gvisor/issues/210
        
         | Matthias247 wrote:
         | It can not only be a lot faster, it definitely is.
         | 
         | I did a lot of work on QUIC protocol efficiency improvements in
         | the last 3 years. The use of sendmmsg/recvmmsg yields maybe a
         | 10% efficiency improvement - because it only helps with
         | reducing the system call overhead. Once the data is inside the
         | kernel, these calls just behave like a loop of sendmsg/recvmsg
         | calls.
         | 
         | The syscall overhead however isn't the bottleneck - all the
         | other work in the network stack is. E.g. looking up routes for
         | each packet, applying iptable rules, applying BPF calls, etc.
         | 
         | Using segmentation offloads means the packets will traverse
         | also the remaining path as a single unit. This can allow for
         | efficiency improvements from somewhere between 200% and 500%
         | depending on the overall application. It's vastly useful to
         | look at GSO/GRO if you are doing anything which requires bulk
         | UDP datagram transmission.
        
       ___________________________________________________________________
       (page generated 2022-12-13 23:00 UTC)