[HN Gopher] Lord of the io_uring: io_uring tutorial, examples an...
       ___________________________________________________________________
        
       Lord of the io_uring: io_uring tutorial, examples and reference
        
       Author : shuss
       Score  : 216 points
       Date   : 2020-05-10 13:49 UTC (9 hours ago)
        
 (HTM) web link (unixism.net)
 (TXT) w3m dump (unixism.net)
        
       | jra_samba wrote:
       | io_uring still has its wrinkles.
       | 
       | We are scrambling right now to fix a problem due to change in
       | behavior exposed to user-space from the io_uring kernel module in
       | later kernels.
       | 
       | Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic
       | #44-Ubuntu SMP) io_uring will not return short reads/writes
       | (that's where you ask for e.g. 8k, but there's only 4k in the
       | buffer cache, so the call doesn't signal as complete and blocks
       | until all 8k has been transferred). In later kernels (not sure
       | when the behavior changed, but the one shipped with Fedora 32 has
       | the new behavior) io_uring returns partial (short) reads to user
       | space. e.g. You ask for 8k but there's only 4k in the buffer
       | cache, so the call signals complete with a return of only 4k
       | read, not the 8k you asked for.
       | 
       | Userspace code now has to cope with this where it didn't before.
       | You could argue (and kernel developers did :-) that this was
       | always possible, so user code needs to be aware of this. But it
       | didn't used to do that :-). Change for user space is _bad_ , mkay
       | :-).
        
         | jra_samba wrote:
         | It was really interesting how this was found.
         | 
         | A user started describing file corruption when copying to/from
         | Windows with the io_uring VFS module loaded.
         | 
         | Tests using the Linux kernel cifsfs client and the Samba
         | libsmbclient libraries/smbclient user-space transfer utility
         | couldn't reproduce the problem, neither could running Windows
         | against Samba on Ubuntu 19.04.
         | 
         | What turned out to be happening was a combination of things.
         | Firstly, the kernel changed so an SMB2_READ request against
         | Samba with io_uring loaded was _sometimes_ hitting a short
         | read, where some of file data was already in the buffer cache,
         | so io_uring now returned a short read to smbd.
         | 
         | We returned this to the client, as in the SMB2 protocol it
         | isn't an error to return a short read, the client is supposed
         | to check read returns and then re-issue another read request
         | for any missing bytes. The Linux kernel cifsfs client and Samba
         | libsmbclient/smbclient did this correctly.
         | 
         | But it turned out that Windows10 clients and MacOSX Catalina
         | (maybe earlier versions of clients too, I don't have access to
         | those) clients have a _horrible_ bug, where they 're not
         | checking read returns when doing pipeline reads.
         | 
         | When trying to read a 10GB file for example, they'll issue a
         | series of 1MB reads at 1MB boundaries, up to their SMB2 credit
         | limit, without waiting for replies. This is an excellent way to
         | improve network file copy performance as you fill the read pipe
         | without waiting for reply latency - indeed both Linux cifsfs
         | and smbclient do exactly the same.
         | 
         | But if one of those reads returns a short value, Windows10 and
         | MacOSX Catalina _DON 'T GO BACK AND RE-READ THE MISSING BYTES
         | FROM THE SHORT READ REPLY_ !!!! This is catastrophic, and will
         | corrupt any file read from the server (the local client buffer
         | cache fills the file contents I'm assuming with zeros - I
         | haven't checked, but the files are corrupt as checked by SHA256
         | hashing anyway).
         | 
         | That's how we discovered the behavior and ended up leading back
         | to the io_uring behavior change. And that's why I hate it when
         | kernel interfaces expose changes to user-space :-).
        
           | jstarks wrote:
           | > in the SMB2 protocol it isn't an error to return a short
           | read, the client is supposed to check read returns and then
           | re-issue another read request for any missing bytes
           | 
           | This is interesting and somewhat surprising, since Windows IO
           | is internally asynchronous and completion based, and AFAIK
           | file system drivers are not allowed to return a short read
           | except for EOF.
           | 
           | And actually, even on Linux file systems are not supposed to
           | return short reads, right? Even on signal? Since user apps
           | don't expect it? (And thus it's not surprising that
           | io_uring's change broke user apps.)
           | 
           | So it wouldn't be surprising to learn that the Windows SMB
           | server never returns short reads, and thus it's interesting
           | that the protocol would allow it. Do you know what the
           | purpose of this is?
        
           | dirtydroog wrote:
           | Well, Linus did have an infamous rant about never breaking
           | userspace; it's surprising this happened.
        
             | loeg wrote:
             | He's not especially consistent about it. Linus was totally
             | prepared to break userspace re: getrandom() in recent
             | history.
        
           | andoma wrote:
           | So it's a Windows and MacOS bug then? Ie, no shadow should
           | fall on io_uring really?
           | 
           | That said, nice dig figuring this out. These type of bugs can
           | be really frustrating to round up.
        
         | magicalhippo wrote:
         | I know nothing about io_uring but looking at the man page[1] of
         | readv I see it returns number of bytes read. For me as a
         | developer that's an unmistakable flag that partial reads is
         | possible.
         | 
         | Was readv changed? The man page also states that partial reads
         | is possible, but I guess that might have been added later?
         | 
         | If it always returned bytes read, it would hardly be the first
         | case where the current behavior is mistaken for the
         | specification. My fondest memory of that is all the OpenGL 1.x
         | programs that broke when OpenGL 2.x was released.
         | 
         | [1]: http://man7.org/linux/man-pages/man2/readv.2.html
        
           | jra_samba wrote:
           | Also, note the preadv2 man page which has a flags field with
           | one flag defined as:
           | 
           | -------------------------------
           | 
           | RWF_NOWAIT (since Linux 4.14) Do not wait for data which is
           | not immediately available. If this flag is specified, the
           | preadv2() system call will return instantly if it would have
           | to read data from the backing storage or wait for a lock. If
           | some data was successfully read, it will return the number of
           | bytes read.
           | 
           | -------------------------------
           | 
           | This implies that "standard" pread/preadv/preadv2 without
           | that flag (which is only available for preadv2) will block
           | waiting for all bytes (or short return on EOF) and you need
           | to set a flag to get the non-blocking behavior you're
           | describing here. Otherwise the flag would be the inverse -
           | RWF_WAIT, implying the standard behavior is the non-blocking
           | one, not the blocking one.
           | 
           | The blocking behavior is what we were expecting (and
           | previously got) out of io_uring, so it was an unpleasant
           | surprise to see the behavior change visible to user-space in
           | later kernels.
        
           | jra_samba_org wrote:
           | Well pread/pwrite have the same return values, and
           | historically for disk reads they block or return a device
           | error.
           | 
           | pread only returns a short value on EOF.
        
             | loeg wrote:
             | Well, or EINTR if your signal handlers are not SA_RESTART.
        
             | magicalhippo wrote:
             | Well, the man page does say that "The readv() system call
             | works just like read(2) except that multiple buffers are
             | filled".
             | 
             | If we go to read(2) we find "It is not an error if [the
             | return value] is smaller than the number of bytes
             | requested; this may happen for example because fewer bytes
             | are actually available right now [...], or because read()
             | was interrupted by a signal."
             | 
             | As an outsider, I'd never rely on this returning the
             | requested number of bytes. If I required N bytes, I'd write
             | use a read loop.
             | 
             | But I do agree that the RWF_NOWAIT flag mentioned in your
             | other comment doesn't help, as it suggests the default is
             | to block.
        
       | qubex wrote:
       | Unfortunately I misread the title as "Lord of the Urine" and...
       | was concerned.
        
       | geofft wrote:
       | One thing this writeup made me realize is, if I have a
       | _misbehaving_ I /O system (NFS or remote block device over a
       | flaky network, dying SSD, etc.), in the pre-io_uring world I'd
       | probably see that via /proc/$pid/stack pretty clearly - I'd see a
       | stack with the read syscall, then the particular I/O subsystem,
       | then the physical implementation of that subsystem. Or if I
       | looked at /proc/$pid/syscall I'd see a read call on a certain fd,
       | and I could look in /proc/$pid/fd/ and see which fd it was and
       | where it lived.
       | 
       | However, in the post-io_uring world, I think I won't see that,
       | right? If I understand right, I'll at most see a call to
       | io_uring_enter, and maybe not even that.
       | 
       | How do I tell what a stuck io_uring-using program is stuck on? Is
       | there a way I can see all the pending I/Os and what's going on
       | with them?
       | 
       | How is this implemented internally - does it expand into one
       | kernel thread per I/O, or something? (I guess, if you had a silly
       | filesystem which spent 5 seconds in TASK_UNINTERRUPTIBLE on each
       | read, and you used io_uring to submit 100 reads from it, what
       | actually happens?)
        
         | dirtydroog wrote:
         | Use timeouts?
        
         | Matthias247 wrote:
         | I think that's a very reasonable concern. It however isn't
         | really about io_uring - it applies to all "async" solutions.
         | Even today if you are running async IO in userspace (e.g. using
         | epoll), it's not very obvious where something went wrong,
         | because no task is seemingly blocked. If you attach a debugger,
         | you might most likely see something being blocked on epoll -
         | but a callstack to the problematic application code is nowhere
         | in sight.
         | 
         | Even if pause execution while inside the application code there
         | might not be a great stack which contains all relevant data. It
         | will only contain the information since the last task
         | resumption (e.g. through a callback). Depending on your
         | solution (C callbacks, C++ closures, C# or Kotlin async/await,
         | Rust async/await) the information will be between not very
         | helpful and somewhat understandable, but never on par with a
         | synchronous call.
        
         | shuss wrote:
         | This is such a great point. Never thought how async I/O could
         | be a problem this way. In the SQ polling example, I used BPF to
         | "prove" that the process does not make system calls:
         | 
         | https://unixism.net/loti/tutorial/sq_poll.html
         | 
         | Could be a good idea to use BPF to expose what io_uring is
         | doing. Just a wild thought.
        
         | matheusmoreira wrote:
         | Good point. Would be great if the submission and completion
         | ring buffers were accessible via procfs.
        
       | rwmj wrote:
       | By coincidence I asked a few questions on the mailing list about
       | io_uring this morning: https://lore.kernel.org/io-
       | uring/20200510080034.GI3888@redha...
        
       | tyingq wrote:
       | There are some benchmarks that show io_uring as a significant
       | boost over aio:
       | https://www.phoronix.com/scan.php?page=news_item&px=Linux-5....
       | 
       | I see that nginx accepted a pull request to use it, mid last
       | year: https://github.com/hakasenyang/openssl-patch/issues/21
       | 
       | Curious if it's also been adopted by other popular IO intensive
       | software.
        
         | jandrewrogers wrote:
         | I have not adopted io_uring yet because it isn't clear that it
         | will provide useful performance improvements over linux aio in
         | cases where the disk I/O subsystem is already highly optimized.
         | Where io_uring seems to show a benefit relative to linux aio is
         | more naive software design, which adds a lot of value but is a
         | somewhat different value proposition than has been expressed.
         | 
         | For software that is already capable of driving storage
         | hardware at its theoretical limit, the benefit is less
         | immediate and offset by the requirement of having a very recent
         | Linux kernel.
        
           | tyingq wrote:
           | If I understand the premise right, it should be fewer
           | syscalls per IO. So even if it doesn't improve disk I/O, it
           | might reduce CPU utilization.
        
             | jandrewrogers wrote:
             | This is true, but for most intentionally optimized storage
             | engines that syscall overhead is below the noise floor in
             | practice, even on NVMe storage. A single core can easily
             | drive gigabytes per second using the old Linux AIO
             | interface.
             | 
             | It appears to primarily be an optimization for storage that
             | was not well-optimized to begin with. It is not obvious
             | that it would make high-performance storage engines faster.
        
             | matheusmoreira wrote:
             | It could also be nearly zero system calls per I/O
             | operation. The kernel can poll the submission queue for new
             | entries. This eliminates system call overhead at the cost
             | of higher CPU utilization.
        
           | shuss wrote:
           | For regular files, aio works async only if they are opened in
           | unbuffered mode. I think this is a huge limitation. io_uring
           | on the other hand, can provide a uniform interface for all
           | file descriptors whether they are sockets or regular files.
           | This should be a decent win, IMO.
        
             | jandrewrogers wrote:
             | That was kind of my point. While all of this is true, these
             | are not material limitations for the implementation of
             | high-performance storage engines. For example, using
             | unbuffered file descriptors is a standard design element of
             | databases for performance reasons that remain true.
             | 
             | Being able to drive networking over io_uring would be a big
             | advantage but my understanding from people using it is that
             | part is still a bit broken.
        
               | shuss wrote:
               | True. Have to agree, here. Although one advantage over
               | aio for block I/O that io_uring will still have is to use
               | polling mode to almost completely avoid system calls.
        
               | g8oz wrote:
               | The ScyllaDB developers wrote up their take here:
               | https://www.scylladb.com/2020/05/05/how-io_uring-and-
               | ebpf-wi...
        
               | jabl wrote:
               | Those benchmark results are pretty impressive. In
               | particular, io_uring gets the best performance both when
               | the data is in the page cache and when bypassing the
               | cache.
        
             | wbl wrote:
             | And works is used advisedly. Certain filesystem edge
             | conditions particularly metadata changes due to block
             | allocation can result in blocking behavior.
        
           | loeg wrote:
           | aio doesn't have facilities for synchronous filesystem
           | metadata operations like open, rename, unlink, etc. If your
           | workload is largely metadata-static, aio is ok. If you need
           | to do even a little filesystem manipulation, io_uring seems
           | like it can provide some benefits.
        
         | shuss wrote:
         | Oh, yeah. QEMU 5.0 already uses io_uring. In fact, it uses
         | liburing. Check out the changelog:
         | https://wiki.qemu.org/ChangeLog/5.0
        
           | Twirrim wrote:
           | To save people time, there's a single reference to it on the
           | changelog:
           | 
           | > The file-posix driver can now use the io_uring interface of
           | Linux with aio=io_uring
           | 
           | side note: I did note a change we built made it in to a
           | released version of qemu:
           | 
           | > qemu-img convert -n now understands a --target-is-zero
           | option, which tells it that the target image is completely
           | zero, so it does not need to be zeroed again.
           | 
           | That's saving us so much time and I/O
        
         | frevib wrote:
         | Echo server benchmarks, io_uring vs epoll:
         | https://github.com/frevib/io_uring-echo-server/blob/io-uring...
        
         | jra_samba wrote:
         | Samba can optionally use it if you explicitly load the
         | vfs_io_uring module, but it exposed a bug for us (see my
         | comment above). We're fixing it right now.
        
       | jcoffland wrote:
       | Maybe I just missed this but can anyone tell me what kernel
       | versions support io_uring. I ran the following test program on
       | 4.19.0 and it is not supported:                   #include
       | <stdio.h>         #include <stdlib.h>         #include
       | <sys/utsname.h>         #include <liburing.h>         #include
       | <liburing/io_uring.h>                   static const char
       | *op_strs[] = {           "IORING_OP_NOP",
       | "IORING_OP_READV",           "IORING_OP_WRITEV",
       | "IORING_OP_FSYNC",           "IORING_OP_READ_FIXED",
       | "IORING_OP_WRITE_FIXED",           "IORING_OP_POLL_ADD",
       | "IORING_OP_POLL_REMOVE",           "IORING_OP_SYNC_FILE_RANGE",
       | "IORING_OP_SENDMSG",           "IORING_OP_RECVMSG",
       | "IORING_OP_TIMEOUT",           "IORING_OP_TIMEOUT_REMOVE",
       | "IORING_OP_ACCEPT",           "IORING_OP_ASYNC_CANCEL",
       | "IORING_OP_LINK_TIMEOUT",           "IORING_OP_CONNECT",
       | "IORING_OP_FALLOCATE",           "IORING_OP_OPENAT",
       | "IORING_OP_CLOSE",           "IORING_OP_FILES_UPDATE",
       | "IORING_OP_STATX",           "IORING_OP_READ",
       | "IORING_OP_WRITE",           "IORING_OP_FADVISE",
       | "IORING_OP_MADVISE",           "IORING_OP_SEND",
       | "IORING_OP_RECV",           "IORING_OP_OPENAT2",
       | "IORING_OP_EPOLL_CTL",           "IORING_OP_SPLICE",
       | "IORING_OP_PROVIDE_BUFFERS",
       | "IORING_OP_REMOVE_BUFFERS",         };                   int
       | main() {           struct utsname u;           uname(&u);
       | struct io_uring_probe *probe = io_uring_get_probe();           if
       | (!probe) {             printf("Kernel %s does not support
       | io_uring.\n", u.release);             return 0;           }
       | printf("List of kernel %s's supported io_uring operations:\n",
       | u.release);                for (int i = 0; i < IORING_OP_LAST;
       | i++ ) {             const char *answer =
       | io_uring_opcode_supported(probe, i) ? "yes" : "no";
       | printf("%s: %s\n", op_strs[i], answer);           }
       | free(probe);           return 0;         }
        
         | shuss wrote:
         | io_uring_get_probe() needs v5.6 at least.
        
       | eMSF wrote:
       | A comment from the cat example:
       | 
       | >/* For each block of the file we need to read, we allocate an
       | iovec struct which is indexed into the iovecs array. This array
       | is passed in as part of the submission. If you don't understand
       | this, then you need to look up how the readv() and writev()
       | system calls work. */
       | 
       | I have to say, I don't really understand why the author chose to
       | individually allocate (up to millions of) single kilobyte buffers
       | for each file. Perhaps there is a reason for it, but I think they
       | should elaborate the choice. Anyway, I guess the first example is
       | too simplified, which is why what follows after is not built on
       | top of it in any way, hence they feel disjointed.
       | 
       | The bigger problem here is that I don't know the author, or how
       | talented they are. Choices like that, or writing non-async-
       | signal-safe signal handlers don't help in estimating it, either.
       | Is the rest of the advice sound?
        
         | shuss wrote:
         | The author here: All examples in the guide are aimed at
         | throwing light at the io_uring and liburing interfaces. They
         | are not very useful or very real-worldish examples. The idea
         | with this example in particular is to show the difference how
         | readv/writev work synchronously vs how they would be "called"
         | io_uring. May be I should call out the fact that these programs
         | are more tuned towards explaining the io_uring interface a lot
         | more in the text. Thanks for the feedback.
        
       | beagle3 wrote:
       | Is there any intention to optimize work done, rather than just
       | the calling interface?
       | 
       | E.g., running an rsync if a 10m files hierarchy usually requires
       | 10m synchronous stat calls. Using io-uring would make them
       | asynchronous, but they could potentially be done more efficiently
       | (e.g. convert file names to inodes in blocks of 20k, and then
       | stat those 20k inodes in a batch).
       | 
       | That would require e.g. the VFS layer to support batch
       | operations. But the io-uring would actually allow that without a
       | user space interface change.
        
       | ignoramous wrote:
       | Anyone familiar with the Infiniband's approach to exposing IO via
       | rx/tx queues [0] comment whether it seems similar to io_uring's
       | ring-buffers [1]? How do these contrast against each other?
       | 
       | [0]
       | https://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-10...
       | 
       | [1] https://news.ycombinator.com/item?id=19846261
        
         | DmitryOlshansky wrote:
         | Very limited experience with Infiniband but it seems similar, a
         | bit more flexible (esp recently with more syscalls supported).
         | 
         | Also similar to but more general than RIO Sockets of Win8+:
         | 
         | https://docs.microsoft.com/en-us/previous-versions/windows/i...
        
       | throw7 wrote:
       | The site pushes really hard that you shouldn't use the low-level
       | system calls in your code and that you should (always?) be using
       | a library (liburing).
       | 
       | What exactly is liburing bringing to the table that I shouldn't
       | be using the uring syscalls directly?
        
         | andoma wrote:
         | io_uring requires userspace to access it using a well-defined
         | load/store memory ordering. Care must be taken to make sure the
         | compiler does not reorder instructions but also to use the
         | correct load/store instructions so hardware doesn't reorder
         | loads and stores. This is easier to (accidentally) get correct
         | on x86 as it has stronger ordering guarantees. In other words,
         | if you are not careful your code might be correct on x86 but
         | fail on Arm, etc. Needless to say the library handles all of
         | this correctly.
        
         | matheusmoreira wrote:
         | You absolutely can use system calls in your code. The kernel
         | has an awesome header that makes this easy and allows you to
         | eliminate all dependencies:
         | 
         | https://github.com/torvalds/linux/blob/master/tools/include/...
         | 
         | This system call avoidance dogma exists because libraries
         | generally have more convenient interfaces and are therefore
         | easier to use. They're not strictly necessary though.
         | 
         | It should be noted that using certain system calls may cause
         | problems with the libraries you're using. For example, glibc
         | needs to maintain complete control over the threading model in
         | order to implement thread-local storage. By issuing a clone
         | system call directly, the glibc threading model is broken and
         | even something simple like errno is likely to break.
         | 
         | In my opinion, libraries shouldn't contain thread-local or
         | global variables in the first place. Unfortunately, the C
         | language is old and these problems will never be fixed. It's
         | possible to create better libraries in freestanding C or even
         | freestanding Rust but replacing what already exists is a
         | lifetime of work.
         | 
         | > What exactly is liburing bringing to the table that I
         | shouldn't be using the uring syscalls directly?
         | 
         | It's easier to use compared to the kernel interface. For
         | example, it handles submission queue polling automatically
         | without any extra code.
        
         | shuss wrote:
         | The raw io_uring interface, once you ignore the boilerplate
         | initialization code, is actually a super-simple interface to
         | use. liburing is itself only a very thin wrapper on top of
         | io_uring. I feel that if you ever used io_uring, after a while
         | you'll end up with a bunch of convenience functions. liburing
         | looks more like a collection of those functions to me today.
         | 
         | One place where a slightly high-level interface is provided by
         | liburing is in the function io_uring_submit(). It determines
         | among other things if there is a need to call the
         | io_uring_enter() system call depending on whether you are in
         | polling mode or not, for example. You can read more about it
         | here:
         | 
         | https://unixism.net/loti/tutorial/sq_poll.html
         | 
         | Otherwise, at least at this time, liburing is a simple wrapper.
        
       | matheusmoreira wrote:
       | So awesome... The ring buffer is like a generic asynchronous
       | system call submission mechanism. The set of supported operations
       | is already a subset of available Linux system calls:
       | 
       | https://github.com/torvalds/linux/blob/master/include/uapi/l...
       | 
       | It almost gained support for ioctl:
       | 
       | https://lwn.net/Articles/810414/
       | 
       | Wouldn't it be cool if it gained support for other types of
       | system calls? Something this awesome shouldn't be restricted to
       | I/O...
        
         | diegocg wrote:
         | The author seems to be planning to expand it to be usable as a
         | generic way of doing asynchronous syscalls
        
       ___________________________________________________________________
       (page generated 2020-05-10 23:00 UTC)