[HN Gopher] Lord of the io_uring: io_uring tutorial, examples an... ___________________________________________________________________ Lord of the io_uring: io_uring tutorial, examples and reference Author : shuss Score : 216 points Date : 2020-05-10 13:49 UTC (9 hours ago) (HTM) web link (unixism.net) (TXT) w3m dump (unixism.net) | jra_samba wrote: | io_uring still has its wrinkles. | | We are scrambling right now to fix a problem due to change in | behavior exposed to user-space from the io_uring kernel module in | later kernels. | | Turns out that in earlier kernels (Ubuntu 19.04 5.3.0-51-generic | #44-Ubuntu SMP) io_uring will not return short reads/writes | (that's where you ask for e.g. 8k, but there's only 4k in the | buffer cache, so the call doesn't signal as complete and blocks | until all 8k has been transferred). In later kernels (not sure | when the behavior changed, but the one shipped with Fedora 32 has | the new behavior) io_uring returns partial (short) reads to user | space. e.g. You ask for 8k but there's only 4k in the buffer | cache, so the call signals complete with a return of only 4k | read, not the 8k you asked for. | | Userspace code now has to cope with this where it didn't before. | You could argue (and kernel developers did :-) that this was | always possible, so user code needs to be aware of this. But it | didn't used to do that :-). Change for user space is _bad_ , mkay | :-). | jra_samba wrote: | It was really interesting how this was found. | | A user started describing file corruption when copying to/from | Windows with the io_uring VFS module loaded. | | Tests using the Linux kernel cifsfs client and the Samba | libsmbclient libraries/smbclient user-space transfer utility | couldn't reproduce the problem, neither could running Windows | against Samba on Ubuntu 19.04. | | What turned out to be happening was a combination of things. | Firstly, the kernel changed so an SMB2_READ request against | Samba with io_uring loaded was _sometimes_ hitting a short | read, where some of file data was already in the buffer cache, | so io_uring now returned a short read to smbd. | | We returned this to the client, as in the SMB2 protocol it | isn't an error to return a short read, the client is supposed | to check read returns and then re-issue another read request | for any missing bytes. The Linux kernel cifsfs client and Samba | libsmbclient/smbclient did this correctly. | | But it turned out that Windows10 clients and MacOSX Catalina | (maybe earlier versions of clients too, I don't have access to | those) clients have a _horrible_ bug, where they 're not | checking read returns when doing pipeline reads. | | When trying to read a 10GB file for example, they'll issue a | series of 1MB reads at 1MB boundaries, up to their SMB2 credit | limit, without waiting for replies. This is an excellent way to | improve network file copy performance as you fill the read pipe | without waiting for reply latency - indeed both Linux cifsfs | and smbclient do exactly the same. | | But if one of those reads returns a short value, Windows10 and | MacOSX Catalina _DON 'T GO BACK AND RE-READ THE MISSING BYTES | FROM THE SHORT READ REPLY_ !!!! This is catastrophic, and will | corrupt any file read from the server (the local client buffer | cache fills the file contents I'm assuming with zeros - I | haven't checked, but the files are corrupt as checked by SHA256 | hashing anyway). | | That's how we discovered the behavior and ended up leading back | to the io_uring behavior change. And that's why I hate it when | kernel interfaces expose changes to user-space :-). | jstarks wrote: | > in the SMB2 protocol it isn't an error to return a short | read, the client is supposed to check read returns and then | re-issue another read request for any missing bytes | | This is interesting and somewhat surprising, since Windows IO | is internally asynchronous and completion based, and AFAIK | file system drivers are not allowed to return a short read | except for EOF. | | And actually, even on Linux file systems are not supposed to | return short reads, right? Even on signal? Since user apps | don't expect it? (And thus it's not surprising that | io_uring's change broke user apps.) | | So it wouldn't be surprising to learn that the Windows SMB | server never returns short reads, and thus it's interesting | that the protocol would allow it. Do you know what the | purpose of this is? | dirtydroog wrote: | Well, Linus did have an infamous rant about never breaking | userspace; it's surprising this happened. | loeg wrote: | He's not especially consistent about it. Linus was totally | prepared to break userspace re: getrandom() in recent | history. | andoma wrote: | So it's a Windows and MacOS bug then? Ie, no shadow should | fall on io_uring really? | | That said, nice dig figuring this out. These type of bugs can | be really frustrating to round up. | magicalhippo wrote: | I know nothing about io_uring but looking at the man page[1] of | readv I see it returns number of bytes read. For me as a | developer that's an unmistakable flag that partial reads is | possible. | | Was readv changed? The man page also states that partial reads | is possible, but I guess that might have been added later? | | If it always returned bytes read, it would hardly be the first | case where the current behavior is mistaken for the | specification. My fondest memory of that is all the OpenGL 1.x | programs that broke when OpenGL 2.x was released. | | [1]: http://man7.org/linux/man-pages/man2/readv.2.html | jra_samba wrote: | Also, note the preadv2 man page which has a flags field with | one flag defined as: | | ------------------------------- | | RWF_NOWAIT (since Linux 4.14) Do not wait for data which is | not immediately available. If this flag is specified, the | preadv2() system call will return instantly if it would have | to read data from the backing storage or wait for a lock. If | some data was successfully read, it will return the number of | bytes read. | | ------------------------------- | | This implies that "standard" pread/preadv/preadv2 without | that flag (which is only available for preadv2) will block | waiting for all bytes (or short return on EOF) and you need | to set a flag to get the non-blocking behavior you're | describing here. Otherwise the flag would be the inverse - | RWF_WAIT, implying the standard behavior is the non-blocking | one, not the blocking one. | | The blocking behavior is what we were expecting (and | previously got) out of io_uring, so it was an unpleasant | surprise to see the behavior change visible to user-space in | later kernels. | jra_samba_org wrote: | Well pread/pwrite have the same return values, and | historically for disk reads they block or return a device | error. | | pread only returns a short value on EOF. | loeg wrote: | Well, or EINTR if your signal handlers are not SA_RESTART. | magicalhippo wrote: | Well, the man page does say that "The readv() system call | works just like read(2) except that multiple buffers are | filled". | | If we go to read(2) we find "It is not an error if [the | return value] is smaller than the number of bytes | requested; this may happen for example because fewer bytes | are actually available right now [...], or because read() | was interrupted by a signal." | | As an outsider, I'd never rely on this returning the | requested number of bytes. If I required N bytes, I'd write | use a read loop. | | But I do agree that the RWF_NOWAIT flag mentioned in your | other comment doesn't help, as it suggests the default is | to block. | qubex wrote: | Unfortunately I misread the title as "Lord of the Urine" and... | was concerned. | geofft wrote: | One thing this writeup made me realize is, if I have a | _misbehaving_ I /O system (NFS or remote block device over a | flaky network, dying SSD, etc.), in the pre-io_uring world I'd | probably see that via /proc/$pid/stack pretty clearly - I'd see a | stack with the read syscall, then the particular I/O subsystem, | then the physical implementation of that subsystem. Or if I | looked at /proc/$pid/syscall I'd see a read call on a certain fd, | and I could look in /proc/$pid/fd/ and see which fd it was and | where it lived. | | However, in the post-io_uring world, I think I won't see that, | right? If I understand right, I'll at most see a call to | io_uring_enter, and maybe not even that. | | How do I tell what a stuck io_uring-using program is stuck on? Is | there a way I can see all the pending I/Os and what's going on | with them? | | How is this implemented internally - does it expand into one | kernel thread per I/O, or something? (I guess, if you had a silly | filesystem which spent 5 seconds in TASK_UNINTERRUPTIBLE on each | read, and you used io_uring to submit 100 reads from it, what | actually happens?) | dirtydroog wrote: | Use timeouts? | Matthias247 wrote: | I think that's a very reasonable concern. It however isn't | really about io_uring - it applies to all "async" solutions. | Even today if you are running async IO in userspace (e.g. using | epoll), it's not very obvious where something went wrong, | because no task is seemingly blocked. If you attach a debugger, | you might most likely see something being blocked on epoll - | but a callstack to the problematic application code is nowhere | in sight. | | Even if pause execution while inside the application code there | might not be a great stack which contains all relevant data. It | will only contain the information since the last task | resumption (e.g. through a callback). Depending on your | solution (C callbacks, C++ closures, C# or Kotlin async/await, | Rust async/await) the information will be between not very | helpful and somewhat understandable, but never on par with a | synchronous call. | shuss wrote: | This is such a great point. Never thought how async I/O could | be a problem this way. In the SQ polling example, I used BPF to | "prove" that the process does not make system calls: | | https://unixism.net/loti/tutorial/sq_poll.html | | Could be a good idea to use BPF to expose what io_uring is | doing. Just a wild thought. | matheusmoreira wrote: | Good point. Would be great if the submission and completion | ring buffers were accessible via procfs. | rwmj wrote: | By coincidence I asked a few questions on the mailing list about | io_uring this morning: https://lore.kernel.org/io- | uring/20200510080034.GI3888@redha... | tyingq wrote: | There are some benchmarks that show io_uring as a significant | boost over aio: | https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.... | | I see that nginx accepted a pull request to use it, mid last | year: https://github.com/hakasenyang/openssl-patch/issues/21 | | Curious if it's also been adopted by other popular IO intensive | software. | jandrewrogers wrote: | I have not adopted io_uring yet because it isn't clear that it | will provide useful performance improvements over linux aio in | cases where the disk I/O subsystem is already highly optimized. | Where io_uring seems to show a benefit relative to linux aio is | more naive software design, which adds a lot of value but is a | somewhat different value proposition than has been expressed. | | For software that is already capable of driving storage | hardware at its theoretical limit, the benefit is less | immediate and offset by the requirement of having a very recent | Linux kernel. | tyingq wrote: | If I understand the premise right, it should be fewer | syscalls per IO. So even if it doesn't improve disk I/O, it | might reduce CPU utilization. | jandrewrogers wrote: | This is true, but for most intentionally optimized storage | engines that syscall overhead is below the noise floor in | practice, even on NVMe storage. A single core can easily | drive gigabytes per second using the old Linux AIO | interface. | | It appears to primarily be an optimization for storage that | was not well-optimized to begin with. It is not obvious | that it would make high-performance storage engines faster. | matheusmoreira wrote: | It could also be nearly zero system calls per I/O | operation. The kernel can poll the submission queue for new | entries. This eliminates system call overhead at the cost | of higher CPU utilization. | shuss wrote: | For regular files, aio works async only if they are opened in | unbuffered mode. I think this is a huge limitation. io_uring | on the other hand, can provide a uniform interface for all | file descriptors whether they are sockets or regular files. | This should be a decent win, IMO. | jandrewrogers wrote: | That was kind of my point. While all of this is true, these | are not material limitations for the implementation of | high-performance storage engines. For example, using | unbuffered file descriptors is a standard design element of | databases for performance reasons that remain true. | | Being able to drive networking over io_uring would be a big | advantage but my understanding from people using it is that | part is still a bit broken. | shuss wrote: | True. Have to agree, here. Although one advantage over | aio for block I/O that io_uring will still have is to use | polling mode to almost completely avoid system calls. | g8oz wrote: | The ScyllaDB developers wrote up their take here: | https://www.scylladb.com/2020/05/05/how-io_uring-and- | ebpf-wi... | jabl wrote: | Those benchmark results are pretty impressive. In | particular, io_uring gets the best performance both when | the data is in the page cache and when bypassing the | cache. | wbl wrote: | And works is used advisedly. Certain filesystem edge | conditions particularly metadata changes due to block | allocation can result in blocking behavior. | loeg wrote: | aio doesn't have facilities for synchronous filesystem | metadata operations like open, rename, unlink, etc. If your | workload is largely metadata-static, aio is ok. If you need | to do even a little filesystem manipulation, io_uring seems | like it can provide some benefits. | shuss wrote: | Oh, yeah. QEMU 5.0 already uses io_uring. In fact, it uses | liburing. Check out the changelog: | https://wiki.qemu.org/ChangeLog/5.0 | Twirrim wrote: | To save people time, there's a single reference to it on the | changelog: | | > The file-posix driver can now use the io_uring interface of | Linux with aio=io_uring | | side note: I did note a change we built made it in to a | released version of qemu: | | > qemu-img convert -n now understands a --target-is-zero | option, which tells it that the target image is completely | zero, so it does not need to be zeroed again. | | That's saving us so much time and I/O | frevib wrote: | Echo server benchmarks, io_uring vs epoll: | https://github.com/frevib/io_uring-echo-server/blob/io-uring... | jra_samba wrote: | Samba can optionally use it if you explicitly load the | vfs_io_uring module, but it exposed a bug for us (see my | comment above). We're fixing it right now. | jcoffland wrote: | Maybe I just missed this but can anyone tell me what kernel | versions support io_uring. I ran the following test program on | 4.19.0 and it is not supported: #include | <stdio.h> #include <stdlib.h> #include | <sys/utsname.h> #include <liburing.h> #include | <liburing/io_uring.h> static const char | *op_strs[] = { "IORING_OP_NOP", | "IORING_OP_READV", "IORING_OP_WRITEV", | "IORING_OP_FSYNC", "IORING_OP_READ_FIXED", | "IORING_OP_WRITE_FIXED", "IORING_OP_POLL_ADD", | "IORING_OP_POLL_REMOVE", "IORING_OP_SYNC_FILE_RANGE", | "IORING_OP_SENDMSG", "IORING_OP_RECVMSG", | "IORING_OP_TIMEOUT", "IORING_OP_TIMEOUT_REMOVE", | "IORING_OP_ACCEPT", "IORING_OP_ASYNC_CANCEL", | "IORING_OP_LINK_TIMEOUT", "IORING_OP_CONNECT", | "IORING_OP_FALLOCATE", "IORING_OP_OPENAT", | "IORING_OP_CLOSE", "IORING_OP_FILES_UPDATE", | "IORING_OP_STATX", "IORING_OP_READ", | "IORING_OP_WRITE", "IORING_OP_FADVISE", | "IORING_OP_MADVISE", "IORING_OP_SEND", | "IORING_OP_RECV", "IORING_OP_OPENAT2", | "IORING_OP_EPOLL_CTL", "IORING_OP_SPLICE", | "IORING_OP_PROVIDE_BUFFERS", | "IORING_OP_REMOVE_BUFFERS", }; int | main() { struct utsname u; uname(&u); | struct io_uring_probe *probe = io_uring_get_probe(); if | (!probe) { printf("Kernel %s does not support | io_uring.\n", u.release); return 0; } | printf("List of kernel %s's supported io_uring operations:\n", | u.release); for (int i = 0; i < IORING_OP_LAST; | i++ ) { const char *answer = | io_uring_opcode_supported(probe, i) ? "yes" : "no"; | printf("%s: %s\n", op_strs[i], answer); } | free(probe); return 0; } | shuss wrote: | io_uring_get_probe() needs v5.6 at least. | eMSF wrote: | A comment from the cat example: | | >/* For each block of the file we need to read, we allocate an | iovec struct which is indexed into the iovecs array. This array | is passed in as part of the submission. If you don't understand | this, then you need to look up how the readv() and writev() | system calls work. */ | | I have to say, I don't really understand why the author chose to | individually allocate (up to millions of) single kilobyte buffers | for each file. Perhaps there is a reason for it, but I think they | should elaborate the choice. Anyway, I guess the first example is | too simplified, which is why what follows after is not built on | top of it in any way, hence they feel disjointed. | | The bigger problem here is that I don't know the author, or how | talented they are. Choices like that, or writing non-async- | signal-safe signal handlers don't help in estimating it, either. | Is the rest of the advice sound? | shuss wrote: | The author here: All examples in the guide are aimed at | throwing light at the io_uring and liburing interfaces. They | are not very useful or very real-worldish examples. The idea | with this example in particular is to show the difference how | readv/writev work synchronously vs how they would be "called" | io_uring. May be I should call out the fact that these programs | are more tuned towards explaining the io_uring interface a lot | more in the text. Thanks for the feedback. | beagle3 wrote: | Is there any intention to optimize work done, rather than just | the calling interface? | | E.g., running an rsync if a 10m files hierarchy usually requires | 10m synchronous stat calls. Using io-uring would make them | asynchronous, but they could potentially be done more efficiently | (e.g. convert file names to inodes in blocks of 20k, and then | stat those 20k inodes in a batch). | | That would require e.g. the VFS layer to support batch | operations. But the io-uring would actually allow that without a | user space interface change. | ignoramous wrote: | Anyone familiar with the Infiniband's approach to exposing IO via | rx/tx queues [0] comment whether it seems similar to io_uring's | ring-buffers [1]? How do these contrast against each other? | | [0] | https://www.cisco.com/c/en/us/td/docs/server_nw_virtual/2-10... | | [1] https://news.ycombinator.com/item?id=19846261 | DmitryOlshansky wrote: | Very limited experience with Infiniband but it seems similar, a | bit more flexible (esp recently with more syscalls supported). | | Also similar to but more general than RIO Sockets of Win8+: | | https://docs.microsoft.com/en-us/previous-versions/windows/i... | throw7 wrote: | The site pushes really hard that you shouldn't use the low-level | system calls in your code and that you should (always?) be using | a library (liburing). | | What exactly is liburing bringing to the table that I shouldn't | be using the uring syscalls directly? | andoma wrote: | io_uring requires userspace to access it using a well-defined | load/store memory ordering. Care must be taken to make sure the | compiler does not reorder instructions but also to use the | correct load/store instructions so hardware doesn't reorder | loads and stores. This is easier to (accidentally) get correct | on x86 as it has stronger ordering guarantees. In other words, | if you are not careful your code might be correct on x86 but | fail on Arm, etc. Needless to say the library handles all of | this correctly. | matheusmoreira wrote: | You absolutely can use system calls in your code. The kernel | has an awesome header that makes this easy and allows you to | eliminate all dependencies: | | https://github.com/torvalds/linux/blob/master/tools/include/... | | This system call avoidance dogma exists because libraries | generally have more convenient interfaces and are therefore | easier to use. They're not strictly necessary though. | | It should be noted that using certain system calls may cause | problems with the libraries you're using. For example, glibc | needs to maintain complete control over the threading model in | order to implement thread-local storage. By issuing a clone | system call directly, the glibc threading model is broken and | even something simple like errno is likely to break. | | In my opinion, libraries shouldn't contain thread-local or | global variables in the first place. Unfortunately, the C | language is old and these problems will never be fixed. It's | possible to create better libraries in freestanding C or even | freestanding Rust but replacing what already exists is a | lifetime of work. | | > What exactly is liburing bringing to the table that I | shouldn't be using the uring syscalls directly? | | It's easier to use compared to the kernel interface. For | example, it handles submission queue polling automatically | without any extra code. | shuss wrote: | The raw io_uring interface, once you ignore the boilerplate | initialization code, is actually a super-simple interface to | use. liburing is itself only a very thin wrapper on top of | io_uring. I feel that if you ever used io_uring, after a while | you'll end up with a bunch of convenience functions. liburing | looks more like a collection of those functions to me today. | | One place where a slightly high-level interface is provided by | liburing is in the function io_uring_submit(). It determines | among other things if there is a need to call the | io_uring_enter() system call depending on whether you are in | polling mode or not, for example. You can read more about it | here: | | https://unixism.net/loti/tutorial/sq_poll.html | | Otherwise, at least at this time, liburing is a simple wrapper. | matheusmoreira wrote: | So awesome... The ring buffer is like a generic asynchronous | system call submission mechanism. The set of supported operations | is already a subset of available Linux system calls: | | https://github.com/torvalds/linux/blob/master/include/uapi/l... | | It almost gained support for ioctl: | | https://lwn.net/Articles/810414/ | | Wouldn't it be cool if it gained support for other types of | system calls? Something this awesome shouldn't be restricted to | I/O... | diegocg wrote: | The author seems to be planning to expand it to be usable as a | generic way of doing asynchronous syscalls ___________________________________________________________________ (page generated 2020-05-10 23:00 UTC)