[HN Gopher] Io_uring is not an event system ___________________________________________________________________ Io_uring is not an event system Author : ot Score : 217 points Date : 2021-06-17 14:56 UTC (8 hours ago) (HTM) web link (despairlabs.com) (TXT) w3m dump (despairlabs.com) | ayanamist wrote: | So linux people find that the model of windows iocp used is | better? | asveikau wrote: | I think it's long been understood that it's better for disk | I/O. I'm not sure the consensus is as clear for sockets. | hermanradtke wrote: | As a Linux person: Yes, and I have thought so for a long time. | | This is why many of us are excited about io_uring. | spullara wrote: | Now if they could allocate the memory for you when needed | rather than having a bunch of buffers allocated when they | aren't yet needed. | ot wrote: | Yes, and now things are going full circle and windows is | adopting the ring model too. | | https://windows-internals.com/i-o-rings-when-one-i-o-operati... | muststopmyths wrote: | huh, interesting. As far as I can tell the main advantage | this has over IOCP is that you can get one completion for | multiple read requests. | | Looks like they took a lot of the concepts from Winsock RIO | and applied them to file I/O. Which is fascinating because | with network traffic you can't predict packet boundaries and | thus your I/O rate can be unpredictable. RIO helps you get | the notification rate under control, which can help if your | packet rate is very high. | | With files, I would think you can control the rate at which | you request data, as well as the memory you allocate for it. | | The other thing it saves just like RIO is the overhead of | locking/unlocking buffers by preregistering them. Is that the | main reason for this API then ? | | I would be very interested to hear from people who have | actually run into limits with overlapped file reads and are | therefore excited about IoRings | volta83 wrote: | Yes. | | Unfortunately Rust went the exact other way. | ginsmar wrote: | Very very interesting. | grok22 wrote: | Isn't locking a problem with io_uring? Won't you block the kernel | when it's trying to do the event completion stuff and the | completion work tries to take a lock? Or is the completion stuff | done entirely in user-space and blocking is not a problem? Maybe | I need to read up on this a bit more... | zxzax wrote: | The submissions and completions are stored in a lock-free ring | buffer, hence the name "uring." | legulere wrote: | lock-free ring buffers still have locks for the case when | they are full, so it would be interesting to see how the | kernel behaves when you never read from the completion ring. | zxzax wrote: | You will get an error for that upon submit. The application | can then buffer the submissions somewhere else and wait for | a few completions to finish. | mzs wrote: | bandwidth exceeded alternative: | https://web.archive.org/web/20210617150204/https://despairla... | zootboy wrote: | It seems that their server is rewriting the 451 error to a 403, | which caused Archive.org to drop the page from its archives. | Unfortunate... | bilalhusain wrote: | Currently serving Bandwidth Restricted page - 451. | | Cached version of the write up https://archive.is/VgHkW | joshmarinacci wrote: | I'm curious how this handles the case where the calling program | dies or wants to cancel the request before it's actually | happened. | ww520 wrote: | The do_exit() function in kernel/exit.c is responsible for | general cleanup on a process [1]. Whether a process dies | gracefully or abruptly, the kernel calls do_exit() to clean up | all the resources owned by the process, like opened files or | acquired locks. I would imagine the io_uring related stuff is | cleaned up there as well. | | [1] | https://elixir.bootlin.com/linux/v5.13-rc6/source/kernel/exi... | | Edit: I just looked up at the latest version of the source [1]. | Yes, it does clean up io_uring related files. | asdfasgasdgasdg wrote: | I don't know the answer but I would assume if you have | submitted an operation to the kernel, you should assume it's in | an indeterminate state until you get the result. If the program | dies then the call may complete or not. | | For cancellation there is an API. Example call: | https://github.com/axboe/liburing/blob/c4c280f31b0e05a1ea792... | PaulDavisThe1st wrote: | Death: same way it handles a program with an open socket and | data arriving and unread and the program dies. It's just part | of the overall resource set of the process and has to be | cleaned up when the process goes away. | raphlinus wrote: | Something I've been thinking about that maybe the HN hivemind can | help with; I know enough about io_uring and GPU each to be | dangerous. | | The roundtrip for command buffer submission in GPU is huge by my | estimation, around 100us. On a 10TFLOPS card, which is nowhere | near the top of the line, that's 1 billion operations. I don't | know exactly where all the time is going, but suspect it's a | bunch of process and kernel transitions between the application, | the userland driver, and the kernel driver. | | My understanding is that games mostly work around this by | batching up a lot of work (many dozens of draw calls, for | example) in one submission. But it's still a problem if CPU | readback is part of the workload. | | So my question is: can a technique like io_uring be used here, to | keep the GPU pipeline full and only take expensive transitions | when absolutely needed? I suspect the programming model will be | different and in some cases harder, but that's already part of | the territory with GPU. | boardwaalk wrote: | The communication between the host and the GPU already works on | a ring buffer on any modern GPU I believe. | | It's why graphics APIs are asynchronous until you synchronize | f.e. by flipping a frame buffer or reading something back. | | APIs like Vulkan are very explicit about this and have fences | and semaphores. Older APIs will just block if you do something | that requires blocking. | api wrote: | What's funny about io_uring is that the blocking syscall | interface was always something critics of Unix pointed out as | being a major shortcoming. We had OSes that did this kind of | thing way back in the 1980s and 1990s, but Unix with its | simplicity and generalism and free implementations took over. | | Now Unix is finally, in 2021, getting a syscall queue construct | where I can interact with the kernel asynchronously. | CodesInChaos wrote: | io_uring is Linux specific. BSD offers kqueue instead, | introduced in 2000. | | I believe both are limited to specific operations, mostly IO, | and aren't fully general asynchronous syscall interfaces. | rapsey wrote: | The entire point of the article is that it is not just for | IO. | binarycrusader wrote: | The OP said _mostly_ IO not only. | CodesInChaos wrote: | > The entire point of the article is that it is not just | for IO | | I went through the list of operations supported by | `io_uring_enter`. Almost all of them are for IO, the | remainder (NOP, timeout, madvise) are useful for supporting | IO, though madvise might have some non-IO uses as well. | While io_uring could form the basis of a generic async | syscall interface in the future, in its current state it | most certainly is not. | | The article mostly talks about io_uring enabling completion | based IO instead of readiness based IO. | | AFAIK kqueue also supports completion based IO using | aio_read/aio_write together with sigevent. | [deleted] | coder543 wrote: | > AFAIK kqueue also supports completion based IO using | aio_read/aio_write together with sigevent. | | If you can point to a practical example of a program | doing it this way and seeing a performance benefit, I | would be curious to see it. I did some googling and | didn't really even find any articles mentioning this as | possibility. | | kqueue is widely considered to be readiness based, just | like epoll, not completion based. | | What you wrote sounds like an interesting hack, but I'm | not sure it counts for much if it is impractical to use. | CodesInChaos wrote: | I don't know if it offers any practical benefits for | sequential/socket IO. But AFAIK it's the way to go if you | want to do async random-access/file IO. | binarycrusader wrote: | Solaris had this over a decade ago now with event ports which | are basically a variant of Windows IOCP: | | https://web.archive.org/web/20110719052845/http://developers... | | So at least one UNIX system had them a while ago. | wahern wrote: | People keep saying this but IME that's not how the Solaris | Event Ports API works _at_ _all_. The semantics of Solaris | Event Ports is nearly identical to both epoll+family and | kqueue. And like with both those others (ignoring io_uring), | I /O _completion_ is done using the POSIX AIO interface, | which signals completion through the Event Port descriptor. | | I've written at least two wrapper libraries for I/O | readiness, POSIX signal, file event, and user-triggered event | polling that encompass epoll, kqueue, and Solaris Event | Ports. Supporting all three is _relatively_ trivial from an | API perspective because they work so similarly. In fact, | notably all _three_ let you poll on the epoll, kqueue, or | Event Port descriptor itself. So you can have event queue | _trees_ , which is very handy when writing composable | libraries. | ww520 wrote: | I remember IPX has similar communication model. To read from | the network, you post an array of buffer pointers to the IPX | driver and can continue to do whatever. When the buffers are | filled, the driver calls your completion function. | PaulDavisThe1st wrote: | Maybe Linux will get scheduler activations in the near future, | another OS feature from the 90s that ended up in Solaris and | more or less nowhere else. "Let my user space thread scheduler | do its work!" | jlokier wrote: | We talked about adding that to Linux in the 90s too. A | simple, small scheduler-hook system call that would allow | userspace to cover different asynchronous I/O scheduling | cases efficiently. | | The sort of thing Go and Rust runtimes try to approximate in | a hackish way nowadays. They would both by improved by an | appropriate scheduler-activation hook. | | Back then the idea didn't gain support. It needed a champion, | and nobody cared enough. It seemed unnecessary, complicated. | What was done instead seemed to be driven by interests that | focused on one kind of task or another, e.g. networking or | databases. | | It doesn't help that the understandings many people have of | performance around asynchronous I/O, stackless and stackful | coroutines, userspace-kernel interactions, CPU-hardware | interactions and so on are not particularly deep. For example | I've met a few people who argued that "async-await" is the | modern and faster alternative to threads in every scenario, | except for needing N threads to use N CPU cores. But that is | far from correct. Stackful coroutines doing blocking I/O with | complex logic (such as filesystems) are lighter than async- | await coroutines doing the same thing, and "heavy" fair | scheduling can improve throughput and latency statistics over | naive queueing. | | It's exciting to see efficient userspace-kernel I/O | scheduling getting attention, and getting better over the | years. Kudos to the implementors. | | But it's also kind of depressing that things that were on the | table 20-25 years ago take this long to be evaluated. It's | almost as if economics and personal situations governs | progress much more than knowledge and ideas... | PaulDavisThe1st wrote: | Actually, I think the biggest obstacle is that as cool as | scheduler activations are, it turns out that not many | applications are really in a position to benefit from them. | The ones that can found other ways ("workarounds") to | address the fact that the kernel scheduler can't know which | user space thread to run. They did so because it was | important to them. | zaphar wrote: | It's almost as if economics and personal situations | governs progress much more than knowledge and ideas... | | That has always been the case and will probably always be | the case. | aseipp wrote: | There's already plans for a new futex-based swap_to | primitive, for improving userland thread scheduling | capabilities. There was some work done on it last year, but | it was rejected on LKML. At this rate, it looks like it will | not move forward until the new futex2 syscall is in place, | since the original API is showing its age. | | So, it will probably happen Soon(tm), but you're probably | still ~2 years out before you can reliably depend on it, I'd | say. | PaulDavisThe1st wrote: | Scheduler activations don't require swap_to. | | The kernel wakes up the user space scheduler when it | decides to put the process onto a cpu. The user space | scheduler decides which _user space_ thread executes in the | kernel thread context that it runs in, and does a user | space thread switch (not a full context switch) to it. It | 's a combination of kernel threads and user space (aka | "green") threads. | gpderetta wrote: | I think some of the *BSDs have (or had) it. Linux almost got | it at the turn of the millennium, with the Next Generation | Posix Threading project, but then the much simpler and faster | NPTL won. | rektide wrote: | It might! I'm not sure if it's an exact fit or not but the | User Managed Concurrency Groups work[1] Google is trying to | upstream with their Fibers userland-scheduling library sounds | like it could be a match, and perhaps it could get the | upstreaming itcs seeking. | | [1] https://www.phoronix.com/scan.php?page=news_item&px=Googl | e-F... | tele_ski wrote: | I think this might be the best explanation I've read of why | io_uring should be better than epoll since it effectively | collapses the 'tell me when this is ready' with the 'do action' | part. That was the really enlightening part for me. | | I have to say though, the name io_uring seems unfortunate and I | think the author touches on this in the article... the name is | really an implementation detail but io_uring's true purpose is a | generic asynchronous syscall facility that is currently tailored | towards i/o. syscall_queue or async_queue or something else...? A | descriptive api name and not an implementation detail would | probably go a long way in helping the feature be easier to | understand. Even window's IOCP seems infinitely better named than | 'uring'. | pydry wrote: | I'm still confused coz this is exactly what I always thought | the difference between epoll and select was. | | "what if, instead of the kernel telling us when something is | ready for an action to be taken so that we can take it, we tell | the kernel what action to we want to take, and it will do it | when the conditions become right." | | The difference between select and epoll was that select would | keep checking in until the conditions were right while epoll | would send _you_ a message. That was gamechanging. | | - I'm not really sure why this is seen as such a fundamental | change. It's changed from the kernel triggering a callback | to... a callback. | asveikau wrote: | select, poll, epoll, are all the same model of blocking and | signalling for readiness. | | The problem with the former occurs with large lists of file | descriptors. Calling from user to kernel, the kernel needs to | copy and examine N file descriptors. When user mode comes | back, it needs to scan its list of file descriptors to see | what changed. That's 2 O(n) scans at every syscall, one | kernel side, one user side, even if only zero or one file | descriptors has an event. | | epoll and kqueue make it so that the kernel persists the list | of interesting file descriptors between calls, and only | returns back what has actually changed, without either side | needing to scan an entire list. | | By contrast, the high level programming model of io_uring | seems pretty similar to POSIX AIO or Windows async I/O [away | from readiness and more towards "actually do the thing"], but | with the innovation being a new data structure that allows | reduction in syscall overhead. | coder543 wrote: | epoll: tell me when any of these descriptors are ready, then | I'll issue another syscall to actually read from that | descriptor into a buffer. | | io_uring: when any of these descriptors are ready, read into | any one of these buffers I've preallocated for you, then let | me know when it is done. | | Instead of waking up a process just so it can do the work of | calling back into the kernel to have the kernel fill a | buffer, io_uring skips that extra syscall altogether. | | Taking things to the next level, io_uring allows you to chain | operations together. You can tell it to read from one socket | and write the results into a different socket or directly to | a file, and it can do that without waking your process | pointlessly at any intermediate stage. | | A nearby comment also mentioned opening files, and that's | cool too. You could issue an entire command sequence to | io_uring, then your program can work on other stuff and check | on it later, or just go to sleep until everything is done. | You could tell the kernel that you want it to open a | connection, write a particular buffer that you prepared for | it into that connection, then open a specific file on disk, | read the response into that file, close the file, then send a | prepared buffer as a response to the connection, close the | connection, then let you know that it is all done. You just | have to prepare two buffers on the frontend, issue the | commands (which could require either 1 or 0 syscalls, | depending on how you're using io_uring), then do whatever you | want. | | You can even have numerous command sequences under kernel | control in parallel, you don't have to issue them one at a | time and wait on them to finish before you can issue the next | one. | | With epoll, you have to do every individual step along the | way yourself, which involves syscalls, context switches, and | potentially more code complexity. Then you realize that epoll | doesn't even support file I/O, so you have to mix multiple | approaches together to even approximate what io_uring is | doing. | | (Note: I've been looking for an excuse to use io_uring, so | I've read a ton about it, but I don't have any practical | experience with it yet. But everything I wrote above should | be accurate.) | throwaway81523 wrote: | Being able to open files with io_uring is important because | there is no other way to do it without an unpredictable | delay. Some systems like Erlang end up using separate OS | threads just to be able to open files without blocking the | main interpreter thread. | zxzax wrote: | If you're looking for an excuse to work on io_uring, please | consider helping get it implemented and tested in your | favorite event loop or I/O abstraction library. Here's some | open issues and PRs: | | https://github.com/golang/go/issues/31908 | | https://github.com/libuv/libuv/pull/2322 | | https://github.com/tokio-rs/mio/issues/923 | | https://gitlab.gnome.org/GNOME/glib/-/issues/2084 | | https://github.com/libevent/libevent/issues/1019 | coder543 wrote: | Oh, trust me... that Go issue is top of mind for me. I | have the fifth comment on that issue, along with several | other comments in there, and I'd love to implement it... | I'm just not familiar enough with working on Go runtime | internals, and motivation for volunteer work is sometimes | hard to come by for the past couple of years. | | Maybe someday I'll get it done :) | zxzax wrote: | Haha nice, I just noticed that :) I think supporting | someone else to help work on it and even just offering to | help test and review a PR is a great and useful thing to | do. | jra_samba wrote: | io_uring has been a game-changer for Samba IO speed. | | Check out Stefan Metzmacher's talk at SambaXP 2021 | (online event) for details: | | https://www.youtube.com/watch?v=eYxp8yJHpik | surrealize wrote: | The performance comparisons start here | https://youtu.be/eYxp8yJHpik?t=1421 | | Looks like the bandwidth went from 3.8 GB/s to 22 GB/s, | with the client being the bottleneck. | pydry wrote: | This makes it much clearer. Thanks! | tele_ski wrote: | What you're describing sounds awesome, I hadn't thought | about being able to string syscall commands together like | that. I wonder how well that will work in practice? Is | there a way to be notified if one of the commands in the | sequence fails like for instance the buffer wasn't large | enough to write all the incoming data into? | touisteur wrote: | I'm looking at the evolution in the chaining capabilities | of io_uring. Right now it's a bit basic but I'm guessing | in 5 or 6 kernel versions people will have built a micro | kernel or a web server just by chaining things in | io_uring and maybe some custom chaining/decision blocks | in ebpf :-) | coder543 wrote: | BPF, you say? https://lwn.net/Articles/847951/ | | > The obvious place where BPF can add value is making | decisions based on the outcome of previous operations in | the ring. Currently, these decisions must be made in user | space, which involves potential delays as the relevant | process is scheduled and run. Instead, when an operation | completes, a BPF program might be able to decide what to | do next without ever leaving the kernel. "What to do | next" could include submitting more I/O operations, | moving on to the next in a series of files to process, or | aborting a series of commands if something unexpected | happens. | touisteur wrote: | BPF is going to change so many things... At the moment | I'm having lots of trouble with the tooling but hey, | let's just write BPF bytecode by hand or with a macro- | asm. Reduce the ambitions... | touisteur wrote: | Also wondering whether we should rethink language | runtimes for this. Like write everything in SPARK (so all | specs are checked), target bpf bytecode through gnatllvm. | OK you've written the equivalent of a cuda kernel or | tbb::flow block. Now for the chaining y'all have this | toolbox of task-chainers (barriers, priority queues, | routers...) and you'll never even enter userland? I'm | thinking /many/ programs could be described as such. | touisteur wrote: | Yes exactly what I had in mind. I'm also thinking of a | particular chain of syscalls [0][1][2][3] (send netlink | message, setsockopt, ioctls, getsockopts, reads, then | setsockopt, then send netlink message) grouped so as to | be done in one sequence without ever surfacing up to | userland (just fill those here buffers, who's a good | boy!). So now I'm missing ioctls and getsockopts but all | in good time! | | [0] https://github.com/checkpoint- | restore/criu/blob/7686b939d155... | | [1] https://github.com/checkpoint- | restore/criu/blob/7686b939d155... | | [2] https://github.com/checkpoint- | restore/criu/blob/7686b939d155... | | [3] https://www.infradead.org/~tgr/libnl/doc/api/group__q | disc__p... | coder543 wrote: | According to a relevant manpage[0]: | | > Only members inside the chain are serialized. A chain | of SQEs will be broken, if any request in that chain ends | in error. io_uring considers any unexpected result an | error. This means that, eg, a short read will also | terminate the remainder of the chain. If a chain of SQE | links is broken, the remaining unstarted part of the | chain will be terminated and completed with -ECANCELED as | the error code. | | So it sounds like you would need to decide what your | strategy is. It sounds like you can inspect the step in | the sequence that had the error, learn what the error | was, and decide whether you want to re-issue the command | that failed along with the remainder of the sequence. For | a short read, you should still have access to the bytes | that were read, so you're not losing information due to | the error. | | There is an alternative "hardlink" concept that will | continue the command sequence even in the presence of an | error in the previous step, like a short read, as long as | the previous step was correctly submitted. | | Error handling gets in the way of some of the fun, as | usual, but it is important to think about. | | [0]: https://manpages.debian.org/unstable/liburing- | dev/io_uring_e... | zxzax wrote: | Yes, check the documentation for the IOSQE_IO_LINK flag | to see exactly how this works. | dataflow wrote: | epoll is based on a "readiness" model (i.e. it tells when | when you can _start_ I /O). io_uring is based on a | "completion" model (i.e. it tells you when I/O is _finished_ | ). The latter is like Windows IOCP, where the C stands for | Completion. Readiness models are rather useless for a local | disk because, unlike with a socket, the disk is more or less | always ready to receive a command. | simcop2387 wrote: | Io uring can in theory be built to subscribe to any syscall | (though it hasn't yet). I don't believe epoll can do things | like stat, opening files, closing files, and syncing though. | [deleted] | eloff wrote: | I've seen mixed results so far. In theory it should perform | better than epoll, but I'm not sure it's quite there yet. The | maintainer of uWebSockets tried it with an earlier version and | it was slower. | | Where it really shines is disk IO because we don't have an | epoll equivalent there. I imagine it would also be great at | network requests that go to or from disk in a simple way | because you can chain the syscalls in theory. | zxzax wrote: | The main benefit to me has been in cases that previously | required a thread pool on top of epoll, it should be safe to | get rid of the thread pool now and only use io_uring. Socket | I/O of course doesn't really need a thread pool in a lot of | cases, but disk I/O does. | infogulch wrote: | syscall_uring would be my preference. | [deleted] | hawski wrote: | Now I wonder if my idea of having a muxcall or a batchcall as I | thought about it a few years ago is something similar to | io_uring, but on a lesser scale and without eBPF goodies. | | My idea was to have a syscall like this: struct | batchvec { unsigned long batchv_callnr; unsigned | long long batchv_argmask; }; asmlinkage long | sys_batchcall(struct batchvec *batchv, int batchvcnt, | long args[16], unsigned flags); | | You were supposed to give in a batchvec a sequence of system | call numbers and a little mapping to arguments you provided in | args. batchv_argmask is a long long - 64 bit type, this mask is | divided to 4 bit fields, every field can address a long from | args table. AFAIR Linux syscalls have up to 6 arguments. 6 | fields for arguments and one for return value, that gives 7 | fields - 28 bits and now I don't remember why I thought I need | a long long. | | It would go like this pseudo code: int i = 0; | for(; i < batchvcnt; i++) { args[batchv[i].argmask[6]] | = sys_call_table[batchv[i].callnr](args[batchv[i].argmask[0]], | args[batchv[i].argmask[1]], args[batchv[i].argmask[2]], | args[batchv[i].argmask[3]], args[batchv[i].argmask[4]], | args[batchv[i].argmask[5]]); | if(args[batchv[i].argmask[6]] < 0) { break; } | } return i; | | It would return a number of successfully run syscalls. It would | stop on first failed one. The user would have to pick up the | error code out of args table. | | I would be interested to know why it wouldn't work. | | I started implementing it against Linux 4.15.12, but never went | to test it. I have some code, but I don't believe it is my last | version of the attempt. | pkghost wrote: | W/r/t a more descriptive name, I disagree--though until fairly | recently I would have agreed. | | I would guess that the desire for something more "descriptive" | reflects the fact that you are not in the weeds with io_uring | (et al), and as such a name that's tied to specifics of the | terrain (io, urings) feels esoteric and unfamiliar. | | However, to anyone who is an immediate consumer of io_uring or | its compatriots, "io" obviously implies "syscall", but is | better than "syscall", because it's more specific; since | io_uring doens't do anything other than io-related syscalls | (there are other kinds of syscalls), naming it "syscall_" would | make it harder for its immediate audience to remember what it | does. | | Similarly, "uring" will be familiar to most of the immediate | audience, and is better than "queue", because it also | communicates some specific features (or performance | characteristics? idk, I'm also not in the weeds) of the API | that the more generic "_queue" would not. | | So, while I agree that the name is mildly inscrutable to us | distant onlookers, I think it's the right name, and indeed | reflects a wise pattern in naming concepts in complex systems. | The less ambiguity you introduce at each layer of indirection | or reference, the better. | | I recently did a toy project that has some files named in a | similar fashion: `docker/cmd`, which is what the CMD directive | in my Dockerfile points at, and `systemd/start`, which is what | the ExecStart line of my systemd service file points at. | They're mildly inscrutable if you're unfamiliar with either | docker or systemd, as they don't really say much about what | they _do_ , but this is a naming pattern that I can port to | just about any project, and at the same time stop spending | energy remembering a unique name for the app's entry point, or | the systemd script. | | Some abstract observations: - naming for | grokkability-a-first-glance is at odds with naming for utility- | over-time; the former is necessarily more ambiguous - | naming for utility over time seems like obviously the better | default naming strategy; find a nice spot in your readme for | onboarding metaphors and make sure the primary consumers of | your name don't have to work harder than necessary to make | sense of it - if you find a name inscrutable, perhaps | you're just missing some context | tele_ski wrote: | Thanks for your detailed thoughts on this, I am definitely | not involved at all and just onlooking from the sidelines. | The name does initially seem quite esoteric but I can | understand why it was picked. Thinking about it more 'io' | rather than 'syscall' does make sense, and Windows also does | use IO in IOCP. | wtallis wrote: | The io specificity is expected to be a temporary situation, | and that part of the name may end up being an anachronism in | a few years once a usefully large subset of another category | of syscalls has been added to io_uring. The shared ring | buffers aspect definitely is an implementation detail, but | one that does explain why performance is better than other | async IO methods (and also avoids misleading people into | thinking that it has something to do with the async/await | paradigm). | | If the BSDs hadn't already claimed the name, it would | probably have been fine to call this kqueue or something like | that. | justsomeuser wrote: | If my process has async/await and uses epoll, wouldn't this have | about the same performance as io_uring? | | E.g. with io_uring, the event triggers the kernel to read into a | buffer. | | With async/await epoll wakes up my process which does the syscall | to read the file. | | In both cases you still need to read from the device and get the | data to the user process? | Matthias247 wrote: | The answer is "go ahead and benchmark". There are some | theoretical advantages to uring, like having to do less | syscalls and have less userspace <-> kernelspace transitions. | | In practice implementation differences can offset that | advantage, or it just might not make any difference for the | application at all since its not the hotspot. | | I think for socket IO a variety of people did some synthetic | benchmarks for epoll vs uring, and got all kinds of results | from either one being a bit faster to both being roughly the | same. | Misdicorl wrote: | Imagine your process instead of getting woken up to make a | syscall gets woken up with a pointer to a filled data buffer. | justsomeuser wrote: | But some process (kernel or user space) still needs to spend | the computers finite resources to read it? | | If the work happens in the kernel process or user process it | still costs the same? | Misdicorl wrote: | Yes, the syscall work still happens and the raw work has | not changed. The overhead has dropped dramatically though | since you've eliminated at least 2 context switches per | data fill cycle(and probably more). | wtallis wrote: | The transition from userspace to the kernel and back takes | a similar amount of time to actually doing a small read | from the fastest SSDs, or issuing a write to most | relatively fast SSDs. So avoiding one extra syscall is a | meaningful performance improvement even if you can't always | eliminate a memcpy operation. | Veserv wrote: | That sounds unlikely. syscall hardware overhead (entry + | exit) on a modern x86 is only on the order of 100 ns | which is approximately main memory access latency. I am | not familiar with Linux internals or what the fastest | SSDs are capable of these days, but I am fairly sure that | for your statement to be true Linux would need to be | adding 1 to 2 orders of magnitude in software overhead. | This occurs in the context switch pathway due to | scheduling decisions, but it is fairly unlikely it occurs | in the syscall pathway unless they are doing something | horribly wrong. | vlovich123 wrote: | My understanding is that's the naiive measurement of the | cost of just the syscall operation (i.e. if you measure | issue to kernel is executing). Does this actually account | for the performance loss of cache innefficiency? If I'm | not mistaken at a minimum the CPU needs to flush various | caches to enter the kernel, fill them up as the kernel is | executing, & then repopulate them when executing back in | userspace. In that case (even if it's not a full flush), | you have a hard to measure slowdown on the code | processing the request in the kernel & in userspace after | the syscall because the locality assumptions that caches | rely on are invalidated. With an io_uring model, since | there's no context switches, temporal & spatial locality | should provide an outsized benefit beyond just removing | the syscall itself. | | Additionally, as noted elsewhere, you can chain syscalls | pretty deeply so that the entire operation occurs in the | kernel & never schedules your process. This also benefits | spatial & temporal locality AND removes the cost of | needing to schedule the process in the first place. | dundarious wrote: | Not necessarily, as you can use io_uring in a zero-copy way, | avoiding copies of network/disk data from kernel space to user | space. | diegocg wrote: | IIRC the author of io_uring wants to support using it with any | system call, not just the current supported subset. Not sure how | these efforts are going. | swiley wrote: | Supporting exec sounds like it could be a security issue. I | also have a hard time imagining why on earth you would call | exec that way. | tptacek wrote: | Say more about this? You're probably right but I haven't | thought about it before and I'd be happy to have you do that | thinking for me. :) | touisteur wrote: | Start a user-provided compression background task at file | close? Think tcpdump with io_uring 'take this here buffer | you've just read and send it to disk without bothering | userland. | the_duke wrote: | > Supporting exec sounds like it could be a security issue. | | Why would it be any more of an issue than calling a blocking | exec? | | > why on earth you would call exec that way | | The same reason as why you would want to do anything else | with io_uring? In an async runtime you have to delegate | blocking calls to a thread pool. Much nicer if the runtime | can use the same execution system as for other IO. | swiley wrote: | >Why would it be any more of an issue than calling a | blocking exec? | | Now you turn everything doing fast I/O into something | that's essentially an easier rwx situation WRT memory | corruption. | | >Much nicer | | In other words: the runtime should handle it because the | kernel doesn't need to. | the_duke wrote: | I really don't follow your thought process here. Can you | expand on your objections? | | io_uring can really be thought of as a generic | asynchronous syscall interface. | | It uses a kernel thread pool to run operations. If | something is blocking on a kernel-level it can just be | run as a blocking operation on that thread pool. | hinkley wrote: | If you can emulate blocking calls in user space for | backward compatibility, that is probably best. But I wonder | about process scheduling. Blocked threads don't get | rescheduled until the kernel unblocks them. Can you tell | the kernel to block a thread until an io_uring state change | occurs? | mikedilger wrote: | > io_uring is not an event system at all. io_uring is actually a | generic asynchronous syscall facility. | | I would call it a message passing system, not an asynchronous | syscall facility. A syscall, even one that doesn't block | indefinitely, transfers control to the kernel. io_uring, once | setup, doesn't. Now that we have multiple cores, there is no | reason to context switch if the kernel can handle your request on | some other core on which it is perhaps already running. | hinkley wrote: | It becomes a question of how expensive the message passing is | between cores versus the context switch overhead - especially | in a world where switching privilege levels has been made | expensive by multiple categories of attack against processor | speculation. | | Is there a middle ground where io_uring doesn't require the | mitigations but other syscalls do? | volta83 wrote: | This is how I think about it as well. | | Which makes me wonder if Mach wasn't right all along. | rjzzleep wrote: | I found this presentation to be quite informative | https://www.youtube.com/watch?v=-5T4Cjw46ys | vkka wrote: | ... _So it has the potential to make a lot of programs much | simpler_. | | More efficient? Yes. Simpler? Not really. A synchronous program | would be simpler, everyone who has done enough of these know it. | masklinn wrote: | I'm guessing they mean that programs which did not want to | block on syscalls and had to deploy workarounds can now just... | do async syscalls. | Filligree wrote: | Which is every program that wants to be fast, on modern | computers. So... | masklinn wrote: | Nonsense. | | Let's say your program wants to list a directory, if it has | nothing to do during that time then there is no point to | using an asynchronous model, that only adds costs and | complexity. | wtallis wrote: | True. But as soon as your program wants to list _two_ | directories and knows both names ahead of time, you have | an opportunity to fire off both operations for the kernel | to work on simultaneously. | | And even if your program doesn't have any opportunity for | doing IO in parallel, being able to chain a sequence of | IO operations together and issue them with at most one | syscall may still get you improved latency. | touisteur wrote: | Yes and even clustering often-used-together syscalls... | An interesting 2010 thesis | https://os.itec.kit.edu/deutsch/2211.php and | https://www2.cs.arizona.edu/~debray/Publications/multi- | call.... for something called 'multi-calls' | | Interesting times. | [deleted] | ot wrote: | Let's say your program wants to list a directory and sort | by mtime, or size. You need to stat all those files, | which means reading a bunch of inodes. And your directory | is on flash, so you'll definitely want to pipeline all | those operations. | | How do you do that without an async API? Thread pool and | synchronous syscalls? That's not simpler. | the8472 wrote: | You can use io_uring as a synchronous API too. Put a bunch of | commands on the submission queue (think of it as a submission | "buffer"), call io_uring_enter() with min_complete == number of | commands and once the syscall returns extract results from the | competion buffer^H^H^Hqueue. Voila, a perfectly synchronous | batch syscall interface. | | You can even choose between executing them sequentially and | aborting on the first error or trying to complete as many as | possible. | aseipp wrote: | My read of that paragraph was that they meant existing | asynchronous programs can be simplified, due to the need for | less workarounds for the Linux I/O layer (e.g. thread pools to | make disk operations appear asynchronous are no longer | necessary.) And I agree with that; asynchronous I/O had a lot | of pitfalls on Linux until io_uring came around, making things | much worse than strictly necessary. | | In general I totally agree that a synchronous program will be | way simpler than an equivalent asynchronous one, though. ___________________________________________________________________ (page generated 2021-06-17 23:00 UTC)