[HN Gopher] Io_uring is not an event system
       ___________________________________________________________________
        
       Io_uring is not an event system
        
       Author : ot
       Score  : 217 points
       Date   : 2021-06-17 14:56 UTC (8 hours ago)
        
 (HTM) web link (despairlabs.com)
 (TXT) w3m dump (despairlabs.com)
        
       | ayanamist wrote:
       | So linux people find that the model of windows iocp used is
       | better?
        
         | asveikau wrote:
         | I think it's long been understood that it's better for disk
         | I/O. I'm not sure the consensus is as clear for sockets.
        
         | hermanradtke wrote:
         | As a Linux person: Yes, and I have thought so for a long time.
         | 
         | This is why many of us are excited about io_uring.
        
         | spullara wrote:
         | Now if they could allocate the memory for you when needed
         | rather than having a bunch of buffers allocated when they
         | aren't yet needed.
        
         | ot wrote:
         | Yes, and now things are going full circle and windows is
         | adopting the ring model too.
         | 
         | https://windows-internals.com/i-o-rings-when-one-i-o-operati...
        
           | muststopmyths wrote:
           | huh, interesting. As far as I can tell the main advantage
           | this has over IOCP is that you can get one completion for
           | multiple read requests.
           | 
           | Looks like they took a lot of the concepts from Winsock RIO
           | and applied them to file I/O. Which is fascinating because
           | with network traffic you can't predict packet boundaries and
           | thus your I/O rate can be unpredictable. RIO helps you get
           | the notification rate under control, which can help if your
           | packet rate is very high.
           | 
           | With files, I would think you can control the rate at which
           | you request data, as well as the memory you allocate for it.
           | 
           | The other thing it saves just like RIO is the overhead of
           | locking/unlocking buffers by preregistering them. Is that the
           | main reason for this API then ?
           | 
           | I would be very interested to hear from people who have
           | actually run into limits with overlapped file reads and are
           | therefore excited about IoRings
        
         | volta83 wrote:
         | Yes.
         | 
         | Unfortunately Rust went the exact other way.
        
       | ginsmar wrote:
       | Very very interesting.
        
       | grok22 wrote:
       | Isn't locking a problem with io_uring? Won't you block the kernel
       | when it's trying to do the event completion stuff and the
       | completion work tries to take a lock? Or is the completion stuff
       | done entirely in user-space and blocking is not a problem? Maybe
       | I need to read up on this a bit more...
        
         | zxzax wrote:
         | The submissions and completions are stored in a lock-free ring
         | buffer, hence the name "uring."
        
           | legulere wrote:
           | lock-free ring buffers still have locks for the case when
           | they are full, so it would be interesting to see how the
           | kernel behaves when you never read from the completion ring.
        
             | zxzax wrote:
             | You will get an error for that upon submit. The application
             | can then buffer the submissions somewhere else and wait for
             | a few completions to finish.
        
       | mzs wrote:
       | bandwidth exceeded alternative:
       | https://web.archive.org/web/20210617150204/https://despairla...
        
         | zootboy wrote:
         | It seems that their server is rewriting the 451 error to a 403,
         | which caused Archive.org to drop the page from its archives.
         | Unfortunate...
        
       | bilalhusain wrote:
       | Currently serving Bandwidth Restricted page - 451.
       | 
       | Cached version of the write up https://archive.is/VgHkW
        
       | joshmarinacci wrote:
       | I'm curious how this handles the case where the calling program
       | dies or wants to cancel the request before it's actually
       | happened.
        
         | ww520 wrote:
         | The do_exit() function in kernel/exit.c is responsible for
         | general cleanup on a process [1]. Whether a process dies
         | gracefully or abruptly, the kernel calls do_exit() to clean up
         | all the resources owned by the process, like opened files or
         | acquired locks. I would imagine the io_uring related stuff is
         | cleaned up there as well.
         | 
         | [1]
         | https://elixir.bootlin.com/linux/v5.13-rc6/source/kernel/exi...
         | 
         | Edit: I just looked up at the latest version of the source [1].
         | Yes, it does clean up io_uring related files.
        
         | asdfasgasdgasdg wrote:
         | I don't know the answer but I would assume if you have
         | submitted an operation to the kernel, you should assume it's in
         | an indeterminate state until you get the result. If the program
         | dies then the call may complete or not.
         | 
         | For cancellation there is an API. Example call:
         | https://github.com/axboe/liburing/blob/c4c280f31b0e05a1ea792...
        
         | PaulDavisThe1st wrote:
         | Death: same way it handles a program with an open socket and
         | data arriving and unread and the program dies. It's just part
         | of the overall resource set of the process and has to be
         | cleaned up when the process goes away.
        
       | raphlinus wrote:
       | Something I've been thinking about that maybe the HN hivemind can
       | help with; I know enough about io_uring and GPU each to be
       | dangerous.
       | 
       | The roundtrip for command buffer submission in GPU is huge by my
       | estimation, around 100us. On a 10TFLOPS card, which is nowhere
       | near the top of the line, that's 1 billion operations. I don't
       | know exactly where all the time is going, but suspect it's a
       | bunch of process and kernel transitions between the application,
       | the userland driver, and the kernel driver.
       | 
       | My understanding is that games mostly work around this by
       | batching up a lot of work (many dozens of draw calls, for
       | example) in one submission. But it's still a problem if CPU
       | readback is part of the workload.
       | 
       | So my question is: can a technique like io_uring be used here, to
       | keep the GPU pipeline full and only take expensive transitions
       | when absolutely needed? I suspect the programming model will be
       | different and in some cases harder, but that's already part of
       | the territory with GPU.
        
         | boardwaalk wrote:
         | The communication between the host and the GPU already works on
         | a ring buffer on any modern GPU I believe.
         | 
         | It's why graphics APIs are asynchronous until you synchronize
         | f.e. by flipping a frame buffer or reading something back.
         | 
         | APIs like Vulkan are very explicit about this and have fences
         | and semaphores. Older APIs will just block if you do something
         | that requires blocking.
        
       | api wrote:
       | What's funny about io_uring is that the blocking syscall
       | interface was always something critics of Unix pointed out as
       | being a major shortcoming. We had OSes that did this kind of
       | thing way back in the 1980s and 1990s, but Unix with its
       | simplicity and generalism and free implementations took over.
       | 
       | Now Unix is finally, in 2021, getting a syscall queue construct
       | where I can interact with the kernel asynchronously.
        
         | CodesInChaos wrote:
         | io_uring is Linux specific. BSD offers kqueue instead,
         | introduced in 2000.
         | 
         | I believe both are limited to specific operations, mostly IO,
         | and aren't fully general asynchronous syscall interfaces.
        
           | rapsey wrote:
           | The entire point of the article is that it is not just for
           | IO.
        
             | binarycrusader wrote:
             | The OP said _mostly_ IO not only.
        
             | CodesInChaos wrote:
             | > The entire point of the article is that it is not just
             | for IO
             | 
             | I went through the list of operations supported by
             | `io_uring_enter`. Almost all of them are for IO, the
             | remainder (NOP, timeout, madvise) are useful for supporting
             | IO, though madvise might have some non-IO uses as well.
             | While io_uring could form the basis of a generic async
             | syscall interface in the future, in its current state it
             | most certainly is not.
             | 
             | The article mostly talks about io_uring enabling completion
             | based IO instead of readiness based IO.
             | 
             | AFAIK kqueue also supports completion based IO using
             | aio_read/aio_write together with sigevent.
        
               | [deleted]
        
               | coder543 wrote:
               | > AFAIK kqueue also supports completion based IO using
               | aio_read/aio_write together with sigevent.
               | 
               | If you can point to a practical example of a program
               | doing it this way and seeing a performance benefit, I
               | would be curious to see it. I did some googling and
               | didn't really even find any articles mentioning this as
               | possibility.
               | 
               | kqueue is widely considered to be readiness based, just
               | like epoll, not completion based.
               | 
               | What you wrote sounds like an interesting hack, but I'm
               | not sure it counts for much if it is impractical to use.
        
               | CodesInChaos wrote:
               | I don't know if it offers any practical benefits for
               | sequential/socket IO. But AFAIK it's the way to go if you
               | want to do async random-access/file IO.
        
         | binarycrusader wrote:
         | Solaris had this over a decade ago now with event ports which
         | are basically a variant of Windows IOCP:
         | 
         | https://web.archive.org/web/20110719052845/http://developers...
         | 
         | So at least one UNIX system had them a while ago.
        
           | wahern wrote:
           | People keep saying this but IME that's not how the Solaris
           | Event Ports API works _at_ _all_. The semantics of Solaris
           | Event Ports is nearly identical to both epoll+family and
           | kqueue. And like with both those others (ignoring io_uring),
           | I /O _completion_ is done using the POSIX AIO interface,
           | which signals completion through the Event Port descriptor.
           | 
           | I've written at least two wrapper libraries for I/O
           | readiness, POSIX signal, file event, and user-triggered event
           | polling that encompass epoll, kqueue, and Solaris Event
           | Ports. Supporting all three is _relatively_ trivial from an
           | API perspective because they work so similarly. In fact,
           | notably all _three_ let you poll on the epoll, kqueue, or
           | Event Port descriptor itself. So you can have event queue
           | _trees_ , which is very handy when writing composable
           | libraries.
        
         | ww520 wrote:
         | I remember IPX has similar communication model. To read from
         | the network, you post an array of buffer pointers to the IPX
         | driver and can continue to do whatever. When the buffers are
         | filled, the driver calls your completion function.
        
         | PaulDavisThe1st wrote:
         | Maybe Linux will get scheduler activations in the near future,
         | another OS feature from the 90s that ended up in Solaris and
         | more or less nowhere else. "Let my user space thread scheduler
         | do its work!"
        
           | jlokier wrote:
           | We talked about adding that to Linux in the 90s too. A
           | simple, small scheduler-hook system call that would allow
           | userspace to cover different asynchronous I/O scheduling
           | cases efficiently.
           | 
           | The sort of thing Go and Rust runtimes try to approximate in
           | a hackish way nowadays. They would both by improved by an
           | appropriate scheduler-activation hook.
           | 
           | Back then the idea didn't gain support. It needed a champion,
           | and nobody cared enough. It seemed unnecessary, complicated.
           | What was done instead seemed to be driven by interests that
           | focused on one kind of task or another, e.g. networking or
           | databases.
           | 
           | It doesn't help that the understandings many people have of
           | performance around asynchronous I/O, stackless and stackful
           | coroutines, userspace-kernel interactions, CPU-hardware
           | interactions and so on are not particularly deep. For example
           | I've met a few people who argued that "async-await" is the
           | modern and faster alternative to threads in every scenario,
           | except for needing N threads to use N CPU cores. But that is
           | far from correct. Stackful coroutines doing blocking I/O with
           | complex logic (such as filesystems) are lighter than async-
           | await coroutines doing the same thing, and "heavy" fair
           | scheduling can improve throughput and latency statistics over
           | naive queueing.
           | 
           | It's exciting to see efficient userspace-kernel I/O
           | scheduling getting attention, and getting better over the
           | years. Kudos to the implementors.
           | 
           | But it's also kind of depressing that things that were on the
           | table 20-25 years ago take this long to be evaluated. It's
           | almost as if economics and personal situations governs
           | progress much more than knowledge and ideas...
        
             | PaulDavisThe1st wrote:
             | Actually, I think the biggest obstacle is that as cool as
             | scheduler activations are, it turns out that not many
             | applications are really in a position to benefit from them.
             | The ones that can found other ways ("workarounds") to
             | address the fact that the kernel scheduler can't know which
             | user space thread to run. They did so because it was
             | important to them.
        
             | zaphar wrote:
             | It's almost as if economics and personal situations
             | governs progress much more than knowledge and ideas...
             | 
             | That has always been the case and will probably always be
             | the case.
        
           | aseipp wrote:
           | There's already plans for a new futex-based swap_to
           | primitive, for improving userland thread scheduling
           | capabilities. There was some work done on it last year, but
           | it was rejected on LKML. At this rate, it looks like it will
           | not move forward until the new futex2 syscall is in place,
           | since the original API is showing its age.
           | 
           | So, it will probably happen Soon(tm), but you're probably
           | still ~2 years out before you can reliably depend on it, I'd
           | say.
        
             | PaulDavisThe1st wrote:
             | Scheduler activations don't require swap_to.
             | 
             | The kernel wakes up the user space scheduler when it
             | decides to put the process onto a cpu. The user space
             | scheduler decides which _user space_ thread executes in the
             | kernel thread context that it runs in, and does a user
             | space thread switch (not a full context switch) to it. It
             | 's a combination of kernel threads and user space (aka
             | "green") threads.
        
           | gpderetta wrote:
           | I think some of the *BSDs have (or had) it. Linux almost got
           | it at the turn of the millennium, with the Next Generation
           | Posix Threading project, but then the much simpler and faster
           | NPTL won.
        
           | rektide wrote:
           | It might! I'm not sure if it's an exact fit or not but the
           | User Managed Concurrency Groups work[1] Google is trying to
           | upstream with their Fibers userland-scheduling library sounds
           | like it could be a match, and perhaps it could get the
           | upstreaming itcs seeking.
           | 
           | [1] https://www.phoronix.com/scan.php?page=news_item&px=Googl
           | e-F...
        
       | tele_ski wrote:
       | I think this might be the best explanation I've read of why
       | io_uring should be better than epoll since it effectively
       | collapses the 'tell me when this is ready' with the 'do action'
       | part. That was the really enlightening part for me.
       | 
       | I have to say though, the name io_uring seems unfortunate and I
       | think the author touches on this in the article... the name is
       | really an implementation detail but io_uring's true purpose is a
       | generic asynchronous syscall facility that is currently tailored
       | towards i/o. syscall_queue or async_queue or something else...? A
       | descriptive api name and not an implementation detail would
       | probably go a long way in helping the feature be easier to
       | understand. Even window's IOCP seems infinitely better named than
       | 'uring'.
        
         | pydry wrote:
         | I'm still confused coz this is exactly what I always thought
         | the difference between epoll and select was.
         | 
         | "what if, instead of the kernel telling us when something is
         | ready for an action to be taken so that we can take it, we tell
         | the kernel what action to we want to take, and it will do it
         | when the conditions become right."
         | 
         | The difference between select and epoll was that select would
         | keep checking in until the conditions were right while epoll
         | would send _you_ a message. That was gamechanging.
         | 
         | - I'm not really sure why this is seen as such a fundamental
         | change. It's changed from the kernel triggering a callback
         | to... a callback.
        
           | asveikau wrote:
           | select, poll, epoll, are all the same model of blocking and
           | signalling for readiness.
           | 
           | The problem with the former occurs with large lists of file
           | descriptors. Calling from user to kernel, the kernel needs to
           | copy and examine N file descriptors. When user mode comes
           | back, it needs to scan its list of file descriptors to see
           | what changed. That's 2 O(n) scans at every syscall, one
           | kernel side, one user side, even if only zero or one file
           | descriptors has an event.
           | 
           | epoll and kqueue make it so that the kernel persists the list
           | of interesting file descriptors between calls, and only
           | returns back what has actually changed, without either side
           | needing to scan an entire list.
           | 
           | By contrast, the high level programming model of io_uring
           | seems pretty similar to POSIX AIO or Windows async I/O [away
           | from readiness and more towards "actually do the thing"], but
           | with the innovation being a new data structure that allows
           | reduction in syscall overhead.
        
           | coder543 wrote:
           | epoll: tell me when any of these descriptors are ready, then
           | I'll issue another syscall to actually read from that
           | descriptor into a buffer.
           | 
           | io_uring: when any of these descriptors are ready, read into
           | any one of these buffers I've preallocated for you, then let
           | me know when it is done.
           | 
           | Instead of waking up a process just so it can do the work of
           | calling back into the kernel to have the kernel fill a
           | buffer, io_uring skips that extra syscall altogether.
           | 
           | Taking things to the next level, io_uring allows you to chain
           | operations together. You can tell it to read from one socket
           | and write the results into a different socket or directly to
           | a file, and it can do that without waking your process
           | pointlessly at any intermediate stage.
           | 
           | A nearby comment also mentioned opening files, and that's
           | cool too. You could issue an entire command sequence to
           | io_uring, then your program can work on other stuff and check
           | on it later, or just go to sleep until everything is done.
           | You could tell the kernel that you want it to open a
           | connection, write a particular buffer that you prepared for
           | it into that connection, then open a specific file on disk,
           | read the response into that file, close the file, then send a
           | prepared buffer as a response to the connection, close the
           | connection, then let you know that it is all done. You just
           | have to prepare two buffers on the frontend, issue the
           | commands (which could require either 1 or 0 syscalls,
           | depending on how you're using io_uring), then do whatever you
           | want.
           | 
           | You can even have numerous command sequences under kernel
           | control in parallel, you don't have to issue them one at a
           | time and wait on them to finish before you can issue the next
           | one.
           | 
           | With epoll, you have to do every individual step along the
           | way yourself, which involves syscalls, context switches, and
           | potentially more code complexity. Then you realize that epoll
           | doesn't even support file I/O, so you have to mix multiple
           | approaches together to even approximate what io_uring is
           | doing.
           | 
           | (Note: I've been looking for an excuse to use io_uring, so
           | I've read a ton about it, but I don't have any practical
           | experience with it yet. But everything I wrote above should
           | be accurate.)
        
             | throwaway81523 wrote:
             | Being able to open files with io_uring is important because
             | there is no other way to do it without an unpredictable
             | delay. Some systems like Erlang end up using separate OS
             | threads just to be able to open files without blocking the
             | main interpreter thread.
        
             | zxzax wrote:
             | If you're looking for an excuse to work on io_uring, please
             | consider helping get it implemented and tested in your
             | favorite event loop or I/O abstraction library. Here's some
             | open issues and PRs:
             | 
             | https://github.com/golang/go/issues/31908
             | 
             | https://github.com/libuv/libuv/pull/2322
             | 
             | https://github.com/tokio-rs/mio/issues/923
             | 
             | https://gitlab.gnome.org/GNOME/glib/-/issues/2084
             | 
             | https://github.com/libevent/libevent/issues/1019
        
               | coder543 wrote:
               | Oh, trust me... that Go issue is top of mind for me. I
               | have the fifth comment on that issue, along with several
               | other comments in there, and I'd love to implement it...
               | I'm just not familiar enough with working on Go runtime
               | internals, and motivation for volunteer work is sometimes
               | hard to come by for the past couple of years.
               | 
               | Maybe someday I'll get it done :)
        
               | zxzax wrote:
               | Haha nice, I just noticed that :) I think supporting
               | someone else to help work on it and even just offering to
               | help test and review a PR is a great and useful thing to
               | do.
        
               | jra_samba wrote:
               | io_uring has been a game-changer for Samba IO speed.
               | 
               | Check out Stefan Metzmacher's talk at SambaXP 2021
               | (online event) for details:
               | 
               | https://www.youtube.com/watch?v=eYxp8yJHpik
        
               | surrealize wrote:
               | The performance comparisons start here
               | https://youtu.be/eYxp8yJHpik?t=1421
               | 
               | Looks like the bandwidth went from 3.8 GB/s to 22 GB/s,
               | with the client being the bottleneck.
        
             | pydry wrote:
             | This makes it much clearer. Thanks!
        
             | tele_ski wrote:
             | What you're describing sounds awesome, I hadn't thought
             | about being able to string syscall commands together like
             | that. I wonder how well that will work in practice? Is
             | there a way to be notified if one of the commands in the
             | sequence fails like for instance the buffer wasn't large
             | enough to write all the incoming data into?
        
               | touisteur wrote:
               | I'm looking at the evolution in the chaining capabilities
               | of io_uring. Right now it's a bit basic but I'm guessing
               | in 5 or 6 kernel versions people will have built a micro
               | kernel or a web server just by chaining things in
               | io_uring and maybe some custom chaining/decision blocks
               | in ebpf :-)
        
               | coder543 wrote:
               | BPF, you say? https://lwn.net/Articles/847951/
               | 
               | > The obvious place where BPF can add value is making
               | decisions based on the outcome of previous operations in
               | the ring. Currently, these decisions must be made in user
               | space, which involves potential delays as the relevant
               | process is scheduled and run. Instead, when an operation
               | completes, a BPF program might be able to decide what to
               | do next without ever leaving the kernel. "What to do
               | next" could include submitting more I/O operations,
               | moving on to the next in a series of files to process, or
               | aborting a series of commands if something unexpected
               | happens.
        
               | touisteur wrote:
               | BPF is going to change so many things... At the moment
               | I'm having lots of trouble with the tooling but hey,
               | let's just write BPF bytecode by hand or with a macro-
               | asm. Reduce the ambitions...
        
               | touisteur wrote:
               | Also wondering whether we should rethink language
               | runtimes for this. Like write everything in SPARK (so all
               | specs are checked), target bpf bytecode through gnatllvm.
               | OK you've written the equivalent of a cuda kernel or
               | tbb::flow block. Now for the chaining y'all have this
               | toolbox of task-chainers (barriers, priority queues,
               | routers...) and you'll never even enter userland? I'm
               | thinking /many/ programs could be described as such.
        
               | touisteur wrote:
               | Yes exactly what I had in mind. I'm also thinking of a
               | particular chain of syscalls [0][1][2][3] (send netlink
               | message, setsockopt, ioctls, getsockopts, reads, then
               | setsockopt, then send netlink message) grouped so as to
               | be done in one sequence without ever surfacing up to
               | userland (just fill those here buffers, who's a good
               | boy!). So now I'm missing ioctls and getsockopts but all
               | in good time!
               | 
               | [0] https://github.com/checkpoint-
               | restore/criu/blob/7686b939d155...
               | 
               | [1] https://github.com/checkpoint-
               | restore/criu/blob/7686b939d155...
               | 
               | [2] https://github.com/checkpoint-
               | restore/criu/blob/7686b939d155...
               | 
               | [3] https://www.infradead.org/~tgr/libnl/doc/api/group__q
               | disc__p...
        
               | coder543 wrote:
               | According to a relevant manpage[0]:
               | 
               | > Only members inside the chain are serialized. A chain
               | of SQEs will be broken, if any request in that chain ends
               | in error. io_uring considers any unexpected result an
               | error. This means that, eg, a short read will also
               | terminate the remainder of the chain. If a chain of SQE
               | links is broken, the remaining unstarted part of the
               | chain will be terminated and completed with -ECANCELED as
               | the error code.
               | 
               | So it sounds like you would need to decide what your
               | strategy is. It sounds like you can inspect the step in
               | the sequence that had the error, learn what the error
               | was, and decide whether you want to re-issue the command
               | that failed along with the remainder of the sequence. For
               | a short read, you should still have access to the bytes
               | that were read, so you're not losing information due to
               | the error.
               | 
               | There is an alternative "hardlink" concept that will
               | continue the command sequence even in the presence of an
               | error in the previous step, like a short read, as long as
               | the previous step was correctly submitted.
               | 
               | Error handling gets in the way of some of the fun, as
               | usual, but it is important to think about.
               | 
               | [0]: https://manpages.debian.org/unstable/liburing-
               | dev/io_uring_e...
        
               | zxzax wrote:
               | Yes, check the documentation for the IOSQE_IO_LINK flag
               | to see exactly how this works.
        
           | dataflow wrote:
           | epoll is based on a "readiness" model (i.e. it tells when
           | when you can _start_ I /O). io_uring is based on a
           | "completion" model (i.e. it tells you when I/O is _finished_
           | ). The latter is like Windows IOCP, where the C stands for
           | Completion. Readiness models are rather useless for a local
           | disk because, unlike with a socket, the disk is more or less
           | always ready to receive a command.
        
           | simcop2387 wrote:
           | Io uring can in theory be built to subscribe to any syscall
           | (though it hasn't yet). I don't believe epoll can do things
           | like stat, opening files, closing files, and syncing though.
        
           | [deleted]
        
         | eloff wrote:
         | I've seen mixed results so far. In theory it should perform
         | better than epoll, but I'm not sure it's quite there yet. The
         | maintainer of uWebSockets tried it with an earlier version and
         | it was slower.
         | 
         | Where it really shines is disk IO because we don't have an
         | epoll equivalent there. I imagine it would also be great at
         | network requests that go to or from disk in a simple way
         | because you can chain the syscalls in theory.
        
           | zxzax wrote:
           | The main benefit to me has been in cases that previously
           | required a thread pool on top of epoll, it should be safe to
           | get rid of the thread pool now and only use io_uring. Socket
           | I/O of course doesn't really need a thread pool in a lot of
           | cases, but disk I/O does.
        
         | infogulch wrote:
         | syscall_uring would be my preference.
        
         | [deleted]
        
         | hawski wrote:
         | Now I wonder if my idea of having a muxcall or a batchcall as I
         | thought about it a few years ago is something similar to
         | io_uring, but on a lesser scale and without eBPF goodies.
         | 
         | My idea was to have a syscall like this:                 struct
         | batchvec {        unsigned long batchv_callnr;        unsigned
         | long long batchv_argmask;       };       asmlinkage long
         | sys_batchcall(struct batchvec *batchv, int batchvcnt,
         | long args[16], unsigned flags);
         | 
         | You were supposed to give in a batchvec a sequence of system
         | call numbers and a little mapping to arguments you provided in
         | args. batchv_argmask is a long long - 64 bit type, this mask is
         | divided to 4 bit fields, every field can address a long from
         | args table. AFAIR Linux syscalls have up to 6 arguments. 6
         | fields for arguments and one for return value, that gives 7
         | fields - 28 bits and now I don't remember why I thought I need
         | a long long.
         | 
         | It would go like this pseudo code:                 int i = 0;
         | for(; i < batchvcnt; i++) {         args[batchv[i].argmask[6]]
         | = sys_call_table[batchv[i].callnr](args[batchv[i].argmask[0]],
         | args[batchv[i].argmask[1]], args[batchv[i].argmask[2]],
         | args[batchv[i].argmask[3]], args[batchv[i].argmask[4]],
         | args[batchv[i].argmask[5]]);
         | if(args[batchv[i].argmask[6]] < 0) {           break;         }
         | }       return i;
         | 
         | It would return a number of successfully run syscalls. It would
         | stop on first failed one. The user would have to pick up the
         | error code out of args table.
         | 
         | I would be interested to know why it wouldn't work.
         | 
         | I started implementing it against Linux 4.15.12, but never went
         | to test it. I have some code, but I don't believe it is my last
         | version of the attempt.
        
         | pkghost wrote:
         | W/r/t a more descriptive name, I disagree--though until fairly
         | recently I would have agreed.
         | 
         | I would guess that the desire for something more "descriptive"
         | reflects the fact that you are not in the weeds with io_uring
         | (et al), and as such a name that's tied to specifics of the
         | terrain (io, urings) feels esoteric and unfamiliar.
         | 
         | However, to anyone who is an immediate consumer of io_uring or
         | its compatriots, "io" obviously implies "syscall", but is
         | better than "syscall", because it's more specific; since
         | io_uring doens't do anything other than io-related syscalls
         | (there are other kinds of syscalls), naming it "syscall_" would
         | make it harder for its immediate audience to remember what it
         | does.
         | 
         | Similarly, "uring" will be familiar to most of the immediate
         | audience, and is better than "queue", because it also
         | communicates some specific features (or performance
         | characteristics? idk, I'm also not in the weeds) of the API
         | that the more generic "_queue" would not.
         | 
         | So, while I agree that the name is mildly inscrutable to us
         | distant onlookers, I think it's the right name, and indeed
         | reflects a wise pattern in naming concepts in complex systems.
         | The less ambiguity you introduce at each layer of indirection
         | or reference, the better.
         | 
         | I recently did a toy project that has some files named in a
         | similar fashion: `docker/cmd`, which is what the CMD directive
         | in my Dockerfile points at, and `systemd/start`, which is what
         | the ExecStart line of my systemd service file points at.
         | They're mildly inscrutable if you're unfamiliar with either
         | docker or systemd, as they don't really say much about what
         | they _do_ , but this is a naming pattern that I can port to
         | just about any project, and at the same time stop spending
         | energy remembering a unique name for the app's entry point, or
         | the systemd script.
         | 
         | Some abstract observations:                 - naming for
         | grokkability-a-first-glance is at odds with naming for utility-
         | over-time; the former is necessarily more ambiguous       -
         | naming for utility over time seems like obviously the better
         | default naming strategy; find a nice spot in your readme for
         | onboarding metaphors and make sure the primary consumers of
         | your name don't have to work harder than necessary to make
         | sense of it       - if you find a name inscrutable, perhaps
         | you're just missing some context
        
           | tele_ski wrote:
           | Thanks for your detailed thoughts on this, I am definitely
           | not involved at all and just onlooking from the sidelines.
           | The name does initially seem quite esoteric but I can
           | understand why it was picked. Thinking about it more 'io'
           | rather than 'syscall' does make sense, and Windows also does
           | use IO in IOCP.
        
           | wtallis wrote:
           | The io specificity is expected to be a temporary situation,
           | and that part of the name may end up being an anachronism in
           | a few years once a usefully large subset of another category
           | of syscalls has been added to io_uring. The shared ring
           | buffers aspect definitely is an implementation detail, but
           | one that does explain why performance is better than other
           | async IO methods (and also avoids misleading people into
           | thinking that it has something to do with the async/await
           | paradigm).
           | 
           | If the BSDs hadn't already claimed the name, it would
           | probably have been fine to call this kqueue or something like
           | that.
        
       | justsomeuser wrote:
       | If my process has async/await and uses epoll, wouldn't this have
       | about the same performance as io_uring?
       | 
       | E.g. with io_uring, the event triggers the kernel to read into a
       | buffer.
       | 
       | With async/await epoll wakes up my process which does the syscall
       | to read the file.
       | 
       | In both cases you still need to read from the device and get the
       | data to the user process?
        
         | Matthias247 wrote:
         | The answer is "go ahead and benchmark". There are some
         | theoretical advantages to uring, like having to do less
         | syscalls and have less userspace <-> kernelspace transitions.
         | 
         | In practice implementation differences can offset that
         | advantage, or it just might not make any difference for the
         | application at all since its not the hotspot.
         | 
         | I think for socket IO a variety of people did some synthetic
         | benchmarks for epoll vs uring, and got all kinds of results
         | from either one being a bit faster to both being roughly the
         | same.
        
         | Misdicorl wrote:
         | Imagine your process instead of getting woken up to make a
         | syscall gets woken up with a pointer to a filled data buffer.
        
           | justsomeuser wrote:
           | But some process (kernel or user space) still needs to spend
           | the computers finite resources to read it?
           | 
           | If the work happens in the kernel process or user process it
           | still costs the same?
        
             | Misdicorl wrote:
             | Yes, the syscall work still happens and the raw work has
             | not changed. The overhead has dropped dramatically though
             | since you've eliminated at least 2 context switches per
             | data fill cycle(and probably more).
        
             | wtallis wrote:
             | The transition from userspace to the kernel and back takes
             | a similar amount of time to actually doing a small read
             | from the fastest SSDs, or issuing a write to most
             | relatively fast SSDs. So avoiding one extra syscall is a
             | meaningful performance improvement even if you can't always
             | eliminate a memcpy operation.
        
               | Veserv wrote:
               | That sounds unlikely. syscall hardware overhead (entry +
               | exit) on a modern x86 is only on the order of 100 ns
               | which is approximately main memory access latency. I am
               | not familiar with Linux internals or what the fastest
               | SSDs are capable of these days, but I am fairly sure that
               | for your statement to be true Linux would need to be
               | adding 1 to 2 orders of magnitude in software overhead.
               | This occurs in the context switch pathway due to
               | scheduling decisions, but it is fairly unlikely it occurs
               | in the syscall pathway unless they are doing something
               | horribly wrong.
        
               | vlovich123 wrote:
               | My understanding is that's the naiive measurement of the
               | cost of just the syscall operation (i.e. if you measure
               | issue to kernel is executing). Does this actually account
               | for the performance loss of cache innefficiency? If I'm
               | not mistaken at a minimum the CPU needs to flush various
               | caches to enter the kernel, fill them up as the kernel is
               | executing, & then repopulate them when executing back in
               | userspace. In that case (even if it's not a full flush),
               | you have a hard to measure slowdown on the code
               | processing the request in the kernel & in userspace after
               | the syscall because the locality assumptions that caches
               | rely on are invalidated. With an io_uring model, since
               | there's no context switches, temporal & spatial locality
               | should provide an outsized benefit beyond just removing
               | the syscall itself.
               | 
               | Additionally, as noted elsewhere, you can chain syscalls
               | pretty deeply so that the entire operation occurs in the
               | kernel & never schedules your process. This also benefits
               | spatial & temporal locality AND removes the cost of
               | needing to schedule the process in the first place.
        
         | dundarious wrote:
         | Not necessarily, as you can use io_uring in a zero-copy way,
         | avoiding copies of network/disk data from kernel space to user
         | space.
        
       | diegocg wrote:
       | IIRC the author of io_uring wants to support using it with any
       | system call, not just the current supported subset. Not sure how
       | these efforts are going.
        
         | swiley wrote:
         | Supporting exec sounds like it could be a security issue. I
         | also have a hard time imagining why on earth you would call
         | exec that way.
        
           | tptacek wrote:
           | Say more about this? You're probably right but I haven't
           | thought about it before and I'd be happy to have you do that
           | thinking for me. :)
        
             | touisteur wrote:
             | Start a user-provided compression background task at file
             | close? Think tcpdump with io_uring 'take this here buffer
             | you've just read and send it to disk without bothering
             | userland.
        
           | the_duke wrote:
           | > Supporting exec sounds like it could be a security issue.
           | 
           | Why would it be any more of an issue than calling a blocking
           | exec?
           | 
           | > why on earth you would call exec that way
           | 
           | The same reason as why you would want to do anything else
           | with io_uring? In an async runtime you have to delegate
           | blocking calls to a thread pool. Much nicer if the runtime
           | can use the same execution system as for other IO.
        
             | swiley wrote:
             | >Why would it be any more of an issue than calling a
             | blocking exec?
             | 
             | Now you turn everything doing fast I/O into something
             | that's essentially an easier rwx situation WRT memory
             | corruption.
             | 
             | >Much nicer
             | 
             | In other words: the runtime should handle it because the
             | kernel doesn't need to.
        
               | the_duke wrote:
               | I really don't follow your thought process here. Can you
               | expand on your objections?
               | 
               | io_uring can really be thought of as a generic
               | asynchronous syscall interface.
               | 
               | It uses a kernel thread pool to run operations. If
               | something is blocking on a kernel-level it can just be
               | run as a blocking operation on that thread pool.
        
             | hinkley wrote:
             | If you can emulate blocking calls in user space for
             | backward compatibility, that is probably best. But I wonder
             | about process scheduling. Blocked threads don't get
             | rescheduled until the kernel unblocks them. Can you tell
             | the kernel to block a thread until an io_uring state change
             | occurs?
        
       | mikedilger wrote:
       | > io_uring is not an event system at all. io_uring is actually a
       | generic asynchronous syscall facility.
       | 
       | I would call it a message passing system, not an asynchronous
       | syscall facility. A syscall, even one that doesn't block
       | indefinitely, transfers control to the kernel. io_uring, once
       | setup, doesn't. Now that we have multiple cores, there is no
       | reason to context switch if the kernel can handle your request on
       | some other core on which it is perhaps already running.
        
         | hinkley wrote:
         | It becomes a question of how expensive the message passing is
         | between cores versus the context switch overhead - especially
         | in a world where switching privilege levels has been made
         | expensive by multiple categories of attack against processor
         | speculation.
         | 
         | Is there a middle ground where io_uring doesn't require the
         | mitigations but other syscalls do?
        
         | volta83 wrote:
         | This is how I think about it as well.
         | 
         | Which makes me wonder if Mach wasn't right all along.
        
       | rjzzleep wrote:
       | I found this presentation to be quite informative
       | https://www.youtube.com/watch?v=-5T4Cjw46ys
        
       | vkka wrote:
       | ... _So it has the potential to make a lot of programs much
       | simpler_.
       | 
       | More efficient? Yes. Simpler? Not really. A synchronous program
       | would be simpler, everyone who has done enough of these know it.
        
         | masklinn wrote:
         | I'm guessing they mean that programs which did not want to
         | block on syscalls and had to deploy workarounds can now just...
         | do async syscalls.
        
           | Filligree wrote:
           | Which is every program that wants to be fast, on modern
           | computers. So...
        
             | masklinn wrote:
             | Nonsense.
             | 
             | Let's say your program wants to list a directory, if it has
             | nothing to do during that time then there is no point to
             | using an asynchronous model, that only adds costs and
             | complexity.
        
               | wtallis wrote:
               | True. But as soon as your program wants to list _two_
               | directories and knows both names ahead of time, you have
               | an opportunity to fire off both operations for the kernel
               | to work on simultaneously.
               | 
               | And even if your program doesn't have any opportunity for
               | doing IO in parallel, being able to chain a sequence of
               | IO operations together and issue them with at most one
               | syscall may still get you improved latency.
        
               | touisteur wrote:
               | Yes and even clustering often-used-together syscalls...
               | An interesting 2010 thesis
               | https://os.itec.kit.edu/deutsch/2211.php and
               | https://www2.cs.arizona.edu/~debray/Publications/multi-
               | call.... for something called 'multi-calls'
               | 
               | Interesting times.
        
               | [deleted]
        
               | ot wrote:
               | Let's say your program wants to list a directory and sort
               | by mtime, or size. You need to stat all those files,
               | which means reading a bunch of inodes. And your directory
               | is on flash, so you'll definitely want to pipeline all
               | those operations.
               | 
               | How do you do that without an async API? Thread pool and
               | synchronous syscalls? That's not simpler.
        
         | the8472 wrote:
         | You can use io_uring as a synchronous API too. Put a bunch of
         | commands on the submission queue (think of it as a submission
         | "buffer"), call io_uring_enter() with min_complete == number of
         | commands and once the syscall returns extract results from the
         | competion buffer^H^H^Hqueue. Voila, a perfectly synchronous
         | batch syscall interface.
         | 
         | You can even choose between executing them sequentially and
         | aborting on the first error or trying to complete as many as
         | possible.
        
         | aseipp wrote:
         | My read of that paragraph was that they meant existing
         | asynchronous programs can be simplified, due to the need for
         | less workarounds for the Linux I/O layer (e.g. thread pools to
         | make disk operations appear asynchronous are no longer
         | necessary.) And I agree with that; asynchronous I/O had a lot
         | of pitfalls on Linux until io_uring came around, making things
         | much worse than strictly necessary.
         | 
         | In general I totally agree that a synchronous program will be
         | way simpler than an equivalent asynchronous one, though.
        
       ___________________________________________________________________
       (page generated 2021-06-17 23:00 UTC)