[HN Gopher] Linux: What Can You Epoll? ___________________________________________________________________ Linux: What Can You Epoll? Author : todsacerdoti Score : 142 points Date : 2022-10-22 16:22 UTC (6 hours ago) (HTM) web link (darkcoding.net) (TXT) w3m dump (darkcoding.net) | sylware wrote: | I wrote many of my own programs on elf/linux: I do epoll as much | as I can. | | The only troubling thingy is the lack of classification of | signals, those who are synchronous by nature and the other ones. | For instance, in a monothreaded application, segfault won't be | delivered via epoll... | | At the same time, it is still important to keep the asynchronous | API for signals that for lower latency, but then, only the | realtime behaviour should be kept since this is really where | latency does matter. | emilfihlman wrote: | Regular files not having a non blocking mode is one of the | biggest and gravest idiocies on linux land. | | And there's one even worse: even having the concept of | uninterruptiple sleep (D). | bitwize wrote: | Why epoll when you can io_uring? In Rust? | karthikmurkonda wrote: | Yep | tlsalmin wrote: | Just skimmed through the article, since I'm just here to testify | that the most important revelation for me on writing APIs was | that you can put and epoll_fd in an epoll_fd. This allows the API | to have e.g. a single epoll_fd that contains all outbound | connections, timers, signalfds and inotifys mentioned in the | articled. Then the e.g. daemon using the APIs can have an | epoll_fd per library it is using and just be sitting in the | epoll_wait loop ready to fire library_x_process() call when | events arrive. | kentonv wrote: | Another use case for this: Say you have a set of "jobs" each | composed of many "tasks" (each waiting for some event). The | "jobs" are able to run concurrently on different threads, but | the "tasks" must not run concurrently with other tasks in the | same job because they might share data structures without | synchronization. | | (This is a pretty common pattern in a lot of big servers.) | | Now you want to make sure you utilize multiple cores | effectively. The naive approaches are: | | 1. Create a thread per job, each waiting on its own epoll | specific to the job. This may be expensive if there are many | jobs, and could allow too much concurrency. | | 2. Have a single epoll and a pool of threads waiting on it. | Each thread must lock a mutex for the job that owns the task | it's going to run. But a thread could receive an event for a | task belonging to a job that's already running on another | thread, in which case it has to synchronize with that other | thread somehow, which is a pain. Be careful not to create a | situation where all threads are blocked on the mutex for one | job while other jobs are starved. | | Epoll nesting presents a clean solution: | | 3. Create an epoll per job, plus an outer epoll that waits on | other epolls. A pool of threads waits on the outer epoll, which | signals when a per-job epoll becomes ready. The thread | receiving that event then takes ownership of the per-job epoll | until the event queue is empty. | Matthias247 wrote: | > On Linux write to a regular file never blocks. Writing to a | file copies data from our user space buffer to the kernel buffer | and returns immediately. At some later point in time the kernel | will send it to the disk. A regular file is hence always ready | for writing and epoll wouldn't add anything. | | Is that true? If it would be, the amount of data the Kernel would | need to buffer is unbounded. I assumed there is a limit on the | amount of buffered and not yet committed data, and when that is | cross the call would block until more data is flushed to disk. | Which is kind of the same as happens for TCP sockets. The | `write()` call there doesn't really send data to the peer, it | just submits data to the kernels send buffer, from where it will | be asynchronously transmitted. | | Edit: Actually I will answer my own question and say I know it | will block. I had deployed IO heavy applications in the past with | instrumented read/write calls for IO operations in a threadpool. | Even though typical IO times are well below 1ms, under extremely | high load latencies of more than 1s could be observed, which is | far from "not blocking". | kentonv wrote: | Yes, file I/O can block. However, there is an assumption that | file I/O will never block "indefinitely" -- unless something is | severely broken, the kernel will always finish the operation in | finite time probably measured in milliseconds at most. The same | is not true of network communications, where you may be waiting | for an event that never happens. | | There is a temptation to say that, well, milliseconds are a | long time, so wouldn't we like to do this in a non-blocking way | so we can work on other stuff in the meantime? | | But... consider this: Reads and writes of memory _also_ may | block. If you really think about it, the only real difference | between main memory blocking and disk blocking is the amount of | time they may block. And with modern SSDs that time difference | is not as large as it used to be. | | So do you want to be able to access memory in a non-blocking | way? Well... you can make the same logical arguments as you do | with file I/O, but in practice, almost no one tries to do this. | Instead, you separate work into threads, and let the CPU switch | (hyper)threads whenever it needs to wait for memory. | | In fact, memory reads may very well block on disk, if you use | swap! | | Given all this, it stops being so clear that async file I/O | really makes sense. | | Meanwhile, as it happens, the Linux kernel was never really | designed for async file I/O in the first place. When you | perform file I/O, the kernel may need to execute filesystem | driver code, and it does so within the same thread that invoked | the operation from userspace. That filesystem code is blocking. | For the kernel to deliver true async file I/O, either all this | code needs to be rewritten to be non-blocking (which would | probably slow it down in most cases!), or the kernel needs to | start a thread behind the scenes to perform the work. | | But... you can just as easily start a thread in userspace. | So... maybe just do that? | | (Or, the modern answer: Use io_uring, which is explicitly | designed to allow a userspace thread to request work performed | on a separate kernel thread, and get notified of completion | later.) | jeffbee wrote: | io_uring just racked up another CVE, so I kinda feel that its | severely under-designed nature will always haunt it. The idea | that you can just hand off infinite amounts of work for the | kernel to do on your behalf is pretty fundamentally broken. | It is a concrete implementation of wishful thinking. | tankenmate wrote: | All "work" you want to do that interfaces with anything on | an OS is handed off to the kernel; want to read a file? | kernel, want to sleep for a while? kernel, etc. Besides | things like network traffic is also asynchronous like | io_uring (even if the socket() interfaces make it look | somewhat synchronous). Outside of toy system asynchronicity | is always a thing, especially when running on multiple | cores. | | I kind of get where you are coming from but at the same | time, the kernel always gets the last say, so as long as | io_uring has a good design and implementation it will | always be just as good or bad as the OS as a whole. Whether | run of the mill programmers are up to the task of being | able to properly conceptualise and use such an OS is | probably not the same thing. | jeffbee wrote: | Yeah but it's not well-designed, that's my point. It has | obliviously shrugged off the tricky question of object | lifetime, that's why it has already collected 16 | different CVEs for things like use-after-free. | Considering its short history, io_uring has already | rocketed to the top of the list of dangerous kernel | features. | nathants wrote: | with linux 6.0, lsm got the ability to filter io_uring. | deny all and carry on. | vlovich123 wrote: | That analysis would seem smart but let's try a game of Mad | Libs: | | The Linux Kernel just racked up another CVE, so I kinda | feel that its severely under-designed nature will always | haunt it. | | KDE just racked up another CVE, so I kinda feel that its | severely under-designed nature will always haunt it. | | Firefox just racked up another CVE, so I kinda feel that | its severely under-designed nature will always haunt it. | | Chrome just racked up another CVE, so I kinda feel that its | severely under-designed nature will always haunt it. | | Windows just racked up another CVE, so I kinda feel that | its severely under-designed nature will always haunt it. | | Photoshop just racked up another CVE, so I kinda feel that | its severely under-designed nature will always haunt it. | | All CPUs just rucked up another CVE, so I kinda feel that | its severely under-designed nature will always haunt it. | | What's the theme? Racking up CVEs is something all software | & hardware does. Mistakes can happen in design and in | implementation and no one is immune. Using presence of CVEs | as an indication of immaturity / fundamental design flaw | isn't helpful. In fact, it's probably the opposite. | Software that has no CVEs probably just means no one is | paying attention to it. Sure, in a theoretical case maybe | you've built a formal proof and translated that into a | memory safe language somehow (& you assume you've made no | mistakes modelling your entire system in your proof), then | maybe. However, that encompasses 0% of all software. | | > The idea that you can just hand off infinite amounts of | work for the kernel to do on your behalf is pretty | fundamentally broken. It is a concrete implementation of | wishful thinking | | How is that any different from a file descriptor? The | kernel is free to setup limits on how much work you can | have outstanding at any given time (now maybe those bits | are missing right now, but it doesn't feel like an | intractable problem). | [deleted] | loeg wrote: | > For the kernel to deliver true async file I/O, either all | this code needs to be rewritten to be non-blocking | | This is, I believe, the NT model. | abiloe wrote: | > If you really think about it, the only real difference | between main memory blocking and disk blocking is the amount | of time they may block. | | This is a somewhat confusing analysis you have here. Direct | read/write from memory for all intents and purposes doesn't | block. Why do you say that reads and writes may also block? | | The reason memory blocks is because it needs to page in or | out from secondary storage - which makes this statement "the | only real difference between main memory blocking and disk | blocking is the amount of time they may block." not really | true | tremon wrote: | _Why do you say that reads and writes may also block?_ | | Let's define "may block" first, perhaps? What do we mean | when we say "network I/O may block"? Usually, this means | that the kernel may see your network request and raise you | a context switch while it waits for the network response on | your behalf. In your last sentence you appear to argue that | the reason _why_ the kernel performs a context switch is | relevant in determining if an operation "may block", and | the GP is arguing that that's a distinction without a | difference. | | If the definition of "may block" is really just "the kernel | may decide to context-switch away from your program", then | yes, the GP's assertion that file I/O, memory I/O (mmap) | and memory access (swap) are all operations that may block | is correct -- the only difference is in degree: from | microsecond delays for nvm-backed swap to multi-second | delays for network transfers. | | Or, of course, I may have misunderstood the GP's train of | thought. | [deleted] | jesboat wrote: | >> If you really think about it, the only real difference | between main memory blocking and disk blocking is the | amount of time they may block. > > This is a somewhat | confusing analysis you have here. Direct read/write from | memory for all intents and purposes doesn't block. Why do | you say that reads and writes may also block? | | Reads and writes from actual, physical, hardware memory | might block, depending on how you define "block", in the | sense that some reads may miss CPU cache. But once you get | to that point, you could argue that every branch might | block if the branch misprediction causes a pipeline stall. | This is not a useful definition of "block". | | The thing is, most programs are almost never low-level | enough to be dealing with memory in that sense: they read | and write _virtual_ memory. And virtual memory can block | for any number of reasons, including some pretty non- | obvious ones like. For example: | | - the system is under memory pressure and that page is no | longer in RAM because it got written to a swap file | | - the system is under memory pressure and that page is no | longer in RAM because it was a read-only mapping from a | file and could be purged | | -- e.g. it's part of your executable's code | | - this is your first access to a page of anonymous virtual | memory and the kernel hadn't needed to allocate a physical | page until now | | - you're in a VM and the VMM can do whatever it wants | | - the page is COW from another process | kentonv wrote: | > This is not a useful definition of "block". | | I think what I'm saying is that calling file I/O | "blocking" is also not a useful definition of "block". | Because I don't really see the fundamental difference | between "we have to wait for main memory to respond" and | "we have to wait for disk to respond". | | > this is your first access to a page of anonymous | virtual memory and the kernel hadn't needed to allocate a | physical page until now | | And said allocation could block on all sorts of things | you might not expect. Once upon a time I helped debug a | problem where memory allocation would block waiting for | the XFS filesystem driver to flush dirty inodes to disk. | Our system generated lots of dirty inodes, and we were | seeing programs randomly hang on allocation for minutes | at a time. | abiloe wrote: | > I think what I'm saying is that calling file I/O | "blocking" is also not a useful definition of "block". | Because I don't really see the fundamental difference | between "we have to wait for main memory to respond" and | "we have to wait for disk to respond". | | In addition to the point elsewhere made that you're sort | of implicitly denying the magnitude of the differences | here - the latency differences are on the order of 1000s. | | The other way of separating is if the OS (or some kind of | software trap handler more generally) has to get | involved. A main memory read to a non-faulting address | doesn't involve the OS - ie it doesn't ever block. | However faulting reads, calls to "disk" IO, and | networking IO (ie just I/O in general) involving the | OS/monitor/what have you are all potentially blocking | operations. | dahfizz wrote: | > Because I don't really see the fundamental difference | between "we have to wait for main memory to respond" and | "we have to wait for disk to respond". | | The difference, conservatively, is a factor of 1000. | | There are plenty of times in software engineering where | scaling 1000x will force you to reconsider your | architecture. | kentonv wrote: | > Direct read/write from memory for all intents and | purposes doesn't block. | | Sure it does! Main memory is much slower than cache so on a | cache miss the CPU has to stop and wait for main memory to | respond. The CPU may even switch to executing some other | thread in the meantime (that's what hyperthreading is). But | if there isn't another hyperthread ready, the CPU sits | idle, wasting resources. | | It's not a form of blocking implemented by the OS | scheduler, but it's pretty similar conceptually. | | > The reason memory blocks is because it needs to page in | or out from secondary storage | | Nope, that's not what I was referring to (other than in the | line mentioning swap). | bch wrote: | With the utmost respect, I've never heard "blocking" | described as "takes some measurable amount of time" | (which is how I'm reading your above statement); by that | definition, async blocks to a degree too. | | You're throwing traditional blocking/non-blocking | distinctions on their ear. | Volundr wrote: | Blocking in this case is referring to the CPU thread | sitting idle whilst the operation is performed. This is | what it means when your blocked on a network request, | blocked on a disk operation, or blocked on a memory | request. It's all blocking. | | A cache miss and going to RAM is usually fast enough that | we as software engineers don't care about it, and in fact | our programming language of choice may not even give us a | way of telling the difference between a piece of data | coming from a CPU register or L1 cache vs going to RAM, | but that doesn't mean the blocking isn't happening. | | EDIT: to maybe make this a little clearer for those who | might not be aware the CPU doesn't go fetch something | from RAM. It puts in a request to the memory controller | (handwaving modern architecture a bit here) then has to | wait ~100-1000 CPU cycles before the controller gets back | to it with the data. Depending on the circumstances the | kernel may let that core sit idle, or it may do a context | switch to another thread. The only difference between | this process and say a network request is how many CPU | cycles before you get your results. In the meantime the | thread isn't progressing and is blocked. | bch wrote: | > A cache miss and going to RAM is usually fast enough | that we as software engineers don't care about it, and in | fact our programming language of choice may not even give | us a way of telling the difference between a piece of | data coming from a CPU register or L1 cache vs going to | RAM, but that doesn't mean the blocking isn't happening. | | Yes, this is the line being discussed, and I guess | (historically) I've just considered "a cost" without | dragging "blocking" into the equation. We know that _not_ | accessing memory is cheaper than accessing it, and we can | tune (pack our structs, mind thrashing the cache), but | calling that blocking is still new to me. I'll have to | consider what it means. And also, does it imply the | existence of non-blocking memory (I don't think DMA is | typically in a developers toolkit, but...)? | Volundr wrote: | > And also, does it imply the existence of non-blocking | memory | | Yes actually! If you know your going to need a block of | memory before you actually need it, you can put in a | request to the memory controller before you need it, then | proceed to do some other work and check back in when your | ready for the data or when the memory controller signals | you it's done. It's just that this kind of thing is | usually the scope of compiler optimizations or hyper | optimized software like Varnish cache rather than | something your average web developer thinks about. It's | again conceptually the same as an async network request, | but you bother with one while considering the other just | "a cost" because of the different timescales. | jmalicki wrote: | > And also, does it imply the existence of non-blocking | memory | | Prefetching instructions, to tell the processor to load | before you use it! | | The first google hit [0] even calls it non-blocking | memory access! | | In [1] you can see some of the available prefetching | instructions, and in [2] some analysis on how they deal | with TLB misses (another _extremely_ expensive way memory | access can be blocking short of a page fault). | | Another thing not mentioned above is that accessing a | page of newly allocated memory often causes a page fault, | since allocation is often delayed until use of each page, | for overcommitting behavior - same for writing to memory | that is copy-on-write from a fork! | | [0] https://www.sciencedirect.com/topics/computer- | science/prefet.... | | [1] https://docs.oracle.com/cd/E36784_01/html/E36859/epmp | w.html | | [2] https://stackoverflow.com/a/52377359/435796 | [deleted] | abiloe wrote: | > Sure it does! Main memory is much slower than cache so | on a cache miss the CPU has to stop and wait for main | memory to respond. The CPU may even switch to executing | some other thread in the meantime (that's what | hyperthreading is). | | Cache is a memory. And which cache, by the way? Even L1 | cache on modern processors doesn't have 0 latency. And | this is a rather poor way of describing hyperthreading - | the CPU doesn't really "switch" - the context for the | alternate process is already available and the resource | stealing can occur for any kind of stall (including cache | loads), not just memory. Calling this a "switch" | suggesting it is like a context switch is very | misleading. It's not similar conceptually. | | In any event, by this definition even a mispredicted | branch or a divide becomes "blocking" - which sort of | tortures any meaningful definition of blocking. | | The essential difference is - memory accesses to paged in | memory (and branch mispredictions, cache misses) are not | something you typically or reasonably trap outside of | debugging. mmaps, swaps, disk I/O, network accesses are | all something delegated to an OS - and at that point are | orders of magnitude more expensive than even most NUMA | memory accesses. I sort of see where you're coming from - | but I don't think it's a useful point. | kentonv wrote: | None of this seems to contradict my point? | | My argument is that disk I/O is more like memory I/O than | it is like network I/O, and so for concurrency purposes | it may make more sense to treat it like you would memory | I/O (use threads) than like you would network I/O (where | you'd use non-blocking APIs and event queues). | abiloe wrote: | > My argument is that disk I/O is more like memory I/O | than it is like network I/O | | It depends on your network and disk - and yes SSD and | "slow" ethernet are the common case, but there is enough | variation (say an relatively sluggish embedded eMMC on | one end and 100 GbE for the networking case), that | there's no point in making some distinction about disk vs | network latency - for a general concurrency abstraction | they're both slow IO and you might as well have a common | abstraction like IOCP or io_uring. | | > concurrency purposes it may make more sense to treat it | like you would memory I/O (use threads) than like you | would network I/O (where you'd use non-blocking APIs and | event queues). | | No, case in point, Windows had IOCP for years such that | you could use the same common abstraction for network and | disk. The fact that the POSIX/UNIX world was far behind | the times in getting its shit together doesn't mean much. | | And why, fundamentally, can you not use blocking APIs | with threads for networking? | p12tic wrote: | It's complicated, memory accesses can really block for | relatively long periods of time. | | Consider that regular memory access via cache takes around | 1 nanosecond. | | If the data is not in top-level cache, then we're looking | at roughly 10 nanoseconds access latency. | | If the data is not in cache at all, we are looking into | 50-150 nanoseconds access latency. | | If the data is in memory, but that memory is attached to | another CPU socket, it's even more latency. | | Finally, if the data access is via atomic instruction and | there are many other CPUs accessing the same memory | location, then the latency can be as high as 3000 | nanoseconds. | | It's not very hard to find NVMe attached storage that has | latencies of tens of microseconds, which is not very far | off memory access speeds. | eloff wrote: | I just want to add to your explanation, that even in the | absence of hard paging from disk, you can have soft page | faults where the kernel modifies the page table entries | or assigns a memory page, or copies a copy on write page, | etc. | | In addition to the cache misses you mention there's also | TLB misses. | | Memory is not actually random access, locality matters a | lot. SSDs reads, on the other hand, are much closer to | random access, but much more expensive. | caf wrote: | The term "blocking" in UNIX-like OSes is jargon with a | particular meaning. It means an interruptible wait. | | Disk files do not block - they may Disk Wait instead, which is | an uninterruptible wait (this is what the 'D' process status | stands for). Disk Wait doesn't interact with O_NONBLOCK, | select(3), poll(3) etc. | | (Back in the bad old days it wasn't even possible for a Disk | Waiting process to wake up to process a SIGKILL and die, which | was the bane of system administrators everywhere when NFS | introduced the idea of disks that could disappear when the | network went down. Now it's common for OSes to make some kinds | of Disk Waits at least killable). | Snild wrote: | > I assumed there is a limit on the amount of buffered and not | yet committed data, and when that is cross the call would block | until more data is flushed to disk. | | There is. It's tunable through /proc/sys/vm/dirty_ratio. When | there is that much write cache, application writes will start | to writeback synchronously. | | There is also dirty_background_ratio, which is the threshold at | which writeback starts happening in the background (that is, in | a kernel thread). | throwaway09223 wrote: | No, as you reasoned out it is absolutely incorrect. Write calls | to regular files will block until they are complete, unless | some kind of error situation is encountered. | | This effect is often particularly pronounced with NFS, where | calls might block for _hours_ or even indefinitely if the | underlying network filesystem goes away. | tankenmate wrote: | Just in case anyone isn't aware there is a mount flag called | "soft" that allows the NFS client (and some other network | filesystems) to timeout or be interrupted, i.e. the process | won't get stuck in 'D' (device wait) state. | inetknght wrote: | > _This effect is often particularly pronounced with NFS, | where calls might block for hours or even indefinitely if the | underlying network filesystem goes away._ | | I can't tell you how many times I've had to debug a stuck | process and it turns out that the logs indicated the NFS had | a hiccup a day or two ago during a file read or write and the | process was never notified of a file error. It's f!@#ing | frustrating. Worse, though, was CIFS. | lanstin wrote: | I routinely have to run file system scans on a giant NFS | filer, and even without a hiccup, out of a 100M stat or | read calls, ten or so will just never finish. In Go, I have | to wrap the call with a channel thing and a time out and | hope I don't run out of threads before scanning all 400 M | files. | kotlin2 wrote: | The write call returns how many bytes were accepted: | https://man7.org/linux/man-pages/man2/write.2.html | | > The number of bytes written may be less than count if, for | example, there is insufficient space on the underlying physical | medium, or the RLIMIT_FSIZE resource limit is encountered (see | setrlimit(2)), or the call was interrupted by a signal handler | after having written less than count bytes. (See also pipe(7).) | wtallis wrote: | That doesn't answer the question. Blocking isn't a matter of | how much data is written, but a matter of when the system | call completes. Other parts of that man page imply that | write(2) may block, unless the fd was opened with O_NONBLOCK | (in which case you'll get an EAGAIN error instead of it | blocking). | icedchai wrote: | "It's complicated." Generally, with a regular file, | write(2) will complete as soon as the the data makes it to | filesystem buffers/cache. The data is _probably_ not on | disk when the call completes. This depends on how the file | was opened (O_FSYNC, O_DIRECT, etc.) and the underlying | filesystem itself. There are many other details at work, | like actual file system, memory pressure (there may not be | enough buffers), cache in the physical disk device or | controller, etc. So the write call itself is "blocking", | but the physical writes are (generally) not synchronous | with the call. | wtallis wrote: | Yes, whether a write blocks is really about whether the | application can do anything else while the write is | processed; whether the application is told the write is | done when it lands in a cache or when it is actually on | stable storage is a separate question. | throwaway09223 wrote: | > (in which case you'll get an EAGAIN error instead of it | blocking). | | You won't. O_NONBLOCK cannot be used with regular files. | That part of the manpage is discussing other non-socket | file types. | | Disk i/o via write(2) is always a blocking call. Always. | 100% of the time, no exceptions. | cout wrote: | It is bounded by available memory. Writes to a socket go to a | FIFO queue (the socket's write buffer), but writes to disk are | different; they go through the page cache | (https://www.kernel.org/doc/html/latest/admin- | guide/mm/concep...): | | > The physical memory is volatile and the common case for | getting data into the memory is to read it from files. Whenever | a file is read, the data is put into the page cache to avoid | expensive disk access on the subsequent reads. Similarly, when | one writes to a file, the data is placed in the page cache and | eventually gets into the backing storage device. The written | pages are marked as dirty and when Linux decides to reuse them | for other purposes, it makes sure to synchronize the file | contents on the device with the updated data. | | There are many advantages to doing it this way. One is that | multiple writes to the same page will result in a single | physical write, if the page has not yet been flushed to disk. | | There are many reasons that you might have seen a write to a | file block. One is that the number of dirty pages has reached | the threshold (nr_dirty_threshold in /proc/vmstat). After that | happens, any process doing disk IO will block. | | Another reason is memory pressure. Since all writes go through | the page cache, the kernel must first allocate a page before | the call to write(2) can be completed. If there are many pages | in the page cache, this can take a long time (I once witnessed | an old kernel bug cause all page allocations to result in | kswapd attempting to reclaim pages, due to active pages being | placed ahead of inactive pages in the LRU lists). | | In general, if you are writing a lot to disk but you have no | intention of reading it in the near future, it is a good idea | to call posix_fadvise(2) with FADV_DONTNEED to ensure the pages | will be reused for something else more quickly. | lanstin wrote: | It is pretty easy to completely hork a large box with a very | disk intensive process; hit a local file system hard enough | and you can get a majority of the processes into D state, | uninterruptible IO Disk wait. Maybe not from inside a | container, haven't see it, but definitely on a box with | shared processes. Even just too much logging can harm | unrelated processes that aren't even doing much with the | disk. | rwmj wrote: | It's weird that (according to this document) you can epoll Unix | domain sockets but not sockets created by socketpair(2). I | thought socketpair created essentially two pre-connected Unix | domain sockets. | kentonv wrote: | Hmm, I don't think that's it says (unless they edited it since | your post?). It mentions socketpair explicitly as something | that _is_ epoll-friendly, and which you can use to communicate | with another thread, in the case where you must create a thread | to perform some blocking task but still want to get completion | notification in the main thread via epoll. | ajross wrote: | Indeed, I am all but certain you can epoll on socketpairs. That | sounds like a mistake in the article. | kentonv wrote: | I highly recommend that you do NOT use signalfd to get | notification of signals through epoll. Instead, block (mask) the | signal, set a signal handler, and use epoll_pwait() to atomically | unblock it while you wait for events. Note that in this setup, | your signal handler callback need not be async-signal-safe, since | you know the precise state of the calling thread: it's invoking | epoll_pwait(). This sidesteps most of the pain of using signals | which might otherwise make you think you want signalfd. | | Two reasons not to use signalfd: | | 1. signalfd has weird semantics that don't match what you'd | normally expect from a file descriptor. When you read from a | signalfd, it tells you signals queued on the thread that called | read(), NOT the thread that created the signalfd. Worse, if you | add signalfd to an epoll, the epoll will report readiness based | on the thread that used epoll_ctl() to add the signalfd, which | may be different from the thread that is reading from the epoll. | So you might get a notification that the signalfd is ready, but | then read the signalfd and find there are no signals, and then | wait on the epoll again just to have it tell you again that this | signalfd is ready. | | 2. It turns out that signalfd's implementation has some severe | lock contention issues. I learned this through my own | experimentation recently. In my experiment, I had 5000 threads | each waiting on an epoll that included a signalfd. When | delivering a thread-specific signal to each of the 5000 threads | at once, the process spent 2+ MINUTES of CPU time spinning on | spinlocks in the kernel before completing all the event | deliveries. The time spent was O(n^2) to the number of threads. | When I switched to an epoll_pwait()-based implementation, the | same task took a few milliseconds. | | Here's the PR where I switched KJ's event loop (used in Cap'n | Proto and Cloudflare Workers) to use epoll_pwait(): | https://github.com/capnproto/capnproto/pull/1511 | kelnos wrote: | The big downside of using a traditional signal handler is that | the only way to get your own data into the handler function is | through global variables (or thread locals). While you can | certainly make an exception just for that one thing, it feels | gross to do so. And you can also just defer processing to your | main loop by setting a flag or writing to a pipe, but those | things still need to be global variables. | | I didn't know about signalfd's limitations before reading your | post, and was happy that signalfd could eliminate the need for | global variables when doing signal handling. Shame that's not | really the case. | kentonv wrote: | In my case I use a thread_local pointer that I initialize | right before epoll_pwait and set back to null immediately | after. The pointer points to the same data structures that I | would otherwise use to handle signalfd events. Yeah it's a | little icky to use the global but I think it ends up | semantically equivalent. | [deleted] | wahern wrote: | Unfortunately, thread-local storage is not async-signal | safe. You're relying (knowingly, I presume, but others | should be warned) on implementation details. | | But, yeah, signalfd leaves much to be desired. *BSD kqueue | EVFILT_SIGNAL has much saner semantics. | kentonv wrote: | > Unfortunately, thread-local storage is not async-signal | safe. | | Doesn't matter, because the signal handler in this case | is strictly called "during" invocation of epoll_pwait, so | there's no risk of it interrupting the initialization of | a TLS object. The usual rules about async signal safety | do not need to be followed here; it's as if | epoll_wait()'s implementation made a plain old function | call to the signal handler. | | (Also, since we're talking about epoll, we can assume | Linux, which means we can assume ELF, which means it's | pretty easy to use thread_local in a way that requires no | initialization by allocating it in the ELF TLS section. | But yes, that's relying on implementation details I | suppose.) | | > kqueue EVFILT_SIGNAL | | Having recently implement kqueue support in my event loop | I have to say I'm disappointed by EVFILT_SIGNAL. It does | not play well with signals that target a specific thread | (pthread_kill()) -- on FreeBSD, all threads will get the | kqueue event, while on MacOS, none of them do. | Fortunately EVFILT_USER provides a reasonable alternative | for efficient inter-thread signaling. | | (I don't like using a pipe or socketpair as that involves | allocating a whole two file descriptors and a kernel | buffer, and it requires a redundant syscall on the | receiving end to read the dummy byte out of the buffer. | If you're just trying to tell another thread "hey I added | something to your work queue, please wake up and check", | that's a waste.) | kelnos wrote: | Makes sense, and is probably the "safest" you can get. | Since, as you say, you know exactly the state of everything | on that thread when you're in your handler, you can also | know that your thread local was set properly before the | epoll_pwait() call. | | It's probably code I'd want to isolate somewhere, with big | warnings so any future reader understands why it is how it | is, but I agree it's probably the safest way to do it. | FPGAhacker wrote: | You should do a write up of item 2. | tlsalmin wrote: | I have to disagree here. Not recommending signalfd for the | mentioned use cases might be reasonable, just as reasonable as | it is to use threads for a specific use case. For a single | threaded non-blocking-FD using client/server signalfd removes | the risk of doing too much in the signal handler and brings | signals nicely into the event loop. This just happens to be 99% | of the functionality I have to do. | | I'd only use more than one signalfd if each signalfd only | catches a specific signal. E.g. main context handles Sigterm | and a background process library handles sigchld. | guenthert wrote: | Thanks for the reminder that there is no non-blocking i/o for | files residing on block devices. | yxhuvud wrote: | But there is, io_uring. | m00dy wrote: | io_uring, a magical keyword I used to use in job | interviews... | healthandsafety wrote: | Care to elaborate? | kortilla wrote: | Everyone says it's better on paper but you rarely get to | actually use it in real code. | guenthert wrote: | That is async i/o afaiu and not classic Unix non-blocking | i/o (O_NONBLOCK given to 2 open()). | yxhuvud wrote: | Sure. But why does the difference matter? It is not as if | epoll is classic Unix either. | guenthert wrote: | epoll might not be, but poll is (depending on how one | would interpret 'classic'). | | Anyhow, I wrongly assumed the difference mattered in | respect of whether one could use io_uring in combination | with epoll(). It turns out, one can [1] or [2]. | | [1] https://stackoverflow.com/questions/70132802/waiting- | for-epo... | | [2] | https://unixism.net/loti/tutorial/register_eventfd.html | yxhuvud wrote: | Having done my own share of uring bindings I wish I had | found work places that appreciated that. | bfrog wrote: | why epoll at all, the new hotness is io_uring, fire away your | iovecs, check back later | rwmj wrote: | You can go from select/poll to epoll relatively easily, but | I've found that to use io_uring you have to substantially | rearchitect your whole program (if you want any performance | benefit). | | Actually I'd love to be wrong about this, but I've not found a | way to easily retrofit io_uring into programs/libraries that | are already using either synchronous operations or poll(2). | jasonzemos wrote: | io_uring is basically a drop-in for epoll. It has an | intrinsic performance benefit because multiple operations can | be both submitted and completed in a single action. | Rearchitecting is only optional when going further by | replacing standalone syscalls with io_uring operations. In | the case of poll(2) I believe it should be no more difficult | than refactoring for epoll. | wahern wrote: | With io_uring, _every_ line in an application that calls | read /recv needs to be refactored, along with much of the | surrounding context. io_uring doesn't replace poll/epoll, | it effectively replaces typical event loop frameworks. You | can integrate io_uring into pre-existing event loop | frameworks, but the event loop framework will end up as a | 99% superfluous wrapper, at least on Linux. | | Note that many applications don't use event loop | frameworks. For simple applications they can be overkill. | Even for more complex applications, it may be cleaner to | use restartable semantics (i.e. same semantics as read-- | just call me again), especially for libraries or components | that want to be event loop agnostic. | gavinray wrote: | You can use userspace Coroutines/fibers implementations to | wire in async io_uring into existing synchronous code and | maintain the facade of the code still being synchronous | | How easy/feasible this is depends on the language. | | In C++, Rust, Zig, Java (Loom fibers), and Kotlin I know for | a fact it's doable | | Other languages I'm not sure what the experience is like | drpixie wrote: | Does anyone feel that the Linux API (and so the kernel) is | slowly getting more and more complex and cumbersome? ___________________________________________________________________ (page generated 2022-10-22 23:00 UTC)