[HN Gopher] Linux: What Can You Epoll?
       ___________________________________________________________________
        
       Linux: What Can You Epoll?
        
       Author : todsacerdoti
       Score  : 142 points
       Date   : 2022-10-22 16:22 UTC (6 hours ago)
        
 (HTM) web link (darkcoding.net)
 (TXT) w3m dump (darkcoding.net)
        
       | sylware wrote:
       | I wrote many of my own programs on elf/linux: I do epoll as much
       | as I can.
       | 
       | The only troubling thingy is the lack of classification of
       | signals, those who are synchronous by nature and the other ones.
       | For instance, in a monothreaded application, segfault won't be
       | delivered via epoll...
       | 
       | At the same time, it is still important to keep the asynchronous
       | API for signals that for lower latency, but then, only the
       | realtime behaviour should be kept since this is really where
       | latency does matter.
        
       | emilfihlman wrote:
       | Regular files not having a non blocking mode is one of the
       | biggest and gravest idiocies on linux land.
       | 
       | And there's one even worse: even having the concept of
       | uninterruptiple sleep (D).
        
       | bitwize wrote:
       | Why epoll when you can io_uring? In Rust?
        
       | karthikmurkonda wrote:
       | Yep
        
       | tlsalmin wrote:
       | Just skimmed through the article, since I'm just here to testify
       | that the most important revelation for me on writing APIs was
       | that you can put and epoll_fd in an epoll_fd. This allows the API
       | to have e.g. a single epoll_fd that contains all outbound
       | connections, timers, signalfds and inotifys mentioned in the
       | articled. Then the e.g. daemon using the APIs can have an
       | epoll_fd per library it is using and just be sitting in the
       | epoll_wait loop ready to fire library_x_process() call when
       | events arrive.
        
         | kentonv wrote:
         | Another use case for this: Say you have a set of "jobs" each
         | composed of many "tasks" (each waiting for some event). The
         | "jobs" are able to run concurrently on different threads, but
         | the "tasks" must not run concurrently with other tasks in the
         | same job because they might share data structures without
         | synchronization.
         | 
         | (This is a pretty common pattern in a lot of big servers.)
         | 
         | Now you want to make sure you utilize multiple cores
         | effectively. The naive approaches are:
         | 
         | 1. Create a thread per job, each waiting on its own epoll
         | specific to the job. This may be expensive if there are many
         | jobs, and could allow too much concurrency.
         | 
         | 2. Have a single epoll and a pool of threads waiting on it.
         | Each thread must lock a mutex for the job that owns the task
         | it's going to run. But a thread could receive an event for a
         | task belonging to a job that's already running on another
         | thread, in which case it has to synchronize with that other
         | thread somehow, which is a pain. Be careful not to create a
         | situation where all threads are blocked on the mutex for one
         | job while other jobs are starved.
         | 
         | Epoll nesting presents a clean solution:
         | 
         | 3. Create an epoll per job, plus an outer epoll that waits on
         | other epolls. A pool of threads waits on the outer epoll, which
         | signals when a per-job epoll becomes ready. The thread
         | receiving that event then takes ownership of the per-job epoll
         | until the event queue is empty.
        
       | Matthias247 wrote:
       | > On Linux write to a regular file never blocks. Writing to a
       | file copies data from our user space buffer to the kernel buffer
       | and returns immediately. At some later point in time the kernel
       | will send it to the disk. A regular file is hence always ready
       | for writing and epoll wouldn't add anything.
       | 
       | Is that true? If it would be, the amount of data the Kernel would
       | need to buffer is unbounded. I assumed there is a limit on the
       | amount of buffered and not yet committed data, and when that is
       | cross the call would block until more data is flushed to disk.
       | Which is kind of the same as happens for TCP sockets. The
       | `write()` call there doesn't really send data to the peer, it
       | just submits data to the kernels send buffer, from where it will
       | be asynchronously transmitted.
       | 
       | Edit: Actually I will answer my own question and say I know it
       | will block. I had deployed IO heavy applications in the past with
       | instrumented read/write calls for IO operations in a threadpool.
       | Even though typical IO times are well below 1ms, under extremely
       | high load latencies of more than 1s could be observed, which is
       | far from "not blocking".
        
         | kentonv wrote:
         | Yes, file I/O can block. However, there is an assumption that
         | file I/O will never block "indefinitely" -- unless something is
         | severely broken, the kernel will always finish the operation in
         | finite time probably measured in milliseconds at most. The same
         | is not true of network communications, where you may be waiting
         | for an event that never happens.
         | 
         | There is a temptation to say that, well, milliseconds are a
         | long time, so wouldn't we like to do this in a non-blocking way
         | so we can work on other stuff in the meantime?
         | 
         | But... consider this: Reads and writes of memory _also_ may
         | block. If you really think about it, the only real difference
         | between main memory blocking and disk blocking is the amount of
         | time they may block. And with modern SSDs that time difference
         | is not as large as it used to be.
         | 
         | So do you want to be able to access memory in a non-blocking
         | way? Well... you can make the same logical arguments as you do
         | with file I/O, but in practice, almost no one tries to do this.
         | Instead, you separate work into threads, and let the CPU switch
         | (hyper)threads whenever it needs to wait for memory.
         | 
         | In fact, memory reads may very well block on disk, if you use
         | swap!
         | 
         | Given all this, it stops being so clear that async file I/O
         | really makes sense.
         | 
         | Meanwhile, as it happens, the Linux kernel was never really
         | designed for async file I/O in the first place. When you
         | perform file I/O, the kernel may need to execute filesystem
         | driver code, and it does so within the same thread that invoked
         | the operation from userspace. That filesystem code is blocking.
         | For the kernel to deliver true async file I/O, either all this
         | code needs to be rewritten to be non-blocking (which would
         | probably slow it down in most cases!), or the kernel needs to
         | start a thread behind the scenes to perform the work.
         | 
         | But... you can just as easily start a thread in userspace.
         | So... maybe just do that?
         | 
         | (Or, the modern answer: Use io_uring, which is explicitly
         | designed to allow a userspace thread to request work performed
         | on a separate kernel thread, and get notified of completion
         | later.)
        
           | jeffbee wrote:
           | io_uring just racked up another CVE, so I kinda feel that its
           | severely under-designed nature will always haunt it. The idea
           | that you can just hand off infinite amounts of work for the
           | kernel to do on your behalf is pretty fundamentally broken.
           | It is a concrete implementation of wishful thinking.
        
             | tankenmate wrote:
             | All "work" you want to do that interfaces with anything on
             | an OS is handed off to the kernel; want to read a file?
             | kernel, want to sleep for a while? kernel, etc. Besides
             | things like network traffic is also asynchronous like
             | io_uring (even if the socket() interfaces make it look
             | somewhat synchronous). Outside of toy system asynchronicity
             | is always a thing, especially when running on multiple
             | cores.
             | 
             | I kind of get where you are coming from but at the same
             | time, the kernel always gets the last say, so as long as
             | io_uring has a good design and implementation it will
             | always be just as good or bad as the OS as a whole. Whether
             | run of the mill programmers are up to the task of being
             | able to properly conceptualise and use such an OS is
             | probably not the same thing.
        
               | jeffbee wrote:
               | Yeah but it's not well-designed, that's my point. It has
               | obliviously shrugged off the tricky question of object
               | lifetime, that's why it has already collected 16
               | different CVEs for things like use-after-free.
               | Considering its short history, io_uring has already
               | rocketed to the top of the list of dangerous kernel
               | features.
        
               | nathants wrote:
               | with linux 6.0, lsm got the ability to filter io_uring.
               | deny all and carry on.
        
             | vlovich123 wrote:
             | That analysis would seem smart but let's try a game of Mad
             | Libs:
             | 
             | The Linux Kernel just racked up another CVE, so I kinda
             | feel that its severely under-designed nature will always
             | haunt it.
             | 
             | KDE just racked up another CVE, so I kinda feel that its
             | severely under-designed nature will always haunt it.
             | 
             | Firefox just racked up another CVE, so I kinda feel that
             | its severely under-designed nature will always haunt it.
             | 
             | Chrome just racked up another CVE, so I kinda feel that its
             | severely under-designed nature will always haunt it.
             | 
             | Windows just racked up another CVE, so I kinda feel that
             | its severely under-designed nature will always haunt it.
             | 
             | Photoshop just racked up another CVE, so I kinda feel that
             | its severely under-designed nature will always haunt it.
             | 
             | All CPUs just rucked up another CVE, so I kinda feel that
             | its severely under-designed nature will always haunt it.
             | 
             | What's the theme? Racking up CVEs is something all software
             | & hardware does. Mistakes can happen in design and in
             | implementation and no one is immune. Using presence of CVEs
             | as an indication of immaturity / fundamental design flaw
             | isn't helpful. In fact, it's probably the opposite.
             | Software that has no CVEs probably just means no one is
             | paying attention to it. Sure, in a theoretical case maybe
             | you've built a formal proof and translated that into a
             | memory safe language somehow (& you assume you've made no
             | mistakes modelling your entire system in your proof), then
             | maybe. However, that encompasses 0% of all software.
             | 
             | > The idea that you can just hand off infinite amounts of
             | work for the kernel to do on your behalf is pretty
             | fundamentally broken. It is a concrete implementation of
             | wishful thinking
             | 
             | How is that any different from a file descriptor? The
             | kernel is free to setup limits on how much work you can
             | have outstanding at any given time (now maybe those bits
             | are missing right now, but it doesn't feel like an
             | intractable problem).
        
           | [deleted]
        
           | loeg wrote:
           | > For the kernel to deliver true async file I/O, either all
           | this code needs to be rewritten to be non-blocking
           | 
           | This is, I believe, the NT model.
        
           | abiloe wrote:
           | > If you really think about it, the only real difference
           | between main memory blocking and disk blocking is the amount
           | of time they may block.
           | 
           | This is a somewhat confusing analysis you have here. Direct
           | read/write from memory for all intents and purposes doesn't
           | block. Why do you say that reads and writes may also block?
           | 
           | The reason memory blocks is because it needs to page in or
           | out from secondary storage - which makes this statement "the
           | only real difference between main memory blocking and disk
           | blocking is the amount of time they may block." not really
           | true
        
             | tremon wrote:
             | _Why do you say that reads and writes may also block?_
             | 
             | Let's define "may block" first, perhaps? What do we mean
             | when we say "network I/O may block"? Usually, this means
             | that the kernel may see your network request and raise you
             | a context switch while it waits for the network response on
             | your behalf. In your last sentence you appear to argue that
             | the reason _why_ the kernel performs a context switch is
             | relevant in determining if an operation  "may block", and
             | the GP is arguing that that's a distinction without a
             | difference.
             | 
             | If the definition of "may block" is really just "the kernel
             | may decide to context-switch away from your program", then
             | yes, the GP's assertion that file I/O, memory I/O (mmap)
             | and memory access (swap) are all operations that may block
             | is correct -- the only difference is in degree: from
             | microsecond delays for nvm-backed swap to multi-second
             | delays for network transfers.
             | 
             | Or, of course, I may have misunderstood the GP's train of
             | thought.
        
               | [deleted]
        
             | jesboat wrote:
             | >> If you really think about it, the only real difference
             | between main memory blocking and disk blocking is the
             | amount of time they may block. > > This is a somewhat
             | confusing analysis you have here. Direct read/write from
             | memory for all intents and purposes doesn't block. Why do
             | you say that reads and writes may also block?
             | 
             | Reads and writes from actual, physical, hardware memory
             | might block, depending on how you define "block", in the
             | sense that some reads may miss CPU cache. But once you get
             | to that point, you could argue that every branch might
             | block if the branch misprediction causes a pipeline stall.
             | This is not a useful definition of "block".
             | 
             | The thing is, most programs are almost never low-level
             | enough to be dealing with memory in that sense: they read
             | and write _virtual_ memory. And virtual memory can block
             | for any number of reasons, including some pretty non-
             | obvious ones like. For example:
             | 
             | - the system is under memory pressure and that page is no
             | longer in RAM because it got written to a swap file
             | 
             | - the system is under memory pressure and that page is no
             | longer in RAM because it was a read-only mapping from a
             | file and could be purged
             | 
             | -- e.g. it's part of your executable's code
             | 
             | - this is your first access to a page of anonymous virtual
             | memory and the kernel hadn't needed to allocate a physical
             | page until now
             | 
             | - you're in a VM and the VMM can do whatever it wants
             | 
             | - the page is COW from another process
        
               | kentonv wrote:
               | > This is not a useful definition of "block".
               | 
               | I think what I'm saying is that calling file I/O
               | "blocking" is also not a useful definition of "block".
               | Because I don't really see the fundamental difference
               | between "we have to wait for main memory to respond" and
               | "we have to wait for disk to respond".
               | 
               | > this is your first access to a page of anonymous
               | virtual memory and the kernel hadn't needed to allocate a
               | physical page until now
               | 
               | And said allocation could block on all sorts of things
               | you might not expect. Once upon a time I helped debug a
               | problem where memory allocation would block waiting for
               | the XFS filesystem driver to flush dirty inodes to disk.
               | Our system generated lots of dirty inodes, and we were
               | seeing programs randomly hang on allocation for minutes
               | at a time.
        
               | abiloe wrote:
               | > I think what I'm saying is that calling file I/O
               | "blocking" is also not a useful definition of "block".
               | Because I don't really see the fundamental difference
               | between "we have to wait for main memory to respond" and
               | "we have to wait for disk to respond".
               | 
               | In addition to the point elsewhere made that you're sort
               | of implicitly denying the magnitude of the differences
               | here - the latency differences are on the order of 1000s.
               | 
               | The other way of separating is if the OS (or some kind of
               | software trap handler more generally) has to get
               | involved. A main memory read to a non-faulting address
               | doesn't involve the OS - ie it doesn't ever block.
               | However faulting reads, calls to "disk" IO, and
               | networking IO (ie just I/O in general) involving the
               | OS/monitor/what have you are all potentially blocking
               | operations.
        
               | dahfizz wrote:
               | > Because I don't really see the fundamental difference
               | between "we have to wait for main memory to respond" and
               | "we have to wait for disk to respond".
               | 
               | The difference, conservatively, is a factor of 1000.
               | 
               | There are plenty of times in software engineering where
               | scaling 1000x will force you to reconsider your
               | architecture.
        
             | kentonv wrote:
             | > Direct read/write from memory for all intents and
             | purposes doesn't block.
             | 
             | Sure it does! Main memory is much slower than cache so on a
             | cache miss the CPU has to stop and wait for main memory to
             | respond. The CPU may even switch to executing some other
             | thread in the meantime (that's what hyperthreading is). But
             | if there isn't another hyperthread ready, the CPU sits
             | idle, wasting resources.
             | 
             | It's not a form of blocking implemented by the OS
             | scheduler, but it's pretty similar conceptually.
             | 
             | > The reason memory blocks is because it needs to page in
             | or out from secondary storage
             | 
             | Nope, that's not what I was referring to (other than in the
             | line mentioning swap).
        
               | bch wrote:
               | With the utmost respect, I've never heard "blocking"
               | described as "takes some measurable amount of time"
               | (which is how I'm reading your above statement); by that
               | definition, async blocks to a degree too.
               | 
               | You're throwing traditional blocking/non-blocking
               | distinctions on their ear.
        
               | Volundr wrote:
               | Blocking in this case is referring to the CPU thread
               | sitting idle whilst the operation is performed. This is
               | what it means when your blocked on a network request,
               | blocked on a disk operation, or blocked on a memory
               | request. It's all blocking.
               | 
               | A cache miss and going to RAM is usually fast enough that
               | we as software engineers don't care about it, and in fact
               | our programming language of choice may not even give us a
               | way of telling the difference between a piece of data
               | coming from a CPU register or L1 cache vs going to RAM,
               | but that doesn't mean the blocking isn't happening.
               | 
               | EDIT: to maybe make this a little clearer for those who
               | might not be aware the CPU doesn't go fetch something
               | from RAM. It puts in a request to the memory controller
               | (handwaving modern architecture a bit here) then has to
               | wait ~100-1000 CPU cycles before the controller gets back
               | to it with the data. Depending on the circumstances the
               | kernel may let that core sit idle, or it may do a context
               | switch to another thread. The only difference between
               | this process and say a network request is how many CPU
               | cycles before you get your results. In the meantime the
               | thread isn't progressing and is blocked.
        
               | bch wrote:
               | > A cache miss and going to RAM is usually fast enough
               | that we as software engineers don't care about it, and in
               | fact our programming language of choice may not even give
               | us a way of telling the difference between a piece of
               | data coming from a CPU register or L1 cache vs going to
               | RAM, but that doesn't mean the blocking isn't happening.
               | 
               | Yes, this is the line being discussed, and I guess
               | (historically) I've just considered "a cost" without
               | dragging "blocking" into the equation. We know that _not_
               | accessing memory is cheaper than accessing it, and we can
               | tune (pack our structs, mind thrashing the cache), but
               | calling that blocking is still new to me. I'll have to
               | consider what it means. And also, does it imply the
               | existence of non-blocking memory (I don't think DMA is
               | typically in a developers toolkit, but...)?
        
               | Volundr wrote:
               | > And also, does it imply the existence of non-blocking
               | memory
               | 
               | Yes actually! If you know your going to need a block of
               | memory before you actually need it, you can put in a
               | request to the memory controller before you need it, then
               | proceed to do some other work and check back in when your
               | ready for the data or when the memory controller signals
               | you it's done. It's just that this kind of thing is
               | usually the scope of compiler optimizations or hyper
               | optimized software like Varnish cache rather than
               | something your average web developer thinks about. It's
               | again conceptually the same as an async network request,
               | but you bother with one while considering the other just
               | "a cost" because of the different timescales.
        
               | jmalicki wrote:
               | > And also, does it imply the existence of non-blocking
               | memory
               | 
               | Prefetching instructions, to tell the processor to load
               | before you use it!
               | 
               | The first google hit [0] even calls it non-blocking
               | memory access!
               | 
               | In [1] you can see some of the available prefetching
               | instructions, and in [2] some analysis on how they deal
               | with TLB misses (another _extremely_ expensive way memory
               | access can be blocking short of a page fault).
               | 
               | Another thing not mentioned above is that accessing a
               | page of newly allocated memory often causes a page fault,
               | since allocation is often delayed until use of each page,
               | for overcommitting behavior - same for writing to memory
               | that is copy-on-write from a fork!
               | 
               | [0] https://www.sciencedirect.com/topics/computer-
               | science/prefet....
               | 
               | [1] https://docs.oracle.com/cd/E36784_01/html/E36859/epmp
               | w.html
               | 
               | [2] https://stackoverflow.com/a/52377359/435796
        
               | [deleted]
        
               | abiloe wrote:
               | > Sure it does! Main memory is much slower than cache so
               | on a cache miss the CPU has to stop and wait for main
               | memory to respond. The CPU may even switch to executing
               | some other thread in the meantime (that's what
               | hyperthreading is).
               | 
               | Cache is a memory. And which cache, by the way? Even L1
               | cache on modern processors doesn't have 0 latency. And
               | this is a rather poor way of describing hyperthreading -
               | the CPU doesn't really "switch" - the context for the
               | alternate process is already available and the resource
               | stealing can occur for any kind of stall (including cache
               | loads), not just memory. Calling this a "switch"
               | suggesting it is like a context switch is very
               | misleading. It's not similar conceptually.
               | 
               | In any event, by this definition even a mispredicted
               | branch or a divide becomes "blocking" - which sort of
               | tortures any meaningful definition of blocking.
               | 
               | The essential difference is - memory accesses to paged in
               | memory (and branch mispredictions, cache misses) are not
               | something you typically or reasonably trap outside of
               | debugging. mmaps, swaps, disk I/O, network accesses are
               | all something delegated to an OS - and at that point are
               | orders of magnitude more expensive than even most NUMA
               | memory accesses. I sort of see where you're coming from -
               | but I don't think it's a useful point.
        
               | kentonv wrote:
               | None of this seems to contradict my point?
               | 
               | My argument is that disk I/O is more like memory I/O than
               | it is like network I/O, and so for concurrency purposes
               | it may make more sense to treat it like you would memory
               | I/O (use threads) than like you would network I/O (where
               | you'd use non-blocking APIs and event queues).
        
               | abiloe wrote:
               | > My argument is that disk I/O is more like memory I/O
               | than it is like network I/O
               | 
               | It depends on your network and disk - and yes SSD and
               | "slow" ethernet are the common case, but there is enough
               | variation (say an relatively sluggish embedded eMMC on
               | one end and 100 GbE for the networking case), that
               | there's no point in making some distinction about disk vs
               | network latency - for a general concurrency abstraction
               | they're both slow IO and you might as well have a common
               | abstraction like IOCP or io_uring.
               | 
               | > concurrency purposes it may make more sense to treat it
               | like you would memory I/O (use threads) than like you
               | would network I/O (where you'd use non-blocking APIs and
               | event queues).
               | 
               | No, case in point, Windows had IOCP for years such that
               | you could use the same common abstraction for network and
               | disk. The fact that the POSIX/UNIX world was far behind
               | the times in getting its shit together doesn't mean much.
               | 
               | And why, fundamentally, can you not use blocking APIs
               | with threads for networking?
        
             | p12tic wrote:
             | It's complicated, memory accesses can really block for
             | relatively long periods of time.
             | 
             | Consider that regular memory access via cache takes around
             | 1 nanosecond.
             | 
             | If the data is not in top-level cache, then we're looking
             | at roughly 10 nanoseconds access latency.
             | 
             | If the data is not in cache at all, we are looking into
             | 50-150 nanoseconds access latency.
             | 
             | If the data is in memory, but that memory is attached to
             | another CPU socket, it's even more latency.
             | 
             | Finally, if the data access is via atomic instruction and
             | there are many other CPUs accessing the same memory
             | location, then the latency can be as high as 3000
             | nanoseconds.
             | 
             | It's not very hard to find NVMe attached storage that has
             | latencies of tens of microseconds, which is not very far
             | off memory access speeds.
        
               | eloff wrote:
               | I just want to add to your explanation, that even in the
               | absence of hard paging from disk, you can have soft page
               | faults where the kernel modifies the page table entries
               | or assigns a memory page, or copies a copy on write page,
               | etc.
               | 
               | In addition to the cache misses you mention there's also
               | TLB misses.
               | 
               | Memory is not actually random access, locality matters a
               | lot. SSDs reads, on the other hand, are much closer to
               | random access, but much more expensive.
        
         | caf wrote:
         | The term "blocking" in UNIX-like OSes is jargon with a
         | particular meaning. It means an interruptible wait.
         | 
         | Disk files do not block - they may Disk Wait instead, which is
         | an uninterruptible wait (this is what the 'D' process status
         | stands for). Disk Wait doesn't interact with O_NONBLOCK,
         | select(3), poll(3) etc.
         | 
         | (Back in the bad old days it wasn't even possible for a Disk
         | Waiting process to wake up to process a SIGKILL and die, which
         | was the bane of system administrators everywhere when NFS
         | introduced the idea of disks that could disappear when the
         | network went down. Now it's common for OSes to make some kinds
         | of Disk Waits at least killable).
        
         | Snild wrote:
         | > I assumed there is a limit on the amount of buffered and not
         | yet committed data, and when that is cross the call would block
         | until more data is flushed to disk.
         | 
         | There is. It's tunable through /proc/sys/vm/dirty_ratio. When
         | there is that much write cache, application writes will start
         | to writeback synchronously.
         | 
         | There is also dirty_background_ratio, which is the threshold at
         | which writeback starts happening in the background (that is, in
         | a kernel thread).
        
         | throwaway09223 wrote:
         | No, as you reasoned out it is absolutely incorrect. Write calls
         | to regular files will block until they are complete, unless
         | some kind of error situation is encountered.
         | 
         | This effect is often particularly pronounced with NFS, where
         | calls might block for _hours_ or even indefinitely if the
         | underlying network filesystem goes away.
        
           | tankenmate wrote:
           | Just in case anyone isn't aware there is a mount flag called
           | "soft" that allows the NFS client (and some other network
           | filesystems) to timeout or be interrupted, i.e. the process
           | won't get stuck in 'D' (device wait) state.
        
           | inetknght wrote:
           | > _This effect is often particularly pronounced with NFS,
           | where calls might block for hours or even indefinitely if the
           | underlying network filesystem goes away._
           | 
           | I can't tell you how many times I've had to debug a stuck
           | process and it turns out that the logs indicated the NFS had
           | a hiccup a day or two ago during a file read or write and the
           | process was never notified of a file error. It's f!@#ing
           | frustrating. Worse, though, was CIFS.
        
             | lanstin wrote:
             | I routinely have to run file system scans on a giant NFS
             | filer, and even without a hiccup, out of a 100M stat or
             | read calls, ten or so will just never finish. In Go, I have
             | to wrap the call with a channel thing and a time out and
             | hope I don't run out of threads before scanning all 400 M
             | files.
        
         | kotlin2 wrote:
         | The write call returns how many bytes were accepted:
         | https://man7.org/linux/man-pages/man2/write.2.html
         | 
         | > The number of bytes written may be less than count if, for
         | example, there is insufficient space on the underlying physical
         | medium, or the RLIMIT_FSIZE resource limit is encountered (see
         | setrlimit(2)), or the call was interrupted by a signal handler
         | after having written less than count bytes. (See also pipe(7).)
        
           | wtallis wrote:
           | That doesn't answer the question. Blocking isn't a matter of
           | how much data is written, but a matter of when the system
           | call completes. Other parts of that man page imply that
           | write(2) may block, unless the fd was opened with O_NONBLOCK
           | (in which case you'll get an EAGAIN error instead of it
           | blocking).
        
             | icedchai wrote:
             | "It's complicated." Generally, with a regular file,
             | write(2) will complete as soon as the the data makes it to
             | filesystem buffers/cache. The data is _probably_ not on
             | disk when the call completes. This depends on how the file
             | was opened (O_FSYNC, O_DIRECT, etc.) and the underlying
             | filesystem itself. There are many other details at work,
             | like actual file system, memory pressure (there may not be
             | enough buffers), cache in the physical disk device or
             | controller, etc. So the write call itself is  "blocking",
             | but the physical writes are (generally) not synchronous
             | with the call.
        
               | wtallis wrote:
               | Yes, whether a write blocks is really about whether the
               | application can do anything else while the write is
               | processed; whether the application is told the write is
               | done when it lands in a cache or when it is actually on
               | stable storage is a separate question.
        
             | throwaway09223 wrote:
             | > (in which case you'll get an EAGAIN error instead of it
             | blocking).
             | 
             | You won't. O_NONBLOCK cannot be used with regular files.
             | That part of the manpage is discussing other non-socket
             | file types.
             | 
             | Disk i/o via write(2) is always a blocking call. Always.
             | 100% of the time, no exceptions.
        
         | cout wrote:
         | It is bounded by available memory. Writes to a socket go to a
         | FIFO queue (the socket's write buffer), but writes to disk are
         | different; they go through the page cache
         | (https://www.kernel.org/doc/html/latest/admin-
         | guide/mm/concep...):
         | 
         | > The physical memory is volatile and the common case for
         | getting data into the memory is to read it from files. Whenever
         | a file is read, the data is put into the page cache to avoid
         | expensive disk access on the subsequent reads. Similarly, when
         | one writes to a file, the data is placed in the page cache and
         | eventually gets into the backing storage device. The written
         | pages are marked as dirty and when Linux decides to reuse them
         | for other purposes, it makes sure to synchronize the file
         | contents on the device with the updated data.
         | 
         | There are many advantages to doing it this way. One is that
         | multiple writes to the same page will result in a single
         | physical write, if the page has not yet been flushed to disk.
         | 
         | There are many reasons that you might have seen a write to a
         | file block. One is that the number of dirty pages has reached
         | the threshold (nr_dirty_threshold in /proc/vmstat). After that
         | happens, any process doing disk IO will block.
         | 
         | Another reason is memory pressure. Since all writes go through
         | the page cache, the kernel must first allocate a page before
         | the call to write(2) can be completed. If there are many pages
         | in the page cache, this can take a long time (I once witnessed
         | an old kernel bug cause all page allocations to result in
         | kswapd attempting to reclaim pages, due to active pages being
         | placed ahead of inactive pages in the LRU lists).
         | 
         | In general, if you are writing a lot to disk but you have no
         | intention of reading it in the near future, it is a good idea
         | to call posix_fadvise(2) with FADV_DONTNEED to ensure the pages
         | will be reused for something else more quickly.
        
           | lanstin wrote:
           | It is pretty easy to completely hork a large box with a very
           | disk intensive process; hit a local file system hard enough
           | and you can get a majority of the processes into D state,
           | uninterruptible IO Disk wait. Maybe not from inside a
           | container, haven't see it, but definitely on a box with
           | shared processes. Even just too much logging can harm
           | unrelated processes that aren't even doing much with the
           | disk.
        
       | rwmj wrote:
       | It's weird that (according to this document) you can epoll Unix
       | domain sockets but not sockets created by socketpair(2). I
       | thought socketpair created essentially two pre-connected Unix
       | domain sockets.
        
         | kentonv wrote:
         | Hmm, I don't think that's it says (unless they edited it since
         | your post?). It mentions socketpair explicitly as something
         | that _is_ epoll-friendly, and which you can use to communicate
         | with another thread, in the case where you must create a thread
         | to perform some blocking task but still want to get completion
         | notification in the main thread via epoll.
        
         | ajross wrote:
         | Indeed, I am all but certain you can epoll on socketpairs. That
         | sounds like a mistake in the article.
        
       | kentonv wrote:
       | I highly recommend that you do NOT use signalfd to get
       | notification of signals through epoll. Instead, block (mask) the
       | signal, set a signal handler, and use epoll_pwait() to atomically
       | unblock it while you wait for events. Note that in this setup,
       | your signal handler callback need not be async-signal-safe, since
       | you know the precise state of the calling thread: it's invoking
       | epoll_pwait(). This sidesteps most of the pain of using signals
       | which might otherwise make you think you want signalfd.
       | 
       | Two reasons not to use signalfd:
       | 
       | 1. signalfd has weird semantics that don't match what you'd
       | normally expect from a file descriptor. When you read from a
       | signalfd, it tells you signals queued on the thread that called
       | read(), NOT the thread that created the signalfd. Worse, if you
       | add signalfd to an epoll, the epoll will report readiness based
       | on the thread that used epoll_ctl() to add the signalfd, which
       | may be different from the thread that is reading from the epoll.
       | So you might get a notification that the signalfd is ready, but
       | then read the signalfd and find there are no signals, and then
       | wait on the epoll again just to have it tell you again that this
       | signalfd is ready.
       | 
       | 2. It turns out that signalfd's implementation has some severe
       | lock contention issues. I learned this through my own
       | experimentation recently. In my experiment, I had 5000 threads
       | each waiting on an epoll that included a signalfd. When
       | delivering a thread-specific signal to each of the 5000 threads
       | at once, the process spent 2+ MINUTES of CPU time spinning on
       | spinlocks in the kernel before completing all the event
       | deliveries. The time spent was O(n^2) to the number of threads.
       | When I switched to an epoll_pwait()-based implementation, the
       | same task took a few milliseconds.
       | 
       | Here's the PR where I switched KJ's event loop (used in Cap'n
       | Proto and Cloudflare Workers) to use epoll_pwait():
       | https://github.com/capnproto/capnproto/pull/1511
        
         | kelnos wrote:
         | The big downside of using a traditional signal handler is that
         | the only way to get your own data into the handler function is
         | through global variables (or thread locals). While you can
         | certainly make an exception just for that one thing, it feels
         | gross to do so. And you can also just defer processing to your
         | main loop by setting a flag or writing to a pipe, but those
         | things still need to be global variables.
         | 
         | I didn't know about signalfd's limitations before reading your
         | post, and was happy that signalfd could eliminate the need for
         | global variables when doing signal handling. Shame that's not
         | really the case.
        
           | kentonv wrote:
           | In my case I use a thread_local pointer that I initialize
           | right before epoll_pwait and set back to null immediately
           | after. The pointer points to the same data structures that I
           | would otherwise use to handle signalfd events. Yeah it's a
           | little icky to use the global but I think it ends up
           | semantically equivalent.
        
             | [deleted]
        
             | wahern wrote:
             | Unfortunately, thread-local storage is not async-signal
             | safe. You're relying (knowingly, I presume, but others
             | should be warned) on implementation details.
             | 
             | But, yeah, signalfd leaves much to be desired. *BSD kqueue
             | EVFILT_SIGNAL has much saner semantics.
        
               | kentonv wrote:
               | > Unfortunately, thread-local storage is not async-signal
               | safe.
               | 
               | Doesn't matter, because the signal handler in this case
               | is strictly called "during" invocation of epoll_pwait, so
               | there's no risk of it interrupting the initialization of
               | a TLS object. The usual rules about async signal safety
               | do not need to be followed here; it's as if
               | epoll_wait()'s implementation made a plain old function
               | call to the signal handler.
               | 
               | (Also, since we're talking about epoll, we can assume
               | Linux, which means we can assume ELF, which means it's
               | pretty easy to use thread_local in a way that requires no
               | initialization by allocating it in the ELF TLS section.
               | But yes, that's relying on implementation details I
               | suppose.)
               | 
               | > kqueue EVFILT_SIGNAL
               | 
               | Having recently implement kqueue support in my event loop
               | I have to say I'm disappointed by EVFILT_SIGNAL. It does
               | not play well with signals that target a specific thread
               | (pthread_kill()) -- on FreeBSD, all threads will get the
               | kqueue event, while on MacOS, none of them do.
               | Fortunately EVFILT_USER provides a reasonable alternative
               | for efficient inter-thread signaling.
               | 
               | (I don't like using a pipe or socketpair as that involves
               | allocating a whole two file descriptors and a kernel
               | buffer, and it requires a redundant syscall on the
               | receiving end to read the dummy byte out of the buffer.
               | If you're just trying to tell another thread "hey I added
               | something to your work queue, please wake up and check",
               | that's a waste.)
        
             | kelnos wrote:
             | Makes sense, and is probably the "safest" you can get.
             | Since, as you say, you know exactly the state of everything
             | on that thread when you're in your handler, you can also
             | know that your thread local was set properly before the
             | epoll_pwait() call.
             | 
             | It's probably code I'd want to isolate somewhere, with big
             | warnings so any future reader understands why it is how it
             | is, but I agree it's probably the safest way to do it.
        
         | FPGAhacker wrote:
         | You should do a write up of item 2.
        
         | tlsalmin wrote:
         | I have to disagree here. Not recommending signalfd for the
         | mentioned use cases might be reasonable, just as reasonable as
         | it is to use threads for a specific use case. For a single
         | threaded non-blocking-FD using client/server signalfd removes
         | the risk of doing too much in the signal handler and brings
         | signals nicely into the event loop. This just happens to be 99%
         | of the functionality I have to do.
         | 
         | I'd only use more than one signalfd if each signalfd only
         | catches a specific signal. E.g. main context handles Sigterm
         | and a background process library handles sigchld.
        
       | guenthert wrote:
       | Thanks for the reminder that there is no non-blocking i/o for
       | files residing on block devices.
        
         | yxhuvud wrote:
         | But there is, io_uring.
        
           | m00dy wrote:
           | io_uring, a magical keyword I used to use in job
           | interviews...
        
             | healthandsafety wrote:
             | Care to elaborate?
        
               | kortilla wrote:
               | Everyone says it's better on paper but you rarely get to
               | actually use it in real code.
        
             | guenthert wrote:
             | That is async i/o afaiu and not classic Unix non-blocking
             | i/o (O_NONBLOCK given to 2 open()).
        
               | yxhuvud wrote:
               | Sure. But why does the difference matter? It is not as if
               | epoll is classic Unix either.
        
               | guenthert wrote:
               | epoll might not be, but poll is (depending on how one
               | would interpret 'classic').
               | 
               | Anyhow, I wrongly assumed the difference mattered in
               | respect of whether one could use io_uring in combination
               | with epoll(). It turns out, one can [1] or [2].
               | 
               | [1] https://stackoverflow.com/questions/70132802/waiting-
               | for-epo...
               | 
               | [2]
               | https://unixism.net/loti/tutorial/register_eventfd.html
        
             | yxhuvud wrote:
             | Having done my own share of uring bindings I wish I had
             | found work places that appreciated that.
        
       | bfrog wrote:
       | why epoll at all, the new hotness is io_uring, fire away your
       | iovecs, check back later
        
         | rwmj wrote:
         | You can go from select/poll to epoll relatively easily, but
         | I've found that to use io_uring you have to substantially
         | rearchitect your whole program (if you want any performance
         | benefit).
         | 
         | Actually I'd love to be wrong about this, but I've not found a
         | way to easily retrofit io_uring into programs/libraries that
         | are already using either synchronous operations or poll(2).
        
           | jasonzemos wrote:
           | io_uring is basically a drop-in for epoll. It has an
           | intrinsic performance benefit because multiple operations can
           | be both submitted and completed in a single action.
           | Rearchitecting is only optional when going further by
           | replacing standalone syscalls with io_uring operations. In
           | the case of poll(2) I believe it should be no more difficult
           | than refactoring for epoll.
        
             | wahern wrote:
             | With io_uring, _every_ line in an application that calls
             | read /recv needs to be refactored, along with much of the
             | surrounding context. io_uring doesn't replace poll/epoll,
             | it effectively replaces typical event loop frameworks. You
             | can integrate io_uring into pre-existing event loop
             | frameworks, but the event loop framework will end up as a
             | 99% superfluous wrapper, at least on Linux.
             | 
             | Note that many applications don't use event loop
             | frameworks. For simple applications they can be overkill.
             | Even for more complex applications, it may be cleaner to
             | use restartable semantics (i.e. same semantics as read--
             | just call me again), especially for libraries or components
             | that want to be event loop agnostic.
        
           | gavinray wrote:
           | You can use userspace Coroutines/fibers implementations to
           | wire in async io_uring into existing synchronous code and
           | maintain the facade of the code still being synchronous
           | 
           | How easy/feasible this is depends on the language.
           | 
           | In C++, Rust, Zig, Java (Loom fibers), and Kotlin I know for
           | a fact it's doable
           | 
           | Other languages I'm not sure what the experience is like
        
         | drpixie wrote:
         | Does anyone feel that the Linux API (and so the kernel) is
         | slowly getting more and more complex and cumbersome?
        
       ___________________________________________________________________
       (page generated 2022-10-22 23:00 UTC)