[HN Gopher] Why mmap is faster than system calls
       ___________________________________________________________________
        
       Why mmap is faster than system calls
        
       Author : vinnyglennon
       Score  : 226 points
       Date   : 2021-01-09 16:53 UTC (6 hours ago)
        
 (HTM) web link (sasha-f.medium.com)
 (TXT) w3m dump (sasha-f.medium.com)
        
       | layoutIfNeeded wrote:
       | >Further, since it is unsafe to directly dereference user-level
       | pointers (what if they are null -- that'll crash the kernel!) the
       | data referred to by these pointers must be copied into the
       | kernel.
       | 
       | False. If the file was opened with O_DIRECT, then the kernel uses
       | the user-space buffer directly.
       | 
       | From man write(2):
       | 
       | O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of
       | the I/O to and from this file. In general this will degrade
       | performance, but it is useful in special situations, such as when
       | applications do their own caching. File I/O is done directly
       | to/from user-space buffers. The O_DIRECT flag on its own makes an
       | effort to transfer data synchronously, but does not give the
       | guarantees of the O_SYNC flag that data and necessary metadata
       | are transferred. To guarantee synchronous I/O, O_SYNC must be
       | used in addition to O_DIRECT. See NOTES below for further
       | discussion.
        
         | wtallis wrote:
         | I don't think O_DIRECT makes any guarantees about zero-copy
         | operation. It merely disallows kernel-level caching of that
         | data. But the kernel may make a private copy that isn't
         | caching.
        
           | layoutIfNeeded wrote:
           | Who said it was guaranteed to be zero-copy?
           | 
           | The original article said that the data _must_ be copied
           | based on some bogus handwavy argument, and I've pointed out
           | that the manpage of write(2) contradicts this when it says
           | the following:
           | 
           | >File I/O is done directly to/from user-space buffers.
        
       | jstimpfle wrote:
       | > Why can't the kernel implementation use AVX? Well, if it did,
       | then it would have to save and restore those registers on each
       | system call, and that would make domain crossing even more
       | expensive. So this was a conscious decision in the Linux kernel.
       | 
       | I don't follow. So a syscall that could profit from AVX can't use
       | it because then _all_ syscalls would have to restore AVX
       | registers? Why can 't the restoring just happen specifically in
       | those syscalls that make use of AVX?
        
         | xymostech wrote:
         | I think by "each system call" she meant it like "every time it
         | calls read()", since it would be read() that was using the AVX
         | registers. Since the example program just calls read() over and
         | over, this could add a significant amount of overhead.
        
         | PaulDavisThe1st wrote:
         | It's not just syscalls. It's every context switch. If the
         | process is in the midst of using AVX registers in kernel code,
         | but is suddenly descheduled, those registers have to be
         | saved/restored. You can't know if the task is using AVX or not,
         | so you have to either always save/restore them, or adopt the
         | policy that these registers are not saved/restored.
        
         | jabl wrote:
         | I vaguely recall that the Linux kernel has used lazy
         | save/restore of FP registers since way back when.
        
         | jeffbee wrote:
         | You'd have to have a static analysis of which syscalls can
         | transitively reach which functions, which is probably not
         | possible because linux uses tables of function pointers for
         | many purposes. Also if thread 1 enters the kernel, suspends
         | waiting for some i/o, and the kernel switches to thread 2, how
         | would it know it needed to restore thread 2's registers because
         | of AVX activity of thread 1? And if it did how would it have
         | known to save them?
        
           | jstimpfle wrote:
           | Not a kernel person, but how about a flag for the thread data
           | structure?
        
             | jeffbee wrote:
             | Yeah actually now that I'm part way through that first cup
             | of coffee, the 2nd part of my comment doesn't make sense,
             | the kernel already has to do a heavier save of a task's
             | register state when it switches tasks.
        
       | CyberRabbi wrote:
       | I believe if you turn PTI off the syscall numbers for sequential
       | copies would be a lot higher.
        
       | CodesInChaos wrote:
       | Memory mapped files are very tricky outside the happy path. In
       | particular recovery from errors and concurrent modification
       | leading to undefined behaviour. It's a good choice for certain
       | use-cases, such as reading assets shipped with the application,
       | where no untrusted process can write to the file and errors can
       | be assumed to not happen.
       | 
       | For high performance code I'd use io_uring.
        
       | jws wrote:
       | Summary: Mostly syscalls and mmap do the same things just
       | substituting a page fault for a syscall to get to kernel mode,
       | but... In user space her code is using AVX optimized memory copy
       | instructions which are not accessible in kernel mode yielding a
       | significant speed up.
       | 
       | Bonus summary: She didn't use the mmapped data in place in order
       | to make a more apples-to-apples comparison. If you can use the
       | data in place then you will get even better performance.
        
         | spockz wrote:
         | Why doesn't the kernel have access to AVX optimised memory copy
         | instructions?
        
           | jws wrote:
           | The size of the state required to be saved and restored on
           | each system call makes it a losing proposition.
        
             | PaulDavisThe1st wrote:
             | Each context switch, not syscall.
        
           | topspin wrote:
           | The kernel does have access to these instructions. It is a
           | deliberate choice by kernel developers not to use them in the
           | case discussed here. In other cases the kernel does use such
           | instructions.
        
         | anaisbetts wrote:
         | *She, not he
        
           | jws wrote:
           | Thanks. Curse this language. I just want to refer to people!
           | It's simple encapsulation and abstraction. I shouldn't have
           | to care about implementation details irrelevant to the
           | context.
        
             | damudel wrote:
             | Don't worry about it. Some people lose their marbles
             | because they think females get erased when male language is
             | used. Just erase both genders and you'll be fine. Use
             | singular they.
        
             | [deleted]
        
             | ryanianian wrote:
             | "They" is an acceptable gender-neutral pronoun.
        
               | FentanylFloyd wrote:
               | it's an idiotic newspeakism, lol
               | 
               | it's well enough that 'you' can be both singular and
               | plural, we don't need another one
        
               | kortilla wrote:
               | Acceptable to some, still not frequent enough though to
               | be normalized.
        
               | lolc wrote:
               | I don't even notice it anymore.
        
               | itamarst wrote:
               | It's been used since the time of Jane Austen (by Jane
               | Austen, in fact), it's perfectly normal:
               | https://pemberley.com/janeinfo/austheir.html
        
               | [deleted]
        
             | cpach wrote:
             | How about "they"...?
        
             | throw_away wrote:
             | singular they is the generalization you're looking for
        
               | jws wrote:
               | I'm old enough that "they" is not singular, it is a
               | grammatical error punishable by red ink and deducted
               | points.
        
               | Spivak wrote:
               | But the usage of they to refer a single person is older
               | than anyone alive if that makes you feel better about
               | sticking it to your picky grade school teachers.
        
               | jfk13 wrote:
               | So is the use of "he" to refer to an individual of
               | unspecified gender.
               | 
               | (The OED quotations for sense 2(b) "In anaphoric
               | reference to a singular noun or pronoun of undetermined
               | gender" go back to at least 1200AD.)
        
             | jfim wrote:
             | You can use "the author" or refer to the article or paper.
             | 
             | [name of paper] mentions that X is faster than Y. The
             | author suggests the cause of the speed up is Z, while we
             | believe it is W.
        
         | usrnm wrote:
         | Just a nitpick: "Alexandra" is the female version of the name
         | "Alexander" in Russian, so it's a "she", not "he".
        
           | andi999 wrote:
           | And 'Sasha' the nickname of 'Alexander'.... man who can think
           | this up, this is like you were calling 'Richard' 'Dick'.
        
             | tucnak wrote:
             | Sasha is universally applied to both males and females,
             | although to be fair, in Russian, it's culturally much
             | acceptable to call Alexander Sasha in any context
             | whatsoever, whereas Sasha as-in female Alexandra is
             | reserved for informal communication.
             | 
             | Disclaimer: I speak Russian.
        
               | LudwigNagasena wrote:
               | Sasha is an informal version for both genders, I don't
               | think there is any difference.
               | 
               | Source: I am Russian
        
               | FpUser wrote:
               | Second that. Was born in Russia as well
        
               | whidden wrote:
               | As someone who grew up in a former Russian territory that
               | speaks no Russian, even I knew that.
        
             | enedil wrote:
             | In Polish, Aleksandra (also female version) is shortened up
             | to Ola, good luck with that ;)
        
             | eps wrote:
             | Sasha is derived from Alexander via its dimunitive, but
             | obsolete form - Aleksashka - shortned to Sashka, further
             | simplified to Sasha as per established name format of Masha
             | (Maria), Dasha (Daria), Pasha (Pavel, Paul), Glasha
             | (Glafira), Natasha (Natalia), etc.
        
       | dmytroi wrote:
       | Did some research on the topic of high bandwidth/high IOPS file
       | accesses, some of my conclusions could be wrong though, but as I
       | discovered modern NVMe drives need to have some queue pressure on
       | them to perform at advertised speeds, as in hardware level they
       | are essentially just a separate CPU running in background that
       | has command queue(s). They also need to have requests align with
       | flash memory hierarchy to perform at advertised speeds. So that's
       | puts a quite finicky limitation on your access patterns: 64-256kb
       | aligned blocks, 8+ accesses in parallel. To see that just try
       | CrystalDiskMark and put queue depth at 1-2, and/or block size on
       | something small, like 4kb, and see how your random speed
       | plummets.
       | 
       | So given the limitations on the access pattern, if you just mmap
       | your file and memcpy the pointer, you'll get like ~1 access
       | request in flight if I understand right. And also as default page
       | size is 4kb, that will be 4kb request size. And then your mmap
       | relies on IRQ's to get completion notifications (instead of
       | polling the device state), somewhat limiting your IOPS. Sure
       | prefetching will help of course, but it is relying on a lot of
       | heuristic machinery to get the correct access pattern, which
       | sometimes fails.
       | 
       | As 7+GB/s drives and 10+Gbe networks become more and more
       | mainstream, the main point where people will realize these
       | requirements are in file copying, for example Windows explorer
       | struggles to copy files at rates 10-25GBe+ simply because how
       | it's file access architecture is designed. And hopefully then we
       | will be better equip to reason about "mmap" vs "read" (really
       | should be pread here to avoid the offset sem in the kernel).
        
         | wtallis wrote:
         | Yep, mmap is really bad for performance on modern hardware
         | because you can only fault on one page at a time (per thread),
         | but SSDs require a high queue depth to deliver the advertised
         | throughput. And you can't overcome that limitation by using
         | more threads, because then you spend all your time on context
         | switches. Hence, io_uring.
        
           | kccqzy wrote:
           | Can't you just use MAP_POPULATE which asks the system to
           | populate the entire mapped address range, which is kind of
           | like page-faulting on every page simultaneously?
        
           | astrange wrote:
           | If you're reading sequentially this shouldn't be a problem
           | because the VM system can pick up hints, or you can use
           | madvise.
           | 
           | If you're reading randomly this is true and you want some
           | kind of async I/O or multiple read operation.
           | 
           | mmap is also dangerous because there's no good way to return
           | errors if the I/O fails, like if the file is resized or is on
           | an external drive.
        
             | jandrewrogers wrote:
             | Even if you use madvise() for a large sequential read, the
             | kernel will often restrict its behavior to something
             | suboptimal with respect to performance on modern hardware.
        
           | im3w1l wrote:
           | If I _read_ with a huge block size, say 100mb. Will the OS
           | request things in a sane way?
        
         | foota wrote:
         | Typically reviews of drives publish rates at different queue
         | depths, or at least specify the queue depths tested.
        
       | silvestrov wrote:
       | Why doesn't Intel CPUs implement a modern version of Z80's LDIR
       | instruction (a memmove in a single instruction)?
       | 
       | Then the kernel wouldn't have to save any registers. (I'd really
       | like if she had documented exactly which CPU/system she used for
       | benchmarking).
        
         | beagle3 wrote:
         | It's called REP MOVSB (or MOVSW, MOVSD, maybe also MOVSQ?). It
         | has existed since the 8086 day; and for reasons I don't know,
         | it supposedly works well for big blocks these days (>1K or so)
         | but is supposedly slower than register moves for small blocks.
        
           | JoshTriplett wrote:
           | > it supposedly works well for big blocks these days (>1K or
           | so) but is supposedly slower than register moves for small
           | blocks.
           | 
           | On current processors with Fast Short REP MOVSB (FSRM), REP
           | MOVSB is the fastest method for all sizes. On processors
           | without FSRM, but with ERMS, REP MOVSB is faster for anything
           | longer than ~128 bytes.
        
             | beagle3 wrote:
             | Thanks! Is there a simple rule-of-the-thumb about when can
             | one rely on FSRM?
        
               | JoshTriplett wrote:
               | You should check the corresponding CPUID bit, but in
               | general, Ice Lake and newer.
        
         | sedatk wrote:
         | LDIR is slower than unrolling multiple LDI instructions by the
         | way.
        
         | jeffbee wrote:
         | Intel CPUs have REP MOVS, which is basically the same thing.
        
       | aleden wrote:
       | The boost.interprocess library presents the capability to keep
       | data structures (std::list, std::vector, ...) in shared memory
       | (i.e. a memory-mapped file)- "offset pointers" are key to that. I
       | can think of no other programming language that can pull this
       | off, with such grace.
        
       | justin_ wrote:
       | I'm not sure the conclusion that vector instructions are
       | responsible for the speed-up is correct. Both implementations
       | seem to use ERMS (using REP MOVSB instructions)[0]. Looking at
       | the profiles, the syscall implementation spends time in the [xfs]
       | driver (even in the long test), while the mmap implementation
       | does not. It appears the real speed-up is related to how memory-
       | mapped pages interact with the buffer cache.
       | 
       | I might be misunderstanding things. What is really going on here?
       | 
       | [0] Lines 56 and 180 here:
       | http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...
        
         | petters wrote:
         | I thought this strange as well. The author even directly links
         | to the source code where REP MOVSB is used.
        
       | pjmlp wrote:
       | On the context of Linux.
        
       | [deleted]
        
       | jFriedensreich wrote:
       | Made me think about LMMD
       | (https://en.m.wikipedia.org/wiki/Lightning_Memory-Mapped_Data...)
       | and wonder why mmap didn't seem to have catched on more in
       | storage engines
        
         | ricardo81 wrote:
         | *LMDB
         | 
         | I use it a bit. The transactional aspect of it requires a bit
         | consideration but generally the performance is good. I'd
         | originally used libJudy in a bunch of places for fast lookups
         | but the init time for programs was being slowed by having to
         | preload everything. Using an mmap/LMDB is a decent middle
         | ground.
        
         | jandrewrogers wrote:
         | For storage engines that prioritize performance and
         | scalability, mmap() is a poor choice. Not only is it slower and
         | less scalable than alternatives but it also has many more edge
         | cases and behaviors you have to consider. Compared to a good
         | O_DIRECT/io_submit storage engine design, which is a common
         | alternative, it isn't particularly close. And now we have
         | io_uring as an alternative too.
         | 
         | If your use case is quick-and-dirty happy path code then mmap()
         | works fine. In more complex and rigorous environments, like
         | database engines, mmap() is not well-behaved.
        
       | utopcell wrote:
       | Last year we were migrating part of YouTube's serving to a new
       | system and we were observing unexplainable high tail latency. It
       | was eventually attributed to mlock()ing some mmap()ed files,
       | which ended up freezing the whole process for significant amounts
       | of time.
       | 
       | Be weary of powerful abstractions.
        
       | AshamedCaptain wrote:
       | Claiming "mmap is faster than system calls" is dangerous.
       | 
       | I once worked for a company where they also heard someone say
       | "mmap is faster than read/write" and as a consequence rewrite
       | their while( read() ) loop into the following monstrosity:
       | 
       | 1. mmap a 4KB chunk of the file
       | 
       | 2. memcpy it into the destination buffer
       | 
       | 3. munmap the 4KB chunk
       | 
       | 4. repeat until eof
       | 
       | This is different from the claim in the article -- the above
       | monstrosity is individually mmaping each 4KB block, while I
       | presume the article's benchmark is mmaping the entire file in
       | memory at once, which makes much more sense.
       | 
       | After I claimed the "monstrosity" was absurdly stupid, someone
       | pointed to a benchmark they made and found that the "monstrosity"
       | version was actually faster. To me, this made no sense. The
       | monstrosity has triple the syscall overhead vs the read()
       | version, requires manipulating page tables for every 4KB block
       | and as a consequence had several page faults for each 4KB block
       | of the file. Yet it was true: their benchmarks showed the
       | monstrosity version to be slightly faster.
       | 
       | The idealist in me couldn't stand this and I reverted this
       | change, using for my own (unrelated) experiments a binary which
       | used the older, classic, read() loop instead of mmap.
       | 
       | Eventually I noticed I was getting results much faster using my
       | build on my single-socket Xeon than they were getting on their
       | $$$ server farms. Despite what the benchmark said.
       | 
       | Turns out, the "monstrosity" was indeed faster, but if you had
       | several of these binaries running concurrently on the same
       | machine, they would all slow down each other, as if the kernel
       | was having scale issues with multiple processes constantly
       | changing their page tables. The thing would slow down to single-
       | core levels of performance.
       | 
       | I still have no idea why the benchmark was apparently slightly
       | faster, but obviously they were checking it either isolated or on
       | machines where the other processes where running read() loops. I
       | guess that by wasting more kernel CPU time on yourself you may
       | starve other processes in the system leaving more user time for
       | yourself. But once every process does it, the net result is still
       | significantly lowered performance for everyone.
       | 
       | Just yet another anecdote for the experience bag...
        
         | piyh wrote:
         | Out of curiosity, what was the use case where they were trying
         | to get these marginal gains out of their program?
        
         | jabl wrote:
         | On Linux mmap_sem contention is a well-known concurrency
         | bottleneck, you may have been hitting that. Multiple efforts
         | over the years have failed to fix it, IIRC. I guess one day
         | they'll find a good solution, but until then, take care.
        
         | beached_whale wrote:
         | mmap is a more ergonomic interface than read too. How often are
         | people copying a file to a local buffer, or the whole file a
         | chunk at a time, in order to use the file like an array of
         | bytes. mmap gives us a range of bytes right off the start. Even
         | if not optimal, the simplicity in usage often means less room
         | for bugs.
        
         | alexchamberlain wrote:
         | The kernel can, in theory, estimate what page you're going to
         | load next, so loading 4KB at a time may not have the page
         | faults you'd expect.
        
         | searealist wrote:
         | Your anecdote doesn't follow your warning.
         | 
         | Using mmap in an unusual way (to read chunks) on presumably
         | legacy hardware doesn't generalize to using it in the obvious
         | way (mmap entire files or at least larger windows) on modern
         | hardware.
        
           | Mathnerd314 wrote:
           | I think the story is to always benchmark first, and also to
           | make sure your benchmarks reflect real-world use. What's
           | dangerous is assuming something is faster without
           | benchmarking.
        
             | searealist wrote:
             | I think many people reading that anecdote may come away
             | with the idea that mmap is bad (and a monstrosity even) and
             | read is good rather than your interpretation that you
             | should benchmark better.
             | 
             | I dislike this kind of muddying the waters and I hope my
             | comment provides another perspective for readers.
        
             | anonunivgrad wrote:
             | Best place to start is to have a good mental model of how
             | things work and why they would be performant or not for a
             | particular use case. Otherwise you're just taking shots in
             | the dark.
        
           | AshamedCaptain wrote:
           | Indeed, my warning is about being cautious when making
           | generalized claims.
        
             | searealist wrote:
             | If someone claimed running was faster than walking and then
             | I told a story about how I once saw someone running in snow
             | shoes on the grass and it was slower than walking then that
             | would just be muddying the waters.
        
               | segfaultbuserr wrote:
               | I remember a quote, I cannot find the source for now, but
               | it basically says "A book can either be completely
               | correct, or be readable, but not both."
        
           | pletnes wrote:
           | It seems a more apples-to-apples comparison would be to open
           | a file, seek(), read() a block, then close() the file. Just
           | as bizarre as the repeated mmap, of course.
        
             | segfaultbuserr wrote:
             | Regardless of how bizarre it is, I've seen this in real
             | code in embedded applications before. It's a workaround of
             | buggy serial port drivers (flow control or buffering is
             | probably broken): You open the port, read/write a line,
             | close it, and open it again...
        
               | craftinator wrote:
               | Hah I came here to say pretty much the same thing!
               | Recently ran into it and coding that workaround on a
               | resource constrained system felt absolutely bonkers.
        
         | stanfordkid wrote:
         | Isn't the whole point of mmap to randomly access the data
         | needed in memory? Did they think memcpy is a totally free
         | operation or something, without any side effects?
        
         | labawi wrote:
         | Was it perhaps a multi-threaded task? Because that would almost
         | definitely crawl.
         | 
         | In general, unmapping expensive, much more expensive than
         | mapping memory, because you need to do a TLB-
         | shootdown/flush/whatever to make sure a cached version of the
         | old mapping is not used. A read/write does a copy, so no need
         | to mess with mappings and TLBs, hence it can scale very well.
        
         | CyberDildonics wrote:
         | If someone hears "mmap is faster than system calls" and then
         | mmaps and munmaps 4KB chunks at a time in a loop, not realizing
         | that mmap and munmap are actually system calls and that the
         | benefit is not about calling those functions as much as
         | possible, there is no saving them.
         | 
         | That's not the fault of a 'dangerous' claim, that's the fault
         | of people who go heads first into something without taking 20
         | minutes to understand what they are doing or 20 minutes to
         | profile after.
        
           | 411111111111111 wrote:
           | You'd need significantly more time then 20 minutes to form an
           | informed opinion on the topic of you don't already know
           | basically everything about it.
           | 
           | The only thing you could do in that timespan is reading a
           | single summary on the topic and hope that this includes all
           | relevant information. Which is unlikely and the reason why
           | things are often mistakenly taken out of context.
           | 
           | And as the original comment mentioned: they _did_ benchmark
           | and it showed an improvement. They just didn 't stresstest
           | it, but that's unlikely doable within 20 minutes either.
        
             | CyberDildonics wrote:
             | In 20 minutes you can read what mmap does and see that you
             | you can map a file and copy it like memory.
             | 
             | In another 20 minutes you can compile and run a medium
             | sized program.
             | 
             | Neither of those is enough time for someone to go deep into
             | something, but you can look up the brand new thing you're
             | using and see where your bottlenecks are.
        
           | anonunivgrad wrote:
           | Yep, there's no substitute for a qualitative understanding of
           | the system.
        
       | tucnak wrote:
       | Am I right to assume that Alexandra is well-known in the field?
       | I've never heard the name.
        
         | lionsdan wrote:
         | https://www.ece.ubc.ca/faculty/alexandra-sasha-fedorova
        
         | stingraycharles wrote:
         | Apparently she's a researcher and MongoDB consultant.
         | 
         | https://www.ece.ubc.ca/~sasha/
        
         | [deleted]
        
         | bzb6 wrote:
         | It's an ad.
        
       | noncoml wrote:
       | The meat of it is:
       | 
       | the functions (for copying data) used for syscall and mmap are
       | very different, and not only in the name.
       | 
       | __memmove_avx_unaligned_erms, called in the mmap experiment, is
       | implemented using Advanced Vector Extensions (AVX) (here is the
       | source code of the functions that it relies on).
       | 
       | The implementation of copy_user_enhanced_fast_string, on the
       | other hand, is much more modest. That, in my opinion, is the huge
       | reason why mmap is faster. Using wide vector instructions for
       | data copying effectively utilizes the memory bandwidth, and
       | combined with CPU pre-fetching makes mmap really really fast.
       | 
       | Why can't the kernel implementation use AVX? Well, if it did,
       | then it would have to save and restore those registers on each
       | system call, and that would make domain crossing even more
       | expensive. So this was a conscious decision in the Linux kernel
        
         | fangyrn wrote:
         | I'm a bit of an idiot, when I think of AVX I think of something
         | that speeds up computation (specifically matrice stuff), not
         | memory access. How wrong am I?
        
           | aliceryhl wrote:
           | AVX is useful for both.
        
           | jstimpfle wrote:
           | It's a set of SIMD (single instruction, multiple data)
           | extensions to the amd64 instruction set. They allow you to
           | operate on larger chunks of data with a single instruction -
           | for example, do 16 integer multiplications in parallel, etc.
        
           | jeffbee wrote:
           | Its registers are just larger. The way x86 moves memory is
           | through registers, register-to-register or register-to/from-
           | memory. The AVX registers move up to 64 bytes in one move. A
           | general purpose register moves at most 8 bytes.
        
         | cwzwarich wrote:
         | Wouldn't REP MOVSB be as fast as an AVX memcpy for 4 KB sizes
         | on recent Intel CPUs?
        
           | adzm wrote:
           | It should be, I think, though it's a complicated question
           | whose answer varies on so much, cpu architecture and how it
           | is used. There is a great discussion on it here, too.
           | 
           | https://stackoverflow.com/questions/43343231/enhanced-rep-
           | mo...
        
           | justin_ wrote:
           | The glibc implementation[0] uses Enhanced REP MOVSB when the
           | array is long enough. It takes a few cycles to start up the
           | ERMS feature, so it's only used on longer arrays.
           | 
           | Edit: Wait a minute... if this is true, then how can AVX be
           | responsible for the speed up? Is it related to the size of
           | the buffers being copied internally?
           | 
           | [0] Line 48 here: http://sourceware.org/git/?p=glibc.git;a=bl
           | ob;f=sysdeps/x86_...
        
             | JoshTriplett wrote:
             | > The glibc implementation[0] uses Enhanced REP MOVSB when
             | the array is long enough. It takes a few cycles to start up
             | the ERMS feature, so it's only used on longer arrays.
             | 
             | That isn't true anymore either, on sufficiently recent
             | processors with "Fast Short REP MOVSB (FSRM)". If the FSRM
             | bit is set (which it is on Ice Lake and newer), you can
             | just always use REP MOVSB.
        
               | jabl wrote:
               | Still waiting for the "Yes, This Time We Really Mean It
               | Fast REP MOVSB" (YTTWRMIFRM) bit.
               | 
               | More seriously, if REP MOVSB can be counted on always
               | being the fastest method that's fantastic. One thing that
               | easily gets forgotten in microbenchmarking is I$
               | pollution by those fancy unrolled SIMD loops with 147
               | special cases.
        
       | amluto wrote:
       | This is a poor explanation and poor benchmarking. Let's see:
       | 
       | copy_user_enhanced_fast_string uses a CPU feature that
       | (supposedly) is very fast. Benchmarking it against AVX could be
       | interesting, but it would need actual benchmarking instead of
       | handwaving. It's worth noting that using AVX at all carries
       | overhead, and it's not always the right choice even if it's
       | faster in a tight loop.
       | 
       | Page faults, on x86_64, are much slower than syscalls. KPTI and
       | other mitigations erode this difference to some extent. But
       | surely the author should have compared the number of page faults
       | to the number of syscalls. Perf can do this.
       | 
       | Finally, munmap() is very, very expensive, as is discarding a
       | mapped page. This is especially true on x86. Workloads that do a
       | lot of munmapping need to be careful, especially in multithreaded
       | programs.
        
       | bigdict wrote:
       | Hold up. Isn't mmap a system call?
        
         | chrisseaton wrote:
         | > Hold up. Isn't mmap a system call?
         | 
         | That's not what they mean. You set up a memory map with the
         | mmap system call, yes, but that's not the point.
         | 
         | The point is then that you can read and write mapped files by
         | reading and writing memory addresses directly - you do not have
         | to use a system call to perform each read and write.
        
           | bzb6 wrote:
           | So like DMA?
        
             | ndesaulniers wrote:
             | DMA is more like a bulk memory transfer operation usually
             | facilitated by specific hardware that generally is
             | asynchronous and requires manual synchronization. Usually
             | hardware devices perform DMAs of memory regions, like a
             | memcpy() but between physical memories.
             | 
             | Memory mappings established via mmap() more so set up the
             | kernel to map in pages when faults on accesses occur. In
             | this case you're not calling into the kernel, the TLB is
             | generating an interrupt when you go to read an address
             | referring to memory that's not yet paged in, which the
             | kernel than handles and restores control flow to userspace
             | without userspace being any wiser (unless userspace is
             | keeping track of time). Handling page faults is faster than
             | the syscalls involved in read() calls, it would seem.
        
             | _0ffh wrote:
             | I think that comparison would be more confusing than
             | helpful.
        
           | [deleted]
        
       | jabl wrote:
       | As a word of warning, mmap is fine if the semantics match the
       | application.
       | 
       | mmap is not a good idea for a general purpose read()/write()
       | replacement, e.g. as advocated in the 1994 "alloc stream
       | facility" paper by Krieger et al. I worked with an I/O library
       | that followed this strategy, and we had no end of trouble how to
       | robustly deal with resizing files, and also how to do the
       | windowing in a good way (this was in the time where we needed to
       | care about systems with 32-bit pointers, VM space getting tight,
       | but still needed to care about files larger than 2 GB). And then
       | we needed the traditional read/write fallback path anyway, in
       | order to deal with special files like tty's, pipes etc. In the
       | end I ripped out the mmap path, and we saw a perf improvement in
       | some benchmark by x300.
        
         | searealist wrote:
         | What year / hardware / kernel version are you talking about?
        
           | jabl wrote:
           | Oh uh, IIRC 2004/2005 or thereabouts. Personally I was using
           | PC HW running an up to date Linux distro, as I guess was the
           | vast majority of the userbase, but there was a long tail of
           | all kinds of weird and wonderful targets where the software
           | was deployed.
        
             | [deleted]
        
         | iforgotpassword wrote:
         | Also error handling. read and write can return errors, but what
         | happens when you write to a mmaped pointer and the underlying
         | file system has some issue? Assigning a value to a variable
         | cannot return an error.
         | 
         | So you get a fine SIGBUS to your application and it crashes.
         | Just the other day I used imagemagick and it always crashed
         | with a SIGBUS and just when I started googling the issue I
         | remembered mmap, noticed that the partition ran out of space,
         | freed up some more and the issue was gone.
         | 
         | So you might want to set up a handler for that signal, but now
         | the control flow suddenly jumps to another function if an error
         | occurs, and you have to somehow figure out where in your
         | program the error occurred and then what? Then you remember
         | that longjmp exists and you end up with a steaming pile of
         | garbage code.
         | 
         | Only use mmap if you absolutely must. Don't just "mmap all teh
         | files" as it's the new cool thing you learned about.
        
           | chrchang523 wrote:
           | Yeah, this is the biggest reason I stay the hell away from
           | mmap now. Signal handlers are a much worse minefield than
           | error handling in any standard file I/O API I've seen.
        
           | klodolph wrote:
           | You don't have to longjmp, you can remap the memory and set a
           | flag, return from the signal handler, handle the error later,
           | if you like.
        
           | jabl wrote:
           | Indeed. The issue with file resizing I mentioned was mostly
           | related to error handling (what if another
           | process/thread/file descriptor/ truncates the file, etc.).
           | But yes, there are of course other errors as well, like the
           | fs running out of space you mention.
        
           | justin66 wrote:
           | There's nothing wrong with using a read only mmap in
           | conjunction with another method for writes.
        
             | iforgotpassword wrote:
             | You have exactly the same problem on a read error.
        
               | justin66 wrote:
               | Not the problem you described in your second paragraph.
        
         | nlitened wrote:
         | Is it still the case in 64-bit systems?
        
           | jabl wrote:
           | Except for running out of VM space, all the other issues are
           | still there. And even if you have (for the time being)
           | practically unlimited VM space, you may still not want to
           | mmap a file of unbounded size, since setting up all those
           | mappings takes quite a lot of time if you're using the
           | default 4 kB page size. So you probably want to do some kind
           | of windowing anyway. But then if the access pattern is random
           | and the file is large, you have to continually shift the
           | window (munmap + mmap) and performance goes down the drain.
           | So I don't think going to 64-bit systems tilts the balance in
           | favor of mmap.
        
             | pocak wrote:
             | Linux allocates page tables lazily, and fills them lazily.
             | The only upfront work is to mark the virtual address range
             | as valid and associated with the file. I'd expect mapping
             | giant files to be fast enough to not need windowing.
        
               | jabl wrote:
               | Good point, scratch that part of my answer.
               | 
               | There are still some cases where you'd not want unlimited
               | VM mapping, but those are getting a bit esoteric and at
               | least the most obvious ones are in the process of getting
               | fixed.
        
       | Matthias247 wrote:
       | The 4-16kB buffer sizes are all rather tiny and inefficient for
       | high throughput use-cases, which makes those results not that
       | relevant. Something between 64kB to 1MB seems more applicable.
        
       | aloknnikhil wrote:
       | Previous discussion:
       | https://news.ycombinator.com/item?id=24842648
        
       | baybal2 wrote:
       | There is a third option on the table! Using DMA controller.
       | 
       | People say what's the difference in between copying memory by
       | CPU, or by DMA controller? The difference is exactly that.
       | 
       | You rarely need that, but in some cases you have:
       | 
       | 1. You do very long copies, and want to have full CPU
       | performance.
       | 
       | 2. You do very long copies, and want to have caches to not be
       | flushed during it.
       | 
       | 3. You care about power consumption, as DMA controller may let
       | the CPU core to enter low power mode quicker.
       | 
       | 4. On some CPU architectures, you can get wildly spread out pages
       | quicker than with CPU/software.
        
         | beagle3 wrote:
         | The DMA needs to cooperate with the MMU which is on the CPU
         | these days (and has been for almost 3 decades now). It's a lot
         | of work to set up DMA correctly given physical<->logical memory
         | mapping - so it's only worth it if you have a really big chunk
         | to copy.
        
           | GeorgeTirebiter wrote:
           | This is quite interesting. This, to me, seems like a systems
           | bug. In the Embedded World, it is exceedingly common to use
           | DMA for all high-speed transfers -- it's effectively a
           | specialized parallel hardware 'MOV' instruction. Also, I have
           | never had an occasion on modern PC hw to need mmap; read()
           | lseek() are clean and less complex overall. Maybe I lack
           | imagination.
        
             | astrange wrote:
             | mmap is being used by libraries under you; it's useful for
             | files on internal drives, that won't be deleted, and want
             | to be accessed randomly, and you don't want to allocate
             | buffers to copy things out of them.
             | 
             | For instance, anytime you call a shared library it's been
             | mmapped by dyld/ld.so.
        
       | xoo1 wrote:
       | It's all very benchmark-chasing theoretical. In practice
       | performance is more complicated and mmap is this weird corner-
       | case over engineered inconsistent optimization thing that often
       | wastes or even "leaks" memory which can be used for actually
       | important for performance caches, it's also awful at error
       | handling and so on. I had to literally patch LevelDB to disable
       | mmap on amd64 once, which eliminated OOMs on those servers,
       | allowed me to run way more LevelDB instances and overall improved
       | performance so significantly, that I had to write this comment.
        
         | jstimpfle wrote:
         | Yup, I don't like using mmap() for the reason alone that it
         | means giving up a lot of control.
        
         | BikiniPrince wrote:
         | One mechanism we developed was to build a variant of our
         | storage node that could run in isolation. This meant that
         | synthetic testing would give us some optimal numbers for
         | hardware vetting and performance changes.
         | 
         | I proved quite quickly our application was quite thread poor
         | and the costs of fixing it was quite worth it. Using other
         | synthetic benchmarks to compare what the systems were capable
         | of.
         | 
         | I was gone before that was finished, but it was quite an
         | improvement. It also allowed cold volumes to exist in an over
         | subscription model.
         | 
         | None of this excuses good real world telemetry and evaluation
         | of your outliers.
        
         | CalChris wrote:
         | Neither _mmap()_ nor _read() /write()_ leak memory.
        
           | jstimpfle wrote:
           | But they might "leak" it. What parent meant is that as an
           | mmap() user you have no control how much of a mapping takes
           | actual memory while you're visiting random memory-mapped file
           | locations. Is that documented somewhere?
        
             | vlovich123 wrote:
             | madvise gives you pretty good control over the paging, no?
             | Generally I think you can MADVISE_NOTNEEDED to page out
             | content if you need to do it more aggressively, no? The
             | benefit is that the kernel understands this enough that it
             | can evict those page buffers, things it can't do when those
             | buffers live in user-space.
        
               | jandrewrogers wrote:
               | No, madvise() does not give good control over paging
               | behavior. As the syscall indicates, it is merely a
               | suggestion. The kernel is free to ignore it and
               | frequently does. This non-determinism makes it nearly
               | useless for many types of page scheduling optimizations.
               | For some workloads, the kernel consistently makes poor
               | paging choices and there is no way to force it to make
               | good paging choices. You have much better visibility into
               | the I/O behavior of your application than the kernel
               | does.
               | 
               | In my experience, at least for database-y workloads, if
               | you care enough about paging behavior to use madvise(),
               | you might as well just use any number of O_DIRECT
               | alternatives that offer deterministic paging control. It
               | is much easier than trying to cajole the kernel into
               | doing what you need via madvise().
        
             | rowanG077 wrote:
             | That's called a space leak not a memory leak.
        
             | AnotherGoodName wrote:
             | The paging system will only page in what's being used right
             | now though and paging out has zero cost. Old data will
             | naturally be paged out. To put it directly the answer is
             | each mmap file will need 1 page of physical memory (the
             | area currently being read/written). There may be old pages
             | left around since there's no reason for the OS to page
             | anything out unless some other application asked for the
             | memory. But if they do mmap will go to 1 page just fine and
             | there's zero cost to paging out.
             | 
             | I feel mmap gets a bad reputation when people look at
             | memory usage tools that look at total virtual memory
             | allocated.
             | 
             | I can mmap a 100GB of files, use 0 physical memory and a
             | lot of memory usage tools will report 100GB of memory usage
             | of a certain type (virtual memory allocated). You then get
             | articles about application X using GB of memory. Anyone
             | trying to correct this is ignored.
             | 
             | Google Chrome is somewhat unfairly hit by this. All those
             | articles along the lines of "Why is Google using 4GB with
             | no tabs after i viewed some large PDFs". The answer is that
             | it reserved 4GB of 'addresses' that it has mapped to files.
             | If another application wants to use that memory there's
             | zero cost to paging out those files from memory. The OS is
             | designed to do this and it's what mmap is for.
        
               | labawi wrote:
               | > paging out has zero cost
               | 
               | Paging out, as in removing a mapping, can be surprisingly
               | costly, because you need to invalidate any cached TLB
               | entries, possibly even in other CPUs.
               | 
               | > each mmap file will need 1 page of physical memory
               | 
               | Technically, a lower limit would be about 2 or so usable
               | pages, because you can't use more than that
               | simultaneously. However unmaps are expensive, so the
               | system won't be too eager to page out.
               | 
               | Also, for pages to be accessible, they need to be
               | specified in the page table (actually tree, of virtual ->
               | physical mappings). A random address may require about
               | 1-3 pages for page table aside from the 1 page of actual
               | data (but won't need more page tables for the next MB).
               | 
               | > application X using GB of memory
               | 
               | I think there is a difference between reserved, allocated
               | and file-backed mmapped memory. Address space, file-
               | backed mmapped memory is easily paged-out, not sure what
               | different types of reserved addresses/memory are, but
               | chrome probably doesn't have lots of mmapped memory that
               | can be paged out. If it's modified, then it must be
               | swapped, otherwise it's just reserved and possibly
               | mapped, but _never_ used memory.
        
               | AnotherGoodName wrote:
               | I'd argue the costs with paging out are already accounted
               | for by the other process paging in though. The other
               | process that paged in and led to the need to page out had
               | already led to the need to change the page table and
               | flush cache.
        
               | labawi wrote:
               | Paging in free memory (adding a mapping) is cheap (no
               | need to flush). Removing a mapping is expensive (need to
               | flush). Also, processes have their own (mostly)
               | independent page tables.
               | 
               | I don't think it would be reasonable accounting, when
               | paging-in is cheap, but only if there is no need to page
               | out (available free memory). Especially when trying to
               | argue that paging out is zero-cost.
        
           | silon42 wrote:
           | mmap has less deterministic memory pressure and more complex
           | interactions with overcommit (if enabled).
        
         | jeffbee wrote:
         | LevelDB is kinda like a single-tablet bigtable, but because of
         | that its mmap i/o is not a result of battle hardening in
         | production systems. bigtable doesn't use local unix i/o for any
         | purpose at all, so I'm not surprised to hear that leveldb's
         | local i/o subsystem is half baked.
        
         | btown wrote:
         | Curious now - were you running an unconventional workload that
         | stressed LevelDB, or do you think some version of this advice
         | could be applicable to typical workloads?
        
       | einpoklum wrote:
       | The author says that in userspace memcpy, AVX is used, but
       | 
       | > The implementation of copy_user_enhanced_fast_string, on the
       | other hand, is much more modest.
       | 
       | Why is that? I mean, if you compiled your kernel for a wide range
       | of machines, then fine, but if you compiled targeting your actual
       | CPU, why would the kernel functions not use AVX?
        
       | lrossi wrote:
       | From the mouth of Linus:
       | 
       | https://marc.info/?l=linux-kernel&m=95496636207616&w=2
       | 
       | It's a bit old, but it should still apply. I remember that in
       | general he was annoyed when seeing people recommend mmap instead
       | of read/write for basic I/O usecases.
       | 
       | In general, it's almost always better to use the specialized API
       | (read, write etc.) instead of reinventing the wheel on your own.
        
         | beagle3 wrote:
         | LMDB (and its modern fork MDBX), and kdb+/shakti make
         | incredibly good use of mmap - I suspect it is possible to get
         | similar performance from read(), but probably at 10x the
         | implementation complexity.
        
         | nabla9 wrote:
         | Yes. If you do lots of sequential/local reads you can reduce
         | the number of context switches dramatically if you do something
         | like:                   /* to reduce context switch */
         | int bsize = 16*BUFSIZ;              int fopen_bsize(char *
         | filename) {             int fp = fopen(filename, "r");
         | setvbuf(fp, NULL, _IOFBF, bsize);             return fp;
         | }
        
       ___________________________________________________________________
       (page generated 2021-01-09 23:00 UTC)