[HN Gopher] Why mmap is faster than system calls ___________________________________________________________________ Why mmap is faster than system calls Author : vinnyglennon Score : 226 points Date : 2021-01-09 16:53 UTC (6 hours ago) (HTM) web link (sasha-f.medium.com) (TXT) w3m dump (sasha-f.medium.com) | layoutIfNeeded wrote: | >Further, since it is unsafe to directly dereference user-level | pointers (what if they are null -- that'll crash the kernel!) the | data referred to by these pointers must be copied into the | kernel. | | False. If the file was opened with O_DIRECT, then the kernel uses | the user-space buffer directly. | | From man write(2): | | O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of | the I/O to and from this file. In general this will degrade | performance, but it is useful in special situations, such as when | applications do their own caching. File I/O is done directly | to/from user-space buffers. The O_DIRECT flag on its own makes an | effort to transfer data synchronously, but does not give the | guarantees of the O_SYNC flag that data and necessary metadata | are transferred. To guarantee synchronous I/O, O_SYNC must be | used in addition to O_DIRECT. See NOTES below for further | discussion. | wtallis wrote: | I don't think O_DIRECT makes any guarantees about zero-copy | operation. It merely disallows kernel-level caching of that | data. But the kernel may make a private copy that isn't | caching. | layoutIfNeeded wrote: | Who said it was guaranteed to be zero-copy? | | The original article said that the data _must_ be copied | based on some bogus handwavy argument, and I've pointed out | that the manpage of write(2) contradicts this when it says | the following: | | >File I/O is done directly to/from user-space buffers. | jstimpfle wrote: | > Why can't the kernel implementation use AVX? Well, if it did, | then it would have to save and restore those registers on each | system call, and that would make domain crossing even more | expensive. So this was a conscious decision in the Linux kernel. | | I don't follow. So a syscall that could profit from AVX can't use | it because then _all_ syscalls would have to restore AVX | registers? Why can 't the restoring just happen specifically in | those syscalls that make use of AVX? | xymostech wrote: | I think by "each system call" she meant it like "every time it | calls read()", since it would be read() that was using the AVX | registers. Since the example program just calls read() over and | over, this could add a significant amount of overhead. | PaulDavisThe1st wrote: | It's not just syscalls. It's every context switch. If the | process is in the midst of using AVX registers in kernel code, | but is suddenly descheduled, those registers have to be | saved/restored. You can't know if the task is using AVX or not, | so you have to either always save/restore them, or adopt the | policy that these registers are not saved/restored. | jabl wrote: | I vaguely recall that the Linux kernel has used lazy | save/restore of FP registers since way back when. | jeffbee wrote: | You'd have to have a static analysis of which syscalls can | transitively reach which functions, which is probably not | possible because linux uses tables of function pointers for | many purposes. Also if thread 1 enters the kernel, suspends | waiting for some i/o, and the kernel switches to thread 2, how | would it know it needed to restore thread 2's registers because | of AVX activity of thread 1? And if it did how would it have | known to save them? | jstimpfle wrote: | Not a kernel person, but how about a flag for the thread data | structure? | jeffbee wrote: | Yeah actually now that I'm part way through that first cup | of coffee, the 2nd part of my comment doesn't make sense, | the kernel already has to do a heavier save of a task's | register state when it switches tasks. | CyberRabbi wrote: | I believe if you turn PTI off the syscall numbers for sequential | copies would be a lot higher. | CodesInChaos wrote: | Memory mapped files are very tricky outside the happy path. In | particular recovery from errors and concurrent modification | leading to undefined behaviour. It's a good choice for certain | use-cases, such as reading assets shipped with the application, | where no untrusted process can write to the file and errors can | be assumed to not happen. | | For high performance code I'd use io_uring. | jws wrote: | Summary: Mostly syscalls and mmap do the same things just | substituting a page fault for a syscall to get to kernel mode, | but... In user space her code is using AVX optimized memory copy | instructions which are not accessible in kernel mode yielding a | significant speed up. | | Bonus summary: She didn't use the mmapped data in place in order | to make a more apples-to-apples comparison. If you can use the | data in place then you will get even better performance. | spockz wrote: | Why doesn't the kernel have access to AVX optimised memory copy | instructions? | jws wrote: | The size of the state required to be saved and restored on | each system call makes it a losing proposition. | PaulDavisThe1st wrote: | Each context switch, not syscall. | topspin wrote: | The kernel does have access to these instructions. It is a | deliberate choice by kernel developers not to use them in the | case discussed here. In other cases the kernel does use such | instructions. | anaisbetts wrote: | *She, not he | jws wrote: | Thanks. Curse this language. I just want to refer to people! | It's simple encapsulation and abstraction. I shouldn't have | to care about implementation details irrelevant to the | context. | damudel wrote: | Don't worry about it. Some people lose their marbles | because they think females get erased when male language is | used. Just erase both genders and you'll be fine. Use | singular they. | [deleted] | ryanianian wrote: | "They" is an acceptable gender-neutral pronoun. | FentanylFloyd wrote: | it's an idiotic newspeakism, lol | | it's well enough that 'you' can be both singular and | plural, we don't need another one | kortilla wrote: | Acceptable to some, still not frequent enough though to | be normalized. | lolc wrote: | I don't even notice it anymore. | itamarst wrote: | It's been used since the time of Jane Austen (by Jane | Austen, in fact), it's perfectly normal: | https://pemberley.com/janeinfo/austheir.html | [deleted] | cpach wrote: | How about "they"...? | throw_away wrote: | singular they is the generalization you're looking for | jws wrote: | I'm old enough that "they" is not singular, it is a | grammatical error punishable by red ink and deducted | points. | Spivak wrote: | But the usage of they to refer a single person is older | than anyone alive if that makes you feel better about | sticking it to your picky grade school teachers. | jfk13 wrote: | So is the use of "he" to refer to an individual of | unspecified gender. | | (The OED quotations for sense 2(b) "In anaphoric | reference to a singular noun or pronoun of undetermined | gender" go back to at least 1200AD.) | jfim wrote: | You can use "the author" or refer to the article or paper. | | [name of paper] mentions that X is faster than Y. The | author suggests the cause of the speed up is Z, while we | believe it is W. | usrnm wrote: | Just a nitpick: "Alexandra" is the female version of the name | "Alexander" in Russian, so it's a "she", not "he". | andi999 wrote: | And 'Sasha' the nickname of 'Alexander'.... man who can think | this up, this is like you were calling 'Richard' 'Dick'. | tucnak wrote: | Sasha is universally applied to both males and females, | although to be fair, in Russian, it's culturally much | acceptable to call Alexander Sasha in any context | whatsoever, whereas Sasha as-in female Alexandra is | reserved for informal communication. | | Disclaimer: I speak Russian. | LudwigNagasena wrote: | Sasha is an informal version for both genders, I don't | think there is any difference. | | Source: I am Russian | FpUser wrote: | Second that. Was born in Russia as well | whidden wrote: | As someone who grew up in a former Russian territory that | speaks no Russian, even I knew that. | enedil wrote: | In Polish, Aleksandra (also female version) is shortened up | to Ola, good luck with that ;) | eps wrote: | Sasha is derived from Alexander via its dimunitive, but | obsolete form - Aleksashka - shortned to Sashka, further | simplified to Sasha as per established name format of Masha | (Maria), Dasha (Daria), Pasha (Pavel, Paul), Glasha | (Glafira), Natasha (Natalia), etc. | dmytroi wrote: | Did some research on the topic of high bandwidth/high IOPS file | accesses, some of my conclusions could be wrong though, but as I | discovered modern NVMe drives need to have some queue pressure on | them to perform at advertised speeds, as in hardware level they | are essentially just a separate CPU running in background that | has command queue(s). They also need to have requests align with | flash memory hierarchy to perform at advertised speeds. So that's | puts a quite finicky limitation on your access patterns: 64-256kb | aligned blocks, 8+ accesses in parallel. To see that just try | CrystalDiskMark and put queue depth at 1-2, and/or block size on | something small, like 4kb, and see how your random speed | plummets. | | So given the limitations on the access pattern, if you just mmap | your file and memcpy the pointer, you'll get like ~1 access | request in flight if I understand right. And also as default page | size is 4kb, that will be 4kb request size. And then your mmap | relies on IRQ's to get completion notifications (instead of | polling the device state), somewhat limiting your IOPS. Sure | prefetching will help of course, but it is relying on a lot of | heuristic machinery to get the correct access pattern, which | sometimes fails. | | As 7+GB/s drives and 10+Gbe networks become more and more | mainstream, the main point where people will realize these | requirements are in file copying, for example Windows explorer | struggles to copy files at rates 10-25GBe+ simply because how | it's file access architecture is designed. And hopefully then we | will be better equip to reason about "mmap" vs "read" (really | should be pread here to avoid the offset sem in the kernel). | wtallis wrote: | Yep, mmap is really bad for performance on modern hardware | because you can only fault on one page at a time (per thread), | but SSDs require a high queue depth to deliver the advertised | throughput. And you can't overcome that limitation by using | more threads, because then you spend all your time on context | switches. Hence, io_uring. | kccqzy wrote: | Can't you just use MAP_POPULATE which asks the system to | populate the entire mapped address range, which is kind of | like page-faulting on every page simultaneously? | astrange wrote: | If you're reading sequentially this shouldn't be a problem | because the VM system can pick up hints, or you can use | madvise. | | If you're reading randomly this is true and you want some | kind of async I/O or multiple read operation. | | mmap is also dangerous because there's no good way to return | errors if the I/O fails, like if the file is resized or is on | an external drive. | jandrewrogers wrote: | Even if you use madvise() for a large sequential read, the | kernel will often restrict its behavior to something | suboptimal with respect to performance on modern hardware. | im3w1l wrote: | If I _read_ with a huge block size, say 100mb. Will the OS | request things in a sane way? | foota wrote: | Typically reviews of drives publish rates at different queue | depths, or at least specify the queue depths tested. | silvestrov wrote: | Why doesn't Intel CPUs implement a modern version of Z80's LDIR | instruction (a memmove in a single instruction)? | | Then the kernel wouldn't have to save any registers. (I'd really | like if she had documented exactly which CPU/system she used for | benchmarking). | beagle3 wrote: | It's called REP MOVSB (or MOVSW, MOVSD, maybe also MOVSQ?). It | has existed since the 8086 day; and for reasons I don't know, | it supposedly works well for big blocks these days (>1K or so) | but is supposedly slower than register moves for small blocks. | JoshTriplett wrote: | > it supposedly works well for big blocks these days (>1K or | so) but is supposedly slower than register moves for small | blocks. | | On current processors with Fast Short REP MOVSB (FSRM), REP | MOVSB is the fastest method for all sizes. On processors | without FSRM, but with ERMS, REP MOVSB is faster for anything | longer than ~128 bytes. | beagle3 wrote: | Thanks! Is there a simple rule-of-the-thumb about when can | one rely on FSRM? | JoshTriplett wrote: | You should check the corresponding CPUID bit, but in | general, Ice Lake and newer. | sedatk wrote: | LDIR is slower than unrolling multiple LDI instructions by the | way. | jeffbee wrote: | Intel CPUs have REP MOVS, which is basically the same thing. | aleden wrote: | The boost.interprocess library presents the capability to keep | data structures (std::list, std::vector, ...) in shared memory | (i.e. a memory-mapped file)- "offset pointers" are key to that. I | can think of no other programming language that can pull this | off, with such grace. | justin_ wrote: | I'm not sure the conclusion that vector instructions are | responsible for the speed-up is correct. Both implementations | seem to use ERMS (using REP MOVSB instructions)[0]. Looking at | the profiles, the syscall implementation spends time in the [xfs] | driver (even in the long test), while the mmap implementation | does not. It appears the real speed-up is related to how memory- | mapped pages interact with the buffer cache. | | I might be misunderstanding things. What is really going on here? | | [0] Lines 56 and 180 here: | http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_... | petters wrote: | I thought this strange as well. The author even directly links | to the source code where REP MOVSB is used. | pjmlp wrote: | On the context of Linux. | [deleted] | jFriedensreich wrote: | Made me think about LMMD | (https://en.m.wikipedia.org/wiki/Lightning_Memory-Mapped_Data...) | and wonder why mmap didn't seem to have catched on more in | storage engines | ricardo81 wrote: | *LMDB | | I use it a bit. The transactional aspect of it requires a bit | consideration but generally the performance is good. I'd | originally used libJudy in a bunch of places for fast lookups | but the init time for programs was being slowed by having to | preload everything. Using an mmap/LMDB is a decent middle | ground. | jandrewrogers wrote: | For storage engines that prioritize performance and | scalability, mmap() is a poor choice. Not only is it slower and | less scalable than alternatives but it also has many more edge | cases and behaviors you have to consider. Compared to a good | O_DIRECT/io_submit storage engine design, which is a common | alternative, it isn't particularly close. And now we have | io_uring as an alternative too. | | If your use case is quick-and-dirty happy path code then mmap() | works fine. In more complex and rigorous environments, like | database engines, mmap() is not well-behaved. | utopcell wrote: | Last year we were migrating part of YouTube's serving to a new | system and we were observing unexplainable high tail latency. It | was eventually attributed to mlock()ing some mmap()ed files, | which ended up freezing the whole process for significant amounts | of time. | | Be weary of powerful abstractions. | AshamedCaptain wrote: | Claiming "mmap is faster than system calls" is dangerous. | | I once worked for a company where they also heard someone say | "mmap is faster than read/write" and as a consequence rewrite | their while( read() ) loop into the following monstrosity: | | 1. mmap a 4KB chunk of the file | | 2. memcpy it into the destination buffer | | 3. munmap the 4KB chunk | | 4. repeat until eof | | This is different from the claim in the article -- the above | monstrosity is individually mmaping each 4KB block, while I | presume the article's benchmark is mmaping the entire file in | memory at once, which makes much more sense. | | After I claimed the "monstrosity" was absurdly stupid, someone | pointed to a benchmark they made and found that the "monstrosity" | version was actually faster. To me, this made no sense. The | monstrosity has triple the syscall overhead vs the read() | version, requires manipulating page tables for every 4KB block | and as a consequence had several page faults for each 4KB block | of the file. Yet it was true: their benchmarks showed the | monstrosity version to be slightly faster. | | The idealist in me couldn't stand this and I reverted this | change, using for my own (unrelated) experiments a binary which | used the older, classic, read() loop instead of mmap. | | Eventually I noticed I was getting results much faster using my | build on my single-socket Xeon than they were getting on their | $$$ server farms. Despite what the benchmark said. | | Turns out, the "monstrosity" was indeed faster, but if you had | several of these binaries running concurrently on the same | machine, they would all slow down each other, as if the kernel | was having scale issues with multiple processes constantly | changing their page tables. The thing would slow down to single- | core levels of performance. | | I still have no idea why the benchmark was apparently slightly | faster, but obviously they were checking it either isolated or on | machines where the other processes where running read() loops. I | guess that by wasting more kernel CPU time on yourself you may | starve other processes in the system leaving more user time for | yourself. But once every process does it, the net result is still | significantly lowered performance for everyone. | | Just yet another anecdote for the experience bag... | piyh wrote: | Out of curiosity, what was the use case where they were trying | to get these marginal gains out of their program? | jabl wrote: | On Linux mmap_sem contention is a well-known concurrency | bottleneck, you may have been hitting that. Multiple efforts | over the years have failed to fix it, IIRC. I guess one day | they'll find a good solution, but until then, take care. | beached_whale wrote: | mmap is a more ergonomic interface than read too. How often are | people copying a file to a local buffer, or the whole file a | chunk at a time, in order to use the file like an array of | bytes. mmap gives us a range of bytes right off the start. Even | if not optimal, the simplicity in usage often means less room | for bugs. | alexchamberlain wrote: | The kernel can, in theory, estimate what page you're going to | load next, so loading 4KB at a time may not have the page | faults you'd expect. | searealist wrote: | Your anecdote doesn't follow your warning. | | Using mmap in an unusual way (to read chunks) on presumably | legacy hardware doesn't generalize to using it in the obvious | way (mmap entire files or at least larger windows) on modern | hardware. | Mathnerd314 wrote: | I think the story is to always benchmark first, and also to | make sure your benchmarks reflect real-world use. What's | dangerous is assuming something is faster without | benchmarking. | searealist wrote: | I think many people reading that anecdote may come away | with the idea that mmap is bad (and a monstrosity even) and | read is good rather than your interpretation that you | should benchmark better. | | I dislike this kind of muddying the waters and I hope my | comment provides another perspective for readers. | anonunivgrad wrote: | Best place to start is to have a good mental model of how | things work and why they would be performant or not for a | particular use case. Otherwise you're just taking shots in | the dark. | AshamedCaptain wrote: | Indeed, my warning is about being cautious when making | generalized claims. | searealist wrote: | If someone claimed running was faster than walking and then | I told a story about how I once saw someone running in snow | shoes on the grass and it was slower than walking then that | would just be muddying the waters. | segfaultbuserr wrote: | I remember a quote, I cannot find the source for now, but | it basically says "A book can either be completely | correct, or be readable, but not both." | pletnes wrote: | It seems a more apples-to-apples comparison would be to open | a file, seek(), read() a block, then close() the file. Just | as bizarre as the repeated mmap, of course. | segfaultbuserr wrote: | Regardless of how bizarre it is, I've seen this in real | code in embedded applications before. It's a workaround of | buggy serial port drivers (flow control or buffering is | probably broken): You open the port, read/write a line, | close it, and open it again... | craftinator wrote: | Hah I came here to say pretty much the same thing! | Recently ran into it and coding that workaround on a | resource constrained system felt absolutely bonkers. | stanfordkid wrote: | Isn't the whole point of mmap to randomly access the data | needed in memory? Did they think memcpy is a totally free | operation or something, without any side effects? | labawi wrote: | Was it perhaps a multi-threaded task? Because that would almost | definitely crawl. | | In general, unmapping expensive, much more expensive than | mapping memory, because you need to do a TLB- | shootdown/flush/whatever to make sure a cached version of the | old mapping is not used. A read/write does a copy, so no need | to mess with mappings and TLBs, hence it can scale very well. | CyberDildonics wrote: | If someone hears "mmap is faster than system calls" and then | mmaps and munmaps 4KB chunks at a time in a loop, not realizing | that mmap and munmap are actually system calls and that the | benefit is not about calling those functions as much as | possible, there is no saving them. | | That's not the fault of a 'dangerous' claim, that's the fault | of people who go heads first into something without taking 20 | minutes to understand what they are doing or 20 minutes to | profile after. | 411111111111111 wrote: | You'd need significantly more time then 20 minutes to form an | informed opinion on the topic of you don't already know | basically everything about it. | | The only thing you could do in that timespan is reading a | single summary on the topic and hope that this includes all | relevant information. Which is unlikely and the reason why | things are often mistakenly taken out of context. | | And as the original comment mentioned: they _did_ benchmark | and it showed an improvement. They just didn 't stresstest | it, but that's unlikely doable within 20 minutes either. | CyberDildonics wrote: | In 20 minutes you can read what mmap does and see that you | you can map a file and copy it like memory. | | In another 20 minutes you can compile and run a medium | sized program. | | Neither of those is enough time for someone to go deep into | something, but you can look up the brand new thing you're | using and see where your bottlenecks are. | anonunivgrad wrote: | Yep, there's no substitute for a qualitative understanding of | the system. | tucnak wrote: | Am I right to assume that Alexandra is well-known in the field? | I've never heard the name. | lionsdan wrote: | https://www.ece.ubc.ca/faculty/alexandra-sasha-fedorova | stingraycharles wrote: | Apparently she's a researcher and MongoDB consultant. | | https://www.ece.ubc.ca/~sasha/ | [deleted] | bzb6 wrote: | It's an ad. | noncoml wrote: | The meat of it is: | | the functions (for copying data) used for syscall and mmap are | very different, and not only in the name. | | __memmove_avx_unaligned_erms, called in the mmap experiment, is | implemented using Advanced Vector Extensions (AVX) (here is the | source code of the functions that it relies on). | | The implementation of copy_user_enhanced_fast_string, on the | other hand, is much more modest. That, in my opinion, is the huge | reason why mmap is faster. Using wide vector instructions for | data copying effectively utilizes the memory bandwidth, and | combined with CPU pre-fetching makes mmap really really fast. | | Why can't the kernel implementation use AVX? Well, if it did, | then it would have to save and restore those registers on each | system call, and that would make domain crossing even more | expensive. So this was a conscious decision in the Linux kernel | fangyrn wrote: | I'm a bit of an idiot, when I think of AVX I think of something | that speeds up computation (specifically matrice stuff), not | memory access. How wrong am I? | aliceryhl wrote: | AVX is useful for both. | jstimpfle wrote: | It's a set of SIMD (single instruction, multiple data) | extensions to the amd64 instruction set. They allow you to | operate on larger chunks of data with a single instruction - | for example, do 16 integer multiplications in parallel, etc. | jeffbee wrote: | Its registers are just larger. The way x86 moves memory is | through registers, register-to-register or register-to/from- | memory. The AVX registers move up to 64 bytes in one move. A | general purpose register moves at most 8 bytes. | cwzwarich wrote: | Wouldn't REP MOVSB be as fast as an AVX memcpy for 4 KB sizes | on recent Intel CPUs? | adzm wrote: | It should be, I think, though it's a complicated question | whose answer varies on so much, cpu architecture and how it | is used. There is a great discussion on it here, too. | | https://stackoverflow.com/questions/43343231/enhanced-rep- | mo... | justin_ wrote: | The glibc implementation[0] uses Enhanced REP MOVSB when the | array is long enough. It takes a few cycles to start up the | ERMS feature, so it's only used on longer arrays. | | Edit: Wait a minute... if this is true, then how can AVX be | responsible for the speed up? Is it related to the size of | the buffers being copied internally? | | [0] Line 48 here: http://sourceware.org/git/?p=glibc.git;a=bl | ob;f=sysdeps/x86_... | JoshTriplett wrote: | > The glibc implementation[0] uses Enhanced REP MOVSB when | the array is long enough. It takes a few cycles to start up | the ERMS feature, so it's only used on longer arrays. | | That isn't true anymore either, on sufficiently recent | processors with "Fast Short REP MOVSB (FSRM)". If the FSRM | bit is set (which it is on Ice Lake and newer), you can | just always use REP MOVSB. | jabl wrote: | Still waiting for the "Yes, This Time We Really Mean It | Fast REP MOVSB" (YTTWRMIFRM) bit. | | More seriously, if REP MOVSB can be counted on always | being the fastest method that's fantastic. One thing that | easily gets forgotten in microbenchmarking is I$ | pollution by those fancy unrolled SIMD loops with 147 | special cases. | amluto wrote: | This is a poor explanation and poor benchmarking. Let's see: | | copy_user_enhanced_fast_string uses a CPU feature that | (supposedly) is very fast. Benchmarking it against AVX could be | interesting, but it would need actual benchmarking instead of | handwaving. It's worth noting that using AVX at all carries | overhead, and it's not always the right choice even if it's | faster in a tight loop. | | Page faults, on x86_64, are much slower than syscalls. KPTI and | other mitigations erode this difference to some extent. But | surely the author should have compared the number of page faults | to the number of syscalls. Perf can do this. | | Finally, munmap() is very, very expensive, as is discarding a | mapped page. This is especially true on x86. Workloads that do a | lot of munmapping need to be careful, especially in multithreaded | programs. | bigdict wrote: | Hold up. Isn't mmap a system call? | chrisseaton wrote: | > Hold up. Isn't mmap a system call? | | That's not what they mean. You set up a memory map with the | mmap system call, yes, but that's not the point. | | The point is then that you can read and write mapped files by | reading and writing memory addresses directly - you do not have | to use a system call to perform each read and write. | bzb6 wrote: | So like DMA? | ndesaulniers wrote: | DMA is more like a bulk memory transfer operation usually | facilitated by specific hardware that generally is | asynchronous and requires manual synchronization. Usually | hardware devices perform DMAs of memory regions, like a | memcpy() but between physical memories. | | Memory mappings established via mmap() more so set up the | kernel to map in pages when faults on accesses occur. In | this case you're not calling into the kernel, the TLB is | generating an interrupt when you go to read an address | referring to memory that's not yet paged in, which the | kernel than handles and restores control flow to userspace | without userspace being any wiser (unless userspace is | keeping track of time). Handling page faults is faster than | the syscalls involved in read() calls, it would seem. | _0ffh wrote: | I think that comparison would be more confusing than | helpful. | [deleted] | jabl wrote: | As a word of warning, mmap is fine if the semantics match the | application. | | mmap is not a good idea for a general purpose read()/write() | replacement, e.g. as advocated in the 1994 "alloc stream | facility" paper by Krieger et al. I worked with an I/O library | that followed this strategy, and we had no end of trouble how to | robustly deal with resizing files, and also how to do the | windowing in a good way (this was in the time where we needed to | care about systems with 32-bit pointers, VM space getting tight, | but still needed to care about files larger than 2 GB). And then | we needed the traditional read/write fallback path anyway, in | order to deal with special files like tty's, pipes etc. In the | end I ripped out the mmap path, and we saw a perf improvement in | some benchmark by x300. | searealist wrote: | What year / hardware / kernel version are you talking about? | jabl wrote: | Oh uh, IIRC 2004/2005 or thereabouts. Personally I was using | PC HW running an up to date Linux distro, as I guess was the | vast majority of the userbase, but there was a long tail of | all kinds of weird and wonderful targets where the software | was deployed. | [deleted] | iforgotpassword wrote: | Also error handling. read and write can return errors, but what | happens when you write to a mmaped pointer and the underlying | file system has some issue? Assigning a value to a variable | cannot return an error. | | So you get a fine SIGBUS to your application and it crashes. | Just the other day I used imagemagick and it always crashed | with a SIGBUS and just when I started googling the issue I | remembered mmap, noticed that the partition ran out of space, | freed up some more and the issue was gone. | | So you might want to set up a handler for that signal, but now | the control flow suddenly jumps to another function if an error | occurs, and you have to somehow figure out where in your | program the error occurred and then what? Then you remember | that longjmp exists and you end up with a steaming pile of | garbage code. | | Only use mmap if you absolutely must. Don't just "mmap all teh | files" as it's the new cool thing you learned about. | chrchang523 wrote: | Yeah, this is the biggest reason I stay the hell away from | mmap now. Signal handlers are a much worse minefield than | error handling in any standard file I/O API I've seen. | klodolph wrote: | You don't have to longjmp, you can remap the memory and set a | flag, return from the signal handler, handle the error later, | if you like. | jabl wrote: | Indeed. The issue with file resizing I mentioned was mostly | related to error handling (what if another | process/thread/file descriptor/ truncates the file, etc.). | But yes, there are of course other errors as well, like the | fs running out of space you mention. | justin66 wrote: | There's nothing wrong with using a read only mmap in | conjunction with another method for writes. | iforgotpassword wrote: | You have exactly the same problem on a read error. | justin66 wrote: | Not the problem you described in your second paragraph. | nlitened wrote: | Is it still the case in 64-bit systems? | jabl wrote: | Except for running out of VM space, all the other issues are | still there. And even if you have (for the time being) | practically unlimited VM space, you may still not want to | mmap a file of unbounded size, since setting up all those | mappings takes quite a lot of time if you're using the | default 4 kB page size. So you probably want to do some kind | of windowing anyway. But then if the access pattern is random | and the file is large, you have to continually shift the | window (munmap + mmap) and performance goes down the drain. | So I don't think going to 64-bit systems tilts the balance in | favor of mmap. | pocak wrote: | Linux allocates page tables lazily, and fills them lazily. | The only upfront work is to mark the virtual address range | as valid and associated with the file. I'd expect mapping | giant files to be fast enough to not need windowing. | jabl wrote: | Good point, scratch that part of my answer. | | There are still some cases where you'd not want unlimited | VM mapping, but those are getting a bit esoteric and at | least the most obvious ones are in the process of getting | fixed. | Matthias247 wrote: | The 4-16kB buffer sizes are all rather tiny and inefficient for | high throughput use-cases, which makes those results not that | relevant. Something between 64kB to 1MB seems more applicable. | aloknnikhil wrote: | Previous discussion: | https://news.ycombinator.com/item?id=24842648 | baybal2 wrote: | There is a third option on the table! Using DMA controller. | | People say what's the difference in between copying memory by | CPU, or by DMA controller? The difference is exactly that. | | You rarely need that, but in some cases you have: | | 1. You do very long copies, and want to have full CPU | performance. | | 2. You do very long copies, and want to have caches to not be | flushed during it. | | 3. You care about power consumption, as DMA controller may let | the CPU core to enter low power mode quicker. | | 4. On some CPU architectures, you can get wildly spread out pages | quicker than with CPU/software. | beagle3 wrote: | The DMA needs to cooperate with the MMU which is on the CPU | these days (and has been for almost 3 decades now). It's a lot | of work to set up DMA correctly given physical<->logical memory | mapping - so it's only worth it if you have a really big chunk | to copy. | GeorgeTirebiter wrote: | This is quite interesting. This, to me, seems like a systems | bug. In the Embedded World, it is exceedingly common to use | DMA for all high-speed transfers -- it's effectively a | specialized parallel hardware 'MOV' instruction. Also, I have | never had an occasion on modern PC hw to need mmap; read() | lseek() are clean and less complex overall. Maybe I lack | imagination. | astrange wrote: | mmap is being used by libraries under you; it's useful for | files on internal drives, that won't be deleted, and want | to be accessed randomly, and you don't want to allocate | buffers to copy things out of them. | | For instance, anytime you call a shared library it's been | mmapped by dyld/ld.so. | xoo1 wrote: | It's all very benchmark-chasing theoretical. In practice | performance is more complicated and mmap is this weird corner- | case over engineered inconsistent optimization thing that often | wastes or even "leaks" memory which can be used for actually | important for performance caches, it's also awful at error | handling and so on. I had to literally patch LevelDB to disable | mmap on amd64 once, which eliminated OOMs on those servers, | allowed me to run way more LevelDB instances and overall improved | performance so significantly, that I had to write this comment. | jstimpfle wrote: | Yup, I don't like using mmap() for the reason alone that it | means giving up a lot of control. | BikiniPrince wrote: | One mechanism we developed was to build a variant of our | storage node that could run in isolation. This meant that | synthetic testing would give us some optimal numbers for | hardware vetting and performance changes. | | I proved quite quickly our application was quite thread poor | and the costs of fixing it was quite worth it. Using other | synthetic benchmarks to compare what the systems were capable | of. | | I was gone before that was finished, but it was quite an | improvement. It also allowed cold volumes to exist in an over | subscription model. | | None of this excuses good real world telemetry and evaluation | of your outliers. | CalChris wrote: | Neither _mmap()_ nor _read() /write()_ leak memory. | jstimpfle wrote: | But they might "leak" it. What parent meant is that as an | mmap() user you have no control how much of a mapping takes | actual memory while you're visiting random memory-mapped file | locations. Is that documented somewhere? | vlovich123 wrote: | madvise gives you pretty good control over the paging, no? | Generally I think you can MADVISE_NOTNEEDED to page out | content if you need to do it more aggressively, no? The | benefit is that the kernel understands this enough that it | can evict those page buffers, things it can't do when those | buffers live in user-space. | jandrewrogers wrote: | No, madvise() does not give good control over paging | behavior. As the syscall indicates, it is merely a | suggestion. The kernel is free to ignore it and | frequently does. This non-determinism makes it nearly | useless for many types of page scheduling optimizations. | For some workloads, the kernel consistently makes poor | paging choices and there is no way to force it to make | good paging choices. You have much better visibility into | the I/O behavior of your application than the kernel | does. | | In my experience, at least for database-y workloads, if | you care enough about paging behavior to use madvise(), | you might as well just use any number of O_DIRECT | alternatives that offer deterministic paging control. It | is much easier than trying to cajole the kernel into | doing what you need via madvise(). | rowanG077 wrote: | That's called a space leak not a memory leak. | AnotherGoodName wrote: | The paging system will only page in what's being used right | now though and paging out has zero cost. Old data will | naturally be paged out. To put it directly the answer is | each mmap file will need 1 page of physical memory (the | area currently being read/written). There may be old pages | left around since there's no reason for the OS to page | anything out unless some other application asked for the | memory. But if they do mmap will go to 1 page just fine and | there's zero cost to paging out. | | I feel mmap gets a bad reputation when people look at | memory usage tools that look at total virtual memory | allocated. | | I can mmap a 100GB of files, use 0 physical memory and a | lot of memory usage tools will report 100GB of memory usage | of a certain type (virtual memory allocated). You then get | articles about application X using GB of memory. Anyone | trying to correct this is ignored. | | Google Chrome is somewhat unfairly hit by this. All those | articles along the lines of "Why is Google using 4GB with | no tabs after i viewed some large PDFs". The answer is that | it reserved 4GB of 'addresses' that it has mapped to files. | If another application wants to use that memory there's | zero cost to paging out those files from memory. The OS is | designed to do this and it's what mmap is for. | labawi wrote: | > paging out has zero cost | | Paging out, as in removing a mapping, can be surprisingly | costly, because you need to invalidate any cached TLB | entries, possibly even in other CPUs. | | > each mmap file will need 1 page of physical memory | | Technically, a lower limit would be about 2 or so usable | pages, because you can't use more than that | simultaneously. However unmaps are expensive, so the | system won't be too eager to page out. | | Also, for pages to be accessible, they need to be | specified in the page table (actually tree, of virtual -> | physical mappings). A random address may require about | 1-3 pages for page table aside from the 1 page of actual | data (but won't need more page tables for the next MB). | | > application X using GB of memory | | I think there is a difference between reserved, allocated | and file-backed mmapped memory. Address space, file- | backed mmapped memory is easily paged-out, not sure what | different types of reserved addresses/memory are, but | chrome probably doesn't have lots of mmapped memory that | can be paged out. If it's modified, then it must be | swapped, otherwise it's just reserved and possibly | mapped, but _never_ used memory. | AnotherGoodName wrote: | I'd argue the costs with paging out are already accounted | for by the other process paging in though. The other | process that paged in and led to the need to page out had | already led to the need to change the page table and | flush cache. | labawi wrote: | Paging in free memory (adding a mapping) is cheap (no | need to flush). Removing a mapping is expensive (need to | flush). Also, processes have their own (mostly) | independent page tables. | | I don't think it would be reasonable accounting, when | paging-in is cheap, but only if there is no need to page | out (available free memory). Especially when trying to | argue that paging out is zero-cost. | silon42 wrote: | mmap has less deterministic memory pressure and more complex | interactions with overcommit (if enabled). | jeffbee wrote: | LevelDB is kinda like a single-tablet bigtable, but because of | that its mmap i/o is not a result of battle hardening in | production systems. bigtable doesn't use local unix i/o for any | purpose at all, so I'm not surprised to hear that leveldb's | local i/o subsystem is half baked. | btown wrote: | Curious now - were you running an unconventional workload that | stressed LevelDB, or do you think some version of this advice | could be applicable to typical workloads? | einpoklum wrote: | The author says that in userspace memcpy, AVX is used, but | | > The implementation of copy_user_enhanced_fast_string, on the | other hand, is much more modest. | | Why is that? I mean, if you compiled your kernel for a wide range | of machines, then fine, but if you compiled targeting your actual | CPU, why would the kernel functions not use AVX? | lrossi wrote: | From the mouth of Linus: | | https://marc.info/?l=linux-kernel&m=95496636207616&w=2 | | It's a bit old, but it should still apply. I remember that in | general he was annoyed when seeing people recommend mmap instead | of read/write for basic I/O usecases. | | In general, it's almost always better to use the specialized API | (read, write etc.) instead of reinventing the wheel on your own. | beagle3 wrote: | LMDB (and its modern fork MDBX), and kdb+/shakti make | incredibly good use of mmap - I suspect it is possible to get | similar performance from read(), but probably at 10x the | implementation complexity. | nabla9 wrote: | Yes. If you do lots of sequential/local reads you can reduce | the number of context switches dramatically if you do something | like: /* to reduce context switch */ | int bsize = 16*BUFSIZ; int fopen_bsize(char * | filename) { int fp = fopen(filename, "r"); | setvbuf(fp, NULL, _IOFBF, bsize); return fp; | } ___________________________________________________________________ (page generated 2021-01-09 23:00 UTC)