[HN Gopher] But how, exactly, do databases use mmap? ___________________________________________________________________ But how, exactly, do databases use mmap? Author : brunoac Score : 168 points Date : 2021-01-23 13:06 UTC (9 hours ago) (HTM) web link (brunocalza.me) (TXT) w3m dump (brunocalza.me) | rcgorton wrote: | I found some of the 'sizing' snippets in the example came across | as disingenuous: if you KNOW the size of the file, mmap it | initially using that without the looping overhead. And you | presumably know how much memory you have on a given system. The | description (at least as how I read the article) implies bolt is | a truly naive implementation of a key/value DB | perbu wrote: | The author notices that Bolt doesn't use mmap for writes. The | reason is surprisingly simple, once you know how it works. Say | you want to overwrite a page at some locations that isn't present | in memory. You'd write to it and you'd think that is that. But | when this happens the CPU triggers a page fault, the OS steps in | and reads the underlying page into memory. It then relinquishes | control back to the application. The application then continues | to overwrite that page. | | So for each write that isn't mapped into memory you'll trigger a | read. Bad. | | Early versions of Varnish Cache struggled with this and this was | the reason they made a malloc-based backend instead. mmaps are | great for reads, but you really shouldn't write through them. | tayo42 wrote: | Is the trade off in varnish worth it? Workloads for a cache | should be pretty read heavy, writes should be infrequent unless | it's being filled for the first time | KMag wrote: | I think the main problem with mmap'd writes is that they're | blocking and synchronous. | | I presume most database record writes are smaller than a page. | In that case, other methods (unless you're using O_DIRECT, | which ads its own difficulties) still have the kernel read a | whole page of memory into the page cache before writing the | selected bytes. So, unless you're using O_DIRECT for your | writes, you're still triggering the exact same read-modify- | write, it's just that with the file APIs you can use async I/O | or use select/poll/epoll/kqueue, etc. to avoid these necessary | reads from blocking your writer thread. | cperciva wrote: | There's an even better reason for databases to not write to | memory mapped pages: Pages get synched out to disk at the | kernel's leisure. This can be ok for a cache but it's | definitely not what you want for a database! | [deleted] | eqvinox wrote: | That's what msync() is for. | monocasa wrote: | Right, but it can sync arbitrary ranges sooner, which is | also awful for consistency. | reader_mode wrote: | Shouldn't your write strategy be resilient to that kind | of stuff (eg. shutdown during a partial update) ? | gmueckl wrote: | Don't you need exact guarantees on write ordering to | achieve that? | jorangreef wrote: | Yes, for almost all databases, although there was a cool | paper from the University of Wisconsin Madison a few | years ago that showed how to design something that could | work without write barriers, and under the assumption | that disks don't always fsync correctly: | | "the No-Order File System (NoFS), a simple, lightweight | file system that employs a novel technique called | backpointer based consistency to provide crash | consistency without ordering writes as they go to disk" | | http://pages.cs.wisc.edu/~vijayc/nofs.htm | vlovich123 wrote: | Does that generalize to databases? My understanding is | that file systems are a restricted case of databases that | don't necessarily support all operations (eg transactions | are smaller, can't do arbitrary queries within a | transaction, etc etc). | bonzini wrote: | You can do write/sync/write/sync in order to achieve | that. It would be nicer to have FUA support in system | calls (or you can open the same file to two descriptors, | one with O_SYNC and one without). | dooglius wrote: | I think you mean mlock | cperciva wrote: | If you're tracking what needs to be flushed to disk when, | you might as well just be making explicit pwrite syscalls. | cma wrote: | Isn't there a way around this? When coding for graphics stuff | writing to GPU mapped memory people usually take pains to turn | off compiler optimizations that might XOR memory against itself | to zero it out or AND it against 0 and cause a read, and other | things like that. | | https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-... | | > Even the following C++ code can read from memory and trigger | the performance penalty because the code can expand to the | following x86 assembly code. C++ code: Copy | *((int*)MappedResource.pData) = 0; | | x86 assembly code: Copy AND DWORD PTR [EAX],0 | | > Use the appropriate optimization settings and language | constructs to help avoid this performance penalty. For example, | you can avoid the xor optimization by using a volatile pointer | or by optimizing for code speed instead of code size. | | I guess mmapped files still may need a read to know whether to | do copy on write, where mapped memory for the CPU in that case | is specifically marked for upload only and gets something | flagged that writes it regardless of if there is a change, but | mmap maybe has something similar? | | (edit: this seems to say nothing similar is possible with mmap | on x86 https://stackoverflow.com/questions/31014515/write-only- | mapp... | | but how does it work for GPUs? Something to do with fixed pci-e | support on the cpu (base address register | https://en.wikipedia.org/wiki/PCI_configuration_space)? | ww520 wrote: | I believe GPU solves this by having read only and write only | buffers in the rendering pipeline. | alaties wrote: | The answer is that it works pretty similarly, but GPUs | usually do this in specialized hardware whereas mmap'ing of | files for DMA-style access is implemented mostly in software. | | https://insujang.github.io/2017-04-27/gpu-architecture- | overv... has a pretty good visual of what's doing what for | GPU DMA. You can imagine much of what happens here is almost | pure software for mmap'd files. | monocasa wrote: | As other have said, you need hardware support to do this | similarly to how GPUs do it. | | That being said, that hardware support exists with NVDIMMs. | remram wrote: | You'd need a way to indicate when you start and end | overwriting the page. You need to avoid the page being | swapped out mid-overwrite and not read back in. You'd also | pay a penalty for zeroing it when it gets mapped pre- | overwrite. The map primitives are just not meant for this. | rini17 wrote: | I think on Linux there's madvise syscall with "remove" flag, | which you can issue on memory pages you intend to completely | overwrite. I have no idea on performance or other practical | issues. | icedchai wrote: | Yes, this can definitely be a problem. I worked on a | transaction processing system that was entirely based on a in- | house memory mapped database. All reads and writes went through | mmap. At startup, it read through all X gigabytes of data to | "ensure" everything was hot, in memory, and also built the in | memory indexes. | | This actually worked fine in production, since the systems were | properly sized and dedicated to this. On dev systems with low | memory and often running into swap, you'd run into cases with | crazy delays... sometimes a second or two for something that | would normally be a few milliseconds. | ramoz wrote: | Perhaps a part 2 would dive a bit deeper into os caching and | hardware (SSDs, their interfaces etc) | shoo wrote: | See also: sublime HQ blog about complexities of shipping a | desktop application using mmap [1] and corresponding 200+ comment | HN thread [2]: | | > When we implemented the git portion of Sublime Merge, we chose | to use mmap for reading git object files. This turned out to be | considerably more difficult than we had first thought. Using mmap | in desktop applications has some serious caveats [...] | | > you can rewrite your code to not use memory mapping. Instead of | passing around a long lived pointer into a memory mapped file all | around the codebase, you can use functions such as pread to copy | only the portions of the file that you require into memory. This | is less elegant initially than using mmap, but it avoids all the | problems you're otherwise going to have. | | > Through some quick benchmarks for the way Sublime Merge reads | git object files, pread was around 2/3 as fast as mmap on | linux. In hindsight it's difficult to justify using mmap over | pread, but now the beast has been tamed and there's little reason | to change any more. | | [1] https://www.sublimetext.com/blog/articles/use-mmap-with-care | [2] https://news.ycombinator.com/item?id=19805675 | minitoar wrote: | Interana mmaps the heck out of stuff. I've found that relying on | the file cache works great. Though our access patterns are | admittedly pretty simple. | 29athrowaway wrote: | malloc is implemented using mmap. | | You map memory manually when you need very low level control over | memory. | jeffbee wrote: | `malloc` is not one thing. Some mallocs use mmap and others use | brk. Some implementations use both. | kevin_thibedeau wrote: | Some use neither. | PaulHoule wrote: | I like mmap and I don't. | | It is incompatible with non-blocking I/O since your process will | be stopped if it tries to access part of the file that is not | mapped -- this isnt a syscall blocking (which you might work | around) but rather any attempt to access mapped memory. | | I like mmap for tasks like seeking into ZIP files, where you can | look at the back 1% of the file, then locate and extract one of | the subfiles; the trouble there is that the really fun case is to | do this over the network with http (say to solve Python | dependencies, to extract the metadata from wheel files) in which | case this method doesnt work. | Sesse__ wrote: | mmap is great for rapid prototyping. For anything I/O-heavy, | it's a mess. You have zero control over how large your I/Os are | (you're very much at the mercy of heuristics that are optimized | for loading executables), readahead is spotty at best | (practical madvise implementation is a mess), async I/O doesn't | exist, you can't interleave compression in the page cache, | there's no way of handling errors (I/O error = SIGBUS/SIGSEGV), | and write ordering is largely inaccessible. Also, you get | issues such as page table overhead for very large files, and | address space limitations for 32-bit systems. | | In short, it's a solution that looks so enticing at first, but | rapidly costs much more than it's worth. As systems grow more | complex, they almost inevitably have to throw out mmap. | rapsey wrote: | Process will be stopped or thread? | ithkuil wrote: | Thread | codetrotter wrote: | > the trouble there is that the really fun case is to do this | over the network with http (say to solve Python dependencies, | to extract the metadata from wheel files) in which case this | method doesnt work | | If the web server can tell you the total size of the file by | responding to a HEAD request, and it support range requests | then it will be possible. | | https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ... | | Or am I missing something? | johndough wrote: | You are correct, this works. There even is a file system | built around this idea: https://github.com/fangfufu/httpdirfs | remram wrote: | You can't do this with mmap though, you can't instruct _the | OS_ to grab pages via HTTP range requests. | kccqzy wrote: | Write a fuse layer. | amelius wrote: | > It is incompatible with non-blocking I/O since your process | will be stopped if it tries to access part of the file that is | not mapped | | Yeah, but the same problem occurs in normal memory when the OS | has swapped out the page. | | So perhaps non-blocking I/O (and cooperative multitasking) is | the problem here. | loeg wrote: | > Yeah, but the same problem occurs in normal memory when the | OS has swapped out the page. | | I'd argue that swapping is an orthogonal problem which can be | solved in a number of ways: disable swap at the OS level, | mlock() in the application, maybe others. | | mmap is really a bad API for IO -- it hides synchronous IO | and doesn't produce useful error statuses at access. | | > So perhaps non-blocking I/O (and cooperative multitasking) | is the problem here. | | I'm not sure how non-blocking IO is "the problem." It's | something Windows has had forever, and unix-y platforms have | wanted for quite a long time. (Long history of poll, epoll, | kqueue, aio, and now io_uring.) | amelius wrote: | > it hides synchronous IO and doesn't produce useful error | statuses at access. | | You can trap IO errors if necessary. E.g. you can raise | signals just like segfaults generate signals. | | > I'm not sure how non-blocking IO is "the problem." | | The point is that non-blocking IO wants to abstract away | the hardware, but the abstraction is leaky. Most programs | which use non-blocking IO actualy want to implement | multitasking without relying threads. But that turns out to | be the wrong approach. | loeg wrote: | > The point is that non-blocking IO wants to abstract | away the hardware, but the abstraction is leaky. | | Why do you say it doesn't match hardware? Basically all | hardware is asynchronous -- submit a request, get a | completion interrupt, completion context has some success | or failure status. Non-blocking IO is fundamentally a | good fit for hardware. It's blocking IO that is a poor | abstraction for hardware. | | > Most programs which use non-blocking IO actualy want to | implement multitasking without relying threads. But that | turns out to be the wrong approach. | | Why is that the wrong approach? Approximately every high- | performance httpd for the last decade or two has used a | multitasking, non-blocking network IO model rather than | thread-per-request. The overhead of threads is just very | high. They would like to use the same model for non- | network IO, but Unix and unix-alikes have historically | not exposed non-blocking disk IO to applications. | io_uring is a step towards a unified non-blocking IO | interface for applications, and also very similar to how | the operating system interacts with most high-performance | devices (i.e., a bunch of queues). | amelius wrote: | > Why do you say it doesn't match hardware? | | Because the CPU itself can block. In this case on memory | access. Most (all?) async software assumes the CPU can't | block. A modern CPU has a pipelining mechanism, where | parts can simply block, waiting for e.g. memory to | return. If you want to handle this all nicely, you have | to respect the api of this process which happens to go | through the OS. So for example, while waiting for your | memory page to be loaded, the OS can run another thread | (which it can't in the async case because there isn't any | other thread). | quotemstr wrote: | You use mmap whether you want to or not: the system executes | your program by mmaping your executable and jumping into it! | You can always take a hard fault at any time because the kernel | is allowed to evict your code pages on demand even if you | studiously avoid mmap for your data files. And it can do this | eviction even if you have swap turned off. | | If you want to guarantee that your program doesn't block, you | need to use mlockall. | geofft wrote: | This is technically true, but the use case we're talking | about is programs that are much smaller than their data. | Postgres, for instance, is under 50 MB, but is often used to | handles databases in the gigabytes or terabytes range. You | can mlockall() the binary if you want, but you probably can't | actually fit the entire database into RAM even if you wanted | to. | | Also, when processing a large data file (say you're walking a | B-tree or even just doing a search on an unindexed field), | the code you're running tends to be a small loop, within the | same few pages, so it might not even leave the CPU's cache, | let alone get swapped out of RAM, but you need to access a | very large amount of data, so it's much more likely the data | you want could be swapped out. If you know some things about | the data structure (e.g., there's an index or lookup table | somewhere you care about, but you're traversing each node | once), you can use that to optimize which things are flushed | from your cache and which aren't. | jorangreef wrote: | But that's a different order of magnitude problem: control | plane vs data plane. | | At some point, we could also say that the line fill buffer | blocks our programs (more often than we realize). | | All of this is accurate, but at different scales. | PaulHoule wrote: | Also many systems in 2021 have a lot of RAM and hardly ever | swap. | loeg wrote: | You're not wrong. Applications and libraries that want to be | non-blocking should mlock their pages and avoid mmap for | further data access. ntpd does this, for example. | | After application startup, you _can_ avoid _additional_ mmap. | amelius wrote: | This is one area where Rust, a modern systems language, has | disappointed me. You can't allocate data structures inside | mmap'ed areas, and expect them to work when you load them again | (i.e., the mmap'ed area's base address might have changed). I | hope that future languages take this usecase into account. | simias wrote: | I'm not sure I see the issue. This approach (putting raw binary | data into files) is filled with footguns. What if you add, | remove or reorder fields? What if your file was externally | modified and now doesn't match the expected layout? What if the | data contains things like file descriptors or pointers that | can't meaningfully be mapped that way? Even changing the | compilation flags can produce binary incompatibilities. | | I'm not saying that it's not sometimes very useful but it's | tricky and low level enough that some unsafe low level plumbing | is, I think, warranted. You have to know what you're doing if | you decide to go down that route, otherwise you're much better | off using something like Serde to explicitly handle | serialization. There's some overhead of course, but 99% of the | time it's the right thing to do. | amelius wrote: | The footguns can be solved in part by the type-system | (preventing certain types from being stored), and (if | necessary) by cooperation with the OS (e.g. to guarantee that | a file is not modified between runs). | | How else would you lazy-load a database of (say) 32GB into | memory, almost instantly? | | And why require everybody to write serialization code when | just allocating the data inside a mmap'ed file is so much | easier? We should be focusing on new problems rather than | reinventing the wheel all the time. Persistence has been an | issue in computing since the start, and it's about time we | put it behind us. | simias wrote: | >How else would you lazy-load a database of (say) 32GB into | memory, almost instantly? | | By using an existing database engine that will do it for | me. If you need to deal with that amount of data and | performance is really important you have a lot more to | worry about than having to use unsafe blocks to map your | data structures. | | Maybe we just have different experiences and work on | different types of projects but I feel like being able to | seamlessly dump and restore binary data transparently is | both very difficult to implement reliably and quite niche. | | Note in particular that machine representation is not | necessarily the most optimal way to store data. For | instance any kind of Vec or String in rust will use 3 usize | to store length, capacity and the data pointer which on 64 | bit architectures is 24 bytes. If you store many small | strings and vectors it adds up to a huge amount of waste. | Enum variants are also 64 bits on 64 bit architectures if I | recall correctly. | | For instance I use bincode with serde to serialize data | between instances of my application, bincode maps almost | 1:1 the objects with their binary representation. I noticed | that by implementing a trivial RLE encoding scheme on top | of bincode for running zeroes I can divide the average | message size by a factor 2 to 3. And bincode only encodes | length, not capacity. | | My point being that I'm not sure that 32GB of memory-mapped | data would necessarily load faster than <16GB of lightly | serialized data. Of course in some cases it might, but | that's sort of my point, you really need to know what | you're doing if you decide to do this. | burntsushi wrote: | > How else would you lazy-load a database of (say) 32GB | into memory, almost instantly? | | That's what the fst crate[1] does. It's likely working at a | lower level of abstraction than you intend. But the point | is that it works, is portable and doesn't require any | cooperation from the OS other than the ability to memory | map files. My imdb-rename tool[2] uses this technique to | build an on-disk database for instantaneous searching. And | then there is the regex-automata crate[3] that permits | deserializing a regex instantaneously from any kind of | slice of bytes.[4] | | I think you should maybe provide some examples of what | you're suggesting to make it more concrete. | | [1] - https://crates.io/crates/fst | | [2] - https://github.com/BurntSushi/imdb-rename | | [3] - https://crates.io/crates/regex-automata | | [4] - https://docs.rs/regex- | automata/0.1.9/regex_automata/#example... | geofft wrote: | I had a use case recently for serializing C data structures | in Rust (i.e., being compatible with an existing protocol | defined as "compile this C header, and send the structs down | a UNIX socket"), and I was a little surprised that the | straightforward way to do it is to unsafely cast a #[repr(C)] | structure to a byte-slice, and there isn't a Serde serializer | for C layouts. (Which would even let you serialize C layouts | for a different platform!) | | I think you could also do something Serde-ish that handles | the original use case where you can derive something on a | structure as long as it contains only plain data types (no | pointers) or nested such structures. Then it would be safe to | "serialize" and "deserialize" the structure by just | translating it into memory (via either mmap or direct | reads/writes), without going through a copy step. | | The other complication here is multiple readers - you might | want your accessor functions to be atomic operations, and you | might want to figure out some way for multiple processes | accessing the same file to coordinate ordering updates. | | I kind of wonder what Rust's capnproto and Arrow bindings do, | now.... | burntsushi wrote: | It's likely that the "safe transmute" working group[1] will | help facilitate this sort of thing. They have an RFC[2]. | See also the bytemuck[3] and zerocopy[4] crates which | predate the RFC, where at least the latter has 'derive' | functionality. | | [1] - https://github.com/rust-lang/project-safe-transmute | | [2] - https://github.com/jswrenn/project-safe- | transmute/blob/rfc/r... | | [3] - https://docs.rs/bytemuck/1.5.0/bytemuck/ | | [4] - https://docs.rs/zerocopy/0.3.0/zerocopy/index.html | comonoid wrote: | Yes, you can. | | You cannot with standard data structures, but you can with your | custom ones. | | That's all about trade-offs, anyway, there is no magic bullet. | the8472 wrote: | Work on custom allocators is underway, some of the std data | structures already support them on nightly. | | https://github.com/rust-lang/wg-allocators/issues/7 | remram wrote: | What about Rust makes this more difficult than doing the same | thing in C++? | quotemstr wrote: | You can't do that in C++ or any language. You need to do your | own relocations and remember enough information to do them. You | can't count on any particular virtual address being available | on a modern system, not if you want to take advantage of ASLR. | | The trouble is that we have to mark relocated pages dirty | because the kernel isn't smart enough to understand that it can | demand fault and relocate on its own. Well, either that, or do | the relocation anew on each access. | whimsicalism wrote: | I don't see what the issue in doing this is in C++. | | The only thing that'll break will be the pointers and | references to things outside of the mmap'd area. | simias wrote: | By that logic you can do it in unsafe Rust as well then. | Obviously in safe Rust having potentially dangling | "pointers and references to things outside of the mmap'd | area" is a big no-no. | | And note that even intra-area pointers would have to be | offset if the base address changes. Unless you go through | the trouble of only storing relative offsets to begin with, | but the performance overhead might be significant. | Hello71 wrote: | libsigsegv (slow) or userfaultfd (less slow) can be used for | this purpose. | secondcoming wrote: | It works with C++ if you use boost::interprocess. Its data | structures use offset_ptr internally rather than assuming | every pointer is on the heap. | quotemstr wrote: | Sure. But that counts as "doing your own relocations". | Unsafe Rust could do the same, yes? | whimsicalism wrote: | What is being relocated? | ithkuil wrote: | If you use offsets instead of pointers you're doing | relocations "on the fly" | secondcoming wrote: | I don't know enough about Rust to say. If it doesn't have | the concept of a 'fancy pointer' then I assume no, you'd | have to essentially reproduce what boost::interprocess | does. | amelius wrote: | That introduces different data-types, rather than using the | existing ones (instantiated with different pointer-types). | secondcoming wrote: | Indeed. I don't know if there's a plan for the standard | type to move to offset-ptr, or if there's even a | std::offset_ptr, but it would be great if there was. | | For us, some of the 'different data type' pain was | alleviated with transparent comparators. YMMV. | | Edit: It seems C++11 has added some form of support for | it... 'fancy pointers' | | https://en.cppreference.com/w/cpp/named_req/Allocator#Fan | cy_... | jnwatson wrote: | There's no placement new in Rust? That's disappointing. | steveklabnik wrote: | Not in stable yet, no. It's desired, but has taken a while to | design, as there have been higher priority things for a | while. We'll get there! | turminal wrote: | This is impossible without significant performance impact. No | language can change that. | | Edit: except theoretically for data structures that have | certain characteristics known in advance | amelius wrote: | Well, one approach is to parameterize your data-types such | that they are fast in the usual case, but become perhaps | slightly slower (but still on par with hand-written code) in | the more versatile case. | waynesonfire wrote: | Thanks for diving into this DB! I find it interesting that many | databases share such similar architectural principles. NIH. It's | super fun to build a database so why not. | | Also, don't beat yourself over how deep you'll be diving into the | design. Why apologize for this? Those that want a deep expository | would quickly move on. | rossmohax wrote: | mmap is not as free as people think. VM subsystem is full of | inefficient locks. Here is a very good writeup on a problem BBC | encountered with Varnish: | https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-4... | jeffbee wrote: | Apparently in a way that the author of the article, and probably | the authors of bolt, do not really understand. | bonzini wrote: | The right answer is that they shouldn't. A database has much more | information than the operating system about what, how and when to | cache information. Therefore the database should handle its own | I/O caching using O_DIRECT on Linux or the equivalent on Windows | or other Unixes. | | The article at https://www.scylladb.com/2017/10/05/io-access- | methods-scylla... is a bit old (2017) but it explains the trade- | offs | quotemstr wrote: | Yep. Every mature, high performing, non-embedded database | evolves towards getting the underlying operating system out of | the way as much as possible. | natmaka wrote: | > A database has much more information than the operating | system about what, how and when to cache information | | Yes, on a dedicated server. However many DB engines instances | run on non-dedicated servers, for example along a web server | flanked with various processes sometimes reading the local | filesystem or using RAM (Varnish, memcached...), and often-run | tasks (tempfiles purge, log aggregation, monitoring probes, | MTA...). In such a case letting the DB engine use too much RAM, | and therefore reducing its global efficiency while limiting | buffercache size, may (all other things being equal) imply more | 'read' operations, reducing overall performance. | sradman wrote: | Great point. Selecting the RDBMS page cache size is a key | performance parameter that is near impossible to get right on | a mixed-use host, both non-dedicated servers and client | desktop/laptop. SQL Anywhere, which emphasizes zero-admin, | has long supported _Dynamic Cache Sizing_ [1] specifically | for this mixed-use case which is /was its bread-and-butter. I | don't know if any other RDBMSes do the same (MS SQL?). | | As a side note, Apache Arrow's main use case is similar, a | column oriented data store shared by one-or-more client | processes (Python, R, Julia, Matlab, etc.) on the same | general purpose host. This is also now a key distinction | between the Apple M1 and its big.LITTLE ARM SoC vs. Amazon | Graviton built for server-side virtualized/containerized | instances. We should not conflate the two use-cases and | understand that the best solution for one use case may not be | the best for the other. | | [1] http://dcx.sybase.com/1200/en/dbusage/perform- | bridgehead-405... | jorangreef wrote: | Yes, and it's not only about performance, but also safety | because O_DIRECT is the only safe way to recover from the | journal after fsync failure (when the page cache can no longer | be trusted by the database to be coherent with the disk): | https://www.usenix.org/system/files/atc20-rebello.pdf | | From a safety perspective, O_DIRECT is now table stakes. | There's simply no control over the granularity of read/write | EIO errors when your syscalls only touch memory and where you | have no visibility into background flush errors. | formerly_proven wrote: | Around four years ago I was working on a transactional data | store and ran into these issues that virtually no one tells | you how durable I/O is supposed to work. There were very few | articles on the internet that went beyond some of the basic | stuff (e.g. create file => fsync directory) and perhaps one | article explaining what needs to be considered when using | sync_file_range. Docs and POSIX were useless. I noticed that | there seemed to be inherent problems with I/O error handling | when using the page cache, i.e. whenever something that | wasn't the app itself caused write I/O you really didn't know | any more if all the data got there. | | Some two years later fsyncgate happened and since then I/O | error handling on Linux has finally gotten at least some | attention and people seemed to have woken up to the fact that | this is a genuinely hard thing to do. | sradman wrote: | O_DIRECT prevents file double buffering by the OS and DBMS page | cache. MMAP removes the need for the DBMS page cache and relies | on the OS's paging algorithm. The gain is zero memory copy and | the ability for multiple processes to access the same data | efficiently. | | Apache Arrow takes advantage of mmap to share data across | different language processes and enables fast startup for short | lived processes that re-access the same OS cached data. | geofft wrote: | Yes, but the claim is that the buffer you should remove is | the OS's one, not the DBMS's one, because for the DBMS use | case (one very large file with deep internal structure, | generally accessed by one long-running process), the DBMS has | information the OS doesn't. | | Arrow is a different use case, for which mmap makes sense. | For something like a short-lived process that stores config | or caches in SQLite, it probably is actually closer to Arrow | than to (e.g.) Postgres, so mmap likely also makes sense for | that. (Conversely, if you're not relying on Arrow's sharing | properties and you have a big Python notebook that's doing | some math on an extremely large data file on disk in a single | process, you might actually get better results from O_DIRECT | than mmap.) | | In particular, "zero memory copy" only applies if you are | accessing the same data from multiple processes (either at | once or sequentially). If you have a single long-running | database server, you have to copy the data from disk to RAM | _anyway_. O_DIRECT means there 's one copy, from disk to a | userspace buffer; mmap means there's one copy, from disk to a | kernel buffer. If you can arrange for a long-lived userspace | buffer, there's no performance advantage to using the kernel | buffer. | sradman wrote: | > but the claim is that the buffer you should remove is the | OS's one | | I was not trying to minimize O_DIRECT, I was trying to | emphasize the key advantage succinctly and also explain the | Apache Arrow use case of mmap which the article does not | discuss. | masklinn wrote: | > Therefore the database should handle its own I/O caching | using O_DIRECT on Linux or the equivalent on Windows or other | Unixes. | | That's not wrong, but at the same time it adds complexity and | requires effort which can't be spent elsewhere unless you've | got someone who really only wants to DIO and wouldn't work on | anything else anyway. | | Postgres has never used DIO, and while there have been rumbling | about moving to DIO (especially following the fsync mess) as | Andres Freund noted: | | > efficient DIO usage is a metric ton of work, and you need a | large amount of differing logic for different platforms. It's | just not realistic to do so for every platform. Postgres is | developed by a small number of people, isn't VC backed etc. The | amount of resources we can throw at something is fairly | limited. I'm hoping to work on adding linux DIO support to pg, | but I'm sure as hell not going to do be able to do the same on | windows (solaris, hpux, aix, ...) etc. | jorangreef wrote: | I have found that planning for DIO from the start makes for a | better, simpler design when designing storage systems, | because it keeps the focus on logical/physical sector | alignment, latent sector error handling, and caching from the | beginning. And even better to design data layouts to work | with block devices. | | Retrofitting DIO onto a non-DIO design and doing this cross- | platform is going to be more work, but I don't think that's | the fault of DIO (when you're already building a database | that is). | jandrewrogers wrote: | PostgreSQL has two main challenges with direct I/O. The basic | one is that it adversely impacts portability, as mentioned, | and is complicated in implementation because file system | behavior under direct I/O is not always consistent. | | The bigger challenge is that PostgreSQL is not architected | like a database engine designed to use direct I/O | effectively. Adding even the most rudimentary support will be | a massive code change and implementation effort, and the end | result won't be comparable to what you would expect from a | modern database kernel designed to use direct I/O. This | raises questions about return on investment. | api wrote: | You can also mount a file system in synchronous mode on most | OSes, which may make sense for a DB storage volume (but not | other parts of the system). | jnwatson wrote: | In theory that's true. In practice, utilizing the highly- | optimized already-in-kernel-mode page cache can produce | tremendous performance. LMDB, for example, is screaming fast, | and doesn't use DIO. | the8472 wrote: | There was a patch set (introducing the RWF_UNCACHED flag) to | get buffered IO with most of the benefits of O_DIRECT and | without its drawbacks, but it looks like it hasn't landed. | | There also are new options to give the kernel better page cache | hints via the new MADV_COLD or MADV_PAGEOUT flags. These ones | did land. | nullsense wrote: | I think of the major database vendors only postgres uses mmap | and everyone else does their own I/O caching management. ___________________________________________________________________ (page generated 2021-01-23 23:00 UTC)