[HN Gopher] Are you sure you want to use MMAP in your database m... ___________________________________________________________________ Are you sure you want to use MMAP in your database management system? [pdf] Author : Aissen Score : 103 points Date : 2022-01-14 16:03 UTC (6 hours ago) (HTM) web link (db.cs.cmu.edu) (TXT) w3m dump (db.cs.cmu.edu) | [deleted] | assface wrote: | RavenDB's response to this paper: | https://ayende.com/blog/196161-C/re-are-you-sure-you-want-to... | 10000truths wrote: | You don't _want_ the OS to take care of reading from disk and | page caching /eviction. You want the DB itself to have explicit | control over that, because the DB has information on access | patterns and table format that the OS is not aware of. It is | better equipped than the OS to anticipate what portions of | tables/indices need to be cached in memory. It is better | equipped to calculate when/where/what/how much to prefetch from | disk. It is better equipped to determine when to buffer writes | and when to flush to disk. | | Sure, it might be more work than using mmap. But it's also more | correct, forces you to handle edge cases, and much more | amenable to platform-specific improvements a la | kqueue/io_uring. | matheusmoreira wrote: | > the DB has information on access patterns and table format | that the OS is not aware of | | Aren't system calls such as madvise supposed to allow user | space to let the kernel know precisely that information? | jandrewrogers wrote: | The madvise() functions and similar are a blunt and | imprecise instrument. The kernel is free to ignore them, | and frequently does in practice. It also does not prevent | the kernel from proactively doing things you don't want it | to do with your buffer pool at the worst possible time. | | A user space buffer pool gives you precise and | deterministic control of many of these behaviors. | ayende wrote: | That is an interesting statement, when discussing what | you want to do | | In this case, there is the issue of who is the you on | question | | For the database in isolation, maybe not ideal | | For a system whre db and app run on the same machine? The | OS can make sure you are on friendly terms and not | fighting | | Same for trying to SSH to a bust server and the OS can | balance things out | masklinn wrote: | > precisely | | Madvise is discussed in the paper, and it notes | specifically that: | | * madvise is not _precise_ | | * madvise is... an advice, which the system is completely | free to disregard | | * madvise is error-prone, providing the wrong hint can have | dire consequences | sltkr wrote: | The counterargument to this is that the kernel can make | decisions based on nonlocal information about the system. | | If your database server is the only process in the system | that is using significant memory, then sure, you might as | well manage it yourself. But if there are multiple processes | competing for memory, the kernel is better equipped to decide | which processes' pages should be paged out or kept into | memory. | electricshampo1 wrote: | Generally for perf critical use cases you dedicate the | machine to the database. This simplifies many things | (avoiding having to reason about sharing, etc etc). | munchler wrote: | This makes me wonder whether there would be value in an | OS that is also a DBMS (or vice versa). In other words, | if the DBMS has total control over the hardware, perhaps | performance can be maximized without too much additional | complexity. | sedachv wrote: | This is a bad idea from the 1960s: IBM TPF, MUMPS, Pick. | As soon as the hardware changes it becomes slower and | more complicated. | ayende wrote: | I'm the author (well, one of) RavenDB | | You are correct to an extent, but there are a few things yo | noted. | | * you can design your system so the access pattern that the | OS is optimized for matches your needs | | * you can use madvise() to give some useful hints | | * the amount of complexity you don't have to deal with is | staggering | tytso wrote: | OTOH, if you care about that last 5 percent or so of | performance there is the complexity that what the OS has | optimized for might be different between different OS's | (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might | change between different versions of Linux, or even, in the | case of buffered writeback, between different _filesystems_ | on the same version of Linux. This is probably historically | one of the most important reasons why enterprise databases | like Oracle DB, DB2, etc., have used direct I /O, and not | buffered I/O or mmap. | | Speaking as an OS developer, we're not going to try to | optimize buffered I/O for a particular database. We'll be | using becnhmarks like compilebench and postmark to optimize | our I/O, and if your write patterns, or readahead patterns, | or caching requirements, don't match those workloads, | well.... sucks to be you. | | I'll also point out that those big companies that actually | pay the salarise of us file system developers (e.g., | Oracle, Google, etc.) for the most part use Direct I/O for | our performance critical workloads. If database companies | that want to use mmap want to hire file system developers | and contribute benchmarks and performance patches for ext4, | xfs, etc., speaking as the ext4 maintainer, I'll welcome | that, and we do have a weekly video conference where I'd | love to have your engineers join to discuss your | contributions. :-) | ayende wrote: | The key from my perspective is that I CAN design my | access patterns to match what you'll optimized. | | Another aspect to remember is that mmap being even | possible for databases as the primary mechanism is quite | new. | | Go 15 years ago and you are in 32 bit land. That rule out | mmap as your approach. | | At this point, I might as well skip the OS and go direct | IO. | | As for differ OS behavior, I generally find that they all | roughly optimize for the same thing. | | I need best perf on Linux and Windows. Other systems I | can get away with just being pretty good | tyingq wrote: | The really key part seems to be this: | | _" If you aren't using mmap, on the other hand, you still need | to handle of all those issues"_ | | Which seems like a reasonable statement. Is it less work to | make your own top-to-bottom buffer pool, and would that | necessarily avoid similar issues? Or is it less work to use | mmap(), but address the issues? | bluestreak wrote: | Questdb's author here. I do share Ayende's sentiment. There | are things that the OP paper doesn't mention, which can help | mitigate some of the disadvantages: | | - single-threaded calls to 'fallocate' will help avoiding | sparse files and SIGBUS during memory write - over- | allocating, caching memory addresses and minimizing OS calls | - transactional safety can be implemented via shared memory | model - hugetlb can minimize TLB shootdowns | | I personally do not have any regrets using mmap because of | all the benefits they provide | AdamProut wrote: | I suppose. Some problems with mmap() are a bit hard to fix | from user land though. You will hit contention on locks | inside the kernel (mmap_sem) if the database does concurrent | high throughput mmap()/unmap(). I don't follow linux kernel | development closely to know if this has been improved | recently, but it was easy to reproduce it 4-5 years ago. | ayende wrote: | Almost no one is going to have a lot of map calls | | Uou map the file once, then fault it in | tyingq wrote: | That makes sense. I wasn't going right to the conclusion | that working around mmap() issues was easier, but it didn't | seem to be explored much. Is the contention around having | one file mmap()ed, or is it reduced if you use more files? | dboreham wrote: | When I worked on/with BerkeleyDB in the late 90s we came to | the conclusion that the various OS mmap() implementations had | been tweaked/fixed to the point where they worked for the | popular high profile applications (in those days: Oracle). So | it can appear like everything is fine, but that probably | means your code behaves the same way as <popular database du | jour>. | tytso wrote: | Um... Oracle (and other enterprise databases like DB2) | don't use mmap. They use Direct I/O. Oracle does have | anonymous (non-file-backed) memory which is mmap'ed and | shared across various Oracle processes, called the Shared | Global Area (SGA), but it's not used for I/O. | ayende wrote: | Yes, isn't that wonderful? | | You get to take advantage of literally decades of | experience | | What is more, if you can match the profile of the | optimization, you can benefit even more | jandrewrogers wrote: | Some issues with mmap() can be avoided entirely if you have | your own buffer pool. Others are easier to handle because | they are made explicit and more buffer state is exposed to | the program logic. That's the positive side. | | The downside is that writing an excellent buffer pool is not | trivial, especially if you haven't done it before. There are | many cross-cutting design concerns that have to be accounted | for. In my experience, an excellent C++ implementation tends | to be on the order of 2,000 lines of code -- someone has to | write that. It also isn't simple code, the logic is | relatively dense and subtle. | tptacek wrote: | From that article: the whole fsyncgate thing seems like a | pretty strong counterargument to "mmap adds more complexity | than it removes": | | https://danluu.com/fsyncgate/ | ayende wrote: | That actually doesn't matter This is orthogonal to mmap | ikawe wrote: | > Off the top of my head, most embedded databases implement a | single writer model. LMDB, Voron (RavenDB's storage engine), | LevelDB, Lucene | | And let's not forget sqlite! | | > There can only be a single writer at a time to an SQLite | database. | | (from https://www.sqlite.org/isolation.html) | moonbug wrote: | why settle for errno when you can have a segfault. | kabdib wrote: | Yup. Why use the operating system's async I/O system when you | can simply burn a thread and do blocking I/O? </snark> | | Been down that primrose path, have the road rash to prove it. | mmap() is great until you realize that pretty much all you've | avoided is some buffer management that you probably need to do | anyway. The OS just doesn't have the information it needs to do | a great (or even correct) job of caching database pages. | wahern wrote: | > Why use the operating system's async I/O system when you | can simply burn a thread and do blocking I/O? </snark> | | mmap isn't non-blocking; page faults are blocking, no | different from a read or write to a (non-direct I/O) file | using a syscall. | | Until recently io_uring literally burned a thread (from a | thread pool) for every read or write regular file operation, | too. Though now it finally has hooks into the buffer cache so | it can opportunistically perform the operation from the same | thread that dequeued the command, pushing it to a worker | thread if it would need to wait for a cache fault.[1] | | [1] Technically the same behavior could be implemented in | user space using userfaultfd, but the latency would likely be | higher on faults. | randbox wrote: | A user process doesn't have the information it needs to do a | good job of coordinating updates from multiple writers to | database pages and indices. With MMAP, writers have access to | shared atomics which they can update using compare-exchange | operations to prevent data races which would be common when | using read() and write() without locks. | jstimpfle wrote: | Are you saying that without mmap() there will be data | races?? | randbox wrote: | icedchai wrote: | I worked at a company that developed its own proprietary database | for a financial application. The entire database, several | gigabytes, not large by today's standards, was mmap'd and read at | startup to warm the page cache. We also built in-memory "indexes" | at startup. | | This was back in the early 2000's, when having 4 gigabytes of RAM | would be considered large. The "database server" was single | threaded, and all changes were logged to a WAL-ish file before | updating the mmap'd database. It was fun stuff to work on. It | worked well, but it wasn't a general purpose DB. | jasone wrote: | Check out the creative use of emoji in the running header! | [deleted] | jnwatson wrote: | I have a great deal of experience in running very large memory- | mapped databases using LMDB. | | The default Linux settings dealing with memory mapped files are | pretty horrible. The observed poor performance is directly | related to not configuring several very important kernel | parameters. | rectang wrote: | One possible advantage of using mmap over a buffer pool can be | programmer ergonomics. | | Reading data into a buffer pool in process RAM takes time to warm | up, and the pool can only be accessed by a single process. In | contrast, for an mmap-backed data structure, assuming that files | are static once written (which can be the case for an multi- | version concurrency control (MVCC) architecture), you open an | mmap read-only connection from any process and the so long as the | data is already in the OS cache, you get instant fast reads. This | makes managing database connections much easier, since | connections are cheap and the programmer can just open as many as | they want whenever and wherever they want. | | It is true that cache eviction strategy used by the OS is likely | to be suboptimal. So if you're in a position to only run a single | database process, you might decide to make different tradeoffs. | apavlo wrote: | This link should be to this page that includes more info and the | accompanying video: | | https://db.cs.cmu.edu/mmap-cidr2022/ | dsiegel2275 wrote: | Thank you for sharing your DB course(s) videos on the YouTube. | I'm a CMU staff member (Open Learning Initiative) that would | never be able to enroll on-site, given likely my lower priority | for getting a seat, but watching your videos online has been | fantastic. | lmwnshn wrote: | Since you have a CMU ID, AFAIK you should be able to enroll | in Piazza in addition to following along with the | assignments/projects (if you wanted to). | jandrewrogers wrote: | The pragmatic consideration that usually influences the decision | to use mmap() is the large discontinuity in skill and expertise | required to replace it. Writing your own alternative to mmap() | can be significantly superior in terms of performance and | functionality, and often lends itself to a cleaner database | architecture. However, this presumes a sufficiently sophisticated | design for an mmap() replacement. The learning curve is steep and | the critical nuances of sophisticated and practical designs are | poorly explored in readily available literature, providing little | in the way of "how-to" guides that you can lean on. | | As a consequence, early attempts to replace mmap() are often | quite poor. You don't know what you don't know, and details of | the implementation that are often glossed over turn out to be | critical in practice. For example, most people eventually figure | out that LRU cache replacement is a bad idea, but many of the | academic alternatives cause CPU cache thrashing in real systems, | replacing one problem with another. There are clever and non- | obvious design elements that can greatly mitigate this but they | are treated as implementation details in most discussions of | cache replacement and largely not discoverable if you are writing | one for the first time. | | While mmap() is a mediocre facility for a database, I think we | also have to be cognizant that replacing it competently is not a | trivial ask for most software engineers. If their learning curve | is anything like mine, I went from mmap() to designing obvious | alternatives with many poorly handled edge cases, and eventually | figuring out how to design non-obvious alternatives that could | smoothly handled very diverse workloads. That period of "poor | alternatives" in the middle doesn't produce great databases but | it almost feels necessary to properly grok the design problem. | Most people would rather spend their time working on other parts | of a database. | stingraycharles wrote: | When you say "replacing mmap()", could you elaborate a bit on | it? The way you write it sounds like you're describing a | reimplementation of mmap() with the same API, while I believe | the actual goal would be to completely rewrite the persistence | and caching layer to be like a "real" database. | jandrewrogers wrote: | The implementation is essentially a complete replacement for | the kernel page cache and I/O scheduler, much of the behavior | of which is hidden behind the mmap() functions. It is never a | drop-in replacement and you wouldn't want it to be but it is | functionally quite similar. | | For example, while the storage will usually have a linear | address space, the "pointer" to that address space won't be a | literal pointer even though it may behave much like one. | There may be stricter invariants around write back and page | fault behavior, and madvise()-like calls have deterministic | effects. You often have cheap visibility into details of | page/buffer state that you don't get with mmap() that can be | used in program logic. And so on. Different but similar. | lmilcin wrote: | The task is deceivingly simple. | | You have a file and a bunch of memory and you need to make | sure data is being moved from file to memory when needed and | from memory to file when needed. | | mmap() is one algorithm to do it, and the idea is that it is | not necessarily the best one. | | Knowing more about your data and application and needs should | theoretically enable you to design an algorithm that will be | more efficient at moving data back and forth. | dicroce wrote: | I have written a couple of mmap() based time series databases. | In my case, these were databases for holding video. For my | uses, mmap() has been great. I strongly agree with your | comment. Maybe mmap() isn't the greatest, but it has worked for | me. | tmountain wrote: | The original version of MongoDB used mmap, and I worked at a | company that had a ton of issues with cache warmup and the | cache getting trashed by competing processes. Granted this was | a long time ago, but the main issue was the operating system's | willingness to reallocate large swaths of memory from the | address space to whatever process was asking for memory right | now. | | Once the working set got trashed, performance would go through | the floor, and our app would slow to a crawl while the cache | went through the warmup cycle. | | Long story short, with that model, Mongo couldn't "own" the | memory it was using, and this lead to chronic problems. | Wiredtiger fixed this completely, but I still think this is a | cautionary tale for anyone considering building a DB without a | dedicated memory manager. | whatshisface wrote: | One might say that to rely on your operating system is to | rely on your system operators. | pengaru wrote: | Choosing mmap() gets you something that works sooner than later. | | But then you have a pile of blocking-style synchronous code | likely exploiting problematic assumptions to rewrite when you | realize you want something that doesn't just work, but works | _well_. | tptacek wrote: | Interesting parallels in this work to Tanenbaum's "RPC Considered | Harmful"+; in both cases, you've got an abstraction that papers | over a huge amount of complexity, and it ends up burning you | because a lot of that complexity turns out to be pretty important | and the abstraction has cost you control over it. | | + | _https://www.cs.vu.nl/~ast/Publications/Papers/euteco-1988.pd..._ | ncmncm wrote: | Yes. | | Whenever you need better performance and reliability, identify | an abstraction beloved of CS professors, and bypass it. | | When I last checked, libtorrent was utterly failing to use | O_DIRECT semantics. I started making a patch, but there are | several places that do file ops, and the main one was more | complicated than I could afford to dive into at the time. | PaulHoule wrote: | Most of the times I used mmap I wasn't happy in the end. | | I went through a phase when I thought it was fun to do extreme | random access on image files, archives and things like that. At | some point I think "I want to do this for a file I fetch over the | network" and that needs a rewrite. | randbox wrote: | How do you implement lockless atomic updates for multiple writers | across multiple threads & processes without mmap? | | With mmap it is straight forward for processes to open persistent | arrays of atomics as a file, and use compare and exchange | operations to prevent data races when multiple threads or | processes update the same page without any file locks, advisory | locks, or mutexes. | | With manual read() and write() calls, the data may be overwritten | by another writer before the update is committed. | jstimpfle wrote: | Why do you need lockless atomic updates to a file-backed memory | area? Genuinely curious. | pizlonator wrote: | This is a great write-up! | | Makes me wonder if there is an alternative universe in which | there is a syscall with semantics similar to mmap that avoids | these pitfalls. It's not like mmap's semantics are the only | semantics that we could have for memory-mapped IO. | sprachspiel wrote: | This would be exactly the kind of innovation we would need in | computer science. Instead we often get stuck in local minima | (in this case a 40-year old POSIX interface) without realizing | how much pain this causes. | dboreham wrote: | Do they mean: mmap() ? | dang wrote: | Url changed from https://db.cs.cmu.edu/mmap-cidr2022/, which has | the abstract and a link to this video: | | https://www.youtube.com/watch?v=1BRGU_AS25c | | and this code: https://github.com/viktorleis/mmapbench | gnabgib wrote: | I personally prefer the abstract to jumping straight into a | full paper, especially since it's quite rich (not one of those | two line entries like some arXiv paper abstracts). After | reading the abstract I did end up opening the PDF.. but I'm | hesitant to pay the PDF tax early. Is this one of those | "original source" type decisions? | dang wrote: | Yes. I hear you about the downside, but the downside of the | more superficial-accessible 'home page' is that people will | not read any further, and instead simply respond generically. ___________________________________________________________________ (page generated 2022-01-14 23:00 UTC)