[HN Gopher] Are you sure you want to use MMAP in your database m...
       ___________________________________________________________________
        
       Are you sure you want to use MMAP in your database management
       system? [pdf]
        
       Author : Aissen
       Score  : 103 points
       Date   : 2022-01-14 16:03 UTC (6 hours ago)
        
 (HTM) web link (db.cs.cmu.edu)
 (TXT) w3m dump (db.cs.cmu.edu)
        
       | [deleted]
        
       | assface wrote:
       | RavenDB's response to this paper:
       | https://ayende.com/blog/196161-C/re-are-you-sure-you-want-to...
        
         | 10000truths wrote:
         | You don't _want_ the OS to take care of reading from disk and
         | page caching /eviction. You want the DB itself to have explicit
         | control over that, because the DB has information on access
         | patterns and table format that the OS is not aware of. It is
         | better equipped than the OS to anticipate what portions of
         | tables/indices need to be cached in memory. It is better
         | equipped to calculate when/where/what/how much to prefetch from
         | disk. It is better equipped to determine when to buffer writes
         | and when to flush to disk.
         | 
         | Sure, it might be more work than using mmap. But it's also more
         | correct, forces you to handle edge cases, and much more
         | amenable to platform-specific improvements a la
         | kqueue/io_uring.
        
           | matheusmoreira wrote:
           | > the DB has information on access patterns and table format
           | that the OS is not aware of
           | 
           | Aren't system calls such as madvise supposed to allow user
           | space to let the kernel know precisely that information?
        
             | jandrewrogers wrote:
             | The madvise() functions and similar are a blunt and
             | imprecise instrument. The kernel is free to ignore them,
             | and frequently does in practice. It also does not prevent
             | the kernel from proactively doing things you don't want it
             | to do with your buffer pool at the worst possible time.
             | 
             | A user space buffer pool gives you precise and
             | deterministic control of many of these behaviors.
        
               | ayende wrote:
               | That is an interesting statement, when discussing what
               | you want to do
               | 
               | In this case, there is the issue of who is the you on
               | question
               | 
               | For the database in isolation, maybe not ideal
               | 
               | For a system whre db and app run on the same machine? The
               | OS can make sure you are on friendly terms and not
               | fighting
               | 
               | Same for trying to SSH to a bust server and the OS can
               | balance things out
        
             | masklinn wrote:
             | > precisely
             | 
             | Madvise is discussed in the paper, and it notes
             | specifically that:
             | 
             | * madvise is not _precise_
             | 
             | * madvise is... an advice, which the system is completely
             | free to disregard
             | 
             | * madvise is error-prone, providing the wrong hint can have
             | dire consequences
        
           | sltkr wrote:
           | The counterargument to this is that the kernel can make
           | decisions based on nonlocal information about the system.
           | 
           | If your database server is the only process in the system
           | that is using significant memory, then sure, you might as
           | well manage it yourself. But if there are multiple processes
           | competing for memory, the kernel is better equipped to decide
           | which processes' pages should be paged out or kept into
           | memory.
        
             | electricshampo1 wrote:
             | Generally for perf critical use cases you dedicate the
             | machine to the database. This simplifies many things
             | (avoiding having to reason about sharing, etc etc).
        
               | munchler wrote:
               | This makes me wonder whether there would be value in an
               | OS that is also a DBMS (or vice versa). In other words,
               | if the DBMS has total control over the hardware, perhaps
               | performance can be maximized without too much additional
               | complexity.
        
               | sedachv wrote:
               | This is a bad idea from the 1960s: IBM TPF, MUMPS, Pick.
               | As soon as the hardware changes it becomes slower and
               | more complicated.
        
           | ayende wrote:
           | I'm the author (well, one of) RavenDB
           | 
           | You are correct to an extent, but there are a few things yo
           | noted.
           | 
           | * you can design your system so the access pattern that the
           | OS is optimized for matches your needs
           | 
           | * you can use madvise() to give some useful hints
           | 
           | * the amount of complexity you don't have to deal with is
           | staggering
        
             | tytso wrote:
             | OTOH, if you care about that last 5 percent or so of
             | performance there is the complexity that what the OS has
             | optimized for might be different between different OS's
             | (e.g., MacOS, Linux, FreeBSd, etc.) and indeed, might
             | change between different versions of Linux, or even, in the
             | case of buffered writeback, between different _filesystems_
             | on the same version of Linux. This is probably historically
             | one of the most important reasons why enterprise databases
             | like Oracle DB, DB2, etc., have used direct I /O, and not
             | buffered I/O or mmap.
             | 
             | Speaking as an OS developer, we're not going to try to
             | optimize buffered I/O for a particular database. We'll be
             | using becnhmarks like compilebench and postmark to optimize
             | our I/O, and if your write patterns, or readahead patterns,
             | or caching requirements, don't match those workloads,
             | well.... sucks to be you.
             | 
             | I'll also point out that those big companies that actually
             | pay the salarise of us file system developers (e.g.,
             | Oracle, Google, etc.) for the most part use Direct I/O for
             | our performance critical workloads. If database companies
             | that want to use mmap want to hire file system developers
             | and contribute benchmarks and performance patches for ext4,
             | xfs, etc., speaking as the ext4 maintainer, I'll welcome
             | that, and we do have a weekly video conference where I'd
             | love to have your engineers join to discuss your
             | contributions. :-)
        
               | ayende wrote:
               | The key from my perspective is that I CAN design my
               | access patterns to match what you'll optimized.
               | 
               | Another aspect to remember is that mmap being even
               | possible for databases as the primary mechanism is quite
               | new.
               | 
               | Go 15 years ago and you are in 32 bit land. That rule out
               | mmap as your approach.
               | 
               | At this point, I might as well skip the OS and go direct
               | IO.
               | 
               | As for differ OS behavior, I generally find that they all
               | roughly optimize for the same thing.
               | 
               | I need best perf on Linux and Windows. Other systems I
               | can get away with just being pretty good
        
         | tyingq wrote:
         | The really key part seems to be this:
         | 
         |  _" If you aren't using mmap, on the other hand, you still need
         | to handle of all those issues"_
         | 
         | Which seems like a reasonable statement. Is it less work to
         | make your own top-to-bottom buffer pool, and would that
         | necessarily avoid similar issues? Or is it less work to use
         | mmap(), but address the issues?
        
           | bluestreak wrote:
           | Questdb's author here. I do share Ayende's sentiment. There
           | are things that the OP paper doesn't mention, which can help
           | mitigate some of the disadvantages:
           | 
           | - single-threaded calls to 'fallocate' will help avoiding
           | sparse files and SIGBUS during memory write - over-
           | allocating, caching memory addresses and minimizing OS calls
           | - transactional safety can be implemented via shared memory
           | model - hugetlb can minimize TLB shootdowns
           | 
           | I personally do not have any regrets using mmap because of
           | all the benefits they provide
        
           | AdamProut wrote:
           | I suppose. Some problems with mmap() are a bit hard to fix
           | from user land though. You will hit contention on locks
           | inside the kernel (mmap_sem) if the database does concurrent
           | high throughput mmap()/unmap(). I don't follow linux kernel
           | development closely to know if this has been improved
           | recently, but it was easy to reproduce it 4-5 years ago.
        
             | ayende wrote:
             | Almost no one is going to have a lot of map calls
             | 
             | Uou map the file once, then fault it in
        
             | tyingq wrote:
             | That makes sense. I wasn't going right to the conclusion
             | that working around mmap() issues was easier, but it didn't
             | seem to be explored much. Is the contention around having
             | one file mmap()ed, or is it reduced if you use more files?
        
           | dboreham wrote:
           | When I worked on/with BerkeleyDB in the late 90s we came to
           | the conclusion that the various OS mmap() implementations had
           | been tweaked/fixed to the point where they worked for the
           | popular high profile applications (in those days: Oracle). So
           | it can appear like everything is fine, but that probably
           | means your code behaves the same way as <popular database du
           | jour>.
        
             | tytso wrote:
             | Um... Oracle (and other enterprise databases like DB2)
             | don't use mmap. They use Direct I/O. Oracle does have
             | anonymous (non-file-backed) memory which is mmap'ed and
             | shared across various Oracle processes, called the Shared
             | Global Area (SGA), but it's not used for I/O.
        
             | ayende wrote:
             | Yes, isn't that wonderful?
             | 
             | You get to take advantage of literally decades of
             | experience
             | 
             | What is more, if you can match the profile of the
             | optimization, you can benefit even more
        
           | jandrewrogers wrote:
           | Some issues with mmap() can be avoided entirely if you have
           | your own buffer pool. Others are easier to handle because
           | they are made explicit and more buffer state is exposed to
           | the program logic. That's the positive side.
           | 
           | The downside is that writing an excellent buffer pool is not
           | trivial, especially if you haven't done it before. There are
           | many cross-cutting design concerns that have to be accounted
           | for. In my experience, an excellent C++ implementation tends
           | to be on the order of 2,000 lines of code -- someone has to
           | write that. It also isn't simple code, the logic is
           | relatively dense and subtle.
        
         | tptacek wrote:
         | From that article: the whole fsyncgate thing seems like a
         | pretty strong counterargument to "mmap adds more complexity
         | than it removes":
         | 
         | https://danluu.com/fsyncgate/
        
           | ayende wrote:
           | That actually doesn't matter This is orthogonal to mmap
        
         | ikawe wrote:
         | > Off the top of my head, most embedded databases implement a
         | single writer model. LMDB, Voron (RavenDB's storage engine),
         | LevelDB, Lucene
         | 
         | And let's not forget sqlite!
         | 
         | > There can only be a single writer at a time to an SQLite
         | database.
         | 
         | (from https://www.sqlite.org/isolation.html)
        
       | moonbug wrote:
       | why settle for errno when you can have a segfault.
        
         | kabdib wrote:
         | Yup. Why use the operating system's async I/O system when you
         | can simply burn a thread and do blocking I/O? </snark>
         | 
         | Been down that primrose path, have the road rash to prove it.
         | mmap() is great until you realize that pretty much all you've
         | avoided is some buffer management that you probably need to do
         | anyway. The OS just doesn't have the information it needs to do
         | a great (or even correct) job of caching database pages.
        
           | wahern wrote:
           | > Why use the operating system's async I/O system when you
           | can simply burn a thread and do blocking I/O? </snark>
           | 
           | mmap isn't non-blocking; page faults are blocking, no
           | different from a read or write to a (non-direct I/O) file
           | using a syscall.
           | 
           | Until recently io_uring literally burned a thread (from a
           | thread pool) for every read or write regular file operation,
           | too. Though now it finally has hooks into the buffer cache so
           | it can opportunistically perform the operation from the same
           | thread that dequeued the command, pushing it to a worker
           | thread if it would need to wait for a cache fault.[1]
           | 
           | [1] Technically the same behavior could be implemented in
           | user space using userfaultfd, but the latency would likely be
           | higher on faults.
        
           | randbox wrote:
           | A user process doesn't have the information it needs to do a
           | good job of coordinating updates from multiple writers to
           | database pages and indices. With MMAP, writers have access to
           | shared atomics which they can update using compare-exchange
           | operations to prevent data races which would be common when
           | using read() and write() without locks.
        
             | jstimpfle wrote:
             | Are you saying that without mmap() there will be data
             | races??
        
               | randbox wrote:
        
       | icedchai wrote:
       | I worked at a company that developed its own proprietary database
       | for a financial application. The entire database, several
       | gigabytes, not large by today's standards, was mmap'd and read at
       | startup to warm the page cache. We also built in-memory "indexes"
       | at startup.
       | 
       | This was back in the early 2000's, when having 4 gigabytes of RAM
       | would be considered large. The "database server" was single
       | threaded, and all changes were logged to a WAL-ish file before
       | updating the mmap'd database. It was fun stuff to work on. It
       | worked well, but it wasn't a general purpose DB.
        
       | jasone wrote:
       | Check out the creative use of emoji in the running header!
        
       | [deleted]
        
       | jnwatson wrote:
       | I have a great deal of experience in running very large memory-
       | mapped databases using LMDB.
       | 
       | The default Linux settings dealing with memory mapped files are
       | pretty horrible. The observed poor performance is directly
       | related to not configuring several very important kernel
       | parameters.
        
       | rectang wrote:
       | One possible advantage of using mmap over a buffer pool can be
       | programmer ergonomics.
       | 
       | Reading data into a buffer pool in process RAM takes time to warm
       | up, and the pool can only be accessed by a single process. In
       | contrast, for an mmap-backed data structure, assuming that files
       | are static once written (which can be the case for an multi-
       | version concurrency control (MVCC) architecture), you open an
       | mmap read-only connection from any process and the so long as the
       | data is already in the OS cache, you get instant fast reads. This
       | makes managing database connections much easier, since
       | connections are cheap and the programmer can just open as many as
       | they want whenever and wherever they want.
       | 
       | It is true that cache eviction strategy used by the OS is likely
       | to be suboptimal. So if you're in a position to only run a single
       | database process, you might decide to make different tradeoffs.
        
       | apavlo wrote:
       | This link should be to this page that includes more info and the
       | accompanying video:
       | 
       | https://db.cs.cmu.edu/mmap-cidr2022/
        
         | dsiegel2275 wrote:
         | Thank you for sharing your DB course(s) videos on the YouTube.
         | I'm a CMU staff member (Open Learning Initiative) that would
         | never be able to enroll on-site, given likely my lower priority
         | for getting a seat, but watching your videos online has been
         | fantastic.
        
           | lmwnshn wrote:
           | Since you have a CMU ID, AFAIK you should be able to enroll
           | in Piazza in addition to following along with the
           | assignments/projects (if you wanted to).
        
       | jandrewrogers wrote:
       | The pragmatic consideration that usually influences the decision
       | to use mmap() is the large discontinuity in skill and expertise
       | required to replace it. Writing your own alternative to mmap()
       | can be significantly superior in terms of performance and
       | functionality, and often lends itself to a cleaner database
       | architecture. However, this presumes a sufficiently sophisticated
       | design for an mmap() replacement. The learning curve is steep and
       | the critical nuances of sophisticated and practical designs are
       | poorly explored in readily available literature, providing little
       | in the way of "how-to" guides that you can lean on.
       | 
       | As a consequence, early attempts to replace mmap() are often
       | quite poor. You don't know what you don't know, and details of
       | the implementation that are often glossed over turn out to be
       | critical in practice. For example, most people eventually figure
       | out that LRU cache replacement is a bad idea, but many of the
       | academic alternatives cause CPU cache thrashing in real systems,
       | replacing one problem with another. There are clever and non-
       | obvious design elements that can greatly mitigate this but they
       | are treated as implementation details in most discussions of
       | cache replacement and largely not discoverable if you are writing
       | one for the first time.
       | 
       | While mmap() is a mediocre facility for a database, I think we
       | also have to be cognizant that replacing it competently is not a
       | trivial ask for most software engineers. If their learning curve
       | is anything like mine, I went from mmap() to designing obvious
       | alternatives with many poorly handled edge cases, and eventually
       | figuring out how to design non-obvious alternatives that could
       | smoothly handled very diverse workloads. That period of "poor
       | alternatives" in the middle doesn't produce great databases but
       | it almost feels necessary to properly grok the design problem.
       | Most people would rather spend their time working on other parts
       | of a database.
        
         | stingraycharles wrote:
         | When you say "replacing mmap()", could you elaborate a bit on
         | it? The way you write it sounds like you're describing a
         | reimplementation of mmap() with the same API, while I believe
         | the actual goal would be to completely rewrite the persistence
         | and caching layer to be like a "real" database.
        
           | jandrewrogers wrote:
           | The implementation is essentially a complete replacement for
           | the kernel page cache and I/O scheduler, much of the behavior
           | of which is hidden behind the mmap() functions. It is never a
           | drop-in replacement and you wouldn't want it to be but it is
           | functionally quite similar.
           | 
           | For example, while the storage will usually have a linear
           | address space, the "pointer" to that address space won't be a
           | literal pointer even though it may behave much like one.
           | There may be stricter invariants around write back and page
           | fault behavior, and madvise()-like calls have deterministic
           | effects. You often have cheap visibility into details of
           | page/buffer state that you don't get with mmap() that can be
           | used in program logic. And so on. Different but similar.
        
           | lmilcin wrote:
           | The task is deceivingly simple.
           | 
           | You have a file and a bunch of memory and you need to make
           | sure data is being moved from file to memory when needed and
           | from memory to file when needed.
           | 
           | mmap() is one algorithm to do it, and the idea is that it is
           | not necessarily the best one.
           | 
           | Knowing more about your data and application and needs should
           | theoretically enable you to design an algorithm that will be
           | more efficient at moving data back and forth.
        
         | dicroce wrote:
         | I have written a couple of mmap() based time series databases.
         | In my case, these were databases for holding video. For my
         | uses, mmap() has been great. I strongly agree with your
         | comment. Maybe mmap() isn't the greatest, but it has worked for
         | me.
        
         | tmountain wrote:
         | The original version of MongoDB used mmap, and I worked at a
         | company that had a ton of issues with cache warmup and the
         | cache getting trashed by competing processes. Granted this was
         | a long time ago, but the main issue was the operating system's
         | willingness to reallocate large swaths of memory from the
         | address space to whatever process was asking for memory right
         | now.
         | 
         | Once the working set got trashed, performance would go through
         | the floor, and our app would slow to a crawl while the cache
         | went through the warmup cycle.
         | 
         | Long story short, with that model, Mongo couldn't "own" the
         | memory it was using, and this lead to chronic problems.
         | Wiredtiger fixed this completely, but I still think this is a
         | cautionary tale for anyone considering building a DB without a
         | dedicated memory manager.
        
           | whatshisface wrote:
           | One might say that to rely on your operating system is to
           | rely on your system operators.
        
       | pengaru wrote:
       | Choosing mmap() gets you something that works sooner than later.
       | 
       | But then you have a pile of blocking-style synchronous code
       | likely exploiting problematic assumptions to rewrite when you
       | realize you want something that doesn't just work, but works
       | _well_.
        
       | tptacek wrote:
       | Interesting parallels in this work to Tanenbaum's "RPC Considered
       | Harmful"+; in both cases, you've got an abstraction that papers
       | over a huge amount of complexity, and it ends up burning you
       | because a lot of that complexity turns out to be pretty important
       | and the abstraction has cost you control over it.
       | 
       | +
       | _https://www.cs.vu.nl/~ast/Publications/Papers/euteco-1988.pd..._
        
         | ncmncm wrote:
         | Yes.
         | 
         | Whenever you need better performance and reliability, identify
         | an abstraction beloved of CS professors, and bypass it.
         | 
         | When I last checked, libtorrent was utterly failing to use
         | O_DIRECT semantics. I started making a patch, but there are
         | several places that do file ops, and the main one was more
         | complicated than I could afford to dive into at the time.
        
       | PaulHoule wrote:
       | Most of the times I used mmap I wasn't happy in the end.
       | 
       | I went through a phase when I thought it was fun to do extreme
       | random access on image files, archives and things like that. At
       | some point I think "I want to do this for a file I fetch over the
       | network" and that needs a rewrite.
        
       | randbox wrote:
       | How do you implement lockless atomic updates for multiple writers
       | across multiple threads & processes without mmap?
       | 
       | With mmap it is straight forward for processes to open persistent
       | arrays of atomics as a file, and use compare and exchange
       | operations to prevent data races when multiple threads or
       | processes update the same page without any file locks, advisory
       | locks, or mutexes.
       | 
       | With manual read() and write() calls, the data may be overwritten
       | by another writer before the update is committed.
        
         | jstimpfle wrote:
         | Why do you need lockless atomic updates to a file-backed memory
         | area? Genuinely curious.
        
       | pizlonator wrote:
       | This is a great write-up!
       | 
       | Makes me wonder if there is an alternative universe in which
       | there is a syscall with semantics similar to mmap that avoids
       | these pitfalls. It's not like mmap's semantics are the only
       | semantics that we could have for memory-mapped IO.
        
         | sprachspiel wrote:
         | This would be exactly the kind of innovation we would need in
         | computer science. Instead we often get stuck in local minima
         | (in this case a 40-year old POSIX interface) without realizing
         | how much pain this causes.
        
       | dboreham wrote:
       | Do they mean: mmap() ?
        
       | dang wrote:
       | Url changed from https://db.cs.cmu.edu/mmap-cidr2022/, which has
       | the abstract and a link to this video:
       | 
       | https://www.youtube.com/watch?v=1BRGU_AS25c
       | 
       | and this code: https://github.com/viktorleis/mmapbench
        
         | gnabgib wrote:
         | I personally prefer the abstract to jumping straight into a
         | full paper, especially since it's quite rich (not one of those
         | two line entries like some arXiv paper abstracts). After
         | reading the abstract I did end up opening the PDF.. but I'm
         | hesitant to pay the PDF tax early. Is this one of those
         | "original source" type decisions?
        
           | dang wrote:
           | Yes. I hear you about the downside, but the downside of the
           | more superficial-accessible 'home page' is that people will
           | not read any further, and instead simply respond generically.
        
       ___________________________________________________________________
       (page generated 2022-01-14 23:00 UTC)