[HN Gopher] But how, exactly, do databases use mmap?
       ___________________________________________________________________
        
       But how, exactly, do databases use mmap?
        
       Author : brunoac
       Score  : 168 points
       Date   : 2021-01-23 13:06 UTC (9 hours ago)
        
 (HTM) web link (brunocalza.me)
 (TXT) w3m dump (brunocalza.me)
        
       | rcgorton wrote:
       | I found some of the 'sizing' snippets in the example came across
       | as disingenuous: if you KNOW the size of the file, mmap it
       | initially using that without the looping overhead. And you
       | presumably know how much memory you have on a given system. The
       | description (at least as how I read the article) implies bolt is
       | a truly naive implementation of a key/value DB
        
       | perbu wrote:
       | The author notices that Bolt doesn't use mmap for writes. The
       | reason is surprisingly simple, once you know how it works. Say
       | you want to overwrite a page at some locations that isn't present
       | in memory. You'd write to it and you'd think that is that. But
       | when this happens the CPU triggers a page fault, the OS steps in
       | and reads the underlying page into memory. It then relinquishes
       | control back to the application. The application then continues
       | to overwrite that page.
       | 
       | So for each write that isn't mapped into memory you'll trigger a
       | read. Bad.
       | 
       | Early versions of Varnish Cache struggled with this and this was
       | the reason they made a malloc-based backend instead. mmaps are
       | great for reads, but you really shouldn't write through them.
        
         | tayo42 wrote:
         | Is the trade off in varnish worth it? Workloads for a cache
         | should be pretty read heavy, writes should be infrequent unless
         | it's being filled for the first time
        
         | KMag wrote:
         | I think the main problem with mmap'd writes is that they're
         | blocking and synchronous.
         | 
         | I presume most database record writes are smaller than a page.
         | In that case, other methods (unless you're using O_DIRECT,
         | which ads its own difficulties) still have the kernel read a
         | whole page of memory into the page cache before writing the
         | selected bytes. So, unless you're using O_DIRECT for your
         | writes, you're still triggering the exact same read-modify-
         | write, it's just that with the file APIs you can use async I/O
         | or use select/poll/epoll/kqueue, etc. to avoid these necessary
         | reads from blocking your writer thread.
        
         | cperciva wrote:
         | There's an even better reason for databases to not write to
         | memory mapped pages: Pages get synched out to disk at the
         | kernel's leisure. This can be ok for a cache but it's
         | definitely not what you want for a database!
        
           | [deleted]
        
           | eqvinox wrote:
           | That's what msync() is for.
        
             | monocasa wrote:
             | Right, but it can sync arbitrary ranges sooner, which is
             | also awful for consistency.
        
               | reader_mode wrote:
               | Shouldn't your write strategy be resilient to that kind
               | of stuff (eg. shutdown during a partial update) ?
        
               | gmueckl wrote:
               | Don't you need exact guarantees on write ordering to
               | achieve that?
        
               | jorangreef wrote:
               | Yes, for almost all databases, although there was a cool
               | paper from the University of Wisconsin Madison a few
               | years ago that showed how to design something that could
               | work without write barriers, and under the assumption
               | that disks don't always fsync correctly:
               | 
               | "the No-Order File System (NoFS), a simple, lightweight
               | file system that employs a novel technique called
               | backpointer based consistency to provide crash
               | consistency without ordering writes as they go to disk"
               | 
               | http://pages.cs.wisc.edu/~vijayc/nofs.htm
        
               | vlovich123 wrote:
               | Does that generalize to databases? My understanding is
               | that file systems are a restricted case of databases that
               | don't necessarily support all operations (eg transactions
               | are smaller, can't do arbitrary queries within a
               | transaction, etc etc).
        
               | bonzini wrote:
               | You can do write/sync/write/sync in order to achieve
               | that. It would be nicer to have FUA support in system
               | calls (or you can open the same file to two descriptors,
               | one with O_SYNC and one without).
        
             | dooglius wrote:
             | I think you mean mlock
        
             | cperciva wrote:
             | If you're tracking what needs to be flushed to disk when,
             | you might as well just be making explicit pwrite syscalls.
        
         | cma wrote:
         | Isn't there a way around this? When coding for graphics stuff
         | writing to GPU mapped memory people usually take pains to turn
         | off compiler optimizations that might XOR memory against itself
         | to zero it out or AND it against 0 and cause a read, and other
         | things like that.
         | 
         | https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-...
         | 
         | > Even the following C++ code can read from memory and trigger
         | the performance penalty because the code can expand to the
         | following x86 assembly code. C++ code:                   Copy
         | *((int*)MappedResource.pData) = 0;
         | 
         | x86 assembly code:                   Copy AND DWORD PTR [EAX],0
         | 
         | > Use the appropriate optimization settings and language
         | constructs to help avoid this performance penalty. For example,
         | you can avoid the xor optimization by using a volatile pointer
         | or by optimizing for code speed instead of code size.
         | 
         | I guess mmapped files still may need a read to know whether to
         | do copy on write, where mapped memory for the CPU in that case
         | is specifically marked for upload only and gets something
         | flagged that writes it regardless of if there is a change, but
         | mmap maybe has something similar?
         | 
         | (edit: this seems to say nothing similar is possible with mmap
         | on x86 https://stackoverflow.com/questions/31014515/write-only-
         | mapp...
         | 
         | but how does it work for GPUs? Something to do with fixed pci-e
         | support on the cpu (base address register
         | https://en.wikipedia.org/wiki/PCI_configuration_space)?
        
           | ww520 wrote:
           | I believe GPU solves this by having read only and write only
           | buffers in the rendering pipeline.
        
           | alaties wrote:
           | The answer is that it works pretty similarly, but GPUs
           | usually do this in specialized hardware whereas mmap'ing of
           | files for DMA-style access is implemented mostly in software.
           | 
           | https://insujang.github.io/2017-04-27/gpu-architecture-
           | overv... has a pretty good visual of what's doing what for
           | GPU DMA. You can imagine much of what happens here is almost
           | pure software for mmap'd files.
        
           | monocasa wrote:
           | As other have said, you need hardware support to do this
           | similarly to how GPUs do it.
           | 
           | That being said, that hardware support exists with NVDIMMs.
        
           | remram wrote:
           | You'd need a way to indicate when you start and end
           | overwriting the page. You need to avoid the page being
           | swapped out mid-overwrite and not read back in. You'd also
           | pay a penalty for zeroing it when it gets mapped pre-
           | overwrite. The map primitives are just not meant for this.
        
           | rini17 wrote:
           | I think on Linux there's madvise syscall with "remove" flag,
           | which you can issue on memory pages you intend to completely
           | overwrite. I have no idea on performance or other practical
           | issues.
        
         | icedchai wrote:
         | Yes, this can definitely be a problem. I worked on a
         | transaction processing system that was entirely based on a in-
         | house memory mapped database. All reads and writes went through
         | mmap. At startup, it read through all X gigabytes of data to
         | "ensure" everything was hot, in memory, and also built the in
         | memory indexes.
         | 
         | This actually worked fine in production, since the systems were
         | properly sized and dedicated to this. On dev systems with low
         | memory and often running into swap, you'd run into cases with
         | crazy delays... sometimes a second or two for something that
         | would normally be a few milliseconds.
        
       | ramoz wrote:
       | Perhaps a part 2 would dive a bit deeper into os caching and
       | hardware (SSDs, their interfaces etc)
        
       | shoo wrote:
       | See also: sublime HQ blog about complexities of shipping a
       | desktop application using mmap [1] and corresponding 200+ comment
       | HN thread [2]:
       | 
       | > When we implemented the git portion of Sublime Merge, we chose
       | to use mmap for reading git object files. This turned out to be
       | considerably more difficult than we had first thought. Using mmap
       | in desktop applications has some serious caveats [...]
       | 
       | > you can rewrite your code to not use memory mapping. Instead of
       | passing around a long lived pointer into a memory mapped file all
       | around the codebase, you can use functions such as pread to copy
       | only the portions of the file that you require into memory. This
       | is less elegant initially than using mmap, but it avoids all the
       | problems you're otherwise going to have.
       | 
       | > Through some quick benchmarks for the way Sublime Merge reads
       | git object files, pread was around  2/3  as fast as mmap on
       | linux. In hindsight it's difficult to justify using mmap over
       | pread, but now the beast has been tamed and there's little reason
       | to change any more.
       | 
       | [1] https://www.sublimetext.com/blog/articles/use-mmap-with-care
       | [2] https://news.ycombinator.com/item?id=19805675
        
       | minitoar wrote:
       | Interana mmaps the heck out of stuff. I've found that relying on
       | the file cache works great. Though our access patterns are
       | admittedly pretty simple.
        
       | 29athrowaway wrote:
       | malloc is implemented using mmap.
       | 
       | You map memory manually when you need very low level control over
       | memory.
        
         | jeffbee wrote:
         | `malloc` is not one thing. Some mallocs use mmap and others use
         | brk. Some implementations use both.
        
           | kevin_thibedeau wrote:
           | Some use neither.
        
       | PaulHoule wrote:
       | I like mmap and I don't.
       | 
       | It is incompatible with non-blocking I/O since your process will
       | be stopped if it tries to access part of the file that is not
       | mapped -- this isnt a syscall blocking (which you might work
       | around) but rather any attempt to access mapped memory.
       | 
       | I like mmap for tasks like seeking into ZIP files, where you can
       | look at the back 1% of the file, then locate and extract one of
       | the subfiles; the trouble there is that the really fun case is to
       | do this over the network with http (say to solve Python
       | dependencies, to extract the metadata from wheel files) in which
       | case this method doesnt work.
        
         | Sesse__ wrote:
         | mmap is great for rapid prototyping. For anything I/O-heavy,
         | it's a mess. You have zero control over how large your I/Os are
         | (you're very much at the mercy of heuristics that are optimized
         | for loading executables), readahead is spotty at best
         | (practical madvise implementation is a mess), async I/O doesn't
         | exist, you can't interleave compression in the page cache,
         | there's no way of handling errors (I/O error = SIGBUS/SIGSEGV),
         | and write ordering is largely inaccessible. Also, you get
         | issues such as page table overhead for very large files, and
         | address space limitations for 32-bit systems.
         | 
         | In short, it's a solution that looks so enticing at first, but
         | rapidly costs much more than it's worth. As systems grow more
         | complex, they almost inevitably have to throw out mmap.
        
         | rapsey wrote:
         | Process will be stopped or thread?
        
           | ithkuil wrote:
           | Thread
        
         | codetrotter wrote:
         | > the trouble there is that the really fun case is to do this
         | over the network with http (say to solve Python dependencies,
         | to extract the metadata from wheel files) in which case this
         | method doesnt work
         | 
         | If the web server can tell you the total size of the file by
         | responding to a HEAD request, and it support range requests
         | then it will be possible.
         | 
         | https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...
         | 
         | Or am I missing something?
        
           | johndough wrote:
           | You are correct, this works. There even is a file system
           | built around this idea: https://github.com/fangfufu/httpdirfs
        
           | remram wrote:
           | You can't do this with mmap though, you can't instruct _the
           | OS_ to grab pages via HTTP range requests.
        
             | kccqzy wrote:
             | Write a fuse layer.
        
         | amelius wrote:
         | > It is incompatible with non-blocking I/O since your process
         | will be stopped if it tries to access part of the file that is
         | not mapped
         | 
         | Yeah, but the same problem occurs in normal memory when the OS
         | has swapped out the page.
         | 
         | So perhaps non-blocking I/O (and cooperative multitasking) is
         | the problem here.
        
           | loeg wrote:
           | > Yeah, but the same problem occurs in normal memory when the
           | OS has swapped out the page.
           | 
           | I'd argue that swapping is an orthogonal problem which can be
           | solved in a number of ways: disable swap at the OS level,
           | mlock() in the application, maybe others.
           | 
           | mmap is really a bad API for IO -- it hides synchronous IO
           | and doesn't produce useful error statuses at access.
           | 
           | > So perhaps non-blocking I/O (and cooperative multitasking)
           | is the problem here.
           | 
           | I'm not sure how non-blocking IO is "the problem." It's
           | something Windows has had forever, and unix-y platforms have
           | wanted for quite a long time. (Long history of poll, epoll,
           | kqueue, aio, and now io_uring.)
        
             | amelius wrote:
             | > it hides synchronous IO and doesn't produce useful error
             | statuses at access.
             | 
             | You can trap IO errors if necessary. E.g. you can raise
             | signals just like segfaults generate signals.
             | 
             | > I'm not sure how non-blocking IO is "the problem."
             | 
             | The point is that non-blocking IO wants to abstract away
             | the hardware, but the abstraction is leaky. Most programs
             | which use non-blocking IO actualy want to implement
             | multitasking without relying threads. But that turns out to
             | be the wrong approach.
        
               | loeg wrote:
               | > The point is that non-blocking IO wants to abstract
               | away the hardware, but the abstraction is leaky.
               | 
               | Why do you say it doesn't match hardware? Basically all
               | hardware is asynchronous -- submit a request, get a
               | completion interrupt, completion context has some success
               | or failure status. Non-blocking IO is fundamentally a
               | good fit for hardware. It's blocking IO that is a poor
               | abstraction for hardware.
               | 
               | > Most programs which use non-blocking IO actualy want to
               | implement multitasking without relying threads. But that
               | turns out to be the wrong approach.
               | 
               | Why is that the wrong approach? Approximately every high-
               | performance httpd for the last decade or two has used a
               | multitasking, non-blocking network IO model rather than
               | thread-per-request. The overhead of threads is just very
               | high. They would like to use the same model for non-
               | network IO, but Unix and unix-alikes have historically
               | not exposed non-blocking disk IO to applications.
               | io_uring is a step towards a unified non-blocking IO
               | interface for applications, and also very similar to how
               | the operating system interacts with most high-performance
               | devices (i.e., a bunch of queues).
        
               | amelius wrote:
               | > Why do you say it doesn't match hardware?
               | 
               | Because the CPU itself can block. In this case on memory
               | access. Most (all?) async software assumes the CPU can't
               | block. A modern CPU has a pipelining mechanism, where
               | parts can simply block, waiting for e.g. memory to
               | return. If you want to handle this all nicely, you have
               | to respect the api of this process which happens to go
               | through the OS. So for example, while waiting for your
               | memory page to be loaded, the OS can run another thread
               | (which it can't in the async case because there isn't any
               | other thread).
        
         | quotemstr wrote:
         | You use mmap whether you want to or not: the system executes
         | your program by mmaping your executable and jumping into it!
         | You can always take a hard fault at any time because the kernel
         | is allowed to evict your code pages on demand even if you
         | studiously avoid mmap for your data files. And it can do this
         | eviction even if you have swap turned off.
         | 
         | If you want to guarantee that your program doesn't block, you
         | need to use mlockall.
        
           | geofft wrote:
           | This is technically true, but the use case we're talking
           | about is programs that are much smaller than their data.
           | Postgres, for instance, is under 50 MB, but is often used to
           | handles databases in the gigabytes or terabytes range. You
           | can mlockall() the binary if you want, but you probably can't
           | actually fit the entire database into RAM even if you wanted
           | to.
           | 
           | Also, when processing a large data file (say you're walking a
           | B-tree or even just doing a search on an unindexed field),
           | the code you're running tends to be a small loop, within the
           | same few pages, so it might not even leave the CPU's cache,
           | let alone get swapped out of RAM, but you need to access a
           | very large amount of data, so it's much more likely the data
           | you want could be swapped out. If you know some things about
           | the data structure (e.g., there's an index or lookup table
           | somewhere you care about, but you're traversing each node
           | once), you can use that to optimize which things are flushed
           | from your cache and which aren't.
        
           | jorangreef wrote:
           | But that's a different order of magnitude problem: control
           | plane vs data plane.
           | 
           | At some point, we could also say that the line fill buffer
           | blocks our programs (more often than we realize).
           | 
           | All of this is accurate, but at different scales.
        
             | PaulHoule wrote:
             | Also many systems in 2021 have a lot of RAM and hardly ever
             | swap.
        
           | loeg wrote:
           | You're not wrong. Applications and libraries that want to be
           | non-blocking should mlock their pages and avoid mmap for
           | further data access. ntpd does this, for example.
           | 
           | After application startup, you _can_ avoid _additional_ mmap.
        
       | amelius wrote:
       | This is one area where Rust, a modern systems language, has
       | disappointed me. You can't allocate data structures inside
       | mmap'ed areas, and expect them to work when you load them again
       | (i.e., the mmap'ed area's base address might have changed). I
       | hope that future languages take this usecase into account.
        
         | simias wrote:
         | I'm not sure I see the issue. This approach (putting raw binary
         | data into files) is filled with footguns. What if you add,
         | remove or reorder fields? What if your file was externally
         | modified and now doesn't match the expected layout? What if the
         | data contains things like file descriptors or pointers that
         | can't meaningfully be mapped that way? Even changing the
         | compilation flags can produce binary incompatibilities.
         | 
         | I'm not saying that it's not sometimes very useful but it's
         | tricky and low level enough that some unsafe low level plumbing
         | is, I think, warranted. You have to know what you're doing if
         | you decide to go down that route, otherwise you're much better
         | off using something like Serde to explicitly handle
         | serialization. There's some overhead of course, but 99% of the
         | time it's the right thing to do.
        
           | amelius wrote:
           | The footguns can be solved in part by the type-system
           | (preventing certain types from being stored), and (if
           | necessary) by cooperation with the OS (e.g. to guarantee that
           | a file is not modified between runs).
           | 
           | How else would you lazy-load a database of (say) 32GB into
           | memory, almost instantly?
           | 
           | And why require everybody to write serialization code when
           | just allocating the data inside a mmap'ed file is so much
           | easier? We should be focusing on new problems rather than
           | reinventing the wheel all the time. Persistence has been an
           | issue in computing since the start, and it's about time we
           | put it behind us.
        
             | simias wrote:
             | >How else would you lazy-load a database of (say) 32GB into
             | memory, almost instantly?
             | 
             | By using an existing database engine that will do it for
             | me. If you need to deal with that amount of data and
             | performance is really important you have a lot more to
             | worry about than having to use unsafe blocks to map your
             | data structures.
             | 
             | Maybe we just have different experiences and work on
             | different types of projects but I feel like being able to
             | seamlessly dump and restore binary data transparently is
             | both very difficult to implement reliably and quite niche.
             | 
             | Note in particular that machine representation is not
             | necessarily the most optimal way to store data. For
             | instance any kind of Vec or String in rust will use 3 usize
             | to store length, capacity and the data pointer which on 64
             | bit architectures is 24 bytes. If you store many small
             | strings and vectors it adds up to a huge amount of waste.
             | Enum variants are also 64 bits on 64 bit architectures if I
             | recall correctly.
             | 
             | For instance I use bincode with serde to serialize data
             | between instances of my application, bincode maps almost
             | 1:1 the objects with their binary representation. I noticed
             | that by implementing a trivial RLE encoding scheme on top
             | of bincode for running zeroes I can divide the average
             | message size by a factor 2 to 3. And bincode only encodes
             | length, not capacity.
             | 
             | My point being that I'm not sure that 32GB of memory-mapped
             | data would necessarily load faster than <16GB of lightly
             | serialized data. Of course in some cases it might, but
             | that's sort of my point, you really need to know what
             | you're doing if you decide to do this.
        
             | burntsushi wrote:
             | > How else would you lazy-load a database of (say) 32GB
             | into memory, almost instantly?
             | 
             | That's what the fst crate[1] does. It's likely working at a
             | lower level of abstraction than you intend. But the point
             | is that it works, is portable and doesn't require any
             | cooperation from the OS other than the ability to memory
             | map files. My imdb-rename tool[2] uses this technique to
             | build an on-disk database for instantaneous searching. And
             | then there is the regex-automata crate[3] that permits
             | deserializing a regex instantaneously from any kind of
             | slice of bytes.[4]
             | 
             | I think you should maybe provide some examples of what
             | you're suggesting to make it more concrete.
             | 
             | [1] - https://crates.io/crates/fst
             | 
             | [2] - https://github.com/BurntSushi/imdb-rename
             | 
             | [3] - https://crates.io/crates/regex-automata
             | 
             | [4] - https://docs.rs/regex-
             | automata/0.1.9/regex_automata/#example...
        
           | geofft wrote:
           | I had a use case recently for serializing C data structures
           | in Rust (i.e., being compatible with an existing protocol
           | defined as "compile this C header, and send the structs down
           | a UNIX socket"), and I was a little surprised that the
           | straightforward way to do it is to unsafely cast a #[repr(C)]
           | structure to a byte-slice, and there isn't a Serde serializer
           | for C layouts. (Which would even let you serialize C layouts
           | for a different platform!)
           | 
           | I think you could also do something Serde-ish that handles
           | the original use case where you can derive something on a
           | structure as long as it contains only plain data types (no
           | pointers) or nested such structures. Then it would be safe to
           | "serialize" and "deserialize" the structure by just
           | translating it into memory (via either mmap or direct
           | reads/writes), without going through a copy step.
           | 
           | The other complication here is multiple readers - you might
           | want your accessor functions to be atomic operations, and you
           | might want to figure out some way for multiple processes
           | accessing the same file to coordinate ordering updates.
           | 
           | I kind of wonder what Rust's capnproto and Arrow bindings do,
           | now....
        
             | burntsushi wrote:
             | It's likely that the "safe transmute" working group[1] will
             | help facilitate this sort of thing. They have an RFC[2].
             | See also the bytemuck[3] and zerocopy[4] crates which
             | predate the RFC, where at least the latter has 'derive'
             | functionality.
             | 
             | [1] - https://github.com/rust-lang/project-safe-transmute
             | 
             | [2] - https://github.com/jswrenn/project-safe-
             | transmute/blob/rfc/r...
             | 
             | [3] - https://docs.rs/bytemuck/1.5.0/bytemuck/
             | 
             | [4] - https://docs.rs/zerocopy/0.3.0/zerocopy/index.html
        
         | comonoid wrote:
         | Yes, you can.
         | 
         | You cannot with standard data structures, but you can with your
         | custom ones.
         | 
         | That's all about trade-offs, anyway, there is no magic bullet.
        
         | the8472 wrote:
         | Work on custom allocators is underway, some of the std data
         | structures already support them on nightly.
         | 
         | https://github.com/rust-lang/wg-allocators/issues/7
        
         | remram wrote:
         | What about Rust makes this more difficult than doing the same
         | thing in C++?
        
         | quotemstr wrote:
         | You can't do that in C++ or any language. You need to do your
         | own relocations and remember enough information to do them. You
         | can't count on any particular virtual address being available
         | on a modern system, not if you want to take advantage of ASLR.
         | 
         | The trouble is that we have to mark relocated pages dirty
         | because the kernel isn't smart enough to understand that it can
         | demand fault and relocate on its own. Well, either that, or do
         | the relocation anew on each access.
        
           | whimsicalism wrote:
           | I don't see what the issue in doing this is in C++.
           | 
           | The only thing that'll break will be the pointers and
           | references to things outside of the mmap'd area.
        
             | simias wrote:
             | By that logic you can do it in unsafe Rust as well then.
             | Obviously in safe Rust having potentially dangling
             | "pointers and references to things outside of the mmap'd
             | area" is a big no-no.
             | 
             | And note that even intra-area pointers would have to be
             | offset if the base address changes. Unless you go through
             | the trouble of only storing relative offsets to begin with,
             | but the performance overhead might be significant.
        
           | Hello71 wrote:
           | libsigsegv (slow) or userfaultfd (less slow) can be used for
           | this purpose.
        
           | secondcoming wrote:
           | It works with C++ if you use boost::interprocess. Its data
           | structures use offset_ptr internally rather than assuming
           | every pointer is on the heap.
        
             | quotemstr wrote:
             | Sure. But that counts as "doing your own relocations".
             | Unsafe Rust could do the same, yes?
        
               | whimsicalism wrote:
               | What is being relocated?
        
               | ithkuil wrote:
               | If you use offsets instead of pointers you're doing
               | relocations "on the fly"
        
               | secondcoming wrote:
               | I don't know enough about Rust to say. If it doesn't have
               | the concept of a 'fancy pointer' then I assume no, you'd
               | have to essentially reproduce what boost::interprocess
               | does.
        
             | amelius wrote:
             | That introduces different data-types, rather than using the
             | existing ones (instantiated with different pointer-types).
        
               | secondcoming wrote:
               | Indeed. I don't know if there's a plan for the standard
               | type to move to offset-ptr, or if there's even a
               | std::offset_ptr, but it would be great if there was.
               | 
               | For us, some of the 'different data type' pain was
               | alleviated with transparent comparators. YMMV.
               | 
               | Edit: It seems C++11 has added some form of support for
               | it... 'fancy pointers'
               | 
               | https://en.cppreference.com/w/cpp/named_req/Allocator#Fan
               | cy_...
        
         | jnwatson wrote:
         | There's no placement new in Rust? That's disappointing.
        
           | steveklabnik wrote:
           | Not in stable yet, no. It's desired, but has taken a while to
           | design, as there have been higher priority things for a
           | while. We'll get there!
        
         | turminal wrote:
         | This is impossible without significant performance impact. No
         | language can change that.
         | 
         | Edit: except theoretically for data structures that have
         | certain characteristics known in advance
        
           | amelius wrote:
           | Well, one approach is to parameterize your data-types such
           | that they are fast in the usual case, but become perhaps
           | slightly slower (but still on par with hand-written code) in
           | the more versatile case.
        
       | waynesonfire wrote:
       | Thanks for diving into this DB! I find it interesting that many
       | databases share such similar architectural principles. NIH. It's
       | super fun to build a database so why not.
       | 
       | Also, don't beat yourself over how deep you'll be diving into the
       | design. Why apologize for this? Those that want a deep expository
       | would quickly move on.
        
       | rossmohax wrote:
       | mmap is not as free as people think. VM subsystem is full of
       | inefficient locks. Here is a very good writeup on a problem BBC
       | encountered with Varnish:
       | https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-4...
        
       | jeffbee wrote:
       | Apparently in a way that the author of the article, and probably
       | the authors of bolt, do not really understand.
        
       | bonzini wrote:
       | The right answer is that they shouldn't. A database has much more
       | information than the operating system about what, how and when to
       | cache information. Therefore the database should handle its own
       | I/O caching using O_DIRECT on Linux or the equivalent on Windows
       | or other Unixes.
       | 
       | The article at https://www.scylladb.com/2017/10/05/io-access-
       | methods-scylla... is a bit old (2017) but it explains the trade-
       | offs
        
         | quotemstr wrote:
         | Yep. Every mature, high performing, non-embedded database
         | evolves towards getting the underlying operating system out of
         | the way as much as possible.
        
         | natmaka wrote:
         | > A database has much more information than the operating
         | system about what, how and when to cache information
         | 
         | Yes, on a dedicated server. However many DB engines instances
         | run on non-dedicated servers, for example along a web server
         | flanked with various processes sometimes reading the local
         | filesystem or using RAM (Varnish, memcached...), and often-run
         | tasks (tempfiles purge, log aggregation, monitoring probes,
         | MTA...). In such a case letting the DB engine use too much RAM,
         | and therefore reducing its global efficiency while limiting
         | buffercache size, may (all other things being equal) imply more
         | 'read' operations, reducing overall performance.
        
           | sradman wrote:
           | Great point. Selecting the RDBMS page cache size is a key
           | performance parameter that is near impossible to get right on
           | a mixed-use host, both non-dedicated servers and client
           | desktop/laptop. SQL Anywhere, which emphasizes zero-admin,
           | has long supported _Dynamic Cache Sizing_ [1] specifically
           | for this mixed-use case which is /was its bread-and-butter. I
           | don't know if any other RDBMSes do the same (MS SQL?).
           | 
           | As a side note, Apache Arrow's main use case is similar, a
           | column oriented data store shared by one-or-more client
           | processes (Python, R, Julia, Matlab, etc.) on the same
           | general purpose host. This is also now a key distinction
           | between the Apple M1 and its big.LITTLE ARM SoC vs. Amazon
           | Graviton built for server-side virtualized/containerized
           | instances. We should not conflate the two use-cases and
           | understand that the best solution for one use case may not be
           | the best for the other.
           | 
           | [1] http://dcx.sybase.com/1200/en/dbusage/perform-
           | bridgehead-405...
        
         | jorangreef wrote:
         | Yes, and it's not only about performance, but also safety
         | because O_DIRECT is the only safe way to recover from the
         | journal after fsync failure (when the page cache can no longer
         | be trusted by the database to be coherent with the disk):
         | https://www.usenix.org/system/files/atc20-rebello.pdf
         | 
         | From a safety perspective, O_DIRECT is now table stakes.
         | There's simply no control over the granularity of read/write
         | EIO errors when your syscalls only touch memory and where you
         | have no visibility into background flush errors.
        
           | formerly_proven wrote:
           | Around four years ago I was working on a transactional data
           | store and ran into these issues that virtually no one tells
           | you how durable I/O is supposed to work. There were very few
           | articles on the internet that went beyond some of the basic
           | stuff (e.g. create file => fsync directory) and perhaps one
           | article explaining what needs to be considered when using
           | sync_file_range. Docs and POSIX were useless. I noticed that
           | there seemed to be inherent problems with I/O error handling
           | when using the page cache, i.e. whenever something that
           | wasn't the app itself caused write I/O you really didn't know
           | any more if all the data got there.
           | 
           | Some two years later fsyncgate happened and since then I/O
           | error handling on Linux has finally gotten at least some
           | attention and people seemed to have woken up to the fact that
           | this is a genuinely hard thing to do.
        
         | sradman wrote:
         | O_DIRECT prevents file double buffering by the OS and DBMS page
         | cache. MMAP removes the need for the DBMS page cache and relies
         | on the OS's paging algorithm. The gain is zero memory copy and
         | the ability for multiple processes to access the same data
         | efficiently.
         | 
         | Apache Arrow takes advantage of mmap to share data across
         | different language processes and enables fast startup for short
         | lived processes that re-access the same OS cached data.
        
           | geofft wrote:
           | Yes, but the claim is that the buffer you should remove is
           | the OS's one, not the DBMS's one, because for the DBMS use
           | case (one very large file with deep internal structure,
           | generally accessed by one long-running process), the DBMS has
           | information the OS doesn't.
           | 
           | Arrow is a different use case, for which mmap makes sense.
           | For something like a short-lived process that stores config
           | or caches in SQLite, it probably is actually closer to Arrow
           | than to (e.g.) Postgres, so mmap likely also makes sense for
           | that. (Conversely, if you're not relying on Arrow's sharing
           | properties and you have a big Python notebook that's doing
           | some math on an extremely large data file on disk in a single
           | process, you might actually get better results from O_DIRECT
           | than mmap.)
           | 
           | In particular, "zero memory copy" only applies if you are
           | accessing the same data from multiple processes (either at
           | once or sequentially). If you have a single long-running
           | database server, you have to copy the data from disk to RAM
           | _anyway_. O_DIRECT means there 's one copy, from disk to a
           | userspace buffer; mmap means there's one copy, from disk to a
           | kernel buffer. If you can arrange for a long-lived userspace
           | buffer, there's no performance advantage to using the kernel
           | buffer.
        
             | sradman wrote:
             | > but the claim is that the buffer you should remove is the
             | OS's one
             | 
             | I was not trying to minimize O_DIRECT, I was trying to
             | emphasize the key advantage succinctly and also explain the
             | Apache Arrow use case of mmap which the article does not
             | discuss.
        
         | masklinn wrote:
         | > Therefore the database should handle its own I/O caching
         | using O_DIRECT on Linux or the equivalent on Windows or other
         | Unixes.
         | 
         | That's not wrong, but at the same time it adds complexity and
         | requires effort which can't be spent elsewhere unless you've
         | got someone who really only wants to DIO and wouldn't work on
         | anything else anyway.
         | 
         | Postgres has never used DIO, and while there have been rumbling
         | about moving to DIO (especially following the fsync mess) as
         | Andres Freund noted:
         | 
         | > efficient DIO usage is a metric ton of work, and you need a
         | large amount of differing logic for different platforms. It's
         | just not realistic to do so for every platform. Postgres is
         | developed by a small number of people, isn't VC backed etc. The
         | amount of resources we can throw at something is fairly
         | limited. I'm hoping to work on adding linux DIO support to pg,
         | but I'm sure as hell not going to do be able to do the same on
         | windows (solaris, hpux, aix, ...) etc.
        
           | jorangreef wrote:
           | I have found that planning for DIO from the start makes for a
           | better, simpler design when designing storage systems,
           | because it keeps the focus on logical/physical sector
           | alignment, latent sector error handling, and caching from the
           | beginning. And even better to design data layouts to work
           | with block devices.
           | 
           | Retrofitting DIO onto a non-DIO design and doing this cross-
           | platform is going to be more work, but I don't think that's
           | the fault of DIO (when you're already building a database
           | that is).
        
           | jandrewrogers wrote:
           | PostgreSQL has two main challenges with direct I/O. The basic
           | one is that it adversely impacts portability, as mentioned,
           | and is complicated in implementation because file system
           | behavior under direct I/O is not always consistent.
           | 
           | The bigger challenge is that PostgreSQL is not architected
           | like a database engine designed to use direct I/O
           | effectively. Adding even the most rudimentary support will be
           | a massive code change and implementation effort, and the end
           | result won't be comparable to what you would expect from a
           | modern database kernel designed to use direct I/O. This
           | raises questions about return on investment.
        
         | api wrote:
         | You can also mount a file system in synchronous mode on most
         | OSes, which may make sense for a DB storage volume (but not
         | other parts of the system).
        
         | jnwatson wrote:
         | In theory that's true. In practice, utilizing the highly-
         | optimized already-in-kernel-mode page cache can produce
         | tremendous performance. LMDB, for example, is screaming fast,
         | and doesn't use DIO.
        
         | the8472 wrote:
         | There was a patch set (introducing the RWF_UNCACHED flag) to
         | get buffered IO with most of the benefits of O_DIRECT and
         | without its drawbacks, but it looks like it hasn't landed.
         | 
         | There also are new options to give the kernel better page cache
         | hints via the new MADV_COLD or MADV_PAGEOUT flags. These ones
         | did land.
        
         | nullsense wrote:
         | I think of the major database vendors only postgres uses mmap
         | and everyone else does their own I/O caching management.
        
       ___________________________________________________________________
       (page generated 2021-01-23 23:00 UTC)