[HN Gopher] In-memory vs. disk-based DB: Why do you need a large...
       ___________________________________________________________________
        
       In-memory vs. disk-based DB: Why do you need a larger than memory
       architecture?
        
       Author : taubek
       Score  : 47 points
       Date   : 2023-09-02 17:33 UTC (5 hours ago)
        
 (HTM) web link (memgraph.com)
 (TXT) w3m dump (memgraph.com)
        
       | moomoo11 wrote:
       | I use levelDB for cache on low cost resources, ie 1 vcpu, 0.5gb
       | ram, 5-10gb ssd.
       | 
       | Gives me great performance (especially paired with Go), and I'm
       | able to deliver 6-10k rps for a few dollars a month.
        
       | cartweheel wrote:
       | They've misunderstood what an in-memory database is though. The
       | distinction only makes sense in a historical perspective, with
       | on-disk formats and associated algorithms that worked well for
       | execution from spinning iron. In fact neo4j is also an in-memory
       | database, since it will slow to a complete crawl on a HDD. The
       | fact that it supports swapping pages in and out of disk can more
       | be likened to virtual memory, rather than any specific support
       | for executing queries efficiently from disk.
        
       | hakunin wrote:
       | Not 100% what the article is about, just a short story.
       | 
       | In one of my old jobs we had megabytes of infrequently accessed
       | static key-value data. If we simply loaded it into a const (i.e.
       | a hash table), it would blow up the RAM so much, that we would
       | need to upgrade to bigger VPSes. If we put it in database, it
       | would make it annoying to keep these tables up to date, track
       | changes in them.
       | 
       | I figured this was one of those in-between use cases, where the
       | best solution is to have zero-RAM lookups from SSD. In my case, I
       | wrote a little ruby library[1] that arranges data in equal cells
       | in a file, and performs binary searches via `pread`. This was
       | perfect for us, because we kept data in our repos, sacrificed no
       | RAM at runtime, SSD lookups were fast enough, and we didn't have
       | to support a more elaborate db.
       | 
       | [1]: https://github.com/maxim/wordmap
        
         | teaearlgraycold wrote:
         | Why not commit a sqlite db?
        
           | hakunin wrote:
           | I considered that, but couldn't find a way to precisely
           | control amount of RAM used when you read from it.
        
         | jerrygenser wrote:
         | I have a similar use case and I have had success reading from
         | sqlite files embedded in my repo
        
         | boywitharupee wrote:
         | Basically, your solution ended up residing on-disk, but the
         | data was fragmented in such a way that efficient lookup was
         | possible? Did this copy from kernel space to userspace, also?
        
           | hakunin wrote:
           | >the data was fragmented in such a way that efficient lookup
           | was possible
           | 
           | Yeah, there's a build step that sorts and arranges data into
           | "cells", making binary search possible.
           | 
           | Not sure about kernel/user space. I'm just calling `pread`
           | from ruby, so only a few bytes are loaded per lookup.
        
         | rubiquity wrote:
         | > sacrificed no RAM at runtime
         | 
         | Those lookups eventually made their way into the kernel's page
         | cache.
        
           | [deleted]
        
           | hakunin wrote:
           | Are you saying that the file would just entirely be loaded
           | into a page cache eventually? I imagine, even if true, it
           | still wouldn't result in OOM killing the server/worker
           | daemons on account of this data?
        
             | rubiquity wrote:
             | Depending on the file size and available memory, yes. In
             | the event memory needs to be reclaimed Linux's memory
             | management system will free from the page cache first if
             | processes need more anonymous memory.
        
         | CraigJPerry wrote:
         | I think i'm missing something here, could you not just mmap()
         | the file in and let the kernel take care of memory pressure for
         | you?
        
           | lisper wrote:
           | Sure -- until your system crashes and data is left in an
           | inconsistent state because some of it was written to non-
           | volatile storage and some of it wasn't and your OS had no
           | concept of transactional consistency because mmap is a leaky
           | abstraction.
        
             | posnet wrote:
             | Not a problem here. Read the top level comment.
             | 
             | "static key-value data. ... a const (i.e. a hash table)"
        
           | sschnei8 wrote:
           | https://db.cs.cmu.edu/mmap-cidr2022/
        
             | rubiquity wrote:
             | Good paper and overall the right opinion but not very
             | relevant to a relatively simple system as OP described
             | since they weren't dealing with safety, transactions, etc.
        
           | hakunin wrote:
           | I'm no systems programmer, but I remember trying to research
           | mmap approach. It's been 3 years, so I'm not sure what
           | stopped me, but something didn't feel right. Perhaps it was
           | the lack of control over how memory is used. I could clearly
           | see how not use it, and didn't want any fluctuations.
           | 
           | Edit: oh and I think I did come across some article like the
           | one linked in the neighbor comment. It's starting to come
           | back.
        
       | andersa wrote:
       | > Fetching data from disk is something that everyone strives to
       | avoid since it takes approximately 10x more time than using it
       | from main memory.
       | 
       | Did they mean 10 _thousand_ times? Or is the in-memory version
       | that inefficient?
        
         | Darych wrote:
         | Based on this resource
         | https://gist.github.com/hellerbarde/2843375 1MB sequential read
         | from SSD just 4x slower than the same read from main memory.
         | For random reads main memory faster of course. I believe author
         | meant some average value.
        
         | pavlov wrote:
         | Yes, even with SSDs it seems like 10x is very optimistic. It
         | should be several orders of magnitude.
        
         | Moto7451 wrote:
         | I think they're playing it safe with their points of
         | comparison. DDR5 supports 64,000 MB/s per channel and 4x NVMe
         | PCIe5 SSDs support 10,000 MB/s. Depending on how many memory
         | channels and what RAID you use, I think 10x improvement over
         | high performance storage is unimpeachable. Memory latency being
         | better than SSD latency will really benefit memory depending on
         | workload, but I don't think you can just throw one number out
         | there to represent that.
         | 
         | Now if you're comparing to spinning rust, memory is definitely
         | going to blow it away, but commodity hardware isn't running
         | tens or hundreds of TBs of memory. Memory to SSD comparisons
         | seem right.
        
           | semi-extrinsic wrote:
           | For example the latest AMD Genoa has 12 memory channels per
           | socket, at dual socket and with enough DIMMs that's a 75x
           | speed advantage even if you compare with RAID0 of high
           | perfomance NVMe.
        
             | wtallis wrote:
             | > even if you compare with RAID0 of high perfomance NVMe.
             | 
             | Do you mean a RAID0 of just two or four NVMe SSDs? It's
             | absolutely ridiculous to count aggregate DRAM bandwidth
             | across two CPU sockets and not do the same for PCIe lanes.
             | A fair comparison is that Genoa has about twice the DRAM
             | bandwidth as it has PCIe bandwidth, though in a fully-
             | loaded database server some portion of the PCIe bandwidth
             | will be used for networking rather than storage.
        
               | zamadatix wrote:
               | FWIW I think you have the aggregation backwards, dual
               | socket Genoa would have 24 channels of DRAM but sacrifice
               | some of the PCIe lanes for the interconnect. Your numbers
               | work out right though as single socket is actually about
               | 1:1 in RAM vs PCIe bandwidth so dual socket would still
               | come out roughly 2:1.
               | 
               | I think 10x is a fair rough number though, depending on
               | your access pattern.
        
               | wtallis wrote:
               | Dual-socket Genoa would be 24 channels at DDR5-4800 (38.4
               | GB/s) for a total of ~921.6 GB/s. Typical PCIe
               | configurations are 64 or 80 lanes per socket for dual-
               | socket, so 128 or 160 lanes total, at PCIe 5.0 speed
               | that's ~504 GB/s for 128 lanes total or ~630 GB/s for 160
               | lane configurations.
               | 
               | Single-socket Genoa would be 12 channels of DRAM (~460.8
               | GB/s) and 128 lanes of PCIe 5.0 (~504 GB/s), but none of
               | the previous comments were specifically about single-
               | socket Genoa and I wasn't going to silently switch from
               | considering dual-socket in one sentence to single-socket
               | in the next sentence.
        
               | zamadatix wrote:
               | Ah yes, I forgot about the 48 lane 2 socket interconnect
               | mode which does allow you to still aggregate some
               | additional lanes.
        
           | convolvatron wrote:
           | its also may be limited on iops/s. you _can_ really just work
           | with latency, but you need to consider the pipeline depth.
        
         | [deleted]
        
         | paulddraper wrote:
         | 10x is reasonable-ish for bandwidth not for latency
        
         | mbuda wrote:
         | I guess when you compare pure performance of hardware, 10x is
         | very optimistic from the perspective of disk. Probably the
         | author based this number on some some specific
         | application/database context / bias in the measurements. But
         | yea, pure hardware difference might be hugely different compare
         | to that number.
        
         | jandrewrogers wrote:
         | Recent database engine designs tend to be bandwidth bound. The
         | difference in bandwidth between memory and modern storage
         | hardware is much smaller than you might expect. Really taking
         | advantage of that storage bandwidth is more difficult than
         | memory, as it requires some top notch scheduler design.
        
           | refset wrote:
           | In a similar vein, this was news to me recently:
           | 
           | > here's a chart comparing the throughputs of typical memory,
           | I/O and networking technologies used in servers in 2020
           | against those technologies in 2023
           | 
           | > Everything got faster, but the relative ratios also
           | completely flipped
           | 
           | > memory located remotely across a network link can now be
           | accessed with no penalty in throughput
           | 
           | The graphs demonstrate it very clearly:
           | https://blog.enfabrica.net/the-next-step-in-high-
           | performance...
        
         | pclmulqdq wrote:
         | 10,000 is for spinning rust. With SSDs on NVMe it's about 50x,
         | so pick 10x or 100x to be your number.
         | 
         | This should have been a revolution in DB design, IMO, but it
         | kind of hasn't been.
        
       | bullen wrote:
       | The whole point of a DB is to remember the data after power
       | cycle.
       | 
       | So either way you turn it both solutions go to disk?
        
         | CyberDildonics wrote:
         | _The whole point of a DB is to remember the data after power
         | cycle._
         | 
         | You are thinking of a file system.
        
           | DaiPlusPlus wrote:
           | Databases don't have to use a file-system - look at old Big
           | Iron systems where the database _is_ the file-system.
           | 
           | In the context of Linux/Unix/etc, the file-system is just
           | another API for the OS - consider /dev/null or /proc - those
           | are certainly in the file-system but they aren't tied to
           | persistent storage.
        
             | CyberDildonics wrote:
             | _Databases don't have to use a file-system_
             | 
             | I didn't say they did.
             | 
             |  _consider /dev/null or /proc - those are certainly in the
             | file-system but they aren't tied to persistent storage._
             | 
             | I'm not sure what point you're trying to make here.
        
         | avianlyric wrote:
         | Sure, querying that data is also important. Which means the
         | question of "do you optimise for in-memory queries, or on disk
         | queries" is important.
         | 
         | Simply saying you want the data to be durable isn't that
         | interesting or hard to achieve, there's plenty of ways of
         | achieving durable storage. The hard part is doing durable
         | storage while also solving problems like query speed, and
         | concurrency control.
        
           | Guvante wrote:
           | It depends on what you mean by durable. Claiming that
           | persistence to disk is trivial when compared to returned
           | values is a stretch IMHO. Modern DBs still have bugs where
           | data returned is different than permanently persisted leading
           | to inconsistency errors.
           | 
           | And that is ignoring performance queries that don't require
           | any guarantees.
        
       ___________________________________________________________________
       (page generated 2023-09-02 23:00 UTC)