[HN Gopher] In-memory vs. disk-based DB: Why do you need a large... ___________________________________________________________________ In-memory vs. disk-based DB: Why do you need a larger than memory architecture? Author : taubek Score : 47 points Date : 2023-09-02 17:33 UTC (5 hours ago) (HTM) web link (memgraph.com) (TXT) w3m dump (memgraph.com) | moomoo11 wrote: | I use levelDB for cache on low cost resources, ie 1 vcpu, 0.5gb | ram, 5-10gb ssd. | | Gives me great performance (especially paired with Go), and I'm | able to deliver 6-10k rps for a few dollars a month. | cartweheel wrote: | They've misunderstood what an in-memory database is though. The | distinction only makes sense in a historical perspective, with | on-disk formats and associated algorithms that worked well for | execution from spinning iron. In fact neo4j is also an in-memory | database, since it will slow to a complete crawl on a HDD. The | fact that it supports swapping pages in and out of disk can more | be likened to virtual memory, rather than any specific support | for executing queries efficiently from disk. | hakunin wrote: | Not 100% what the article is about, just a short story. | | In one of my old jobs we had megabytes of infrequently accessed | static key-value data. If we simply loaded it into a const (i.e. | a hash table), it would blow up the RAM so much, that we would | need to upgrade to bigger VPSes. If we put it in database, it | would make it annoying to keep these tables up to date, track | changes in them. | | I figured this was one of those in-between use cases, where the | best solution is to have zero-RAM lookups from SSD. In my case, I | wrote a little ruby library[1] that arranges data in equal cells | in a file, and performs binary searches via `pread`. This was | perfect for us, because we kept data in our repos, sacrificed no | RAM at runtime, SSD lookups were fast enough, and we didn't have | to support a more elaborate db. | | [1]: https://github.com/maxim/wordmap | teaearlgraycold wrote: | Why not commit a sqlite db? | hakunin wrote: | I considered that, but couldn't find a way to precisely | control amount of RAM used when you read from it. | jerrygenser wrote: | I have a similar use case and I have had success reading from | sqlite files embedded in my repo | boywitharupee wrote: | Basically, your solution ended up residing on-disk, but the | data was fragmented in such a way that efficient lookup was | possible? Did this copy from kernel space to userspace, also? | hakunin wrote: | >the data was fragmented in such a way that efficient lookup | was possible | | Yeah, there's a build step that sorts and arranges data into | "cells", making binary search possible. | | Not sure about kernel/user space. I'm just calling `pread` | from ruby, so only a few bytes are loaded per lookup. | rubiquity wrote: | > sacrificed no RAM at runtime | | Those lookups eventually made their way into the kernel's page | cache. | [deleted] | hakunin wrote: | Are you saying that the file would just entirely be loaded | into a page cache eventually? I imagine, even if true, it | still wouldn't result in OOM killing the server/worker | daemons on account of this data? | rubiquity wrote: | Depending on the file size and available memory, yes. In | the event memory needs to be reclaimed Linux's memory | management system will free from the page cache first if | processes need more anonymous memory. | CraigJPerry wrote: | I think i'm missing something here, could you not just mmap() | the file in and let the kernel take care of memory pressure for | you? | lisper wrote: | Sure -- until your system crashes and data is left in an | inconsistent state because some of it was written to non- | volatile storage and some of it wasn't and your OS had no | concept of transactional consistency because mmap is a leaky | abstraction. | posnet wrote: | Not a problem here. Read the top level comment. | | "static key-value data. ... a const (i.e. a hash table)" | sschnei8 wrote: | https://db.cs.cmu.edu/mmap-cidr2022/ | rubiquity wrote: | Good paper and overall the right opinion but not very | relevant to a relatively simple system as OP described | since they weren't dealing with safety, transactions, etc. | hakunin wrote: | I'm no systems programmer, but I remember trying to research | mmap approach. It's been 3 years, so I'm not sure what | stopped me, but something didn't feel right. Perhaps it was | the lack of control over how memory is used. I could clearly | see how not use it, and didn't want any fluctuations. | | Edit: oh and I think I did come across some article like the | one linked in the neighbor comment. It's starting to come | back. | andersa wrote: | > Fetching data from disk is something that everyone strives to | avoid since it takes approximately 10x more time than using it | from main memory. | | Did they mean 10 _thousand_ times? Or is the in-memory version | that inefficient? | Darych wrote: | Based on this resource | https://gist.github.com/hellerbarde/2843375 1MB sequential read | from SSD just 4x slower than the same read from main memory. | For random reads main memory faster of course. I believe author | meant some average value. | pavlov wrote: | Yes, even with SSDs it seems like 10x is very optimistic. It | should be several orders of magnitude. | Moto7451 wrote: | I think they're playing it safe with their points of | comparison. DDR5 supports 64,000 MB/s per channel and 4x NVMe | PCIe5 SSDs support 10,000 MB/s. Depending on how many memory | channels and what RAID you use, I think 10x improvement over | high performance storage is unimpeachable. Memory latency being | better than SSD latency will really benefit memory depending on | workload, but I don't think you can just throw one number out | there to represent that. | | Now if you're comparing to spinning rust, memory is definitely | going to blow it away, but commodity hardware isn't running | tens or hundreds of TBs of memory. Memory to SSD comparisons | seem right. | semi-extrinsic wrote: | For example the latest AMD Genoa has 12 memory channels per | socket, at dual socket and with enough DIMMs that's a 75x | speed advantage even if you compare with RAID0 of high | perfomance NVMe. | wtallis wrote: | > even if you compare with RAID0 of high perfomance NVMe. | | Do you mean a RAID0 of just two or four NVMe SSDs? It's | absolutely ridiculous to count aggregate DRAM bandwidth | across two CPU sockets and not do the same for PCIe lanes. | A fair comparison is that Genoa has about twice the DRAM | bandwidth as it has PCIe bandwidth, though in a fully- | loaded database server some portion of the PCIe bandwidth | will be used for networking rather than storage. | zamadatix wrote: | FWIW I think you have the aggregation backwards, dual | socket Genoa would have 24 channels of DRAM but sacrifice | some of the PCIe lanes for the interconnect. Your numbers | work out right though as single socket is actually about | 1:1 in RAM vs PCIe bandwidth so dual socket would still | come out roughly 2:1. | | I think 10x is a fair rough number though, depending on | your access pattern. | wtallis wrote: | Dual-socket Genoa would be 24 channels at DDR5-4800 (38.4 | GB/s) for a total of ~921.6 GB/s. Typical PCIe | configurations are 64 or 80 lanes per socket for dual- | socket, so 128 or 160 lanes total, at PCIe 5.0 speed | that's ~504 GB/s for 128 lanes total or ~630 GB/s for 160 | lane configurations. | | Single-socket Genoa would be 12 channels of DRAM (~460.8 | GB/s) and 128 lanes of PCIe 5.0 (~504 GB/s), but none of | the previous comments were specifically about single- | socket Genoa and I wasn't going to silently switch from | considering dual-socket in one sentence to single-socket | in the next sentence. | zamadatix wrote: | Ah yes, I forgot about the 48 lane 2 socket interconnect | mode which does allow you to still aggregate some | additional lanes. | convolvatron wrote: | its also may be limited on iops/s. you _can_ really just work | with latency, but you need to consider the pipeline depth. | [deleted] | paulddraper wrote: | 10x is reasonable-ish for bandwidth not for latency | mbuda wrote: | I guess when you compare pure performance of hardware, 10x is | very optimistic from the perspective of disk. Probably the | author based this number on some some specific | application/database context / bias in the measurements. But | yea, pure hardware difference might be hugely different compare | to that number. | jandrewrogers wrote: | Recent database engine designs tend to be bandwidth bound. The | difference in bandwidth between memory and modern storage | hardware is much smaller than you might expect. Really taking | advantage of that storage bandwidth is more difficult than | memory, as it requires some top notch scheduler design. | refset wrote: | In a similar vein, this was news to me recently: | | > here's a chart comparing the throughputs of typical memory, | I/O and networking technologies used in servers in 2020 | against those technologies in 2023 | | > Everything got faster, but the relative ratios also | completely flipped | | > memory located remotely across a network link can now be | accessed with no penalty in throughput | | The graphs demonstrate it very clearly: | https://blog.enfabrica.net/the-next-step-in-high- | performance... | pclmulqdq wrote: | 10,000 is for spinning rust. With SSDs on NVMe it's about 50x, | so pick 10x or 100x to be your number. | | This should have been a revolution in DB design, IMO, but it | kind of hasn't been. | bullen wrote: | The whole point of a DB is to remember the data after power | cycle. | | So either way you turn it both solutions go to disk? | CyberDildonics wrote: | _The whole point of a DB is to remember the data after power | cycle._ | | You are thinking of a file system. | DaiPlusPlus wrote: | Databases don't have to use a file-system - look at old Big | Iron systems where the database _is_ the file-system. | | In the context of Linux/Unix/etc, the file-system is just | another API for the OS - consider /dev/null or /proc - those | are certainly in the file-system but they aren't tied to | persistent storage. | CyberDildonics wrote: | _Databases don't have to use a file-system_ | | I didn't say they did. | | _consider /dev/null or /proc - those are certainly in the | file-system but they aren't tied to persistent storage._ | | I'm not sure what point you're trying to make here. | avianlyric wrote: | Sure, querying that data is also important. Which means the | question of "do you optimise for in-memory queries, or on disk | queries" is important. | | Simply saying you want the data to be durable isn't that | interesting or hard to achieve, there's plenty of ways of | achieving durable storage. The hard part is doing durable | storage while also solving problems like query speed, and | concurrency control. | Guvante wrote: | It depends on what you mean by durable. Claiming that | persistence to disk is trivial when compared to returned | values is a stretch IMHO. Modern DBs still have bugs where | data returned is different than permanently persisted leading | to inconsistency errors. | | And that is ignoring performance queries that don't require | any guarantees. ___________________________________________________________________ (page generated 2023-09-02 23:00 UTC)