[HN Gopher] Does my data fit in RAM? ___________________________________________________________________ Does my data fit in RAM? Author : louwrentius Score : 92 points Date : 2022-08-02 19:49 UTC (3 hours ago) (HTM) web link (yourdatafitsinram.net) (TXT) w3m dump (yourdatafitsinram.net) | nsxwolf wrote: | 64 TiB fits in a Dell Poweredge R840 with 6TB max RAM... how | exactly? | louwrentius wrote: | Please take a look further down the list. For your use case, a | Power System E980 may just be enough or too small. :-) | mciancia wrote: | Maybe someone mixed up regular dram and optane dimms? | staticassertion wrote: | mine does not | louwrentius wrote: | The original site made by lukegb inspired me because of the down- | to-earth simplicity. Scaling vertically is often so much easier | and better in so many dimensions than creating a complex | distributed computing setup. | | This is why I recreated the site when it went down quite a while | ago. | | The recent article "Use One Big Server"[0] inspired me to | (re)submit this website to HN because it addresses the same | topic. I like this new article so much because in this day and | age of the cloud, people tend to forget how insanely fast and | powerful modern servers have become. | | And if you don't have budget for new equiment, the second-hand | stuff from a few years back is stil beyond amazing and the prices | are very reasonable compared to cloud cost. Sure, running bare | metal co-located somewhere has it's own cost, but it's not that | of a big deal and many issues can be dealt with using 'remote | hands' services. | | To be fair, the article admits that in the end it's really about | your organisation's specific circumstances and thus your | requirements. Physical servers and/or vertical scaling may not | (always) be the right answer. That said, do yourself a favour, | and do take this option seriously and at least consider it. You | can even do an experiment: buy some second-hand gear just to gain | some experience with hardware if you don't have it already and do | a trial in a co-location. | | Now that we are talking, yourdatafitsinram.net runs on a | Raspberry Pi 4 which in turn is running on solar power.[1] (The | blog and this site are both running on the same host) | | [0]: https://news.ycombinator.com/item?id=32319147 | | [1]: https://louwrentius.com/this-blog-is-now-running-on-solar- | po... | karamanolev wrote: | > many issues can be dealt with using 'remote hands' services. | | I have a few second-hand HP/Dell/Supermicro systems running | colocated. I find that for all software issues, remote | management / IPMI / KVM over IP is perfectly sufficient. Remote | hands are needed only for actual hardware issues, most of which | is "replace this component with an identical one". Usually HDD, | if you're running those. Overall, I'm quite happy with the | setup and it's very high on the value/$ spectrum. | louwrentius wrote: | Yes, I bet a lot of people aren't even aware of IPMI/KVM over | IP capabilities that servers have for decades, which makes | hardware management (manual or automated!) much easier. | | Remote hands is for the inevitable hardware failure (Disk, | PSU, Fan) or human error (you locked yourself out somehow | remotely from IPMI). | | P.S. I have a HP Proliant DL380 G8 with 128 GB of memory and | 20 physical cores as a lab system for playing with many | virtual machines. I turn it on and off on demand using IPMI. | toast0 wrote: | IPMI is nice, although the older you go, the more | _particular_ it gets. I had professional experience with the | SuperMicro Xeon e5-2600 series v1-4, and recently started | renting a previous generation server[1] and it 's worse than | the ones I used before. It's still servicable though; but I'm | not sure it it's using a dedicated LAN, because the kvm and | the sol drop out when the OS starts or ends; it'll come back, | but you miss early boot messages. | | It's definitely worth the effort to script starting the KVM, | and maybe even the sol. If you've got a bunch of servers, you | should script the power management as well, if nothing else, | you want to rate limit power commands across your fleet to | prevent accidental mass restarts. Intentional mass restarts | can probably happen through the OS, so 1 power command per | second across your fleet is probably fine. (You can always | hack out the rate limit if you're really sure). | | [1] I don't need a whole server, but for $30/month when I | wanted to leave my VPS behind for a few reasons anyway... | baisq wrote: | Why is table.html loaded as an external resource instead of being | in index.html proper? | [deleted] | louwrentius wrote: | I can't remember why I did that, probably to keep the data | separate from the rest of the code. | game-of-throws wrote: | I just confirmed that 640 KB fits in RAM. That's enough for me. | rmetzler wrote: | 640K ought to be enough for anybody. | mech422 wrote: | Thanks Bill! | tester756 wrote: | >Gates himself has strenuously denied making the comment. In | a newspaper column that he wrote in the mid-1990s, Gates | responded to a student's question about the quote: "I've said | some stupid things and some wrong things, but not that. No | one involved in computers would ever say that a certain | amount of memory is enough for all time." Later in the | column, he added, "I keep bumping into that silly quotation | attributed to me that says 640K of memory is enough. There's | never a citation; the quotation just floats like a rumor, | repeated again and again." | vlunkr wrote: | Amazing. This has been the solution to postgres issues for me. | Just add enough memory that everything, or at least everything | that is accessed frequently can fit in RAM. Suddenly everything | is cached and fast. | hyperman1 wrote: | Funny, it lets you click to negative amounts of RAM. My -1 PiB | fits in RAM, so having it as a unit is not useless. (It also | accepts fractions but not octal) | antisthenes wrote: | If you're wondering, the cutoff is 64 TiB. | | That's the amount of RAM on an IBM Power E980 System. | baisq wrote: | How much does that cost? | edmundsauto wrote: | $8.5 million according to a sibling comment. | dang wrote: | Related: | | _Does my data fit in RAM?_ - | https://news.ycombinator.com/item?id=22309883 - Feb 2020 (162 | comments) | Cwizard wrote: | Anyone have any recommendations for a SQL engine that works on | in-memory data and has a simple/monolithic architecture? Our data | is about 50-100gb (uncompressed) and thus easily fits into | memory. I am sure we could do our processing using something like | polars or pandas in memory quite quickly but we prefer a SQL | interface. Using postgres is still quite slow even when it has | more than enough memory available compared to something like | duckdb. Duckdb has other limitations however. I've been eying | MemSQL but that also seems to be targeted more towards multi | machine deployments. | chaxor wrote: | SQLite is almost always the answer | mritchie712 wrote: | what limit are you hitting with duckdb? | giraffe_lady wrote: | sqlite? | somekyle wrote: | Is the point of this that you can do large-scale data processing | without the overhead of distribution if you're willing to pay for | the kind of hardware that can give you fast random access to all | of it? | nattaylor wrote: | Yes, take a look at the "inspired by" tweet [0] | | [0] https://twitter.com/garybernhardt/status/600783770925420546 | civilized wrote: | Has anyone tried firing up Pandas or something to load a multi-TB | table? Would be interested to see if you run into some hidden | snags. | jdeaton wrote: | I've done this though the data in the table was split across | DataFrames in many concurrent processes. | https://stackoverflow.com/questions/49438954/python-shared-m... | itamarst wrote: | There's just a huge amount of waste in many cases which is very | easy to fix. For example, if we have a list of fractions | (0.0-1.0): | | * Python list of N Python floats: 32xN bytes (approximate, the | Python float is 24 bytes + 8-byte pointer for each item in the | list) | | * NumPy array of N double floats: 8xN bytes | | * Hey, we don't need that much precision, let's use 32-bit floats | in NumPy: 4xN | | * Actually, values of 0-100 are good enough, let's just use uint8 | in NumPy and divide by 100 if necessary to get the fraction: N | bytes | | And now we're down to 3% of original memory usage, and quite | possibly with no meaningful impact on the application. | | (See e.g. https://pythonspeed.com/articles/python-integers- | memory/ and https://pythonspeed.com/articles/pandas-reduce- | memory-lossy/ for longer prose versions that approximate the | above.) | deckard1 wrote: | interesting. Python doesn't use tagged pointers? I would think | most dynamic languages would store immediate char/float/int in | a single tagged 32-bit/64-bit word. That's some crazy overhead. | acdha wrote: | This has been talked about for years but I believe it's still | complicated by C API compatibility. The most recent | discussion I see is here: | | https://github.com/faster-cpython/ideas/discussions/138 | | Victor Stinner's experiment showed some performance | regressions, too: | | https://github.com/vstinner/cpython/pull/6#issuecomment-6561. | .. | nneonneo wrote: | Absolutely everything in CPython is a PyObject, and that | can't be changed without breaking the C API. A PyObject | contains (among other things) a type pointer, a reference | count, and a data field; none of these things can be changed | without (again) breaking the C API. | | There have definitely been attempts to modernize; the HPy | project (https://hpyproject.org/), for instance, moves | towards a handle-oriented API that keeps implementation | details private and thus enables certain optimizations. | [deleted] | BLanen wrote: | You're describing operations done on data in memory to save | memory. That list of fractions still needs to be in memory at | some point. And if you're batching, this whole discussion goes | out of the window. | rcoveson wrote: | Why would the whole original dataset need to be in memory all | at once to operate on it value-by-value and put it into an | array? | BLanen wrote: | If the whole original dataset doesn't need to be in memory | all at once, there isn't even an issue to begin with. | saltcured wrote: | I think the point is that you can use a streaming IO | approach to transcode or load data into the compact | representation in memory, which is then used by whatever | algorithm actually needs the in-memory access. You don't | have to naively load the entire serialization from disk | into memory. | | This is one reason projects like Twitter popularized | serializations like json-stream in the past, to make it | even easier to incrementally load a large file with basic | software. Formats like TSV and CSV are also trivially | easy to load with streaming IO. | | I think the mark of good data formats and libraries is | that they allow for this. They should not force an in- | memory all or nothing approach, even if applications may | want to put all their data in memory. If for no other | reason, the application developer should be allowed to | commit most of the system RAM to their actual data, not | the temporary buffers needed during the IO process. | | If I want to push a machine to its limits on some large | data, I do not want to be limited to 1/2, 1/3 or worse of | the machine size because some IO library developers have | all read an article like this and think "my data fits in | RAM"! It's not "your data" nor your RAM when you are | writing a library. If a user's actual end data might just | barely fit in RAM, it will certainly fail if the deep | call-stack of typical data analysis tools is cavalier | about allocating additional whole-dataset copies during | some synchronous load step... | adamsmith143 wrote: | Ok now I have 100s of columns. I should do this for every | single one in every single dataset I have? | staticassertion wrote: | Yes? | itamarst wrote: | It takes like 5 minutes, and once you are in the habit it's | something you do automatically as you write the code and so | it doesn't actually cost you extra time. | | Efficient representation should be something you build into | your data model, it will save you time in the long run. | | (Also if you have 100s of columns you're hopefully already | benefiting from something like NumPy or Arrow or whatever, so | you're already doing better than you could be... ) | maerF0x0 wrote: | > It takes like 5 minutes, and once you are in the habit | it's something you do automatically as you write the code | and so it doesn't actually cost you extra time. | | This is the argument I've been having my whole career with | people who claim the better way is "too hard and too slow" | . | | I'm like "gee, funny how the thing you do the most often | you're fastest at... could it be that you'd be just as fast | at a better thing if you did it more than never?" . | dahfizz wrote: | Hey, programmer time is expensive. It is our duty to | always do the easiest, most wasteful thing. /s | maerF0x0 wrote: | Future me's time is free to today me. :wink: | chaps wrote: | Hah, I'd love to work with the datasets you work with if it | takes five minutes to do this. Or maybe you're just | suggesting it takes five minutes to write out "TEXT" for | each column type? | | The data I work with is messy, from hand written notes, | multiple sources, millions of rows, etc etc. A single point | that's written as "one" instead of 1 makes your whole idea | fall on its face. | itamarst wrote: | For pile-of-strings data, there are still things you can | do. E.g. in Pandas, if there are a small number of | different values, switch to categoricals | (https://pythonspeed.com/articles/pandas-load-less-data/ | item 3). And there's a new column type for strings that | uses less memory | (https://pythonspeed.com/articles/pandas-string-dtype- | memory/). | bee_rider wrote: | Is enough data generated from handwritten notes that the | memory cost is a serious problem? I was under the | impression that hundreds of books worth of text fit in a | gigabyte. | dvfjsdhgfv wrote: | You'll need to decide on a case by case basis. Many datasets | I work with are being generated by machines, come from | network cards etc. - these are quite consistent. Occasionally | I deal with datasets prepare by humans and these are mediocre | at best, and in these cases I spend a lot of time cleaning | them up. Once it's done, I can clearly see if there are some | columns can be stored in a more efficient way, or not. If the | dataset is large, I do it, because it gives me extra freedom | if I can fit everything in RAM. If it's small, I don't | bother, my time is more expensive than potential gains. | [deleted] | Goz3rr wrote: | Am I the only one here using Chrome or is everyone else just | ignoring the table being broken? The author used an <object> tag | which just results in Chrome displaying "This plugin is not | supported". I'm unsure why they didn't just use an iframe | instead. | louwrentius wrote: | I can only state for myself that on my Mac running chrome, the | site works OK. I don't get any plugin messages. | AdamJacobMuller wrote: | https://www.redbooks.ibm.com/redpapers/pdfs/redp5510.pdf | | I want one of these. | | a system with 1TB of ram is 133k, 8.5mil for a system with 64TB | of ram? | chaxor wrote: | Absolutely not. You can purchase a system with 1TB of ram, and | some decent CPUs etc for ~25k. My lab just did this. That's far | overpriced. 133k is closer to what you would spend if you used | the machine with 1tb "in the cloud". | didgetmaster wrote: | I still remember the first advertisement I saw for 1TB of disk | space. I think it was about 1997 and about the biggest | individual drive you could buy was 2GB. The system was the size | of a couple of server racks and they put 500 of those disks in | it. It cost over $1M for the whole system. | nimish wrote: | That's insanely overpriced. A 128gb lrdimm is $1000. So a tb on | a commodity 8 mem slot board would be 8k plus a few thousand | for the cpu and chassis. | sophacles wrote: | I find it mind boggling that one can purchase a server with more | RAM than the sum of all working storage media in my house. | boredumb wrote: | would be neat if I could do say, 6gb, and see the machines that | are closest in size instead of only the upper limit | bob1029 wrote: | This kind of realization that "yes, it probably will" has | recently inspired me to hand-build various database engines | wherein the entire working set lives in memory. I do realize | others have worked on this idea too, but I always wanted to play | with it myself. | | My most recent prototypes use a hybrid mechanism that | dramatically increases the supported working set size. Any | property larger than a specific cutoff would be a separate read | operation to the durable log. For these properties, only the | log's 64-bit offset is stored in memory. There is an alternative | heuristic that allows for the developer to add attributes which | signify if properties are to be maintained in-memory or permitted | to be secondary lookups. | | As a consequence, that 2TB worth of ram can properly track | hundreds or even thousands of TB worth of effective data. | | If you are using modern NVMe storage, those reads to disk are | stupid-fast in the worst case. There's still a really good chance | you will get a hit in the IO cache if you application isn't | ridiculous and has some predictable access patterns. | saltcured wrote: | I don't mean to discourage personal exploration in any way, but | when doing this sort of thing it can also be illuminating to | consider the null hypothesis... what happens if you let the | conventional software use a similarly enlarged RAM budget or | fast storage? | | SQLite or PostgreSQL can be given some configuration/hints to | be more aggressive about using RAM while still having their | built-in capability to spill to storage rather than hit a hard | limit. Or on Linux (at least), just allowing the OS page cache | to sprawl over a large RAM system may make the IO so fast that | the database doesn't need to worry about special RAM usage. For | PostgreSQL, this can just be hints to the optimizer to adjust | the cost model and consider random access to be cheaper when | comparing possible query plans. | | Once you do some sanity check benchmarks of different systems | like that, you might find different bottlenecks than expected, | and this might highlight new performance optimization quests | you hadn't even considered before. :-) | none_to_remain wrote: | Several years ago my job then got a dev and prod server with a | terabyte of RAM. I liked the dev server because a few times I | found myself thinking "this would be easy to debug if I had an | insane amount of RAM" and then I would remember I did. | ailef wrote: | Basically every fits in RAM up to 24TB. | donkarma wrote: | 64TB because of the mainframe | jhbadger wrote: | I was disappointed that the page didn't start offering vintage | computers for very small datasets given that it has bytes and | kilobytes as options ("your data is too large for a VIC-20, but | a Commodore 64 should handle it") | louwrentius wrote: | That is actually a funny idea, I didn't think about that. I | only revived and refreshed what somebody else came up with | and made before me. | louwrentius wrote: | Extra anecdote: | | Around 2000, a guy told me he was asked to support very | significant performance issues with a server running a critical | application. He quickly figured out that the server ran out of | memory. Option 1 was to rewrite the application to use less | memory. He chose option two: increase the server memory, going | from 64 MB to 128 MB (Yes MB). | | At that time, 128 MB was an ungodly amount of memory and memory | was very expensive. But it was still cheaper to just throw RAM at | the problem than to spend many hours rewriting the application. | z3t4 wrote: | Your data might even fit in the CPU L3 cache... But most likely | you want your data to be persistent. But how often do you | actually "pull the plug" on your servers!? And what happens when | SSD's are fast enough ? Will we see a whole new architecture | where the working memory is integrated into the CPU and the main | memory is persistent ? | mnd999 wrote: | That was the promise of optane. Unfortunately nobody bought it. | tester756 wrote: | What do you mean by "nobody"? | | Significant % of top 500 fortune used it | bee_rider wrote: | On one hand it would be cool to have some persistence in the | CPU. On the other -- imagine if rebooting a computer didn't | make the problems all go away. What a nightmare. | rob_c wrote: | marcinzm wrote: | We went with this approach. Pandas hit GIL limits which made it | too slow. Then we moved to Dask and hit GIL limits on the | scheduler process. Then we moved to Spark and hit JVM GC | slowdowns on the amount of allocated memory. Then we burned it | all down and became hermits in the woods. | [deleted] | mritchie712 wrote: | Did you consider Clickhouse? join's are slow, but if your data | is in a single table, it works really well. | marcinzm wrote: | We were trying to keep everything on one machine in (mostly) | memory for simplicity. Once you open up the pandoras box of | distributed compute there's a lot of options including other | ways of running Spark. But yes, in retrospect, we should have | opened that box first. | anko wrote: | I have solved a similar problem, in a similar way and i've | found polars <https://www.pola.rs/> to solve this quite | well without needing clickhouse. It has a python library | but does most processing in rust, across multiple cores. | I've used it for data sets up to about 20GB no worries, but | my computer's ram became the issue, not polars itself. | marcinzm wrote: | We were using 500+gb of memory at peak and were expecting | that to grow. If I remember we didn't go with Polars | because we needed to run custom apply functions on | DataFrames. Polars had them but the function took a tuple | (not a DF or dict) which when you've got 20+ columns | makes for really error prone code. Dask and Spark both | supported a batch transform operation so the function | took a Pandas Dataframe as input and output. | mumblemumble wrote: | I have decided that all solutions to questions of scale fall | into one of two general categories. Either you can spend all | your money on computers, or you can spend all your money on | C/C++/Rust/Cython/Fortran/whatever developers. | | There's one severely under-appreciated factor that favors the | first option: computers are commodities that can be acquired | very quickly. Almost instantaneously if you're in the cloud. | Skilled lower-level programmers are very definitely not | commodities, and growing your pool of them can easily take | months or years. | jbverschoor wrote: | Buying hardware won't give you the same performance benefits | as a better implementation/architecture. | | And if the problem is big enough, buying hardware will cause | operational problems, so you'll need more people. And most | likely you're not gonna wanna spend on people, so you get a | bunch of people who won't fix the problem, but buy more | hardware. | mumblemumble wrote: | Ayup. | | And yet, people still regularly choose to go down a path | that leads there. Because business decisions are about | satisficing, not optimizing. So "I'm 90% sure I will be | able to cope with problems of this type but it might cost | as much as $10,000,000" is often favored above, "I am 75% | sure I might be able to solve problems of this type for no | more than $500,000," when the hypothetical downside of not | solving it is, "We might go out of business." | marcinzm wrote: | >And if the problem is big enough, buying hardware will | cause operational problems, so you'll need more people. And | most likely you're not gonna wanna spend on people, so you | get a bunch of people who won't fix the problem, but buy | more hardware. | | That's why people love the cloud. ___________________________________________________________________ (page generated 2022-08-02 23:01 UTC)