[HN Gopher] Does my data fit in RAM?
       ___________________________________________________________________
        
       Does my data fit in RAM?
        
       Author : louwrentius
       Score  : 92 points
       Date   : 2022-08-02 19:49 UTC (3 hours ago)
        
 (HTM) web link (yourdatafitsinram.net)
 (TXT) w3m dump (yourdatafitsinram.net)
        
       | nsxwolf wrote:
       | 64 TiB fits in a Dell Poweredge R840 with 6TB max RAM... how
       | exactly?
        
         | louwrentius wrote:
         | Please take a look further down the list. For your use case, a
         | Power System E980 may just be enough or too small. :-)
        
         | mciancia wrote:
         | Maybe someone mixed up regular dram and optane dimms?
        
       | staticassertion wrote:
       | mine does not
        
       | louwrentius wrote:
       | The original site made by lukegb inspired me because of the down-
       | to-earth simplicity. Scaling vertically is often so much easier
       | and better in so many dimensions than creating a complex
       | distributed computing setup.
       | 
       | This is why I recreated the site when it went down quite a while
       | ago.
       | 
       | The recent article "Use One Big Server"[0] inspired me to
       | (re)submit this website to HN because it addresses the same
       | topic. I like this new article so much because in this day and
       | age of the cloud, people tend to forget how insanely fast and
       | powerful modern servers have become.
       | 
       | And if you don't have budget for new equiment, the second-hand
       | stuff from a few years back is stil beyond amazing and the prices
       | are very reasonable compared to cloud cost. Sure, running bare
       | metal co-located somewhere has it's own cost, but it's not that
       | of a big deal and many issues can be dealt with using 'remote
       | hands' services.
       | 
       | To be fair, the article admits that in the end it's really about
       | your organisation's specific circumstances and thus your
       | requirements. Physical servers and/or vertical scaling may not
       | (always) be the right answer. That said, do yourself a favour,
       | and do take this option seriously and at least consider it. You
       | can even do an experiment: buy some second-hand gear just to gain
       | some experience with hardware if you don't have it already and do
       | a trial in a co-location.
       | 
       | Now that we are talking, yourdatafitsinram.net runs on a
       | Raspberry Pi 4 which in turn is running on solar power.[1] (The
       | blog and this site are both running on the same host)
       | 
       | [0]: https://news.ycombinator.com/item?id=32319147
       | 
       | [1]: https://louwrentius.com/this-blog-is-now-running-on-solar-
       | po...
        
         | karamanolev wrote:
         | > many issues can be dealt with using 'remote hands' services.
         | 
         | I have a few second-hand HP/Dell/Supermicro systems running
         | colocated. I find that for all software issues, remote
         | management / IPMI / KVM over IP is perfectly sufficient. Remote
         | hands are needed only for actual hardware issues, most of which
         | is "replace this component with an identical one". Usually HDD,
         | if you're running those. Overall, I'm quite happy with the
         | setup and it's very high on the value/$ spectrum.
        
           | louwrentius wrote:
           | Yes, I bet a lot of people aren't even aware of IPMI/KVM over
           | IP capabilities that servers have for decades, which makes
           | hardware management (manual or automated!) much easier.
           | 
           | Remote hands is for the inevitable hardware failure (Disk,
           | PSU, Fan) or human error (you locked yourself out somehow
           | remotely from IPMI).
           | 
           | P.S. I have a HP Proliant DL380 G8 with 128 GB of memory and
           | 20 physical cores as a lab system for playing with many
           | virtual machines. I turn it on and off on demand using IPMI.
        
           | toast0 wrote:
           | IPMI is nice, although the older you go, the more
           | _particular_ it gets. I had professional experience with the
           | SuperMicro Xeon e5-2600 series v1-4, and recently started
           | renting a previous generation server[1] and it 's worse than
           | the ones I used before. It's still servicable though; but I'm
           | not sure it it's using a dedicated LAN, because the kvm and
           | the sol drop out when the OS starts or ends; it'll come back,
           | but you miss early boot messages.
           | 
           | It's definitely worth the effort to script starting the KVM,
           | and maybe even the sol. If you've got a bunch of servers, you
           | should script the power management as well, if nothing else,
           | you want to rate limit power commands across your fleet to
           | prevent accidental mass restarts. Intentional mass restarts
           | can probably happen through the OS, so 1 power command per
           | second across your fleet is probably fine. (You can always
           | hack out the rate limit if you're really sure).
           | 
           | [1] I don't need a whole server, but for $30/month when I
           | wanted to leave my VPS behind for a few reasons anyway...
        
       | baisq wrote:
       | Why is table.html loaded as an external resource instead of being
       | in index.html proper?
        
         | [deleted]
        
         | louwrentius wrote:
         | I can't remember why I did that, probably to keep the data
         | separate from the rest of the code.
        
       | game-of-throws wrote:
       | I just confirmed that 640 KB fits in RAM. That's enough for me.
        
         | rmetzler wrote:
         | 640K ought to be enough for anybody.
        
         | mech422 wrote:
         | Thanks Bill!
        
           | tester756 wrote:
           | >Gates himself has strenuously denied making the comment. In
           | a newspaper column that he wrote in the mid-1990s, Gates
           | responded to a student's question about the quote: "I've said
           | some stupid things and some wrong things, but not that. No
           | one involved in computers would ever say that a certain
           | amount of memory is enough for all time." Later in the
           | column, he added, "I keep bumping into that silly quotation
           | attributed to me that says 640K of memory is enough. There's
           | never a citation; the quotation just floats like a rumor,
           | repeated again and again."
        
       | vlunkr wrote:
       | Amazing. This has been the solution to postgres issues for me.
       | Just add enough memory that everything, or at least everything
       | that is accessed frequently can fit in RAM. Suddenly everything
       | is cached and fast.
        
       | hyperman1 wrote:
       | Funny, it lets you click to negative amounts of RAM. My -1 PiB
       | fits in RAM, so having it as a unit is not useless. (It also
       | accepts fractions but not octal)
        
       | antisthenes wrote:
       | If you're wondering, the cutoff is 64 TiB.
       | 
       | That's the amount of RAM on an IBM Power E980 System.
        
         | baisq wrote:
         | How much does that cost?
        
           | edmundsauto wrote:
           | $8.5 million according to a sibling comment.
        
       | dang wrote:
       | Related:
       | 
       |  _Does my data fit in RAM?_ -
       | https://news.ycombinator.com/item?id=22309883 - Feb 2020 (162
       | comments)
        
       | Cwizard wrote:
       | Anyone have any recommendations for a SQL engine that works on
       | in-memory data and has a simple/monolithic architecture? Our data
       | is about 50-100gb (uncompressed) and thus easily fits into
       | memory. I am sure we could do our processing using something like
       | polars or pandas in memory quite quickly but we prefer a SQL
       | interface. Using postgres is still quite slow even when it has
       | more than enough memory available compared to something like
       | duckdb. Duckdb has other limitations however. I've been eying
       | MemSQL but that also seems to be targeted more towards multi
       | machine deployments.
        
         | chaxor wrote:
         | SQLite is almost always the answer
        
         | mritchie712 wrote:
         | what limit are you hitting with duckdb?
        
         | giraffe_lady wrote:
         | sqlite?
        
       | somekyle wrote:
       | Is the point of this that you can do large-scale data processing
       | without the overhead of distribution if you're willing to pay for
       | the kind of hardware that can give you fast random access to all
       | of it?
        
         | nattaylor wrote:
         | Yes, take a look at the "inspired by" tweet [0]
         | 
         | [0] https://twitter.com/garybernhardt/status/600783770925420546
        
       | civilized wrote:
       | Has anyone tried firing up Pandas or something to load a multi-TB
       | table? Would be interested to see if you run into some hidden
       | snags.
        
         | jdeaton wrote:
         | I've done this though the data in the table was split across
         | DataFrames in many concurrent processes.
         | https://stackoverflow.com/questions/49438954/python-shared-m...
        
       | itamarst wrote:
       | There's just a huge amount of waste in many cases which is very
       | easy to fix. For example, if we have a list of fractions
       | (0.0-1.0):
       | 
       | * Python list of N Python floats: 32xN bytes (approximate, the
       | Python float is 24 bytes + 8-byte pointer for each item in the
       | list)
       | 
       | * NumPy array of N double floats: 8xN bytes
       | 
       | * Hey, we don't need that much precision, let's use 32-bit floats
       | in NumPy: 4xN
       | 
       | * Actually, values of 0-100 are good enough, let's just use uint8
       | in NumPy and divide by 100 if necessary to get the fraction: N
       | bytes
       | 
       | And now we're down to 3% of original memory usage, and quite
       | possibly with no meaningful impact on the application.
       | 
       | (See e.g. https://pythonspeed.com/articles/python-integers-
       | memory/ and https://pythonspeed.com/articles/pandas-reduce-
       | memory-lossy/ for longer prose versions that approximate the
       | above.)
        
         | deckard1 wrote:
         | interesting. Python doesn't use tagged pointers? I would think
         | most dynamic languages would store immediate char/float/int in
         | a single tagged 32-bit/64-bit word. That's some crazy overhead.
        
           | acdha wrote:
           | This has been talked about for years but I believe it's still
           | complicated by C API compatibility. The most recent
           | discussion I see is here:
           | 
           | https://github.com/faster-cpython/ideas/discussions/138
           | 
           | Victor Stinner's experiment showed some performance
           | regressions, too:
           | 
           | https://github.com/vstinner/cpython/pull/6#issuecomment-6561.
           | ..
        
           | nneonneo wrote:
           | Absolutely everything in CPython is a PyObject, and that
           | can't be changed without breaking the C API. A PyObject
           | contains (among other things) a type pointer, a reference
           | count, and a data field; none of these things can be changed
           | without (again) breaking the C API.
           | 
           | There have definitely been attempts to modernize; the HPy
           | project (https://hpyproject.org/), for instance, moves
           | towards a handle-oriented API that keeps implementation
           | details private and thus enables certain optimizations.
        
           | [deleted]
        
         | BLanen wrote:
         | You're describing operations done on data in memory to save
         | memory. That list of fractions still needs to be in memory at
         | some point. And if you're batching, this whole discussion goes
         | out of the window.
        
           | rcoveson wrote:
           | Why would the whole original dataset need to be in memory all
           | at once to operate on it value-by-value and put it into an
           | array?
        
             | BLanen wrote:
             | If the whole original dataset doesn't need to be in memory
             | all at once, there isn't even an issue to begin with.
        
               | saltcured wrote:
               | I think the point is that you can use a streaming IO
               | approach to transcode or load data into the compact
               | representation in memory, which is then used by whatever
               | algorithm actually needs the in-memory access. You don't
               | have to naively load the entire serialization from disk
               | into memory.
               | 
               | This is one reason projects like Twitter popularized
               | serializations like json-stream in the past, to make it
               | even easier to incrementally load a large file with basic
               | software. Formats like TSV and CSV are also trivially
               | easy to load with streaming IO.
               | 
               | I think the mark of good data formats and libraries is
               | that they allow for this. They should not force an in-
               | memory all or nothing approach, even if applications may
               | want to put all their data in memory. If for no other
               | reason, the application developer should be allowed to
               | commit most of the system RAM to their actual data, not
               | the temporary buffers needed during the IO process.
               | 
               | If I want to push a machine to its limits on some large
               | data, I do not want to be limited to 1/2, 1/3 or worse of
               | the machine size because some IO library developers have
               | all read an article like this and think "my data fits in
               | RAM"! It's not "your data" nor your RAM when you are
               | writing a library. If a user's actual end data might just
               | barely fit in RAM, it will certainly fail if the deep
               | call-stack of typical data analysis tools is cavalier
               | about allocating additional whole-dataset copies during
               | some synchronous load step...
        
         | adamsmith143 wrote:
         | Ok now I have 100s of columns. I should do this for every
         | single one in every single dataset I have?
        
           | staticassertion wrote:
           | Yes?
        
           | itamarst wrote:
           | It takes like 5 minutes, and once you are in the habit it's
           | something you do automatically as you write the code and so
           | it doesn't actually cost you extra time.
           | 
           | Efficient representation should be something you build into
           | your data model, it will save you time in the long run.
           | 
           | (Also if you have 100s of columns you're hopefully already
           | benefiting from something like NumPy or Arrow or whatever, so
           | you're already doing better than you could be... )
        
             | maerF0x0 wrote:
             | > It takes like 5 minutes, and once you are in the habit
             | it's something you do automatically as you write the code
             | and so it doesn't actually cost you extra time.
             | 
             | This is the argument I've been having my whole career with
             | people who claim the better way is "too hard and too slow"
             | .
             | 
             | I'm like "gee, funny how the thing you do the most often
             | you're fastest at... could it be that you'd be just as fast
             | at a better thing if you did it more than never?" .
        
               | dahfizz wrote:
               | Hey, programmer time is expensive. It is our duty to
               | always do the easiest, most wasteful thing. /s
        
               | maerF0x0 wrote:
               | Future me's time is free to today me. :wink:
        
             | chaps wrote:
             | Hah, I'd love to work with the datasets you work with if it
             | takes five minutes to do this. Or maybe you're just
             | suggesting it takes five minutes to write out "TEXT" for
             | each column type?
             | 
             | The data I work with is messy, from hand written notes,
             | multiple sources, millions of rows, etc etc. A single point
             | that's written as "one" instead of 1 makes your whole idea
             | fall on its face.
        
               | itamarst wrote:
               | For pile-of-strings data, there are still things you can
               | do. E.g. in Pandas, if there are a small number of
               | different values, switch to categoricals
               | (https://pythonspeed.com/articles/pandas-load-less-data/
               | item 3). And there's a new column type for strings that
               | uses less memory
               | (https://pythonspeed.com/articles/pandas-string-dtype-
               | memory/).
        
               | bee_rider wrote:
               | Is enough data generated from handwritten notes that the
               | memory cost is a serious problem? I was under the
               | impression that hundreds of books worth of text fit in a
               | gigabyte.
        
           | dvfjsdhgfv wrote:
           | You'll need to decide on a case by case basis. Many datasets
           | I work with are being generated by machines, come from
           | network cards etc. - these are quite consistent. Occasionally
           | I deal with datasets prepare by humans and these are mediocre
           | at best, and in these cases I spend a lot of time cleaning
           | them up. Once it's done, I can clearly see if there are some
           | columns can be stored in a more efficient way, or not. If the
           | dataset is large, I do it, because it gives me extra freedom
           | if I can fit everything in RAM. If it's small, I don't
           | bother, my time is more expensive than potential gains.
        
           | [deleted]
        
       | Goz3rr wrote:
       | Am I the only one here using Chrome or is everyone else just
       | ignoring the table being broken? The author used an <object> tag
       | which just results in Chrome displaying "This plugin is not
       | supported". I'm unsure why they didn't just use an iframe
       | instead.
        
         | louwrentius wrote:
         | I can only state for myself that on my Mac running chrome, the
         | site works OK. I don't get any plugin messages.
        
       | AdamJacobMuller wrote:
       | https://www.redbooks.ibm.com/redpapers/pdfs/redp5510.pdf
       | 
       | I want one of these.
       | 
       | a system with 1TB of ram is 133k, 8.5mil for a system with 64TB
       | of ram?
        
         | chaxor wrote:
         | Absolutely not. You can purchase a system with 1TB of ram, and
         | some decent CPUs etc for ~25k. My lab just did this. That's far
         | overpriced. 133k is closer to what you would spend if you used
         | the machine with 1tb "in the cloud".
        
         | didgetmaster wrote:
         | I still remember the first advertisement I saw for 1TB of disk
         | space. I think it was about 1997 and about the biggest
         | individual drive you could buy was 2GB. The system was the size
         | of a couple of server racks and they put 500 of those disks in
         | it. It cost over $1M for the whole system.
        
         | nimish wrote:
         | That's insanely overpriced. A 128gb lrdimm is $1000. So a tb on
         | a commodity 8 mem slot board would be 8k plus a few thousand
         | for the cpu and chassis.
        
       | sophacles wrote:
       | I find it mind boggling that one can purchase a server with more
       | RAM than the sum of all working storage media in my house.
        
       | boredumb wrote:
       | would be neat if I could do say, 6gb, and see the machines that
       | are closest in size instead of only the upper limit
        
       | bob1029 wrote:
       | This kind of realization that "yes, it probably will" has
       | recently inspired me to hand-build various database engines
       | wherein the entire working set lives in memory. I do realize
       | others have worked on this idea too, but I always wanted to play
       | with it myself.
       | 
       | My most recent prototypes use a hybrid mechanism that
       | dramatically increases the supported working set size. Any
       | property larger than a specific cutoff would be a separate read
       | operation to the durable log. For these properties, only the
       | log's 64-bit offset is stored in memory. There is an alternative
       | heuristic that allows for the developer to add attributes which
       | signify if properties are to be maintained in-memory or permitted
       | to be secondary lookups.
       | 
       | As a consequence, that 2TB worth of ram can properly track
       | hundreds or even thousands of TB worth of effective data.
       | 
       | If you are using modern NVMe storage, those reads to disk are
       | stupid-fast in the worst case. There's still a really good chance
       | you will get a hit in the IO cache if you application isn't
       | ridiculous and has some predictable access patterns.
        
         | saltcured wrote:
         | I don't mean to discourage personal exploration in any way, but
         | when doing this sort of thing it can also be illuminating to
         | consider the null hypothesis... what happens if you let the
         | conventional software use a similarly enlarged RAM budget or
         | fast storage?
         | 
         | SQLite or PostgreSQL can be given some configuration/hints to
         | be more aggressive about using RAM while still having their
         | built-in capability to spill to storage rather than hit a hard
         | limit. Or on Linux (at least), just allowing the OS page cache
         | to sprawl over a large RAM system may make the IO so fast that
         | the database doesn't need to worry about special RAM usage. For
         | PostgreSQL, this can just be hints to the optimizer to adjust
         | the cost model and consider random access to be cheaper when
         | comparing possible query plans.
         | 
         | Once you do some sanity check benchmarks of different systems
         | like that, you might find different bottlenecks than expected,
         | and this might highlight new performance optimization quests
         | you hadn't even considered before. :-)
        
       | none_to_remain wrote:
       | Several years ago my job then got a dev and prod server with a
       | terabyte of RAM. I liked the dev server because a few times I
       | found myself thinking "this would be easy to debug if I had an
       | insane amount of RAM" and then I would remember I did.
        
       | ailef wrote:
       | Basically every fits in RAM up to 24TB.
        
         | donkarma wrote:
         | 64TB because of the mainframe
        
         | jhbadger wrote:
         | I was disappointed that the page didn't start offering vintage
         | computers for very small datasets given that it has bytes and
         | kilobytes as options ("your data is too large for a VIC-20, but
         | a Commodore 64 should handle it")
        
           | louwrentius wrote:
           | That is actually a funny idea, I didn't think about that. I
           | only revived and refreshed what somebody else came up with
           | and made before me.
        
       | louwrentius wrote:
       | Extra anecdote:
       | 
       | Around 2000, a guy told me he was asked to support very
       | significant performance issues with a server running a critical
       | application. He quickly figured out that the server ran out of
       | memory. Option 1 was to rewrite the application to use less
       | memory. He chose option two: increase the server memory, going
       | from 64 MB to 128 MB (Yes MB).
       | 
       | At that time, 128 MB was an ungodly amount of memory and memory
       | was very expensive. But it was still cheaper to just throw RAM at
       | the problem than to spend many hours rewriting the application.
        
       | z3t4 wrote:
       | Your data might even fit in the CPU L3 cache... But most likely
       | you want your data to be persistent. But how often do you
       | actually "pull the plug" on your servers!? And what happens when
       | SSD's are fast enough ? Will we see a whole new architecture
       | where the working memory is integrated into the CPU and the main
       | memory is persistent ?
        
         | mnd999 wrote:
         | That was the promise of optane. Unfortunately nobody bought it.
        
           | tester756 wrote:
           | What do you mean by "nobody"?
           | 
           | Significant % of top 500 fortune used it
        
         | bee_rider wrote:
         | On one hand it would be cool to have some persistence in the
         | CPU. On the other -- imagine if rebooting a computer didn't
         | make the problems all go away. What a nightmare.
        
       | rob_c wrote:
        
       | marcinzm wrote:
       | We went with this approach. Pandas hit GIL limits which made it
       | too slow. Then we moved to Dask and hit GIL limits on the
       | scheduler process. Then we moved to Spark and hit JVM GC
       | slowdowns on the amount of allocated memory. Then we burned it
       | all down and became hermits in the woods.
        
         | [deleted]
        
         | mritchie712 wrote:
         | Did you consider Clickhouse? join's are slow, but if your data
         | is in a single table, it works really well.
        
           | marcinzm wrote:
           | We were trying to keep everything on one machine in (mostly)
           | memory for simplicity. Once you open up the pandoras box of
           | distributed compute there's a lot of options including other
           | ways of running Spark. But yes, in retrospect, we should have
           | opened that box first.
        
             | anko wrote:
             | I have solved a similar problem, in a similar way and i've
             | found polars <https://www.pola.rs/> to solve this quite
             | well without needing clickhouse. It has a python library
             | but does most processing in rust, across multiple cores.
             | I've used it for data sets up to about 20GB no worries, but
             | my computer's ram became the issue, not polars itself.
        
               | marcinzm wrote:
               | We were using 500+gb of memory at peak and were expecting
               | that to grow. If I remember we didn't go with Polars
               | because we needed to run custom apply functions on
               | DataFrames. Polars had them but the function took a tuple
               | (not a DF or dict) which when you've got 20+ columns
               | makes for really error prone code. Dask and Spark both
               | supported a batch transform operation so the function
               | took a Pandas Dataframe as input and output.
        
         | mumblemumble wrote:
         | I have decided that all solutions to questions of scale fall
         | into one of two general categories. Either you can spend all
         | your money on computers, or you can spend all your money on
         | C/C++/Rust/Cython/Fortran/whatever developers.
         | 
         | There's one severely under-appreciated factor that favors the
         | first option: computers are commodities that can be acquired
         | very quickly. Almost instantaneously if you're in the cloud.
         | Skilled lower-level programmers are very definitely not
         | commodities, and growing your pool of them can easily take
         | months or years.
        
           | jbverschoor wrote:
           | Buying hardware won't give you the same performance benefits
           | as a better implementation/architecture.
           | 
           | And if the problem is big enough, buying hardware will cause
           | operational problems, so you'll need more people. And most
           | likely you're not gonna wanna spend on people, so you get a
           | bunch of people who won't fix the problem, but buy more
           | hardware.
        
             | mumblemumble wrote:
             | Ayup.
             | 
             | And yet, people still regularly choose to go down a path
             | that leads there. Because business decisions are about
             | satisficing, not optimizing. So "I'm 90% sure I will be
             | able to cope with problems of this type but it might cost
             | as much as $10,000,000" is often favored above, "I am 75%
             | sure I might be able to solve problems of this type for no
             | more than $500,000," when the hypothetical downside of not
             | solving it is, "We might go out of business."
        
             | marcinzm wrote:
             | >And if the problem is big enough, buying hardware will
             | cause operational problems, so you'll need more people. And
             | most likely you're not gonna wanna spend on people, so you
             | get a bunch of people who won't fix the problem, but buy
             | more hardware.
             | 
             | That's why people love the cloud.
        
       ___________________________________________________________________
       (page generated 2022-08-02 23:01 UTC)