hngopher.com

       [HN Gopher] The simple joys of scaling up
       ___________________________________________________________________
        
       The simple joys of scaling up
        
       Author : eatonphil
       Score  : 78 points
       Date   : 2023-05-18 15:04 UTC (7 hours ago)
        
 (HTM) web link (motherduck.com)
 (TXT) w3m dump (motherduck.com)
        
       | ThePhysicist wrote:
       | FWIW I really like this new "neo-brutalist" website style with
       | hard shadows, clear, solid lines and simple typography and
       | layout.
        
       | andrewstuart wrote:
       | Moore's law freezes in the cloud.
        
       | winrid wrote:
       | The i4i instances also have crazy fast disks to go along with
       | that 1tb of ram. I hope to move all our stuff off i3 instances
       | this year to i4.
        
       | waynesonfire wrote:
       | it's cloud companies raking in the profits of these hardware
       | improvements. "Widely-available machines now have 128 cores and a
       | terabyte of RAM." and I'm still paying $5 bucks for a couple
       | cores.
        
       | dangoodmanUT wrote:
       | I've been following DuckDB for a while now, and even tinkered
       | with a layer on top called "IceDB" (totally needs a rewrite:
       | https://blog.danthegoodman.com/introducing-icedb--a-serverle...)
       | 
       | The issue I see now is that there is no good way to know what
       | files will match well when reading from remote (decoupled)
       | storage.
       | 
       | While it does support hive partitioning (thank god), and S3 list
       | calls, if you are looking at doing inserts frequently you need
       | some way to merge these parquet files.
       | 
       | The MergeTree engine is my favorite thing about ClickHouse, and
       | why it's still my go-to. I think if there was a serverless way to
       | merge parquet (which was the aim of IceDB) that would make DuckDB
       | massively more powerful as a primary OLAP db.
        
         | LewisJEllis wrote:
         | Yea, DuckDB is a slam dunk when you have a relatively static
         | dataset - object storage is your durable primary SSOT, and
         | ephemeral VMs running duckdb pointed at the object storage
         | parquet files are your scalable stateless replicas - but the
         | story gets trickier in the face of frequent ongoing writes /
         | inserts. ClickHouse handles that scenario well, but I suspect
         | the MotherDuck folks have answers for that in mind :)
        
       | marsupialtail_2 wrote:
       | You will always be limited by network throughput. Sure that wire
       | is getting bigger but so is your data
        
       | brundolf wrote:
       | Probably biased given that it's on the DuckDB site, but well-
       | reasoned and referenced, and my gut agrees with the overall
       | philosophy
       | 
       | This feels like the kicker:
       | 
       | > In the cloud, you don't need to pay extra for a "big iron"
       | machine because you're already running on one. You just need a
       | bigger slice. Cloud vendors don't charge proportionally more for
       | a larger slice, so your cost per unit of compute doesn't change
       | if you're working on a tiny instance or a giant one.
       | 
       | It's obvious once you think about it: you aren't choosing between
       | a bunch of small machines and one big machine, you may very well
       | be choosing between a bunch of small slices of a big machine and
       | one big slice of a big machine. The only difference would be in
       | how your software sees it: as a complex distributed system, or as
       | a single system (that can eg. share memory with itself instead of
       | serializing and deserializing data over network sockets)
        
         | LeifCarrotson wrote:
         | The reason this feels non-obvious is that people like to think
         | that they're choosing a variable number of small slices of a
         | big _datacenter_ , scaling up and down hour-by-hour or minute-
         | by-minute to get maximum efficiency.
         | 
         | Really, though, you're generating enormous overhead while
         | turning on and off small slices of a 128-core monster with a
         | terabyte of RAM.
        
         | JohnMakin wrote:
         | That's not the only difference - there are many more facets of
         | reliability guarantees than the brief hand-waving this author
         | does about it in the article.
        
       | paulddraper wrote:
       | This is absolutely correct (and gets more correct every year).
       | 
       | A m5.large (2 vCPU, 8GB RAM) is $0.096/hr. m5.24xlarge (96 vCPU,
       | 384GB RAM) is $4.608/hr.
       | 
       | Exactly 1:48 scale up, in capacity and cost.
       | 
       | The largest AWS instance is x2iedn.32xlarge (128 vCPU, 4096GB
       | RAM) for $26.676/hr. Compared to m5.large, a 64x increase in
       | compute and 512x increase in memory for 277x the cost.
       | 
       | Long story short.....you can scale up linearly for a long time in
       | the cloud.
        
       | samsquire wrote:
       | This is an interesting post, thank you.
       | 
       | In my toy barebones SQL database, I store rows alternatedly on
       | different replicas based on a consistent hash. I also have a
       | "create join" statement, this keeps join keys colocated.
       | 
       | Then when there is a join query issued, I can always join because
       | the join keys are available and the join query can be executed on
       | each replica and returned to the client to be aggregated.
       | 
       | I want building distributed high throughput systems to be easier
       | and less error prone. I wonder if a mixture of scale up and scale
       | out could be useful architecture.
       | 
       | You want minimum network round trips or crossovers between
       | threads (synchronization cost) as you can get.
        
       ___________________________________________________________________
       (page generated 2023-05-18 23:01 UTC)