hngopher.com

       [HN Gopher] We saved millions in SSD costs by upgrading our file...
       ___________________________________________________________________
        
       We saved millions in SSD costs by upgrading our filesystem
        
       Author : kmdupree
       Score  : 334 points
       Date   : 2021-11-09 17:41 UTC (5 hours ago)
        
 (HTM) web link (heap.io)
 (TXT) w3m dump (heap.io)
        
       | londons_explore wrote:
       | Postgres is copy on write. (A modified record is copied to a new
       | page when written, and the old record left in place till a
       | vacuum).
       | 
       | ZFS is Copy-on-write (One byte written in a block requires the
       | whole block be rewritten, with the old one scheduled for
       | reclamation).
       | 
       | The underlying SSD wear levelling algorithm is Copy on Write
       | (Writing a single byte involves writing the data from the page to
       | a new block, and then erasing the old one sometime later)
       | 
       | That means a tiny 1 byte modification to a postgres record
       | involves creating many new unnecessary copies of a lot of data...
       | 
       | I imagine that if the three layers could be combined, dramatic
       | performance benefits could happen, since written data might go
       | down by an order of magnitude at least....
        
         | ComodoHacker wrote:
         | > A modified record is copied to a new page when written
         | 
         | AFAIK a modified record is copied to the same page if there is
         | enough space (which you can tune).
        
         | storyinmemo wrote:
         | This is where RocksDB or other SSTable storage systems really
         | line up well with SSDs. The write amplification reduction can
         | improve throughput and disk life by multiples.
         | 
         | https://engineering.fb.com/2016/08/31/core-data/myrocks-a-sp...
        
         | touisteur wrote:
         | Oh, a postgres disk controler / fs / storage engine would be an
         | interesting project.
        
         | jbverschoor wrote:
         | That's what you get when everybody is hiding abstractions in 10
         | layers.
         | 
         | The same thing happens with all the layers of virtual-machines
        
           | josteink wrote:
           | To be fair you have some virtualisation stacks (like
           | QEMU/KVM) which goes in the opposite direction:
           | 
           | It provides VMs with drivers for purely virtual "virtio"
           | devices (storage, network, etc) with no effort or overhead
           | put into mimicking the mechanics of a real sas/sata/scsi
           | device or whatever.
           | 
           | The result is less complexity and a much more direct data-
           | path so things should hopefully not only be more performant,
           | but also more stable.
        
         | strainer wrote:
         | Not sure about SSD but filesystem COW doesn't only entail
         | writing and freeing whole blocks, it entails "dont make an
         | _actual_ copy _until_ write "
         | 
         | COW filesystem means you can make (virtual) copies of files
         | /blocks without writing (duplicating) them on the media. They
         | only write metadata for bare copies and delay duplicating data
         | until a virtual duplicate is altered.
         | 
         | Actual writes are written into a free block, then the old block
         | is marked clear. The old block is not copied and then written
         | over. In my understanding that's not what COW characterizes -
         | its refering to how copying data is almost free (only costs
         | metadata changes in COW filesystems) until copies are altered
         | (written to)
        
           | londons_explore wrote:
           | These semantics are true at all three layers.
           | 
           | In Postgres, Transactions work on a "snapshot" of the data
           | that existed at one point in time. That snapshot is logically
           | a copy of the data, but in reality uses copy-on-write of
           | records to avoid having to make a copy of the entire database
           | at the start of any transaction.
           | 
           | In ZFS, it works as described.
           | 
           | In SSD's, operating system 'write' commands are treated as
           | transactions - ie. certain ordering semantics must be
           | preserved in case of a power failure. Since performance is
           | improved by having extra parallelism and not doing the actual
           | operations in the order they are presented by the OS, a copy-
           | on-write model is used to ensure that an incomplete
           | transaction can be rolled back. This isn't supposed to be
           | user-visible, but occasionally in a badly broken SSD, you
           | hear users complaining of 'it works fine, but then when I
           | reboot my computer everything I did is undone'! Well that's
           | because no transactions are committing...
        
         | tmikaeld wrote:
         | If anything, it would drastically increase the lifetime of the
         | SSDs.
        
           | jandrese wrote:
           | SSDs already do copy-on-write internally, so this doesn't
           | change much. The files are slightly smaller so it may save a
           | few writes here and there, but I wouldn't expect a drastic
           | change one way or another.
        
         | wtallis wrote:
         | I'm not sure it's an order of magnitude reduction in real flash
         | writes, but there have been some pretty promising results from
         | experiments with collapsing those three layers. Moving it all
         | to the SSD gets you a Key-Value SSD, and moving it all to the
         | host system gets you a Zoned SSD. Both ideas have seen enough
         | interest from storage vendors and hyperscale cloud to end up
         | standardized. So at this point it's mostly a matter of getting
         | support added to common software like Postgres so that it's
         | easy for smaller operations to adopt.
        
         | blibble wrote:
         | wear levelling isn't really copy on write is it?
         | 
         | the original block will be removed from the mapping entirely
         | (not referred to by a copy)
        
           | londons_explore wrote:
           | Typically flash devices can't overwrite data directly, and
           | can only erase very large blocks (ie. 1MB or more).
           | 
           | That means any write of data must be made to a new, freshly
           | erased, location.
           | 
           | The old version of the data, sitting in the middle of a large
           | eraseblock is no longer used, but cannot be reused until
           | everything else in the eraseblock is either unused or copied
           | elsewhere.
           | 
           | In the worst case (of a nearly full SSD), it means that for
           | every 1 byte write into a 4k page, a full 1Mbyte of data
           | needs to be copied. Typical cases are better (when the drive
           | has plenty of spare space, so can delay the reclamation of
           | the eraseblock for as long as possible, hoping that other
           | things in the block are invalidated, reducing the amount that
           | needs to be copied).
           | 
           | TL;DR: It's complex, but a lot of copying is involved with
           | most writes...
        
       | rsync wrote:
       | I am genuinely curious how a ZFS "special" device[1] which
       | absorbs all metadata for a pool would help this kind of a
       | workload.
       | 
       | The "special" device is a full-blown vdev that you add to a pool
       | (it is _not_ a cache) - typically a 3 or 4-way mirror of SSDs.
       | 
       | Now all metadata read and write happen at SSD speeds.
       | 
       | We wrote about this in the _rsync.net Technical Notes_ for Q3 of
       | this year[2].
       | 
       | I know what this kind of SSD based vdev does to typical mixed
       | file performance but I'm not sure how metadata-heavy a postgres
       | implementation is ...
       | 
       | [1] Yes, they really are called that.
       | 
       | [2]
       | https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...
        
         | aftbit wrote:
         | Thanks for posting about this. TIL about special devices.
        
         | latk wrote:
         | Would separate SSD metadata devices help if the pool, as in
         | Heap.io's case, already consists entirely of SSDs? It's
         | obviously a win for a use case like Rsync.net's where the data
         | is less "hot"and therefore uses much more cost-effective HDDs.
        
           | franga2000 wrote:
           | Would be interesting to see if Optane or even just some
           | faster SSDs for the metadata would give any noticeable
           | improvement. I imagine latency would be more important for
           | metadata than throughput, so perhaps SSDs would be more or
           | less equivalent, but I'd be really interested in seeing the
           | numbers for Optane
        
             | buildbot wrote:
             | This is actually one of the golden use cases for optane!
             | Back in the pre-optane day, the server company I worked for
             | would back SSD/HDD zpools with a device called a ZeusRAM
             | drive: https://www.ebay.com/itm/234256677232
             | 
             | It's all about the latency.
        
           | withinboredom wrote:
           | The benefits would be that metadata type FS queries are OOB
           | from the actual data, theoretically allowing more IOPs on
           | your data disks spent on actual data.
        
             | smallnamespace wrote:
             | If you just add metadata SSDs you're also adding IOPS to
             | the pool. The question then becomes whether that improves
             | performance more than if you added those same SSDs without
             | the split (my guess would be not splitting is better, since
             | the IO load will be better balanced across drives).
        
       | kevvok wrote:
       | I'm not sure how much I buy their claim that ZFS being COW is
       | worse for SSDs given SSDs themselves are COW to allow for wear-
       | leveling because they have to erase whole blocks and rewrite them
       | if even one byte is changed
        
         | sipos wrote:
         | I thought the same thing as I was reading it, but I think they
         | are probably using larger block sizes than the SSD's blocks for
         | better compression. I'm not certain though.
        
         | toast0 wrote:
         | Given that they mention snapshots, that's probably the bigger
         | issue. Almost any sort of storage works better when you have
         | more free space, and having a snapshot means you need to keep
         | all that data as well as any data that changed since then, so
         | you have less free space.
         | 
         | Using a COW filesystem adds at least some amount of usage,
         | since instead of modifying in place, you'd write a new block
         | and only trim the old block sometime after the new block is
         | committed; but if you don't have snapshots and you have zfs
         | autotrim (and it trims all your old blocks), the commit
         | interval is short (5 seconds by default?), so I wouldn't expect
         | a big difference in effective free space here.
        
         | bbarnett wrote:
         | They say their fs block size is 64kb. How large are SSD blocks?
        
           | jeffbee wrote:
           | SSD erase blocks vary, could be anywhere from 64KiB to
           | several MiB. Their logical blocks (visible to the host
           | operating system) are usually either 512 or 4096 bytes.
        
             | wtallis wrote:
             | NAND flash erase blocks passed 1MB a long time ago. For
             | mainstream TLC NAND, 16 to 24 MB is currently typical, and
             | QLC NAND has gone as high as 48 and 96 MB. NAND page sizes
             | are usually 16kB, but often with support for faster partial
             | page programming for the sake of 4kB operations. Logical
             | block sizes of 512 bytes are purely a fiction for the sake
             | of compatibility and SSDs don't actually track allocations
             | at that granularity. They do track things at 4kB
             | granularity even though that's not quite large enough to be
             | a good fit for today's flash.
        
         | bob1029 wrote:
         | My experience with SSDs tells me the only way to beat the
         | system is to employ some append-only log storage structure.
         | Potentially with segmentation done at the device level, so that
         | you can have a large fleet of drives in "append" mode while
         | others are reconciling or cleaning all their blocks in
         | anticipation of taking another full sequential fill-up.
         | Throwing mixed workloads at individual devices is just asking
         | for trouble if you are trying to maintain some razor-thin SLA.
         | Dirty pages and all the weird tricks employed to hide this
         | concern results in side effects that break more complex schemes
         | on top.
         | 
         | Taking to the next level - Batching your I/O in software is how
         | you can start saying things like "Transactions per disk I/O",
         | not just "Fewer I/O for those transactions which now fit in
         | fewer blocks due to compression". Batching doesn't have to mean
         | "nightly processing". It can mean "all requests which occurred
         | over the last 100uS". From a user's perspective, this can
         | effectively still be a real-time RPC experience. For systems
         | with very heavy load, this sort of micro-batching can add many
         | orders of magnitude improvement in throughput. Also bear in
         | mind that the more transactions you have available to compress
         | each time means you get better odds when dealing with entropy.
         | 
         | I have personally developed software that can insert 2-5x the
         | stated write IOPS figure with these sorts of tactics. On modern
         | NVMe devices this can mean you start tickling 8 figure
         | transactions per second if the size of each request is very
         | modest.
        
         | notacoward wrote:
         | The fact that there's two layers both doing this kind of thing
         | (especially oblivious to one another) is a known and studied
         | issue. Most of that study has been in the context of virtual-
         | machine filesystems atop host filesystems but it's also
         | applicable here. I'm pretty sure there was a paper at FAST
         | about this several years ago, but can't find it right now. It's
         | entirely likely that COW-over-COW is a bad choice just like
         | TCP-over-TCP is.
        
           | klodolph wrote:
           | Are you thinking of the classic, "Don't Stack Your Log On My
           | Log"?
           | 
           | https://www.usenix.org/conference/inflow14/workshop-
           | program/...
        
             | notacoward wrote:
             | Yeah, seems about right. Thanks!
        
           | the8472 wrote:
           | But the upper-level CoW can coordinate with the lower level
           | via TRIM, that's not quite the same as TCP-in-TCP where the
           | congestion algorithms don't talk to each other.
        
             | notacoward wrote:
             | Yes, TRIM exists, but it's a very limited form of
             | coordination at best and implementations don't even do all
             | they can with that. Every analogy is imperfect.
        
       | paladin314159 wrote:
       | We switched from LZ4 to Zstd in almost all of our compression use
       | cases to great effect. Reducing the data on disk or over the
       | network is a huge win with only a minor loss in decompression
       | speed (using the appropriate level of Zstd). E.g. data in Kafka:
       | https://amplitude.engineering/reducing-kafka-costs-with-z-ta...
        
       | djanogo wrote:
       | Can somebody enlighten me on why you would use ZFS for DB?, seems
       | like there would be overlap/conflict of features. The only
       | benefit that I know of would be quick restore to previous state,
       | but how often would you need to restore?
        
         | jdhawk wrote:
         | They state it. Compression
        
           | Aachen wrote:
           | Even NTFS has compression, so it doesn't seem to be that
           | simple.
        
         | toast0 wrote:
         | Data integrity is nice to have.
        
         | jhk727 wrote:
         | Author here - It's explained in the post but the primary driver
         | is cost. 80%+ reduction in storage is massive when you're
         | storing petabytes of data on ssds.
        
       | deeblering4 wrote:
       | It's wild to me that sites managing storage clusters at petabyte
       | scale are doing it on AWS. I would think that by then you could
       | save millions more by migrating to your own colocated hardware.
        
         | thehappypm wrote:
         | It's not always that dollars you spend are evil. Sometimes the
         | more expensive or slower solution is better because it means
         | you can have fewer people working on it, or have lower-skilled
         | people managing it. And new people are easier to hire and make
         | productive because it's a shared skill. And you get updates for
         | free. And outages are likely to be shorter. And.. the list goes
         | on.
        
         | teknopurge wrote:
         | Or save even more using decentralized options like Storj.io or
         | Filecoin.
         | 
         | IMO, market pendulum is swinging(in-motion) right now:
         | 
         | cloud ---(cost drivers)---> dedicated HW/colo with hybrid or
         | custom cloud---(operational drivers)---> web3 decentralization
         | [^--- we are here ---^]
        
           | tmikaeld wrote:
           | You must be kidding? Those solutions are not even close to
           | running any database of this scale, did you even read the
           | article?
        
             | teknopurge wrote:
             | yep. Those solutions aren't there yet, and arguably too
             | early, but the market direction is clear. In five years I
             | would not be surprised to see some form of transactional
             | data store(high TPS, blob store) on a decentralized layer.
             | 
             | Crazy to talk about different FS compression schemes when
             | trying to optimize business/app logic higher up the stack.
             | should be abstracted away by now. (yea, I know it's not,
             | but should be)
        
         | jhk727 wrote:
         | Author here - as others have noted, there's a lot of benefits
         | to a company of our size operating infrastructure on AWS vs.
         | managing physical hardware. A couple of the highlights for
         | managing our primary database cluster include:
         | 
         | - Automation - this was noted by another commenter, but with
         | AWS we can fully automate the instance replacement procedure
         | using autoscaling groups. On hardware failure, the relevant
         | database is removed from its autoscaling group and we
         | automatically start restoring a fresh instance from latest
         | backup. This would be much more difficult if we were to manage
         | our own hardware.
         | 
         | - Flexibility - we have the ability to easily change instance
         | classes via selling/buying reservations. Some of our biggest
         | wins historically have come from AWS releasing new instance
         | families - we've been able to swap out the hardware for our
         | entire cluster over a week or so, for negligible cost (often
         | saving money in the process due to the cost per unit of
         | hardware decreasing on new instance classes). While we could
         | leverage the same developments in a self-managed environment,
         | it would be more difficult, and likely more expensive due to
         | how capital-intensive self-hosted is.
         | 
         | Additionally - there's a ton of value in the integration of the
         | AWS ecosystem. We use many AWS managed services, including
         | heavy use of RDS, Kinesis, S3, and others. For a company with a
         | relatively small engineering team managing a large
         | infrastructure footprint, it hasn't made financial sense yet to
         | invest in moving to self-hosted infrastructure.
        
           | tinco wrote:
           | The things you mention are ostensibly true, yet still they
           | don't make sense to me. It might make sense when you're a
           | startup that's growing, but when your SSD costs are so large
           | you can save _millions_ on them, then the numbers just don 't
           | add up.
           | 
           | In my experience, doing things in the cloud is about as
           | expensive per 12-18 months as buying the hardware up front
           | is. That's super interesting for a fast growing startup that
           | could go bust any minute and wants to spend every second of
           | their time on growing, expanding and marketing.
           | 
           | But when you're spending so much on AWS you can save millions
           | just by reducing filesystem overhead by 20%, it should have
           | stopped making sense a while ago. $2 million should get you a
           | team of 10 sysadmins and devops engineers. Sure automation
           | would be more difficult, but you'd have the manpower to
           | achieve it. Isn't that what running a business is about?
           | 
           | Flexibility, when you're growing quickly it's nice that you
           | can provision new hardware instantly, but AWS is so expensive
           | you could continuously over provision your hardware by 50%
           | and still always be ahead of the AWS price curve. And as I
           | said, you could fully swap out your hardware every 18 months
           | and be at the same price basically. You could even hire a
           | merchant to offload your old hardware and recuperate 50% of
           | those costs.
           | 
           | And I'm not saying to throw AWS overboard altogether, that
           | you have your core business outside of AWS's datacenter
           | doesn't preclude you from buying into RDS, Kinesis, S3.
           | 
           | Is AWS just cutting you more financial slack than we're
           | getting as a tiny company? Or am I underestimating the costs
           | of getting that sysadmin team on board?
        
             | stef25 wrote:
             | > $2 million should get you a team of 10 sysadmins and
             | devops engineers
             | 
             | Managing staff vs managing AWS ... I know what I'd choose
             | (without really knowing the numbers)
        
             | gwright wrote:
             | > In my experience, doing things in the cloud is about as
             | expensive per 12-18 months as buying the hardware up front
             | is.
             | 
             | It seems like you are glossing over the other costs: staff
             | to implement and manage, development time and maintenance
             | for automation to re-implement everything that AWS
             | includes, data center costs (not clear if you were thinking
             | of hardware ownership only or data-center also).
             | 
             | I'm not saying you didn't think about those things just
             | saying that they can't be ignored in these types of
             | comparisons.
        
               | tinco wrote:
               | Where did I gloss over them? I literally suggest spending
               | 2 million a year on staff.
        
               | gwright wrote:
               | Without specific numbers it is a bit difficult to be as
               | clear as I would like but I read your comment as
               | suggesting that the savings from owning/hosting your own
               | equipment could pay for the team needed to operate that
               | solution -- but then what was the point of switching?
               | 
               | The devil is in the details, and I wouldn't say that it
               | never makes sense to bring operations in-house, but your
               | post didn't make a clear case from my point of view.
        
               | aflag wrote:
               | You don't need to reimplement everything aws provides.
               | AWS provides services on demand for a huge costumer base.
               | The sysadmin/devops team you set up needs to solve only
               | your particular problem. You can often get a better,
               | easier to use and maintain system this way. The downside
               | is that you have to pay for that extra team, but if only
               | 20% of your data already costs millions, your scale is
               | big enough that you'll likely save money by hiring a sys
               | admin team.
        
               | lazide wrote:
               | Sometimes, sometimes not. A lot of common 'prod bricks'
               | (S3, Managed kubernetes, etc) get used because they are
               | convenient, and while they could be implemented in some
               | other, more bespoke way, it's rarer and rarer that it
               | actually pencils out as a net win. You also have to deal
               | with the complexity of managing your own version of it,
               | which is non trivial over the full lifecycle of
               | something.
               | 
               | If it is your core business to provide that thing?
               | Sometimes or even often worth it. Otherwise, often not.
        
               | manquer wrote:
               | Unless you are literally Amazon or big Co, spending ten
               | of millions ( if 20% is millions then at-least 5-10
               | million ) on just SSDs _alone_ ! should make it pretty
               | much a core business problem to solve?
               | 
               | My sense that in the last decade startups have not lost
               | the skills to do Co-Location setups as they did in 2000s
               | and think it is more complex than actually is. Co-Lo
               | hardware management is hard yes, but if it not even worth
               | doing 10+ million /year budgets we would never have SaaS
               | companies pre-cloud at all.
        
               | merb wrote:
               | a lot of people underestimate how expensive it is just to
               | have a 10g/25g/100g network and maintain it. that alone
               | is extremly expensive. especially when you want it over
               | multiple locations. if you want to connect two
               | datacenters with a low latency high troughput network you
               | would probably go to aws since that is cheaper. and that
               | is just the network, you also need to maintain other
               | stuff like storage. maintaining a storage network is
               | extremly hard. like s3/block storage, etc.
        
             | PragmaticPulp wrote:
             | > In my experience, doing things in the cloud is about as
             | expensive per 12-18 months as buying the hardware up front
             | is.
             | 
             | The fallacy is comparing hardware costs to services cost.
             | The hardware is the cheap part.
             | 
             | When you run your own system, you have to develop the
             | entire system up front and maintain it on the backend. The
             | hardware is cheap by comparison to the salaries and
             | development costs you pay.
             | 
             | > $2 million should get you a team of 10 sysadmins and
             | devops engineers.
             | 
             | Probably double that once you add in fully-loaded costs as
             | well as the compensation for ~2 managers to manage them.
        
               | mbreese wrote:
               | But if you're paying X in OpEx to AWS, at some point Y in
               | CapEx (hardware) and Z for OpEx (your people) becomes
               | more compelling. What I believe the comment above is
               | arguing is that if you are saving millions for 20%
               | savings on SSD costs, X >> Y + Z.
               | 
               | Yes, you have to manage the hardware, and Y doesn't
               | automatically go to zero for year two, but the
               | convenience of the cloud isn't always cost effective.
               | Don't get caught up with the details. The 2 million
               | figure doesn't matter as much. It's finding that
               | inflection point and making the better business decision.
        
               | starfallg wrote:
               | >When you run your own system, you have to develop the
               | entire system up front and maintain it on the backend.
               | The hardware is cheap by comparison to the salaries and
               | development costs you pay
               | 
               | The development costs are OTC that are amortised during
               | the life of the solution. Whereas in xAAS, they are MRC.
               | The longer your tech refresh cycle, the cheaper it is.
               | It's inherent to the pricing model.
               | 
               | Also when you get down to the cruz if it, these solutions
               | (like openstack or vSphere), are software platform that
               | provides similar features. There's not much development
               | costs, it's just software licensing and PS.
               | 
               | In terms of operations, it's not like you can get rid of
               | sysadmins, they just morphed into DevOps.
               | 
               | >Probably double that once you add in fully-loaded costs
               | as well as the compensation for ~2 managers to manage
               | them.
               | 
               | You might as well add all sorts of additional costs such
               | as egress charges on exit and cloud consultancy.
        
               | manquer wrote:
               | There are plenty of financing options( with pretty
               | competitive interests) that will help you amortize your
               | upfront costs over project lifetime if your credit is
               | good ( at this kind range any startup gets access to this
               | fairly easily), So both options can be monthly recurring
               | if need be.
        
               | aflag wrote:
               | I think he compared hardware + team salary vs aws. After
               | a certain size that starts tipping over to the side of
               | having things on prem. Running things on prem is nothing
               | scary. You just need people with the skill set needed to
               | do it. But when you're spending millions in
               | infrastructure, that's hardly a problem.
        
           | [deleted]
        
           | jes wrote:
           | Thank you for writing this article.
           | 
           | I'm curious: The engineers that brought these significant
           | cost savings to your company, did they receive a share of the
           | money saved?
        
             | AstroDogCatcher wrote:
             | Thank you - haven't laughed like that in a while.
        
               | Aachen wrote:
               | I'm not sure how appropriate it is to take a serious
               | comment where the author has a genuinely unpopular
               | opinion and say you laughed really hard at it.
        
             | Johnny555 wrote:
             | how would you even allocate the money fairly besides
             | rolling it into the company to keep it alive and successful
             | (and maybe the added profit increases the bonus pool if
             | that exists)?
             | 
             | The engineers didn't do it in isolation - how much of a
             | share goes to the office receptionist that answered the
             | phones and kept visitors out of the way of the engineers?
             | How much goes to the Finance department that kept the
             | engineering paychecks coming while they did they work? How
             | much goes to the salespeople who kept the deals flowing and
             | money coming in that filled the disks in the first place...
             | and so on and so on.
             | 
             | Once a company exits the "a few devs in a garage" stage,
             | many people contribute to the company's success.
        
               | franga2000 wrote:
               | Let's face it, increasing your employer's profit margin
               | will only benefit the employees if the company is
               | struggling and they were about to be laid off because
               | costs were starting to eat into the margins. The only
               | case where it does is with a bonus or with workers owning
               | shares with dividends. "keeping it alive and successful"
               | only matters if that wouldn't have happened otherwise and
               | even then not much if you have good job mobility.
               | 
               | No need to allocate. Something like "our clever engineers
               | managed to save us $2M per year, so we're giving everyone
               | a $500 bonus this month" seams entirely reasonable.
        
           | shrubble wrote:
           | It seems shocking to me, that you haven't yet migrated,
           | however you know your costs/benefit ratio more than I do.
           | Have you ever examined a split model, where some parts of the
           | load are run on your own or rented dedicated servers, and
           | some runs on AWS?
           | 
           | Separately to your comment about LVM... the LVM snapshot
           | requires that a separate part of the volumes be set aside to
           | hold the snapshot data.
           | 
           | If the snapshot volume fills up with changes being made to
           | volume that holds your data before the snapshot completes,
           | then the snapshot will fail.
           | 
           | This does not occur with ZFS as you have noticed.
        
           | newsclues wrote:
           | Are you in the gap oxide computer is trying to fill?
        
         | _nickwhite wrote:
         | AWS is more often than not a better solution than colo when
         | factoring in the on-site engineers, techs, and operational
         | complexity costs a company will pay to monitor and respond to
         | hardware-related events. One could build out a datacenter
         | management team with on-call engineers, or, they could pay AWS
         | to handle all that, and focus on innovation and products that
         | make their company unique and (hopefully) profitable. AWS makes
         | a lot of sense for companies that wish to inoculate themselves
         | from the hardware layer, and it would probably take a company
         | many magnitudes larger than Heap to realize any real benefits
         | from self-hosting at a colo. This isn't even considering the
         | fact that uptime matters, and you'll need more than 1 colo to
         | really do it right.
         | 
         | I say this as someone who built, manages and operates
         | datacenters and colo spaces.
        
           | runlevel1 wrote:
           | I'm not saying it never happens, but I've never seen moving
           | to AWS (or Azure, GCP, etc.) save in people costs at any tech
           | company with a large resource footprint. It just shifted
           | where the time is spent and who had to spend it.
           | 
           | The public cloud and managed services work great for the most
           | common use cases, but go outside those and you start having
           | to engineer around limitations.
           | 
           | If you have a sizable footprint in any given dimension you're
           | trading one complexity for another.
        
             | bombcar wrote:
             | The "who had to spend it" is huge - companies _love_ paying
             | providers and _hate_ paying people /depreciating costs.
        
               | markus_zhang wrote:
               | Curiously our company moved from AWS to on-premise a
               | couple of years ago. Something about CAPEX -> OPEX was
               | mentioned back then.
        
               | bombcar wrote:
               | That's the second part of the loop, when the cost of AWS
               | is high enough that you can show immediate dollar savings
               | by bringing it in-house.
        
               | markus_zhang wrote:
               | Yeah, that was a bit more than one year before we filed
               | for IPO :)
        
               | walrus01 wrote:
               | > hate paying depreciating costs.
               | 
               | One possible solution for this if you want to do it as
               | bare metal you control, is leased equipment (even with $1
               | buyout at end of term), which can be accounted for
               | differently than purchasing it up front.
        
           | [deleted]
        
           | gsliepen wrote:
           | For a while I did maintain a storage cluster that had close
           | to 0.5 PB of data, and had a capacity of up to 3 PB (if you
           | filled all the slots with the largest disks you can buy). You
           | want to ensure you have a lot of redundancy and spare
           | capacity if you are managing it yourself. Luckily, the
           | hardware is relatively cheap. It's the manpower that costs a
           | lot. Still, I think it was only 0.1 FTE to manage this
           | storage cluster, including the network, file systems, user
           | access, swapping out bad harddisks and storage pods (but
           | granted, it's I/O load was very light). Also, while AWS takes
           | away the burden of handling the physical drives and the
           | filesystem for you, now you have to handle interfacing with
           | AWS. That means you need an engineer that knows how to
           | integrate your application with AWS. If you can leverage more
           | AWS services, maybe even avoid needing your own server rooms
           | because everything is running there, it might pay off.
           | European vs. American salaries might also change the
           | equation. But if you just use it for storage, I don't think
           | it's worth it at any scale.
        
           | nickstinemates wrote:
           | Our company is not many magnitudes larger than any company
           | and it is not remotely cost competitive for us to run any of
           | our stack on AWS (heavy data ingest (hundreds of billions of
           | inserts a day,) many disks/'big data' backend, hundreds of
           | customers accessing the data. Even 1 time deals to get us
           | into the door are not cost competitive, let alone long tail
           | economics. Cloud for fast storage and high bandwidth usecases
           | is extremely expensive.
        
           | _3u10 wrote:
           | It isn't. I have the same overhead running my server as I do
           | a VM. What I don't get is a $3000 bill for $100 worth of
           | server.
           | 
           | You can generally buy whatever you are renting from AWS for 1
           | to 3 months of an AWS bill.
           | 
           | The only thing I don't get from colo is a bunch of other
           | customers thrashing the cache on my CPUs
           | 
           | Databases are not web servers there's no possible way to not
           | run a database on smaller / fewer instances when running at
           | non-peak times. Instant scaling is the only possible
           | advantage AWS could bring. However with the prices they
           | charge it's simpler and cheaper to just buy/rent your own
           | hardware. Especially if you have to pay egress fees.
           | (bandwidth is really the biggest ripoff)
        
             | midasuni wrote:
             | The argument isn't about using AWS to run a VM (which can
             | be cheaper that coloing your own kit, depending how many
             | you want and for how long), it's all the extra stuff. Start
             | an aws load balancer rather than run and maintain your own
             | for example.
             | 
             | I don't like lock-in, but the prevailing view has always
             | been in favour of lock-in, be it IBM mainframes, oracle
             | databases, windows servers etc, and if you swing that way
             | aws has tempting offers.
             | 
             | Oh and databases do scale. Say you want to run end of
             | quarter financials that require a lot of processing for a
             | day, you bring up tons of read replicas and away you go
        
               | _3u10 wrote:
               | If you bought hardware with what you pay for AWS RDS you
               | could run your entire DB in RAM. Hell you could probably
               | put the data in memory on a GPU.
               | 
               | Also, this is generally why you run financials overnight.
               | If your hardware is serving transactions during the day
               | it can easily run your quarterlies at night.
               | 
               | nginx is far easier to maintain than AWS load balancers
               | which is what load balancers their load balancers are.
               | The best part about nginx configs? They are cloud
               | agnostic and will work on everything from a Raspberry Pi
               | to a 128 core EPYC server.
               | 
               | I'll tell you something about RDS reliabilty, your
               | monthly maintenance window brings your DB down far more
               | often than a single unreplicated server ever fails. EBS
               | (like the entire thing) has failed more times in the last
               | year on us-east-1 than my colo RAID.
               | 
               | The selling point of AWS is that if you pick AWS and it
               | fails you can say, well the richest guy in the world
               | can't figure this stuff out so it must be impossible,
               | when in reality high school kids could make a more
               | reliable system. If you pick AWS you have the
               | unreliability of the base software / hardware of their
               | systems plus whatever the AWS engineers fuck up. At this
               | point it's pretty clear that they can't even keep a SAN
               | working.
        
               | spookthesunset wrote:
               | What about all the config management infrastructure to
               | manage those nginx instances?
               | 
               | What is the amount of work required every time some team
               | wants to spin up new stuff? What is the turnaround time
               | between when they file the ticket and the work being
               | completed?
        
           | saiya-jin wrote:
           | That's a nice statement sounding like straight from Amazon
           | sales reps, and it can actually work for some companies,
           | maybe. But for our bank (top 10 globally), the only way to
           | get even equal financially to our own farms is to have
           | aggressive downtime every night that negatively affect
           | productivity of our global teams. Pricing is really not that
           | great if you deal in scale.
           | 
           | You wanna push 1 evening a bit late to deliver something
           | valuable for the project? Sorry, no can do.
           | 
           | I don't even factor in horribly expensive migration projects
           | that brought actually 0 added business value for the type of
           | apps we use. We still have to keep our Network, Windows and
           | Unix admins, various App support personnel etc., there is
           | plenty of work for them with AWS. Not 1 single IT guy was
           | made redundant.
           | 
           | No cost savings, in contrary.
        
           | IgorPartola wrote:
           | That argument doesn't hold up. If how could AWS be cheaper
           | than doing it yourself at that scale? Like if you are a tiny
           | company that can't afford your own DC, your own engineers,
           | etc. then yes AWS is cheaper in absolute costs but not in
           | per-byte costs. But at scale you should be able to hire
           | engineers and build out a DC at which point you aren't paying
           | the AWS margin, which is how you save money.
           | 
           | In other words your assessment would only be true if AWS had
           | a 0% or a negative margin.
        
             | kevincox wrote:
             | There is still an economy of scale. You are right, as you
             | use more resources your economy of scale will increase, but
             | it will never match AWS's (unless you are huge). So the
             | math is if the AWS margin is less than the difference
             | between the two different economies of scale then it makes
             | sense to run your own datacenters (ignoring the opportunity
             | cost of the transition costs). Of course at some point the
             | margin _will_ exceed that difference, but depending on what
             | type of infrastructure you need it can be at a very high
             | point.
        
               | IgorPartola wrote:
               | Your last point is the important part: depending on the
               | type of infrastructure you need you might be able to save
               | money. If you want a cheap place to dump your files, B2
               | is cheaper than S3 and raw storage hardware pays for
               | itself in about a year. If you need a sophisticated CDN
               | then yeah you'll need to be huge before it pays for
               | itself. I would consider ditching S3 at the point where I
               | can hire two full time engineers to worry about my
               | storage layer.
        
           | walrus01 wrote:
           | If you have petabytes of data in AWS you already have a
           | number of on-staff engineers with significant six figure
           | salaries.
           | 
           | If the problem is that your group of six-figure salary people
           | only know how to put data into AWS, or other cloud services,
           | and not design/engineer/maintain your own bare metal
           | infrastructure as well, then that would definitely be a
           | limitation.
           | 
           | For reference, a few petabytes of data is not actually that
           | many systems these days, if you have something like a bunch
           | of 72-drive supermicros or equivalent with 14-16TB drives in
           | them. Set up properly this can be administered by one FTE (of
           | course with additional staffing/tech resources for when that
           | FTE is on vacation/unavailable, and appropriate training for
           | other persons who might have admin on the setup).
           | 
           | my very rough calculation here says that a 36-drive ZFS
           | RAIDZ2 composed of 16TB drives is something like 492TB
           | (447TiB) usable storage capacity.
           | 
           | so five such arrays would be 2460TB.
           | 
           | compared to the monthly AWS bill for 2.0 to 2.5TB of data you
           | could probably afford to entirely duplicate the whole setup
           | in a twin identical set of hardware at a geographically
           | diverse off-site location.
        
             | chucky_z wrote:
             | Engineers who can put together and maintain this hardware
             | don't exist on the job market.
        
               | walrus01 wrote:
               | Engineers who can put together and maintain 2PB of
               | storage don't exist on the job market? I'd say you don't
               | know the right people or aren't looking in the right
               | places.
               | 
               | 2PB is really not that much stuff these days. It's less
               | than two cabinets of equipment and that amount of space
               | (80RU or so) includes routers, switches, AC power
               | distribution, OOB, etc.
        
               | spookthesunset wrote:
               | They exist but finding and recruiting those people costs
               | a non trivial amount of time and money. Both of which
               | could be spent on whatever secret sauce your company
               | does.
        
             | __turbobrew__ wrote:
             | I work in a very data heavy HPC space and you are glossing
             | over many things.
             | 
             | * to get performant access to a storage cluster is non-
             | trivial, there are many different variables in place which
             | must be correctly tuned to get good performance. Network
             | topology, high quality NICs and switches, a tuned Linux
             | kernel, client side caching settings, network packet sizes,
             | file system block sizes, erasure coding settings, etc.
             | 
             | * your solution mentions nothing of backups, offsite
             | failovers, or disaster recovery plans.
             | 
             | * your solution mentions nothing of a physical datacenter:
             | fire suppression, battery backups, hvac, power supplies,
             | backup generators, server racks, sound isolation,
             | workspaces for hardware maintenance, network cable routing.
             | 
             | * If you have multiple geolocations you need to have dark
             | fiber or ip transit between locations with multiple ISPs to
             | have high speed connections between sites without downtime.
             | 
             | In addition to the raw costs you have to factor in the lead
             | time of building a qualified infrastructure team, building
             | out the requirements, provisioning hardware and datacenter
             | space, setting everything up, and then tuning everything.
             | With infinite money this is probably still a 2 year lead
             | time at minimum.
             | 
             | I do agree that running multi petabyte workloads in AWS is
             | probably not optimal, but when you are a startup in the
             | growth stage it is probably better worth your time throwing
             | VC money at AWS and building out your product. Eventually,
             | the business should probably migrate to self managed
             | infrastructure once the right product fit has been found
             | and the business is looking to streamline.
        
             | PragmaticPulp wrote:
             | > If the problem is that your group of six-figure salary
             | people only know how to put data into AWS, or other cloud
             | services, and not design/engineer/maintain your own bare
             | metal infrastructure as well, then that would definitely be
             | a limitation.
             | 
             | It's not about whether or not the engineers can make the
             | colocated setup work.
             | 
             | It's that you're going to pay _a lot_ of hidden costs with
             | a colocated setup. Engineers can 't set up, maintain, and
             | do on-call for the colocated setup without subtracting from
             | their primary working hours.
             | 
             | Each additional engineer you have to hire to help with the
             | colocated setup is $200-400K fully loaded out of your
             | company's budget. If you have to hire 3 additional
             | engineers to fill out your colocated on-call schedule and
             | help set up and maintain the system, that's easily an extra
             | $1 million per year on your budget. Cloud is expensive, but
             | $1 million goes a long way.
             | 
             | It's easy to look at a potential AWS bill and a potential
             | colocation and hardware bill and declare colocation the
             | winner, but then you still have to set up and maintain it
             | all as well as constantly train everyone on it.
             | 
             | With AWS, you can hire engineers with AWS experience and
             | they'll understand the big picture of how to work with
             | things on day 1. With a custom setup, you're at the whims
             | of whichever employees set up the system because they know
             | it best.
             | 
             | Colocated systems tend to work very well _at first_ when
             | the original engineers who set it up are all still at the
             | company and it hasn 't run long enough to start
             | encountering rare failure modes. They quickly become a
             | nightmare when your engineering staff turns over multiple
             | times and nobody can remember who knows how to do what on
             | the colocated system or if the documentation is up to date
             | or not.
        
               | walrus01 wrote:
               | > Colocated systems tend to work very well at first when
               | the original engineers who set it up are all still at the
               | company and it hasn't run long enough to start
               | encountering rare failure modes. They quickly become a
               | nightmare when your engineering staff turns over multiple
               | times and nobody can remember who knows how to do what on
               | the colocated system or if the documentation is up to
               | date or not.
               | 
               | Everything above really sounds like it's just
               | regurgitating AWS sales person talking points.
               | 
               | Sounds like a systemic management / CTO-level problem to
               | me if a company isn't willing to put in place the hiring
               | practices and compensation, documentation systems and
               | operational procedures to deal with that sort of concern.
               | 
               | If your core engineering staff is turning over multiple
               | times for arbitrary reasons you have other problems to
               | deal with.
               | 
               | > Engineers can't set up, maintain, and do on-call for
               | the colocated setup without subtracting from their
               | primary working hours.
               | 
               | If a company can't hire datacenter techs to install
               | hardware, cables, and swap hardware as smart remote
               | hands, maintaining as little as a couple of 45RU cabinets
               | of gear, you also have other management/systemic problems
               | to deal with. I'm looking at this from the point of view
               | of a facilities based bare metal ISP that owns/runs all
               | of its own hardware, and can tell you it's not rocket
               | science.
        
               | hagy wrote:
               | I worked at a company that migrated a 100 PB Hadoop
               | cluster to GCP for assorted reasons despite many years of
               | success with colocation. I wasn't involved in any of
               | this, but the team's decision process makes sense. You
               | can read through their decision making in these blog
               | posts:
               | 
               | * https://liveramp.com/developers/blog/google-cloud-
               | platform-g... *
               | https://liveramp.com/developers/blog/migrating-a-big-
               | data-en...
               | 
               | One big point was challenges of maintaining multiple
               | colocation sites, with cross replication, for disaster
               | recovery. Since Hadoop triple replicates all data within
               | one DC, this requires 6 times the disk storage capacity
               | of data size for dual DCs. In contrast, cloud object
               | storage pricing includes replication within a region with
               | very high availability such that storing once in cloud
               | storage may be acceptable. Further, you also need double
               | the compute, with one of the DCs always standing by
               | should the other fail.
        
               | loriverkutya wrote:
               | If I have to pay the same amount of money and I can
               | choose to deal with people or deal with a decent sized
               | company (and leave them to deal with their people), I
               | always choose the later.
        
               | PragmaticPulp wrote:
               | > If you core engineering staff is turning over multiple
               | times for arbitrary reasons you have other problems to
               | deal with.
               | 
               | People leave for all sorts of reasons: Moving for family
               | reasons, becoming stay-at-home parents, moving for a
               | spouse's job, retiring, starting their own companies, or
               | even just getting bored and wanting to do something
               | different. Or it could be as simple as getting promoted
               | to a different role.
               | 
               | It's unrealistic to make engineering decisions with the
               | assumption that the same engineers will be around and
               | stuck on the same project forever.
               | 
               | Like the OP said: Every hour they have to spend working
               | on the colocation setup is an hour they aren't spending
               | on your company's competitive advantage, so you have to
               | hire more engineers (and more managers) to compensate.
               | 
               | > If a company can't hire datacenter techs...
               | 
               | How many techs do you think you need for reasonable on-
               | call coverage? 3? 6? Add a manager in the mix because you
               | need someone to manage them.
               | 
               | The costs add up quickly.
               | 
               | It's weird to see people championing colocation as a cost
               | saver and then pivoting to arguments that you just need
               | to hire more engineers and techs and manage them.
               | 
               | Employees are expensive. One of the primary benefits of
               | cloud is that you don't have to hire and manage all of
               | these employees to do all of these things at the colo.
        
               | midasuni wrote:
               | My team of three looks after hundreds of bits of diverse
               | kit in dozens of locations across the world.
               | 
               | I can't remember the last time I took a call outside of
               | office hours, and even in hours it's very rare. There's
               | enough resilience built in that any issues can wait until
               | morning
               | 
               | The last major outage was in 2017, before we had a third
               | member of the team. I was on the other side of the world
               | installing a new system, the other was on leave. We had a
               | network issue, OSPF melted and knocked out some services,
               | we were down for about half an hour as I rebooted the
               | core switch pair remotely.
               | 
               | (We've since redesigned so that doesn't happen)
               | 
               | We get paid nowhere near six figures either.
               | 
               | Sure you can be ridiculous, I remember one team I worked
               | on that employed a full time unix contractor (on 3 times
               | the staff wage) to look after 6 servers and deploy a tar
               | ball every few months. I replaced him with a small shell
               | script. Another was a DBA looking after a small oracle
               | database (oracle - which of course is that generations
               | "just use amazon")
        
               | walrus01 wrote:
               | I think a number of people who are taking the position of
               | "but it's so HARD, and so EXPENSIVE!" to own and run bare
               | metal network infrastructure may not have ever seen a
               | proper OOB management setup, with dedicated OOB network,
               | serial consoles on stuff, management routers and switch
               | at site, things like cradlepoint LTE radios stuck to the
               | top of colocation cages, etc.
               | 
               | And then basic other things like having remote smart
               | hands ready to go, and common failure items like fans,
               | power supplies, fan trays, hard drives pre-positioned and
               | ready to swap in. With MOPs for swapping them. Stocks of
               | basic things like fiber patch cables, commonly used
               | transceivers, copper patch cables, stored in every cage.
        
               | PragmaticPulp wrote:
               | > I think a number of people who are taking the position
               | of "but it's so HARD, and so EXPENSIVE!" to own and run
               | bare metal network infrastructure may not have ever seen
               | a proper OOB management setup, with dedicated OOB
               | network, serial consoles on stuff, management routers and
               | switch at site, things like cradlepoint LTE radios stuck
               | to the top of colocation cages, etc.
               | 
               | Or we have seen all of this and that's exactly why we
               | don't want it.
               | 
               | Building a company is hard enough. Adding the overhead of
               | developing, maintaining, recruiting for, and staffing our
               | own datacenter is madness when I can click a few buttons
               | and get the same thing from a cloud provider _without
               | hiring anyone extra_ to manage the datacenter.
               | 
               | No one is denying that a proper data center management
               | system can exist. We all know it can exist.
               | 
               | The issue is that it's a huge distraction with a lot of
               | potential pitfalls. Your network infrastructure with
               | Cradlepoint LTE radios in the colocation cages sounds
               | great after it works, gets set up, stays documented, and
               | all the bugs have been ironed out. But that's a lot of
               | hidden work that could have been allocated to launching
               | the product faster.
        
               | walrus01 wrote:
               | I think the difference in point of view here, is that
               | from my own perspective, owning and running the bare
               | metal things is a basic core competency of being an ISP.
               | Which is what I do for a living. The infrastructure I've
               | described up thread _is_ the product.
               | 
               | If the use case is somebody developing a software product
               | that is a totally other scenario.
        
               | spookthesunset wrote:
               | For an ISP, you are probably correct.
        
               | spookthesunset wrote:
               | Been there, done that. No thanks. Every place that has
               | their own hardware also has a huge bureaucratic process
               | to get more hardware for your project. Not to mention
               | almost always the software stack is old as dirt. For
               | example mongo might be two or three major versions behind
               | current... and IT wants nothing to do with supporting the
               | new version.
               | 
               | People move to the cloud to escape their company's IT
               | process... there might be some unicorn company out there
               | that does infrastructure "right" but I've yet to work
               | there.
        
               | _s wrote:
               | ^^ This 1000x.
               | 
               | Humans are incredibly expensive and notoriously
               | unreliable when compared against "machines"; or in this
               | case an API.
               | 
               | It's usually worth paying 2-3x the cost to have someone
               | else manage something for you with a given SLA, because
               | that's what it will end up being when you decide to bring
               | it in house when you take into account the time and
               | effort needed as well.
               | 
               | A "good" & "reliable" Systems Engineering team, that can
               | offer 24/7 support will take around a year to hire and
               | setup, and they need roughly the same amount of time to
               | transition you off AWS in to your system. They probably
               | need closer to 3-5yrs to give the same level of
               | documentation, API's, tooling, processes, UI's and
               | training that AWS already provides.
               | 
               | Let's call it 5 years to get to the level of AWS when you
               | started the transition.
               | 
               | A decent team of 5-7, including engineers + PO/PM + UX
               | and so forth, is at least $1.5M/yr. That's $7.5M over 5
               | years, not including your new hardware and networking
               | costs. Let's call it $10M. You're also 5 years behind AWS
               | now, and over that transition you're still paying AWS,
               | and your development speed has halved as you wait for
               | your new team to build or transition infrastructure.
               | 
               | You can trade cost and quality for speed and have
               | everything ready in 2-3 years by setting up a few teams.
               | Add HR support, more contractors etc etc. you're looking
               | at a $10M+ outlay again, regardless.
               | 
               | Or you can keep paying AWS $5M/yr, renegotiate fees often
               | and literally not worry about that headache and focus on
               | your product.
        
               | Dylan16807 wrote:
               | You act like AWS doesn't require you to have a team, or
               | carefully transition infrastructure. Because of that your
               | cost estimate is _much_ higher than the couple additional
               | people you should actually need.
               | 
               | > They probably need closer to 3-5yrs to give the same
               | level of documentation, API's, tooling, processes, UI's
               | and training that AWS already provides.
               | 
               | You don't need to build an internal AWS to manage your
               | own servers.
        
               | bserge wrote:
               | This all sounds like excuses. Insert that two dog meme,
               | one builds a freaking datacenter using commodity hardware
               | in a barn and the other uses AWS and complains about
               | lock-in and expenses.
               | 
               | As an example, imagine if the founders and engineers of
               | Backblaze thought like that.
        
               | Spooky23 wrote:
               | Sorry, that's bullshit.
               | 
               | It all depends on the size of the investment and how you
               | need to run it. I built a "new" environment on a company
               | premises due to some compliance requirements that would
               | be cost-prohibitive in AWS or GCP. The gear was procured
               | through a leasing vehicle, and the hardware vendor had an
               | SLA for delivering compute and storage. HPE happened to
               | win the bid.
               | 
               | There is very little difference operationally. From a
               | costing perspective, it's about 40% less than an AWS
               | solution. But in fairness, the customer had an existing
               | investment in a facility - you'd reduce the savings if
               | you had to lease appropriate space in a colo. There are
               | some differences in terms of headcount, but those staff
               | aren't in NYC/SFO/BOS, so they are very cheap -- senior
               | level engineers for $80-120k, fully loaded.
               | 
               | Startups do stupid shit like buy supermicro computers and
               | cobbling together hardware that gets them into trouble
               | when the mad scientist moves on to a new gig. Makes sense
               | when you're drowning in VC money and need to hire people,
               | but doesn't make sense in most other ways. You avoid that
               | by doing competitive procurements and paying marginally
               | more for HPE/Dell/Lenovo/etc.
        
               | walrus01 wrote:
               | > cobbling together hardware that gets them into trouble
               | when the mad scientist moves on to a new gig.
               | 
               | If you think that standards based x86-64 hardware running
               | Linux and ZFS, or FreeBSD and ZFS is something that is
               | super unreliable and requires a "mad scientist", then
               | yes, you are definitely in HPE and Dell's target market.
        
               | spookthesunset wrote:
               | I laughed at the "mad scientist" comment because every
               | startup I've worked for has had a "mad scientist". They
               | are always very opinionated and have a lot of political
               | capital because of seniority. The weird concoctions they
               | create... the minute they leave all the remaining
               | engineers immediately replace most of it with off the
               | shelf stuff.
               | 
               | Home built web frameworks (which apparently aren't
               | "bloated" and "slow"), piles of bash scripts because they
               | never heard of Salt (or whatever is the latest config
               | management tool)...
               | 
               | Almost always they think they are "saving money" by doing
               | what they do, rarely do they ever consider the
               | opportunity costs to rolling the entire software stack
               | from the hardware to web stack on their own.
               | 
               | Good times.
        
               | mercurialuser wrote:
               | the op was talking about hardware, I think. but you are
               | spot-on on the mad scientist. we are just in the process
               | to de-clutter all the non standard, hand made, highly
               | customized, never documented, stuff produced buba
               | coworker that, unfortunately, passed away.
               | 
               | btw, we also wrote a HA cluster software for sun solaris
               | in year 2000...
        
               | [deleted]
        
               | mercurialuser wrote:
               | I run 10 years old HP servers, something like DL580G5,
               | with care-packs (extended, paid warranty). We needed to
               | flash firmware, 2 motherboards broke and they sent spare
               | parts to replace. Probably with less enterprise-y server
               | firms it may be difficult to find spare parts after 10
               | years..
        
             | markus_zhang wrote:
             | We are probably looking at the new future in which Cloud
             | computing == Mainframes of the 50s~80s and fewer and fewer
             | people even know how to run the whole scene on-premise.
             | People who got into cloud computing early (mostly by luck)
             | get to win big bucks and better life styles while others
             | try to dispel the magic from left and right.
        
               | walrus01 wrote:
               | ultimately, though, somebody has to own, house and run
               | those mainframes, so it's just abstracting the work away
               | to some other group of people. lots of people made
               | careers out of running mainframes and minicomputers in
               | the 1955-1985 time frame. in the case of things like aws,
               | azure, etc, it's just a lot more centralized in a smaller
               | number of gargantuan companies.
        
               | markus_zhang wrote:
               | Since the future is pretty much set, I think it's more
               | relevant to try to obtain skills (albeit more difficult
               | to obtain because fewer companies have them) and jump
               | into the wagon.
        
             | amluto wrote:
             | You have to factor in the cost of egress from AWS to your
             | nice colocated drive.
        
               | walrus01 wrote:
               | welcome to the hotel california...
               | 
               | Last thing I remember, I was
               | 
               | Running for the door
               | 
               | I had to find the passage back
               | 
               | To the place I was before
               | 
               | "Relax, " said the night man,
               | 
               | "We are programmed to receive.
               | 
               | You can check-out any time you like,
               | 
               | But you can never leave! "
        
               | amluto wrote:
               | And this is why I don't think AWS will lower egress fees
               | in response to R2. AWS may be more interested in
               | discouraging people from using egress than in capturing
               | the revenue from egress. I predict that, at most, we'll
               | see a narrowly tailored reduction in egress fees that is
               | designed to be entirely useless for communication between
               | server applications.
        
         | sorenjan wrote:
         | "Nobody Ever Got Fired for Buying IBM"
         | 
         | Maybe it's worth a couple of million to not have to deal with
         | the risk, and just keeping status quo.
        
           | dilyevsky wrote:
           | Exactly, most management would prefer to just set investors
           | money on fire and keep risk profile low if their business
           | model can support it
        
         | mwcampbell wrote:
         | I would hope that for a high-throughput DB cluster like this,
         | they're using instance-local storage rather than EBS. If that's
         | the case, then they're probably already taking advantage of EC2
         | reserved instances to save a lot compared to the on-demand
         | prices that we usually see.
        
           | jhk727 wrote:
           | We are, though we started out using EBS. As you mentioned,
           | NVMe instance storage performs much better for our workload.
           | We work around the lack of durability through strong
           | automation of point in time restore/swapping in of new nodes
           | in case of hardware failures.
           | 
           | And yes, reservations make a massive difference economically.
        
         | enginaar wrote:
         | Bank of America saves $2 billion per year
         | https://www.google.com/search?client=safari&rls=en&q=bank+of...
        
         | dayjah wrote:
         | I feel it's fair that they're on AWS right now. Generally the
         | arc of MVP->IPO involves using the cloud to find product market
         | fit, and as that fit improves your revenues should also. Moving
         | from the cloud to a colo would then be driven by capital
         | investment to bring down COGs; to either improve PPS or get to
         | cash-flow positive.
         | 
         | Heap using AWS just means they've not yet reached a point on
         | that trajectory where the capital investment moves the needle
         | enough to warrant it. That could be for any number of reasons.
        
         | Stevvo wrote:
         | Far lower risk and capital investment than collocation. I've
         | never had to store petabytes of data, but I would imagine the
         | considerations are not too different to smaller scales.
        
         | [deleted]
        
         | jasode wrote:
         | _> petabyte scale are doing it on AWS. I would think that by
         | then you could save millions more by migrating to your own
         | colocated hardware._
         | 
         | Usually, those types of judgements are based on thinking of AWS
         | as a "dumb datacenter" such as a bunch of harddrives or just
         | bare cpu.
         | 
         | AWS is more cost-effective _if you use high-level AWS services_
         | instead of just storing files in the cloud. In this case, it
         | looks like Heap is also using _AWS Redshift_ and probably a
         | bunch of other services in the AWS portfolio. A similar comment
         | I made previously:
         | https://news.ycombinator.com/item?id=28288352
         | 
         | So for self-hosting hardware, Heap would not only build up the
         | petabytes of diskspace, they also have to replicate Redshift
         | functionality and the entire AWS _services portfolio_ they 're
         | using. If you use enough AWS _services_, it _becomes cheaper_
         | than self-hosting because you don 't have to reinvent the
         | wheel.
        
           | jjav wrote:
           | > AWS is more cost-effective if you use high-level AWS
           | services instead of just storing files in the cloud.
           | 
           | Mostly this only works when your utilization is low(ish).
           | Once you have high load 24x7, the AWS profit margin will
           | quickly overtake the self-hosted solution.
        
             | cestith wrote:
             | With spiky utilization, you're buying and powering a lot of
             | hardware to sit idle a good portion of the time.
        
           | deeblering4 wrote:
           | My main take aways from this are that cloud vendor lock-ins
           | are real, and they can be hard to break free from.
           | 
           | Perhaps that's more of a cautionary tale for new projects
           | than a justification for the expense though.
        
             | jasode wrote:
             | _> Perhaps that's more of a cautionary tale for new
             | projects than a justification for the expense though._
             | 
             | You can find case studies for both positions:
             | 
             | - migrate to AWS to save money: Netflix, Guardian newspaper
             | [1]
             | 
             | - migrate away from AWS to save money: E.g. Dropbox [2]
             | 
             | A lot of companies (especially non-tech businesses) don't
             | have the technical skills to run internal datacenters at
             | the same competency as AWS. Thus, they don't want to be
             | "locked in" to their own IT department that's slow and
             | handicaps their business.
             | 
             | Dropbox, Facebook, and Walmart would among the very few
             | that can competently run their own datacenters with
             | advanced services like AWS.
             | 
             | [1] https://web.archive.org/web/20160319022029/https://www.
             | compu...
             | 
             | [2] https://www.google.com/search?q=dropbox+migrates+off+aw
             | s+sav...
        
               | Tehnix wrote:
               | And then Dropbox shifted kinda back again, at least
               | partially [0], it's interesting to see the ebb and flow
               | :)
               | 
               | [0]: https://aws.amazon.com/solutions/case-
               | studies/dropbox-s3/
        
               | ignoramous wrote:
               | Wait. Why? How? Their in-house system (Magic Pocket /
               | Diskotech) _seemed_ so promising.
               | 
               | https://dropbox.tech/tag-results.magic-pocket
        
               | nosefrog wrote:
               | They're for different use cases. Magic Pocket is for
               | storing file block data, and according to the AWS
               | article, they just moved their analytics data to AWS.
        
               | jasode wrote:
               | _> Their in-house system (Magic Pocket / Diskotech)
               | seemed so promising._
               | 
               | The story described Dropbox moving _" 34 PB of analytics
               | data (Hadoop)"_ to AWS.
               | 
               | My reading of Dropbox's Magic Pocket / Diskotech appears
               | to be storage for _customer raw data_ -- similar to
               | BackBlaze type of raw storage.
               | 
               | It's 2 different use cases so it's not surprising Dropbox
               | found AWS to be effective for analytics workloads. AWS
               | has an _extensive portfolio of software services to
               | analyze data_ so Dropbox may have concluded paying AWS
               | would _cost less_ than reinventing the analytics pipeline
               | in-house.
        
             | r3trohack3r wrote:
             | What you're calling "vendor lock-ins" I'm calling
             | "providing sufficient value to justify cost."
             | 
             | It's not that migrating out isn't possible, it's that
             | Amazon is providing "Engineering/SiteOps Departments as a
             | Service" at a price that's hard to compete with in house.
        
               | whydoyoucare wrote:
               | I am not sure the size of your company and the budget for
               | in-house prices, but we realized AWS is not just a lock-
               | in, but also a permanent money drain.
        
               | hackerfromthefu wrote:
               | What's the newspeak for the high egress fees?
        
         | aaronblohowiak wrote:
         | The amount of staff you have to have on-hand and amount of pre-
         | planning (and up-front capital commitment) can all make that
         | very unattractive long after the basic per-GB price would seem
         | to make it attractive.
        
           | bbarnett wrote:
           | No. You already have 24x7 staff at this scale. Hardware
           | requires thought and skill, but then so does software. It
           | isn't voodoo.
        
             | outworlder wrote:
             | > No. You already have 24x7 staff at this scale. Hardware
             | requires thought and skill, but then so does software. It
             | isn't voodoo.
             | 
             | Not necessarily. Hardware requires people to physically
             | replace failed drives and otherwise do on-site maintenance.
             | 
             | In the _unlikely_ event that an AWS volume fails, I can
             | (and have) automation to fix that. While everyone sleeps.
        
               | Robotbeat wrote:
               | Okay, but it's not hard to setup up redundancy and warm
               | spares as well to make it automatic. You don't need
               | someone physically there.
        
               | deeblering4 wrote:
               | > Hardware requires people to physically replace failed
               | drives and otherwise do on-site maintenance.
               | 
               | This is the premise of colocation (as opposed to building
               | your own server room). A colo is a secure building with
               | round the clock staff. Hardware vendors offer rapid on-
               | site parts replacements and can gain access via the on-
               | site staff, and the colo has services to perform on-site
               | work like "remote hands" as well.
               | 
               | > In the unlikely event that an AWS volume fails, I can
               | (and have) automation to fix that. While everyone sleeps.
               | 
               | Fault tolerant architectures can be be deployed on
               | colocated hardware too.
        
               | outworlder wrote:
               | The point is - I can do any changes I need to the
               | underlying resources programmatically and near instantly,
               | without ever having to talk to anyone. Including cloud
               | provider staff. Or rather, automation can.
               | 
               | There may exist some colo where I can get a server(or
               | storage, or network cards or anything else) added in
               | minutes over an API call but I haven't heard of any.
               | That's usually found on the VPS side.
               | 
               | > Fault tolerant architectures can be be deployed on
               | colocated hardware too.
               | 
               | They can, usually requiring that you specifically setup
               | redundancies and the like. Which is something that you
               | _already_ have for many cloud offerings. Your automation
               | and redundancies sit on top of the vendor 's existing
               | redundancies.
               | 
               | For instance, the EBS volume I mention. It is not a disk.
               | Its not even just an array. It's a far more sophisticated
               | abstraction. If there are issues, it can automatically
               | fetch blocks from your snapshots(if the blocks are
               | unmodified, something they also keep track of). Not happy
               | with spinning disks and want a SSD? No need to place a
               | service order to your colo provider, just send an API
               | call and this will be automatically migrated to SSDs
               | without your applications ever noticing the difference
               | (other than the response time) and with zero downtime.
               | Your software could even do this if it notices that the
               | workloads require it.
               | 
               | If an AWS datacenter goes up in flames the systems I
               | manage will still function (and will self-heal, assuming
               | they even get affected, which for big zones they might
               | not be). I don't have to talk to anyone. I can be
               | sleeping and this will still happen.
               | 
               | It's a completely different level of abstraction.
               | 
               | If you want to compare a big cloud provider with either
               | your own datacenter or colocation facility, there's a big
               | disparity in scale. At a minimum, you would have to
               | compare with several interconnected datacenters or colos.
               | You still don't get the abstraction layer.
               | 
               | It's all missing the point though - I was pointing out
               | that software doesn't necessarily need to have 24x7
               | staff, as the parent poster was pointing out, even for
               | exceptional (but predictable) issues. Sure, you need
               | someone on-call to handle completely unexpected events,
               | but I don't think that was the point being made.
        
               | jjav wrote:
               | > There may exist some colo where I can get a server(or
               | storage, or network cards or anything else) added in
               | minutes over an API call but I haven't heard of any.
               | That's usually found on the VPS side.
               | 
               | To be fair, you can't get hardware added in AWS via API
               | call either. What you can do is spin up
               | instances/storage/etc via API call, as long as that spare
               | capacity hardware is already set up, available and ready
               | to be allocated to you. Which you can also do on on-prem
               | hardware.
               | 
               | If you're saying that your utilization is so peaky or
               | unpredictable that you end up needing an order of
               | magnitude more, or fewer, resources available day to day,
               | then you are absolutely correct that provisioning so much
               | spare capacity on-prem would be prohibitive. This is an
               | use-case where AWS excels.
               | 
               | But if your utilization doesn't have dramatic peaks and
               | growth is mostly predictable, then it becomes practical
               | to provision for it on-prem and it'll be a lot cheaper.
        
             | koolba wrote:
             | It is a different skill though. Going from zero to one for
             | physical infrastructure is a significant leap in both cost
             | and operational process. You need to manage inventory,
             | provide 24/7 physical access, and set up supply chains to
             | ensure you have ongoing availability.
        
               | z3t4 wrote:
               | You don't have to do everything in house, you can for
               | example buy servers with a on site support agreement.
               | Then you just have to buy new servers at regular
               | intervals, you don't need to have a guy that can fix a
               | server with a soldering pen.
               | 
               | Same for internet connection, you can buy transit, no
               | need to become your own ISP. So you don't need people who
               | deal with peering agreements, etc.
               | 
               | For electricity you can make a support deal with a local
               | electrician company. You don't need a guy who can build
               | and maintain a custom power supply unit.
               | 
               | It does help however to have someone with basic sysadmin
               | and network skills. But if you don't have that, you will
               | sooner or later screw up your AWS infrastructure too.
        
               | tomnipotent wrote:
               | > support deal with a local electrician company
               | 
               | Maybe if you're wiring a closet in your office, but no
               | colo facility is letting you within ten feet of their
               | power infrastructure. The best you're getting is a
               | racked-mounted UPS.
        
               | manquer wrote:
               | I think OP means setting up your own DC, for colo most of
               | this is anyway offered by the DC partner, you would only
               | go with something else if there is very good reason not
               | to use what they offer.
        
               | Spivak wrote:
               | Y'all are making "buy an asset tag printer", "have a rep
               | from Dell/HP" and "use the data center's remote hands if
               | you need it" sound crazy complicated.
        
             | fwip wrote:
             | Not always true. Some data is intrinsically bigger than
             | others.
             | 
             | If you have a petabyte of chatlogs, sure, you have 24x7
             | obligations to millions of people. If you have a petabyte
             | of astronomy data, you have like 3 research scientists
             | using it.
        
               | Robotbeat wrote:
               | The research scientists DEFINITELY can't afford to run
               | petabytes of astronomy data on AWS. Source: am a research
               | scientist.
        
               | fwip wrote:
               | Oh, for sure. Just that you don't usually have the amount
               | of dedicated staff that was implied.
        
         | namdnay wrote:
         | Keep in mind that nobody at large scale is paying the sticker
         | price for AWS (or Google or Azure)
        
         | ksec wrote:
         | >storage clusters at petabyte scale are doing it on AWS.
         | 
         | I had to double check just in case, but petabyte is only 1000
         | terabyte. It may be big in terms of database, but rather small
         | in absolute terms. You could fit a single Petabyte in a 1U
         | server.
         | 
         | I doubt they pay listed price. And AWS is now mostly a
         | Enterprise and Sales game. So once you ran other cost involved
         | in managing, I would think you need to be multiple Rack scale
         | before the cost break down better for your own hardware.
         | 
         | And that is excluding other benefits of sitting inside AWS
         | ecosystem. The only thing I think AWS isn't so good at is the
         | low cost, sub $1000 per month spending scenario. Where you are
         | paying a lot more just for staying inside the ecosystem for
         | things you may not be using. Those tends to flavour Linode or
         | DO.
        
           | toast0 wrote:
           | > You could fit a single Petabyte in a 1U server.
           | 
           | That seems a bit over the top. I see 18 TB drives available,
           | but let's posit 20 TB drives, so you need 50 of them. I don't
           | think you can fit 50 3.5" drives in a 1U space, even if
           | there's no motherboard or power supply. 50+ drive storage
           | chassis are generally 4U. I did see some 16 drive 1U servers
           | though, so I'm pretty sure you could fit that much storage
           | into 3U even though I also didn't see any 3U storage chassis.
        
         | WJW wrote:
         | The discounts you can get from doing anything at big enough
         | scale will push your costs back to colocation prices. Don't
         | assume that anyone with a cloud bill over 200k is playing
         | anywhere near the price you read on the pricing page.
        
           | Robotbeat wrote:
           | "Will"? I doubt it. Definitely not with that level of
           | certainty.
        
             | tomnipotent wrote:
             | > certainty
             | 
             | Considering the number of people here commenting about
             | costs but have never managed a P&L, I don't think certainty
             | is high on the list.
        
           | FpUser wrote:
           | Nope. I had chance to compare what one org had for 600k of
           | real money after all the discounts. Not even remotely close
           | to what one can get for rented dedicated servers.
        
       | magicalhippo wrote:
       | > For these reasons, it's generally recommended not to let ZFS go
       | past 80% disk utilization.
       | 
       | There's another reason why you don't want to go beyond 80%
       | utilization, and that's because the block allocator will switch
       | behavior to a more involved search, which can take a lot more
       | time.
       | 
       | Thus allocating new blocks can get really slow once you get past
       | 80%.
        
         | jhk727 wrote:
         | Thank you for the clarification - I had heard from a few
         | sources that the block allocator algorithm actually changes at
         | higher utilization, but was previously unable to find anything
         | concrete in the documentation. This helped clear up a
         | longstanding curiosity for me.
        
         | k8sToGo wrote:
         | Does the problem go away again if you go back to below 80%?
        
           | magicalhippo wrote:
           | After a bit of digging, yes but no.
           | 
           | So turns out it's a bit more involved than what's been
           | commonly told as a straight up 80% == bad scenario. ZFS by
           | default divides[1] each vdev (RAIDZ or mirror set) into ~200
           | allocation regions called metaslabs[2].
           | 
           | When allocating from a metaslab[3] it will check if the free
           | space _in that metaslab_ is below the threshold defined by
           | metaslab_df_free_pct. It seems the threshold was changed to
           | 4% free space at some point[4].
           | 
           | If the free space is above the limit it will use the fast
           | first-fit search, if not it will use the expensive best-fit
           | search.
           | 
           | However, as noted that threshold is per metaslab. So if the
           | pool is fragmented, even though the overall free space in the
           | pool is above the 4% threshold, there might be metaslabs with
           | less than that free, which will lead to the expensive best-
           | fit search.
           | 
           | So it's not a hard limit, but it should start to be
           | noticeable above 80%.
           | 
           | [1]: https://www.delphix.com/blog/delphix-
           | engineering/openzfs-cod...
           | 
           | [2]: http://dtrace.org/blogs/ahl/2012/11/08/zfs-trivia-
           | metaslabs/
           | 
           | [3]: https://github.com/openzfs/zfs/blob/master/module/zfs/me
           | tasl... (note metaslab_df_free_pct)
           | 
           | [4]: https://www.truenas.com/community/threads/zfs-tweak-for-
           | firs...
        
         | GhettoComputers wrote:
         | I've read that any TRIM supported SSD prevents this and all
         | SSDs have extra blocks (some more than others) that aren't
         | being utilized by default and are designed to replace any bad
         | blocks. https://www.truenas.com/community/resources/some-
         | differences... seems like ZFS might be special and if you have
         | larger SSDs they will have different cell blocks and be faster
         | because it's 2x256 boards that can be used concurrently versus
         | 1x256 which will have half the write speed. SSDs also
         | complicate it further with RAM disks, mixed storage (like
         | SLC+TLC/QLC) so the SSDs will be affected more if it's a
         | cheaper drive with no ram or SLC cache, and smaller sizes with
         | less memory chips. I remember getting the Evo 850 because it
         | had great firmware with ram and SLC cache, it was 3D TLC but
         | it's speed was still excellent.
        
       | bbarnett wrote:
       | Good grief.
       | 
       | Talk of rsync backups on live DB systems. zfs. On the fly disk
       | encryprion.
       | 
       | All I can think of is, clearly these guys never worked with
       | spinning disks and large datasets.
       | 
       | So much headroom to waste with SSDs, people are spoiled today.
        
         | laumars wrote:
         | Nothing you've described there wasn't possible with spinning
         | disks storing large data sets. In fact if anything, ZFS is
         | ideally suited to exactly that scenario.
        
           | bbarnett wrote:
           | I assure you, rsync can tank a highly tuned db environment.
           | 'updatedb', which updates the db 'locate' uses under linux,
           | running in its cron, can cause issues.
           | 
           | It all depends upon how much headroom you have, the type of
           | io activity, etc. I've operated systems under consistent 80%
           | io load with massive datasets at the time, under spinning
           | disks.
           | 
           | Running rsync would be madness on such a system. I know. I
           | only did it once.
        
             | ComodoHacker wrote:
             | Let me guess, your system hadn't ionice back then?
        
             | whitepoplar wrote:
             | They never said they were using rsync on their datasets.
             | They're likely using ZFS snapshots.
        
         | netizen-936824 wrote:
         | I'm not sure I understand your comment. Is this extra headroom
         | a bad thing?
        
           | bbarnett wrote:
           | More of a jealous thing, and, astonishment at how much extra
           | hardware is used to give that headroom.
           | 
           | Over the years, I've run comparatively larger datasets, on
           | significantly less hardware.
           | 
           | edit:
           | 
           | When I switched to SSDs for the first time, to give you a
           | performance example, on some read queries I saw a 1000x to
           | 10000x speed improvement.
           | 
           | This of course was on a read only secondary, long running
           | reporting queries, no one runs queries of that nature on a
           | primary.
           | 
           | SSD were an insane game changer.
        
             | terr-dav wrote:
             | I think you're feeling envy, not jealousy. ;]
        
       | nine_k wrote:
       | A semi-related story from ancient past.
       | 
       | Back in the university days we've built an information retrieval
       | system. It ran on an IBM PC XT, with a 20MB HDD, which was pretty
       | slow.
       | 
       | The heaviest queries to the information system involved full
       | scans. They were too slow, slower than had been agreed with the
       | customer.
       | 
       | So we installed a disk compression program, maybe Stacker or
       | something similar. It ate some of the already slow 8088 CPU, and
       | some of the scarce RAM. But crucially it compressed the data to
       | about 50% of the original size.
       | 
       | This made the number of blocks to read, and, most importantly, to
       | seek twice as low. The query speed increased twofold. We
       | successfully completed the (tiny) software development contract.
        
         | [deleted]
        
         | bombcar wrote:
         | Whenever CPU speeds surpass disk speeds, compression becomes
         | king; when the opposite happens it dies away. I don't know if
         | we'll ever see disk speeds compete with CPU speeds again,
         | however.
        
       | peremasip wrote:
       | This is pretty interesting, because the effects of migrating from
       | lz4 to zstd were: - Total storage usage reduced by ~21% (for our
       | dataset, this is on the order of petabytes)
       | 
       | - Average write operation duration decreased by 50% on our
       | fullest machines
       | 
       | - No observable query performance effects
       | 
       | It seems like the better compression ratio and resulting reduced
       | IO more than makes up for increased CPU compared to lz4. I wish
       | they had mentioned the actual effect on CPU.
       | 
       | Compare to the recent thread "The LZ4 introduced in PostgreSQL 14
       | provides faster compression" [0] where the loudest voices were
       | saying that zstd would not work due to increased CPU. This is a
       | different layer (filesystem compression vs db comrpession), but
       | this article represents an interesting data point in the
       | conversation.
       | 
       | [0]: https://news.ycombinator.com/item?id=29147656
        
         | pirata99 wrote:
         | hey you copied this comment!!
         | 
         | that's mean >:(
        
           | peremasip wrote:
           | hey you copied this comment!! that's mean >:(
        
       | whitepoplar wrote:
       | Kinda unrelated, but I have a question for anyone who is
       | knowledgeable about running Postgres on ZFS...does setting a
       | large-ish ZFS block size (e.g. 64kB) for use with Postgres
       | (default 8kB blocks) cause a great deal of write-amplification
       | even when ZFS `full_page_writes = off`?
        
       | infogulch wrote:
       | This is pretty interesting, because the effects of migrating from
       | lz4 to zstd were:
       | 
       | - Total storage usage reduced by ~21% (for our dataset, this is
       | on the order of petabytes)
       | 
       | - Average write operation duration decreased by 50% on our
       | fullest machines
       | 
       | - No observable query performance effects
       | 
       | It seems like the better compression ratio and resulting reduced
       | IO more than makes up for increased CPU compared to lz4. I wish
       | they had mentioned the actual effect on CPU.
       | 
       | Compare to the recent thread "The LZ4 introduced in PostgreSQL 14
       | provides faster compression" [0] where the loudest voices were
       | saying that zstd would not work due to increased CPU. This is a
       | different layer (filesystem compression vs db compression), but
       | this article represents an interesting data point in the
       | conversation.
       | 
       | [0]: https://news.ycombinator.com/item?id=29147656
        
         | walrus01 wrote:
         | I wonder what, if any, further improvement would be had by
         | comparing xzip vs zstd.
         | 
         | Obviously you need a LOT of CPU to throw at xzip if you want to
         | use it.
         | 
         | zstd is very much more optimized for compression at speeds
         | comparable to traditional gzip.
         | 
         | I use xzip primarily for things that will get compressed to
         | long term storage and the time to create the archive isn't a
         | really important factor.
         | 
         | in this test: https://sysdfree.wordpress.com/2020/01/04/293/
         | 
         | zstd level 19 wins on time vs. xz levels 5 through 9, but the
         | xz ultimate compressed file size is definitely smaller.
        
           | ncmncm wrote:
           | If your system experiences periods of greater and lesser
           | load, then using the rest of whatever is its load capacity,
           | during periods of lesser load, on further compressing its
           | contents might be worth the bother.
           | 
           | Perhaps better than stepping to a different compression
           | algorithm, zstd has multiple levels of compression that might
           | be used at different times. The advantage there is that the
           | same decompression algorithm works for all.
           | 
           | One might reasonably hope that decompression tables may be
           | shared amongst multiple of the 64k raw blocks, to further
           | squeeze usage.
        
         | jhk727 wrote:
         | Author here - it's difficult to provide a single number to
         | summarize what we've observed re: CPU, but one data point is
         | that average CPU utilization across our cluster increased from
         | ~40% to ~50%. This effect is more pronounced during NA daylight
         | hours.
         | 
         | Worth noting that part of the reason this is relatively low
         | impact for our read queries is that the hot portion of our
         | dataset is usually in Postgres page cache where the data is
         | already decompressed (we see a 95-98% cache hit rate under
         | normal conditions). We've noticed the impact more for
         | operations that involve large scans - in particular, backups
         | and index builds have become more expensive.
        
           | TedDoesntTalk wrote:
           | how/why did you choose Postgres over MariaDB? I am facing
           | such a decision now.
        
           | infogulch wrote:
           | Hey thanks for the clarification. That seems like a
           | worthwhile tradeoff in your case.
           | 
           | For backups in particular, are ZFS snapshots alone not
           | suitable to serve as a backup? Is there something else that
           | the pg backup process does that is not covered by a "dumb"
           | snapshot?
        
             | jhk727 wrote:
             | We use wal-g and extensively leverage its archive/point-in-
             | time restore capabilities. I think it would be tricky to
             | manage similar functionality with snapshots (and possibly
             | more expensive if archival involved syncing to a remote
             | pool).
             | 
             | That being said, wal-g has worked well enough for us that
             | we haven't put a ton of time into investigating
             | alternatives yet, so I can't say for sure whether snapshots
             | would be a better option.
        
         | matsur wrote:
         | https://blog.cloudflare.com/squeezing-the-firehose/ is our
         | story of how we moved from lz4 to zstd (with a stop at snappy
         | in between) in our kafka deployments. Results are/were similar
         | to what Heap is reporting here.
        
       | willis936 wrote:
       | For anyone like me: home usage workloads are read-heavy with
       | files that are predominantly already compressed. Moving to
       | Zstandard might be interesting to toy with if you have more
       | compute and disk I/O than network throughput, but the benefits
       | would likely be smaller.
        
       | wanderer2323 wrote:
       | ... from ZFS (lz4) to ZFS 2.x (Zstandard).
        
         | chungy wrote:
         | It is an upgrade, but don't mistake ZFS 2.x as making zstandard
         | mandatory. The default for compression=on is still lz4.
        
         | [deleted]
        
       | B1FF_PSUVM wrote:
       | I don't have a dog in that race, but I've seen it said that the
       | DB architecture itself should be reviewed, because SSDs make it
       | possible to use databases in higher "normal form", with more
       | tables that require more lookups, but less data volume.
       | 
       | E.g. https://drcoddwasright.blogspot.com: _" In a time of SSD,
       | multi-core/processor, two terabyte memory and Optane App Direct
       | Mode machines, there is no reason not to build from BCNF data.
       | Time to do what Dr. Codd demonstrated. Technology has finally
       | caught up with the maths."_
        
       | otterley wrote:
       | [deleted]
        
         | whitepoplar wrote:
         | The post is literally about how ZFS compression saves them
         | millions of dollars.
        
           | pengaru wrote:
           | > The post was literally about how ZFS compression saves them
           | millions of dollars.
           | 
           | ... relative to their previous ZFS configuration.
           | 
           | They didn't evaluate alternatives to ZFS, did they? They're
           | still incurring copy-on-write FS overhead, and the
           | compression is just helping reduce the pain there, no?
        
             | drob wrote:
             | Zstandard gets us 5.5x compression. The previous ZFS config
             | got us 4.4x compression.
             | 
             | XFS, which we ran on for years before rolling out ZFS, does
             | not compress.
        
               | pengaru wrote:
               | Thanks for the clarification
        
       | nwmcsween wrote:
       | OK a one thing that stands out here and please correct me if I'm
       | wrong:
       | 
       | > ... multi-petabyte cluster of Postgres instances... blocksize
       | relatively high at 64 kb ...
       | 
       | The dataset should be the Postgresql "page size" which IIRC is
       | 8KB, the reasoning for this is RMW cycles will read 64kb modify
       | 8KB and write out the full 64KB amplifying writes 8 fold.
       | 
       | Also IIRC Postgresql will automatically use TOAST when needed?
        
         | jhk727 wrote:
         | Good callout - we use a higher blocksize than Postgres page
         | size because it gives us a much higher compression ratio, at
         | the cost of some read/write amplification.
         | 
         | And yes - Postgres will automatically TOAST oversized tuples
         | and compress the relevant data (if you configure it to do so).
         | This is much lower impact for us than filesystem level
         | compression, as it doesn't affect the main relation heap space
         | (or any indexes).
        
           | nwmcsween wrote:
           | What about: https://people.freebsd.org/~seanc/postgresql/scal
           | e15x-2017-p...
           | 
           | 16k record size 2x amplification and still (?) allows
           | compression w/ lz4
        
             | jhk727 wrote:
             | We tested this extensively a few years back. We saw a
             | compression ratio of ~1.9 with 8k recordsize/lz4, ~2.7 with
             | 16k/lz4, and now ~5.5 with 64k/zstd.
        
           | nwmcsween wrote:
           | There has to be something better than a potential 8 fold
           | write performance reduction wrt compression
        
       | mikewarot wrote:
       | My understanding of SSD architecture is that you can't flip bits
       | in a page, you have to write a page at a time, thus all SSD
       | systems stall if they get stuck waiting for empty pages (which
       | take longer than writing pages). Thus, a full SSD (which
       | internally has a few % allocated spares the customer isn't
       | supposed to be able to access) is a _slow_ SSD.
       | 
       | It would seem to me if you can keep the utilization of the disk
       | under 80%, and support TRIM (which lets the SSD know which pages
       | can be erased), you should be able to get really high performance
       | out of them with a Copy on Write file system.
        
         | Neil44 wrote:
         | I was about to post the same, you will quickly run out of ready
         | trim'd blocks at high utilisation, which the article doesn't
         | mention.
        
       | ddlutz wrote:
       | I wonder how well Postgres is up to the task of analytical
       | queries? Most people use Postgres for OLTP, maybe they are
       | running some version of it that uses a column store?
        
         | jhk727 wrote:
         | Postgres is not designed for OLAP, but you can push it a lot
         | farther than one would expect with the correct schema and
         | indexing strategy. See https://heap.io/blog/running-10-million-
         | postgresql-indexes-i... for a little more detail about how we
         | schematize for distributed OLAP queries on Postgres at scale.
        
         | __s wrote:
         | They use Citus: https://www.citusdata.com/customers/heap
         | 
         | Who recently iterated on cstore_fdw to create columnar:
         | https://www.citusdata.com/blog/2021/03/06/citus-10-columnar-...
         | 
         | But I don't think Heap's using columnar
        
           | jhk727 wrote:
           | We aren't using cstore_fdw, though we've looked into it in
           | the past. cstore tables don't support deletes or updates, and
           | we still rely on updates for some key parts of our write
           | pipeline. Additionally, we rely heavily on btree partial
           | indexes, while cstore tables only support skip indexes.
        
       | KennyBlanken wrote:
       | Since the title is clickbaity: they had issues with ZFS due to
       | too-high a blocksize and too-high a filesystem utilization, so
       | they upgraded to ZFS 2 for the Zstandard compression and saw an
       | improvement.
        
         | ziddoap wrote:
         | Is it clickbaity if they actually _did_ save millions in SSD
         | costs? What would you suggest as a non-clickbaity title? Just
         | "Upgrading Our Filesystem" leaves out the important parts (the
         | why and the results).
         | 
         | A lot of the time I agree that titles can be quite clickbaity.
         | But this one doesn't really seem to be... At least to me. The
         | company upgraded their filesystem and it saved them a bunch of
         | money. Title feels appropriate.
         | 
         | If the title were "Top ways to save millions on SSDs!" or
         | similar, I'd wholeheartedly agree.
        
           | jaclaz wrote:
           | But:
           | 
           | >Total storage usage reduced by ~21% (for our dataset, this
           | is on the order of petabytes)
           | 
           | If 21% reduction is "millions" they should be spending in
           | excess of 10 millions (each what? week/month/year?) provided
           | that there is a linear correlation between storage usage and
           | (failed and needing to be replaced?) SSD costs.
        
         | slownews45 wrote:
         | There's a reason I click to the comments first most of the
         | time.
        
       | chrisaycock wrote:
       | TL;DR They switched compression from lz4 to Zstandard.
       | 
       | The latter does more compression (and therefore requires less
       | storage and less IO), but is slower at decompression. Results
       | show that query (read) performance did not actually change,
       | whereas write operations needed only half as much time. Storage
       | also saved ~20% space. So it was a win-win all around.
        
         | LolWolf wrote:
         | Thanks !
         | 
         | I love some of the articles, but in this case I definitely went
         | the "I'm happy for you or sad it happened, but I ain't about to
         | read all of that" route.
        
         | jandrese wrote:
         | Most of the savings seemed to come from freeing up enough
         | headroom in each drive to prevent block collating slowdowns on
         | write. This smells like a temporary workaround to me, as data
         | tends to grow over time.
         | 
         | It probably saved them from having to buy more storage this
         | quarter, but it is a one time savings.
        
           | turbocon wrote:
           | To the contrary! This decreases their storage need by ~20%
           | which will increase their cost savings over time. Yes they
           | pushed off increasing their storage footprint in the short
           | term but in the long term they decreased their rate of total
           | storage increase.
        
       ___________________________________________________________________
       (page generated 2021-11-09 23:00 UTC)