[HN Gopher] From S3 to R2: An economic opportunity
       ___________________________________________________________________
        
       From S3 to R2: An economic opportunity
        
       Author : dangoldin
       Score  : 93 points
       Date   : 2023-11-02 19:15 UTC (3 hours ago)
        
 (HTM) web link (dansdatathoughts.substack.com)
 (TXT) w3m dump (dansdatathoughts.substack.com)
        
       | simonsarris wrote:
       | Cloudflare has been attacking the S3 egress problem by creating
       | Sippy: https://developers.cloudflare.com/r2/data-migration/sippy/
       | 
       | It allows you to _incrementally_ migrate off of providers like S3
       | and onto the egress-free Cloudflare R2. Very clever idea.
       | 
       | He calls R2 an undiscovered gem and IMO this is the gem's
       | undiscovered gem. (Understandable since Sippy is very new and
       | still in beta)
        
         | ravetcofx wrote:
         | What are the economics that Amazon and other providers have
         | egress fees and R2 doesn't? Is it acting as a loss leader or
         | does this model still make money for CloudFlare?
        
           | NicoJuicy wrote:
           | You pay for the capacity of your network.
           | 
           | Cloudflare has huge ingress, because they need it to protect
           | sites against DDOS.
           | 
           | They basically already pay for their R2 bandwidth ( = egress)
           | because of that.
           | 
           | Additionally, with their SDN ( software defined networking)
           | they can fine-tune some of the Data-Flow/bandwidth too.
           | 
           | That's how I understood it, fyi.
           | 
           | Some more info could be found when they started ( or co-
           | founded, not sure) the bandwidth alliance.
           | 
           | Eg.
           | 
           | https://blog.cloudflare.com/aws-egregious-egress/
           | 
           | https://blog.cloudflare.com/bandwidth-alliance/
        
             | miselin wrote:
             | Also, for the CDN case that R2 seems to be targeting -
             | regardless of the origin of the data (R2 or S3), chances
             | are pretty good that Cloudflare is already paying for the
             | egress anyway.
        
               | NicoJuicy wrote:
               | I'm not sure about that.
               | 
               | A CDN keeps the data nearby, reducing the need to pay
               | egress to the big bandwidth providers.
               | 
               | ( not an expert though)
        
               | ilc wrote:
               | Let's say you want to use cloudflare, or another CDN. The
               | process is pretty simple.
               | 
               | You setup your website and preferably DON'T have it talk
               | to anyone other than the CDN.
               | 
               | You then point your DNS to wherever the CDN tells you to.
               | (Or let them take over DNS. Depends on the provider.)
               | 
               | The CDN then will fetch data from your site and cache it,
               | as needed.
               | 
               | Your site is the "origin", in CDN speak.
               | 
               | If Cloudflare can move the origin within their network,
               | there is huge cost savings and reliability increases
               | there. This is game changing stuff. Do not under estimate
               | it.
        
               | kkielhofner wrote:
               | It's actually worse than that.
               | 
               | In the CDN case Cloudflare has to fetch it from the
               | origin, cache (store) it anyway, and then egress it. By
               | charging for R2 they're moving that cost center to a
               | profit one.
        
             | swyx wrote:
             | somebody more knowledgeaeble please correct me if i'm
             | mistaken, but i think the bandwidth alliance is really the
             | lynchpin of the whole thing. basically get all the non-AWS
             | players in the room and agree on zero rating traffic
             | between each other, to provide a credible alternative to
             | AWS networks
        
           | Nextgrid wrote:
           | _Completely_ free egress is a loss leader, but the true cost
           | is so little (at least 90x less than what AWS charges) that
           | it pays for itself in the form of more CloudFlare marketshare
           | /mindshare.
        
             | WJW wrote:
             | I know from personal experience that "big" customers can
             | negotiate incredible discounts on egress bandwidth as well.
             | 90-95% discount is not impossible, only "retail" customers
             | pay the sticker price.
        
               | martinald wrote:
               | That's still a 3-10x markup though. And it's also very
               | dependent on your relationship with AWS. What happens if
               | they don't offer the discount on renewal?
        
           | candiddevmike wrote:
           | Greed on the cloud providers part, I think. You'd expect
           | egress fees to enable cheaper compute, but there are other
           | cloud providers out there like Hetzner with cheaper compute
           | and egress, so the economics don't really add up.
        
             | vidarh wrote:
             | Indeed, Hetzner is so much cheaper that if you have high S3
             | egress fees you can rent Hetzner boxes to sit in front of
             | your S3 deployment as caching proxies and get a lot of
             | extra "free" compute on top.
             | 
             | It's an option that's often been attractive if/when you
             | didn't want the hassle of building out something that could
             | provide S3 level durability yourself. But with more/cheaper
             | S3 competitors it's becoming a significantly less
             | attractive option.
        
           | kazen44 wrote:
           | also, egress fees are a sort of vendor lock-in, because
           | getting data out of the cloud is vastly more expensive then
           | putting new data into the cloud..
        
             | oaktowner wrote:
             | Exactly this. Data has gravity, and this increases the
             | gravity around data stored at Amazon...making it more
             | likely for you to buy more compute/services at Amazon.
        
             | kkielhofner wrote:
             | The big cloud providers are Hotel California - you can
             | check in but you can't check out.
             | 
             | Of course you can (like Snap) but it's a MASSIVE
             | engineering effort and initial expense.
        
           | chatmasta wrote:
           | Amazon doesn't have unit cost for egress. They charge you for
           | the stuff you put through their pipe, while paying their
           | transit providers only for the size of the pipe (or more
           | often, not paying them anything since they just peer directly
           | with them at an exchange point).
           | 
           | Amazon uses $/gb as a price gouging mechanism and also a QoS
           | constraint. Every bit you send through their pipe is
           | basically printing money for them, but they don't want to
           | give you a reserved fraction of the pipe because then other
           | people can't push their bits through that fraction. So they
           | get the most efficient utilization by charging for the stuff
           | you send through it, ripping everybody off equally.
           | 
           | Also, this way it's not cost effective to build a competitor
           | to Amazon (or any bandwidth intensive business like a CDN or
           | VPN) on top of Amazon itself. You fundamentally need to
           | charge more by adding a layer of virtualization, which means
           | "PaaS" companies built on Amazon are never a threat to AWS
           | and actually symbiotically grow the revenue of the ecosystem
           | by passing the price gouging onto their own customers.
        
             | specialp wrote:
             | You don't get charge for transit if you are sending stuff
             | IN from the internet or to any other AWS resource in that
             | region. So there is no QOS constraint inside except for
             | perhaps paying for the S3 GET/SELECT/LIST costs.
             | 
             | It is pretty much exclusively to lock you into their
             | services. It heavily impacts multi-cloud and outside of AWS
             | service decisions when your data lives in AWS and is taxed
             | at 5-9 cents a GB to come out. We have settled for inferior
             | AWS solutions at times because the cost of moving things
             | out is prohibitive (IE AWS Backup vs other providers)
        
               | dangoldin wrote:
               | Author here - have you tried using R2? As others
               | mentioned there's also Sippy
               | (https://developers.cloudflare.com/r2/data-
               | migration/sippy/) which makes this easy to try.
        
               | martinald wrote:
               | It also makes things like just using RDS for your managed
               | database and having compute nearby but with another
               | provider often incredibly expensive.
        
             | kkielhofner wrote:
             | AWS egress charges blatantly take advantage of people who
             | have never bought transit or done peering.
             | 
             | To them "that's just what bandwidth costs" but anyone who's
             | worked with this stuff (sounds like you and I both) can do
             | the quick math and see what kind of money printing machine
             | this scheme is.
        
             | pests wrote:
             | Honest question, how is this different than a toll road? An
             | entity creates a road network with a certain size (lanes,
             | capacity/hour, literal traffic) and pays for it by charging
             | individual cars put through the road.
        
           | dotnet00 wrote:
           | There has to be more to it than a pure loss leader, since
           | there's also the Bandwidth Alliance Cloudflare is in, which
           | allows R2 competitors like Backblaze B2 to also offer free
           | egress, which benefits those competitors while weakening the
           | incentive for R2 somewhat.
        
           | jmarbach wrote:
           | Cloudflare wrote a blog post about their bandwidth egress
           | charges in different parts of the world:
           | https://blog.cloudflare.com/the-relative-cost-of-
           | bandwidth-a...
           | 
           | The original post also includes a link to a more recent
           | Cloudflare blog post on AWS bandwidth charges:
           | https://blog.cloudflare.com/aws-egregious-egress/
        
         | dangoldin wrote:
         | Author here and really cool link to Sippy. I love the idea here
         | since you're really migrating data as needed so the cost you
         | incur is really a function of the workload. It's basically
         | acting as a caching layer.
        
         | paulddraper wrote:
         | Clever
        
       | nik736 wrote:
       | S3 and R2 aside, OVHs object storage offering is really robust
       | and great. It performs better than S3 and is way cheaper, in
       | storage and egress cost.
        
         | drewnick wrote:
         | Agree. We've used it for two years with solid performance and
         | reliability.
        
         | 9dev wrote:
         | You might even say their offering is... on fire
        
       | threatofrain wrote:
       | Cloudflare has been building a micro-AWS/Vercel competitor and I
       | love it; i.e., serverless functions, queues, sqlite, kv store,
       | object store (R2), etc.
        
         | davidjfelix wrote:
         | FWIW, Vercel is at least partially backed by cloudflare
         | services under the hood.
        
           | zwily wrote:
           | Right - Vercel's edge functions are just cloudflare workers
           | with a massive markup.
        
         | camkego wrote:
         | I would love to see a good blog post or article on Cloudflares
         | KV store. I just checked it out, and it reports eventual
         | consistency, so it sounds like it might be based upon CRDTs,
         | but I'm just guessing.
        
         | chatmasta wrote:
         | Vercel doesn't offer any of that, without major caveats (e.g.
         | must use Next.js to get a serverless endpoint). And to the
         | degree they do offer any of it, it's mostly built on
         | infrastructure of other companies, including Cloudflare.
        
       | paulgb wrote:
       | Since I know there will be Cloudflare people reading this (hi!),
       | I'm begging you: please wrestle control of the blob storage API
       | standard from AWS.
       | 
       | AWS has zero interest in S3's API being a universal standard for
       | blob storage and you can tell from its design. What happens in
       | practice is that everybody (including R2) implements some subset
       | of the S3 API, so everyone ends up with a jagged API surface
       | where developers can use a standard API library but then have to
       | refer to docs of each S3-compatible vendor to see figure out
       | whether the subset of the S3 API you need will be compatible with
       | different vendors.
       | 
       | This makes it harder than it needs to be to make vendor-agnostic
       | open source projects that are backed by blob storage, which would
       | otherwise be an excellent lowest-common-denominator storage
       | option.
       | 
       | Blob storage is the most underused cloud tech IMHO largely
       | because of the lack of a standard blob storage API. Cloudflare is
       | in the rare position where you have a fantastic S3 alternative
       | that people love, and you would be doing the industry a huge
       | service by standardizing the API.
        
         | londons_explore wrote:
         | I think the subtle API differences reflect bigger and deeper
         | implementation differences...
         | 
         | For example, "Can one append to an existing blob/resume an
         | upload?" leads to lots of questions about data immutability,
         | cacheability of blobs, etc.
         | 
         | "What happens if two things are uploaded with the same name at
         | the same time" leads into data models, mastership/eventual
         | consistency, etc.
         | 
         | Basically, these 'little' differences are in fact huge
         | differences on the inside, and fixing them probably involves a
         | total redesign.
        
           | paulgb wrote:
           | This is a good point, but just a standard for the standard
           | create/read/update (replace)/delete operations combined with
           | some baseline guarantees (like approximately-last-write-wins
           | eventual consistency) would probably cover a whole lot of
           | applications that currently use S3 (which doesn't support
           | appends anyway).
           | 
           | Heck, HTTP already provides verbs that would cover this, it
           | would just require a vendor to carve out a subset of HTTP
           | that a standard-compliant server would support, plus
           | standardize an auth/signing mechanism.
        
       | maxclark wrote:
       | R2 and Sippy solve a specific pipeline issue: Storage -> CDN ->
       | Eyeball
       | 
       | The real issue is how that data get's into S3 in the first place
       | and what else you need to do with it.
       | 
       | S3 and DynamoDB are the real moats for AWS.
        
       | tehlike wrote:
       | If you are storing large amount of data: E2 is the cheapest
       | (20$/TB/year, 3x egress for free)
       | 
       | If you are having lots of egress: R2 is the cheapest
       | (15$/TB/month, free egress)
       | 
       | R2 can get somewhat expensive if you have lots of mutations,
       | which is not a typical use case for most.
        
         | vladvasiliu wrote:
         | What's E2? Top google result for "e2 blob storage" is azure,
         | but that can't be it since the pricing table comes at around
         | $18/TB/month.
        
           | maccam912 wrote:
           | I imagine it was a typo for backblaze B2? The call out that
           | egress is free for the first 3x of what you have stored
           | matches up.
        
             | wfleming wrote:
             | That's what I thought they meant as well, but B2 is more
             | like $72/TB/yr. Maybe relevant to another story on the
             | front page right now, they have a very unusual custom
             | keyboard layout that makes it easy to typo e for b and 2
             | for 7 ;)?
        
           | natrys wrote:
           | Seems to be this one: https://www.idrive.com/object-
           | storage-e2/
        
           | leiferik wrote:
           | I think Backblaze B2 is probably the reference (which has
           | free egress up to 3x data stored -
           | https://www.backblaze.com/blog/2023-product-announcement/). I
           | don't know of any public S3-compatible provider that is as
           | cheap as 20$/TB/year (roughly ~$0.0016/GB/mo).
        
       | arghwhat wrote:
       | I wish the R2 access control was similar to S3 - able to issue
       | keys with specific accesses to particular prefixes, and ability
       | to delegate ability to create keys.
       | 
       | It currently feels a little limited and... bolted on to the
       | Cloudflare UI.
        
         | andrewstuart wrote:
         | I think the idea is to use Cloudflare Workers to add more
         | sophisticated functionality.
        
           | slig wrote:
           | But then you start paying for Worker's bandwidth, correct?
        
       | meowface wrote:
       | Is there any reason to _not_ use R2 over a competing storage
       | service? I already use Cloudflare for lots of other things, and
       | don 't personally care all that much about the "Cloudflare's
       | near-monopoly as a web intermediary is dangerous" arguments or
       | anything like that.
        
         | Hasz wrote:
         | As far as I know, R2 offers no storage tiers. Most of my s3
         | usage is archival and sits in glacier. From Cloudflare's
         | pricing page, S3 is substantially cheaper for that type of
         | workload.
        
         | gurchik wrote:
         | 1. This is the most obvious one, but S3 access control is done
         | via IAM. For better or for worse, IAM has a lot of
         | functionality. I can configure a specific EC2 instance to have
         | access to a specific file in S3 without the need to deal with
         | API keys and such. I can search CloudTrail for all the times a
         | specific user read a certain file.
         | 
         | 2. R2 doesn't support file versioning like S3. As I understand
         | it, Wasabi supports it.
         | 
         | 3. R2's storage pricing is designed for frequently accessed
         | files. They charge a flat $0.015 per GB-month stored. This is a
         | lot cheaper than S3 Standard standard pricing ($0.023 per GB-
         | month), but more expensive than Glacier and marginally more
         | expensive than S3 Standard - Infrequent Access. Wasabi is even
         | cheaper at $0.0068 per GB-month but with a 1 TB billing
         | minimum.
         | 
         | 4. If you want public access to the files in your S3 bucket
         | using your own domain name, you can create a CNAME record with
         | whatever DNS provider you use. With R2 you cannot use a custom
         | domain unless the domain is set up in Cloudflare. I had to
         | register a new domain name for this purpose since I could not
         | switch DNS providers for something like this.
         | 
         | 5. If you care about the geographical region your data is
         | stored in, AWS has way more options. At a previous job I needed
         | to control the specific US state my data was in, which is easy
         | to do in AWS if there is an AWS Region there. In contrast R2
         | and Wasabi both have few options. R2 has a "Jurisdictional
         | Restriction" feature in Beta right now to restrict data to a
         | specific legal jurisdiction, but they only support EU right
         | now. Not helpful if you need your data to be stored in Brazil
         | or something.
        
         | paulddraper wrote:
         | If you already use Cloudflare for lots of other things, no.
         | 
         | If you already use AWS for lots of other things, yes.
        
       | benjaminwootton wrote:
       | The other hidden cost when you are working with data hosted on S3
       | is the LIST requests. Some of the data tools seem very chatty
       | with S3, and you end up with thousands of them when you have
       | small filed buried in folders with a not insignifcant cost. I
       | need to dig into it more, but they are always up there towards
       | the top of my AWS bills.
        
       | thedaly wrote:
       | > In fact, there's an opportunity to build entire companies that
       | take advantage of this price differential and I expect we'll see
       | more and more of that happening.
       | 
       | Interesting. What sort of companies can take advantage of this?
        
         | diamondap wrote:
         | Basically any company offering special services that work with
         | very large data sets. That could be a consumer backup system
         | like Carbonite or a bulk photo processing service. In either
         | case, legal agreements with customers are key, because you
         | ultimately don't control the storage system on which your
         | business and their data depend.
         | 
         | I work for a non-profit doing digital preservation for a number
         | of universities in the US. We store huge amounts of data in S3,
         | Glacier and Wasabi, and provide services and workflows to help
         | depositors comply with legal requirements, access controls,
         | provable data integrity, archival best practices, etc.
         | 
         | There are some for-profits in this space as well. It's not a
         | huge or highly profitable space, but I do think there are other
         | business opportunities out there where organizations want to
         | store geographically distributed copies of their data (for
         | safety) and run that data through processing pipelines.
         | 
         | The trick, of course, is to identify which organizations have a
         | similar set of needs and then build that. In our case, we've
         | spent a lot of time working around data access costs, and there
         | are some cases where we just can't avoid them. They can really
         | be considerable when you're working with large data sets, and
         | if you can solve the problem of data transfer costs from the
         | get-go, you'll be way ahead of many existing services built on
         | S3 and Glacier.
        
         | dangoldin wrote:
         | Author here but some ideas I was thinking about: - An open
         | source data pipeline built on top of R2. A way of keeping data
         | on R2/S3 but then having execution handled in Workers/Lambda.
         | Inspired by what https://www.boilingdata.com/ and
         | https://www.bauplanlabs.com/ are doing. - Related to above but
         | taking data that's stored in the various big data formats
         | (Parquet, Iceberg, Hudi, etc) and generating many more
         | combinations of the datasets and choose optimal ones based on
         | the workload. You can do this with existing providers but I
         | think the cost element just makes this easier to stomach. -
         | Abstracting some of the AI/ML products out there and choosing
         | best one for the job by keeping the data on R2 and then
         | shipping it to the relevant providers (since data ingress to
         | them is free) for specific tasks. -
        
         | gen220 wrote:
         | I'm building a "media hosting site". Based on somewhat
         | reasonable forecasts of egress demand vs total volume stored,
         | using R2 means I'll be able to charge a low take rate that
         | should (in theory) give me a good counterposition to
         | competitors in the space.
         | 
         | Basically, using R2 allows you to undercut competitors'
         | pricing. It also means I don't need to build out a separate CDN
         | to host my files, because Cloudflare will do that for me, too.
         | 
         | Competitors built out and maintain their own equivalent CDNs
         | and storage solutions that are more ~10x more expensive to
         | maintain and operate than going through Cloudflare. Basically,
         | Cloudflare is doing to CDNs and storage what AWS and friends
         | did to compute.
        
       | xrd wrote:
       | I just love minio. It is a drop-in replacement for S3. I have
       | never done a price comparison for TOC to S3 or R2, but I have a
       | good backup story and run it all inside docker/dokku so it is
       | easy to recover.
        
       | hipadev23 wrote:
       | OP is missing that a correct implementation of Databricks or
       | Snowflake will have those instances are running inside the same
       | AWS region as the data. That's not to say R2 isn't an amazing
       | product, but the egregious costs aren't as high since egress is
       | $0 on both sides.
        
         | dangoldin wrote:
         | Author here and it is true that costs within a region are free
         | and if you do design your system appropriately you can take
         | advantage of it but I've seen accidental cases where someone
         | will try to access in another region and it's nice to not even
         | have to worry about it. Even that can be handled with better
         | tooling/processes but the bigger point is if you want to have
         | your data be available across clouds to take advantage of the
         | different capabilities. I used AI as an example but imagine you
         | have all your data in S3 but want to use Azure due to the
         | OpenAI partnership. It's that use case that's enabled by R2.
        
           | hipadev23 wrote:
           | Yeah, for greenfield work building up on R2 is generally a
           | far better deal than S3, but if you have a massive amount of
           | data already on S3, especially if it's small files, you're
           | going to pay a massive penalty to move the data. Sippy is
           | nice but it just spreads the pain over time.
        
         | cmgriffing wrote:
         | I could be mistaken, but I believe AWS would still charge for
         | one direction of an S3 to Databricks/Snowflake
         | instance/cluster.
        
           | hipadev23 wrote:
           | AWS S3 Egress charges are $0.00 when the destination is AWS
           | within the same region. When you setup your Databricks or
           | Snowflake accounts, you need to correctly specify the same
           | region as your S3 bucket(s) otherwise you'll pay egress.
        
       | drexlspivey wrote:
       | If I understand correctly when storing data to vanilla S3 (not
       | their edge offering) the data live in a single zone/datacenter
       | right? While on R2 they could potentially be replicated in tens
       | of locations. If that is true how can Cloudflare afford the
       | storage cost with basically the same pricing?
        
       | leiferik wrote:
       | As an indie dev, I recommend R2 highly. No egress is the killer
       | feature. I started using R2 earlier this year for my AI
       | transcription service TurboScribe (https://turboscribe.ai/).
       | Users upload audio/video files directly to R2 buckets (sometimes
       | many large, multi-GB files), which are then transferred to a
       | compute provider for transcription. No vendor lock-in for my
       | compute (ingress is free/cheap pretty much everywhere) and I can
       | easily move workloads across multiple providers. Users can even
       | re-download their (again, potentially large) files with a simple
       | signed R2 URL (again, no egress fees).
       | 
       | I'm also a Backblaze B2 customer, which I also highly recommend
       | and has slightly different trade-offs (R2 is slightly faster in
       | my experience, but B2 is 2-3x cheaper storage, so I use it mostly
       | for backups other files that I'm likely to store a long time).
        
       | jokethrowaway wrote:
       | It blows my mind that anyone would consider S3 cheap.
       | 
       | You always had available plenty of space on dedicated servers for
       | way cheaper before the cloud.
       | 
       | You could make an argument about the API being nicer than dealing
       | with a linux server - but is AWS nice? I think it's pretty awful
       | and requires tons of (different, specific, non transferable)
       | knowledge.
       | 
       | Hype, scalability buzzwords thrown around by startups with 1000
       | users and 1M contract with AWS.
       | 
       | Sure R2 is cheaper but it's still not a low cost option. You are
       | paying for a nice shiny service.
        
         | gen220 wrote:
         | I think it all depends on the volume of data you're storing,
         | access requirements, and how much value you plan to generate
         | per GB.
         | 
         | It's certainly quite cheap for a set of typical "requirements"
         | for media hosting companies.
         | 
         | But yeah, if you're storing data for mainly archival purposes,
         | you shouldn't be paying for R2 or S3.
        
       | sgammon wrote:
       | We absolutely love R2, especially when paired with Workers.
        
       | johnklos wrote:
       | Should we simply ignore the tremendous amount of phishing hosted
       | using r2.dev? Or is this also part of "an economic opportunity"?
       | 
       | Cloudflare may well be on their way to becoming a monopoly, but
       | they certainly show they don't care about abuse. Even if it
       | weren't a simple matter of principle, in case they aren't
       | successful in forcing themselves down everyone's throats, I
       | wouldn't want to host anything on any service that hosts phishers
       | and scammers without even a modicum of concern.
        
       | andrewstuart wrote:
       | >> you're paying anywhere from $0.05/GB to $0.09/GB for data
       | transfer in us-east-1. At big data scale this adds up.
       | 
       | At small data scale this adds up.
       | 
       | And..... it's 11 cents a GB from Australia and 15 cents a GB from
       | Brazil.
       | 
       | If you have S3 facing the Internet a hacker can bankrupt your
       | company in minutes with simple load testing application. Not even
       | a hacker. A bug in a web page could do the same thing.
        
         | paulddraper wrote:
         | 200 TB in minutes is impressive.
         | 
         | (Assuming your company can be bankrupted for ~$20k.)
        
       ___________________________________________________________________
       (page generated 2023-11-02 23:00 UTC)