[HN Gopher] From S3 to R2: An economic opportunity ___________________________________________________________________ From S3 to R2: An economic opportunity Author : dangoldin Score : 93 points Date : 2023-11-02 19:15 UTC (3 hours ago) (HTM) web link (dansdatathoughts.substack.com) (TXT) w3m dump (dansdatathoughts.substack.com) | simonsarris wrote: | Cloudflare has been attacking the S3 egress problem by creating | Sippy: https://developers.cloudflare.com/r2/data-migration/sippy/ | | It allows you to _incrementally_ migrate off of providers like S3 | and onto the egress-free Cloudflare R2. Very clever idea. | | He calls R2 an undiscovered gem and IMO this is the gem's | undiscovered gem. (Understandable since Sippy is very new and | still in beta) | ravetcofx wrote: | What are the economics that Amazon and other providers have | egress fees and R2 doesn't? Is it acting as a loss leader or | does this model still make money for CloudFlare? | NicoJuicy wrote: | You pay for the capacity of your network. | | Cloudflare has huge ingress, because they need it to protect | sites against DDOS. | | They basically already pay for their R2 bandwidth ( = egress) | because of that. | | Additionally, with their SDN ( software defined networking) | they can fine-tune some of the Data-Flow/bandwidth too. | | That's how I understood it, fyi. | | Some more info could be found when they started ( or co- | founded, not sure) the bandwidth alliance. | | Eg. | | https://blog.cloudflare.com/aws-egregious-egress/ | | https://blog.cloudflare.com/bandwidth-alliance/ | miselin wrote: | Also, for the CDN case that R2 seems to be targeting - | regardless of the origin of the data (R2 or S3), chances | are pretty good that Cloudflare is already paying for the | egress anyway. | NicoJuicy wrote: | I'm not sure about that. | | A CDN keeps the data nearby, reducing the need to pay | egress to the big bandwidth providers. | | ( not an expert though) | ilc wrote: | Let's say you want to use cloudflare, or another CDN. The | process is pretty simple. | | You setup your website and preferably DON'T have it talk | to anyone other than the CDN. | | You then point your DNS to wherever the CDN tells you to. | (Or let them take over DNS. Depends on the provider.) | | The CDN then will fetch data from your site and cache it, | as needed. | | Your site is the "origin", in CDN speak. | | If Cloudflare can move the origin within their network, | there is huge cost savings and reliability increases | there. This is game changing stuff. Do not under estimate | it. | kkielhofner wrote: | It's actually worse than that. | | In the CDN case Cloudflare has to fetch it from the | origin, cache (store) it anyway, and then egress it. By | charging for R2 they're moving that cost center to a | profit one. | swyx wrote: | somebody more knowledgeaeble please correct me if i'm | mistaken, but i think the bandwidth alliance is really the | lynchpin of the whole thing. basically get all the non-AWS | players in the room and agree on zero rating traffic | between each other, to provide a credible alternative to | AWS networks | Nextgrid wrote: | _Completely_ free egress is a loss leader, but the true cost | is so little (at least 90x less than what AWS charges) that | it pays for itself in the form of more CloudFlare marketshare | /mindshare. | WJW wrote: | I know from personal experience that "big" customers can | negotiate incredible discounts on egress bandwidth as well. | 90-95% discount is not impossible, only "retail" customers | pay the sticker price. | martinald wrote: | That's still a 3-10x markup though. And it's also very | dependent on your relationship with AWS. What happens if | they don't offer the discount on renewal? | candiddevmike wrote: | Greed on the cloud providers part, I think. You'd expect | egress fees to enable cheaper compute, but there are other | cloud providers out there like Hetzner with cheaper compute | and egress, so the economics don't really add up. | vidarh wrote: | Indeed, Hetzner is so much cheaper that if you have high S3 | egress fees you can rent Hetzner boxes to sit in front of | your S3 deployment as caching proxies and get a lot of | extra "free" compute on top. | | It's an option that's often been attractive if/when you | didn't want the hassle of building out something that could | provide S3 level durability yourself. But with more/cheaper | S3 competitors it's becoming a significantly less | attractive option. | kazen44 wrote: | also, egress fees are a sort of vendor lock-in, because | getting data out of the cloud is vastly more expensive then | putting new data into the cloud.. | oaktowner wrote: | Exactly this. Data has gravity, and this increases the | gravity around data stored at Amazon...making it more | likely for you to buy more compute/services at Amazon. | kkielhofner wrote: | The big cloud providers are Hotel California - you can | check in but you can't check out. | | Of course you can (like Snap) but it's a MASSIVE | engineering effort and initial expense. | chatmasta wrote: | Amazon doesn't have unit cost for egress. They charge you for | the stuff you put through their pipe, while paying their | transit providers only for the size of the pipe (or more | often, not paying them anything since they just peer directly | with them at an exchange point). | | Amazon uses $/gb as a price gouging mechanism and also a QoS | constraint. Every bit you send through their pipe is | basically printing money for them, but they don't want to | give you a reserved fraction of the pipe because then other | people can't push their bits through that fraction. So they | get the most efficient utilization by charging for the stuff | you send through it, ripping everybody off equally. | | Also, this way it's not cost effective to build a competitor | to Amazon (or any bandwidth intensive business like a CDN or | VPN) on top of Amazon itself. You fundamentally need to | charge more by adding a layer of virtualization, which means | "PaaS" companies built on Amazon are never a threat to AWS | and actually symbiotically grow the revenue of the ecosystem | by passing the price gouging onto their own customers. | specialp wrote: | You don't get charge for transit if you are sending stuff | IN from the internet or to any other AWS resource in that | region. So there is no QOS constraint inside except for | perhaps paying for the S3 GET/SELECT/LIST costs. | | It is pretty much exclusively to lock you into their | services. It heavily impacts multi-cloud and outside of AWS | service decisions when your data lives in AWS and is taxed | at 5-9 cents a GB to come out. We have settled for inferior | AWS solutions at times because the cost of moving things | out is prohibitive (IE AWS Backup vs other providers) | dangoldin wrote: | Author here - have you tried using R2? As others | mentioned there's also Sippy | (https://developers.cloudflare.com/r2/data- | migration/sippy/) which makes this easy to try. | martinald wrote: | It also makes things like just using RDS for your managed | database and having compute nearby but with another | provider often incredibly expensive. | kkielhofner wrote: | AWS egress charges blatantly take advantage of people who | have never bought transit or done peering. | | To them "that's just what bandwidth costs" but anyone who's | worked with this stuff (sounds like you and I both) can do | the quick math and see what kind of money printing machine | this scheme is. | pests wrote: | Honest question, how is this different than a toll road? An | entity creates a road network with a certain size (lanes, | capacity/hour, literal traffic) and pays for it by charging | individual cars put through the road. | dotnet00 wrote: | There has to be more to it than a pure loss leader, since | there's also the Bandwidth Alliance Cloudflare is in, which | allows R2 competitors like Backblaze B2 to also offer free | egress, which benefits those competitors while weakening the | incentive for R2 somewhat. | jmarbach wrote: | Cloudflare wrote a blog post about their bandwidth egress | charges in different parts of the world: | https://blog.cloudflare.com/the-relative-cost-of- | bandwidth-a... | | The original post also includes a link to a more recent | Cloudflare blog post on AWS bandwidth charges: | https://blog.cloudflare.com/aws-egregious-egress/ | dangoldin wrote: | Author here and really cool link to Sippy. I love the idea here | since you're really migrating data as needed so the cost you | incur is really a function of the workload. It's basically | acting as a caching layer. | paulddraper wrote: | Clever | nik736 wrote: | S3 and R2 aside, OVHs object storage offering is really robust | and great. It performs better than S3 and is way cheaper, in | storage and egress cost. | drewnick wrote: | Agree. We've used it for two years with solid performance and | reliability. | 9dev wrote: | You might even say their offering is... on fire | threatofrain wrote: | Cloudflare has been building a micro-AWS/Vercel competitor and I | love it; i.e., serverless functions, queues, sqlite, kv store, | object store (R2), etc. | davidjfelix wrote: | FWIW, Vercel is at least partially backed by cloudflare | services under the hood. | zwily wrote: | Right - Vercel's edge functions are just cloudflare workers | with a massive markup. | camkego wrote: | I would love to see a good blog post or article on Cloudflares | KV store. I just checked it out, and it reports eventual | consistency, so it sounds like it might be based upon CRDTs, | but I'm just guessing. | chatmasta wrote: | Vercel doesn't offer any of that, without major caveats (e.g. | must use Next.js to get a serverless endpoint). And to the | degree they do offer any of it, it's mostly built on | infrastructure of other companies, including Cloudflare. | paulgb wrote: | Since I know there will be Cloudflare people reading this (hi!), | I'm begging you: please wrestle control of the blob storage API | standard from AWS. | | AWS has zero interest in S3's API being a universal standard for | blob storage and you can tell from its design. What happens in | practice is that everybody (including R2) implements some subset | of the S3 API, so everyone ends up with a jagged API surface | where developers can use a standard API library but then have to | refer to docs of each S3-compatible vendor to see figure out | whether the subset of the S3 API you need will be compatible with | different vendors. | | This makes it harder than it needs to be to make vendor-agnostic | open source projects that are backed by blob storage, which would | otherwise be an excellent lowest-common-denominator storage | option. | | Blob storage is the most underused cloud tech IMHO largely | because of the lack of a standard blob storage API. Cloudflare is | in the rare position where you have a fantastic S3 alternative | that people love, and you would be doing the industry a huge | service by standardizing the API. | londons_explore wrote: | I think the subtle API differences reflect bigger and deeper | implementation differences... | | For example, "Can one append to an existing blob/resume an | upload?" leads to lots of questions about data immutability, | cacheability of blobs, etc. | | "What happens if two things are uploaded with the same name at | the same time" leads into data models, mastership/eventual | consistency, etc. | | Basically, these 'little' differences are in fact huge | differences on the inside, and fixing them probably involves a | total redesign. | paulgb wrote: | This is a good point, but just a standard for the standard | create/read/update (replace)/delete operations combined with | some baseline guarantees (like approximately-last-write-wins | eventual consistency) would probably cover a whole lot of | applications that currently use S3 (which doesn't support | appends anyway). | | Heck, HTTP already provides verbs that would cover this, it | would just require a vendor to carve out a subset of HTTP | that a standard-compliant server would support, plus | standardize an auth/signing mechanism. | maxclark wrote: | R2 and Sippy solve a specific pipeline issue: Storage -> CDN -> | Eyeball | | The real issue is how that data get's into S3 in the first place | and what else you need to do with it. | | S3 and DynamoDB are the real moats for AWS. | tehlike wrote: | If you are storing large amount of data: E2 is the cheapest | (20$/TB/year, 3x egress for free) | | If you are having lots of egress: R2 is the cheapest | (15$/TB/month, free egress) | | R2 can get somewhat expensive if you have lots of mutations, | which is not a typical use case for most. | vladvasiliu wrote: | What's E2? Top google result for "e2 blob storage" is azure, | but that can't be it since the pricing table comes at around | $18/TB/month. | maccam912 wrote: | I imagine it was a typo for backblaze B2? The call out that | egress is free for the first 3x of what you have stored | matches up. | wfleming wrote: | That's what I thought they meant as well, but B2 is more | like $72/TB/yr. Maybe relevant to another story on the | front page right now, they have a very unusual custom | keyboard layout that makes it easy to typo e for b and 2 | for 7 ;)? | natrys wrote: | Seems to be this one: https://www.idrive.com/object- | storage-e2/ | leiferik wrote: | I think Backblaze B2 is probably the reference (which has | free egress up to 3x data stored - | https://www.backblaze.com/blog/2023-product-announcement/). I | don't know of any public S3-compatible provider that is as | cheap as 20$/TB/year (roughly ~$0.0016/GB/mo). | arghwhat wrote: | I wish the R2 access control was similar to S3 - able to issue | keys with specific accesses to particular prefixes, and ability | to delegate ability to create keys. | | It currently feels a little limited and... bolted on to the | Cloudflare UI. | andrewstuart wrote: | I think the idea is to use Cloudflare Workers to add more | sophisticated functionality. | slig wrote: | But then you start paying for Worker's bandwidth, correct? | meowface wrote: | Is there any reason to _not_ use R2 over a competing storage | service? I already use Cloudflare for lots of other things, and | don 't personally care all that much about the "Cloudflare's | near-monopoly as a web intermediary is dangerous" arguments or | anything like that. | Hasz wrote: | As far as I know, R2 offers no storage tiers. Most of my s3 | usage is archival and sits in glacier. From Cloudflare's | pricing page, S3 is substantially cheaper for that type of | workload. | gurchik wrote: | 1. This is the most obvious one, but S3 access control is done | via IAM. For better or for worse, IAM has a lot of | functionality. I can configure a specific EC2 instance to have | access to a specific file in S3 without the need to deal with | API keys and such. I can search CloudTrail for all the times a | specific user read a certain file. | | 2. R2 doesn't support file versioning like S3. As I understand | it, Wasabi supports it. | | 3. R2's storage pricing is designed for frequently accessed | files. They charge a flat $0.015 per GB-month stored. This is a | lot cheaper than S3 Standard standard pricing ($0.023 per GB- | month), but more expensive than Glacier and marginally more | expensive than S3 Standard - Infrequent Access. Wasabi is even | cheaper at $0.0068 per GB-month but with a 1 TB billing | minimum. | | 4. If you want public access to the files in your S3 bucket | using your own domain name, you can create a CNAME record with | whatever DNS provider you use. With R2 you cannot use a custom | domain unless the domain is set up in Cloudflare. I had to | register a new domain name for this purpose since I could not | switch DNS providers for something like this. | | 5. If you care about the geographical region your data is | stored in, AWS has way more options. At a previous job I needed | to control the specific US state my data was in, which is easy | to do in AWS if there is an AWS Region there. In contrast R2 | and Wasabi both have few options. R2 has a "Jurisdictional | Restriction" feature in Beta right now to restrict data to a | specific legal jurisdiction, but they only support EU right | now. Not helpful if you need your data to be stored in Brazil | or something. | paulddraper wrote: | If you already use Cloudflare for lots of other things, no. | | If you already use AWS for lots of other things, yes. | benjaminwootton wrote: | The other hidden cost when you are working with data hosted on S3 | is the LIST requests. Some of the data tools seem very chatty | with S3, and you end up with thousands of them when you have | small filed buried in folders with a not insignifcant cost. I | need to dig into it more, but they are always up there towards | the top of my AWS bills. | thedaly wrote: | > In fact, there's an opportunity to build entire companies that | take advantage of this price differential and I expect we'll see | more and more of that happening. | | Interesting. What sort of companies can take advantage of this? | diamondap wrote: | Basically any company offering special services that work with | very large data sets. That could be a consumer backup system | like Carbonite or a bulk photo processing service. In either | case, legal agreements with customers are key, because you | ultimately don't control the storage system on which your | business and their data depend. | | I work for a non-profit doing digital preservation for a number | of universities in the US. We store huge amounts of data in S3, | Glacier and Wasabi, and provide services and workflows to help | depositors comply with legal requirements, access controls, | provable data integrity, archival best practices, etc. | | There are some for-profits in this space as well. It's not a | huge or highly profitable space, but I do think there are other | business opportunities out there where organizations want to | store geographically distributed copies of their data (for | safety) and run that data through processing pipelines. | | The trick, of course, is to identify which organizations have a | similar set of needs and then build that. In our case, we've | spent a lot of time working around data access costs, and there | are some cases where we just can't avoid them. They can really | be considerable when you're working with large data sets, and | if you can solve the problem of data transfer costs from the | get-go, you'll be way ahead of many existing services built on | S3 and Glacier. | dangoldin wrote: | Author here but some ideas I was thinking about: - An open | source data pipeline built on top of R2. A way of keeping data | on R2/S3 but then having execution handled in Workers/Lambda. | Inspired by what https://www.boilingdata.com/ and | https://www.bauplanlabs.com/ are doing. - Related to above but | taking data that's stored in the various big data formats | (Parquet, Iceberg, Hudi, etc) and generating many more | combinations of the datasets and choose optimal ones based on | the workload. You can do this with existing providers but I | think the cost element just makes this easier to stomach. - | Abstracting some of the AI/ML products out there and choosing | best one for the job by keeping the data on R2 and then | shipping it to the relevant providers (since data ingress to | them is free) for specific tasks. - | gen220 wrote: | I'm building a "media hosting site". Based on somewhat | reasonable forecasts of egress demand vs total volume stored, | using R2 means I'll be able to charge a low take rate that | should (in theory) give me a good counterposition to | competitors in the space. | | Basically, using R2 allows you to undercut competitors' | pricing. It also means I don't need to build out a separate CDN | to host my files, because Cloudflare will do that for me, too. | | Competitors built out and maintain their own equivalent CDNs | and storage solutions that are more ~10x more expensive to | maintain and operate than going through Cloudflare. Basically, | Cloudflare is doing to CDNs and storage what AWS and friends | did to compute. | xrd wrote: | I just love minio. It is a drop-in replacement for S3. I have | never done a price comparison for TOC to S3 or R2, but I have a | good backup story and run it all inside docker/dokku so it is | easy to recover. | hipadev23 wrote: | OP is missing that a correct implementation of Databricks or | Snowflake will have those instances are running inside the same | AWS region as the data. That's not to say R2 isn't an amazing | product, but the egregious costs aren't as high since egress is | $0 on both sides. | dangoldin wrote: | Author here and it is true that costs within a region are free | and if you do design your system appropriately you can take | advantage of it but I've seen accidental cases where someone | will try to access in another region and it's nice to not even | have to worry about it. Even that can be handled with better | tooling/processes but the bigger point is if you want to have | your data be available across clouds to take advantage of the | different capabilities. I used AI as an example but imagine you | have all your data in S3 but want to use Azure due to the | OpenAI partnership. It's that use case that's enabled by R2. | hipadev23 wrote: | Yeah, for greenfield work building up on R2 is generally a | far better deal than S3, but if you have a massive amount of | data already on S3, especially if it's small files, you're | going to pay a massive penalty to move the data. Sippy is | nice but it just spreads the pain over time. | cmgriffing wrote: | I could be mistaken, but I believe AWS would still charge for | one direction of an S3 to Databricks/Snowflake | instance/cluster. | hipadev23 wrote: | AWS S3 Egress charges are $0.00 when the destination is AWS | within the same region. When you setup your Databricks or | Snowflake accounts, you need to correctly specify the same | region as your S3 bucket(s) otherwise you'll pay egress. | drexlspivey wrote: | If I understand correctly when storing data to vanilla S3 (not | their edge offering) the data live in a single zone/datacenter | right? While on R2 they could potentially be replicated in tens | of locations. If that is true how can Cloudflare afford the | storage cost with basically the same pricing? | leiferik wrote: | As an indie dev, I recommend R2 highly. No egress is the killer | feature. I started using R2 earlier this year for my AI | transcription service TurboScribe (https://turboscribe.ai/). | Users upload audio/video files directly to R2 buckets (sometimes | many large, multi-GB files), which are then transferred to a | compute provider for transcription. No vendor lock-in for my | compute (ingress is free/cheap pretty much everywhere) and I can | easily move workloads across multiple providers. Users can even | re-download their (again, potentially large) files with a simple | signed R2 URL (again, no egress fees). | | I'm also a Backblaze B2 customer, which I also highly recommend | and has slightly different trade-offs (R2 is slightly faster in | my experience, but B2 is 2-3x cheaper storage, so I use it mostly | for backups other files that I'm likely to store a long time). | jokethrowaway wrote: | It blows my mind that anyone would consider S3 cheap. | | You always had available plenty of space on dedicated servers for | way cheaper before the cloud. | | You could make an argument about the API being nicer than dealing | with a linux server - but is AWS nice? I think it's pretty awful | and requires tons of (different, specific, non transferable) | knowledge. | | Hype, scalability buzzwords thrown around by startups with 1000 | users and 1M contract with AWS. | | Sure R2 is cheaper but it's still not a low cost option. You are | paying for a nice shiny service. | gen220 wrote: | I think it all depends on the volume of data you're storing, | access requirements, and how much value you plan to generate | per GB. | | It's certainly quite cheap for a set of typical "requirements" | for media hosting companies. | | But yeah, if you're storing data for mainly archival purposes, | you shouldn't be paying for R2 or S3. | sgammon wrote: | We absolutely love R2, especially when paired with Workers. | johnklos wrote: | Should we simply ignore the tremendous amount of phishing hosted | using r2.dev? Or is this also part of "an economic opportunity"? | | Cloudflare may well be on their way to becoming a monopoly, but | they certainly show they don't care about abuse. Even if it | weren't a simple matter of principle, in case they aren't | successful in forcing themselves down everyone's throats, I | wouldn't want to host anything on any service that hosts phishers | and scammers without even a modicum of concern. | andrewstuart wrote: | >> you're paying anywhere from $0.05/GB to $0.09/GB for data | transfer in us-east-1. At big data scale this adds up. | | At small data scale this adds up. | | And..... it's 11 cents a GB from Australia and 15 cents a GB from | Brazil. | | If you have S3 facing the Internet a hacker can bankrupt your | company in minutes with simple load testing application. Not even | a hacker. A bug in a web page could do the same thing. | paulddraper wrote: | 200 TB in minutes is impressive. | | (Assuming your company can be bankrupted for ~$20k.) ___________________________________________________________________ (page generated 2023-11-02 23:00 UTC)