[HN Gopher] Uses and abuses of cloud data warehouses ___________________________________________________________________ Uses and abuses of cloud data warehouses Author : Malp Score : 126 points Date : 2023-08-16 13:10 UTC (9 hours ago) (HTM) web link (materialize.com) (TXT) w3m dump (materialize.com) | spullara wrote: | These reasons are why Snowflake is building hybrid tables (under | the Unistore umbrella). Those tables keep recent data in an | operational store and historical data in their typical data | warehouse storage systems. Best of both worlds. Still in private | preview but definitely the answer to how you build applications | that need both without using multiple databases and syncing. | | https://www.snowflake.com/guides/htap-hybrid-transactional-a... | datavirtue wrote: | Conveniently leave out the issue of cost. Snowflake is piling | on features that encourage more compute. Customers abuse the | system and they (Snowflake) respond by helping cement them into | continuing the abuse (spending more) by developing features to | make bad habits and horrible engineering decisions look like | something they should be doing. Typical. | disgruntledphd2 wrote: | Snowflake are the Oracle of the cloud. | ed_elliott_asc wrote: | Oh come on snowflake isn't cheap but there are none of the | license auditing nonsense. | | (Also aren't oracle the oracle of the cloud?) | tcoff91 wrote: | Oracle is so far beyond anyone other major player in | crookedness that it's not even funny. | | Oh, you happened to run your oracle database in a VM and | got audited? They'll try to shake you down to pay for | however many oracle licenses for every other box you are | running hypervisors on, because they claim that you could | have transferred the database to any of those other | boxes. So if you have a datacenter with 1000 boxes | running VMWare, and you ran Oracle on one of them, they | try to shake you down for paying for not buying 1000 | licenses. Then they say but if you just buy a bunch of | cloud credits, we can make your 1000x license violation | go away. | mrbungie wrote: | I remeber one time I was working as a Data & Analytics Lead | (almost a Chief Data Officer but without the title) in a company | were I don't work anymore and I was "challenged" by our parent | company CDO about our data tech stack and operations. Just for | context, my team at the time was me working as the lead and main | Data Engineer plus 3 Data Analysts that I was coaching/teaching | to convert into DEngs/DScientists. | | At the time we were mostly a batch data shop, based on Apache | Airflow + K8S + BigQuery + GCS in Google Cloud Platform, with | BigQuery + GCS as the central datalake techs for analytics and | processing. We still had RT capabilities due to having also some | Flink processes running in the K8S cluster, and also having time- | critical (time, not latency) processes running in microbatches of | minutes for NRT. It was pretty cheap and sufficiently reliable, | with both Airflow and Flink having self-healing capabilities at | least at the node/process level (and even cluster/region level | should we need it and be willing to increase the costs), while | also allowing for some changes down the road like moving out of | BQ if the costs scaled up too much. | | What they wanted us to implement what according to them was the | industry "best practices" circa 2021: a Kafka-based Datalake | (KSQL and co.), at least other 4 engines (Trino, Pinot, Postgres | and Flink) and an external object storage with most of the stuff | running inside Docker containers orchestrated by Ansible in N | compute instances manually controlled from a bastion instance. | For some reason, they insisted on having a real time datalake | based on Kafka. It was an insane mix of cargo cult, FOMO, high | operational complexity and low reliability in one package. | | I resisted the idea until the last second I was in that place. I | reunited with some of my team members for drinks months later | after my departure and they told me the new CDO was already | convinced that said "RT-based" datalake was the way to go | forward. I still shudder every time I remember the architectural | diagram and I hope they didn't finally follow that terrible | advice. | | tl;dr: I will never understand the cargo cult around real time | data and analytics but it is a thing that appeals to both | decision makers and "data workers". Most businesses and | operations (especially those whose main focus is not IT by | itself) won't act or decide in hours, but rather in days. Build | around your main use case and then make exceptions, not the other | way around. | chuckhend wrote: | I agree that is a great approach - build around the main use | cases and then make exceptions. I think a lot of companies have | legitimate use cases for real-time analytics (outside of their | internal decision making), but as you mention, preemptively | optimize for the aspiration and leads them towards unnecessary | tool and tech sprawl. For example, a marketplace application | that shows you the quantity of an item currently available -- | you as a consumer use that information to make a decision in | seconds, so its a great use-case. Internally, the org probably | uses that data for weekly or quarterly forecasting. I've seen | use cases like that lead to the "let's make everything real- | time", but not every use case benefits the same from real-time. | kitanata wrote: | [flagged] | mritchie712 wrote: | I caught myself wondering how Google, Microsoft and Amazon let | Snowflake win. You can argue they haven't won, but lets assume | they have. Two things: | | 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T. | Owning Snowflake would be a drop in the bucket for any of them | (let alone if they were splitting the revenue). | | 2. Snowflake runs on AWS, GCP or Azure (customers choice), so a | good chunk of their revenue goes back to these services. | | Looking at these two points as the CEO of GOOGL, MSFT, or AMZN, | I'd shrug away Snowflake "beating us". It's crazy that you can | build a $50B company that your largest competitors barely care | about. | riku_iki wrote: | > 1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over | $1T. Owning Snowflake would be a drop in the bucket for any of | them (let alone if they were splitting the revenue). | | FAANG can't utilize their market cap to buy SNOW, they would | need to pay cash, and 50B is very large amount for any of these | companies (its about annual Google net income). | | Also, snow stock is very inflated now, it is heavily income | negative, and revenue not that high, stock price is very high | on growth expectations. | mritchie712 wrote: | My point is that they wouldn't want to buy it (or have | focused much on building a competitive product) if it's only | worth $50B. | hnthrowaway0328 wrote: | I agree. The cloud providers are basically the guys who sell | shovels in gold rush. Snowflake still needs to build on top the | clouds so MAG never lose. I heard that SNOW is offering its own | cloud services but I could be wrong -- and even if I'm correct | they have a super long way to catch up. | pm90 wrote: | What I heard is that AWS got there first with Redshift but then | didn't really invest as much as was required by users so | Snowflake found an opening and pounced on it. | | BigQuery in GCP is a pretty great alternative and I know that | GCP invests/promotes it heavily, but they were slightly late to | the market. | datadrivenangel wrote: | BigQuery is pretty great. The serverless by default setup | works very well for most BI use cases. There are some weird | issues when you're a heavy user and start hitting the | normally hidden quotas. | dsaavy wrote: | There are some ways around the heavy user issues that | aren't ideal but will work for BI-oriented heavy users. | hodgesrm wrote: | This article uses an either or definition that leaves out a big | set of use cases that combine operational _and_ analytic usage: | | > First, a working definition. An operational tool facilitates | the day-to-day operation of your business. Think of it in | contrast to analytical tools that facilitate historical analysis | of your business to inform longer term resource allocation or | strategy. | | Security event and incident management (SEIM) is a typical | example. You want fast notification on events _combined with_ the | ability to sift through history extremely quickly to assesss | problems. This is precisely the niche occupied by real-time | analytic databases like ClickHouse, Druid, and Pinot. | dontupvoteme wrote: | The random bolding of words reeks of adtech. | | is the usage of such an old html tag itself now a trigger to send | something to /dev/null? | atwong wrote: | There are other databases today that do real time analytics | (ClickHouse, Apache Druid, StarRocks along with Apache Pinot). | I'd look at the ClickHouse Benchmark to see who are the | competitors in that space and their relative performance. | slotrans wrote: | Yeah ClickHouse is definitely the way to go here. Its ability | to serve queries with low latency and high concurrency is in an | entirely different league from Snowflake, Redshift, BigQuery, | etc. | biggestdummy wrote: | StarRocks handles latency and concurrency as well as | Clickhouse but also does joins. Less denormalization, and you | can use the same platform for traditional BI/ad-hoc queries. | riku_iki wrote: | Clickhouse also does joins. | | Somehow StarRocks dudes appear in every relevant post with | this false claim. | biggestdummy wrote: | There's a difference between "supports the syntax for | joins" and "does joins efficiently enough that they are | useful." | | My experience with Clickhouse is that its joins are not | performant enough to be useful. So the best practice in | most cases is to denormalize. I should have been more | specific in my earlier comment. | riku_iki wrote: | ack that anonymous user in internet said he couldn't make | CLickhouse joins perform well in his case which he didn't | describe | albert_e wrote: | Arent a lot of businesses being sold on "real time analytics" | these days? | | That mixes the uses cases of analytics and operations because | everyone is led to believe that things that happened in last 10 | minutes must go through the analytics lens and yield actionable | insights in real time so their operational systems can | react/adapt instantly. | | Most business processes probably don't need anywhere near such | real time analytics capability but it is very easy to think (or | be convinced that) we do. Especially if I am a owner of a given | business process (with an IT budget) why wouldn't I want the | ability to understand trends in real-time and react to it if not | get ahead of them and predict/be prepared. Anything less than | that is seen as being shamefully behind on the tech curve. | | In this context-- the section in article where it says present | data is of virtually zero importance to analytics is no longer | true. We need a real solution even if we apply those (presumably | complex and costly) solutions to only the most deserving use | cases (and not abuse them). | | What is the current thinking in this space? I am sure there are | technical solutions here but what is the framework to evaluate | which use case actually deserves pursuing such a setup. | | Curious to hear. | riordan wrote: | > In this context-- the section in article where it says | present data is of virtually zero importance to analytics is no | longer true. We need a real solution even if we apply those | (presumably complex and costly) solutions to only the most | deserving use cases (and not abuse them). | | Totally agreed, though where real-time data is being put | through an analytics lens is where CDW's start to creak and get | costly. In my experience, these real-time uses shift the burden | from being about human-decision-makers to automated decision- | making and it becomes more a part of the product. And that's | cool, but it gets costly, fast. | | It also makes perfect sense to fake-it-til-you-make-it for | real-time use cases on an existing Cloud Data Warehouse/dbt | style _modern data stack_ if your data team's already using it | for the rest of their data platform; after all they already | know it and it's allowed that team to scale. | | But a huge part of the challenge is that once you've made it, | the alternative for a data-intensive use case is a bespoke | microservice or a streaming pipeline, often in a language or on | a platform that's foreign to the existing data team who's built | the thing. If most of your code is dbt sql and airflow jobs, | working with Kafka and streaming spark is pretty foreign (not | to mention entirely outside of the observability infrastructure | your team already has in place). Now we've got rewrites across | languages/platforms, and leave teams with the cognitive | overhead of multiple architectures & toolchains (and split | focus). The alternative would be having a separate team to hand | off real-time systems to and only that's if the company can | afford to have that many engineers. Might as well just allocate | that spend to your cloud budget and let the existing data team | run up a crazy bill on Snowflake or BigQuery as long as it's | less than the cost of a new engineering team. | | ------ | | There's something incredible about the ruthless efficiency of | sql data platforms that allows data teams to scale the number | of components/engineer. Once you have a Modern-Data-Stack | system in place, the marginal cost of new pipelines or | transformations is negligible (and they build atop one | another). That platform-enabled compounding effect doesn't | really occur with data-intensive microservices/streaming | pipelines and means only the biggest business-critical | applications (or skunk works shadow projects) will get the | data-intensive applications[1] treatment, and business | stakeholders will be hesitant to greenlight it. | | I think Materialize is trying to build that Modern-Data-Stack | type platform for real-time use cases: one that doesn't come | with the cognitive cost of a completely separate architecture | or the divide of completely separate teams and tools. If I | already had a go-to system in place for streaming data that | could be prototyped with the data warehouse, then shifted over | to a streaming platform, the same teams could manage it and | we'd actually get that cumulative compounding effect. Not to | mention it becomes a lot easier to then justify using a real- | time application the next time. | | [1]: https://martin.kleppmann.com/2014/10/16/real-time-data- | produ... | slotrans wrote: | Just gonna keep linking this til the heat death of the | universe: https://mcfunley.com/whom-the-gods-would-destroy- | they-first-... | | Real-time analytics are _worse than useless_. At best they are | a distracting resource sink, at worst they directly harm the | quality of decision-making. | jandrewrogers wrote: | The term "real-time" is much abused in marketing copy. It is | often treated like a technical metric but it is actually a | business metric: am I making operational decisions with the | most recent data available? For many businesses, "most recent | data available" can be several days old and little operational | efficiency would be gained by investing in reducing that | latency. | | For some businesses, "real-time" can be properly defined as | "within the last week". While there are many businesses where | reducing that operational tempo to seconds would have an | impact, it is by no means universal. | Eumenes wrote: | From my experience (mostly startups), real time analytics is | generally overkill, esp. from a BI perspective. Unless your | business is very focused on real time data and transactional | processing, you can generally get away with ETL/batch jobs. | Show executives, product, and downstream teams some metrics | that update a few times per day saves a ton of money over | things like Snowflake/Databricks/Redshift stuff. While cloud | services can be pricey, tools like dbt are really useful and | can be administered by savvy business people or analyst types. | Those candidates are way easier to hire compared to data | engineers, sql experts, etc. | lmkg wrote: | I work as a web analyst (think Google Analytics). | | One time I ran an A/B test on the color of a button. After the | conclusion of the test, with a clear winner in hand, it took | _eleven months_ for all involved stakeholders to approve the | change. The website in question got a few thousand visits a | month and was not critical to any form of business. | | This organization does not benefit from real-time analytics. | | Now that's an extreme outlier, but my experience is that _most_ | organizations are in that position. The feedback loop from | collecting data to making a decision is long, and real-time | analytics shortens a part that 's already not the bottleneck. | The _technical_ part of real-time analytics provides no value | unless the org also has the _operational_ capacity to use that | data quickly. | | I have seen this! I have, for example, seen a news site that | looked at web analytics data from the morning and was able to | publish new opinion pieces that afternoon if something was | trending. They had a _dedicated process_ built around that data | pipeline. Critically, they had a _specific_ idea of _what they | could do_ with that data when the received it. | | So if you want a framework, I would start from a single, simple | question: What can you actually _do_ with real-time data? Name | one (1) action your organization could take based on that data. | | I think it's also useful to separate _what data_ benefits from | realtime and _which users_ can make use of it. Even if you have | real-time data, some consumers don 't benefit from immediacy. | coredog64 wrote: | Generally speaking "What questions do you hope to answer with | this data?" is a good filter for all kinds of operational | data. | iamacyborg wrote: | Hate to say it but if your site was only getting a few | thousand visitors a month your test was likely vastly | underpowered and therefore irrelevant anyway | mrbungie wrote: | Power is not just about sample size, but also | (expected/previously informed by some other evidence) | effect size. You can't make that conclusion without that. | iamacyborg wrote: | For sure, but you'd need one hell of a good cta to be | getting a sufficient effect size to warrant small | samples. | ozim wrote: | For me it mostly is that business people don't understand OLAP | vs OLTP and that if they add 5 items to database and they are | visible in the system their "dashboard" will not update | instantly but only after when data pipes run. | | Which is hard to explain because if it is not instant | everywhere they think it is a bug and system is crappy. Later | on they will use dashboard view once a week or once a month so | 5 items update is not relevant at all. | atwebb wrote: | Real-time generally means near-real-time and even then I liken | it to availability. | | If asked people would say "I need to always be up" until they | see the costs associated with it, then being out for a few | hours a year tends to be ok. | datadrivenangel wrote: | This is a great way looking at it. The cost starts going up | rapidly from daily and approaches infinity as you get to | ultra-low latency realtime analytics. | | There is a minimum cost though (systems, engineers, etc), so | for medium data there's often very little marginal cost up | until you start getting to hourly refreshes. This is not true | for larger datasets though. | higeorge13 wrote: | I work in a real time subscription analytics company | (chartmogul.com). We fetch, normalize and aggregate various | billing systems data and eventually visualize them into graphs | and tables. | | I had this discussion with key people and i would say it | depends on multiple factors. Small companies really like and | require real-time analytics: they want to see how a couple | invoices translate into updated saas metrics or why they didn't | get a slack/email notification as soon asit happened. Larger | ones will check their data less frequently per day or week, but | again it depends on the people and their role. Most of them are | happy with getting their data once per day into their mailboxes | or warehouses. | | But we try to make everyone happy so we aim for real time | analytics. | mrbungie wrote: | I think GP's point is that is not about the perceived value | of real time data/analytics, but rather, its actual value. | Decision makers may ask for RT or NRT, but most of the time | won't make a decision or action in a timeframe that actually | justifies RT/NRT data/analytics. | | For most operations RT/NRT data stuff normally is about | novelty/vanity rather than a real existing business need. | andrenotgiant wrote: | The article is separating "operational" and "analytical" | use-cases. | | IIUC analytical = "what question are you trying to answer" | and in analytics, RT/NRT is absolutely novelty/vanity. | Operational = "what action are you trying to take" and it | makes sense to want to have up-to-date data when, for | example, running ML models, triggering notifications, | etc... | mrbungie wrote: | Yeah, totally. I should've specified "analytical | operations", as in, updating dashboards and other non- | time-critical data processing that eventually feed into | decision making. That's were devs or decision makers | asking for RT/NRT makes no sense. | debarshri wrote: | 15 years ago when I joined workforce business intelligence was | all the rage. Data world was pretty much straight forward. You | had transactional data in OLTP databases which would be shipped | to Operational data stores, then rolled into the data warehouse. | Datawarehouses were actual specialised hardware appliances | (netezza et al) reporting tools were robust too. | | Everytime I moved from one org to another, these concepts of data | warehouse somehow got muddled. | andrenotgiant wrote: | It seems like Snowflake is going all-in on building features and | doing marketing that encourage their customers to build | applications, serving operational workloads, etc... on them. | Things like in-product analytics, usage-based billing, | personalization, etc... | | Anyone here taking them up on it? I'm genuinely curious how it's | going. | weego wrote: | After a series of calls, examples and explanations with them we | never managed to get close to a reasonable projection of what | our monthly costs would be like on Snowflake. I understand why | companies in this field use abstract notions of 'processing' | /'compute' units but it's a no go finance wise. | | Without some close to real world projections we don't have time | to consider implementation to find out for ourselves. | benjaminwootton wrote: | Snowflake is one of the easier tools to measure because it's | a simple function of region, instance size, uptime. If you | can simulate some real loads and understand the usage then | you do have a shot at forecasting. | | Of course the number is going to be high, but you have to | remember it rolls up compute and requires less manpower. This | is also a win for finance if they are comfortable with usage | based billing. | code_biologist wrote: | Who's finance team likes usage based billing? It makes | sense for elastic use cases and is definitely "fair", but | there are a lot of issues: Forecasting is hard. "dev team | had an oops" situations. | | I had frog getting boiled situation at one job that was | exactly the process described in the posted article: usage | of the cloud data warehouse grew as people trusted the | infrastructure and used more fresh data for more and more | use cases. They were all good, sane use cases. I repeatedly | under-forecast our cost growth until we made large changes | and it really frustrated the finance people, rightly so. | GeneralAntilles wrote: | Yeah, they're providing a path-of-least-resistance for getting | stuff done in your existing data environment. | | A common challenge in a lot of organizations is IT as a | roadblock to deployment of internal tools coming from data | teams. Snowflake is answering this with Streamlit. You get an | easy platform for data people to use and deploy on and it can | all be done within the business firewall under data governance | within Snowflake. | politelemon wrote: | I've noticed that too. I think the marketing is definitely | working, I'm seeing a few organisations starting to shift more | and more workloads onto them, and some are also publishing | datasets on their marketplace. | | One of their most interesting offerings coming up is Snowpark | which lets you run a Python function as a UDF, within | Snowflake. This way you don't have to transfer data around | everywhere, just run it as part of your normal SQL statements. | It's also possible to pickle a function and send it over... so | conceivably one could train a data science model and run that | as part of a SQL statement. This could get very interesting. | jamesblonde wrote: | In theory, fine. Then you look at the walled garden that is | Snowpark - only "approved" python libraries are allowed | there. It will be a very constrictive set of models you can | train, and very constrictive feature engineering in Python. | And, wait, aren't Python UDFs super-slow (GIL) - what about | Pandas UDFs (wait that's PySpark.....) | Pils wrote: | Having worked with a team using Snowpark, there are a | couple things that bother me about it as a platform. For | example, it only supported Python 3.8 until 3.9/10 recently | entered preview mode. It feels a bit like a rushed project | designed to compete with Databricks/Spark at the bullet | point level, but not quite at the same quality level. | | But that's fine! It has only existed for around a year in | public preview, and appears to be improving quickly. My | issue was with how aggressively Snowflake sales tried to | push it as a production-ready ML platform. Whenever I asked | questions about version control/CI, model versioning/ops, | package managers, etc. the sales engineers and data | scientists consistently oversold the product. | disgruntledphd2 wrote: | Yeah it's definitely not ready for modelling. It's pretty | rocking for ETL though, and much easier to test and | abstract than regular SQL. Granted it's a PySpark clone | but our data is already in Snowflake. | noazdad wrote: | Disclaimer: Snowflake employee here. You can add any Python | library you want - as long as its dependencies are also | 100% Python. Takes about a minute: pip install the package, | zip it up, upload it to an internal Snowflake stage, then | reference it in the IMPORTS=() directive in your Python. I | did this with pydicom just the other day - worked a treat. | So yes, not the depth and breadth of the entire Python | ecosystem, but 1500+ native packages/versions on the | Anaconda repo, plus this technique? Hardly a "walled | garden". | jamesblonde wrote: | Good luck with trying to install any non-trivial python | library this way. And with AI moving so fast, do you | think people will accept that they can't use the | libraries they need, because you haven't approved them | yet?!? | lokar wrote: | Also containers: | | https://www.snowflake.com/blog/snowpark-container- | services-d... | atwebb wrote: | > run a Python function as a UDF | | Is that a differentiator? I'm unfamiliar with Snowpark's | actual implementation but know SQL Server introduced Python/R | in engine in 2016? something like that. | ramraj07 wrote: | Snowflake is capturing a large market share in analytics | industries thanks to its "just works" feature. I'm a massive | fan. | | But in the end, snowflake stores the data in S3 as partitions. | If you want to update a single value you have to replace the | entire s3 partition. Similarly you need to read a reasonable | amount of s3 data to retrieve even a single record. Thus you're | never going to get responses shorter than half a second (at | best). As long as you don't try and game around that limitation | it works great. | | Materialize up here also follows the same model in the end | FWIW. | munchor wrote: | Disclaimer: I work at SingleStoreDB. | | Building a database that can handle both analytics and | operations is what we've been working on for the past 10+ | years. Our customers use us to build applications with a strong | analytical component to them (all of the use cases you | mentioned and many more). | | How's it going? It's going really well! And we're working on | some really cool things that will expand our offering from | being a pure data storage solution to much more of a | platform[1]. | | If you want to learn more about our architecture, we published | this paper at SIGMOD in late 2022 about it[2]. | | [1]: https://davidgomes.com/databases-cant-be-just-databases- | anym... | | [2]: https://dl.acm.org/doi/pdf/10.1145/3514221.3526055 | datadrivenangel wrote: | I assume they're angling for a salesforce acquisition as they | move towards being a micro-hosting service like salesforce. | rubiquity wrote: | Snowflake is worth at least 25% of Salesforce so such an | acquisition is very unlikely unless Salesforce has $60 | billion or more burning a hole in their pocket. ___________________________________________________________________ (page generated 2023-08-16 23:01 UTC)