[HN Gopher] Databricks is an RDBMS ___________________________________________________________________ Databricks is an RDBMS Author : georgewfraser Score : 95 points Date : 2021-02-01 19:21 UTC (3 hours ago) (HTM) web link (fivetran.com) (TXT) w3m dump (fivetran.com) | anonu wrote: | So is Databricks like AWS RDS with some Lambda functions built | around it? | pbourke wrote: | No, Databricks is a distribution of Apache Spark with some | value-added features such as the Delta Lake data format and a | clean UI for hosting notebooks, doing cluster admin, etc | fs111 wrote: | delta lake is open source too btw | rymurr wrote: | only in name. check out the lack of traction for a lot of | PRs on https://github.com/delta/delta-oss . Iceberg | (https://iceberg.apache.org) is comparable but actually | OSS. | rymurr wrote: | Check out | https://searchdatamanagement.techtarget.com/news/252495619/A... | the convergence between data lake and data warehouse idea is | starting to spread rapidly. | jpau wrote: | I'm deeply disappointed in Databricks as an RDBMS. | | As a DS/DE, there's a lot to love (not all, but a lot). The easy | provision of Spark clusters. The jobs API. DeltaLake (mostly). | Easy notebooks (please don't create a prod system from these..). | And Spark itself continues to improve, albeit in an increasingly | crowded field. | | But I've worked closely with BigCo SQL analysts on Azure | Databricks, and their experience was terrible. For example: | - You cannot browse the data structure without an active cluster | - Starting a cluster can take ~5 minutes and, since you missed | that moment, you may not submit your first query until 10-15 | minutes. - The SQL error messages are often (perhaps | usually?) nonsense, so you have to operate without them. | - An unfortunate amount of downtime, followed by bizarre excuses. | - It's so darn slow, relative to equivalent queries on BigQuery | or Snowflake. - Even submitting a query can take a | weird amount of time. | | If Databricks-as-an-RDBMS were competing against Teradata, sure, | let's have a chat. | | But we're in 2021, and there's just no comparing the experience | of the SQL analyst on Databricks-as-an-RDBMS vs. | Snowflake/BigQuery. | | I'm excited for the potential of Snowflake's SnowPark (though | know little about it). Calling UDFs from SQL means you can create | great features for SQL analysts, provided that they can build the | momentum to need it. | mrbungie wrote: | I've seen and have compared Databricks clusters to a 10-15yo | Teradata cluster and no way in hell I would use Databricks. | | Teradata is a lot faster for interactive workloads than | Databricks. | | PS: I agree there's no comparing on Databricks vs | Snowflake/BigQuery. | nattaylor wrote: | I got excellent performance in Databricks with well partitioned | Parquet and Spark 2.4. What is making the queries slow? Data | scanning? | jpau wrote: | They use DeltaLake + Spark 3.0, and are mostly careful to | partition well. | | Their datasets are small. Most tables are ~50GB, the odd | table up to ~2TB. The clusters typically are nothing shabby | for this size, defaults to ~[4-12]x32GB. | | The queries that I have seen are typically not written well. | Think view-on-view-on-view (there's a BigCo policy against | them materialising data..), and where the filter is applied | in the last step. The stuff of horrors, but something I've | seen in more-than-one-BigCo. | | But we have compared some of those same queries on BigQuery | vs. Databricks, and, I don't know if BigQuery's execution | optimiser is better? Or if the BigQuery storage is better | organising the data? Or if BigQuery is simply throwing more | resource their way? | agambrahma wrote: | Sigma Computing (https://www.sigmacomputing.com) might be a | good fit here too. | | (disclaimer: plug) | bkandel wrote: | Yes, and I would add to this the (nearly) complete lack of IDE | support makes working with Spark SQL quite painful. | kfk wrote: | These are great innovations but can we please take a moment to | realize 99% of companies are still stuck with a blend of excels, | access and sql servers? Why is adoption of this new tech so poor? | Maybe it has something to do with the amount of confusion all the | sales pitches about data lakes, big data, ai and company are | generating | phoe-krk wrote: | > Why is adoption of this new tech so poor? | | Because it's unnecessary to those 99% of companies. If company | data fits in an Excel spreadsheet, Access database, or a single | MySQL/Postgres instance, then introducing all this new tech | with all of the associated costs and little return gain is a | net loss. | vmsp wrote: | I know Excel runs the world but are there really that many | people using Access? | QuesnayJr wrote: | I actually think the effect of all of this hype will move | people past the Excel/Access era. There are simple analyses | every company could do with R or scikit-learn that would save | or make them money, and they just don't know how. Someone with | AI expertise is over-qualified to do this, but they are at | least qualified. | t0mas88 wrote: | Because most data isn't "big" data. If it fits in Excel on a | laptop, why bother to roll out a distributed data lake system | like Spark with all the associated ops work. | hztar wrote: | The ones I have been talking to stick to their Excel because | their little part of the puzzle can solved by Excel. These | technologies usually demand data to be collected, but the | amount of incentive to do so is low in any classic balkanized | F500 organization. | bpodgursky wrote: | Hmm, I agree there's a market here, but I don't know why I | wouldn't just use Snowflake or Bigquery if what I really wanted | was a big-data RDBMS. | | Everyone I know who uses Databricks (and they all like it) use it | as hosted Spark with S3 integrations, or write... directly to | Snowflake. I'm a little skeptical they're going to get traction | as a true data lake model | georgewfraser wrote: | It's a lot simpler to use a single system as both your data | lake, and your data warehouse. As Databricks gets better and | better at the core data warehouse features, it becomes feasible | to use it for both. Meanwhile, Snowflake and BQ are coming from | the other direction, implementing data lake features. AWS | strategy seems to be, just make it easier to have 2 systems and | move data back and forth. | MikeDelta wrote: | Their Delta also has SSD caching, which turns out to be logic | that stores a local copy of the file you queried for faster re- | query. Going to call my lru cache function like that as well... | | My company loves them, I think they only do a few things good | of which marketing the best, and are not worth the money for | data science teams with devops skills. Happy to hear from | others if I am wrong. | pram wrote: | Depends. If you only have a couple workspaces, then no. It's | worth the money for a company with lots of data science | teams, and one team who is janitoring all the workspace | infra. Our company has 20 workspaces for 10 teams already and | it would be a nightmare if we expected everyone to fix and | manage their own AWS stuff. | snidane wrote: | Snowflake and Bigquery will bite you in the ass later on. You | can do 80% of the things you will need - which is great for | some newbie stuff or for sales presentations. Once you need | something complicated, you're on your own, while being stuck in | a proprietary environment that you cannot extend. | | You will have to develop some kind of data lake to store | unstructured data anyway. You will end up with a Snowflake data | warehouse and a data lake. Why not just go with data lake first | then. | | Databricks/Spark are just good platforms to help you do | something with structured data in your lake. With the recent | additions to its execution engine and Delta (strange naming | tbh) it will be pretty much the same as Snowflake for you. | mrbungie wrote: | BigQuery/Snowflake can process Parquet and multiple other | formats in Object Storage. You can use them more "freely" if | you keep your raw data in open formats. | | You need something more complicated than what can be done | using BigQuery/Snowflake (that remaining 20%, though I would | say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever | and process it with anything, including | Dataproc/HDInsight/EMR or even Databricks. That's actually a | common pattern. | willvarfar wrote: | The reason snowflake has such a high market cap is because it's | customers aren't paying much now, but they'll be paying and | paying monthly forever. It's lock-in on a massive scale. | | Delta lake is something you can run on data you feel you still | have some semblance of control over. | ACow_Adonis wrote: | data lake, delta lake, lake house. snowflake, snowpark. cloud, | data warehouse. I just want to take a second to thank these | companies for trying to turn my profession (data science + | analytics) into the living hell of your every day enterprise and | tech culture cluster-fuck (pun not intended). | | I'm still coming to terms with the fact that there's actually a | technology called Kafka... | | /sorry, I'm just particularly bitter after having to do some | databricks training yesterday... it was basically 80% trying to | rote the sales literature about why they're so great/enterprisy | and parroting company and platform specific jargon. I don't like | seeing my profession turn into an obsession with tech and | platforms when 99.9% of companies and people can't reason | properly or operate their current tools/resources efficiently. | obviously just my own opinion. | arafa wrote: | The training is quite heavy on the sales pitch, I agree (and | expensive). There were some useful bits if you dig around | though. | MikeDelta wrote: | I read a wonderful comment in another thread the other day. It | was about the obsession of devs to collect a whole range of | technologies on their resume. It went something like: "When I | hire a carpenter I won't hire him for what he has in his | toolbox. I want to know what he can do with only a hammer and | chisel (= Linux machine). The rest he can learn." | | I will try to look up the link and obviously 'he' can be 'she' | as well. | willvarfar wrote: | The key gap as I see it is that databricks doesn't support multi- | table transactions. And that is why you can't treat it as though | it as a drop in functional replacement for an rdbms. | | (Why no vector clocks in the manifest files or something?) | georgewfraser wrote: | Interestingly, BigQuery is also missing multi-table | transactions. There are ways to live without this feature, but | I agree it's a gap. | rymurr wrote: | Check out https://projectnessie.org it adds multi-table | transactions to Databricks Delta. | | disclaimer: an author of Nessie | drej wrote: | I have seen several deployments of Databricks (including first | hand experience) and... most use cases could be better served by | Postgres (or Redshift, Athena, Snowflake for larger scale). It | has honestly been such an overkill for so many workloads, it was | quite astonishing. I've seen people move from Excel to Spark... | to handle the same volume of data. That's obviously not | Databricks' fault, but their PR is pretty much "please do all | your data work in our product, it's well suited for it". | | Yes, it's very good if you don't like setting up clusters (few | do/can) and the UI is rather useful for getting up and running | (not so much for writing code though). But you need to really | understand the platform before adopting it. Please, please don't | just adopt it because it's popular. ___________________________________________________________________ (page generated 2021-02-01 23:00 UTC)