[HN Gopher] Databricks is an RDBMS
       ___________________________________________________________________
        
       Databricks is an RDBMS
        
       Author : georgewfraser
       Score  : 95 points
       Date   : 2021-02-01 19:21 UTC (3 hours ago)
        
 (HTM) web link (fivetran.com)
 (TXT) w3m dump (fivetran.com)
        
       | anonu wrote:
       | So is Databricks like AWS RDS with some Lambda functions built
       | around it?
        
         | pbourke wrote:
         | No, Databricks is a distribution of Apache Spark with some
         | value-added features such as the Delta Lake data format and a
         | clean UI for hosting notebooks, doing cluster admin, etc
        
           | fs111 wrote:
           | delta lake is open source too btw
        
             | rymurr wrote:
             | only in name. check out the lack of traction for a lot of
             | PRs on https://github.com/delta/delta-oss . Iceberg
             | (https://iceberg.apache.org) is comparable but actually
             | OSS.
        
       | rymurr wrote:
       | Check out
       | https://searchdatamanagement.techtarget.com/news/252495619/A...
       | the convergence between data lake and data warehouse idea is
       | starting to spread rapidly.
        
       | jpau wrote:
       | I'm deeply disappointed in Databricks as an RDBMS.
       | 
       | As a DS/DE, there's a lot to love (not all, but a lot). The easy
       | provision of Spark clusters. The jobs API. DeltaLake (mostly).
       | Easy notebooks (please don't create a prod system from these..).
       | And Spark itself continues to improve, albeit in an increasingly
       | crowded field.
       | 
       | But I've worked closely with BigCo SQL analysts on Azure
       | Databricks, and their experience was terrible. For example:
       | - You cannot browse the data structure without an active cluster
       | - Starting a cluster can take ~5 minutes and, since you missed
       | that moment, you may not submit your first query until 10-15
       | minutes.              - The SQL error messages are often (perhaps
       | usually?) nonsense, so you have to operate without them.
       | - An unfortunate amount of downtime, followed by bizarre excuses.
       | - It's so darn slow, relative to equivalent queries on BigQuery
       | or Snowflake.              - Even submitting a query can take a
       | weird amount of time.
       | 
       | If Databricks-as-an-RDBMS were competing against Teradata, sure,
       | let's have a chat.
       | 
       | But we're in 2021, and there's just no comparing the experience
       | of the SQL analyst on Databricks-as-an-RDBMS vs.
       | Snowflake/BigQuery.
       | 
       | I'm excited for the potential of Snowflake's SnowPark (though
       | know little about it). Calling UDFs from SQL means you can create
       | great features for SQL analysts, provided that they can build the
       | momentum to need it.
        
         | mrbungie wrote:
         | I've seen and have compared Databricks clusters to a 10-15yo
         | Teradata cluster and no way in hell I would use Databricks.
         | 
         | Teradata is a lot faster for interactive workloads than
         | Databricks.
         | 
         | PS: I agree there's no comparing on Databricks vs
         | Snowflake/BigQuery.
        
         | nattaylor wrote:
         | I got excellent performance in Databricks with well partitioned
         | Parquet and Spark 2.4. What is making the queries slow? Data
         | scanning?
        
           | jpau wrote:
           | They use DeltaLake + Spark 3.0, and are mostly careful to
           | partition well.
           | 
           | Their datasets are small. Most tables are ~50GB, the odd
           | table up to ~2TB. The clusters typically are nothing shabby
           | for this size, defaults to ~[4-12]x32GB.
           | 
           | The queries that I have seen are typically not written well.
           | Think view-on-view-on-view (there's a BigCo policy against
           | them materialising data..), and where the filter is applied
           | in the last step. The stuff of horrors, but something I've
           | seen in more-than-one-BigCo.
           | 
           | But we have compared some of those same queries on BigQuery
           | vs. Databricks, and, I don't know if BigQuery's execution
           | optimiser is better? Or if the BigQuery storage is better
           | organising the data? Or if BigQuery is simply throwing more
           | resource their way?
        
         | agambrahma wrote:
         | Sigma Computing (https://www.sigmacomputing.com) might be a
         | good fit here too.
         | 
         | (disclaimer: plug)
        
         | bkandel wrote:
         | Yes, and I would add to this the (nearly) complete lack of IDE
         | support makes working with Spark SQL quite painful.
        
       | kfk wrote:
       | These are great innovations but can we please take a moment to
       | realize 99% of companies are still stuck with a blend of excels,
       | access and sql servers? Why is adoption of this new tech so poor?
       | Maybe it has something to do with the amount of confusion all the
       | sales pitches about data lakes, big data, ai and company are
       | generating
        
         | phoe-krk wrote:
         | > Why is adoption of this new tech so poor?
         | 
         | Because it's unnecessary to those 99% of companies. If company
         | data fits in an Excel spreadsheet, Access database, or a single
         | MySQL/Postgres instance, then introducing all this new tech
         | with all of the associated costs and little return gain is a
         | net loss.
        
         | vmsp wrote:
         | I know Excel runs the world but are there really that many
         | people using Access?
        
         | QuesnayJr wrote:
         | I actually think the effect of all of this hype will move
         | people past the Excel/Access era. There are simple analyses
         | every company could do with R or scikit-learn that would save
         | or make them money, and they just don't know how. Someone with
         | AI expertise is over-qualified to do this, but they are at
         | least qualified.
        
         | t0mas88 wrote:
         | Because most data isn't "big" data. If it fits in Excel on a
         | laptop, why bother to roll out a distributed data lake system
         | like Spark with all the associated ops work.
        
         | hztar wrote:
         | The ones I have been talking to stick to their Excel because
         | their little part of the puzzle can solved by Excel. These
         | technologies usually demand data to be collected, but the
         | amount of incentive to do so is low in any classic balkanized
         | F500 organization.
        
       | bpodgursky wrote:
       | Hmm, I agree there's a market here, but I don't know why I
       | wouldn't just use Snowflake or Bigquery if what I really wanted
       | was a big-data RDBMS.
       | 
       | Everyone I know who uses Databricks (and they all like it) use it
       | as hosted Spark with S3 integrations, or write... directly to
       | Snowflake. I'm a little skeptical they're going to get traction
       | as a true data lake model
        
         | georgewfraser wrote:
         | It's a lot simpler to use a single system as both your data
         | lake, and your data warehouse. As Databricks gets better and
         | better at the core data warehouse features, it becomes feasible
         | to use it for both. Meanwhile, Snowflake and BQ are coming from
         | the other direction, implementing data lake features. AWS
         | strategy seems to be, just make it easier to have 2 systems and
         | move data back and forth.
        
         | MikeDelta wrote:
         | Their Delta also has SSD caching, which turns out to be logic
         | that stores a local copy of the file you queried for faster re-
         | query. Going to call my lru cache function like that as well...
         | 
         | My company loves them, I think they only do a few things good
         | of which marketing the best, and are not worth the money for
         | data science teams with devops skills. Happy to hear from
         | others if I am wrong.
        
           | pram wrote:
           | Depends. If you only have a couple workspaces, then no. It's
           | worth the money for a company with lots of data science
           | teams, and one team who is janitoring all the workspace
           | infra. Our company has 20 workspaces for 10 teams already and
           | it would be a nightmare if we expected everyone to fix and
           | manage their own AWS stuff.
        
         | snidane wrote:
         | Snowflake and Bigquery will bite you in the ass later on. You
         | can do 80% of the things you will need - which is great for
         | some newbie stuff or for sales presentations. Once you need
         | something complicated, you're on your own, while being stuck in
         | a proprietary environment that you cannot extend.
         | 
         | You will have to develop some kind of data lake to store
         | unstructured data anyway. You will end up with a Snowflake data
         | warehouse and a data lake. Why not just go with data lake first
         | then.
         | 
         | Databricks/Spark are just good platforms to help you do
         | something with structured data in your lake. With the recent
         | additions to its execution engine and Delta (strange naming
         | tbh) it will be pretty much the same as Snowflake for you.
        
           | mrbungie wrote:
           | BigQuery/Snowflake can process Parquet and multiple other
           | formats in Object Storage. You can use them more "freely" if
           | you keep your raw data in open formats.
           | 
           | You need something more complicated than what can be done
           | using BigQuery/Snowflake (that remaining 20%, though I would
           | say 10%)? Export the dataset to CSV/Parquet/Avro/ORC/whatever
           | and process it with anything, including
           | Dataproc/HDInsight/EMR or even Databricks. That's actually a
           | common pattern.
        
         | willvarfar wrote:
         | The reason snowflake has such a high market cap is because it's
         | customers aren't paying much now, but they'll be paying and
         | paying monthly forever. It's lock-in on a massive scale.
         | 
         | Delta lake is something you can run on data you feel you still
         | have some semblance of control over.
        
       | ACow_Adonis wrote:
       | data lake, delta lake, lake house. snowflake, snowpark. cloud,
       | data warehouse. I just want to take a second to thank these
       | companies for trying to turn my profession (data science +
       | analytics) into the living hell of your every day enterprise and
       | tech culture cluster-fuck (pun not intended).
       | 
       | I'm still coming to terms with the fact that there's actually a
       | technology called Kafka...
       | 
       | /sorry, I'm just particularly bitter after having to do some
       | databricks training yesterday... it was basically 80% trying to
       | rote the sales literature about why they're so great/enterprisy
       | and parroting company and platform specific jargon. I don't like
       | seeing my profession turn into an obsession with tech and
       | platforms when 99.9% of companies and people can't reason
       | properly or operate their current tools/resources efficiently.
       | obviously just my own opinion.
        
         | arafa wrote:
         | The training is quite heavy on the sales pitch, I agree (and
         | expensive). There were some useful bits if you dig around
         | though.
        
         | MikeDelta wrote:
         | I read a wonderful comment in another thread the other day. It
         | was about the obsession of devs to collect a whole range of
         | technologies on their resume. It went something like: "When I
         | hire a carpenter I won't hire him for what he has in his
         | toolbox. I want to know what he can do with only a hammer and
         | chisel (= Linux machine). The rest he can learn."
         | 
         | I will try to look up the link and obviously 'he' can be 'she'
         | as well.
        
       | willvarfar wrote:
       | The key gap as I see it is that databricks doesn't support multi-
       | table transactions. And that is why you can't treat it as though
       | it as a drop in functional replacement for an rdbms.
       | 
       | (Why no vector clocks in the manifest files or something?)
        
         | georgewfraser wrote:
         | Interestingly, BigQuery is also missing multi-table
         | transactions. There are ways to live without this feature, but
         | I agree it's a gap.
        
         | rymurr wrote:
         | Check out https://projectnessie.org it adds multi-table
         | transactions to Databricks Delta.
         | 
         | disclaimer: an author of Nessie
        
       | drej wrote:
       | I have seen several deployments of Databricks (including first
       | hand experience) and... most use cases could be better served by
       | Postgres (or Redshift, Athena, Snowflake for larger scale). It
       | has honestly been such an overkill for so many workloads, it was
       | quite astonishing. I've seen people move from Excel to Spark...
       | to handle the same volume of data. That's obviously not
       | Databricks' fault, but their PR is pretty much "please do all
       | your data work in our product, it's well suited for it".
       | 
       | Yes, it's very good if you don't like setting up clusters (few
       | do/can) and the UI is rather useful for getting up and running
       | (not so much for writing code though). But you need to really
       | understand the platform before adopting it. Please, please don't
       | just adopt it because it's popular.
        
       ___________________________________________________________________
       (page generated 2021-02-01 23:00 UTC)