[HN Gopher] Launch HN: Data Mechanics (YC S19) - The Simplest Wa...
       ___________________________________________________________________
        
       Launch HN: Data Mechanics (YC S19) - The Simplest Way to Run Apache
       Spark
        
       Hi HN,  We're JY & Julien, co-founders of Data Mechanics
       (https://www.datamechanics.co), a big data platform striving to
       offer the simplest way to run Apache Spark.  Apache Spark is an
       open-source distributed computing engine. It's the most used
       technology in big data. First, because it's fast (10-100x faster
       than Hadoop MapReduce). Second, because it offers simple, high-
       level APIs in Scala, Python, SQL, and R. In a few lines of code,
       data scientists and engineers can explore data, train machine
       learning models, and build batch or streaming pipelines over very
       large datasets (size ranging from 10GBs to PBs).  While writing
       Spark applications is pretty easy, managing their infrastructure,
       deploying them and keeping them performant and stable in production
       over time is hard. You need to learn how Apache Spark works under
       the hood, become an expert with YARN and the JVM, manually choose
       dozens of infrastructure parameters and Spark configurations, and
       go through painfully slow iteration cycles to develop, debug, and
       productionize your app.  As you can tell, before starting Data
       Mechanics, we were frustrated Spark developers. Julien was a data
       scientist and data engineer at BlaBlaCar and ContentSquare. JY was
       the Spark infrastructure team lead at Databricks, the data science
       platform founded by the creators of Spark. We've designed Data
       Mechanics so that our peer data scientists and engineers can focus
       on their core mission - building models and pipelines - while the
       platform handles the mechanical DevOps work.  To realize this goal,
       we needed a way to tune infrastructure parameters and Spark
       configurations automatically. There are dozens of such parameters
       but the most critical ones are the amount of memory and cpu
       allocated to each node, the degree of parallelism of Spark, and the
       way Spark handles all-to-all data transfer stages (called
       shuffles). It takes a lot of expertise and trial-and-error loops to
       manually tune those parameters. To do it automatically, we first
       run the logs and metadata produced by Spark through a set of
       heuristics that determines if the application is stable and
       performant. A Bayesian optimization algorithm uses this analysis as
       well as data from past recent runs to choose a set of parameters to
       use on the next run. It's not perfect - it needs a few iterations
       like an engineer would. But the impact is huge because this happens
       automatically for each application running on the platform (which
       would be too time-consuming for an engineer). Take the example of
       an application gradually going unstable as the input data grows
       over time. Without us, the application crashes on a random day, and
       an engineer must spend a day remediating the impact of the outage
       and debugging the app. Our platform can often anticipate and avoid
       the outage altogether.  The other way we differentiate is by
       integrating with the popular tools from the data stack. Enterprise
       data science platforms tend to require their users to abandon their
       tools to adopt their own end-to-end suite of proprietary solutions:
       their hosted notebooks, their scheduler, their way of packaging
       dependencies and version-controlling your code. Instead, our users
       can connect their Jupyter notebook, their Airflow scheduler, and
       their favourite IDE directly to the platform. This enables a
       seamless transition from local development to running at scale on
       the platform.  We also deploy Spark directly on Kubernetes, which
       wasn't possible until recently (Spark version 2.3) - most Spark
       platforms run on YARN instead. This means our users can package
       their code dependencies on a Docker image and use a lot of
       k8s-compatible projects for free (for example around secrets
       management and monitoring). Kubernetes does have its inherent
       complexity. We hide it from our users by deploying Data Mechanics
       in their cloud account on a Kubernetes cluster that we manage for
       them. Our users can simply interact with our web UI and our API/CLI
       - they don't need to poke around Kubernetes unless they really want
       to.  The platform is available on AWS, GCP, and Azure. Many of our
       customers use us for their ETL pipelines, they appreciate the ease
       of use of the platform and the performance boost from automated
       tuning. We've also helped companies start their first Spark
       project: a startup is using us to parallelize chemistry
       computations and accelerate the discovery of drugs. This is our
       ultimate goal - to make distributed data processing accessible to
       all.  Of course, we share this mission with many companies out
       there, but we hope you'll find our angle interesting! We're excited
       to share our story with the HN community today and we look forward
       to hearing about your experience in the data engineering and data
       science spaces. Have you used Spark and did you feel the
       frustrations we talked about? If you consider Spark for your next
       project, does our platform look appealing? We don't offer self-
       service deployment yet, but you can schedule a demo with us from
       the website and we'll be happy to give you a free trial access in
       exchange for your feedback.  Thank you!
        
       Author : jstephan
       Score  : 90 points
       Date   : 2020-05-11 14:58 UTC (8 hours ago)
        
       | apoverton wrote:
       | I've thought about solving this problem with an ML approach like
       | you all are taking but as you say never had the bandwidth because
       | I was focusing on my "core missions". I'm no longer a heavy spark
       | user but am very happy to see you all working on this!
       | 
       | It always seemed so inefficient to me to spend all this time hand
       | tuning jobs only to have the data change and need to do the same
       | thing again.
       | 
       | Good luck!
        
         | jstephan wrote:
         | Thanks for the wishes! Indeed it's rarely worth it to build an
         | automated tuning tool: - Unless you operate at a massive scale
         | (eg Dr Elephant + TuneIn projects, originally developed at
         | LinkedIn) - Or you operate a big data platform yourself.
         | 
         | If you're curious about our ML approach, we gave a tech talk
         | about it at last year's Spark Summit:
         | https://databricks.com/session_eu19/how-to-automate-performa...
        
       | soumyadeb wrote:
       | >Many of our customers use us for their ETL pipelines, they
       | appreciate the ease of use of the platform and the performance
       | boost from automated tuning.
       | 
       | This is quite interesting. Founder of RudderStack here (we are a
       | CDI or simply an open-source Segment equivalent). I have seen a
       | similar pain point across some of our customers. They use
       | RudderStack to get data into S3 (or equivalent) and then run some
       | kind of post-processing Spark jobs for analytics/machine-learning
       | use cases. Managing two setups (RudderStack on Kubernetes +
       | Spark) is a pain.
       | 
       | A singly managed solution with Spark on Kubernetes makes so much
       | sense. Would love to figure out how to integrate with you guys.
        
         | jstephan wrote:
         | Congrats for RudderStack, what you're saying makes a lot of
         | sense. Reaching out to you directly to follow up on a potential
         | integration!
        
           | soumyadeb wrote:
           | Thanks a lot. Will follow up with you.
        
       | blancothewhite wrote:
       | Very interesting topics in good hands !
        
       | sg47 wrote:
       | I saw that dynamic allocation is enabled by default. I thought
       | dynamic allocation does not work well on k8s if the executors
       | need to be kept around for serving shuffle files. How does it
       | work in your case?
        
         | jstephan wrote:
         | Thanks, great question !
         | 
         | Dynamic allocation is only enabled on our Spark 3.0 image (from
         | the 3.0-preview branch, since the official 3.0 isn't released
         | yet). It works by tracking which executors are storing active
         | shuffle files. These executors will not be removed when
         | downscaling. More info here:
         | https://issues.apache.org/jira/browse/SPARK-27963
         | 
         | It's not perfect, but there are more improvements for dynamic
         | allocation being worked on (remote shuffle service for
         | Kubernetes).
        
       | ev0xmusic wrote:
       | Congrats guys, what you are doing is awesome :)
        
       | knes wrote:
       | Awesome! Making Spark more approachable is good news for the wave
       | of new data engineers.
       | 
       | Do you have any record demo you can share where we can see how a
       | user would set up and integrate with the other tools? that would
       | be neat
        
         | jstephan wrote:
         | Thanks for the feedback! We're preparing a demo for the
         | upcoming Spark Summit next month... Stay tuned :)
         | 
         | In the meantime you can book a time with one of our data
         | engineers through the website to get a live demo:
         | https://www.datamechanics.co
        
       | ojnabieoot wrote:
       | Speaking as someone who might be in your target audience: my
       | experience with Databricks (back in 2017/2018, without
       | Kubernetes) is that their product is just as unreliable and
       | frustrating as deploying a Spark cluster manually, but also more
       | expensive and more time-consuming. It was so bad that I was
       | wondering if the entire company was a scam - which isn't true, of
       | course. I suspect a big part of our problem was a shuffle-heavy
       | workload hitting a relatively new product. But it left a really
       | bad taste in my mouth about the entire business model of "Spark
       | as a Service."
       | 
       | My impulse reaction to your sales pitch is "their product
       | probably doesn't work very well and is way too expensive." I know
       | that's unfair, but this entire idea of "our platform automates
       | away the tedium of Spark clusters" just strikes me as a bag of
       | magic beans.
       | 
       | What would help a lot with drawing cynical, bitter people like
       | me: _case studies_ on your website. I know that 's a lot to ask
       | for a young startup. But actual details about either money or
       | developer time saved with Data Mechanics - specific pains your
       | customers were having and how Data Mechanics addressed them, or
       | specific analyses your customers were able to do now that they're
       | spending less time managing Spark. Running a big Spark job in the
       | cloud is a huge financial risk, and many Spark users are much
       | more concerned about this than the headaches involved with
       | management - and again, my last experience with Databricks
       | resulted in more cost and more headaches. I do not think I am
       | alone here.
       | 
       | I am wondering if you're considering selling your Spark
       | telemetry/parameter tuning/etc software, or offering it as a
       | service, etc. Speaking personally, I would be much more open to
       | using Data Mechanics's tools on my own Spark cluster rather than
       | outsource the actual management. At my organization, in addition
       | to AWS, we also have a local Hadoop cluster with Spark installed;
       | commercial software that gives better insight into its
       | performance could be very useful.
        
         | jstephan wrote:
         | Thanks for the detailed feedback. Spark can sometimes be
         | frustrating. Automated tuning has a major impact but it is no
         | silver bullet, sometimes a stability/performance problem lays
         | in the code or the input data (partitioning).
         | 
         | That's why we're working on new monitoring solution (think
         | Spark UI + Node metrics) to give Spark developers the much
         | needed high-level feedback on the stability and performance of
         | their apps. We'd like to make this work on top of other data
         | platforms (at least the monitoring part, the automated tuning
         | would be much harder).
         | 
         | Case studies: Thanks, we're working on them. Check our Spark
         | Summit 2019 talk (How to automate performance tuning for Apache
         | Spark) for the analysis of the impact at one of our customers.
        
         | _so_why_not_ wrote:
         | Over the last year there's been a significant amount of low
         | level changes in the proprietary versions of spark (aka EMR and
         | Databricks) designed to address reliability and stability. Out
         | of curiosity what exceptions did you run into?
        
         | glapark wrote:
         | Shuffling in Spark works well for small datasets, but is not
         | reliable for large datasets because fault tolerance in Spark is
         | incomplete. For example, check this Jira:
         | 
         | https://issues.apache.org/jira/browse/SPARK-20178
         | 
         | So, if your problem was mainly due to shuffle-heavy workload,
         | then I guess no managed Spark service would be able to
         | alleviate/eliminate it by automatic parameter tuning. In other
         | words, your pain might be due to a fundamental problem in Spark
         | itself.
         | 
         | IMO, Spark is great, but its speed is no longer its key
         | strength. For examples, Hive is much faster than SparkSQL these
         | days.
        
       | izyda wrote:
       | What do you see as your key differentiator from Databricks?
       | what's the key pain point they weren't/couldn't solve that you
       | are?
        
         | jstephan wrote:
         | (Former Databricks software engineer speaking) The pain point
         | they didn't solve (well enough) is Spark cluster management and
         | configuration. From our experience and user interviews, it's
         | the critical pain point that still slows down Spark adoption.
         | Through our automated tuning feature, we're going further than
         | them to provide a _serverless experience_ to our users.
         | 
         | This being said, Databricks is a great end-to-end data science
         | platform, with notable features we lack like collaborative
         | hosted notebooks. A lot of people don't want/need the full
         | proprietary feature set of Databricks though. They choose to
         | build on EMR, Dataproc, and other platforms instead. We hope
         | they'll try Data Mechanics now :)
        
           | __vb__ wrote:
           | databricks has other optimizations on top of open source
           | spark version, are you maintaining your own version of spark
           | or using the vanilla version of spark.
           | 
           | One thing I constantly deal with is how to optimize spark,
           | how to use ganglia and spark ui to dig into what is causing
           | data skew and slowness while running jobs. Is this something
           | that you do better than databricks?
        
             | jstephan wrote:
             | Spark versions: Only vanilla (open source) Spark. But we
             | offer a list of pre-packaged Docker images with useful
             | libraries (e.g. for ML or for efficient data access) for
             | each major Spark version. You can use them directly or
             | build your own docker image on top of them.
             | 
             | Optimization/Monitoring: This topic is very important to
             | us, thanks for bringing it up. Indeed we automatically tune
             | configurations, but developers still need to understand the
             | performance of their app to write better code. We're
             | working on a Spark UI + Ganglia improvement (well,
             | replacement really), which we could potentially open
             | source.
             | 
             | Would you mind emailing me (jy@datamechanics.co) or even
             | scheduling a call with me
             | (https://calendly.com/b/datamechanics/avk7bhxq) so I show
             | you what we have in mind and get your feedback? Anyone else
             | interested is welcome to do the same.
        
       | flowerlad wrote:
       | Running Spark on a Kubernetes cluster is already pretty easy, so
       | it is unclear what value this is adding. Controlling cost is the
       | hard part. You may only need a cluster for 1 hour per day for a
       | nightly aggregation job. Kubernetes clusters are not easy to
       | provision and de-provision, so you end up paying for a cluster
       | for 24 hour days and use it for only 1 hour. If someone comes up
       | with a way to pay for pre-provisioned Kubernetes clusters only
       | for the duration you use it that would be interesting.
        
         | jstephan wrote:
         | Thanks for the feedback! It's _possible_ to run Spark on
         | Kubernetes using just OS tools - in fact our platform builds
         | upon and contributes to many of these tools. But it 's not easy
         | _enough_ , in our humble opinion, you need to build a decent
         | level of expertise on Spark and k8s just to get started, and
         | even more to keep it operational/stable/cost-efficient/secure
         | in the long term.
         | 
         | Regarding costs. By autoscaling the cluster size and minimising
         | our service footprint, the fixed cost for using our platform is
         | around $100/month, which is negligible compared to the cost of
         | most big data projects. We have some ideas on how to drive this
         | fixed cost to zero, and offer a free hosted version of our
         | platform too. It's in the roadmap!
        
         | quadrature wrote:
         | The problem being solved here is resource tuning. Which is a
         | problem you will eventually encounter as your data org grows
         | big. Specifically in our case the authors of our spark jobs
         | understand the data modelling well but might not know how to
         | tweak the spark parameters to optimize execution. As mentioned
         | in the post, even if you do know what you're doing the process
         | is long and time consuming. so i definitely see the value add
         | here.
         | 
         | if you need ephemeral spark clusters dataproc in GCP will give
         | that to you, theres probably a similar service in AWS and
         | Azure.
        
           | waffletower wrote:
           | AWS EMR is a fairly straight-forward and reasonably cost-
           | effective method to manage ephemeral Spark clusters on Amazon
           | Web Services.
        
       ___________________________________________________________________
       (page generated 2020-05-11 23:00 UTC)