[HN Gopher] We Need DevOps for ML Data
       ___________________________________________________________________
        
       We Need DevOps for ML Data
        
       Author : amargvela
       Score  : 78 points
       Date   : 2020-04-28 20:09 UTC (2 hours ago)
        
 (HTM) web link (tecton.ai)
 (TXT) w3m dump (tecton.ai)
        
       | ska wrote:
       | "We need fewer data scientists and more data janitors" - anon
        
       | remmargorp64 wrote:
       | I was the main data science engineer at one of my previous
       | companies. We used tools like airflow for running python scripts
       | to import data, clean/transform it, train models, and even test
       | various models against datasets. We also used Azure for similar
       | things.
       | 
       | It's easy to do "dev ops" for machine learning. Basically, just
       | automate everything and implement gatekeeping mechanisms along
       | with active monitoring.
       | 
       | It's true, though. I had to cobble together a lot of custom
       | things at the time, but it wasn't that hard to do.
        
         | nik_s wrote:
         | I'm the CTO at a data science company, and this has been my
         | experience too. I've been lucky enough to have quite a few
         | engineers go from zero practical experience to being able to
         | train and deploy complex ml solutions, and the most successful
         | solutions have always involved a combination of just a couple
         | of tools: - airflow and/or celery for running data extraction
         | and transformation jobs - pandas and numpy for data wrangling -
         | sklearn, xgboost, lightgbm, pytorch or tensorflow for
         | training/inference - flask or Django to serve results
         | 
         | It's a handful of technologies, but they're (generally) mature,
         | battle tested, and well documented.
        
       | Tehchops wrote:
       | I'm reminded of: https://blog.acolyer.org/2019/06/03/ease-ml-ci/
        
       | gas9S9zw3P9c wrote:
       | Wow, I probably have seen 10 of these kind of companies over the
       | past few months. Personally I believe (and hope) the winners in
       | this space are going to be modular open-source companies/products
       | as opposed to the "all-in-one enterprise solutions"
        
         | jdoliner wrote:
         | Pachyderm is probably one of the companies you've seen in this
         | space. Full disclosure: I'm the founder, but I feel that we've
         | stayed pretty true to the idea of being a modular open-source
         | tool. We have customers who just use our filesystem, and
         | customers who just use our pipeline system, and of course many
         | more who use both. We've also integrated best in class open-
         | source projects, for example Kubeflow's TFJob is now the
         | standard way of doing Tensorflow training on Pachyderm, and
         | we're working on integrating Seldon as the serving component.
         | We find this architecture a lot more appealing than an all-in-
         | one web interface that you load your data into.
        
           | gas9S9zw3P9c wrote:
           | I haven't used you yet, but IMO this is the way it should be
           | done. Once I get around to cleaning up my current custom k8s
           | pipelines I will give you a spin :)
        
         | fizixer wrote:
         | I wonder what's the business model for teams/startups offering
         | open-source solutions that they developed in-house.
        
         | _mdb wrote:
         | CEO of Tecton here, and happy to give more context. Tecton is
         | specifically focused on solving a few key data problems to make
         | it easier to deploy and manage ML in production. e.g.:
         | 
         | - How can I deliver these features to my model in production?
         | 
         | - How do I make sure the data I'm serving to my model is
         | similar to what is trained on?
         | 
         | - How can I construct my training data with point in time
         | accuracy for every example?
         | 
         | - How can I reuse features that another DS on my team built?
         | 
         | We've found that there's a ton of complexity getting data right
         | for real-time production use cases. These problems can be
         | solved, but require a lot of care and are hard to get right.
         | We're building production-ready feature infrastructure and
         | managed workflows that "just work" for teams that can't or
         | don't want to dedicate large engineering teams to these
         | problems.
         | 
         | At the core of Tecton is a managed feature store, feature
         | pipeline automation, and a feature server. We're building the
         | platform to integrate with existing tools in the ML ecosystem.
         | 
         | We're going to share more about the platform in the next few
         | months. Happy to answer any questions. I'd also love to hear
         | what challenges folks on this thread have encountered when
         | putting ML into production.
        
           | bogomipz wrote:
           | All of the open positions listed on your careers page appear
           | to be broken. There is no field to upload or attach a CV when
           | applying to any of the roles. Also why would a LinkedIn
           | Profile be mandatory in order to apply for a role? There are
           | many qualified people who have simply chosen not to be a part
           | of that social network.
        
             | _mdb wrote:
             | Ah. We're on it. LinkedIn shouldn't be required. Thanks for
             | flagging.
        
         | yanovskishai wrote:
         | Could you please mention what are the other solutions you've
         | got to see in this space?
        
           | simonw wrote:
           | https://angel.co/companies?keywords=machine+learning+models+.
           | .. lists a whole bunch of them.
        
           | verdverm wrote:
           | https://dolthub.com is the cool kid right now. There is
           | pacaderm, git lfs, IPFS.
           | 
           | Really what we need is version control for data, it's not
           | just an ML data problem. It's a little different though,
           | because you would like to move computation to data, rather
           | than the other way around
        
             | wenc wrote:
             | The utility of version controling production-sized (not
             | sample training data) data (as opposed to code) is
             | something I've having trouble grasping unless I'm missing
             | something here -- and I may be, so please enlighten me.
             | 
             | It seems to me to be able to time-travel in data you almost
             | need to store the Write-Ahead Log of database transactions
             | and be able to replay that. Debezium captures the CDC
             | information, but it's a infrastructure level tool rather
             | than a version control tool.
             | 
             | In data science, most time-travel issues are worked around
             | using bitemporal data modeling: which is a fancy way of
             | saying "add a separate timestamp column to the table to
             | record when the data was written". Then you can roll things
             | back to any ETL point in a performant fashion. This is
             | particularly useful for debugging recursive algorithms that
             | get retrained every day.
             | 
             | But these are infrastructure level approaches. I'm not sure
             | that it's a problem for a version control tool.
        
               | timsehn wrote:
               | Tim , CEO of Liquidata, the company that built Dolt and
               | DoltHub here. This is how we store the version controlled
               | rows so that we get structural sharing across versions
               | (ie. 50M + one row chgange becomes 50M+1 entries in the
               | database not 100M with no need to replay logs):
               | 
               | https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-
               | tabl...
        
               | wenc wrote:
               | Thanks, that looks like an interesting approach. I may
               | have missed this in the article, but let's say I have a
               | SQL database with 600m records, and an ETL process does
               | massive upserts (20m records) every day, with many
               | UPDATEs on 1-2 fields.
               | 
               | Wouldn't discovering what those changes are still entail
               | heavy database queries? Unless Dolt has a hook into most
               | SQL databases' internal data structures? Or WALs?
        
               | zachmu wrote:
               | One of the cool things about Dolt is that you can query
               | the diff between two commits. This functionality is
               | available through special system tables. You specify two
               | commits in the WHERE clause, and the query only returns
               | the rows that changed between the commits. The syntax
               | looks like:
               | 
               | `SELECT * FROM dolt_diff_$table where from_commit =
               | '230sadfo98' and to_commit = 'sadf9807sdf'`
        
               | sgt101 wrote:
               | I worry about retraining every day. Isn't that a flag
               | that says "It hasn't learned a thing and actually I'm
               | just improving my backfitting score"?
        
               | wenc wrote:
               | Not really -- in many forecasting applications in fast-
               | changing markets, it is fairly common to dynamically
               | retrain your recursive model to a moving window of
               | historical data in order to adapt to your current
               | environment (with some regularization). The length of the
               | window depends on how fast the market changes.
               | 
               | For these types of recursive model applications, you
               | cannot just fit the model once and forget about it.
        
               | somurzakov wrote:
               | as long as it works well on out of sample data at
               | deployment time, it is okay.
               | 
               | Until some major data drift happens, but you would notoce
               | it anyways
        
             | yanovskishai wrote:
             | Thanks ! There are indeed players many new in the data
             | versioning space (DVC and Quilt also probably worth
             | mentioning).
             | 
             | I totally agree that data management problems are not just
             | ML related. But I personally think that there are
             | additional challenges in the space that are not just
             | version control for data.. all the area of data quality
             | management and monitoring for example. I liked the analogy
             | to devops, source version was super critical problem to
             | solve in software development, but it didn't stop there,
             | with things like CI/CD etc. I believe we'll see similar
             | evolution in the data space..
        
           | dttos wrote:
           | Composable https://composable.ai is another tool in this
           | space
        
           | timsehn wrote:
           | Here's a list of companies/tools in the Git for Data space:
           | 
           | https://www.dolthub.com/blog/2020-03-06-so-you-want-git-
           | for-...
        
           | SirOibaf wrote:
           | https://logicalclocks.com with their ML + Feature Store open
           | source platform Hopsworks and their managed cloud version
           | https://hopsworks.ai
        
             | jamesblonde wrote:
             | Disclaimer: i am a co-founder of Logical Clocks. There are
             | loads of interesting technical challenges in this "Feature
             | Store" space. Here are just a few we address in Hopsworks:
             | 
             | 1. To replicate models (needed for regulatory reasons), you
             | need to commit both data and code. If you have only a few
             | models, fine just archive the training data. But, if you
             | have lots of models (dev+prod) and lots of data - you can't
             | use git-based approaches where you commit metadata and make
             | immutable copies of data. It scales (your data!) badly. We
             | are following the ACID datalake approach (Apache Hudi),
             | where you store diffs of your data and can issue queries
             | like "Give me training data for these features as it was on
             | this date".
             | 
             | 2. You want one feature pipeline to compute features (not
             | one for training and a different one when serving
             | features). Your feature store should scale to store TBs/PBs
             | of cached features to generate train/test data, but should
             | also return feature vectors in single ms latency for online
             | apps to make predictions. What DB has those
             | characteristics? We say none, and we adopt a dual-DB
             | approach with one DB for low-latency and one for scale-out
             | SQL. We use open-source NDB and Hive on our HopsFS
             | filesystem - where all 2 DBs and the filesystem share the
             | same unified, scale-out metadata layer (a "rm -rf
             | feature_group" on the filesystem also automatically cleans
             | up Hive and feature metadata)
             | 
             | 3. You want to be able to catalog/search for features using
             | free-text search and have good exploratory data analysis.
             | The systems challenge here is how to allow search on your
             | production DB with your features. Our solution is that we
             | provide a CDC API to our Feature Store, and automatically
             | sync extended metadata to Elastic with an eventually
             | consistent replication protocol. So when you 'rm -rf ..' on
             | your filesystem, even the extended metadata in Elastic is
             | automatically cleaned up.
             | 
             | 4. You need to support reuse of features in different
             | training datasets. Otherwise, what's the point? We do that
             | using Spark as a compute engine to join features from
             | tables containing normalized features.
             | 
             | References:
             | 
             | * https://www.logicalclocks.com/blog/mlops-with-a-feature-
             | stor... * https://ieeexplore.ieee.org/document/8752956 (CDC
             | HopsFS to Elastic) * http://kth.diva-
             | portal.org/smash/get/diva2:1149002/FULLTEXT0... (Hive on
             | HopsFS)
        
           | chaoyu wrote:
           | I'm actually building a "modular open-source company/product"
           | in the MLOps space:
           | 
           | BentoML https://docs.bentoml.org/en/latest/
        
           | mmq wrote:
           | Polyaxon is an open source machine learning automation
           | platform. It allows to schedule notebooks, tensorboards, and
           | container workloads for training ML and DL. It also has
           | native integration with Kubeflow's operators for distributed
           | training.
           | 
           | https://github.com/polyaxon/polyaxon
        
         | minimaxir wrote:
         | Additionally, all of Google, Amazon, and Microsoft are pushing
         | _very_ heavily in the ML DevOps space. And if you are training
         | /deploying ML models at such a frequency that you _need_ to
         | utilize DevOps, chances are you are already using their
         | platforms for server compute.
        
       | iddan wrote:
       | This startup is trying to build the next GitHub for ML Data:
       | https://dagshub.com/
        
         | gunshai wrote:
         | This seems pretty cool.
        
       | smeeth wrote:
       | I really find it difficult to put into words just how little I
       | care to pay for a web ui so I can "manage" my data.
       | 
       | Data pipelines are a real problem though, and I'm very interested
       | in what startups do with this space.
        
         | factorialboy wrote:
         | > Data pipelines are a real problem though
         | 
         | Can you please elaborate more, thanks.
        
           | prions wrote:
           | Its not trivial to create and manage Data Pipelines if you
           | care about scale, serving a wide range of inputs and outputs,
           | or making this data easy to surface and spread throughout
           | your org (i.e. making it actually useful to regular people).
           | 
           | "Static ETL" like running the same database load every day at
           | 1:00am isn't a super challenging problem. Doing it across
           | many tables with complex transformations and multiple steps
           | easily can be. You really have to consider reliability,
           | processing speed, failure methods and other problems that
           | dont really arise until you hit a certain scale.
           | 
           | There's also the issue of what people want out of a Pipeline
           | that's changing. If you want people to be ""data driven"",
           | then that means they need easy access to potentially all of
           | your company's data on an ad hoc basis. So now your boring
           | ETL 1 am pipeline isnt really serving any of these new
           | usecases.
           | 
           | How do you create flexible pipelines that can be created from
           | any dataset on an ad hoc basis? This is where tools like
           | Airflow or Prefect come in. Creating a platform that can
           | create these types of Pipelines is a real problem.
           | 
           | And before you even ask yourself _how_ to process this data,
           | you need to also ask _where_? If you want to do what I
           | outlined above - making your data more accessible and easy to
           | use - then you probably need to rework how you're storing
           | your data. But Data Lakes (and others) are a whole topic in
           | and of itself.
        
         | dttos wrote:
         | Suggest you check out https://composable.ai for building out
         | robust data pipelines
        
         | seddonm1 wrote:
         | We have been thinking about these problems for a few years now
         | and have built Arc https://arc.tripl.ai (fully open source.)
         | which is an abstraction layer on top of Apache Spark to help
         | end-users rapidly build and deploy data pipelines without
         | having to know about #dataops. Ultimately we decided that
         | giving users a decent interface https://github.com/tripl-
         | ai/arc-starter (based on Jupyter Notebooks) and encouraging a
         | 'SQL first' approach means we can give users flexibility but
         | also have a standardised way of deploying jobs with many of the
         | devops attributes (like logging and reliability). You can run
         | Arc as a standard docker run command or using Argo Workflows
         | https://argoproj.github.io/ on Kubernetes as the orchestrator
         | as it plays nicely with Arc and is easy to build resilient
         | pipelines (retries etc.)
        
       | beckingz wrote:
       | Data is hard to automate and standardized pipelines and processes
       | are really helpful. This is interesting.
        
       | moandcompany wrote:
       | Fig 4 looks like it's derived from Hidden Technical Debt in
       | Machine Learning (2015).
       | 
       | https://papers.nips.cc/paper/5656-hidden-technical-debt-in-m...
       | 
       | As someone else says in this comment thread, this is very much an
       | organizational problem, and cannot be viewed as just a technology
       | problem.
       | 
       | The common behavior of individuals and teams is the pursuit of
       | solutions that solve problems for them. Problems here with ML,
       | and as we've seen with "Data Science," along with other magic
       | technologies is that having an appreciation for the domain or
       | context goes a long way. Being familiar with entire process, or
       | "pipeline," is valuable, and role/functional silos often lead the
       | problems people experience.
       | 
       | For some classes of machine learning problems and associated
       | data, sourcing solutions from vendors can work, but as with any
       | tools you can procure, you need the right people to use them
       | appropriately. This also applies to "DevOps" which is used for
       | comparison in the blog post.
       | 
       | --> DevOps example -- the philosophy seems to be about having
       | software developers also share build/release and infrastructure
       | responsibility. But some organizations have made "DevOps" teams
       | to silo build/release and infrastructure work... they ended up
       | renaming what used to be called their Build/Release or SysAdmin
       | teams. Siloing things to be "someone else's" problem doesn't
       | result in the major transformations that are needed.
       | 
       | Now imagine what happens if we substitute MLDevOps for DevOps
       | above.
       | 
       | I'll continue to say "The Role of a Data Engineer on a Team is
       | Complementary and Defined By The Tasks That Others Don't (Want
       | To) Do (Well)"
        
       | simonw wrote:
       | I see this as more of an organizational challenge than a
       | technology challenge.
       | 
       | Getting ML models into production isn't particularly hard... if
       | you put an engineering team on it that know how to write
       | automated release procedures, design architecture that can scale
       | and build robust APIs to surface the data.
       | 
       | But in many companies the engineers with those operations-level
       | skills and the researchers who work on machine learning live
       | completely separate lives. And then the researchers are expected
       | to deploy and scale their models to production!
       | 
       | That's not to say that this organizational problem cannot be
       | solved with technology/entrepreneurship. If a company can afford
       | it it's likely much cheaper to pay an external company to solve
       | your "ML in production" problems than to re-design your
       | organization such that you equip your internal ML teams with the
       | skills they need to go to prod.
        
         | mmq wrote:
         | > if you put an engineering team on it that know how to write
         | automated release procedures
         | 
         | I think surfacing the data is just the first step, often times
         | data scientists need to run some data exploration, the process
         | is generally iterative, and so they need to run several
         | experiments, resume or restart some experiments, scale training
         | with distributed learning using several machines, or run hyper-
         | parameters tuning, which means handling failures, visualize and
         | debug results, before deciding if they should deploy a model.
         | Once a model is deployed the story does not end there, because
         | models become stale and need to be retrained. There are other
         | issues related to compliance that need to be handled as well,
         | and many other problems related to governance, a/b testing, ...
         | 
         | The good news is that there are several open source initiatives
         | to solve several of these problems, at Polyaxon [0] we are
         | trying to solve some of the aspects related to the
         | experimentation and the automation phase.
         | 
         | [0] https://github.com/polyaxon/polyaxon
        
         | Cacti wrote:
         | I disagree. It's not about getting the data where it needs to
         | be. It's about data version control at a very fine level with
         | very large datasets (in a way that is efficient). It's about
         | detecting changes in model results base don changes in data.
         | It's about tracking provenance of data in the datasets. It's
         | about potentially controlled access to the data (eg allowing
         | models to use health care data without actually knowing the
         | underlying data). It's about detecting bias in datasets over
         | time.
         | 
         | It's actually quite complex, which is why generally speaking
         | very few people do anything like this. I am unaware of any
         | general solution to this problem, either in industry or
         | academia.
        
         | calebkaiser wrote:
         | I agree that a lot of the challenges around production ML are
         | organizational, but I think in many companies, it has more to
         | do with a lack of engineering resources than it does the
         | separation of eng and data science (though that certainly
         | happens).
         | 
         | Building and maintaining ML infrastructure from scratch is a
         | big project. That's why you see FAANG companies hiring for ML
         | infrastructure/platform engineers. Most startups don't have the
         | extra cycles for that big of an undertaking, and so you see a
         | lot of slapped-together, hacky solutions to putting models into
         | production.
         | 
         | I'm biased in that I work on Cortex (
         | https://github.com/cortexlabs/cortex ), but I think that open
         | source, modular tooling that removes the need to reinvent the
         | wheel is going to have a big impact in terms of making
         | production ML more accessible.
        
       ___________________________________________________________________
       (page generated 2020-04-28 23:00 UTC)