[HN Gopher] We Need DevOps for ML Data ___________________________________________________________________ We Need DevOps for ML Data Author : amargvela Score : 78 points Date : 2020-04-28 20:09 UTC (2 hours ago) (HTM) web link (tecton.ai) (TXT) w3m dump (tecton.ai) | ska wrote: | "We need fewer data scientists and more data janitors" - anon | remmargorp64 wrote: | I was the main data science engineer at one of my previous | companies. We used tools like airflow for running python scripts | to import data, clean/transform it, train models, and even test | various models against datasets. We also used Azure for similar | things. | | It's easy to do "dev ops" for machine learning. Basically, just | automate everything and implement gatekeeping mechanisms along | with active monitoring. | | It's true, though. I had to cobble together a lot of custom | things at the time, but it wasn't that hard to do. | nik_s wrote: | I'm the CTO at a data science company, and this has been my | experience too. I've been lucky enough to have quite a few | engineers go from zero practical experience to being able to | train and deploy complex ml solutions, and the most successful | solutions have always involved a combination of just a couple | of tools: - airflow and/or celery for running data extraction | and transformation jobs - pandas and numpy for data wrangling - | sklearn, xgboost, lightgbm, pytorch or tensorflow for | training/inference - flask or Django to serve results | | It's a handful of technologies, but they're (generally) mature, | battle tested, and well documented. | Tehchops wrote: | I'm reminded of: https://blog.acolyer.org/2019/06/03/ease-ml-ci/ | gas9S9zw3P9c wrote: | Wow, I probably have seen 10 of these kind of companies over the | past few months. Personally I believe (and hope) the winners in | this space are going to be modular open-source companies/products | as opposed to the "all-in-one enterprise solutions" | jdoliner wrote: | Pachyderm is probably one of the companies you've seen in this | space. Full disclosure: I'm the founder, but I feel that we've | stayed pretty true to the idea of being a modular open-source | tool. We have customers who just use our filesystem, and | customers who just use our pipeline system, and of course many | more who use both. We've also integrated best in class open- | source projects, for example Kubeflow's TFJob is now the | standard way of doing Tensorflow training on Pachyderm, and | we're working on integrating Seldon as the serving component. | We find this architecture a lot more appealing than an all-in- | one web interface that you load your data into. | gas9S9zw3P9c wrote: | I haven't used you yet, but IMO this is the way it should be | done. Once I get around to cleaning up my current custom k8s | pipelines I will give you a spin :) | fizixer wrote: | I wonder what's the business model for teams/startups offering | open-source solutions that they developed in-house. | _mdb wrote: | CEO of Tecton here, and happy to give more context. Tecton is | specifically focused on solving a few key data problems to make | it easier to deploy and manage ML in production. e.g.: | | - How can I deliver these features to my model in production? | | - How do I make sure the data I'm serving to my model is | similar to what is trained on? | | - How can I construct my training data with point in time | accuracy for every example? | | - How can I reuse features that another DS on my team built? | | We've found that there's a ton of complexity getting data right | for real-time production use cases. These problems can be | solved, but require a lot of care and are hard to get right. | We're building production-ready feature infrastructure and | managed workflows that "just work" for teams that can't or | don't want to dedicate large engineering teams to these | problems. | | At the core of Tecton is a managed feature store, feature | pipeline automation, and a feature server. We're building the | platform to integrate with existing tools in the ML ecosystem. | | We're going to share more about the platform in the next few | months. Happy to answer any questions. I'd also love to hear | what challenges folks on this thread have encountered when | putting ML into production. | bogomipz wrote: | All of the open positions listed on your careers page appear | to be broken. There is no field to upload or attach a CV when | applying to any of the roles. Also why would a LinkedIn | Profile be mandatory in order to apply for a role? There are | many qualified people who have simply chosen not to be a part | of that social network. | _mdb wrote: | Ah. We're on it. LinkedIn shouldn't be required. Thanks for | flagging. | yanovskishai wrote: | Could you please mention what are the other solutions you've | got to see in this space? | simonw wrote: | https://angel.co/companies?keywords=machine+learning+models+. | .. lists a whole bunch of them. | verdverm wrote: | https://dolthub.com is the cool kid right now. There is | pacaderm, git lfs, IPFS. | | Really what we need is version control for data, it's not | just an ML data problem. It's a little different though, | because you would like to move computation to data, rather | than the other way around | wenc wrote: | The utility of version controling production-sized (not | sample training data) data (as opposed to code) is | something I've having trouble grasping unless I'm missing | something here -- and I may be, so please enlighten me. | | It seems to me to be able to time-travel in data you almost | need to store the Write-Ahead Log of database transactions | and be able to replay that. Debezium captures the CDC | information, but it's a infrastructure level tool rather | than a version control tool. | | In data science, most time-travel issues are worked around | using bitemporal data modeling: which is a fancy way of | saying "add a separate timestamp column to the table to | record when the data was written". Then you can roll things | back to any ETL point in a performant fashion. This is | particularly useful for debugging recursive algorithms that | get retrained every day. | | But these are infrastructure level approaches. I'm not sure | that it's a problem for a version control tool. | timsehn wrote: | Tim , CEO of Liquidata, the company that built Dolt and | DoltHub here. This is how we store the version controlled | rows so that we get structural sharing across versions | (ie. 50M + one row chgange becomes 50M+1 entries in the | database not 100M with no need to replay logs): | | https://www.dolthub.com/blog/2020-04-01-how-dolt-stores- | tabl... | wenc wrote: | Thanks, that looks like an interesting approach. I may | have missed this in the article, but let's say I have a | SQL database with 600m records, and an ETL process does | massive upserts (20m records) every day, with many | UPDATEs on 1-2 fields. | | Wouldn't discovering what those changes are still entail | heavy database queries? Unless Dolt has a hook into most | SQL databases' internal data structures? Or WALs? | zachmu wrote: | One of the cool things about Dolt is that you can query | the diff between two commits. This functionality is | available through special system tables. You specify two | commits in the WHERE clause, and the query only returns | the rows that changed between the commits. The syntax | looks like: | | `SELECT * FROM dolt_diff_$table where from_commit = | '230sadfo98' and to_commit = 'sadf9807sdf'` | sgt101 wrote: | I worry about retraining every day. Isn't that a flag | that says "It hasn't learned a thing and actually I'm | just improving my backfitting score"? | wenc wrote: | Not really -- in many forecasting applications in fast- | changing markets, it is fairly common to dynamically | retrain your recursive model to a moving window of | historical data in order to adapt to your current | environment (with some regularization). The length of the | window depends on how fast the market changes. | | For these types of recursive model applications, you | cannot just fit the model once and forget about it. | somurzakov wrote: | as long as it works well on out of sample data at | deployment time, it is okay. | | Until some major data drift happens, but you would notoce | it anyways | yanovskishai wrote: | Thanks ! There are indeed players many new in the data | versioning space (DVC and Quilt also probably worth | mentioning). | | I totally agree that data management problems are not just | ML related. But I personally think that there are | additional challenges in the space that are not just | version control for data.. all the area of data quality | management and monitoring for example. I liked the analogy | to devops, source version was super critical problem to | solve in software development, but it didn't stop there, | with things like CI/CD etc. I believe we'll see similar | evolution in the data space.. | dttos wrote: | Composable https://composable.ai is another tool in this | space | timsehn wrote: | Here's a list of companies/tools in the Git for Data space: | | https://www.dolthub.com/blog/2020-03-06-so-you-want-git- | for-... | SirOibaf wrote: | https://logicalclocks.com with their ML + Feature Store open | source platform Hopsworks and their managed cloud version | https://hopsworks.ai | jamesblonde wrote: | Disclaimer: i am a co-founder of Logical Clocks. There are | loads of interesting technical challenges in this "Feature | Store" space. Here are just a few we address in Hopsworks: | | 1. To replicate models (needed for regulatory reasons), you | need to commit both data and code. If you have only a few | models, fine just archive the training data. But, if you | have lots of models (dev+prod) and lots of data - you can't | use git-based approaches where you commit metadata and make | immutable copies of data. It scales (your data!) badly. We | are following the ACID datalake approach (Apache Hudi), | where you store diffs of your data and can issue queries | like "Give me training data for these features as it was on | this date". | | 2. You want one feature pipeline to compute features (not | one for training and a different one when serving | features). Your feature store should scale to store TBs/PBs | of cached features to generate train/test data, but should | also return feature vectors in single ms latency for online | apps to make predictions. What DB has those | characteristics? We say none, and we adopt a dual-DB | approach with one DB for low-latency and one for scale-out | SQL. We use open-source NDB and Hive on our HopsFS | filesystem - where all 2 DBs and the filesystem share the | same unified, scale-out metadata layer (a "rm -rf | feature_group" on the filesystem also automatically cleans | up Hive and feature metadata) | | 3. You want to be able to catalog/search for features using | free-text search and have good exploratory data analysis. | The systems challenge here is how to allow search on your | production DB with your features. Our solution is that we | provide a CDC API to our Feature Store, and automatically | sync extended metadata to Elastic with an eventually | consistent replication protocol. So when you 'rm -rf ..' on | your filesystem, even the extended metadata in Elastic is | automatically cleaned up. | | 4. You need to support reuse of features in different | training datasets. Otherwise, what's the point? We do that | using Spark as a compute engine to join features from | tables containing normalized features. | | References: | | * https://www.logicalclocks.com/blog/mlops-with-a-feature- | stor... * https://ieeexplore.ieee.org/document/8752956 (CDC | HopsFS to Elastic) * http://kth.diva- | portal.org/smash/get/diva2:1149002/FULLTEXT0... (Hive on | HopsFS) | chaoyu wrote: | I'm actually building a "modular open-source company/product" | in the MLOps space: | | BentoML https://docs.bentoml.org/en/latest/ | mmq wrote: | Polyaxon is an open source machine learning automation | platform. It allows to schedule notebooks, tensorboards, and | container workloads for training ML and DL. It also has | native integration with Kubeflow's operators for distributed | training. | | https://github.com/polyaxon/polyaxon | minimaxir wrote: | Additionally, all of Google, Amazon, and Microsoft are pushing | _very_ heavily in the ML DevOps space. And if you are training | /deploying ML models at such a frequency that you _need_ to | utilize DevOps, chances are you are already using their | platforms for server compute. | iddan wrote: | This startup is trying to build the next GitHub for ML Data: | https://dagshub.com/ | gunshai wrote: | This seems pretty cool. | smeeth wrote: | I really find it difficult to put into words just how little I | care to pay for a web ui so I can "manage" my data. | | Data pipelines are a real problem though, and I'm very interested | in what startups do with this space. | factorialboy wrote: | > Data pipelines are a real problem though | | Can you please elaborate more, thanks. | prions wrote: | Its not trivial to create and manage Data Pipelines if you | care about scale, serving a wide range of inputs and outputs, | or making this data easy to surface and spread throughout | your org (i.e. making it actually useful to regular people). | | "Static ETL" like running the same database load every day at | 1:00am isn't a super challenging problem. Doing it across | many tables with complex transformations and multiple steps | easily can be. You really have to consider reliability, | processing speed, failure methods and other problems that | dont really arise until you hit a certain scale. | | There's also the issue of what people want out of a Pipeline | that's changing. If you want people to be ""data driven"", | then that means they need easy access to potentially all of | your company's data on an ad hoc basis. So now your boring | ETL 1 am pipeline isnt really serving any of these new | usecases. | | How do you create flexible pipelines that can be created from | any dataset on an ad hoc basis? This is where tools like | Airflow or Prefect come in. Creating a platform that can | create these types of Pipelines is a real problem. | | And before you even ask yourself _how_ to process this data, | you need to also ask _where_? If you want to do what I | outlined above - making your data more accessible and easy to | use - then you probably need to rework how you're storing | your data. But Data Lakes (and others) are a whole topic in | and of itself. | dttos wrote: | Suggest you check out https://composable.ai for building out | robust data pipelines | seddonm1 wrote: | We have been thinking about these problems for a few years now | and have built Arc https://arc.tripl.ai (fully open source.) | which is an abstraction layer on top of Apache Spark to help | end-users rapidly build and deploy data pipelines without | having to know about #dataops. Ultimately we decided that | giving users a decent interface https://github.com/tripl- | ai/arc-starter (based on Jupyter Notebooks) and encouraging a | 'SQL first' approach means we can give users flexibility but | also have a standardised way of deploying jobs with many of the | devops attributes (like logging and reliability). You can run | Arc as a standard docker run command or using Argo Workflows | https://argoproj.github.io/ on Kubernetes as the orchestrator | as it plays nicely with Arc and is easy to build resilient | pipelines (retries etc.) | beckingz wrote: | Data is hard to automate and standardized pipelines and processes | are really helpful. This is interesting. | moandcompany wrote: | Fig 4 looks like it's derived from Hidden Technical Debt in | Machine Learning (2015). | | https://papers.nips.cc/paper/5656-hidden-technical-debt-in-m... | | As someone else says in this comment thread, this is very much an | organizational problem, and cannot be viewed as just a technology | problem. | | The common behavior of individuals and teams is the pursuit of | solutions that solve problems for them. Problems here with ML, | and as we've seen with "Data Science," along with other magic | technologies is that having an appreciation for the domain or | context goes a long way. Being familiar with entire process, or | "pipeline," is valuable, and role/functional silos often lead the | problems people experience. | | For some classes of machine learning problems and associated | data, sourcing solutions from vendors can work, but as with any | tools you can procure, you need the right people to use them | appropriately. This also applies to "DevOps" which is used for | comparison in the blog post. | | --> DevOps example -- the philosophy seems to be about having | software developers also share build/release and infrastructure | responsibility. But some organizations have made "DevOps" teams | to silo build/release and infrastructure work... they ended up | renaming what used to be called their Build/Release or SysAdmin | teams. Siloing things to be "someone else's" problem doesn't | result in the major transformations that are needed. | | Now imagine what happens if we substitute MLDevOps for DevOps | above. | | I'll continue to say "The Role of a Data Engineer on a Team is | Complementary and Defined By The Tasks That Others Don't (Want | To) Do (Well)" | simonw wrote: | I see this as more of an organizational challenge than a | technology challenge. | | Getting ML models into production isn't particularly hard... if | you put an engineering team on it that know how to write | automated release procedures, design architecture that can scale | and build robust APIs to surface the data. | | But in many companies the engineers with those operations-level | skills and the researchers who work on machine learning live | completely separate lives. And then the researchers are expected | to deploy and scale their models to production! | | That's not to say that this organizational problem cannot be | solved with technology/entrepreneurship. If a company can afford | it it's likely much cheaper to pay an external company to solve | your "ML in production" problems than to re-design your | organization such that you equip your internal ML teams with the | skills they need to go to prod. | mmq wrote: | > if you put an engineering team on it that know how to write | automated release procedures | | I think surfacing the data is just the first step, often times | data scientists need to run some data exploration, the process | is generally iterative, and so they need to run several | experiments, resume or restart some experiments, scale training | with distributed learning using several machines, or run hyper- | parameters tuning, which means handling failures, visualize and | debug results, before deciding if they should deploy a model. | Once a model is deployed the story does not end there, because | models become stale and need to be retrained. There are other | issues related to compliance that need to be handled as well, | and many other problems related to governance, a/b testing, ... | | The good news is that there are several open source initiatives | to solve several of these problems, at Polyaxon [0] we are | trying to solve some of the aspects related to the | experimentation and the automation phase. | | [0] https://github.com/polyaxon/polyaxon | Cacti wrote: | I disagree. It's not about getting the data where it needs to | be. It's about data version control at a very fine level with | very large datasets (in a way that is efficient). It's about | detecting changes in model results base don changes in data. | It's about tracking provenance of data in the datasets. It's | about potentially controlled access to the data (eg allowing | models to use health care data without actually knowing the | underlying data). It's about detecting bias in datasets over | time. | | It's actually quite complex, which is why generally speaking | very few people do anything like this. I am unaware of any | general solution to this problem, either in industry or | academia. | calebkaiser wrote: | I agree that a lot of the challenges around production ML are | organizational, but I think in many companies, it has more to | do with a lack of engineering resources than it does the | separation of eng and data science (though that certainly | happens). | | Building and maintaining ML infrastructure from scratch is a | big project. That's why you see FAANG companies hiring for ML | infrastructure/platform engineers. Most startups don't have the | extra cycles for that big of an undertaking, and so you see a | lot of slapped-together, hacky solutions to putting models into | production. | | I'm biased in that I work on Cortex ( | https://github.com/cortexlabs/cortex ), but I think that open | source, modular tooling that removes the need to reinvent the | wheel is going to have a big impact in terms of making | production ML more accessible. ___________________________________________________________________ (page generated 2020-04-28 23:00 UTC)