[HN Gopher] Launch HN: Replicate (YC W20) - Version control for ... ___________________________________________________________________ Launch HN: Replicate (YC W20) - Version control for machine learning Author : bfirsh Score : 120 points Date : 2020-11-19 15:45 UTC (7 hours ago) (HTM) web link (replicate.ai) (TXT) w3m dump (replicate.ai) | m0sth8 wrote: | Congratulations with the launch. | | We've used https://github.com/iterative/dvc for a long time and | quite happy. What's the main difference between replicate.ai and | dvc? | mwnivek wrote: | I'd be curious about comparison with | https://github.com/mlflow/mlflow | bfirsh wrote: | We talked to a bunch of MLflow users, and the general | impression we got is that it is heavyweight and hard to set | up. MLflow is an all-encompassing "ML platform". Which is | fine if you need that, but we're trying to just do one thing | well. (Imagine if Git called itself a "software platform".) | | In terms of features, Replicate points directly at an S3 | bucket (so you don't have to run a server and Postgres DB), | it saves your training code (for reproducibility and to | commit to Git after the fact), and it has a nice API for | reading and analyzing your experiments in a notebook. | Jugurtha wrote: | Congrats on the launch! | | > _MLflow is an all-encompassing "ML platform"_ | | Not really. We're trying to use MLflow with our "ML | platform"[0]. Namely, it can save a model that expects high | dimensional inputs, which is most models I've seen that | aren't trivial, and can "deploy" the model but with an | expectation of two dimensional DataFrame inputs. | Apparently, they're working on that. | | There are also many ambiguities concerning Keras and | Tensorflow stemming from "What is a Keras model? Is it a | Tensorflow model now they're integrated? Why are Keras | models logged with the tensorflow model logger when you use | the autolog functionality?". These are shared ambiguities, | as there are several ways to save and load models with | Tensorflow, and we're looking into the Keras/Tensorflow | integration closely. MLflow uses `cloudpickle` and | unpickling expects not only the same 'protocol', but the | same Python _version_. Had to dig deeper than necessary. | | One other problem is when a model relies on ancillary | functions, which you must be able to ship somehow. You end | up tinkering with its guts, too. | | Could you shed some light on how do you deal with these | matters. Namely, high dimensional inputs for models, pre- | processing/post-processing functions, serialization | brittleness, and Keras/Tensorflow "duality". | | We have to inherit that complexity to spare our users from | having to mentally think of saving their experiments (we do | that automatically to save models, metrics, params). The | workflow is data --> collaborative notebooks with | scheduling features and job --> (generate appbooks) --> | automatically tracked models/params/metrics --> one click | deployment --> 'REST' API or form to invoke model. | | Aaaaaand again, congrats on the launch! | | - [0]: https://iko.ai | bfirsh wrote: | Thanks! | | DVC is closely tied to Git. We've heard people find that quite | heavyweight when you're running experiments. | | We think we can build a much better experience if we detach | ourselves from Git. With Replicate, you just run your training | script as usual, and it automatically tracks everything from | within Python. You don't have to run any additional commands to | track things. | | DVC is really good for storing data sets though, and we see | potential for integration there: | https://github.com/replicate/replicate/issues/359 | ishcheklein wrote: | Hey, one of the DVC maintainers here! | | TL;DR: I think it should be compared with the upcoming DVC | feature - https://github.com/iterative/dvc/wiki/Experiments . | Stay tuned - it'll be released very soon but you can try it | now in beta. | | First of all, congrats on the launch! I do really like the | aesthetics of the website, and the overall approach. It | resonates with our vision and philosophy! | | Good feedback on experiments feeling heavyweight! We've been | focused on doing great foundation to manage data and | pipelines in the previous DVC versions and were aware about | this problem (https://github.com/iterative/dvc/issues/2799). | As I mentioned - Experiments feature is already there in beta | testing. It means that users don't have to do commits anymore | until they are ready, still can share experiments (it's a | long topic and we'll write a blog post at some point since I | really excited about the way it'll be implemented using | custom Git refs), support for DL workflow (auto-checkpoints), | and more. Would love to discuss and share any details, it | would be great to compare the approaches. | bfirsh wrote: | Would love to chat -- I'll shoot you an email. :) | toto15151 wrote: | ghj | gidim wrote: | Hey! I'm one of the founders at Comet.ml. We believe that Git | should continue to be the approach for managing code (similar | to dvc) but we adapted it to the ML workflow. Our approach is | to compute a git patch on every run so later you can 'git | apply' if you'd like (https://www.comet.ml/docs/user- | interface/#the-reproduce-butt...). | edolev wrote: | Congrats on the launch! This looks exciting. My company has | been using Comet.ml and they cover a few use cases that are | missing here. Specifically things like real time visualizations | and sharing experiments which is key when working in a team. | Are you planning on adding those? | fagerhult wrote: | Thank you! We have an issue on the roadmap for adding a web | GUI: https://github.com/replicate/replicate/issues/295 | | We haven't thought about it in great detail yet, so I'd be | curious to hear your thoughts and ideas if you'd like to add | a comment to that issue! | mfDjB wrote: | As someone who tried to use git to do this for large sets of | data, I'm very glad this exists. Will be trying this out in the | future. | kevlar1818 wrote: | You may also be interested in a simple tool I'm building that | works in concert with source control to store, version, and | reproduce large data: https://github.com/kevin-hanselman/dud | | My project is in its infancy (open-sourced less than a month | ago), but I'm pleased with its UX thus far. There's lots to add | in terms of documentation, but Dud currently uses Rclone[1] for | remote syncing. | | [1]: https://rclone.org/ | breck wrote: | nit: "Throw away your spreadsheet" scares me a little. I love | spreadsheets, and think there are 100x+ more users of | spreadsheets than notebooks (though the overlap of notebook users | and ML users is probably close to 1, so I see your point). I | would always save my experiment results so they were ready to | analyze in spreadsheets (and other vis tools). | fagerhult wrote: | Hi, Andreas here. Yes spreadsheets are great, and better than | notebooks in many cases. But I always felt like I was doing | something wrong when I used spreadsheets and markdown files to | manually record metrics and hyperparameters for my experiments. | It's error prone and easy to forget to update the spreadsheet | with new experiments. | | So we're trying to automate recording this metadata, but then | give you that metadata in various ways for you to inspect it. | One of those ways is actually spreadsheets: | https://github.com/replicate/replicate/issues/289 | rq1 wrote: | I was looking for something similar today. I just adopted it. :) | | Thank you for your amazing work! | | Do you have the intention to integrate it with PT Lightning or as | a PT Lightning Logger? | | It would be nice to have it maintained there. | | It's used all over huggingface-Transformers examples. | fagerhult wrote: | Fantastic, thank you for those kind words! | | And great idea to integrate with PT Lightning. I just opened an | issue: https://github.com/replicate/replicate/issues/367, feel | free to add more detail and comments! -andreas | mindhash wrote: | Congrats on the launch. | | I have built an open source tool (called hyperML) for similar | problem sometime back. I think the problem is not just storing in | version control but being able to quickly retrieve those in live | /test systems or containers. | | Mounting and loading datasets and models is painful. It kind of | what makes local training a better option. | | If only the weights were version controlled by libraries (tf, | pytorch or scikit) this whole problem will be much easier to | solve. | mkuklik wrote: | Congrats on the launch. Have you looked at https://comet.ml If | so, how do you compare to them? | bfirsh wrote: | Thanks! It's open source and you're in control of your own | data. See https://news.ycombinator.com/item?id=25151741 | bfirsh wrote: | Hello HN! | | We're Ben & Andreas, and we made Replicate. It's a lightweight | open-source tool for tracking and analyzing your machine learning | experiments: https://replicate.ai/ | | Andreas used to do machine learning at Spotify. He built a lot of | ML infrastructure there (versioning, training, deployment, etc). | I used to be product manager for Docker's open source projects, | and created Docker Compose. | | We built https://www.arxiv-vanity.com/ together for fun, which | led to us teaming up to build more tools for ML. | | We spent a year talking to lots of people in the ML community and | building all sorts of prototypes, but we kept on coming back to a | foundational problem: not many people in machine learning use | version control. | | This causes all sorts of problems: people are manually keeping | track of things in spreadsheets, model weights are scattered on | S3, and results can't be reproduced. | | So why isn't everyone using Git? Git doesn't work well with | machine learning. It can't store trained machine learning models, | it can't handle key/value metadata, and it's not designed to | record information automatically from a training script. There | are some solutions for these things, but they feel like band- | aids. | | We came to the conclusion that we need a native version control | system for ML. It's sufficiently different to normal software | that we can't just put band-aids on Git. | | We believe the tool should be small, easy to use, and extensible. | We found people struggling to migrate to "AI Platforms". A tool | should do one thing well and combine with other tools to produce | the system you need. | | Finally, we also believe it should be open source. There are a | number of proprietary solutions, but something so foundational | needs to be built by and for the ML community. | | Replicate is a first cut at something we think is useful: It is a | Python library that uploads your files and metadata (like | hyperparameters) to Amazon S3 or Google Cloud Storage. You can | get back to any point in time using the command-line interface, | analyze your results inside a notebook using the Python API, and | load your models in production systems. | | We'd love to hear your feedback, and hear your stories about how | you've done this before. | | Also - building a version control system is rather complex, and | to make this a reality we need your help. Join us in Discord if | you want to be involved in the early design and help build it: | https://discord.gg/QmzJApGjyE | nemoniac wrote: | Wait? So you can upload to Amazon or Google but nowhere else? | Like to your own servers, for example? | bfirsh wrote: | You can save data to a path on the filesystem, so one way to | do this is with a network mount. Lots of academic departments | have their own GPU clusters, and they tend to have a shared | network filesystem. | | We want to have more ways to do this though. We were close to | adding SFTP support, but didn't get round to it. Another | method could be to implement our own server, but we're trying | to keep it simple for now. I'd be curious to hear your | feedback here: | https://github.com/replicate/replicate/issues/366 | mritchie712 wrote: | > it can't handle key/value metadata | | What do you mean by that? Is a JSON no good? I guess you mean | the diffs will be unordered? | bfirsh wrote: | Yep, and we can do lots of other nice things. We can produce | nice tables with the key/value data, filter it ("show me all | experiments with an accuracy greater than 0.9"), produce | well-formatted diffs across an arbitrary number of things, | give you a nice Python API for analyzing the data in a | notebook, and so on. | | There are some examples of these things on the home page, all | of which would be very fiddly to do with JSON files in Git: | https://replicate.ai/#features | flc-anonym wrote: | Congrats on the launch! This looks interesting, however I feel | like this space is quite crowded. You mentioned that your most | important feature is the fact that you are open-source, but off | the top of my head I can think of several projects: | | * Kubeflow: https://github.com/kubeflow/kubeflow | | * MLFlow: https://github.com/mlflow/mlflow | | * Pachyderm: https://github.com/pachyderm/pachyderm | | * DVC: https://github.com/iterative/dvc | | * Polyaxon: https://github.com/polyaxon/polyaxon | | * Sacred: https://github.com/IDSIA/sacred | | * pytorch-lightning + grid: | https://github.com/PyTorchLightning/pytorch-lightning | | * DeterminedAI: https://github.com/determined-ai/determined | | * Metaflow: https://github.com/Netflix/metaflow | | * Aim: https://github.com/aimhubio/aim | | * And so many more... | | In addition to this list, several other hosted platform offer | experiments tracking and model management. How do you compare to | all of these tools, and why do you think users should move from | one of them to use replicate, thank you. | gitgud wrote: | Congrats on the launch! What's the business concept behind the | tool? Can't find anything on the homepage, or in the post. I | didn't know YC funded opensource tools like this, it's kind of | refreshing. | bfirsh wrote: | Yeah, YC is funding lots of open source projects. PostHog[0] | was in our batch. GitLab, Docker, Mattermost, and CoreOS come | to mind as other open source YC companies. | | There are a number of businesses we could build around the | project. A cloud service or enterprise products/support are the | obvious ones. Right now, we're focused on community building, | because a potential open source business can't be successful | with a healthy open source project. | | [0] https://news.ycombinator.com/item?id=22376732 | gitgud wrote: | > _There are a number of businesses we could build around the | project._ | | So you got funding from YCombinator without a concrete plan | to make a business? That's pretty interesting, I always | thought they wanted profitable businesses, and turned down | ideas they didn't think would work. | | Great to hear they're betting on open-source more! | lcap wrote: | How does this compare to tools like neptune.ai, weights and | biases and so on? I can see the advantage of having control of | one's data, whereas these tools use their own servers. | | However what I love about them is the amazing UI that allows me | to compare experiments. | bfirsh wrote: | This came out of a practical problem: at Spotify, Andreas | couldn't let any data leave their network. He wasn't going to | go through procurement to buy an enterprise version of one of | those products, so his only option left was open source | software. | | But it's also out of principle: we think such a foundational | thing needs to be open source. There is a reason most people | use Git and not Perforce. | | Replicate can work alongside visualization tools -- your data | is safe in your own S3 bucket, but you can use the hosted | visualization tool to complement that. | | You could also imagine visualization tools built on top of | Replicate. One thing we've been thinking about is doing | visualization inside notebooks. It's like a programmable | Tensorboard: | https://colab.research.google.com/drive/18sVRE4Zi484G2rBeOYj... | | I'd be curious to hear your thoughts about that. It's pretty | primitive so far, but we've got to start somewhere I suppose. | :) ___________________________________________________________________ (page generated 2020-11-19 23:00 UTC)