[HN Gopher] Launch HN: Replicate (YC W20) - Version control for ...
       ___________________________________________________________________
        
       Launch HN: Replicate (YC W20) - Version control for machine
       learning
        
       Author : bfirsh
       Score  : 120 points
       Date   : 2020-11-19 15:45 UTC (7 hours ago)
        
 (HTM) web link (replicate.ai)
 (TXT) w3m dump (replicate.ai)
        
       | m0sth8 wrote:
       | Congratulations with the launch.
       | 
       | We've used https://github.com/iterative/dvc for a long time and
       | quite happy. What's the main difference between replicate.ai and
       | dvc?
        
         | mwnivek wrote:
         | I'd be curious about comparison with
         | https://github.com/mlflow/mlflow
        
           | bfirsh wrote:
           | We talked to a bunch of MLflow users, and the general
           | impression we got is that it is heavyweight and hard to set
           | up. MLflow is an all-encompassing "ML platform". Which is
           | fine if you need that, but we're trying to just do one thing
           | well. (Imagine if Git called itself a "software platform".)
           | 
           | In terms of features, Replicate points directly at an S3
           | bucket (so you don't have to run a server and Postgres DB),
           | it saves your training code (for reproducibility and to
           | commit to Git after the fact), and it has a nice API for
           | reading and analyzing your experiments in a notebook.
        
             | Jugurtha wrote:
             | Congrats on the launch!
             | 
             | > _MLflow is an all-encompassing "ML platform"_
             | 
             | Not really. We're trying to use MLflow with our "ML
             | platform"[0]. Namely, it can save a model that expects high
             | dimensional inputs, which is most models I've seen that
             | aren't trivial, and can "deploy" the model but with an
             | expectation of two dimensional DataFrame inputs.
             | Apparently, they're working on that.
             | 
             | There are also many ambiguities concerning Keras and
             | Tensorflow stemming from "What is a Keras model? Is it a
             | Tensorflow model now they're integrated? Why are Keras
             | models logged with the tensorflow model logger when you use
             | the autolog functionality?". These are shared ambiguities,
             | as there are several ways to save and load models with
             | Tensorflow, and we're looking into the Keras/Tensorflow
             | integration closely. MLflow uses `cloudpickle` and
             | unpickling expects not only the same 'protocol', but the
             | same Python _version_. Had to dig deeper than necessary.
             | 
             | One other problem is when a model relies on ancillary
             | functions, which you must be able to ship somehow. You end
             | up tinkering with its guts, too.
             | 
             | Could you shed some light on how do you deal with these
             | matters. Namely, high dimensional inputs for models, pre-
             | processing/post-processing functions, serialization
             | brittleness, and Keras/Tensorflow "duality".
             | 
             | We have to inherit that complexity to spare our users from
             | having to mentally think of saving their experiments (we do
             | that automatically to save models, metrics, params). The
             | workflow is data --> collaborative notebooks with
             | scheduling features and job --> (generate appbooks) -->
             | automatically tracked models/params/metrics --> one click
             | deployment --> 'REST' API or form to invoke model.
             | 
             | Aaaaaand again, congrats on the launch!
             | 
             | - [0]: https://iko.ai
        
         | bfirsh wrote:
         | Thanks!
         | 
         | DVC is closely tied to Git. We've heard people find that quite
         | heavyweight when you're running experiments.
         | 
         | We think we can build a much better experience if we detach
         | ourselves from Git. With Replicate, you just run your training
         | script as usual, and it automatically tracks everything from
         | within Python. You don't have to run any additional commands to
         | track things.
         | 
         | DVC is really good for storing data sets though, and we see
         | potential for integration there:
         | https://github.com/replicate/replicate/issues/359
        
           | ishcheklein wrote:
           | Hey, one of the DVC maintainers here!
           | 
           | TL;DR: I think it should be compared with the upcoming DVC
           | feature - https://github.com/iterative/dvc/wiki/Experiments .
           | Stay tuned - it'll be released very soon but you can try it
           | now in beta.
           | 
           | First of all, congrats on the launch! I do really like the
           | aesthetics of the website, and the overall approach. It
           | resonates with our vision and philosophy!
           | 
           | Good feedback on experiments feeling heavyweight! We've been
           | focused on doing great foundation to manage data and
           | pipelines in the previous DVC versions and were aware about
           | this problem (https://github.com/iterative/dvc/issues/2799).
           | As I mentioned - Experiments feature is already there in beta
           | testing. It means that users don't have to do commits anymore
           | until they are ready, still can share experiments (it's a
           | long topic and we'll write a blog post at some point since I
           | really excited about the way it'll be implemented using
           | custom Git refs), support for DL workflow (auto-checkpoints),
           | and more. Would love to discuss and share any details, it
           | would be great to compare the approaches.
        
             | bfirsh wrote:
             | Would love to chat -- I'll shoot you an email. :)
        
               | toto15151 wrote:
               | ghj
        
           | gidim wrote:
           | Hey! I'm one of the founders at Comet.ml. We believe that Git
           | should continue to be the approach for managing code (similar
           | to dvc) but we adapted it to the ML workflow. Our approach is
           | to compute a git patch on every run so later you can 'git
           | apply' if you'd like (https://www.comet.ml/docs/user-
           | interface/#the-reproduce-butt...).
        
         | edolev wrote:
         | Congrats on the launch! This looks exciting. My company has
         | been using Comet.ml and they cover a few use cases that are
         | missing here. Specifically things like real time visualizations
         | and sharing experiments which is key when working in a team.
         | Are you planning on adding those?
        
           | fagerhult wrote:
           | Thank you! We have an issue on the roadmap for adding a web
           | GUI: https://github.com/replicate/replicate/issues/295
           | 
           | We haven't thought about it in great detail yet, so I'd be
           | curious to hear your thoughts and ideas if you'd like to add
           | a comment to that issue!
        
       | mfDjB wrote:
       | As someone who tried to use git to do this for large sets of
       | data, I'm very glad this exists. Will be trying this out in the
       | future.
        
         | kevlar1818 wrote:
         | You may also be interested in a simple tool I'm building that
         | works in concert with source control to store, version, and
         | reproduce large data: https://github.com/kevin-hanselman/dud
         | 
         | My project is in its infancy (open-sourced less than a month
         | ago), but I'm pleased with its UX thus far. There's lots to add
         | in terms of documentation, but Dud currently uses Rclone[1] for
         | remote syncing.
         | 
         | [1]: https://rclone.org/
        
       | breck wrote:
       | nit: "Throw away your spreadsheet" scares me a little. I love
       | spreadsheets, and think there are 100x+ more users of
       | spreadsheets than notebooks (though the overlap of notebook users
       | and ML users is probably close to 1, so I see your point). I
       | would always save my experiment results so they were ready to
       | analyze in spreadsheets (and other vis tools).
        
         | fagerhult wrote:
         | Hi, Andreas here. Yes spreadsheets are great, and better than
         | notebooks in many cases. But I always felt like I was doing
         | something wrong when I used spreadsheets and markdown files to
         | manually record metrics and hyperparameters for my experiments.
         | It's error prone and easy to forget to update the spreadsheet
         | with new experiments.
         | 
         | So we're trying to automate recording this metadata, but then
         | give you that metadata in various ways for you to inspect it.
         | One of those ways is actually spreadsheets:
         | https://github.com/replicate/replicate/issues/289
        
       | rq1 wrote:
       | I was looking for something similar today. I just adopted it. :)
       | 
       | Thank you for your amazing work!
       | 
       | Do you have the intention to integrate it with PT Lightning or as
       | a PT Lightning Logger?
       | 
       | It would be nice to have it maintained there.
       | 
       | It's used all over huggingface-Transformers examples.
        
         | fagerhult wrote:
         | Fantastic, thank you for those kind words!
         | 
         | And great idea to integrate with PT Lightning. I just opened an
         | issue: https://github.com/replicate/replicate/issues/367, feel
         | free to add more detail and comments! -andreas
        
       | mindhash wrote:
       | Congrats on the launch.
       | 
       | I have built an open source tool (called hyperML) for similar
       | problem sometime back. I think the problem is not just storing in
       | version control but being able to quickly retrieve those in live
       | /test systems or containers.
       | 
       | Mounting and loading datasets and models is painful. It kind of
       | what makes local training a better option.
       | 
       | If only the weights were version controlled by libraries (tf,
       | pytorch or scikit) this whole problem will be much easier to
       | solve.
        
       | mkuklik wrote:
       | Congrats on the launch. Have you looked at https://comet.ml If
       | so, how do you compare to them?
        
         | bfirsh wrote:
         | Thanks! It's open source and you're in control of your own
         | data. See https://news.ycombinator.com/item?id=25151741
        
       | bfirsh wrote:
       | Hello HN!
       | 
       | We're Ben & Andreas, and we made Replicate. It's a lightweight
       | open-source tool for tracking and analyzing your machine learning
       | experiments: https://replicate.ai/
       | 
       | Andreas used to do machine learning at Spotify. He built a lot of
       | ML infrastructure there (versioning, training, deployment, etc).
       | I used to be product manager for Docker's open source projects,
       | and created Docker Compose.
       | 
       | We built https://www.arxiv-vanity.com/ together for fun, which
       | led to us teaming up to build more tools for ML.
       | 
       | We spent a year talking to lots of people in the ML community and
       | building all sorts of prototypes, but we kept on coming back to a
       | foundational problem: not many people in machine learning use
       | version control.
       | 
       | This causes all sorts of problems: people are manually keeping
       | track of things in spreadsheets, model weights are scattered on
       | S3, and results can't be reproduced.
       | 
       | So why isn't everyone using Git? Git doesn't work well with
       | machine learning. It can't store trained machine learning models,
       | it can't handle key/value metadata, and it's not designed to
       | record information automatically from a training script. There
       | are some solutions for these things, but they feel like band-
       | aids.
       | 
       | We came to the conclusion that we need a native version control
       | system for ML. It's sufficiently different to normal software
       | that we can't just put band-aids on Git.
       | 
       | We believe the tool should be small, easy to use, and extensible.
       | We found people struggling to migrate to "AI Platforms". A tool
       | should do one thing well and combine with other tools to produce
       | the system you need.
       | 
       | Finally, we also believe it should be open source. There are a
       | number of proprietary solutions, but something so foundational
       | needs to be built by and for the ML community.
       | 
       | Replicate is a first cut at something we think is useful: It is a
       | Python library that uploads your files and metadata (like
       | hyperparameters) to Amazon S3 or Google Cloud Storage. You can
       | get back to any point in time using the command-line interface,
       | analyze your results inside a notebook using the Python API, and
       | load your models in production systems.
       | 
       | We'd love to hear your feedback, and hear your stories about how
       | you've done this before.
       | 
       | Also - building a version control system is rather complex, and
       | to make this a reality we need your help. Join us in Discord if
       | you want to be involved in the early design and help build it:
       | https://discord.gg/QmzJApGjyE
        
         | nemoniac wrote:
         | Wait? So you can upload to Amazon or Google but nowhere else?
         | Like to your own servers, for example?
        
           | bfirsh wrote:
           | You can save data to a path on the filesystem, so one way to
           | do this is with a network mount. Lots of academic departments
           | have their own GPU clusters, and they tend to have a shared
           | network filesystem.
           | 
           | We want to have more ways to do this though. We were close to
           | adding SFTP support, but didn't get round to it. Another
           | method could be to implement our own server, but we're trying
           | to keep it simple for now. I'd be curious to hear your
           | feedback here:
           | https://github.com/replicate/replicate/issues/366
        
         | mritchie712 wrote:
         | > it can't handle key/value metadata
         | 
         | What do you mean by that? Is a JSON no good? I guess you mean
         | the diffs will be unordered?
        
           | bfirsh wrote:
           | Yep, and we can do lots of other nice things. We can produce
           | nice tables with the key/value data, filter it ("show me all
           | experiments with an accuracy greater than 0.9"), produce
           | well-formatted diffs across an arbitrary number of things,
           | give you a nice Python API for analyzing the data in a
           | notebook, and so on.
           | 
           | There are some examples of these things on the home page, all
           | of which would be very fiddly to do with JSON files in Git:
           | https://replicate.ai/#features
        
       | flc-anonym wrote:
       | Congrats on the launch! This looks interesting, however I feel
       | like this space is quite crowded. You mentioned that your most
       | important feature is the fact that you are open-source, but off
       | the top of my head I can think of several projects:
       | 
       | * Kubeflow: https://github.com/kubeflow/kubeflow
       | 
       | * MLFlow: https://github.com/mlflow/mlflow
       | 
       | * Pachyderm: https://github.com/pachyderm/pachyderm
       | 
       | * DVC: https://github.com/iterative/dvc
       | 
       | * Polyaxon: https://github.com/polyaxon/polyaxon
       | 
       | * Sacred: https://github.com/IDSIA/sacred
       | 
       | * pytorch-lightning + grid:
       | https://github.com/PyTorchLightning/pytorch-lightning
       | 
       | * DeterminedAI: https://github.com/determined-ai/determined
       | 
       | * Metaflow: https://github.com/Netflix/metaflow
       | 
       | * Aim: https://github.com/aimhubio/aim
       | 
       | * And so many more...
       | 
       | In addition to this list, several other hosted platform offer
       | experiments tracking and model management. How do you compare to
       | all of these tools, and why do you think users should move from
       | one of them to use replicate, thank you.
        
       | gitgud wrote:
       | Congrats on the launch! What's the business concept behind the
       | tool? Can't find anything on the homepage, or in the post. I
       | didn't know YC funded opensource tools like this, it's kind of
       | refreshing.
        
         | bfirsh wrote:
         | Yeah, YC is funding lots of open source projects. PostHog[0]
         | was in our batch. GitLab, Docker, Mattermost, and CoreOS come
         | to mind as other open source YC companies.
         | 
         | There are a number of businesses we could build around the
         | project. A cloud service or enterprise products/support are the
         | obvious ones. Right now, we're focused on community building,
         | because a potential open source business can't be successful
         | with a healthy open source project.
         | 
         | [0] https://news.ycombinator.com/item?id=22376732
        
           | gitgud wrote:
           | > _There are a number of businesses we could build around the
           | project._
           | 
           | So you got funding from YCombinator without a concrete plan
           | to make a business? That's pretty interesting, I always
           | thought they wanted profitable businesses, and turned down
           | ideas they didn't think would work.
           | 
           | Great to hear they're betting on open-source more!
        
       | lcap wrote:
       | How does this compare to tools like neptune.ai, weights and
       | biases and so on? I can see the advantage of having control of
       | one's data, whereas these tools use their own servers.
       | 
       | However what I love about them is the amazing UI that allows me
       | to compare experiments.
        
         | bfirsh wrote:
         | This came out of a practical problem: at Spotify, Andreas
         | couldn't let any data leave their network. He wasn't going to
         | go through procurement to buy an enterprise version of one of
         | those products, so his only option left was open source
         | software.
         | 
         | But it's also out of principle: we think such a foundational
         | thing needs to be open source. There is a reason most people
         | use Git and not Perforce.
         | 
         | Replicate can work alongside visualization tools -- your data
         | is safe in your own S3 bucket, but you can use the hosted
         | visualization tool to complement that.
         | 
         | You could also imagine visualization tools built on top of
         | Replicate. One thing we've been thinking about is doing
         | visualization inside notebooks. It's like a programmable
         | Tensorboard:
         | https://colab.research.google.com/drive/18sVRE4Zi484G2rBeOYj...
         | 
         | I'd be curious to hear your thoughts about that. It's pretty
         | primitive so far, but we've got to start somewhere I suppose.
         | :)
        
       ___________________________________________________________________
       (page generated 2020-11-19 23:00 UTC)