[HN Gopher] Show HN: Orchest - Data Science Pipelines
       ___________________________________________________________________
        
       Show HN: Orchest - Data Science Pipelines
        
       Hello Hacker News! We are Rick & Yannick from Orchest
       (https://www.orchest.io - https://github.com/orchest/orchest).
       We're building a visual pipeline tool for data scientists. The tool
       can be considered to be high-code because you write your own
       Python/R notebooks and scripts, but we manage the underlying
       infrastructure to make it 'just work(tm)'. You can think of it as a
       simplified version of Kubeflow.  We created Orchest to free data
       scientists from the tedious engineering related tasks of their job.
       Similar to how companies like Netflix, Uber and Booking.com support
       their data scientists with internal tooling and frameworks to
       increase productivity. When we worked as data scientists ourselves
       we noticed how heavily we had to depend on our software engineering
       skills to perform all kinds of tasks. From configuring cloud
       instances for distributed training, to optimizing the networking
       and storage for processing large amounts of data. We believe data
       scientists should be able to focus on the data and the domain
       specific challenges.  Today we are just at the very beginning of
       making better tooling available for data science and are launching
       our GitHub project that will give enhanced pipelining abilities to
       data scientists using the PyData/R stack, with deep integration of
       Jupyter Notebooks.  Currently Orchest supports:  1) visually and
       interactively editing a pipeline that is represented using a simple
       JSON schema;  2) running remote container based kernels through the
       Jupyter Enterprise Gateway integration;  3) scheduling experiments
       by launching parameterized pipelines on top of our Celery task
       scheduler;  4) configuring local and remote data sources to
       separate code versioning from the data passing through your
       pipelines.  We are here to learn and get feedback from the
       community. As youngsters we don't have all the answers and are
       always looking to improve.
        
       Author : ricklamers
       Score  : 31 points
       Date   : 2020-08-12 12:24 UTC (10 hours ago)
        
       | vasinov wrote:
       | This looks cool! A couple of questions:
       | 
       | 1. Currently, if I install something in the notebook, does it get
       | re-installed every time the pipeline is run? Is there any way to
       | "snapshot" the state of the container?
       | 
       | 2. Where is the data stored between the steps?
       | 
       | 3. How well-integrated is it with AWS cloud primitives such as
       | EC2 instances, EFS, and S3?
        
         | ricklamers wrote:
         | Thanks!
         | 
         | 1. Right now additional dependencies for the container need to
         | be re-installed whenever you run the pipeline. During the
         | entire Jupyter kernel session though, the container state and
         | thus any installed dependencies remain available. We're working
         | on either supporting container snapshots or custom container
         | images (with desired dependencies pre-installed). We'll likely
         | go with snaphots as they'll be easier from an end-user
         | perspective.
         | 
         | 2. During step execution data is stored inside of either the
         | pipeline directory (which contains for example the
         | .ipynb/.py/.R/.sh files) or in any of the mounted directories
         | (through data sources).
         | 
         | When you run the pipeline as part of an experiment a copy is
         | created so that any state generated by any of the steps inside
         | of the pipeline directory is isolated from the 'working copy'
         | of the pipeline.
         | 
         | Edit: forgot to mention that we support memory-based data
         | transfer between steps which is faster and doesn't "pollute"
         | your pipeline directory. It does require your data to fit in
         | memory though. We use Apache Arrow's Plasma for this.
         | 
         | 3. AWS S3 and AWS Redshift are currently supported as data
         | sources. Some light docs at https://orchest-
         | sdk.readthedocs.io/en/latest/python.html#dat... (to be
         | improved!) and the relevant SDK source
         | (https://github.com/orchest/orchest-
         | sdk/blob/master/python/or...). We should look into EFS. Do you
         | have a use case in mind?
        
       | Obinkhorst wrote:
       | Thanks for sharing, this is super helpful. I'm endlessly jealous
       | of the teams at Uber and Booking and their fancy tools
        
         | ricklamers wrote:
         | We'll make sure the rest of the world gets those great tools
         | too!
        
       | rgmvisser wrote:
       | Really cool! I can't wait to start playing with it.
       | 
       | Can two people collaborate on the same project at the same time?
        
         | ricklamers wrote:
         | That's great to hear! Right now it's not fully supported to
         | edit a pipeline at the same time. We're moving towards a git-
         | based async collaboration approach where you can fork and merge
         | pipelines to make sure changes you make to code/Notebooks
         | aren't going to surprise you in your analysis/models.
        
       ___________________________________________________________________
       (page generated 2020-08-12 23:00 UTC)