[HN Gopher] Launch HN: Sematic (YC S22) - Open-source framework ...
       ___________________________________________________________________
        
       Launch HN: Sematic (YC S22) - Open-source framework to build ML
       pipelines faster
        
       Hi HN - I'm Emmanuel, founder of Sematic (https://sematic.dev).
       Sematic is an open-source framework to prototype and productionize
       end-to-end Machine Learning (ML) and Data Science (DS) pipelines in
       days instead of weeks or months. The idea is to do for ML
       development what Rails and Heroku did for web development.  I
       started my career searching for Supersymmetry and the Higgs boson
       on the Large Hadron Collider at CERN, then moved to industry. I
       spent the last four years building ML infrastructure at Cruise. In
       both academia and industry, I witnessed researchers, data
       scientists, and ML engineers spending an absurd share of their time
       building makeshift tooling, stitching up infrastructure, and
       battling obscure systems, instead of focusing on their core area of
       expertise: extracting insights and predictions from data.  This was
       painfully apparent at Cruise where the ML Platform team needed to
       grow linearly with the number of users to support and models to
       ship to the car. What should have just taken a click (e.g.
       retraining a model when world conditions change - COVID parklets,
       road construction sites, deployment to new cities) often required
       weeks of painstaking work. Existing tools for prototyping and
       productionizing ML/DS models did not enable developers to become
       autonomous and tackle new projects instead of babysitting current
       ones.  For example, a widely adopted tool such as Kubeflow
       Pipelines requires users to learn an obscure Python API, package
       and deploy their code and dependencies by hand, and does not offer
       exhaustive tracking and visualization of artifacts beyond simple
       metadata.  In order to become autonomous, users needed a dead-
       simple way to iterate seamlessly between local and cloud
       environments (change code, validate locally, run at scale in the
       cloud, repeat) and visualize objects (metrics, plots, datasets,
       configs) in a UI. Strong guarantees around dependency packaging,
       traceability of artifact lineage, and reproducibility would have to
       be provided out-of-the-box.  Sematic lets ML/DS developers build
       and run pipelines of arbitrary complexity with nothing more than
       minimalistic Python APIs. Business logic, dynamic pipeline graphs,
       configurations, resource requirements, etc. -- all with only
       Python. We are bringing the lovable aspects of Jupyter Notebooks
       (iterative development, visualizations) to the actual pipeline.
       How it works: Sematic resolves dynamic nested graphs of pipeline
       steps (simple Python functions) and intercepts all inputs and
       outputs of each step to type-check, serialize, version, and track
       them. Individual steps are orchestrated as Kubernetes jobs
       according to required resources (e.g. GPU, high-memory), and all
       tracking and visualization information is surfaced in a modern UI.
       Build assets (user code, third-party dependencies, drivers, static
       libraries) are packaged and shipped to remote workers at runtime,
       which enables a fast and seamless iterative development experience.
       Sematic lets you achieve results much faster by not wasting time on
       packaging dependencies, foraging for output artifacts to visualize,
       investigating obscure failures in black-box container jobs,
       bookkeeping configurations, writing complex YAML templates to run
       multiple experiments, etc.  It can run on a local machine or be
       deployed to leverage cloud resources (e.g. GPUs, high-memory
       instances, map/reduce clusters, etc.) with minimal external
       dependencies: Python, PostgreSQL, and Kubernetes.  Sematic is open-
       source and free to use locally or self-hosted in your own cloud. We
       will provide a SaaS offering to enable access to cloud resources
       without the hassle of maintaining a cloud deployment. To get
       started, simply run `$ pip install sematic; sematic start`. Check
       us out at https://sematic.dev, star our Github repo, and join our
       Discord for updates, feature requests, and bug reports.  We would
       love to hear from everyone about your experience building reliable
       end-to-end ML training pipelines, and anything else you'd like to
       share in the comments!
        
       Author : neutralino1
       Score  : 80 points
       Date   : 2022-08-10 14:56 UTC (8 hours ago)
        
       | boredumb wrote:
       | > with minimal external dependencies: Python, PostgreSQL, and
       | _Kubernetes_
       | 
       | What a time to be alive.
       | 
       | All jokes aside, this is really awesome and i'm glad to see more
       | and more tools to make ML more developer friendly and accessible.
       | Out of curiosity do you guys come from a TF, pytorch, jax, etc
       | background?
        
         | neutralino1 wrote:
         | Indeed, Kubernetes is hardly "light" :D
         | 
         | However, it's only required when running pipelines in the
         | cloud. When running locally (`$ sematic start`), nothing else
         | than Python is required.
         | 
         | Regarding background, we are folks with experience building ML
         | tooling. We've built infra around TF and Pytorch, but none with
         | Jax. That being said, Sematic is agnostic to the framework you
         | want to use.
        
           | nemoniac wrote:
           | How about if "locally" is a handful of servers reachable by
           | ssh?
        
             | neutralino1 wrote:
             | You can certainly deploy the web app on one server, and run
             | your pipelines on another (or the same). In this case,
             | "locally" would mean that the pipeline and all its steps
             | run on the same host machine. This is totally sufficient in
             | many cases.
             | 
             | Kubernetes becomes interesting when using heterogeneous
             | resources (e.g. GPU nodes for training, high-memory for
             | data processing, etc.), but is not a necessity.
        
       | edublancas wrote:
       | > The idea is to do for ML development what Rails and Heroku did
       | for web development.
       | 
       | I think this is a great way to explain what you're doing. I'm
       | working in the same space (ML/DS tooling) and I feel like we, as
       | the ML/DS community, haven't cracked exactly how Rails for data
       | looks like, I actually wrote some ideas on this a while ago
       | (https://ploomber.io/blog/rails4ml/).
       | 
       | Congrats on the launch and best of luck with the product!
        
         | neutralino1 wrote:
         | Thanks! I was a pretty heavy user of Rails in past jobs. They
         | have a nice mix of good abstractions (Model/View/Controller),
         | tooling (CLI, local web server), and best practices (how to
         | name tables, fields, lifecycle timestamps, etc.).
         | 
         | We think that is a good way to go: solid abstractions that
         | experts can build on top of, but also more junior folks can get
         | started quickly by following best practices.
         | 
         | Ploomber looks great too, I like the breakdown of your SaaS
         | offering.
        
       | benjismith wrote:
       | Sounds great! Very interested when the SaaS offering opens up.
       | Definitely not keen on running a Kubernetes cluster for the sake
       | of simplifying ML operations.
        
         | josh-sematic wrote:
         | Glad it looks interesting to you! Regarding running a
         | Kubernetes cluster, that's only required if you want the steps
         | in your pipeline to all execute in their own containers. If
         | your workflow is such that everything can run on a single
         | machine, you can still use Sematic to track your experiments.
         | One advantage here is that if you ever do need to scale up to
         | containerized workflows, you can do so without changing the
         | code for your pipeline.
        
       | brochington wrote:
       | Just a note to say that even though the name "Sematic" is the
       | same, this is not the same open source project as mine that I
       | posted to Show HN about a week ago here:
       | https://news.ycombinator.com/item?id=32364193.
        
         | neutralino1 wrote:
         | As they say, naming things is one of the two hardest problems
         | in Computer Science :)
         | 
         | Your project is cool and has a cool name!
        
           | brochington wrote:
           | Haha, thank you, and yours has a great name too!
           | 
           | Both our projects are ML-adjacent. I don't want to cause any
           | confusion, and will start thinking of some alternate names
           | for my project.
        
             | dang wrote:
             | Independently of the name collision, your Show HN looks
             | good! and didn't get any attention. If you email me at
             | hn@ycombinator.com, I'll send you a repost invite for it.
        
       | ricklamers wrote:
       | For people in this thread interested in what this tool is an
       | alternative to: Airflow, Luigi, Kubeflow, Kedro, Flyte, Metaflow,
       | Sagemaker Pipelines, GCP Vertex Workbench, Azure Data Factory,
       | Azure ML, Dagster, DVC, ClearML, Prefect, Pachyderm, and Orchest.
       | 
       | Disclaimer: author of Orchest https://github.com/orchest/orchest
        
         | josh-sematic wrote:
         | Yup, most of these tools fall into the definition of
         | "orchestration" in one way shape or form, though not all of
         | them are targeted at Machine Learning pipelines, and not all of
         | them focus as much on the low barrier to entry space we're
         | aiming at. In general they also don't optimize for local
         | workflows while still enabling cloud access. We understand
         | there are a lot of tools that may superficially look similar,
         | so we've put together a page to describe things we do
         | differently from other tools: https://docs.sematic.dev/sematic-
         | vs
         | 
         | Btw, orchest looks like a cool way to orchestrate notebooks!
         | 
         | Also, hi! I'm founding engineer here at Sematic. Happy to
         | answer any questions I can!
        
           | ricklamers wrote:
           | Nice comparison section! And thank you kindly for the
           | compliment.
           | 
           | Our approach has been supporting both Python scripts (R,
           | Julia, JS and Bash too for that matter) and notebooks as it
           | would give users the ability to choose the right tool for the
           | job and in case of migrations from notebooks to scripts make
           | the process more incremental.
           | 
           | Welcome to the thread :wave:
        
           | troiskaer wrote:
           | How does Sematic compare to Metaflow? it optimizes for many
           | of the same goals of Sematic - local workflows, cloud access,
           | lineage tracking, state transfer etc?
        
             | josh-sematic wrote:
             | There are several differences, but I'd say these are some
             | of the main ones:
             | 
             | UI: whereas Metaflow provides the ability to build your own
             | result visualizations explicitly in your workflow (via
             | their "cards" feature), Sematic makes it so that your
             | outputs ( _and_ inputs) get automatic rich visualizations
             | based on the data type of the data being passed around.
             | 
             | API: Instead of being based around explicitly building up a
             | graph, where you have to explicitly specify the I/O
             | connections between steps, Sematic makes defining your
             | steps look like writing/calling python functions.
             | 
             | Packaging: Whereas metaflow requires you to include
             | packaging information in the code defining your steps (the
             | @conda decorator, etc.), Sematic plugs into your existing
             | dependency management to bundle up dependencies for
             | execution in the cloud.
        
       | Smergnus wrote:
       | I have an idea where I want to build an ML system that generates
       | different sets of board game rules (think tic-tac-toe type
       | games), then trains models to play that game, and scores each set
       | of rules based on a set of criteria. For example: no side should
       | always win, the skill ceiling should be high (models should keep
       | improving when trained more). A less skilled (trained) model
       | should sometimes be able to beat a more skilled model. The games
       | should end within a reasonable number of turns. Etc. The high
       | level system should then generate new rulesets, searching for a
       | ruleset that scores optimally on the criteria. Would Sematic be
       | good for this?
        
         | neutralino1 wrote:
         | Thanks for your question! Yes, Sematic has a neat feature that
         | can help: Dynamic Graphs. Because Sematic uses simple Python to
         | declare the control and data flow of your graph, you can simply
         | loop over configurations (in your case different sets of board
         | game rules) and train a model for each config, and eventually
         | aggregate results to determine the winner.
         | 
         | Join our Discord in you want to discuss this further -
         | https://discord.gg/4KZJ6kYVax
        
           | Smergnus wrote:
           | Awesome. Thanks!
        
       | rubenfiszel wrote:
       | This is amazing. Long live open-source platforms that let
       | developers and data scientist focus on the interesting parts of
       | their jobs and day and make them more productive.
        
       | kajecounterhack wrote:
       | Looks cool!
       | 
       | > Sematic makes I/O between steps in your pipelines as simple as
       | passing an output of one python function as the input of another.
       | Airflow provides APIs which can pass data between tasks, but
       | involves some boilerplate around explicitly pushing/pulling data
       | around, and coupling producers and consumers via named data keys.
       | 
       | In robotics you sometimes need high performance data
       | transformation e.g. convert pile of raw robot log data protos -->
       | pile of simulation inputs --> pile of extracted data --> munged
       | into net input format
       | 
       | Does semantic support this if the communication between tasks
       | uses python functions? Like if my simulator is C++, will I have
       | to use SWIG?
       | 
       | In some of the competing systems, the input/output between nodes
       | are just produced files as side effects, which is nice because it
       | doesn't care what language / infra you use as long as you produce
       | the required input/output.
        
         | neutralino1 wrote:
         | Thank you!
         | 
         | What we have seen done in the past is use things like pybind11
         | to expose C++ APIs in Python, which I guess is a similar
         | concept to SWIG. If you are using build tools such as Bazel,
         | you can even get the C++ compiled at run-time when submitting
         | your pipeline.
         | 
         | Regarding i/o artifacts, Sematic lets user choose how they are
         | serialized. We offer reasonable baseline defaults, but certain
         | artifacts require serialization formats that are cross-language
         | (e.g. ROS messages since you mention robotics).
         | 
         | As a last resort, users are free to serialize and persist
         | artifacts by hand as part of their pipeline functions (e.g.
         | storing in a cloud bucket) and only returning a reference to
         | said artifact (e.g. a Python dataclass with artifact location
         | and metadata).
        
       | llaolleh wrote:
       | I will check it out after work. Let me just say that this is
       | indeed a legitimate problem. After you train the model, to me it
       | takes at least 3x the amount of effort to deploy and push to
       | production.
       | 
       | I wish it was as easy as drag and dropping the model to target
       | servers after building the model.
        
         | neutralino1 wrote:
         | Absolutely. Software development has many nice CI/CD patterns
         | to generate assets, test, deploy. It enables software engineers
         | to work fast with a safety net and have all pipelines
         | automated.
         | 
         | I think ML development is where web development was in the late
         | 2000's. Clear patterns and best practices have not yet emerged.
         | For example, many ML developers work without reproducibility
         | and traceability enabled. This does not allow for fast and safe
         | work.
        
           | fisf wrote:
           | Ok, I'll bite.
           | 
           | disclaimer: this is a nice framework, will happily try it.
           | 
           | Imho: The underlying patterns are quite clear, and there are
           | various approaches to build stable pipelines.
           | 
           | I have used automation with basic containers + gitlab actions
           | / custom runners, clearml, earthly pipelines, kubeflow,.. for
           | this.
           | 
           | All of those can give reproducibility (experiment tracking,
           | code & dependencies, etc.) without much effort.
           | 
           | The last mile (model deployment) is often very specific, so
           | let's keep that out of scope.
           | 
           | But: The basic problem is cultural, not technical.
           | 
           | One stated goal of this project hits close to the root cause:
           | "Facilitating the transition from Jupyter Notebook prototype
           | code to steady production-grade pipelines".
           | 
           | As ML developers, we have to stop regarding notebooks as
           | anything that produces acceptable output (apart from initial
           | exploration). Work has to happen in structured, tracked, and
           | versioned codebases (~production code).
           | 
           | Anything that happened locally/in a notebook might as well
           | not exist from my point of view.
        
             | neutralino1 wrote:
             | That is very true. In our careers we have repeatedly
             | incentivized users to exit Notebooks early on.
             | 
             | Of course we understand why Notebooks are so appealing and
             | useful, and we think they have their place in the toolbox
             | (like a Python console for developers).
             | 
             | And you are correct that there already are many tools to
             | build reproducible traceable pipelines. What we have found
             | is that they are still too difficult to adopt. Which is why
             | we are trying to greatly lower the barrier to entry.
        
       | [deleted]
        
       | pottertheotter wrote:
       | How is this different from or complementary to Tecton?
        
         | josh-sematic wrote:
         | I'll confess I haven't used Tecton myself, but reading through
         | their documentation it seems that they are much more focused on
         | ETL style data pipelines with the final output being to a
         | feature store. Whereas Sematic is looking at the general end-
         | to-end ML pipelines (e.g. not just dataset
         | transformations/feature extractions, but also model training
         | and evaluation). In case it's helpful, we do have a page that
         | compares Sematic with some other tools in a similar domain to
         | us: https://docs.sematic.dev/sematic-vs#...-mlflow-pipelines
        
       | buntha wrote:
       | How is it different from MLFlow? the recent MLFlow pipeline, does
       | it has any similarities?
        
         | josh-sematic wrote:
         | MLFlow in general focuses more exclusively on the "lineage
         | tracking" pieces alone. I.e. registering your models and such.
         | They do have an experimental pipeline product as you mention,
         | but it's different in a number of ways. I'd say the biggest one
         | is that MLFlow pipelines has fixed, pre-determined structures
         | for your pipelines while Sematic lets you build your own. We
         | also have a more python-dev-friendly API and better access to
         | cloud computing resources. You can find more details on our
         | comparison page: https://docs.sematic.dev/sematic-
         | vs#...-mlflow-pipelines
        
       | wodenokoto wrote:
       | Do I still need to manage a kubernetes cluster?
        
         | neutralino1 wrote:
         | In the current open-source product, yes. But you can also
         | simply use Sematic locally without any infrastructure (`$
         | sematic start`).
         | 
         | In the coming months, we will provide a fully-hosted SaaS
         | offering to save you the hassle of maintaining your own
         | infrastructure.
        
       ___________________________________________________________________
       (page generated 2022-08-10 23:00 UTC)