[HN Gopher] Launch HN: Evidently AI (YC S21) - Track and Debug M...
       ___________________________________________________________________
        
       Launch HN: Evidently AI (YC S21) - Track and Debug ML Models in
       Production
        
       Hi HN, we are Elena and Emeli, co-founders of Evidently AI
       http://evidentlyai.com. We're building monitoring for machine
       learning models in production. The tool is open source and
       available on GitHub: https://github.com/evidentlyai/evidently. You
       can use it locally in a Jupyter notebook or in a Bash shell.
       There's a video showing how it works in Jupyter here:
       https://www.youtube.com/watch?v=NPtTKYxm524.  Machine learning
       models can stop working as expected, often for non-obvious reasons.
       If this happens to a marketing personalization model, you might
       spam your customers by mistake. If this happens to credit scoring
       models, you might face legal and reputational risks. And so on. To
       catch issues with the model, it is not enough to just look at
       service metrics like latency. You have to track data quality, data
       drift (did the inputs change too much?), underperforming segments
       (does the model fail only for users in a certain region?), model
       metrics (accuracy, ROC AUC, mean error, etc.), and so on.  Emeli
       and I have been friends for many years. We first met when we both
       worked at Yandex (the company behind CatBoost and ClickHouse). We
       worked on creating ML systems for large enterprises. We then co-
       founded a startup focused on ML for manufacturing. Overall we've
       worked on more than 50 real-world ML projects, from e-commerce
       recommendations to steel production optimization. We faced the
       monitoring problem on our own when we put models in production and
       had to create and build custom dashboards. Emeli is also an ML
       instructor on Coursera (co-author of the most popular ML course in
       Russian) and a number of offline courses. She knows first-hand how
       many data scientists try to repeatedly implement the same things
       over and over. There is no reason why everyone should have to build
       their own version of something like drift detection.  We spent a
       couple of months talking to ML teams from different industries. We
       learned that there are no good, standard solutions for model
       monitoring. Some quoted us horror stories about broken models left
       unnoticed which led to $100K+ in losses. Others showed us home-
       grown dashboards and complained they are hard to maintain. Some
       said they simply have a recurring task to look at the logs once per
       month, and often catch the issues late. It is surprising how often
       models are not monitored until the first failure. We spoke to many
       teams who said that only after the first breakdown they started to
       think about monitoring. Some never do, and failures go undetected.
       If you want to calculate a couple of performance metrics on top of
       your data, it is easy to do ad hoc. But if you want to have stable
       visibility into different models, you need to consider edge cases,
       choose the right statistical tests and implement them, design
       visuals, define thresholds for alerts etc. That is a harder problem
       that combines statistics and engineering. Beyond that, monitoring
       often involves sharing the results with different teams: from
       domain experts to developers. In practice, data scientists often
       end up sharing screenshots of their plots and sending files here
       and there. Building a maintainable software system that supports
       these workflows is a project in itself, and machine learning teams
       usually do not have time or resources for it.  Since there is no
       standard open-source solution, we decided to build one. We want to
       automate as much as possible to help people focus on the modeling
       work that matters, not boilerplate code.  Our main tool is an open-
       source Python library that generates interactive reports on ML
       model performance. To get it, you need to provide the model logs
       (input features, prediction, and ground truth if available) and
       reference data (usually from training). Then you choose the report
       type and we generate a set of dashboards. We have pre-built several
       reports to detect things like data drift, prediction drift,
       visualize performance metrics, and help understand where the model
       makes errors. We can display these in a Jupyter notebook or HTML.
       We can also generate a JSON profile instead of a report. You can
       then integrate this output with any external tool (like Grafana)
       and build a workflow you want to trigger retraining or alerts.
       Under the hood, we perform the needed calculations (e.g. Kolmogorov
       Smirnov or Chi-Squared test to detect drift) and generate multiple
       interactive tables and plots (using Plotly on the backend). Right
       now it works with tabular data only. In the future, we plan to add
       more data types, reports and make it easier to customize metrics.
       Our goal is to make it dead easy to understand all aspects of model
       performance and monitor them.  We differ from other approaches in a
       couple of ways. There are end-to-end ML platforms on the market
       that include monitoring features. These work for teams who are
       ready to trade flexibility in order to have an all-in-one tool. But
       most teams we spoke to have custom needs and prefer to build their
       own platform from open components. We want to create a tool that
       does one thing well and is easy to integrate with whatever stack
       you use. There are also some proprietary ML monitoring solutions on
       the market, but we believe that tools like these should be open,
       transparent, and available for self-hosting. That is why we are
       building it as open source.  We launched under Apache 2.0 license
       so that everyone can use the tool. For now, our focus is to get
       adoption for the open-source project. We don't plan to charge
       individual users or small teams. We believe that the open-source
       project should remain open and be highly valuable. Later on, we
       plan to make money by providing a hosted cloud version for teams
       that do not want to run it themselves. We're also considering an
       open-core business model where we charge for features that large
       companies care about like single sign-on, security and audits.  If
       you work in tech companies, you might think that many ML infra
       problems are already solved. But in more traditional industries
       like manufacturing, retail, finance, etc., ML is just hitting
       adoption. Their ML needs and environment are often very different
       due to legacy IT systems, regulations, and types of use cases they
       work with. Now that many move from ML proof-of-concept projects to
       production, they will need the tools to help run the models
       reliably.  We are super excited to share this early release, and
       we'd love if you could give it a try:
       https://github.com/evidentlyai/evidently. If you run models in
       production - let us know how you monitor them and if anything is
       missing. If you need some help to test the tool - happy to chat! We
       want to build this open-source project together with the community,
       so let us know if you have any thoughts or feedback.
        
       Author : elenasamuylova
       Score  : 61 points
       Date   : 2021-07-07 13:02 UTC (9 hours ago)
        
       | shcheklein wrote:
       | Hey, Elena, Emeli, congrats with the launch! Question- am I
       | right, that you don't "dictate" specific monitoring
       | infrastructure? You provide a tool that can take a model and
       | calculate/plot certain important characteristic about it( (e.g.
       | drift)? Do you have a plan to make connectors, templates, some
       | additional infrastructure to simplify integration with Grafana?
       | What other monitoring infrastructure tools you consider
       | integrating?
        
         | elenasamuylova wrote:
         | True, we do not want to force users into a rigid workflow. We
         | build the tool so it can integrate with other parts of the ML
         | stack. Right now you can for example use Evidently to calculate
         | if the model has drifted, and then log the results in a JSON
         | format, or take only some parts of the JSON (for example
         | metrics you want) and then send them to an external
         | dashboarding tool. As long as you can process a JSON output or
         | store an HTML file you can build any workflow around that.
         | 
         | We plan to work on tutorials and prepare native integrations
         | with some popular tools like Grafana, MLflow, DVC and some
         | others.
        
       | rehabemam wrote:
       | That will be very helpful, excited to contribute and see this
       | growing, thanks for sharing.
        
         | elenasamuylova wrote:
         | Thanks! Looking forward to your contributions! <3
        
       | streetcat1 wrote:
       | Thanks for the info, so can you please elaborate more on how are
       | you access the prediction logs? Is there a specific log format?
       | How do you know the model input schema?
        
         | elenasamuylova wrote:
         | Right now we ask the user to prepare the logs on their side (or
         | schedule a job to push the logs to the tool). We learnt that
         | most teams store the prediction logs anyways - since they are
         | usually used for retraining. So we thought that is the simplest
         | and most universal interface for integration for now.
         | 
         | The tool now works with tabular data. Depending on the report
         | type you can include only the input features (e.g. for data
         | drift report), or also add the prediction and target column to
         | the table (e.g. for model performance report). So you might
         | need to perform some basic transformations (e.g. to add the
         | target column if this data comes later) to prepare the input.
         | 
         | To specify the schema, you need to configure a simple column
         | mapping (basically show where the target or prediction columns
         | are, and optionally specify which features are categorical and
         | numerical).
         | 
         | You can check the requirements for each report in the
         | documentation https://docs.evidentlyai.com/
        
         | emelidral wrote:
         | To add to this, if the column_mapping is not provided we try to
         | parse data automatically assuming that the schema is standard
         | (e.g. you use the column names like "target" and "prediction")
         | We also process the features based on pandas data type. In
         | future to want to make it super easy to avoid writing extra
         | configuration so we will try to parse as much as possible, but
         | of course give the user the opportunity to override.
        
       | fighterpilot wrote:
       | If the ground truth is unavailable in real time, are you still
       | able to detect anomalies with the inputs?
        
         | elenasamuylova wrote:
         | We do not directly detect anomalies but we have two types of
         | checks to run when there are no actuals or ground truth labels:
         | 1) data drift to compare the statistical distribution of the
         | input features to the past 2) prediction drift to compare the
         | distribution of the model predictions to the past
         | 
         | To control the sensitivity of monitoring you can manually
         | decide if you want to monitor all features, or maybe only the
         | most important ones. This is not automated yet.
         | 
         | We also generate a few dashboards to show the relationship
         | between the features and predicted values - to help with visual
         | debugging.
         | 
         | We plan to add some unsupervised approaches like outlier
         | detection later on. But for the moment we do not have checks on
         | the level of individual objects in the data.
        
       | rodrigorivera wrote:
       | Hi Evidently team, congratulations on the launch! I like a lot
       | what you are doing. I work primarily with time-series data. I
       | hope we will see in the future a module to handle time-series
       | problems such as forecasting. Also, please include integration
       | with Google Colab. My colleagues and I work extensively with it,
       | and it is the platform that we use to test new libraries.
       | Otherwise, keep with the great work!
        
         | elenasamuylova wrote:
         | Thanks - integration with Solab is in the near-time roadmap.
         | You can already use it to generate JSON profiles and export
         | HTML reports. We are working now to make it possible to display
         | all the interactive plots directly inside Solab.
         | 
         | Longer term, we also plan to add more reports for specific
         | problem types, as such time series or recommendation systems.
         | Keep an eye on the repo!
        
       | billconan wrote:
       | how to obtain ground truth in production?
        
         | elenasamuylova wrote:
         | Depends on the use case and available instrumentation!
         | 
         | If the feedback is available almost instantly (e.g. you
         | recommend something to a user based on a model prediction and
         | you know if they clicked on it or not), you can log the user
         | action in your data warehouse to have the ground truth easily
         | available for further analysis. Then you run the performance
         | reports on top of complete logs.
         | 
         | In other cases you might have to wait for the ground truth
         | (e.g. you predict the demand for some future period and then
         | wait for it to materialize, or you need to label the data
         | first). In this case you can log the ground truth to the data
         | warehouse once it becomes available and join with the
         | prediction logs. You can then run complete performance
         | monitoring with error analysis as a batch job. In the meantime,
         | you can still monitor the data drift.
         | 
         | Could you describe a specific use case and environment? We can
         | brainstorm how to best arrange it.
        
       ___________________________________________________________________
       (page generated 2021-07-07 23:00 UTC)