[HN Gopher] Launch HN: Sarus (YC W22) - Work on sensitive data w...
       ___________________________________________________________________
        
       Launch HN: Sarus (YC W22) - Work on sensitive data with
       differential privacy
        
       Hi HN! Maxime, Nicolas, and Vincent here, founders of Sarus
       (https://www.sarus.tech). Sarus is a privacy engineering software
       that lets data scientists work on data without the need to access
       it. It works like a proxy between the practitioner and the data.
       All queries and data processing jobs are executed on the original
       data with the privacy guarantees of differential privacy.  When
       data is sensitive, getting access can be a huge pain. It means
       going through a long manual validation process that includes
       designing, and implementing an appropriate data anonymization. It
       takes weeks to months and some data utility may be lost to the
       masking requirements.  Sarus makes all of it irrelevant by letting
       analysts work on data that is never accessed. Analysts only access
       outputs of their data jobs, and those can be protected with
       appropriate privacy measures.  With past lives in healthtech,
       finance, and marketing, we've experienced first-hand that data
       governance has taken a huge part in data operations. It's a
       rightful objective to protect data but it should not have to
       hamstring all innovation. For most data science or analytics
       objectives, the analyst has no interest in the information of a
       given individual. They look for patterns that are valid across the
       dataset. Access to user-level information is just an unfortunate
       way to get there.  We decided to build Sarus so that data access is
       no longer a requirement.  The Sarus API proxies all queries,
       compiles them into a privacy-safe version, runs them on the
       original data (which never moves outside of our clients'
       infrastructure) and outputs the protected results to the
       practitioner. The protection relies on differential privacy, a
       mathematical definition of privacy already used by leading tech
       companies. Differential privacy works by adding calibrated
       randomness to outputs so that the information of any given
       individual cannot be inferred. One of its main benefits is that it
       does not make any assumption on what is sensitive in the data or
       what the recipient of the output may already know or do. This is
       the ideal candidate for replacing all manual data governance
       processes by something fully automated. Each query gets rewritten
       by Sarus in a way that implements its core principles.  For the
       core primitives of differential privacy, we leverage the latest
       research (Dwork & Roth 2014, Abadi 2016, Dong 2019, Koskela 2020 or
       Wilson 2019) and open source implementations (tensorflow-privacy,
       Google Differential Privacy, OpenDP, Smartnoise). Our key
       contribution is to bundle everything into an API that can be
       queried without seeing the data in the first place. It requires
       proper privacy accounting (we use PLD accounting as in Koskela
       2020) but also setting all the technical parameters that are
       required by the framework (estimating range of input data,
       allocating privacy budget across computation steps...). We also
       optimize the privacy utility trade-off by memoizing previous
       queries as much as possible.  Wait, but the first thing data
       scientists do is to check out the data, how do I do that now? Not a
       problem, the API provides synthetic data samples with the same
       schema and statistical distribution by default. It effectively
       replaces the need to see any record, and data scientists can still
       do feature engineering, test and debug code with it. Of course,
       synthetic data is not something you would want to build insights or
       ML models on, you'd use the API to do that on the original data.
       How it works: the app is deployed in the cloud infrastructure (any
       cloud vendor is compatible). The data admin lists relevant data
       sources from the UI or the API, and grants learning access to
       practitioners by applying a privacy policy among predefined
       templates. The synthetic data sample is automatically generated.
       From there, data scientists can run their analyses with their usual
       tools (pandas, numpy, TF, scikit-learn, Metabase, Redash,
       Tableau...), whether from a python SDK or a hiveSQL connector.
       Curious? We have released a self-serve demo for you to try it out.
       It lets you make a dataset available from the Sarus proxy, set up
       access policies and then, as a data practitioner, use it for
       analytics and machine learning. It is limited to a handful of
       datasets but should give you a good understanding of Sarus. You can
       sign up at https://demo.sarus.tech/signup and begin using Sarus for
       free, no credit card required (tutorial on
       https://www.sarus.tech/post/we-just-released-an-open-demo-tr...).
       Our model is a software license to run on our clients' cloud. Our
       pricing is on a per-dataset per-month basis and starts at
       $600/month.  Please let us know what you think! We look forward to
       hearing your questions, feedback, ideas, and experience!
        
       Author : maximeago
       Score  : 102 points
       Date   : 2022-03-16 13:00 UTC (9 hours ago)
        
       | flyingyeti wrote:
       | The demo signup form has a required field "Token" that's blocking
       | me from signing up
        
         | maximeago wrote:
         | You can use Google SSO without a token. If you don't have a
         | Google account, can you contact us with the contact form? we'll
         | send you a token.
        
       | XCSme wrote:
       | It's unclear for me from the landing page hero section what the
       | product is/does or what problem does it solve:
       | 
       | "PRIVACY-BY-DESIGN
       | 
       | Time-to-data: from months to minutes
       | 
       | Organizations that use Sarus outperform their peers at execution
       | speed for machine learning and analytics while being more secure
       | "
        
         | maximeago wrote:
         | The product solves the problem of the time it takes to access
         | sensitive data for analytics and machine learning. When you
         | work in a large healthcare or financial organization, each
         | dataset is highly protected. Each time a data practitioner
         | needs to work on it, they may have to wait for months for
         | compliance processes to opine on a data masking strategy and
         | engineering teams to prepare a data lab and implement this
         | strategy. With Sarus, data practitioner no longer need to
         | access data to do analytics or machine learning on sensitive
         | data assets.
         | 
         | When internal access to personal data is not a concern within
         | an organization, data sharing with external partners certainly
         | is. This process can be avoided just the same.
         | 
         | Hence the promise of taking time-to-data form months to
         | minutes.
         | 
         | Hope that helps clarify.
        
           | XCSme wrote:
           | Thanks, this makes it a lot more clear!
           | 
           | Maybe the hero text could be more clear, explaining in
           | summary what it does (similar to this comment). "Get instant
           | access to sensitive data for analytics and machine learning."
        
       | amzl wrote:
       | Pretty impressive. What do you use for synthetic data generation?
       | Also, you say in the blog post that it works with any type of
       | data. Can you tell a bit more? Does it work for text and images?
        
         | maximeago wrote:
         | We developed our own generative model for synthetic data
         | generation. It is an autoregressive model where each variable
         | is derived from previously generated ones using Transformers
         | networks. If you are interested, you have more details in:
         | https://arxiv.org/pdf/2202.02145.pdf When we say it works on
         | any types of data, we mean: numerical, categorical, text,
         | images and compositions of those types (see the paper).
        
       | tnzk wrote:
       | Interesting. Is the API hosted with Sarus supposed to be used by
       | in-house analysts? Or 3rd party?
       | 
       | > Our key contribution is to bundle everything into an API that
       | can be queried without seeing the data in the first place.
       | 
       | Without being seen by who?
        
         | maximeago wrote:
         | The API is designed to be hosted by our clients so that the
         | software runs directly on their data infrastructure and no
         | sensitive data leaves their systems. In this demo, it is
         | obviously hosted by us.
         | 
         | A big innovation is that, with Sarus, the data practitioner
         | does not need to see the data and can still manipulate it. Most
         | DP libraries are designed for researchers that have access to
         | the data. They can prepare the data however they like, tune the
         | libraries all they want, and eventually use the library to
         | produce protected outputs from the data. With Sarus, someone
         | who never saw the data, can achieve the same.
        
       | umanwizard wrote:
       | FYI: The headings on your Careers page are all in French, e.g.
       | "Qui sommes-nous ? ", but the actual content is all in English.
        
         | maximeago wrote:
         | Thanks for catching it! will fix it.
        
       | ZeroCool2u wrote:
       | So, I work in an org that has truly sensitive data and this has
       | been a barrier for us more times than I can count, so this is
       | obviously very interesting to us and something we've thought
       | about a lot. A couple questions I have are:
       | 
       | 1. How well does Sarus work with data that is not in a database,
       | like unstructured data such as documents/text?
       | 
       | 2. How does Sarus handle 'legacy' DB's, where the schema for a
       | table might not be quite right, but due to operational
       | constraints these schemas can't be easily corrected? The
       | canonical example I'm thinking of is date times that have been
       | specified as strings and no one bothered to change them.
       | 
       | 3. What kind of language support exists for interacting with the
       | Sarus proxy? Obviously, you have Python support but for large
       | enterprises that might need Sarus oftentimes there are a few
       | languages that are popular internally and all need equal support.
       | I think the comprehensive list of analytics languages in use in
       | large orgs would look something like, [python, R, Julia, Matlab,
       | SAS]. Rust/C++ support would be ideal as well as they're commonly
       | used in Python/R to accelerate hot code. Do you have plans to
       | develop SDK's? Would they be hand crafted or do you plan to
       | develop generated SDK's similar to how GCP does it?
       | 
       | 4. Are you moving to get any security certs? Of course you're a
       | startup right now, but I know from experience enterprise orgs
       | will still blindly ask questions like, "Are you FedRamp
       | Moderate/High certified?" (This doesn't even make sense for your
       | sales model and I'm certain you'll still have to answer this
       | question and explain why over and over.) or "Do you have a Soc 2
       | Type 2 report we can look at?". The orgs that actually need
       | something like this are going to be asking these questions pretty
       | quickly.
       | 
       | 5. When I use Sarus, do I have to use your IDE/interface? One of
       | the things I noticed when looking at your demo gifs is there is a
       | lot of use of notebooks, which of course are popular, but you'll
       | be met with a lot of resistance if your users can't use the
       | tooling they prefer (PyCharm Pro / DataGrip plugin to interact
       | with DB's in my teams case).
       | 
       | 6. How exactly is Sarus deployed? Terraform? Is it a
       | containerized application? Does it scale vertically or
       | horizontally? Can its logging mechanism integrate with
       | StackDriver, Splunk, or Cloudtrail?
       | 
       | 7. Have you proved out the technology with more complex time
       | series data? I'm thinking of sensitive trading data.
       | 
       | 8. Do you provide benchmarks for showing that a model trained on
       | a real dataset is equivalent in performance a model trained on
       | the synthetic dataset?
       | 
       | Super cool product and you're in a great position to make a ton
       | of money if you nail the execution and get some large customers!
        
         | maximeago wrote:
         | 1. Sarus works on data that is organized in records. The
         | intuition is that one record should not transpire in the
         | results (hence protecting their privacy) but studying all
         | records conjointly should be possible. It may be flat files,
         | parquet filets, etc. but we do need this record-level
         | organization. In a given record, there may be columns that are
         | text or images, Sarus will work fine. We never worked on pdf
         | documents. Conceptually it could work but this is quite far
         | down the road.
         | 
         | 2. Sarus has connectors to the main DB and we add more when we
         | meet them. The basic assumption is that the experience should
         | be the same as working on the data in its original form. For
         | instance if your data is in a CSV with a weird date format, you
         | will be able to (i) get synthetic data with this same weird
         | date format, (ii) apply python code that transforms this weird
         | date format into something more conventional and use that
         | reformatted version. When running your data job, Sarus will
         | apply your preprocessing code and take it from there.
         | 
         | 3. Today we have a python SDK and a SQL connector. Both
         | leverage the same low-level API. We may build other SDKs for
         | other languages but haven't started doing so.
         | 
         | 4. Indeed, we don't have any cert yet but we are looking into
         | getting some soon. We are about to start Soc2 for instance.
         | This is somewhat less of a requirement as we never host any of
         | our clients' data. Of course, everything that helps get the
         | green light of the ITSec team is useful.
         | 
         | 5. The python SDK is standard python code so you can use in any
         | python env. The notebook is just here to make it more user-
         | friendly in demos. Same for SQL, you can use any SQL querying
         | tool, we did the demo with Metabase.
         | 
         | 6. The easiest way is to deploy a docker image with Docker
         | compose. It does not scale on multiple machine yet (stay
         | tuned). In that sense, big data sources are only partially
         | supported: if the source is RedShift and you submit a SQL query
         | to the API, we'll rewrite it and send it to Redshift (which
         | scales), but if you want to do ML on the same data, we won't be
         | able to scale the same.
         | 
         | 7. Complex time series is not a problem for the remote
         | execution part provided it is stored in a traditional format.
         | That being said, we don't have a specific synthetic data model
         | for time series yet, so that part of the experience will be a
         | bit different.
         | 
         | 8. This is a debate we leave to researchers because there is
         | not a single answer. It depends directly on the number of
         | records in your dataset and the dimensionality of your data.
         | However, you can set up privacy policies so that the weights of
         | ML model without DP are allowed to be shared. This is
         | considered acceptable by 99% of compliance teams in the world
         | today so it's not a huge compromise. If you use Sarus this way,
         | you are guaranteed to have exactly the same performance.
         | 
         | Would love to continue the conversation offline of course!
        
       | amitport wrote:
       | The blog post, demo, and website are incredibly uninformative
       | (maybe informative, but not on own product's details).
       | Eventually, pressing on "getting started" goes to a sign up for
       | updates page.
        
         | ganzuul wrote:
         | The PySyft project is well-documented and researched if you
         | want to learn about the technology.
        
           | amitport wrote:
           | I'm very familiar with pysyft and tensorflow federated (and
           | Duet which may be the open source basis for this kind of
           | product). I have much interest on the topic and that's why I
           | was seriously scanning the website and tried to understand
           | what the product is exactly. I failed.
        
           | maximeago wrote:
           | Yes, this is a very rich resource. Thx
        
           | tnzk wrote:
           | Is PySyft used under the hood?
        
             | maximeago wrote:
             | No, we do not. Pysyft was mostly first designed to do
             | federated learning. Sarus targets organizations that have
             | their data in one central repository in a trusted curator
             | model. It lets external data practitioners query that data
             | with all sorts of data jobs (not just ML, but also SQL
             | analysis, and spark soon).
        
         | maximeago wrote:
         | The first link is the corporate website, it may not include all
         | the product details you expected, sorry about that. You should
         | get a lot more details on how it works if you try the tutorial
         | and play with it yourself. This is at the bottom of the post,
         | hopefully it satisfies your curiosity but happy to answer
         | outstanding questions here of course.
        
           | amitport wrote:
           | Is this a productized Duet[1]? Are you using it under the
           | hood?
           | 
           | (As far as I'm concerned, if the answer is yes to both, this
           | has much potential. I'm just trying to figure out what I'm
           | looking at)
           | 
           | Thank you!
           | 
           | [1]: https://blog.openmined.org/duet-demo-how-to-do-data-
           | science-...
        
             | maximeago wrote:
             | Yes, there are many parallels with Duets we can look at
             | Sarus as a productized version of it.
             | 
             | There are some differences though: - we designed for the
             | trusted curator model where Duet is mostly for federated
             | learning tasks in mind - the privacy policies are based on
             | principles (such as: "DP-outputs with epsilon < 2 can be
             | shared", "DP-synthetic data can be shared", or "weights of
             | ML models can be shared"), then the gateway applies the
             | principles to any query, whether it is a SQL query, an ML
             | model or else. In Duet, it's all about manual validation of
             | given queries.
        
         | shmatt wrote:
         | It looks like someone asked an AI model to generate a website
         | for the next YC company
         | 
         | this page specifically https://www.sarus.tech/solutions just
         | screams "UX is an afterthought"
        
       | nicolasmesselet wrote:
       | Very interesting indeed. I read that Sarus has been designed with
       | Data Scientists persona in mind. Would that also be easily
       | solving internal access for other engineering teams. Basically,
       | allowing engineering teams create new features including all the
       | local/staging/production environments sensitive data masking? Is
       | the Sarus approach also validated by Privacy or Security
       | authorities?
        
         | maximeago wrote:
         | Sarus is designed for all data use cases, provided that access
         | to a given user's information is not the objective. This is the
         | case for all of BI, analytics, or machine learning. It also
         | works for testing or debugging, building APIs, etc. It
         | resonates with organizations' aspiration for the
         | democratization of data.
         | 
         | Differential privacy provides much better protection than data
         | masking, but most importantly, it does not require any manual
         | decision (which column to mask, how, etc.). This is what makes
         | it easy to apply at scale to all datasets in the data warehouse
         | or data lake instead of having dataset per dataset decision
         | making involved.
         | 
         | Differential privacy is used by Apple, Google, Microsoft, or
         | the US Census. When used properly, the data protection it
         | provides does not need to be proven to regulators or security
         | teams anymore. That being said, regulators do not require DP
         | protection per se. They require organizations to put in place
         | the best practices in terms of data governance, data
         | minimization, or data security as a whole. This is part of the
         | answer.
        
           | fluidcruft wrote:
           | I think this is interesting but I'm having trouble seeing how
           | it would apply to the sorts of machine learning tasks that
           | are drawing heavy interest in a radiology department. How
           | does it apply to, say, development or testing of image
           | segmentation tools? Quite often vendors want to sell us
           | software and we would very much like to test it at scale on
           | our own data to see whether it's trash or not because
           | procurement is a beast. Does this sort of tool provide that
           | sort of an interface somehow? I can see how it works for
           | tablular data, I'm just not sure how you can guarantee PHI is
           | fuzzed sufficiently in images.
        
             | maximeago wrote:
             | Here is how it would work in theory (not including the
             | scalability question of working with heavy DICOM files and
             | huge DNN). I'm assuming your data is made of records
             | composed by an image and some information about the image
             | or the patient.
             | 
             | The system will generate a fake dataset with the exact same
             | structure and schema (the information on patients is
             | realistic, the images look reasonable and importantly has
             | the right encoding, size, etc.). The purpose of this fake
             | data is for the vendor to adjust their algorithm to be able
             | to consume your data as it is. The vendor builds up the
             | preprocessing on the fake data and then submit their data
             | job to the API (say a preprocessing function to be applied
             | on each record and a Tensorflow model to be fitted on the
             | data, or just to measure the performance on the data). The
             | preprocessing code runs on the original records, the model
             | would be trained or validated against the real data. In the
             | end they can prove the value of their model without having
             | to get their hands on the real data.
        
               | fluidcruft wrote:
               | The problem we generally have is that plugging the
               | vendor's [insert tensorflow model component] into our
               | network seems to always become an operational no-go prior
               | to purchase due to a variety of reasons including
               | intrusiveness and questions about privacy and the
               | vendor's ability to manipulate the process to get access
               | to datasets. So it's actually the preprocessing step
               | that's we keep hitting as the pain point. In some cases
               | we generate de-identified datasets for demonstration and
               | testing but it can be very labor intensive.
               | 
               | I've not encountered differential privacy in my work
               | before now, but at least for dealing with metadata in the
               | DICOM it could probably be helpful for some datasets. But
               | it could still be challenging to ensure the IODs are
               | correct (or that known quirks are preserved). Anyway this
               | is very interesting. I have a colleague who is working on
               | some utilization/value research using billing records and
               | I'll show him this.
        
               | maximeago wrote:
               | Thanks! Our goal is that no matter what preprocessing
               | function they pass, the only end up accessing outputs
               | that comply with the privacy policies. The code gets
               | access to the real data but it is shielded from the
               | vendor who can only see protected outputs. It should
               | address the risk of private information being exposed to
               | them, but for sure, the more sophisticated the
               | preprocessing code will be, the more challenging it will
               | become. Deep learning on Dicom data is pushing the system
               | to the edge a bit.
        
       | leecarraher wrote:
       | for synthetic data generation, what methods are they using to
       | sample data from the distribution? What assumptions about the
       | distribution are being made? Does it model correlations between
       | sample attributes that could adversely effect some ML methods
       | (multi-colinearity can cause problems).
        
         | maximeago wrote:
         | We developed our own generative model for synthetic data
         | generation. It is an autoregressive model where each
         | variable/attribute is derived from previously generated ones
         | using Transformers networks (more details there:
         | https://arxiv.org/pdf/2202.02145.pdf). So yes, correlations are
         | modelled, although exact multicollinearity (when there is a
         | linear relationship between bunch of attributes) would be a bit
         | blurry in the synthetic data.
         | 
         | This being said, the goal of Sarus is to enable analysis on the
         | original data with privacy guarantee on the result (synthetic
         | data is merely used as a tool and a fallback when there is no
         | better solution) so you can write a statistical test to detect
         | multicollinearity and run it on the original data within Sarus.
        
       ___________________________________________________________________
       (page generated 2022-03-16 23:00 UTC)