[HN Gopher] Launch HN: Synth (YC S20) - Realistic, synthetic tes...
       ___________________________________________________________________
        
       Launch HN: Synth (YC S20) - Realistic, synthetic test data for your
       app
        
       Hey!  Christos, Damien and Nodar here and we're the co-founders of
       Synth (https://getsynth.com) - Synth is an API which allows you to
       quickly and easily provision test databases with realistic data
       with which to test your application.  We started our company about
       a year ago, after working at a quantitative hedge fund in London
       where we built models to trade US equities. Strangely, instead of
       spending time developing models or building the trading system, a
       large portion of our time was spent on just sourcing and on-
       boarding datasets to train and feed our models. The process of
       testing datasets and on-boarding them was archaic; one data
       provider served us XML files over FTP which we then had to spend
       weeks transforming for our models to ingest. A different provider
       asked us to spin up our own database and then sent us a binary
       which was used to load the data. We had to whitelist their API ip-
       address and setup a cronjob to make sure the dataset was never out
       of date. The binary provided an interactive input so it couldn't be
       scripted, or rather it could be but you need something to mock the
       interactive params. All this took a junior developer on the team a
       good 3-4 days to figure out and setup. Furthermore after our trial
       expired we decided we didn't actually need this dataset so those
       3-4 days were essentially wasted. Our frustration around the
       status-quo in data distribution is what drove us to start our
       company.  We spent the first 6 months building a privacy-aware
       query engine (think Presto but with built in privacy primitives),
       but software developers we talked to would frequently divert the
       topic to the lack of high quality, sanitised testing data during
       the software development lifecycle. It was strange - most of us
       developers and data scientists constantly use some sort of testing
       data for different reasons. Maybe you want a local development
       environment which is representative of production but clean from
       customer data. Or a staging environment which contains a much
       smaller, representative database so that tests run faster. You
       could want the dataset to be much bigger to test how your
       application scales. Maybe you want to share your database with 3rd
       party contractors who you don't necessarily trust. Whichever way
       you put it, it's strange that for a problem most of us face every
       day, we have no idiomatic solution. We write bespoke scripts and
       pipelines which often break. They are time consuming to write and
       maintain and every time your schema changes you need to update them
       manually. Or we get lazy and copy/paste production.  We finally
       listened to all this feedback, dropped the previous product, and
       built Synth instead. Synth is a platform for provisioning databases
       with completely synthetic data.  The way Synth works can be broken
       into 3 main steps. You first download our CLI tool (a bunch of
       python wrapped up in a container) and point it at your database to
       create a model (we host the models on the Synth platform). This
       model encodes your schema, and foreign key relationships as well as
       a semantic representation of your types. We currently use simple
       regular expressions to classify the semantic types (for example an
       address or license plate). The whole model is represented as a JSON
       object - if the classifier gets something wrong you can easily
       change the semantic type. Once the model has been created, the next
       step is to train the model. Under the hood we use a combination of
       copulas and deep-learning models to model the distributions and
       correlations in your dataset (the intuition here is that it's much
       more useful for developers to have realistic data than just sample
       from a random number generator). The final step is to use the
       trained model to generate synthetic data. You can either sample
       directly from the model or we can spin up a database for you and
       fill it with as much data as you need. The generation step samples
       from the trained model to create realistic data, as well as
       utilising bespoke generators for sensitive fields (credit card
       numbers, names, addresses etc.)  You can run the entire lifecycle
       in a single command - you point the CLI tool at your database
       (currently Postgres, MySQL and MsSQL) and in ~1 minute you get an
       i.p. address and credentials to your new database with completely
       synthetic data.  We're long time fans of HN and are eagerly looking
       forward to feedback from the community (especially criticism).
       We've made a free version available for this week so you can try it
       with no strings attached. We hope some of you will find Synth
       useful. If you have any questions we'll be around throughout the
       day. Also feel free to get in touch via the site.  Thanks! ~
       Christos, Damien & Nodar
        
       Author : openquery
       Score  : 77 points
       Date   : 2020-08-18 13:09 UTC (9 hours ago)
        
       | joshAg wrote:
       | For testing i care a lot about repeatability.
       | 
       | Specifically, i'm interested in testing a web dashboard/app. So
       | if I use synth to populate my db, how would I know whether the
       | backend's endpoints are giving me good data? Is there a way to
       | guarantee a specific set of test data each time (so i can
       | precompute what the values should be), or will i need to start a
       | test run by querying the data base a bunch to see what's in it to
       | figure out what i should expect the test results to be?
       | 
       | Also, is there a way to prepare data for import into an existing
       | db? Right now for some of our testing we have a single staging
       | instance and we deconflict multiple tests by including a
       | randomized 8 character string in all the relevant IDs for
       | precomputed data we insert as part of the testing initialization.
       | For this testing it's not as important that the data is
       | repeatable, but the testers have a few different scenarios they
       | want to test, so I'd need a way to make a low-data, medium-data,
       | and high-data test set where the backing data fit within some
       | ranges.
        
         | openquery wrote:
         | Hey!
         | 
         | > Is there a way to guarantee a specific set of test data each
         | time
         | 
         | Absolutely. You can seed the model so that the data you get
         | each time is completely reproducible
         | 
         | > For this testing it's not as important that the data is
         | repeatable, but the testers have a few different scenarios they
         | want to test, so I'd need a way to make a low-data, medium-
         | data, and high-data test set where the backing data fit within
         | some ranges.
         | 
         | This is a great use-case for Synth. With the upcoming Firehose
         | API you can point it at an existing database and specify how
         | much synthetic data you want to generate and pump into your db.
         | 
         | For now you can either create a database and write the ETL, or
         | do `synth model sample <model-id> --ouput <some-directory>
         | --sample-size <number-of-rows>` to sample directly from the
         | model into a directory of CSV files and use that to load your
         | database
         | 
         | Feel free to get in touch if you would like to learn more :)
        
       | svsaraf wrote:
       | For those of you who feel this solution is a bit too complex for
       | your workflow, there are a couple of lightweight alternatives,
       | including Sudopoint (https://www.sudopoint.com) which lets you
       | specify what you need and download a CSV, in and out in a few
       | seconds.
       | 
       | To the Synth team, awesome product! Great to see that more tools
       | are getting built to help testing / QA workflows. I think this a
       | huge area for the future. Welcome to the competition. :)
       | 
       | [Disclaimer] I'm the (solo, bootstrapped) founder of Sudopoint
        
       | withinboredom wrote:
       | Does this work with unstructured data (such as cosmosdb?)
        
         | openquery wrote:
         | Not yet - but it's on our roadmap. Feel free to get in touch if
         | you would like this to be accelerated and we can find out more
         | about your use case :)
        
       | brosky117 wrote:
       | Congrats on shipping Christos, Damien, and Nodar! I really like
       | this idea. I have this problem at my company.
       | 
       | Two questions:
       | 
       | First, we're using Postgres and some of our tables use JSON.
       | Would Synth be able to generate realistic JSON? Sometimes this is
       | configuration (which would need to be straight copied) and other
       | times it would be data (which would need to keep the same keys
       | but have generated values). Is this use case supported?
       | 
       | Second, I'm concerned about giving Synth access to my data as
       | much of it is sensitive. I understand that you need access to
       | production data to offer the service. What can you tell me about
       | your data security to help me feel more comfortable? (i.e. What
       | kind of data would you have stored on your end? How does the CLI
       | work? etc)
       | 
       | Congrats again and good luck!
        
         | openquery wrote:
         | Thanks and great questions!
         | 
         | > First, we're using Postgres and some of our tables use
         | JSON...
         | 
         | We've seen this before when we were talking to a company we
         | were considering to pilot pre-launch - it's on our roadmap.
         | Currently the JSON text would be treated as a string, i.e. it
         | is classified as a categorical type or text.
         | 
         | What we would want is for the classifier to traverse the JSON
         | object instead of treating it like text. This feature is going
         | to be implemented when we extend to NoSQL databases.
         | 
         | > Second, I'm concerned about giving Synth access to my data as
         | much of it is sensitive.
         | 
         | Absolutely. This has been one of the guiding principles in
         | building Synth. We've built it so that our servers _never_ have
         | to see any sensitive information. (Hence why you can use Synth
         | via a CLI tool instead of an API)
         | 
         | Also:
         | 
         | 1) The CLI is soon to be OSS giving full visibility into
         | exactly what's happening when you use it. (Really it's OSS now
         | since you can just take a look at the source code running in
         | the container, we just haven't had the time to make our repo
         | public)
         | 
         | 2) The models are designed to be transparent. You can inspect
         | them by running `synth model inspect <model-id>`. This gives
         | you visibility into exactly what the model looks like. (Looking
         | at the data which has been sampled is still a WIP)
         | 
         | 3) If something goes wrong and sensitive information is
         | uploaded to the Synth platform, you can easily purge all traces
         | of it using `synth model rm <model-id>`
        
           | sbecker wrote:
           | > We've built it so that our servers never have to see any
           | sensitive information.
           | 
           | If true, this is a key selling point and should probably be
           | somewhere near the top of the homepage. I didn't get that
           | point from reading any of the copy.
        
             | openquery wrote:
             | Thanks for the feedback. I'll make sure this is clear.
             | 
             | Why is this important for you?
        
       | lukeqsee wrote:
       | As billed, this is good stuff.
       | 
       | I have a client who has millions of rows of data in production--
       | and we have to run our test suite against production because they
       | have no curated staging data set. This would allow us to save
       | multiple minutes every dev pipeline and local test run (which are
       | typically too slow to even run locally).
       | 
       | Looking forward to see you growing!
        
         | lukeqsee wrote:
         | This same client is a bit of a penny-pincher.
         | 
         | Are there any plans to open this up so we could host the
         | infrastructure and then pull a SQL import dump or something
         | along those lines after running the CLI part? This would reduce
         | your ongoing costs to reduce our monthly fee? ($130 would be a
         | very tough sell, even though I think the business value is
         | there.)
        
           | openquery wrote:
           | Hey!
           | 
           | So we are soon introducing the Firehose API. Basically this
           | allows you to point at an arbitrary database and fill it up
           | with as much data as you need from the model.
           | 
           | The Firehose should work for use-case and be much more cost
           | effective.
           | 
           | A more hacky solution for right now, you can spin up a
           | database and run a `select * ...`.
        
             | lukeqsee wrote:
             | That's perfect! I'll keep an eye out for that.
        
               | openquery wrote:
               | If you can't wait, you can always run `synth model sample
               | <model-id> --ouput <some-directory> --sample-size
               | <number-of-rows>` which will generate synthetic data
               | directly into your directory as CSV files. You can then
               | ETL that into your database.
               | 
               | Hope this helps :)
        
       | sleepygardener wrote:
       | Sorry to say this, but the name "synth" is terribly misleading
       | and generic. Word "synth" is used widely for electronic musical
       | instrument "synthesizer".
        
         | openquery wrote:
         | I wouldn't say it's misleading but I see where you're coming
         | from. I play the piano so this was what inspired the name.
         | 
         | It turns out that picking a name for a startup/product which is
         | representative of what you do is hard!
        
       | trulala wrote:
       | How does it compare to Delphix?
        
       | silverlake wrote:
       | I implemented a similar system a while ago, including
       | differential privacy. The data at my firm was so messy the models
       | failed miserably. You really need an analysis phase that can tell
       | a customer whether their data will work or not. I.e. weird
       | distributions, crazy foreign keys, difficult data types.
        
         | openquery wrote:
         | Yes - you're absolutely right in that data is a messy business.
         | 
         | Even in the early days we've seen crazy data types and
         | constraints that makes our job of completely automating the
         | process hard. However, every instance of this makes the product
         | better and this transfers to the next customer.
         | 
         | > You really need an analysis phase that can tell a customer
         | whether their data will work or not
         | 
         | This is part of the roadmap, it's a non-trivial piece of
         | engineering. In the meantime you can try it for free and see if
         | it works for you :)
        
       | Tarrosion wrote:
       | This looks really cool. One question I have is about how much the
       | synthetic data can protect privacy. For example, my company has
       | geospatial event data from our customers. We're very protective
       | of customer identities, and wouldn't want to expose which cities
       | our customers are in. If a model trained on our database notices
       | that the "longitude" column marginal distribution has a spike
       | around (just as an example) -71 degrees (longitude of Boston,
       | where we're located), then presumably the synthetic data would
       | also include a bunch of longitudes near -71 degrees? But there
       | aren't that many cities at longitude -71 degrees, so even the
       | marginal distribution of the synthetic longitudes would reveal
       | something private about our data.
       | 
       | Second question is whether y'all support geospatial data? Both in
       | the sense of "the topology of latitudes and longitudes is not a
       | plane" and "can the model be trained on databases which encode
       | geometries as a single column?"
        
       | hans_castorp wrote:
       | Can this be installed on premise? Especially in the light of GDPR
       | it might not be possible to do something like this with data
       | stored "on the outside" (even if it's only a "model").
       | 
       | I know for sure, our customers wouldn't allow this.
        
         | openquery wrote:
         | Hey - great question!
         | 
         | We've been careful to design Synth such that the model doesn't
         | contain any sensitive information. That being said I completely
         | understand where you're coming from.
         | 
         | We do offer the enterprise version for on-prem deployments.
         | Basically, if you have a Kubernetes cluster you can run Synth
         | on-prem :)
        
       | graerg wrote:
       | > Under the hood we use a combination of copulas and deep-
       | learning models to model the distributions and correlations in
       | your dataset (the intuition here is that it's much more useful
       | for developers to have realistic data than just sample from a
       | random number generator)
       | 
       | This is neat, but do users have the option of just doing vanilla
       | RNG if they want?
        
         | openquery wrote:
         | Hey - good question.
         | 
         | Not right now, but it shouldn't be hard to implement. Is there
         | something some specific use-case this would address?
        
           | graerg wrote:
           | > it shouldn't be hard to implement
           | 
           | Yeah it seems like it's just a flat/un-informed probability
           | distribution and I'd guess your models are general enough to
           | accommodate that.
           | 
           | A couple use cases come to mind:
           | 
           | 1. If I have no data but want to test out various/arbitrary
           | schemas with just a bunch of dummy data. Of course, I could
           | generate it myself (either with ad hoc scripts or building a
           | more general CLI that does this for me), but if Synth just
           | makes it a one-liner in the command line, that's appealing.
           | 
           | 2. If it's too burdensome to convince others in my org that
           | you've "built it so that our servers never have to see any
           | sensitive information". Even if I trust you, I then have to
           | make arguments for others to also trust you, when really if
           | all I need is some random data for an empty schema, then
           | that's a whole can of worms I don't need to open.
        
       | nartz wrote:
       | Hey guys - here's just some critical feedback from a fellow dev -
       | here's my n of 1 perspective - of course this could be a very
       | different perspective for e.g. large enterprise companies
       | struggling with this.
       | 
       | Feedback:
       | 
       | It seems overly complicated. You lost me when you said i have to
       | train models? Are you assuming that software developers want to
       | train machine learning models to do something as simple as
       | creating some test data? In reality - I reach for tools that make
       | things easier for me, which includes not having to read a ton of
       | documentation, download new external tools, and things that 'just
       | work'.
       | 
       | It is 100% easier for me to export a little production data to
       | test on (and maybe sanitize), or to write a small script to
       | generate a few users and those things I need to test. Plus - then
       | I know exactly what I'm going to get. A lot of times, after I've
       | done this once, it will work for a good while as well - if I do
       | change the schema, I can add some additional data for that
       | column, and go from there, or otherwise.
       | 
       | For those companies who have 'messy' fixture data - is the _tool_
       | the issue? My take is that the difficulty with maintaining the
       | data could contribute to this issue, but is also more an issue of
       | simply bad housekeeping - e.g. rushing and not tending the
       | garden. While your system might handle this, your system also
       | seems to require a different skillset (e.g. specific training
       | /knowledge) than the standard QA developer might have.
       | 
       | If I did use it, i'd prefer it to be much easier to use - if I
       | could include a ruby gem, and incorporte it into the testing
       | progress, e.g. an 'after' hook after migrating the db, that would
       | be ideal. Then, I dont really need to know much. However, I would
       | still be concerned about whether this is deterministically
       | creating data or if its random?
       | 
       | Good luck!
        
         | openquery wrote:
         | Thanks for the feedback. This is exactly what we're looking
         | for.
         | 
         | > It is 100% easier for me to export a little production data
         | to test on (and maybe sanitize), or to write a small script to
         | generate a few users and those things I need to test.
         | 
         | In your case it may very well be. But when you are an
         | organization with a schema which has 100+ tables and these
         | tables have scattered sensitive information this can become a
         | nightmare to manage. I've seen this first hand. Furthermore if
         | you are trying to generate more than 'a little' data this can
         | get more complex as you have to create factories and write a
         | lot of code to make the whole thing coherent and tell a story.
         | I think undertaking the added complexity of Synth is a trade-
         | off one should consider depending on the sophistication of the
         | testing data they require.
         | 
         | > If I did use it, i'd prefer it to be much easier to use
         | 
         | I think this misconception may be attributed to the fact that
         | we use machine learning under the hood. We've spent a lot of
         | time abstracting the developer away from this. In fact you can
         | run the whole lifecycle with 1 line of code:
         | 
         | `synth model new --from-database <database-uri> --train
         | --deploy`
         | 
         | > I would still be concerned about whether this is
         | deterministically creating data or if its random?
         | 
         | At this point you can choose. You can either pick a seed with
         | which the whole generation process starts (this may not be in
         | production yet) or elect to randomly seed it.
         | 
         | Thanks for the great questions :)
        
         | treis wrote:
         | >things I need to test
         | 
         | I think this is the biggest problem. I don't need a lot of
         | random data in my database. I need a lot of specific scenarios
         | set up. And a way to get those scenarios back after I test
         | something.
         | 
         | I've definitely been in a lot of situations where test data is
         | a problem. A particularly egregious one that comes to mind is
         | the poor developer that had to develop the fraud functionality.
         | Marking an account as fraud nuked it in the back end. Lots of
         | angry testers/developers when their favorite test account got
         | marked as fraud.
        
       | cowb0yl0gic wrote:
       | This is almost identical to a project idea I've had banging
       | around for...um...6 years now. :) Glad to see someone is running
       | with it, and also that you have data privacy as a 1st-class
       | citizen. One idea for the data model: domain-specific descriptors
       | (ex., not just a date, but a human birthdate with specific
       | parameters (think healthcare applications: pediatrics vs general
       | inpatient); this could be derived from sample/production data,
       | but when designing a new application, one might need to have
       | finer control over things like distribution (normal vs. skewed),
       | min/max, etc.). If someone is designing a new report for an
       | existing application, but wants synthetic data to use for
       | dev/testing and UAT, the report "target data profile" may diverge
       | from historical production data in very specific ways (ex.,
       | introducing new types/classes of products).
        
         | openquery wrote:
         | Thanks for your comment :)
         | 
         | These are all very good points. We are in the process of
         | figuring out a natural way to express user-specified semantic
         | types. We have some ideas but more on this coming soon!
        
       | iforiq wrote:
       | One use case I've seen for this is compliance. For SOC2 and other
       | compliance standards, I _think_ you aren 't allowed to use
       | production data for dev/staging environments. An automated way to
       | generate a database with synthetic data would make life much
       | better in such cases.
        
         | openquery wrote:
         | Absolutely! We spent a bunch of time in the data privacy space
         | before pivoting to Synth. Synth has utility as a dev tool but
         | really does address exactly this issue.
         | 
         | This also ties into GDPR and CCPA compliance - we think that as
         | regulations tighten (which seems almost inevitable) this sort
         | of tooling will empower developers to go quicker and focus on
         | their applications instead of compliance.
        
       | sqs wrote:
       | Anyone know how this compares to https://www.tonic.ai/? Tonic
       | lets you generate data for safe local dev/testing, and they're
       | also open source and have some big customers.
        
       | carlps wrote:
       | I'm curious how the model handles text data. Does it use the
       | actual input text from the source db to generate new synthetic
       | data? If I have a column of a bunch of sensitive text that I need
       | sanitized, how will that appear in the output? What is the risk
       | of leaking something sensitive?
        
         | openquery wrote:
         | Thanks for the question!
         | 
         | For now text data will be marked as `categorical` or `text`.
         | When you have sensitive data you want to use `text` which will
         | provide a lorem-ipsum type generator.
         | 
         | If the model has classified that column with the semantic type
         | `text`, no information from the column should be leaked :)
        
       ___________________________________________________________________
       (page generated 2020-08-18 23:00 UTC)