[HN Gopher] Launch HN: Syndetic (YC W20) - Software for explaini...
       ___________________________________________________________________
        
       Launch HN: Syndetic (YC W20) - Software for explaining datasets
        
       Hi HN,  We're Allison and Steve of Syndetic
       (https://www.getsyndetic.com). Syndetic is a web app that data
       providers use to explain their datasets to their customers. Think
       ReadMe but for datasets instead of APIs.  Every exchange of data
       ultimately comes down to a person at one company explaining their
       data to a person at another. Data buyers need to understand what's
       in the dataset (what are the fields and what do they mean) as well
       as how valuable it can be to them (how complete is it? how
       relevant?). Data providers solve this problem today with a "data
       dictionary" which is a meta spreadsheet explaining a dataset. This
       gets shared alongside some sample data over email. These artifacts
       are constantly getting stale as the underlying data changes.
       Syndetic replaces this with software connected directly to the data
       that's being exchanged. We scan the data and automatically
       summarize it through statistics (e.g., cardinality), coverage
       rates, frequency counts, and sample sets. We do this continuously
       to monitor data quality over time. If a field gets removed from the
       file or goes from 1% null to 20% null we automatically alert the
       provider so they can take a look. For an example of what we produce
       but on an open dataset check out the results of the NYC 2015 Tree
       census at
       https://www.getsyndetic.com/publish/datasets/f1691c5d-56a9-4....
       We met at SevenFifty, a tech startup connecting the three tiers of
       the beverage alcohol trade in the United States. SevenFifty
       integrates with the backend systems of 1,000+ beverage wholesalers
       to produce a complete dataset of what a restaurant can buy
       wholesale, at what price, in any zipcode in America. While the core
       business is a marketplace between buyers and sellers of alcohol, we
       built a side product providing data feeds back to beverage
       wholesalers about their own data. Syndetic grew out of the problems
       we experienced doing that. Allison kept a spreadsheet in dropbox of
       our data schema, which was very difficult to maintain, especially
       across a distributed team of data engineers and account managers.
       We pulled sample sets ad hoc, and ran stats over the samples to
       make sure the quality was good. We spent hours on the phone with
       our customers putting it all together to convey the meaning and the
       value of our data. We wondered why there was no software out there
       specifically built for data-as-a-service.  We also have backgrounds
       in quantitative finance (D. E. Shaw, Tower Research, BlackRock),
       large purchasers of external data, where we've seen the other side
       of this problem. Data purchasers spend a lot of time up-front
       evaluating the quality of a dataset, but they often don't monitor
       how the quality changes over time. They also have a hard time
       assessing the intersection of external datasets with data they
       already have. We're focusing on data providers first but expect to
       expand to purchasers down the road.  Our tech stack is one
       monolithic repo split into the frontend web app and backend data
       scanning. The frontend is a rails app and the data scanning is
       written in rust (we forked the amazing library xsv). One quirk is
       that we want to run the scanning in the same region as our
       customers' data to keep bandwidth costs and transfer time down, so
       we're actually running across both GCP and AWS.  If you're
       interested in this field you might enjoy reading the paper
       "Datasheets for datasets" (https://arxiv.org/pdf/1803.09010.pdf)
       which proposes a standardized method for documenting datasets
       modeled after the spec sheets that come with electronics. The
       authors propose that "for dataset creators, the primary objective
       is to encourage careful reflection on the process of creating,
       distributing, and maintaining a dataset, including any underlying
       assumptions, potential risks or harms, and implications of use." We
       agree with them that as more and more data is sold, the chance of
       misunderstanding what's in the data increases. We think we can help
       here by building qualitative questions into Syndetic alongside
       automation.  We have lots of ideas of where we could go with this,
       like fancier type detection (e.g. is this a phone number),
       validations, visualizations, anomaly detection, stability scores,
       configurable sampling, and benchmarking. We'd love feedback and to
       hear about your challenges working with datasets!
        
       Author : stevepike
       Score  : 71 points
       Date   : 2020-02-24 18:08 UTC (4 hours ago)
        
       | mason55 wrote:
       | Any plans on a hosted version? Either regular on prem or as
       | something that can be privately hosted on e.g. AWS like
       | DataBricks or Snowflake?
       | 
       | I love the idea but we could never expose most of our data to a
       | public SaaS. There are all kinds of restrictions we have on
       | things like data privacy and data needing to stay in specific
       | regions.
        
         | stevepike wrote:
         | Yeah, on-premise or at least private cloud has come up a few
         | times. Beyond the data privacy and licensing requirements it'd
         | also just be plain faster in some cases. We haven't offered it
         | yet just because we're a small company and are rapidly adding
         | features. Our backend is mostly running in k8s so I don't think
         | it'll be a _huge_ technical rewrite to get it running in
         | private cloud. Frankly I just don't have experience supporting
         | software running outside my control and want to make sure we
         | take the time to do it correctly.
        
           | mason55 wrote:
           | Cool. We're an enterprise B2B SaaS company that does a ton of
           | data interchange with our customers. Each project burns a
           | bunch of engineering hours and calendar time because
           | customers send us data that doesn't match the spec or we just
           | don't have anyone outside the engineering org who can confirm
           | that the data looks how it's supposed to.
           | 
           | Something which simplified the process of analyzing sample
           | data and provided a view of it to a non-technical user would
           | be very valuable. But as I mentioned in my previous comment
           | we could never expose any of our data outside of our private
           | infrastructure, so we can't use this until there are other
           | options for hosting.
        
       | loganfrederick wrote:
       | This (or something like it) makes a lot of sense to me. I've been
       | at multiple organizations where there have been efforts to create
       | these "data dictionaries" explaining the meaning of the data,
       | especially when the schemas or APIs are not well designed.
       | 
       | But then manually writing documentation is obviously tedious and
       | can typically only be written by the data team that knows the
       | underlying data well, which is not always the best use of their
       | time.
       | 
       | I'll definitely be following Syndetic and hope they can help
       | crack this problem.
        
         | aswihart wrote:
         | (This is Allison) - thank you! It's interesting for us to see
         | how the problem is handled at different types of organizations,
         | because as you point out the data team knows the underlying
         | data best but is not often customer-facing.
        
       | adampgreen wrote:
       | Awesome! Congrats on the launch! Excited to check it out.
        
       | pplonski86 wrote:
       | What data types do you plan to support? Have you considered
       | supporting datasets with images+labels used in computer vision?
       | How would you like to handle them?
       | 
       | Are you going to support data labeling tasks?
        
         | stevepike wrote:
         | We've talked with some image data providers who are creating
         | datasets for use in machine learning. We're not running any of
         | our own models on image data right now, so I think the place we
         | can be most useful is in summarizing metadata about the images
         | in cases where the dataset isn't just an image file. For
         | example, if the dataset is images of intersections plus
         | bounding box coordinates of street signs we can tell a
         | prospective consumer what % of images have a street sign in
         | them. If you have a little more metadata (e.g., what time of
         | day the photo is) the stats get much more useful out of the
         | box.
         | 
         | I don't understand the data labeling question. How would you
         | imagine us getting involved there?
        
       | aripickar wrote:
       | This looks really cool! I've worked with large data sets before
       | and one of the most annoying things was when they were split up
       | into multiple files. Do you currently support statistics across
       | multiple datasets?
       | 
       | Also, how did you come up with the pricing? 500 and call us seems
       | like a lot per month
        
         | stevepike wrote:
         | Thanks! We can definitely combine multiple files into one
         | dataset so long as they share the same fields. We've got one
         | customer that keeps their data in an S3 bucket as one JSON file
         | per record, so for them we're scanning ~480K files to construct
         | the stats. If you've got multiple different datasets we've got
         | a concept of "collections" for organization.
         | 
         | We came up with the pricing strategy based on conversations
         | with early customers. We want to be able to say yes to
         | integrations with whatever system they're using to store their
         | data so we need flexibility at the early stage.
        
       | SamuelAdams wrote:
       | I wonder if you could combine your service with freely available
       | datasets like Google Dataset Search [1] to demo what a large
       | amount of various datasets would look like under your service.
       | 
       | [1]:
       | https://datasetsearch.research.google.com/search?query=puppi...
        
         | aswihart wrote:
         | We would love to do that at some point. There is tons of open
         | data out there, but not a lot of it has useful descriptions at
         | the field level (only the dataset level), so it would take some
         | time to put a collection together that is robust. Also the demo
         | on our splash page is a demo of the artifact we create (i.e.
         | the published dictionary) only. The other side of the web app
         | is a management layer to bundle datasets into collections,
         | annotate fields, configure sample sets, and share the
         | artifacts. We'll work on fleshing out our demo to show what the
         | system looks like when there are hundreds or thousands of
         | datasets.
        
       | gkoberger wrote:
       | Hey! I'm Greg of ReadMe... think Syndetic but for APIs rather
       | than datasets!
       | 
       | Congrats on the launch :) I'll find you at Alumni Demo Day, or
       | feel free to reach out if I can help with anything! And welcome
       | to the war on PDFs :)
        
         | aswihart wrote:
         | This made our day : ) Thank you!
        
       | [deleted]
        
       | shostack wrote:
       | What industries do you see this being most useful to?
        
         | aswihart wrote:
         | We think data-as-a-service is a new and growing category, and
         | this includes startups we would consider "pure" data companies
         | (e.g. a company that sells data on airfares across the web) and
         | companies that have been around for decades selling csvs that
         | they deliver over FTP (e.g. a giant company like ADP that sells
         | payroll processing data). Based on our backgrounds we have a
         | fair amount of experience in the alternative data space, which
         | is basically any data that might have some signal to a hedge
         | fund that isn't market data. I'm finding that the providers in
         | that space are interested in expanding their customer base to
         | corporates (e.g. Walmart, McDonalds) whereas the providers
         | currently selling to corporates are interested in expanding
         | into alternative data.
        
       | coolsank wrote:
       | very cool product! I've worked on much smaller datasets with
       | pandas and their inbuilt profiling report method can slow things
       | down to a crawl!. Hoping to see more from you guys :)
        
         | stevepike wrote:
         | A very reductionist version of our company that Allison hates
         | when I use is "csvstat but on the internet" :-). I think the
         | problem of auto-summarizing datasets has hit kind of a local
         | maximum in what pandas dataframe summaries (csvstat is a
         | similar python tool) can do on one machine. We will be able to
         | add much fancier things like sophisticated type classification
         | (e.g., is this field a stock ticker) without burning your CPU.
        
           | coolsank wrote:
           | hah! but this is a very interesting area. You're right on the
           | auto-summarizing issue becoming a problem these days with the
           | usage of larger datasets. Data versioning also is starting to
           | become a larger problem and I saw that you guys already have
           | addressed it in your enterprise product. Hoping to see some
           | sort of API-like version for comparison of data troves from
           | different timelines in the future.
        
       | dodata wrote:
       | Neat! Congrats on the launch - the demo is very helpful to
       | understand the product. Having consumed long, painful PDF data
       | dictionaries in the past, this is a big breath of fresh air.
       | Excited to see where Syndetic goes!
       | 
       | For me, the most painful part of working with 3rd party data was
       | actually figuring out the "match rate" to internal data. For
       | example, you might be a consumer-facing company who hopes to add
       | more context to your internal data by pulling in 3rd party
       | information for existing clients. To match your internal data to
       | a 3rd party dataset, you usually match on some hashed email (or
       | similar identifier) to see what percentage of your consumer
       | records will be available in the 3rd party dataset. Have you
       | thought about something like that with your tool? Maybe you can
       | upload a sample of hashed emails and see how different match
       | rates pan out.
        
         | stevepike wrote:
         | Yes! This has come up across multiple industries and is
         | probably the feature on our roadmap I'm most excited about. The
         | implementation is tricky but customers definitely care about
         | the intersection of a provider's data with their own. Some more
         | sophisticated providers have internal tools for generating
         | things like sample sets customized to a prospect.
         | 
         | We're going to be adding a feature where we can flag fields as
         | identifying keys and index them. We'll start with a simple
         | intersection count ("upload 100 stock tickers, see how many
         | records match"). Then we'll add an interactive feature to let a
         | prospective customer generate all of the stats in the
         | dictionary scoped down to the subset of data they care about.
         | It's important to be able to answer questions like "for the 100
         | tickers I care about, how many NULLs are there for this other
         | column?".
         | 
         | Maybe someday we'll even get into the more general record
         | linkage problem when there's no reliable matching key.
        
       | knes wrote:
       | Super great idea. I've been talking to data scientist/eng in B2B
       | SaaS space on how we should bring best practices like that to the
       | Sales/Marketing/Business ops world too.
       | 
       | What would you say are the differences between syndetic and
       | qri.io (not affiliated in anyway)
        
         | stevepike wrote:
         | Thanks for the kind words! I hadn't seen qri.io before, but
         | from a read of their website I think it's broadly similar to
         | Dolt (https://www.liquidata.co/) which is git for datasets.
         | Kaggle has a similar data hub at
         | https://www.kaggle.com/datasets that's not open source but is
         | in the same space.
         | 
         | Our approach is to scan the datasets wherever they currently
         | live in production rather than being a new way to store the
         | data. The industry seems to have settled on FTP and S3 for now,
         | and we think it's important that we connect to the same exact
         | thing a customer would access. That lets us keep the dictionary
         | up to date automatically without the data providers needing to
         | change their storage infrastructure.
        
       | nocitrek wrote:
       | Does this support more complex data structures - for example
       | parquet files?
        
         | stevepike wrote:
         | The manual data upload is restricted to well-formed CSV files
         | w/ headers on the first row right now. For the "contact us"
         | higher tier we'll handle any file format that we can extract
         | columns from, so parquet would be fine.
        
       ___________________________________________________________________
       (page generated 2020-02-24 23:00 UTC)