[HN Gopher] Launch HN: Prequel (YC W21) - Sync data to your cust...
       ___________________________________________________________________
        
       Launch HN: Prequel (YC W21) - Sync data to your customer's data
       warehouse
        
       Hey HN! We're Conor and Charles from Prequel (https://prequel.co).
       We make it easy for B2B companies to send data to their customers.
       Specifically, we help companies sync data directly to their
       customer's data warehouse, on an ongoing basis.  We're building
       Prequel because we think the current ETL paradigm isn't quite
       right. Today, it's still hard to get data out of SaaS tools:
       customers have to write custom code to scrape APIs, or procure
       third-party tools like Fivetran to get access to their data. In
       other words, the burden of data exports is on the customer.  We
       think this is backwards! Instead, _vendors_ should make it seamless
       for their customers to export data to their data warehouse. Not
       only does this make the customer's life easier, it benefits the
       vendor too: they now have a competitive advantage, and they get to
       generate new revenue if they choose to charge for the feature. This
       approach is becoming more popular: companies like Stripe, Segment,
       Heap, and most recently Salesforce offer some flavor of this
       capability to their customers.  However, just as it doesn't make
       sense for each customer to write their own API-scraping code, it
       doesn't make sense for every SaaS company to build their own sync-
       to-customer-warehouse system. That's where Prequel comes in. We
       give SaaS companies the infrastructure they need to easily connect
       to their customers' data warehouses, start writing data to it, and
       keep that data updated on an ongoing basis. Here's a quick demo:
       https://www.loom.com/share/da181d0c83e44ef9b8c5200fa850a2fd.
       Prequel takes less than an hour to set up: you (the SaaS vendor)
       connect Prequel to your source database/warehouse, configure your
       data model (aka which tables to sync), and that's pretty much it.
       After that, your customers can connect their database/warehouse and
       start receiving their data in a matter of minutes. All of this can
       be done through our API or in our admin UI.  Moving all this data
       accurately and in a timely manner is a nontrivial technical
       problem. We potentially have to transfer billions of rows /
       terabytes of data per day, while guaranteeing that transfers are
       completely accurate. Since companies might use this data to drive
       business decisions or in financial reporting, we really can't
       afford to miss a single row.  There are a few things that make this
       particularly tricky. Each data warehouse speaks a slightly
       different dialect of SQL and has a different type system (which is
       not always well documented, as we've come to learn!). Each
       warehouse also has slightly different ingest characteristics (for
       example, Redshift has a hard cap of 16MB on any statement), meaning
       you need different data loading strategies to optimize throughput.
       Finally, most of the source databases we read data from are multi-
       tenant -- meaning they contain data from multiple end customers,
       and part of our job is to make sure that the right data gets routed
       to the right customer. Again, it's pretty much mission-critical
       that we don't get this wrong, not even once.  As a result, we've
       invested in extensive testing a lot earlier than it makes sense for
       most startups to. We also tend to write code fairly defensively: we
       always try to think about the ways in which our code could fail (or
       anticipate what bugs might be introduced in the future), and make
       sure that the failure path is as innocuous as possible. Our backend
       is written in Go, our frontend is in React + Typescript (we're big
       fans of compiled languages!), we use Postgres as our application
       db, and we run the infra on Kubernetes.  The last piece we'll touch
       on is security and privacy. Since we're in the business of moving
       customer data, we know that security and privacy are paramount.
       We're SOC 2 Type II certified, and we go through annual white-box
       pentests to make sure that all our code is up to snuff. We also
       offer on-prem deployments, so data never has to touch our servers
       if our customers don't want it to.  It's kind of surreal to launch
       on here - we're long time listeners, first time callers, and have
       been surfing HN since long before we first started dreaming about
       starting a company. Thanks for having us, and we're happy to answer
       any questions you may have! If you wanna take the product for a
       spin, you can sign up on our website or drop us a line at hn (at)
       prequel.co. We look forward to your comments!
        
       Author : ctc24
       Score  : 90 points
       Date   : 2022-09-26 15:07 UTC (7 hours ago)
        
       | soumyadeb wrote:
       | Congrats on the launch. Very cool idea and always wondered why
       | this hasn't been done before.
       | 
       | One question though - don't you see Snowflake (or the cloud data
       | warehouse vendors) building this? Snowflake has to build native
       | support for CDC from some production databases like Postgres,
       | MySQL, Oracle etc. Once the data has landed in the SaaS vendor's
       | Snowflake, it can be shared (well the relevant rows which a
       | customer should have access to) with each customer.
       | 
       | Isn't that long term right solution? Or am I missing something
       | here?
        
         | conormccarter wrote:
         | Snowflake has made it really easy to share to other Snowflake
         | instances, and the other major cloud data warehouses are
         | working on similar warehouse specific features as well. This
         | makes sense because it can be a major driver of growth for
         | them. The way we see the space play out is that every cloud
         | warehouse develops some version of same-vendor sharing, while
         | neglecting competitive warehouse support.
         | 
         | Long term, we'd like to be the interoperable player focused
         | purely on data sharing that plays nicely with any of the
         | upstream sources, but also facilitates connecting to any data
         | destination. (This also means we can spend more time building
         | thoughtful interfaces - API and UI - for onboarding and
         | managing source and destination connections)
        
           | soumyadeb wrote:
           | Got it. Makes sense.
           | 
           | However, don't you think it makes sense for Snowflake to
           | support "Replicate my Postgres/MySQL/Oracle" to Snowflake?
           | Given how much they are investing in making it easier to get
           | data into Snowflake.
        
             | conormccarter wrote:
             | Oh, yeah, it probably does make sense for the warehouses to
             | make that part easier, at least for the more popular
             | transactional db choices. You may have seen Google/BigQuery
             | recently announced their off the shelf replication service
             | for Oracle and MySQL. As far as Prequel goes, we connect to
             | either (db or data warehouse sources), so we're largely
             | agnostic to how the data moves around internally before it
             | gets sent to customers.
        
       | yevpats wrote:
       | It's an interesting take but I worry this might not be possible
       | (for a lot of cases) because not everything is database backed
       | and sometime you have logic behind the API and you actually want
       | the result of that API in your data warehouses. Think AWS API for
       | example, even their own AWS Config team is using the same AWS
       | APIs because there is no one place where the data just resides
       | (of-course it would be great if it was the case, it would make
       | life very easy)
       | 
       | I think this is a good tweet explaining the problem that will
       | never be solved -
       | https://twitter.com/mattrickard/status/1542193426979909634
       | 
       | Full Disclaimer: Im the founder and original author of
       | https://github.com/cloudquery/cloudquery
        
         | ctc24 wrote:
         | Very much hear you about the amount of logic contained within
         | API layers. That said, it actually hasn't come up a lot so far.
         | What we're finding is that a lot of teams that have complex
         | API-layer logic also end up landing their data in a data
         | warehouse and replicating the API's data model there. In those
         | cases, we can use that as the source. For those that don't have
         | a dwh, we do actually support connecting to an API as a source,
         | it just requires a bit more config and it's less efficient and
         | so we don't promote it quite as much :).
         | 
         | As far as the tweet goes, there's a couple points we might
         | disagree with. One of the assumptions made is that the problem
         | is solved by an outsourced third-party. That's actually exactly
         | what we want to change! We think the problem should be solved
         | by a first-party (the software vendor), and we want to give
         | them the tools to accomplish that. Another assumption in there
         | is that companies who charge for dashboards wouldn't do this
         | because they'd lose money. If they charge for dashboards, they
         | can also choose to charge for data exports.
        
           | yevpats wrote:
           | I very much hope it will work for you and it will succeed.
           | This can be def an amazing turning point in data integration
           | and ELT world.
           | 
           | I do still have hard time understanding how this is
           | technically feasible because let's say a company have their
           | data stored in PostgreSQL but the data model their is not
           | exactly as in their API because they are doing some minimal
           | transformation and then expose it to the user. How the user
           | will know what to expect the new data model is not documented
           | at all and they usually want what is documented and exposed
           | via the API. so seems you still have to go via the API and if
           | you are there already then your are in the ELT space.
           | 
           | Per your comment that it falls into the user and it should
           | fall on the vendor - I agree but if it's a matter in the ELT
           | space then they should just maintain a plugin to CloudQuery
           | (https://github.com/cloudquery/cloudquery) or AirByte or
           | something similar. Similarly how vendors maintain terraform
           | or pulumi plugins.
        
           | cayleyh wrote:
           | The API-based part sounds a lot link https://www.singer.io,
           | the API-based data ETL tooling Stitch Data developed -- is
           | that accurate?
        
             | ctc24 wrote:
             | There's definitely some similarities (both are generic-ish
             | ways to get data out of an API). As mentioned in another
             | comment, we're still deciding whether to adopt an existing
             | protocol or roll our own for those API connections. We've
             | done the latter so far but it's a work in progress.
        
       | sgammon wrote:
       | This is very cool. I applied! We have a use for it now (we're
       | starting a B2B thing). Good luck!
        
         | conormccarter wrote:
         | Just saw and sent you a note - would love to hear more!
        
       | mfrye0 wrote:
       | Very cool. We've just started exploring customer requests to sync
       | our data to their warehouses, so great timing.
       | 
       | What sort of scale can you guys handle? One of our DBs is in the
       | billions of rows.
       | 
       | I assume we'd need to potentially create custom sources for each
       | destination as well? Or does your system automatically figure out
       | the best "common" schema across all destinations? For example, an
       | IP subnet column.
        
         | conormccarter wrote:
         | Re: scale - we do handle billions of rows! As you can imagine,
         | exact throughput depends on the source and destination database
         | (as well as the width of the rows) but to give you a rough
         | sense - on most warehouses, we can sync on the order of 5M rows
         | per minute for each customer that you want to send data to. In
         | practice, for a source with billions of rows, the initial
         | backfill might take a few hours, and each incremental sync
         | thereafter will be much faster. We can hook you up with a
         | sandbox account if you want to run your own speed test!
         | 
         | Re: configuration - you would create a config file for each
         | "source" table that you want to make available to customers,
         | including which columns should be sent over. Then at the
         | destination level, you can specify the subset of tables you'd
         | like to sync. This could be a single common schema for all
         | customers, or different schemas based on the products the
         | customer uses.
        
       | thdxr wrote:
       | Wow I've been looking for this for years! Always thought SaaS
       | companies waste time building yet another mediocre analytics
       | dashboard when they should just sync their data
       | 
       | My main thing is I don't want to think in terms of raw data from
       | my database to the customer database, I have higher level API
       | concepts.
       | 
       | Would be cool if there was some kind of sync protocol I could
       | implement where prequel sent a request with a "last fetched"
       | timestamp and the endpoint replied with all data to be updated.
       | 
       | Kind of like this: https://doc.replicache.dev/server-pull
        
         | ctc24 wrote:
         | That's actually pretty close to how we connect to APIs today,
         | but I love the idea of streamlining it as a protocol. We're
         | still working on improving the dev experience, especially
         | around APIs, and would love to chat more if you have more
         | feedback!
        
         | evantahler wrote:
         | Check out the Airbyte Protocol - this is exactly the kind of
         | thing we made it for! https://docs.airbyte.com/understanding-
         | airbyte/airbyte-proto...
        
       | blakeburch wrote:
       | Love the idea! I really see the value in shifting the
       | conversation towards the vendor themselves being responsible for
       | pushing the data to customers and it makes a lot of sense to do
       | it directly from DB -> DB.
       | 
       | However, building a data product myself (Shipyard), we really try
       | to encourage the idea of "connecting every data touchpoint
       | together" so you can get an end-to-end view of how data is used
       | and prevent downstream issues from ever occurring. This raised a
       | few questions:
       | 
       | 1. If the vendor owns the process of when the data gets
       | delivered, how would a data team be able to have their pipelines
       | react to the completion or failure of that specific vendor's
       | delivery? Or does the ingestion process just become more of a
       | black box?
       | 
       | While relying on a 3rd party ingestion platform or running
       | ingestion scripts on your own orchestration platform isn't ideal,
       | it at least centralizes the observability of ongoing ingestion
       | processes into a single location.
       | 
       | 2. From a business perspective, do you see a tool like Prequel
       | encouraging businesses to restrict their data exports behind
       | their own paywall rather than making the data accessible via
       | external APIs?
       | 
       | --
       | 
       | Would love to connect and chat more if you're interested! Contact
       | is in bio.
        
         | ctc24 wrote:
         | 1. I think in practice, people are already using a mix of
         | sources for ingestion today. It's rare that a data team would
         | rely on a single tool to ingest all their data - instead, they
         | might get some data via one or more ETL tools, some data via a
         | script they wrote themselves, and some other data from their
         | own db. So in that regard, I don't think a world where the
         | vendor provides the data pipeline makes this a lot more
         | complex.
         | 
         | One way we're hoping to pre-empt some of this is by helping
         | vendors to surface more observability primitives in the schema
         | that they write data to. To give an example: Prequel writes a
         | _transfer_status table in each destination with some metadata
         | about the last time data was updated. The goal there is to
         | decouple the means of moving data with the observability piece.
         | 
         | We can also help vendors expose hooks that people's data
         | pipelines & observability tools plug into (think webhooks and
         | the like).
         | 
         | 2. We don't really - anecdotally, companies that offer data
         | warehouse syncs tend to be pretty focused on providing a great
         | user-experience. At the end of the day, that decision is pretty
         | much entirely with the business. We see a pretty wide range
         | today: some teams choose to make exports widely available, some
         | choose to reserve it for their pro or enterprise tier, and some
         | choose to sell it as a standalone SKU. It's pretty similar to
         | what already happens with APIs designed for data exports.
         | 
         | Would love to continue the convo and hear more about your take
         | on obs! Sending you a note now.
        
       | leetrout wrote:
       | Love it.
       | 
       | As we continue to move toward more and more composable
       | architectures combining SaaS an offering like this is going to
       | really give your users a leg up.
       | 
       | Edit: deleted redundant question about SOC 2. I missed the whole
       | paragraph on mobile
        
       | dwiner wrote:
       | Brilliant idea. Can see many use cases for this with our company.
       | Congrats on the launch!
        
       | sails wrote:
       | This is excellent (the idea, I don't know anything about
       | prequel!), and a much needed tool to support a reasonable trend.
       | I fully support B2B companies taking on the responsibility to
       | make data more readily available for analytics, beyond just
       | exposing a fragile API.
       | 
       | For those unaware, this is a relatively recently established
       | practice (direct to warehouse instead of via 3rd party ETL)
       | 
       | https://techcrunch.com/2022/09/15/salesforce-snowflake-partn...
       | 
       | https://stripe.com/en-gb-es/data-pipeline
        
       | gourabmi wrote:
       | How do you deal with incompatible data types between the source
       | and destination systems ? For example, the source might have a
       | timestamp with timezone data type and the destination could just
       | support timestamp in UTC.
        
         | ctc24 wrote:
         | That's something that we spend a lot of time working on. Our
         | general approach is to do both the simple and predictable thing
         | (ie what format would I want to get the data in if I were the
         | data team receiving it). For the specific example you give,
         | we'd convert the timestamptz into a timestamp before writing it
         | to the destination.
         | 
         | Another type where this comes up is JSON. Some warehouses
         | support the type whereas others don't. In those cases, we
         | typically write the data to a text/varchar column instead.
         | 
         | The way it works under the hood is we've effectively written
         | our own query planner. It takes in the source database flavor
         | (eg Snowflake) and types, the destination database flavor, and
         | figures out the happy path for handling all the specified
         | types.
        
       | r_thambapillai wrote:
       | Would you guys support the inverse of this - sending data to your
       | vendors? If you buy a SaaS tool that needs to ingest certain data
       | from your data warehouse (or SaaS tools) into the vendors
       | Snowflake / Data warehouse, would Prequel be worth looking at
       | trying to achieve that with?
        
         | ctc24 wrote:
         | As you can imagine, the underlying tech is pretty much the same
         | - we're still moving bits from one db/dwh to another - so we
         | can use our existing transfer infra. The only difference is the
         | UX and API we surface on top of it.
         | 
         | We got this request from a couple teams we work with, and so we
         | have an alpha that's live with them. It'll probably go live in
         | GA in Q4 or Q1 (and if you want access to it before then, drop
         | us a line at hn (at) prequel.co!).
        
       | buremba wrote:
       | Congrats on your launch! What's the data source that I can add?
       | Does it also need to be another database listed here (1)? Does
       | that mean that I need to move the data to one of these databases
       | in order to sync our customer data to their data warehouses?
       | 
       | (1) https://docs.prequel.co/reference/post_sources
        
         | conormccarter wrote:
         | Thank you! And exactly right - Postgres, Snowflake, BigQuery,
         | Redshift, and Databricks are the sources we support today. We
         | also have some streaming sources (like Kafka) in beta with a
         | couple pilot users. At this point, it's fairly negligible work
         | for us to add support for new SQL based sources, so we can add
         | new sources quickly as needed.
        
       | ianbutler wrote:
       | Hey guys we spoke a while back, glad to see you're still going at
       | it! Good luck with everything.
        
         | conormccarter wrote:
         | Hey Ian, great to hear from you - thanks for the note, and hope
         | you're doing well!
        
       | sv123 wrote:
       | Very cool, we've had great success using Snowflake's sharing to
       | give our customers access to their data... That obviously falls
       | apart if the customer wants data in BigQuery or somewhere else.
        
       | mrwnmonm wrote:
       | Since a lot of services do integrations these days, I wonder are
       | there some common connectors they are using?
        
       | ztratar wrote:
       | One of the first Show HN's i've read where my first thought was
       | "would invest based purely on the single sentence description"
       | 
       | Nice idea. Have fun executing!
        
       | wasd wrote:
       | Awesome product.
       | 
       | 1. Do you expect to support SQL Server? If so, do you know when?
       | 
       | 2. Watched the Loom video. How should we handle multi-tenant data
       | that requires a join? For example, let' say I want to send data
       | specific school. The student would belong to a Teacher who
       | belongs to a School.
        
         | conormccarter wrote:
         | Thanks for watching and for the kind words! Re 1. - it's
         | definitely on the roadmap - we're planning on getting to it in
         | Q4/Q1, but we can move it up depending on customer need. Re 2.
         | - for tables without a tenant ID column, we suggest creating a
         | view on top of that table that performs the join to add the
         | tenant column (e.g., "school_id") - it's a pretty common pre-
         | req.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-09-26 23:00 UTC)