[HN Gopher] Launch HN: PeerDB (YC S23) - Fast, Native ETL/ELT fo...
       ___________________________________________________________________
        
       Launch HN: PeerDB (YC S23) - Fast, Native ETL/ELT for Postgres
        
       Hi HN! I'm Sai, the co-founder and CEO of PeerDB
       (https://www.peerdb.io/), a Postgres-first data-movement platform
       that makes moving data in and out of Postgres fast and simple.
       PeerDB is free and open (https://github.com/PeerDB-io/peerdb) and
       we provide a Docker stack for users to try us out. Our repo is at
       https://github.com/PeerDB-io/peerdb and there's a 5-minute
       quickstart here: https://docs.peerdb.io/quickstart.  For the past 8
       years, working at Microsoft on Postgres on Azure, and before that
       at Citus Data, I've worked closely with customers running Postgres
       at the heart of their data stack, storing anywhere from 10s of GB
       of data to 10s of TB.  This was when I got exposed to the
       challenges customers faced when moving data in and out of Postgres.
       Usually they would try existing ETL tools, fail, and decide to
       build in-house solutions. Common issues with these tools included
       painfully slow syncs - syncing 100s of GB of data took days; flaky
       and unreliable - frequent crashes, loss of data precision on target
       etc., and; feature-limited - lack of configurability, unsupported
       data types and so on.  I remember a specific scenario where a tool
       didn't support something as simple as the Postgres' COPY command to
       ingest data. This would have improved the throughput by orders of
       magnitude. We (customer and me) reached out to that company to
       request them to add this feature. They couldn't prioritize this
       feature because it wasn't very easy - their tech stack was designed
       to support 100s of connectors rather than supporting a native
       Postgres feature.  After multiple such occurrences, I thought, why
       not build a tool specialized for Postgres, making the lives of many
       Postgres users easier. I reached out to my long-time buddy Kaushik,
       who was building operating systems at Google and had led data teams
       at Safegraph and Palantir. We spent a few weeks building an MVP
       that streamed data in real-time from Postgres to BigQuery. It was
       10 times faster than existing tools and maintained data freshness
       of less than 30 seconds. We realized that there were many Postgres
       native and infrastructural optimizations we could do to provide a
       rich data-movement experience for Postgres users. This is when we
       decided to start PeerDB!  We started with two main use cases: Real-
       time Change Data Capture from Postgres (demo:
       https://docs.peerdb.io/usecases/realtime-cdc#demo) and Real-time
       Streaming of query results from Postgres (demo:
       https://docs.peerdb.io/usecases/realtime-streaming-of-query-...).
       The 2nd demo shows PeerDB streaming a table with 100M rows from
       Postgres to Snowflake.  We implement multiple optimizations to
       provide a fast, reliable, feature-rich experience. For performance,
       we can parallelize the initial load of a large table, still
       ensuring consistency. Syncing 100s of GB goes from days to minutes.
       We do this by logically partitioning the table based on internal
       tuple identifiers (CTID) and parallelly streaming those partitions
       (inspired by this DuckDB blog -
       https://duckdb.org/2022/09/30/postgres-scanner.html#parallel...)
       For CDC, we don't use Debezium, rather handle replication more
       natively--reading the slot, replicating the changes, keeping state
       etc. We made this choice mainly for flexibility. Staying native
       helps us use existing and future Postgres enhancements more
       effectively. For example, if the order of rows across tables on the
       target is not important, we can parallelize reading of a single
       slot across multiple tables and improve performance. Our
       architecture is designed for real-time syncs, which enables data-
       freshness of a few 10s of seconds even at large throughputs (10k+
       tps).  We have fault tolerance mechanisms for reliability
       (https://blog.peerdb.io/using-temporal-to-scale-data-synchron...)
       and support multiple features including log-based (CDC) / query
       based streaming, efficient syncing of tables with large (TOAST)
       columns, configurable batching and parallelism to prevent OOMs and
       crashes etc.  For usability - we provide a Postgres compatible SQL
       layer for data-movement. This makes the life of data engineers much
       easier. They can develop pipelines using a framework they are
       familiar with, without needing to deal with custom UIs and REST
       APIs. They can use Postgres' 100s of integrations to build and
       manage ETL. We extend Postgres' SQL grammar with a few new
       intuitive SQL commands to enable real-time data streaming across
       stores. Because of this, we were able to add dbt integration via
       Dagster (in private preview) in a few hours! We expect data-
       engineers to unravel similar integrations with PeerDB easily, and
       plan to make this grammar richer as we evolve.  PeerDB consists of
       the following components to handle data replication: (1) PeerDB
       Server uses the pgwire protocol to mimic a PostgreSQL server,
       responsible for query routing and generating gRPC requests to the
       Flow API. It relies on AST analysis to make informed decisions on
       routing. (2) Flow API: an API layer that deals with gRPC commands,
       orchestrating the data sync operations; (3) Flow Workers execute
       the data read-write operations from the source to the destination.
       Built to scale horizontally, they interact with Temporal for
       increased resilience. The types of data replication supported
       include CDC streaming replication and query-based batch
       replication. Workers do all of the heavy lifting, and have data
       store specific optimizations.  Currently we support 6 target data
       stores (BigQuery, Snowflake, Postgres, S3, Kafka etc) for data
       movement from Postgres. This doc captures the current status of the
       connectors: https://docs.peerdb.io/sql/commands/supported-
       connectors.  As we spoke to more customers, we realized that
       getting data into PostgreSQL at scale is equally important and
       hard. For example one of our customers wants to periodically sync
       data in multiple SQL Server instances (running on the edge) to
       their centralized Postgres database. Requests for Oracle to
       Postgres migrations are also common. So now we're also supporting
       source data stores with Postgres as the target (currently SQL
       Server and Postgres itself, with more to come).  We are actively
       working with customers to onboard them to our self-hosted
       enterprise offering. Our fully hosted offering on the cloud is in
       private preview. We haven't yet decided on the pricing. One common
       concern we've heard from customers is that existing tools are
       expensive and charge based on the amount of data transferred. To
       address this, we are considering a more transparent way of pricing
       --for example, pricing based on provisioned hardware (cpu, memory,
       disk). We're open for feedback on this!  Check out our github repo
       - https://github.com/PeerDB-io/peerdb and go ahead and give it a
       spin (5-minute quickstart https://docs.peerdb.io/quickstart).  We
       want to provide the world's best data-movement experience for
       Postgres. We would love to get your feedback on product experience,
       our thesis and anything else that comes to your mind. It would be
       super useful for us. Thank you!
        
       Author : saisrirampur
       Score  : 137 points
       Date   : 2023-07-27 15:39 UTC (7 hours ago)
        
       | roanakb wrote:
       | Looks really cool! Nice work.
        
       | xwowsersx wrote:
       | "Frustratingly simple"...that is a very strange adverb to use
       | there in my opinion. When I'm assessing a tool, I don't think
       | frustrating is a word I want to see used to describe it. Just me?
        
         | chzblck wrote:
         | I agree definitely something that I would notice from reading
        
         | codegeek wrote:
         | It is cute but since it is a negative word, I would suggest not
         | using it. Instead use something like "Amazingly Simple" or
         | "Incredibly Simple".
        
           | dang wrote:
           | Maybe "surprisingly simple"?
        
             | saisrirampur wrote:
             | Very interesting. One of our colleagues/friends also
             | proposed this one! :) Will take this as inputs, if we
             | change it. Thanks!
        
               | arcticfox wrote:
               | Another vote for "surprisingly". Although "frustratingly"
               | is kind of fun because it's so weird...just too negative.
               | If you want to keep a provocative adverb, maybe "oddly"
               | or "weirdly" would be fun too...not normal, but not so
               | negative.
        
         | saisrirampur wrote:
         | Good point! We got mixed feedback on this. Some people really
         | loved it and some felt similar to what you mentioned.
         | Transparently, we left it as is, as it was intriguing the
         | audience. However point taken, will take this as inputs to
         | future changes, if any :)
        
         | iknownothow wrote:
         | It makes me think of "cute aggression"
        
         | craigkerstiens wrote:
         | Having dealt with ETL tooling that starts as "simple" and next
         | thing you know is a few dedicated hires for what was supposed
         | to be a simple pipeline, the phrase resonates a lot. If you've
         | already built out a team to do this, tried multiple different
         | tools, and then have something that just works I'd be all for
         | it and it would have been just that to me frustrating to have
         | gone down some other path first.
         | 
         | As a vision statement to me it resonates, now curious to give
         | it a try and see if it fulfills on that vision.
        
       | kaushik92 wrote:
       | Moving data in and out of Postgres in a fast and reliable way is
       | exactly what my startup needs. I am looking forward to trying
       | PeerDB!
        
       | leononame wrote:
       | Congratulations on the launch. Does PostgreSQL to PostgreSQL
       | streaming using PeerDB have any benefit over just using Streaming
       | Replication?
       | 
       | Could this be used as a sort of "live backup" of your data? (i.e.
       | just making sure that data isn't lost if the server dies down
       | completely, not thinking of HA)
       | 
       | Sorry if it's a bit of a stupid question, I realize it's not the
       | main focus of PeerDB.
        
         | saisrirampur wrote:
         | Postgres Streaming replication is very robust and has been in
         | Postgres since multiple 10s of years. Logical
         | replication/decoding (that PeerDB uses) is more recent -
         | introduced in the last decade. However streaming replication is
         | harder to manage/setup and a bit restrictive - most cloud
         | providers don't give access to WAL, so you cannot use streaming
         | replication to replicate data across cloud providers.
         | 
         | Sure you can use PeerDB for backing up data - using CDC based
         | replication or query based replication and both of these are
         | pretty fast with PeerDB. You can have cold backups (store data
         | to s3, blob etc) or hot backups (another postgres database).
         | However note that the replication is async and there is some
         | lag (few 10s of seconds) on the target data-store. So if you
         | are expecting 0 data-loss, this won't be the right approach for
         | backups/HA. With streaming replication, replication can be
         | synchronous (synchronous_commit setting), which helps with 0
         | data-loss.
        
       | Multrex wrote:
       | Hope in the future you introduce more Source and Target dbs just
       | like ReplicaDB open-source software
        
         | saisrirampur wrote:
         | Thanks for the comment! As we work with customers we will add
         | more source and target dbs. A couple of things, our scope as of
         | now is data-movement in/out Postgres. And as we add more data-
         | stores as sources/targets to Postgres, providing a high quality
         | experience will be the primary focus than expanding coverage.
        
       | hankchinaski wrote:
       | Any plans on supporting redshift as target?
        
         | saisrirampur wrote:
         | Redshift should work as it is PostgreSQL based - under the hood
         | we use simple DML, DDL and COPY commands. We haven't yet tested
         | it, but worth giving it a shot! We have user testing PeerDB for
         | a redshift like database and it works.
        
           | hankchinaski wrote:
           | Not as trivial as some data types are different (jsonb,
           | array, uuid etc) but will give it a try
        
             | saisrirampur wrote:
             | Gotcha, worth giving it a shot! If any data-type behaves
             | finicky let us know (via github issue), we should be able
             | to add support quickly.
        
       | trust_coder wrote:
       | Pretty cool stuff, I would use it just for mirroring the data
       | itself. Curious if you are planning to have change events for
       | e.g. add/update/delete to the records? I would love to get them
       | in a stream and directly dumped into a data-store like bigquery.
        
         | saisrirampur wrote:
         | Yep, change events (CDC) is already supported! PeerDB replicate
         | any DML (insert/update/delete) efficiently to the target data-
         | store (incl BigQuery).
        
       | Kwpolska wrote:
       | So, you've got funding from Hacker News's side hustle, great, but
       | what's the business model beyond that? Why would anyone pay you
       | for hosting if all the code is open-source and anyone can host it
       | on their own, in their own cloud of choice?
       | 
       | (Also, not having pricing when launching seems like a very
       | strange choice, since potential buyers might pass and never come
       | back.)
        
         | saisrirampur wrote:
         | Thanks for the feedback! Many dev-tools and infra products,
         | specifically in the Postgres space are open source. Citus (my
         | previous gig) is an example here. Customers still pay for these
         | products because of 2 main reasons a/ operationalizing an open
         | source tool to support production workloads requires good
         | amount effort - ex: setting up HA, advance metrics/monitoring
         | etc. and they want to offload it to by buying a paid offering
         | that is more plug and play b/ they want to work with a team
         | which can empathize with their challenges, are experts in the
         | area and helps make them successful. With PeerDB, we are
         | expecting something similar and are committed to make our
         | customers successful.
         | 
         | On the pricing side, valid feedback. We are actively working
         | with customers and are coming up with custom (reasonable)
         | pricing based on their use-case and usage level of product.
         | Through this process we are getting a ton of feedback. As
         | mentioned in the post, a common concern we heard from customers
         | is that the existing tools are expensive (pricing is black box)
         | - they charge based on the amount data-transferred. We are
         | thinking of ways to make pricing more transparent (see post on
         | what our thinking has been so far). But haven't landed on the
         | right strategy yet. We didn't want to rush through publishing
         | any pricing.
        
           | Kwpolska wrote:
           | Many businesses, big or small, are cheapskates. If the paid
           | offerings don't get you much over the free one, many
           | companies will just take the open-source thing and make it
           | work for them. Does the "Cloud" offering even get you any
           | support?
        
             | saisrirampur wrote:
             | We anticipate both groups of businesses. Considering we are
             | building a product for ETL/data-movement which innately has
             | multiple moving parts & fragile, we anticipate a good chunk
             | of businesses preferring to offload the effort of managing
             | to us!
        
       | jwilber wrote:
       | Awesome!
        
       | michaelmior wrote:
       | On the supported connectors page[0], the link to the right of the
       | row for Postgres/S3 is to localhost instead of the docs.
       | 
       | [0] https://docs.peerdb.io/sql/commands/supported-
       | connectors#:~:...
        
         | AmoghBharadwaj wrote:
         | Thanks a lot! Fixed it
        
       | calcsam wrote:
       | This is amazing!
        
       | kobaruon wrote:
       | Looks good! Do you have any benchmark against Debezium for CDC?
        
         | saisrirampur wrote:
         | Not yet. But very soon. A few benefits of PeerDB vs Debezium
         | incl. 1/ easy to setup and work with - no dependence on kafka,
         | zookeper, kafka connect. 2/ managed experience for CDC from
         | PostgreSQL through our enterprise & hosted offerings 3/
         | performance wise, with the optimization we are doing
         | (parallelized initial loads, parallelized reading of slots,
         | leaner signature of CDC on the target), I'm expecting PeerDB to
         | be better. However not sure by how much. Stay tuned for a
         | future post on this :)
        
       | tarun_anand wrote:
       | Hi - Congratulations! In the streaming use case, does it restart
       | from where it left off in case the target peer or source peer is
       | down/restarts etc?
        
         | saisrirampur wrote:
         | Great question. Yes it does. PeerDB keeps track of what rows
         | have been streamed and what are yet to be streamed. During
         | failures (restarts, crashes etc), it uses this to resume from
         | where it left off. More details on how we do it can be found in
         | this blog - https://blog.peerdb.io/using-temporal-to-scale-
         | data-synchron...
        
       | netcraft wrote:
       | Meta: These links are redirecting to
       | https://docs.peerdb.io/introduction#demo for me:
       | 
       | Real-time Change Data Capture from Postgres (demo:
       | https://docs.peerdb.io/usecases/realtime-cdc#demo)
       | 
       | Real-time Streaming of query results from Postgres (demo:
       | https://docs.peerdb.io/usecases/realtime-streaming-of-query-...)
        
         | saisrirampur wrote:
         | thank you for pointing this! just fixed it.
        
       | oskarpearson wrote:
       | Seems like a really useful tool. Would your system support
       | Postgres Aurora on AWS as a source database? Or does it require
       | some lower-level access to Postgres server?
       | 
       | We are currently using DMS to send data to S3 and from there to
       | Snowflake.
        
         | saisrirampur wrote:
         | PeerDB should work or Aurora PostgreSQL. It should work for
         | both log based (CDC) and query based replication. Log based
         | because Aurora supports pgoutput plugin. Curious, are you
         | leveraging CDC to move data to S3? or more query (batch) based?
        
           | oskarpearson wrote:
           | We use DMS in continuous replication mode, which appears to
           | use CDC under the hood according to https://docs.aws.amazon.c
           | om/dms/latest/userguide/CHAP_Task.C...
           | 
           | In our setup DMS pushes Parquet files on s3. Snowflake then
           | loads data from there.
           | 
           | We've occasionally had to do a full table sync from scratch,
           | which is painfully slow. We are going to have to do that in
           | the very near future - when we are upgrading from Postgres 11
           | to Postgres 15.
           | 
           | The S3 step also seems unnecessarily complicated, since we
           | have to expire data from the bucket.
           | 
           | How does PeerDB handle things like schema changes? Would the
           | change replicate to Snowflake? (I'm sure this is in the docs,
           | but I'm supposed to be on holiday this week ) Thanks for the
           | quick reply.
        
             | saisrirampur wrote:
             | Gotcha, that really helps. Schema changes feature is coming
             | soon! We are actively working on it. This thread captures
             | our thinking around it -
             | https://news.ycombinator.com/item?id=36895220 Also have a
             | good holiday! :)
        
               | netvarun wrote:
               | I think you were referring to this thread:
               | https://news.ycombinator.com/item?id=36897010
        
       | arcticfox wrote:
       | Yes please! I love this. The abstraction required for more
       | generic ETL solutions makes them a real pain for my two use-
       | cases: Postgres-to-Postgres (online instance to analytics
       | instance) and Postgres-to-Bigquery (online WAL change data to
       | Biqquery).
       | 
       | I cannot wait to try this to see if I can remove Meltano
       | (Postgres-to-Postgres) and my custom Postgres-to-Bigquery code.
        
       | ozgune wrote:
       | Congrats on the launch!
       | 
       | I've worked with Sai for years, so I just wanted to put in a good
       | word for PeerDB and its founders. Sai is resourceful and
       | relentless; his energy and optimism are contagious. Kaushik
       | complements that with deep backend and analysis skills.
       | 
       | Data movement is a big pain point with different players. I think
       | it's time that there's a Postgres-centric solution out there
       | built by a team who gets Postgres. Best of luck!
        
       | furkansahin wrote:
       | Congratulations on the launch Sai. Having worked with him over
       | the years, I know that Sai knows what postgres migration means. I
       | have seen him deal with countless migrations in and out of our
       | services. I am excited to see what they have built
        
       | mritchie712 wrote:
       | Good stuff! It's pretty nuts Snowflake doesn't offer an
       | integration like this out of the box. BigQuery kind of supports
       | this[1], but it's not easy to set up or monitor.
       | 
       | Good luck!
       | 
       | 1 - https://www.youtube.com/watch?v=ZNvuobLvL6M
        
         | saisrirampur wrote:
         | Thanks for the comment! Yep Google provides Datastream. We
         | tried it out and the experience was pretty good! However it was
         | very much tied to the GCP eco-system. With PeerDB, our goal is
         | to be open - be community driven than cloud driven. Also just
         | to mention, as called out in the post there are more features
         | apart from cdc (query based streaming, postgres as the target
         | etc) that we will keep adding to help Postgres users.
        
       | seigel wrote:
       | Looks very intriguing! Tried to get something quickly going with
       | a small db set up I have. Just ran into a `peer type not
       | supported` error and was wondering which three databases are
       | supported of the ones you have listed. See the attached picture.
       | https://d.pr/i/HYIk0Z+
        
         | saisrirampur wrote:
         | That is a typo, you should be able to create all those peers.
         | For which type of peer did you run into this issue?
        
       | yawgmoth wrote:
       | Can this be used with Citus or Hydra?
        
         | saisrirampur wrote:
         | Both should be supported as target data-stores.
         | 
         | As a source, PeerDB should likely work with any Postgres based
         | databases (like Citus). Query based replication should work!
         | Log based (CDC) replication could have a few quirks - i.e. the
         | source database should support "pgoutput" format for change
         | data capture. As we evolve we do planning to enable a native
         | data-movement experience for Postgres based (both extensions
         | and postgres-compatible) databases!
        
         | [deleted]
        
       | samaysharma wrote:
       | Nice! I like the focus on Postgres. Most ETL tools end up trying
       | to build for a larger matrix of source and targets which limits
       | using database specific features and optimizations. Is the CDC
       | built primarily on top of the logical replication / logical
       | decoding infrastructure in Postgres? If so, what are the
       | limitations in that infrastructure which you'd like to see
       | addressed in future Postgres versions?
        
         | saisrirampur wrote:
         | That is a really good question! A few of them that come to my
         | mind:
         | 
         | 1/ logical replication support for schema (DDL) changes
         | 
         | 2/ a native logical replication plugin (not wal2json) which is
         | easier to read from the client side. pgoutput is fast but from
         | reading/parsing from the client side is not as straightforward.
         | 
         | 3/ improve decoding perf - i've observed pgoutput to cap at
         | 10-15k changes per sec, for an average usecase. This is after
         | good amount of tuning - ex: logical_replication_work_mem etc.
         | Enabling larger tps - 50k+ tps would be great. Also this is
         | important for Postgres, considering the diverse variety of
         | workloads users are running. For example at Citus, I saw
         | customers doing 500k rps (with COPY), I am not sure logical
         | replication can handle those cases.
         | 
         | 4/ logical replication slots in remote storage. one big risk
         | with slots is that they can grow in size (if not read properly)
         | and use up storage on the source. allowing shipping slots to
         | remote storage would really help. i think Oracle allows
         | something like this, but not 100% sure.
         | 
         | 5/ logical decoding on standby. it is coming in postgre 16! we
         | will aim to support in PeerDB, right after it is available.
         | 
         | I can think of many more, but sharing a few top ones that came
         | to my mind!
        
       | [deleted]
        
       | throwaway99431 wrote:
       | What happens if there's a table schema change on a mapped table
       | on the source side? What about on the target side?
        
         | cauchyk wrote:
         | Hi there, I'm Kaushik, one of the co-founders of PeerDB. PeerDB
         | doesn't handle schema changes today.
         | 
         | For CDC, change stream does give us events in case of schema
         | changes, we would have to replay them on the destination.
         | Schema changes on the destination are not supported, the
         | general recommendation is to build new tables / views and let
         | PeerDB manage the destination table.
         | 
         | For streaming the results of a query, as long as the query
         | itself can execute (say a few columns were added or untouched
         | columns were edited) mirror job will continue to execute. In
         | case this is not the case, there will be some manual
         | intervention needed to account for the schema changes.
         | 
         | Thanks for the question, this is a requested feature and on our
         | roadmap.
        
           | jmg_ wrote:
           | Calling out limitations like this in the documentation would
           | go a long way in building confidence in the project. Better
           | yet, if there's an example of how to deal with "day-2"
           | operational concerns like this.
           | 
           | Simply looking at the docs on these two pages, its unclear to
           | me whether there's a way to update the mirror definition when
           | a schema change occurs or if I need to drop & recreate the
           | mirror (and what the effects of this are in the destination):
           | 
           | - https://docs.peerdb.io/sql/commands/create-mirror
           | 
           | - https://docs.peerdb.io/usecases/Streaming%20Query%20Replica
           | t...
           | 
           | All-in-all, very excited to see this project and will be
           | watching it closely!
        
             | saisrirampur wrote:
             | Thanks for the feedback and I agree on making these missing
             | features more visible in our documentation! We did it here
             | - https://docs.peerdb.io/usecases/Real-time%20CDC/postgres-
             | to-... But will make it more visible soon i.e. in streaming
             | query, cdc, CREATE MIRROR docs etc. We were thinking
             | something on the lines of ALTER MIRROR or provide a new
             | OPTION in CREATE MIRROR that will automatically pick up
             | schema changes etc. Exact spec is not yet finalized.
        
       ___________________________________________________________________
       (page generated 2023-07-27 23:00 UTC)