hngopher.com

       [HN Gopher] Launch HN: Artie (YC S23) - Real time data replicati...
       ___________________________________________________________________
        
       Launch HN: Artie (YC S23) - Real time data replication to data
       warehouses
        
       Hey, I'm Robin and I'm the founder/CTO at Artie
       (https://www.artie.so/). We solve the problem of stale production
       data in the data warehouse. We're open source, and you can try it
       for free here: https://github.com/artie-labs/transfer.
       Specifically, we do real time data replication from databases to
       data warehouses. We leverage change data capture (CDC) and stream
       processing to perform data transfers efficiently, enabling sub-
       minute latency and significant cost savings. Here's a quick demo
       (jumping straight to the important part):
       https://www.youtube.com/watch?v=uAi1tm4gd9U#t=81s.  I encountered
       this problem when I was a heavy data warehouse user at prior jobs.
       The data in our data warehouse was super lagged and analytics were
       always stale. Imagine a fintech company performing anti-
       fraud/transaction monitoring with batched ETLs and finding out that
       fraudulent transactions occurred 24 hours ago. This was very
       frustrating to me! Since my background was in distributed systems
       and database architecture, I knew that there was a better way to
       perform data transfers.  The more teams I spoke with, the more I
       realized this was a real pain point. People wanted real time data
       for analytics, fraud alerting, transaction monitoring, and training
       AI/ML models; but there wasn't an easy out-of-the-box solution.
       Companies were either constrained on latency or schema
       integrity/data usability or data accuracy. Companies started
       telling me that if I built a tool that is robust, near real time,
       but also maintained schema integrity and data accuracy, they would
       very happily pay for it.  So I built Artie, a real time, open-
       source data streaming solution to transfer data from databases to
       data warehouses and handle schemas automatically in-flight (DMLs
       and DDLs).  Typical ETL solutions leverage batched processes that
       run on a schedule (DAGs, Airflow), which cannot achieve real time
       data syncs. This means that when companies aggregate production
       data into their data warehouse, the underlying data is always
       stale.  Streaming change data capture (CDC) logs is a more
       efficient way to transfer data, and helps lower networking/data
       ingestion costs as well. However, building data pipelines with CDC
       streaming is complicated. I wanted Artie to be the tool that
       abstracts away that complexity so that any company can benefit from
       having real time data.  A big challenge is implementing CDC
       streaming for stateful data (i.e. row updates/deletes) and schema
       changes (DDLs), which most streaming solutions just don't address,
       meaning that complexity is passed down to the customer. A lot of
       in-house streaming solutions leverage some combination of Debezium
       + Kafka/Kinesis + Apache Flink and are able to achieve near real
       time syncs, but they only handle append-only events (inserts) and
       don't handle schema changes/schema evolution like DMLs and DDLs.
       Not handling thse means the data at the destination doesn't look
       exactly like the production database, which obscures the source of
       truth. You end up having to do additional work to make the data
       warehouse tables match the source DB.  So how do we offer a robust
       CDC streaming solution? We grab CDC logs using Debezium and/or our
       custom connector (which solves for certain edge cases that Debezium
       doesn't handle) and push them into Kafka (or Google Pub/Sub). Kafka
       helps ensure ordering and ease of recovery upon an outage - we use
       one table per topic, and the partition key is the primary key(s) to
       ensure no out of order writes. Artie then consumes these events
       from Kafka and we have an in-memory DB with our typing library that
       can infer schemas (DML and DDL), performs optimizations like
       deduplications, and then flushes data to the data warehouse. When
       the data warehouse confirms a successful merge, we then commit the
       offset within Kafka. This all happens with sub-minute data latency
       even at high volumes (several TBs or billions of rows).  However
       the data looks in your database, it should look exactly the same in
       your data warehouse. We also strive to handle all data types with
       no exceptions - i.e. supporting TOAST columns, composite keys as
       primary keys, arrays, large rows (>1MBs), etc. (I say "strive"
       because I'm sure we haven't seen all possible data types yet!).
       We've been live for a while now. Several companies use us to update
       all their analytic dashboards in real time (with dashboards built
       on top of their data warehouse). Fintech platforms use us to
       perform financial transaction monitoring. A utilities software
       platform uses us to grab video/photo data to perform risk/hazard
       assessment against ML models.  Artie is live and supports
       PostgreSQL, MySQL, MongoDB sources and Snowflake, BigQuery,
       Redshift destinations. We make money from our hosted service. We
       charge based on usage (# of rows transferred per month, not
       including initial snapshots).  We'd love for you to try it out! You
       can get started with the open source version here:
       https://github.com/artie-labs/transfer. We have a small OSS Slack
       community (www.artie.so/slack) - feel free to ping us for help or
       any other requests.  For our hosted version, we need to ensure that
       we have enough storage and compute capacity provisioned, so we're
       asking cloud users to hop on a quick call with us before we
       activate your account. Eventually we'll have easy self-serve
       functionality but that's not 100% built yet, so for now we set up a
       Slack channel to ensure smooth deployments. If you're willing to
       work with us on that, we'll be super excited to show you what we've
       got. Just email us at founders@artie.so or request access from
       https://www.artie.so.  We'd love for you to try the OSS or hosted
       solution and give us feedback! We're eager to improve the product
       and test it against various workloads and data types :)
        
       Author : tang8330
       Score  : 80 points
       Date   : 2023-07-24 13:21 UTC (9 hours ago)
        
       | bpicolo wrote:
       | > the partition key is the primary key(s) to ensure no out of
       | order writes.
       | 
       | Can you get out of order writes if the pk for a row changes?
        
         | tang8330 wrote:
         | Great question! Say this happened:
         | 
         | * CREATE TABLE user_test (id int primary key, name text);
         | 
         | * INSERT INTO user_test (id, name) VALUES (1, 'foo');
         | 
         | * UPDATE user_test set id = 2 where id = 1;
         | 
         | When the UPDATE call is invoked, there will be 2 events emitted
         | to Kafka. 1/ A DELETE event for id = 1
         | 
         | 2/ A CREATE event for id = 2
         | 
         | Here's the detailed events that get emitted:
         | https://gist.github.com/Tang8330/7a9450f95fbae486a4393abdd49...
        
           | bpicolo wrote:
           | Interesting! I'm both surprised and also not surprised it's
           | modeled that way.
        
             | tang8330 wrote:
             | Eventual consistency FTW!
        
       | davej wrote:
       | Very cool, are you guys focused on databases as sources or do you
       | plan to add API-based sources too (e.g.. Stripe Customers)?
       | Currently using Airbyte but something more real-time would be
       | beneficial to us.
        
         | tang8330 wrote:
         | For now, we're super focused on databases as sources. We really
         | want to do this well before we move on to other data sources
         | such as APIs.
        
         | jgraettinger1 wrote:
         | estuary.dev may be a fit for you (am CTO).
         | 
         | (Competitive product and I feel weird about replying in Artie's
         | thread, but tang8330 has said they're not serving this segment)
        
       | halfcat wrote:
       | Is the value here primarily in low latency updates? i.e. the
       | fraud detection scenario
       | 
       | My main pain points with ETLs are with systems that don't provide
       | good CDC metrics to begin with, which are not Postgres/etc, which
       | are usually a bolt-on ODBC driver or rate-limited APIs.
        
         | tang8330 wrote:
         | Main value props are low latency OLTP data in your DWH and
         | lower DWH ingestion costs.
        
       | joelhaasnoot wrote:
       | Do you support PostGIS? This seems to be an issue with a lot of
       | solutions I've seen...
        
         | tang8330 wrote:
         | Not at the moment, but no reason(s) why we shouldn't. Debezium
         | already supports this [1] and it'd be a minor refactor on our
         | end to support. If that's something you'd want, LMK!
         | 
         | [1]
         | https://debezium.io/documentation/reference/stable/connector...
        
           | btown wrote:
           | Seconding this! We're in the short-term rental space, and
           | while properties to don't change locations frequently (though
           | mistakes get corrected, and also https://arkup.com/
           | exists...), the polygon definitions and hierarchical
           | relationships of regions-of-interest for search and discovery
           | are often being updated, and being able to have real-time
           | geometry intersections rather than rerunning scripts
           | periodically to rebuild our analyses would be incredibly
           | powerful. And certainly for use cases like trucking fleet
           | management etc., real-time analytics on location data across
           | the entire gamut is vital.
        
             | tang8330 wrote:
             | That's fascinating! Thanks for providing more color,
             | support for geometric shapes coming!
             | https://github.com/artie-labs/transfer/issues/155
        
       | yevpats wrote:
       | Congrats!! Check out also CloudQuery
       | (https://github.com/cloudquery/cloudquery) an open source ELT
       | that support DB->DB without Kafka or Debezium (Founder here) :)
        
         | tang8330 wrote:
         | Thank you!!
        
       | RyanHamilton wrote:
       | Good to see more entrants in the real time analytics space. I've
       | been waiting years for most database to embrace streaming. I'm
       | betting on these real time data platforms needing a faster
       | dashboard than tableau... That's what I'm working on
       | https://www.timestored.com/pulse/. You may find it useful to show
       | off demos. We are partnering with some startups like yourself.
       | Email me if interested. And good luck!
        
       | BenderV wrote:
       | Congrats on your project. Reliable CDC is hard to set up.
       | 
       | I worked a bit on the space, focusing on api instead of DB, which
       | is in the end a quite different problem. My goal was to leverage
       | LLM to auto-build connector on the fly.
       | 
       | I started by building a lightweight ETL, that I open-sourced
       | (https://github.com/BenderV/universal-data).
       | 
       | I left the space because I realize that I didn't want to work on
       | this problem, even though I believe in the "AI" approach, and
       | think simplifying data transfer (or distributing compute) is one
       | the key factor to scale data usage.
        
       | EDEdDNEdDYFaN wrote:
       | How many of these tools are out there? This is like the 4th CDC
       | to Data warehouse tool I've seen this year and I'm struggling to
       | understand why they all exist in parallel.
       | 
       | I just found estuary with a colleague yesterday which didn't work
       | and previously have seen at least two other tools promising the
       | same CDC to Data Warehouse in-a-box pipelines. What makes this
       | tool different?
        
         | tang8330 wrote:
         | For us, what we are striving to do differently is:
         | 
         | 1/ DFS vs BFS. We are planning on rolling out connectors slowly
         | vs. building tens / hundreds / thousands of connectors to try
         | to attract a broader set of audiences. Those that have tried to
         | replicate data between OLTP and OLAP know how painful it is and
         | we really want to solve this pain point before we move on to
         | new sources. In addition, we're planning on providing more
         | value than just replicating data from source to destination.
         | We're planning on integrating our telemetry library [1] with
         | Datadog such that customers can: * Centralize all metrics. See
         | Artie metrics without coming to our dashboard, instead it's
         | integrated with your existing tools. * Help provide cookie
         | cutter monitors for anomaly detection * We also want to provide
         | better table quality checks
         | 
         | 2/ We do not want to change user behavior. We are using pretty
         | standard tools to solve this problem such as Kafka, Pub/Sub and
         | Red Hat Debezium. If you are already using one of these tools,
         | we can just integrate vs. standing up a whole suite of services
         | just for the data pipeline. * If you have CDC events being
         | emitted for event-driven architecture already, we'll skip
         | deploying Debezium and just deploy our consumer * If you have
         | Kafka enabled and want to also consume CDC events, we'll just
         | deploy Debezium to publish to your Kafka vs. ours.
         | 
         | 3/ Ease of use. This goes without saying, but plenty of tools
         | out there have broken links in their documentation, extremely
         | confusing UI, etc. We really look up to Fivetran in this regard
         | and we try to make the onboarding process for new connectors as
         | simple as possible.
         | 
         | Do you think there are anything missing with the other CDC or
         | data replication tools out in the market? Let me know and happy
         | to see how we can help!
         | 
         | [1] https://docs.artie.so/telemetry/overview
        
         | joelhaasnoot wrote:
         | Many of them really just suck (depending on your workload)
        
         | jgraettinger1 wrote:
         | Hi, I'm Estuary's CTO (https://estuary.dev). Mind speaking a
         | bit more about what didn't work?
         | 
         | We put quite a bit of effort into our CDC connectors, as it's a
         | core competency. We have numerous customers using them at scale
         | successfully, but they can be a bit nuanced to get configured.
         | We're constantly trying to make our onboarding experience more
         | intuitive and seamless... it's a hard problem.
        
       | Aaronstotle wrote:
       | The blocker for testing this out and using it at my place of
       | employment is lack of binary data type in MySQL, any plans to add
       | this?
        
         | tang8330 wrote:
         | Definitely. Do you expect the resulting data to also be in
         | binary / bytes format in your DWH?
         | 
         | I ask because there's a workaround by setting
         | `binary.handling.mode` to a STRING type [1].
         | 
         | Transfer will then automatically pick this up and write this as
         | a B64 string to the DWH.
         | 
         | [1]
         | https://debezium.io/documentation/reference/stable/connector...
        
           | Aaronstotle wrote:
           | Thanks for the reply! We'll be keeping an eye on it
        
       | cgio wrote:
       | I've been in this space and can appreciate your design decisions.
       | They are meaningful, but a couple of comments on the
       | differentiating factors. DML is tackled in traditional pipelines
       | very similarly, I.e. execution of the log at the target, whether
       | that is materialised or as a view. DDL, it is a nice to have from
       | a technical achievement perspective, but changes to a data schema
       | are not live. They have to be deployed, approved etc. so the
       | flexibility of accommodating them on the pipeline is a benefit
       | from an ease of deployment perspective but could also be alarming
       | for data officers who may feel like they are losing a control
       | lever by not getting the pipeline to block when a schema changes
       | outside normal deployment processes. Finally, the biggest issue
       | with CDC always ends up being the seed loads, recoveries and the
       | incremental snapshot strategies.
        
         | tang8330 wrote:
         | Thanks for the comment!
         | 
         | Your comment regarding DDL is interesting.
         | 
         | Today, this is what happens:
         | 
         | 1/ Column doesn't exist in the destination, we'll create it
         | based on our typing inference from the data type (important:
         | not the data value).
         | 
         | 2/ Certain tools will handle automatic column data type
         | conversion if a change like this was detected at the source. We
         | do not do this. We will simply hard fail and cause head-of-line
         | blocking reasons being: this is anti-pattern and should be
         | rare, in which case - it's okay to cause an err and require
         | manual intervention for this breaking change.
         | 
         | 3/ If the column has been dropped from the source, you as the
         | end user can decide whether this column should be also dropped
         | in the destination, or not. The default is not to drop it.
         | 
         | ^ We hear more customers explicitly don't want columns to be
         | dropped because it could cause downstream errors, such as other
         | views / tables not compiling due to referencing a non-existent
         | column.
         | 
         | We haven't heard much from folks that don't even want columns
         | to be added. If there is a need, we can definitely add that as
         | a config option to provide maximum configurability.
         | 
         | > Finally, the biggest issue with CDC always ends up being the
         | seed loads, recoveries and the incremental snapshot strategies.
         | 
         | Yep totally. On the recovery bit, this is exactly why we are
         | leveraging Kafka. If there are any particular issues, we simply
         | don't commit the offset and cause head-of-line blocking.
         | 
         | On the incremental snapshot and recoveries bit, we primarily
         | leverage Debezium's DDD-3 high watermark strategy [1] for MySQL
         | and MongoDB. Postgres has a different issue in that replication
         | slots can grow really fast, esp on AWS! [2]. We ended up
         | writing our own custom snapshotter for Postgres that is
         | Debezium compatible to onboard customers that have a massive
         | dataset and cannot afford to have a read lock on their WAL.
         | 
         | [1] https://github.com/debezium/debezium-design-
         | documents/blob/m... [2]
         | https://www.morling.dev/blog/insatiable-postgres-replication...
        
       | PeterZaitsev wrote:
       | Hi Robin, This looks like very cool project
       | 
       | Note thought Elastic Licence 2.0 is not Open Source License so if
       | this is your license choice I'd avoid calling the code you have
       | on gitHub OSS
        
       | danpalmer wrote:
       | Hey, brief feedback, but "5,000,000,000+" rows processed might be
       | a red flag for some. Many individual customers might do this per
       | day, so to say that's all the company has done so far might put
       | them off.
       | 
       | You've said that initial imports are free, but anecdotally,
       | initial imports ended up being a somewhat regular occurrence as
       | we found issues required a re-import, table rewrites, breaking
       | schema changes, and so on. Do you pay full cost for subsequent
       | full-imports?
       | 
       | Lastly, pricing. I feel like you really need a dollar figure on
       | this even just to start the discussion. "Call us for pricing" is
       | fine for enterprise plans, but in a ~100 person startup I would
       | have just passed immediately on to others that provide pricing
       | details. I'm not going to spend time trialling something without
       | knowing if the pricing is in the right ballpark as it's a waste
       | of my time if it isn't. My perception is that this would be
       | substantially more expensive than batch replication competitors
       | such as Stitch, but I don't know.
        
         | tang8330 wrote:
         | > Hey, brief feedback, but "5,000,000,000+" rows processed
         | might be a red flag for some. Many individual customers might
         | do this per day, so to say that's all the company has done so
         | far might put them off.
         | 
         | Appreciate the feedback! Totally agree and it's something we
         | are working towards :). We learned a ton from going from
         | nothing -> billions, certain functions, backfill strategies,
         | etc just doesn't work at that scale. I'm sure we'll have
         | another equally exciting learning curve when scaling from
         | B/month -> B/day!
         | 
         | > You've said that initial imports are free, but anecdotally,
         | initial imports ended up being a somewhat regular occurrence as
         | we found issues required a re-import, table rewrites, breaking
         | schema changes, and so on. Do you pay full cost for subsequent
         | full-imports?
         | 
         | Thanks for pointing this out. We plan to make all backfills
         | free of charge.
         | 
         | > Lastly, pricing. I feel like you really need a dollar figure
         | on this even just to start the discussion. "Call us for
         | pricing" is fine for enterprise plans, but in a ~100 person
         | startup I would have just passed immediately on to others that
         | provide pricing details. I'm not going to spend time trialling
         | something without knowing if the pricing is in the right
         | ballpark as it's a waste of my time if it isn't. My perception
         | is that this would be substantially more expensive than batch
         | replication competitors such as Stitch, but I don't know.
         | 
         | Makes sense. A lot of folks think that streaming is more
         | expensive than batched, what we've found with our initial
         | customers is the opposite. Streaming is able to distribute the
         | load to DWH and as a result, customers on Snowflake can use a
         | much smaller vDWH when working with Artie. On the "call us for
         | pricing" front, we are planning to provide a calculator / graph
         | to estimate costs before talking to a human in the future.
         | However, we are learning that a lot of companies have bespoke
         | 1yr/ 2yr contracts with their data provider and are trying to
         | figure out our own pricing and packaging at the moment.
        
           | GordonS wrote:
           | I just wanted to second the GP's point on pricing - like
           | them, "call me pricing" is a hard no from me. IMO it's fine
           | for the highest tier (Enterprise or whatever), but lower
           | tiers should have fixed and public pricing.
        
       | slotrans wrote:
       | (None of the below is meant to diminish the work done by the
       | author/poster.)
       | 
       | > This means that when companies aggregate production data into
       | their data warehouse, the underlying data is always stale.
       | 
       | This is intentional and desirable.
       | 
       | The classic piece on this is this one by Dan McKinley
       | https://mcfunley.com/whom-the-gods-would-destroy-they-first-...
       | 
       | Something McKinley doesn't address is that it's quite
       | advantageous if the values in your data warehouse _don 't change
       | intra-day_ because this lets business users reach consensus.
       | Whereas if Bob runs a report and gets $X, and Alice runs the same
       | report 5 minutes later and gets $Y, that creates confusion (much
       | more than you would expect). I recall a particular system I built
       | that refreshed every 6 hours (limited by upstream), that
       | eventually Marketing asked me to dial back to every 24 hours
       | because they couldn't stand things changing in the middle of the
       | day.
       | 
       | Now of course I see you're targeting more real-time use cases
       | like fraud detection. That's great! But why you would run a fraud
       | detection process out of your data warehouse, which likely
       | doesn't even have a production-grade uptime SLA? Run it out of
       | your production database, that's what it's for!
        
         | tang8330 wrote:
         | Thanks for your feedback!
         | 
         | > Something McKinley doesn't address is that it's quite
         | advantageous if the values in your data warehouse don't change
         | intra-day because this lets business users reach consensus.
         | Whereas if Bob runs a report and gets $X, and Alice runs the
         | same report 5 minutes later and gets $Y, that creates confusion
         | (much more than you would expect). I recall a particular system
         | I built that refreshed every 6 hours (limited by upstream),
         | that eventually Marketing asked me to dial back to every 24
         | hours because they couldn't stand things changing in the middle
         | of the day.
         | 
         | If they want to see a consistent view of the report, you could
         | bound this.
         | 
         | 1/ SELECT * FROM FOO WHERE DATE_TRUNC('day', updated_at) <
         | DATE_TRUNC('day', DATEADD(day, -1, CURRENT_DATE()));
         | 
         | If your dataset doesn't contain kv, you can turn on include
         | `artie_updated_at` which will provide an additional column with
         | the updated_at field to support incremental ingestion.
         | 
         | 2/ If you had stateful data, you could also explore creating a
         | Snowflake task and leveraging the time travel f(x) to create a
         | "snapshot" if your workload depended on it.
         | 
         | 3/ Also, if you _did_ want this to be more lagged, you can
         | actually increase the flushIntervalSeconds [1] to 6h, 24h,
         | whichever time interval you fancy. You as the customer should
         | have maximum flexibility when it comes to when to flush to DWH.
         | 
         | 4/ You can also choose to refresh the analytical report on
         | Looker / Mode to be daily. [2]
         | 
         | > Now of course I see you're targeting more real-time use cases
         | like fraud detection. That's great! But why you would run a
         | fraud detection process out of your data warehouse, which
         | likely doesn't even have a production-grade uptime SLA? Run it
         | out of your production database, that's what it's for!
         | 
         | You can certainly do this in production db (that was our
         | original hypothesis as well!), however, after talking to more
         | companies...it has become more obvious to us that folks that
         | are running fraud algos actually want to join this across
         | various data sets. Further, by using a DWH - it provides a nice
         | visualization layer on top.
         | 
         | Of course, you could go with something even more bespoke by
         | utilizing real-time DBs such as Materialize / Rockset /
         | RisingWave. Just comes with trade offs such as increase in
         | architectural complexity.
         | 
         | There are also plenty of additional use cases this can unlock
         | given that DWH is a platform, any post-DWH application can
         | benefit from less lag, such as reverse ETLs.
         | 
         | [1] https://docs.artie.so/running-transfer/options
         | 
         | [2] https://mode.com/help/articles/report-scheduling-and-
         | sharing...
        
           | debarshri wrote:
           | I agree with the parent point. I also don't think DWH is the
           | primary usecase for your platform.
           | 
           | I have seen architectures where databases are siloed within
           | departments and data has to be replicated across department
           | physical databases in the same network or different, mostly
           | in banks, insurances and old school industries. In this
           | scenario, a daily batch would run that would replicate and
           | populate the tables and kick start business processes. A
           | platform like this would make sense. Another usecase, i can
           | think of is reverse ETL, but there are many tools custom made
           | for that.
           | 
           | As for fraud analysis, there are many vendor tools that does
           | exactly that, asking people to visualize and implement a full
           | blown usecase is hard.
           | 
           | I might be naive I don't see the USP between artie and
           | Airbyte, hevodata, fivetran, stitch etc. and others from a
           | distance.
        
           | mbesto wrote:
           | I think you missed the parent's point - your USP is real-time
           | replication. So everything you're proposing makes it _not_
           | real time. Your USP is now worthless (in that context) and
           | you 're competitors are numerous.
        
             | tang8330 wrote:
             | Hm, perhaps I wasn't being clear, apologies for that.
             | 
             | What I am proposing above is ways to provide a view to
             | teams that do not want real-time data while keeping your
             | underlying dataset in real-time.
        
       ___________________________________________________________________
       (page generated 2023-07-24 23:00 UTC)