[HN Gopher] Launch HN: Airbyte (YC W20) - Open-Source ELT (Fivet...
       Launch HN: Airbyte (YC W20) - Open-Source ELT (Fivetran/Stitch
       Hi HN!  Michel here with John, Shrif, Jared, Charles, and Chris. We
       are building an open-source ELT platform that replicates data from
       any applications, APIs, databases, etc. into your data warehouses,
       data lakes or databases: https://airbyte.io.  I've been in data
       engineering for 11 years. Before Airbyte, I was the head of
       integrations at Liveramp, where we built and scaled over 1,000 data
       ingestion connectors to replicate 100TB worth of data every day.
       John, on the other end, has already built 3 startups with 2 exits.
       His latest one didn't work out, though. He spent almost a year
       building ETL pipelines for an engineering management platform, but
       he eventually ran out of money before reaching product-market fit.
       By late 2019, we had known each other for 7 years, and always
       wanted to work together. When John's third startup shut down, it
       was finally the right timing for both of us. And we knew which
       problem we wanted to address: data integration, and ELT more
       specifically.  We started interviewing Fivetran, Stitchdata, and
       Matillion's customers, in order to see if the existing solutions
       were solving their problems. We learned they all fell short, and
       always with the same patterns.  Some limitations we identified are
       due to the fact that they are closed source. This prevents them
       from addressing the long tail of integrations because they will
       always have a ROI consideration when building and maintaining new
       connectors. A good example is Fivetran which, after 8 years, offers
       around 150 connectors. This is not a lot when you look at the
       number of existing tools out there (more than 10,000). In fact, all
       their customers that we talked to are building and maintaining
       their own connectors (along with orchestration, scheduling,
       monitoring, etc.) in-house, as the connectors they needed were
       either not supported in the way they needed or not supported at
       all.  Some of those customers also tried to leverage existing open-
       source solutions, but the quality of the existing connectors is
       inconsistent, as many haven't been updated in years. Plus, they are
       not usable out of the box.  That's when we knew we wanted Airbyte
       to be open-source (MIT license), usable out of the box, and cover
       the long tail of integrations. By making it trivial to build new
       connectors on Airbyte in any language (they run as Docker
       containers), we hope the community will help us build and maintain
       the long tail of connectors. While open-source also enables us to
       address all use cases (including internal DBs and APIs), it also
       allows us to solve the problem inherent to cloud-based solutions:
       the security and privacy of your data. Companies don't need to
       trust yet another 3rd-party vendor. Because it is self-hosted, it
       will disrupt the pricing of existing solutions.  Here's a 2-minute
       demo video if you want to check out how it looks:
       https://www.youtube.com/watch?v=sKDviQrOAbU  Airbyte can run on a
       single node without any external infrastructure. We also integrate
       with Kubernetes (alpha), and will soon integrate with Airflow so
       you can run replication tasks across your cluster.  Today, our
       early version supports about 41 sources and 6 destinations
       (https://docs.airbyte.io/integrations/destinations). We're
       releasing new connectors
       (https://docs.airbyte.io/changelog/connectors) every week (6 of
       them have already been contributed by the community). We
       bootstrapped some connectors using the highest-quality ones from
       Singer. Our connectors will always remain open-source.  Our goal is
       to solve data integration for as many companies as possible, and
       the success of Airbyte is predicated on the open-source project
       becoming loved and ubiquitous. For this reason, we will focus the
       entirety of 2021 strengthening the open-source edition; we are
       dedicated to making it amazing for all users. We will eventually
       create a paid edition (open core model) with enterprise-level
       features (support, SLA, hosting and management, privacy compliance,
       role and access management, SSO, etc.) to address the needs of our
       most demanding users.  Give it a spin:
       https://github.com/airbytehq/airbyte/ & https://demo.airbyte.io.
       Let us know what you think. This is our first time building an
       open-source technology, so we know we have a lot to learn!
       Author : mtricot
       Score  : 104 points
       Date   : 2021-01-26 16:08 UTC (6 hours ago)
       | georgewfraser wrote:
       | The fundamental challenge of open-source ETL is that high-quality
       | connectors require understanding and working around all kinds of
       | corner cases in the API of each data source. It's very hard to
       | get open source contributors to do this kind of work; it's a real
       | slog. Hence at Fivetran we've always stuck with the commercial
       | route.
         | cgardens wrote:
         | We couldn't agree more that producing high-quality connectors
         | requires a lot of work. The hardest part about this task is
         | that connectors must evolve quickly (due to changes in the API,
         | new corner cases, etc). The quality of the connector is not
         | just how well the first version works but how well it works
         | throughout its entire lifetime.
         | Our perspective is that by providing these connectors as open
         | source we can arrive at higher quality connectors. For a closed
         | source solution, a user has to go through customer service and
         | persuade them that there is indeed a problem. A story we have
         | heard countless times, is that SaaS ETL providers are slow to
         | fix corner cases discovered by users leading to extended
         | downtime. With an OSS solution, a user can fix a problem
         | themselves and be back online immediately.
         | We proactively maintain all connectors, but we believe that by
         | sharing that responsibility with the OSS community, we can
         | achieve the highest quality connectors.
         | One of the main focuses of Airbyte is to provide a very strong
         | open-source MIT standard for testing and developing (base
         | packages, standard tests, best practices...) connectors in
         | order to achieve the highest quality.
         | NavyDish wrote:
         | Similar thoughts (btw I came here looking for your comment
         | ha!).
         | I guess you had mentioned in one of the videos that at
         | Fivetran, it is your responsibility to ensure data integrity
         | across all of the sources/integrations, and has been since the
         | early days. This led the customers to trust the product in the
         | early days and the team to draw learnings from abstract
         | patterns across sources.
         | Have come to believe that it is THE MOST important thing to
         | have an explicit ownership for issues whenever there is
         | physical movement of data across an org's ecosystem.
           | vianneychevalie wrote:
           | How do you determine this explicit ownership for issues? I've
           | come across many governance problems linked to a lack of
           | transparency in "bug ownership", but I've often failed to
           | find a common ground for clients and third parties: who's
           | responsible? Who should pay for it?
           | Quite often it's the one with the loudest mouth or the
           | biggest sponsor who wins.
         | noahmbarr wrote:
         | A customer perspective from a mid-stage CFO: I like saving
         | money, but prefer to pay for software solutions like this
         | directly. I pay you, and you make sure this set of connectors
         | {in and out} continues to durably work. Meanwhile, our
         | engineers can focus on building our product.
           | jeanlaf wrote:
           | This is something we will definitely offer as well, with an
           | SLA. And because the maintenance is not only done by us, but
           | the community as well, fixes will be propagated throughout
           | all users much faster than if it has to go through customer
           | support.
           | Open-source doesn't mean you can't have both. You can check
           | how Databricks or Confluent are doing.
         | DouweM wrote:
         | I think that if the wider open source community can maintain
         | API client libraries for every imaginable SaaS API and every
         | popular programming language, there's no reason that it can't
         | maintain open source ELT connectors for all of these sources as
         | well.
         | I work at GitLab as project lead of Meltano
         | (https://meltano.com/) -- which embraces Singer instead of
         | abandoning it -- and we've seen a lot of interest from data
         | consultancies looking for mature tooling around deploying and
         | developing Singer taps, many of whom have expressed that they'd
         | be happy to maintain open source ELT connectors for data
         | sources that are commonly used by their clients, if they can
         | significantly save on ELT costs that would otherwise get passed
         | on to those clients.
         | Of course, only one data consultancy (or data team at a
         | company) would need to maintain an open source tap, and others
         | that need the same source for _their_ clients can contribute
         | and help keep it up to date.
         | marcinzm wrote:
         | Yeah, Fivetran seems often to have issues that come down to "we
         | had a discussion with data source/sink provider and found they
         | had a bug in their latest release." Even if an open source
         | contributor gets to that point they won't have the strong arm
         | ability to force the provider to fix the bug ASAP.
         | renewiltord wrote:
         | Personally, the most infuriating thing about a tool is where I
         | can fix the damn thing given the source code but I have to go
         | through the support staff to the engineering team and then wait
         | for "this is on our roadmap but not something we're currently
         | prioritizing". Right, I know. I don't expect other people to do
         | work for me. I just need them to let me do the work myself.
         | The massive advantage of the OSS route isn't that you can ask
         | the community to build a tool for you; it's that when you
         | inevitably have a corner case or some behaviour you want to
         | encode, you can just make RenesPostgres connector and copy in
         | the Postgres connector and fix it.
         | I don't understand why anyone keeps their source all closed.
         | Even one of those "you can't release this but you can edit it"
         | licenses is better.
         | Half of why I use Kong as an API Gateway is that I can just
         | edit the source code of their plugins. Thank fuck for that.
         | cpard wrote:
         | That's an excellent point and not easy to demonstrate until
         | someone does experience an edge case with their connectors. The
         | main value of open sourcing a framework for integrations (e.g.
         | Singer), is to allow customers to easily support a large number
         | of long tail integrations that exist out there.
       | tejasmanohar wrote:
       | Congrats! We at Hightouch [0] ("reverse ETL") are excited to see
       | Airbyte here on HN. We've been following Michel & John for a
       | while now since the YC days, and from the outside, it seems like
       | they've been consistently shipping incredibly quickly ever sinec
       | the open-source project launch.
       | @mtricot -- You mention that a big value prop of Airbyte is
       | providing an interface for building custom connectors. Have there
       | been interesting learnings on designing an ideal "interface" to
       | provide developers? How does the interface you provide compare to
       | that of Fivetran's Functions offering [1]?
       | [0]: https://hightouch.io
       | [1]: https://fivetran.com/docs/functions
         | mtricot wrote:
         | Answering your first question: When talking about the interface
         | we need to separate: the data protocol and the developer
         | experience (DX) creating & maintaining a connector. We believe
         | the data protocol we have in place should address 95% of the
         | use cases and, as we get more sophisticated use cases we will
         | evolve the protocol (for example for more scale). Regarding the
         | DX, we are continuously working on it to make it a breeze and
         | ensure super high quality.
         | Answering your second question: Fivetran functions are a nice
         | escape hatch but none of the users we talked to mentioned
         | those. They always mentioned building inhouse for missing
         | connectors. My interpretation is that this is too much of a
         | vendor lock-in for a cloud-based product.
       | yclurker wrote:
       | The Open Source Fivetran alternative. Yay, it was about time! A
       | simple license : MIT. Clear differentiation between free & paid
       | plans. I am liking what I am seeing so far. One of our client is
       | in advertising industry and is syncing data from 20 different API
       | vendors to postgres. So I am one of your potential customer.
       | However, there is a big problem I'm noticing with "Open source
       | alternatives" lately on HN. I had to mention this.
       | Even a simple installation of airbyte on my local machine fails
       | :( I tried docker-compose up!
       | I simply wanna know why a basic example is not working on an
       | important day of your company ? :) Is this a genuine mistake ?
       | Sorry, this feedback will sound harsh but companies are taking
       | words 'open source' for a complete ride. It's a great marketing
       | trick. Gets you plenty of eyeballs, good will & trust to begin
       | with. Then later we figure it's not even self hostable.
       | Here is a bad example that you may not want to follow : Supabase
       | "The Open Source Firebase Alternative". The product is not self
       | hostable despite calling themselves open source firebase all over
       | internet. The Founders of Supabase have been disingenuous not to
       | address self hosting[1][2] and its a been long time since their
       | launch. The self hosting section on their website[3] doesn't
       | provide any details on how to self host and they are careless
       | enough to even mention "how to migrate away" from Supabase in
       | that section.
       | [1] :
       | https://github.com/supabase/supabase/discussions/219#discuss...
       | [2] :
       | https://github.com/supabase/supabase/issues/85#issuecomment-...
       | [3] : https://supabase.io/docs/guides/platform#self-hosting
         | mtricot wrote:
         | Sorry that it doesn't work :( Murphy law on launch day I would
         | say...
         | Right now all our users self-host and the whole project is
         | meant to be self-hosted for data privacy and security reasons.
         | Do you want to join our slack (https://slack.airbyte.io)? We
         | can help you on the resolution!
           | yclurker wrote:
           | Traceback (most recent call last):       File "site-
           | packages/urllib3/connectionpool.py", line 677, in urlopen
           | File "site-packages/urllib3/connectionpool.py", line 392, in
           | _make_request       File "http/client.py", line 1252, in
           | request       File "http/client.py", line 1298, in
           | _send_request       File "http/client.py", line 1247, in
           | endheaders       File "http/client.py", line 1026, in
           | _send_output       File "http/client.py", line 966, in send
           | File "site-packages/docker/transport/unixconn.py", line 43,
           | in connect
             | mtricot wrote:
             | Thanks for posting! I don't have enough context to help you
             | solve.
             | Do you want to send a screenshot of your terminal? michel
             | [@] airbyte.io
       | Dnguyen wrote:
       | I do a lot of ETL over the years and watched the video and read
       | the intro. I apologize but I haven't a chance to read through the
       | full documentation. My question is, a lot of data I pull are
       | relational and hierarchical, how would I pass variables along to
       | related connectors? And is there a way to wait for a parent
       | connection to finish before the child connector runs? I've build
       | many ETL over the years and it's not easy to keep up with the
       | changes. The bottleneck was always the engineering adapting to
       | the changing schema.
         | mtricot wrote:
         | It is not possible as of now but we are about to start
         | integrating with DAG managers (Airflow, Dagster, Prefect...).
         | That will give the possibility to schedule the connectors with
         | the proper dependencies.
         | See: https://github.com/airbytehq/airbyte/issues/836
           | Dnguyen wrote:
           | Correct me if I'm wrong, but Airbyte is mainly an EL to be
           | used with other schedulers? The scheduler will have to
           | provide, internally or externally, the logic and parameters
           | to the nodes that drive Airbyte connectors?
             | mtricot wrote:
             | we are still figuring out how we want to integrate with
             | other schedulers.
             | One option would be that you configure your
             | source/destination with the Airbyte UI or API and with the
             | external scheduler you just reference to a connection
             | object with an id.
             | We need to run some experimentations and talk to the
             | community to see what makes the most sense. If you have
             | some opinion / scenarios, do you want to write them in the
             | ticket?
       | bithavoc wrote:
       | Is there a guide to set it up behind a reverse proxy with https?
       | I have it running with Nginx but it won't load, it insist in
       | using /8001/api. How do I change the API endpoint?
         | mtricot wrote:
         | You can. You can play the the env variable: API_URL
         | For example if you have a reverse proxy that serves both the
         | webapp and the api you can just launch with: API_URL=/api/v1/
         | docker-compose up
           | bithavoc wrote:
           | I just effortlessly synced Freshdesk to Postgres in 20
           | minutes. Thank you.
       | rilut wrote:
       | Will it support change data capture (CDC) from SQL databases?
         | mtricot wrote:
         | Yes, it is on our roadmap. We will likely have an alpha version
         | in the next two months.
         | https://github.com/airbytehq/airbyte/issues/957
           | rilut wrote:
           | Fantastic. We've using both managed service and in-house
           | solution (debezium + kafka) for CDC from transactional to
           | analytics. Looking forward to see this on Airbyte.
       | influx wrote:
       | How do you differentiate from Meltano?
         | jeanlaf wrote:
         | The main difference is that Meltano is based on Singer while
         | Airbyte is based on the Airbyte Protocol.
         | We don't believe Singer is a good building block, because it
         | requires a significant time investment from its users to
         | compensate for the absence of centralized enforcement of the
         | Singer protocol. Since it is not enforced, there is often no
         | guarantee that any pair of Singer connectors are compatible. It
         | defeats the point of a specification. All taps live in their
         | own repo, and all contributions are made to address the
         | contributor's case, not the general use case. The lack of
         | standard makes it very difficult to maintain all those
         | connectors, and you end up with a majority of Singer taps being
         | out of date.
         | Airbyte doesn't have the same data protocol as Singer (but we
         | are compatible). Our goal is to make building and maintaining
         | new connectors a lot easier than it is with Singer, and
         | therefore Meltano. That's why we were able to ramp up our
         | connectors (46 now) within just 5 months, while Meltano is
         | focused on fixing the issues with Singer. We think it's much
         | harder to patch over Singer and reverse course on an
         | abandonware project than it is to start from the ground up with
         | these issues in mind. We wouldn't be surprised if Meltano
         | starts supporting Airbyte connectors in the future.
         | We detail these differences here:
         | https://docs.airbyte.io/faq/differences-with.../meltano-vs-a...
           | theboat wrote:
           | Does this imply airbyte only supports connectors that airbyte
           | validates and integrates into the platform? Can I use an
           | airbyte connector that lives in a repo on my private github?
           | This solves the problem of getting high quality connectors
           | built, but how do you plan to maintain them? What if the
           | original contributor falls off the face of the earth?
             | mtricot wrote:
             | You can also use an external connector if you want to. This
             | is a very valid use-case, especially if you connect
             | internal APIs or private sources that wouldn't make sense
             | for the community.
             | If the original contributor falls off the face of the
             | earth, it is OK! That's the beauty of Open-Source. Another
             | person who is using it can jump in. We can also jump in.
       | rilut wrote:
       | Will Airbyte support MySQL as destination?
         | mtricot wrote:
         | Yes! it is on our short term roadmap:
         | https://github.com/airbytehq/airbyte/issues/1483
         | Also, we are OSS so if you want to contribute, we can guide you
         | through it!
       | jameslk wrote:
       | It's nice to see some competition in this space, especially open
       | source. I'm a little confused though I thought Singer was the
       | open source version of Stitch (which you mention briefly):
       | https://www.singer.io. But maybe Singer doesn't have all the UI
       | features that Stitch does and that's where Airbyte is different?
       | I would love to know more about the differences
         | mtricot wrote:
         | I am one of Airbyte's founder. The initial version of was
         | actually fully based on Singer and this is when we realize it
         | wouldn't be possible for us to depend on it.
         | Amongst the main reasons: Singer seems to have been abandoned
         | by StitchData (after they got acquired by Talend), the quality
         | of the connectors is too unpredictable, Singer connectors are
         | not usable outside the box.
         | We would have preferred to use an existing standard if one
         | already existed. It was a tough decision for us to create
         | something from scratch but now we are very satisfied with the
         | decision. It is way easier for the community and for us to
         | build connectors that meet quality standards and we can make it
         | MIT so the community can have control on the evolution of the
         | protocol.
         | We actually wrote a few articles about it:
         | https://docs.airbyte.io/faq/differences-with.../singer-vs-ai...
         | https://airbyte.io/articles/data-engineering-thoughts/airbyt...
           | mtricot wrote:
           | Forgot to mention. We have a compatibility layer with Singer
           | so it possible to run Singer Taps in Airbyte. We have a few
           | sources that are actually some of the high quality Singer's.
         | DouweM wrote:
         | At GitLab, we're not ready to give up on the Singer spec,
         | community, and ecosystem yet, which is why I've been working on
         | Meltano for the past year: https://meltano.com/
         | We think that the biggest things holding back Singer are the
         | lack of documentation and tooling around taking existing taps
         | and targets to production, and around building, debugging,
         | maintaining, and testing new or existing high-quality taps and
         | targets.
         | Meltano itself addresses the first problem, and provides a
         | robust and reliable platform for building, running &
         | orchestrating Singer- and dbt-based ELT pipelines. It's built
         | for developers who are comfortable with CLIs and YML files, and
         | want their pipelines to be defined in a Git repository so that
         | they get the benefits of DevOps best practices like code review
         | and CI/CD.
         | At the same time, we have been working with some members of the
         | community on a new framework for building taps and targets:
         | https://gitlab.com/meltano/meltano/-/issues/2401, which we have
         | decided to call the Singer SDK:
         | https://gitlab.com/meltano/singer-sdk. We are moving as many
         | Singer specification-specific details around things like
         | incremental state replication and stream/field selection into
         | the framework, so that individual taps only need to worry about
         | getting the data from the source and can be expected to behave
         | more consistently and correctly across the board.
         | tehalex wrote:
         | Stitch is partially open as stitch - many of the integrations
         | they list on their website are hosted versions of OSS singer,
         | but a couple are not OSS.
         | I'm not sure how Stitch's acquisition will affect Singer
         | contributions and such going forward.
         | Also, Singer has no UI, it's all CLI.
       | polskibus wrote:
       | Some hard questions that I need to ask as someone that does set
       | up complex data pipelines for customers as part of his job.
       | How do you want to monetize your product? What is your runway as
       | of today? When do you project you will be self-sustainable?
       | It's all great to have an open source solution for pushing the
       | data around, but I don't want to invest in learning a new tool
       | only to see it vanish in 2-3 years or so.
         | jeanlaf wrote:
         | Sure! Those are actually great questions!
         | Regarding monetization, you can see more details here:
         | https://airbyte.io/pricing We consider 2 monetization
         | approaches: Open core (connectors staying open-source forever)
         | with premium features as: hosting & management (cloud-based
         | control panel without access to your data plan), and enterprise
         | features (privacy compliance, SSO, user access management, etc)
         | What we call "Powered by Airbyte" where we enable you to offer
         | integrations to your own customers using our API
         | Regarding our runway, with the team as is, mid 2025. We intend
         | to grow the team though, given the adoption growth we have.
         | We've already been approached for a Series-A, but will consider
         | it in mid 2022.
         | Regarding self-sustainability, do you mean financially?
         | Possibly at the end of 2023 or 2024.
         | How does that sound to you? Genuinely curious.
           | polskibus wrote:
           | Your pricing page does not have any prices so it's hard to
           | tell. I'll definitely keep an eye on your progress though.
           | Your connector as container approach is interesting although
           | i think IT departments may have issues with an on prem
           | solution that spawns other containers by itself.
           | One more thing - how will you protect yourself from let's say
           | AWS forking you and selling a managed airbyte version?
       | giovannibonetti wrote:
       | Shouldn't it be ETL (extract, transform, load) [1]?
       | [1] https://en.wikipedia.org/wiki/Extract%2C_transform%2C_load
         | scapecast wrote:
         | no, it should not. The point of modern data warehouses like
         | Snowflake is that you run the transformations in the warehouse,
         | vs. some external transformation layer (think Informatica).
         | In the old approach, you would run the transform BEFORE loading
         | data into the warehouse. The disadvantage of that approach is
         | that you loose all fidelity of the raw data.
         | In the new approach (Airbyte's approach), you load the raw data
         | into the warehouse, and then run your transform jobs in the
         | warehouse. You can do that because modern warehouses are cheap
         | and scalable. The benefit of that approach is that you keep
         | your raw data with all its fidelity, opening up endless
         | opportunities for exploratory slicing and dicing.
         | That's why it's called "ELT" (new) these days, to distinguish
         | from "ELT" (old).
         | llampx wrote:
         | ETL and ELT are similar but separate things. ELT has been
         | popular for a few years now, and the data lake approach as well
         | as cheap storage has solidified it as the current preferred way
         | to do a data warehouse.
         | cgardens wrote:
         | If you are interested, John (one of the co-founders) wrote an
         | article about how we are imagining ELT evolving.
         | https://airbyte.io/articles/data-engineering-thoughts/why-th...
       | satyrnein wrote:
       | I notice dbt integration on your roadmap. As a current Stitch and
       | dbt Cloud customer, can you give some insight into what you're
       | considering there?
         | mtricot wrote:
         | Yes, we love DBT too!
         | It's great for handling transformations and since we want to
         | focus on the EL part, we think there's good synergy there.
         | Airbyte is already using the DBT CLI internally and as we
         | provide more transformations of the data during syncs, we'll
         | make it easier to give a better integration with DBT projects
         | downstream:
         | - Native Transformations as part of sync process: for example,
         | schema migrations for source data changes, un-nesting complex
         | objects columns (from APIs), etc
         | - Customizable models to override or extend further Airbyte's
         | proposed transformations to be executed in the same sync
         | pipeline
         | - Seamless DX between custom downstream transformation and
         | transformations made by Airbyte
         | - Integration with external orchestrators (Airflow, DBT Cloud
         | jobs) with webhook triggers?
         | We're happy to hear more ideas/needs to build this roadmap
         | though!
         | You can have a look at the current state of Airbyte with DBT
         | here: https://docs.airbyte.io/tutorials/connecting-el-with-t-
         | using...
         | [deleted]
       (page generated 2021-01-26 23:00 UTC)