hngopher.com

       [HN Gopher] Using an ETL framework vs. writing yet another ETL s...
       ___________________________________________________________________
        
       Using an ETL framework vs. writing yet another ETL script
        
       Author : Arimbr
       Score  : 81 points
       Date   : 2021-12-17 16:03 UTC (6 hours ago)
        
 (HTM) web link (airbyte.io)
 (TXT) w3m dump (airbyte.io)
        
       | johanneskanybal wrote:
       | Well this is just an ad. At glance looks like an airflow knock
       | off.
        
       | hadlock wrote:
       | Reminds me of the xkcd "how standards proliferate"
       | 
       | Arguably you could use Kubernetes as a scheduler, or "ETL
       | framework/kit", it supports cron jobs, has a restful api, local
       | and remote secrets storage, native support for cloud storage,
       | support for multiple logging solutions, distributed workloads,
       | supports cron jobs etc.
       | 
       | Years ago I worked for a financial services company and they
       | would run their batch ETL jobs via a product called Tidal that
       | later got bought by Cisco. I really liked using Tidal for the
       | longest time, but 100% of what Tidal does, you can replicate with
       | the scheduler features of Kubernetes.
        
       | kartoshechka wrote:
       | I just got out of the job where we were working on legacy ETL
       | "script" in Elixir, and terrible code architecture decisions
       | aside, I think the pattern where you have to launch and monitor
       | long lasting jobs is a breeze in BEAM. You just spawn one process
       | for each job, report back via mailbox and monitor via Supervisor.
       | Unfortunately making changes to that system where all sources
       | were hardcoded was to say at least abysmal, but the job running
       | core was quite elegant.
       | 
       | Hopefully Elixir and other BEAM compiled languages will gain
       | enough traction, I can't imagine rewriting something that
       | available in Erlang from the box in OOP languages with mutable
       | objects.
        
         | chasers wrote:
         | Definitely feels like the beam was built for this kind of work.
        
       | at_a_remove wrote:
       | I am just an amateur, but gosh I love doing ETL. There's
       | something about the specification of the data you're supposed to
       | be getting, and then the harsh reality of the filthy trash you
       | will actually be fed, and making something that can carefully
       | examine each of the assumptions in the spec to test for how it
       | won't go right, making tests for things that will, you are told,
       | "never happen" (only to get an email from your program three
       | years later that this has, in fact, happened), interpolating data
       | where you can, and in general making a "simple load process" into
       | one of those grinders people can feed cows and tungsten carbide
       | rods into.
       | 
       | I feel like Data mentioning that he just ... loves scanning for
       | life forms.
        
       | eatonphil wrote:
       | I had this same kind of impression writing product/business
       | reports as an engineering manager at Oracle and elsewhere. But my
       | way of solving it was to build a smarter IDE that helps you do
       | this stuff without ETL.
       | 
       | You shouldn't need to waste time every time you build a report
       | figuring out again how to query Postgres/MySQL/Elastic in Python
       | and how to generate graphs and where to host it or how to set up
       | recurring alerts on changes. The only part you should care about
       | is the actual queries to write, the joins to do between datasets,
       | and the fields to graph. The actual integration code (connecting
       | to the database, copying a file from a remote server, etc.)
       | should be handled for you.
       | 
       | The tool I'm building to solve this is open source if you're
       | curious!
       | 
       | https://github.com/multiprocessio/datastation
        
       | elchief wrote:
       | I gave up on all these frameworks and just use Cron and Make and
       | Python
        
         | johanneskanybal wrote:
         | Depends on what you need. It's classic thinking you need a too
         | big tool. Can imagine that setup working great for many cases.
        
         | Cwizard wrote:
         | Do you use some kind of library to help with transformations?
         | If you have some volume using plain python can become a
         | bottleneck no? Any tips in this area would be appreciated.
        
           | WhyCause wrote:
           | I'm not the person you were asking, but I, too, use mostly
           | Python for ETL tasks as a good chunk of my job.
           | 
           | Python's speed, or lack thereof, is rarely an issue. Most
           | tasks that I write/run finish in less than a minute,
           | including the time for transformations. Some of those can get
           | fairly hairy, as well. Most of the time is spent querying the
           | DB and uploading to the remote.
           | 
           | For the tasks where the transformations are too complex, you
           | can greatly decrease the time necessary by tweaking the
           | queries ( _i.e._ , limiting rows to just changed rows, _etc._
           | ). And frankly, once you realize most data doesn't need to
           | update in real time, it doesn't matter how long the
           | transformation step takes (as long as it's fewer than about
           | 23 1/2 hours).
        
           | elchief wrote:
           | pyspark if big-ol' data sets
           | 
           | pandas otherwise
        
       | MichaelRazum wrote:
       | Please correct me if I'm doing it wrong:
       | 
       | Scripts are collection data from external sources Scripts are
       | inserting Data to DB Other Scripts are fetching DB and adding
       | Data to other tables
       | 
       | Sure I mean it was mentioned, that data was missing and so on.
       | But in a mature project you basically have already some
       | monitoring, so you could just use existing solution instead of a
       | new Framework.
       | 
       | Is there something wrong with such an approach?
        
       | jumpman500 wrote:
       | I've always found ETL frameworks to have their own problems. They
       | seem great on paper but usually they don't account for a specific
       | source system, APIs, applications, data size, data distribution
       | or scheduling situations. If your project is using it then
       | developers end up hacking the frameworks instead of writing
       | simple code that does the specific thing they need to do.
       | 
       | Before you know it you have super long and super inefficient code
       | just to fit the framework. It takes about the same time to read
       | and understand an ETL framework as it is to write your own
       | python/bash script, and at least with your own code it's easier
       | to see bottlenecks.
        
         | mcguire wrote:
         | Couple that with the way ETL frameworks quickly become
         | undocumented, featuriferous that are opaque to anyone who isn't
         | deeply embedded into the framework, yeah.
        
         | aarreedd wrote:
         | I agree. I've used AWS Data Pipelines for some jobs but there
         | is a steep learning curve. It is good for launching servers on
         | demand to run your ETL jobs if you need that.
         | 
         | The best solution I have found is writing ETL scripts in
         | Laravel which I use for most projects anyway. The framework has
         | built-in scheduling and error reporting.
        
         | dillondoyle wrote:
         | Not sure if this provides any insight or value.
         | 
         | But I had this same experience.
         | 
         | First exapmle was connecting Iterable - which looks like
         | Airbyte supports - to bigquery.
         | 
         | In the past I had someone help me setup snowflake which was too
         | complicated for me to maintain / understand myself especially
         | AWS is too complicated for me compared to simpler google cloud.
         | 
         | Have also tried stich and fivetran at different times. Mostly
         | to try to save time setting up non webhook syncs from FB
         | marketing, Front. The volume of iterable data would be way
         | hugely prohibitably expensive for us on those as paid
         | platforms.
         | 
         | In the end I was able to do FB Marketing myself less than 1k
         | lines of python modified from a script I found on github which
         | used google cloud scheduler & function. I don't know python so
         | that felt somewhat satisfying.
         | 
         | Another nuance in favor of a hosted/paid platform is that it
         | looks like airbyte uses an api lookup sync instead of webhooks.
         | That lets Airbyte get some more meta data to join to that I
         | don't collect. That's valuable!
         | 
         | For iterable I ended up making a GAE app to accept incoming
         | webhook data -> push to pubsub -> pushes to function -> which
         | writes to bigquery.
         | 
         | The latency for bq writes was too much to try and do it all at
         | once and i don't think iterable does webhook retries. Also
         | Iterable is MEGA bursty like I've seen our GAE will scale up to
         | somethings 40+ instances within minutes after we hit send on a
         | campaign. That was the hardest problem to figure out getting
         | the latency down for cold starts and scaling, cloud functions
         | didn't work. It's not perfect but it's good enough for our
         | needs. The simpler FB function grabs data 100% correct each day
         | which feels good last I talked to some of the paid ETL it was
         | flat $500 minimum a month not worth it.
         | 
         | From learning all this I've been able to reuse this gae,
         | pubsub, function, bq/spanner pattern for other stuff I build
         | and it has saved a lot of time and headache.
        
         | ibains wrote:
         | Prophecy.io let's you create visual components from any Spark
         | function. Same with Airflow. So you can use standard components
         | (built-in or your new ones) without being restricted.
         | 
         | Founder here - we're working to solve this exact problem.
        
           | wobblykiwi wrote:
           | What's the difference between Prophecy and the multitude of
           | other ETL tools out there, like StreamSets, Talend, Ab
           | Initio, and plenty more?
        
             | ibains wrote:
             | We're very different from the ETL tools in that we're bring
             | software development best practices to data.
             | 
             | When you do visual drag and drop - prophecy is generating
             | high quality code on git that is 100% open source (spark,
             | airflow), you have tests and CI/CD - so you're visually
             | doing solid data engineering.
             | 
             | You can toggle between code and visual - so if you change
             | the code (some), the visual graph updates - so small edits
             | directly to git don't break the visual layer.
             | 
             | All visual components are generated from a spec - think a
             | Spark function with a some more info. So the data platform
             | teams will create their own library/framework and roll it
             | out to the wider teams. How it works us that in the visual
             | editor, you start with standard Spark library, but can load
             | visual components for delta, or encryption or data quality.
             | 
             | Our customers are typically fed up of these ETL tools and
             | moving to us. We can also import the ETL formats (AbInitio,
             | Informatica, ...) in an automated way (we reverse
             | engineered their formats and created source to source
             | compilers)
        
         | switch007 wrote:
         | Indeed. A classic one is dealing with oauth2...
         | 
         | Airbyte docs:
         | 
         | > Note that the OAuth2Authenticator currently only supports
         | refresh tokens and not the full OAuth2.0 loop.
        
           | sherifnada wrote:
           | I think this is saying that particular class expects to
           | receive a refresh token as input. The "full oauth loop" means
           | the UI needs to produce a refresh token via user consent in
           | the browser.
        
         | huetius wrote:
         | I started writing and never completed a dead simple ETL
         | framework that left most of the work up to the programmer. It
         | was basically just an http server that you could hit with cron
         | jobs, a DAG data structure for mapping the process, and some
         | annotations that you could use to specify that a node/program
         | step was concurrency safe, blocking, etc. You're entirely right
         | that there's way more to ETL than meets the eye, but this still
         | makes me want to dig it back up.
        
       | travisgriggs wrote:
       | ETL was actually a new acronym for me: Extract, Transform, Load.
       | 
       | https://en.m.wikipedia.org/wiki/Extract,_transform,_load
        
         | mcguire wrote:
         | Welcome to the enterprise! :-)
        
         | elchief wrote:
         | It really stands for Excel Taking Longtime
        
       | cheriot wrote:
       | Can any of these ETL frameworks kick off an ETL script without a
       | rewrite? Something that would handle scheduling, retries, emit
       | metrics around those actions, but let me use my own tools for the
       | actual data manipulation.
        
         | jasonpbecker wrote:
         | We use Jenkins for essentially this right now, and it works
         | fine (especially if you have someone who knows Jenkins
         | already).
        
         | johanneskanybal wrote:
         | One of the standard patterns is airflow and kubernetes executor
         | and then the pod can do whatever. There's the occasional side
         | effect and some setup but more or less it works as advertised.
        
         | timwis wrote:
         | Celery?
        
       | 0xbadcafebee wrote:
       | The work companies put into ETL is absolutely bizarre from a
       | business standpoint.
       | 
       | Say you want to build a new kind of hammer. Normally what you do
       | is you pay a company for access to some facility that has forging
       | process, and you put your effort into designing the hammer,
       | working with the facility to get it forged the way you want, and
       | selling it.
       | 
       | Building ETL pipelines from scratch is like building an ire ore
       | mining facility, building diggers, trucks, shipping containers,
       | trains, boats, iron smelting and forging equipment, and
       | warehouses, on top of designing and selling the hammer. People
       | today are so brainwashed by the idea that they need to be writing
       | software that they go to these extreme lengths to build
       | everything from scratch, when they should almost always be paying
       | somebody else to do most of it, and only spend time working on
       | your business's actual problem. Building factories and logistics
       | chains should not be a problem your business needs to solve if
       | all you need to do is build and sell a hammer.
        
       | davinchia wrote:
       | Airbyte Engineer here.
       | 
       | I think some of the points made here about ETL scripts being just
       | 'ETL scripts' are very relevant. Definitely been on the other
       | side of the table arguing for a quick 3-hour script.
       | 
       | Having written plenty of ETL scripts - in Java with Hadoop/Spark,
       | Python with Airflow and pure Bash - that later morphed into tech
       | debt monsters, I think many people underestimate how quickly
       | these can quickly snowball into proper products with actual
       | requirements.
       | 
       | Unless one is extremely confident an ETL script will remain a
       | non-critical good-to-have part of the stack, I believe evaluating
       | and adopting a good ETL framework, especially one with pre-built
       | integrations is good case of 'sharpening the axe before cutting
       | the tree' and well worth the time.
       | 
       | We've been very careful to minimise Airbyte's learning curve.
       | Starting up Airbyte is as easy as checking out the git repo and
       | running 'docker compose up'. A UI allows users to select,
       | configure and schedule jobs from a list of 120+ supported
       | connectors. It's not uncommon to see users successfully using
       | Airbyte within tens of mins.
       | 
       | If a connector is not supported, we offer a Python CDK that lets
       | anyone develop their own connectors in a matter of hours. We have
       | a commitment to supporting community contributed connectors so
       | there is no worry about contributions going to waste.
       | 
       | Everything is open source, so anyone is free to deep as dive as
       | they need or want to.
       | 
       | We also build in the open and have single-digit hour Slack
       | response time on weekdays. Do check us out -
       | https://github.com/airbytehq/airbyte!
        
         | 0xbadcafebee wrote:
         | _" Definitely been on the other side of the table arguing for a
         | quick 3-hour script."_
         | 
         | But that's probably the time it took to write it, right? 80% of
         | the cost of software is in maintenance, so there's another 12
         | hours worth of maintenance left to account for. If you know
         | you're going to spend 15 hours on it, then you might as well
         | use a system you know will cost less to extend or scale over
         | time.
         | 
         |  _" We've been very careful to minimise Airbyte's learning
         | curve."_
         | 
         | That's good for quickly onboarding new customers, but not
         | necessarily for the system to be scalable or extensible.
         | 
         |  _" Starting up Airbyte is as easy as checking out the git repo
         | and running 'docker compose up'."_
         | 
         | I'm always curious about this. Docker-compose doesn't run on
         | more than a single host, unless you're using it with the AWS
         | ECS integration (and maybe Swarm?). So sure, the developer can
         | "get it running" quickly to look at it, but to actually deploy
         | something to production they'll have to rewrite the docker-
         | compose thing in something else. If you provide them with
         | Terraform modules or a Helm chart, that would get them into
         | production faster. And maybe even a canned CI/CD pipeline in a
         | container so they can start iterating on it immediately. It's
         | more work for your company, but it shortens the friction for
         | the developers to get to production, and enables businesses to
         | start using your product in production _immediately_ , which I
         | think is a pretty big differentiator of business value.
        
       | unklefolk wrote:
       | Of course, the flip side is that sometimes that initial step of:
       | 
       | > "We are doing a prototype on our new mapping product, we need
       | to write a one-off script to pull in some map data."
       | 
       | ... is just that - a one off script. And it can prove to be a lot
       | quicker to write a one-off script than getting involved with an
       | ETL framework. I am not arguing against ETL Frameworks (Airbyte
       | etc). Just that over-engineering carries its own costs, just like
       | under-engineering does.
        
       | devnull255 wrote:
       | At our company, we actually built ETL-Framework-agnostic
       | wrappers, monitoring, logging and scheduling tooling around the
       | different ETL tools we used for four different ETL Product
       | Frameworks we used: Microfocus COBOL, Torrent Orchestrate,
       | Datastage (which incorporated Torrent) and Abinitio. The wrappers
       | invoked the ETL command, reformatted and consolidated logs. For
       | scheduling, we relied mostly on CA Autosys, instead of whatever
       | scheduling mechanisms came with the ETL Product.
       | 
       | We found this approach made it easier to transition from one
       | product to another. As it consistently faster to plug the ETL
       | framework into the supporting framework than to implement
       | everything a new ETL Product offered.
       | 
       | As we move from our on-prem environment to the cloud, we hope we
       | can implement a similar strategy even if we have to switch the
       | support frameworks.
        
       | cgardens wrote:
       | Hiya, I'm the original author. tl;dr for those deciding whether
       | or not to read it:
       | 
       | If you are thinking about build versus buy for your ETL solution,
       | in this day and age, there are enough great tools out there where
       | buy is almost always the right option. You may think you can
       | write a "simple" little ETL script to solve your problem, but
       | invariably it grows into a monster that will be a reliability
       | liability and engineering time suck. The post goes into more
       | depth on why that is. Enjoy!
        
       | bob1029 wrote:
       | I've just been spinning up new C# console projects in VS and
       | pulling down Dapper to do most nasty ad-hoc things between
       | databases.
       | 
       | I never bother to save any of these to source control because the
       | boilerplate is negligible and its usually a one-time deal.
        
       | computershit wrote:
       | I have been working a lot in this space for the last two years
       | but especially in the last 6 months. I believe we're about to
       | enter a phase where much more elegant and less restrictive ETL
       | platforms or frameworks are as commonplace as modern software
       | CICD offerings. Prefect and Dagster both stand out to me as
       | viable replacements for Airflow.
        
       | mrbungie wrote:
       | We currently are using Airflow for ELTs/ETLs to ingest from
       | different Postgres databases to BigQuery/Google Cloud Storage.
       | Airbyte looks sweet for the same task and would free us from a
       | big effort burden, but its Postgres source only supports SELECT *
       | statements (i.e. you can't deselect columns).
       | 
       | That's kind of a dealbreaker for us, because for security reasons
       | our Postgres users permissions are granularly configured with
       | column-based security. I hope the Airbyte team solves this
       | eventually because the software is looking great.
        
         | lloydatkinson wrote:
         | That's crazy, why would it have the approach of "select all
         | columns" when you might never need all of them? Could you
         | create a view for it to select from instead?
         | 
         | Either way if someone told me a ETL supports only literally all
         | the columns or nothing then I'd assume it was a MVP product to
         | demonstrate the idea... not a production system.
        
           | davinchia wrote:
           | We are working on it. This should be out next quarter!
        
             | lixtra wrote:
             | It's strange to advertise a framework that doesn't support
             | more than "select *". A framework should make easy things
             | easy and complicated things possible. Ideally there is a
             | natural way to do stuff, so different people understand
             | each other's approach.
             | 
             | The whole article sounds like you target total amateurs in
             | the ETL domain. Every decent ETL engineer knows all of that
             | what is revealed after "a few months".
             | 
             | That said, I always welcome new competition in the field.
             | Tools still suck.
        
       | jermaustin1 wrote:
       | I was once tasked with replacing a dying set of ETLs composed in
       | "ScribeSoft", apparently the built in scheduling and speed left
       | too much to be desired, and calling other jobs from inside the
       | job would halt the current job. Ended up replacing everything
       | with a C# console application that ran every 1 minute unless it
       | was currently running. There were a lot of bugs on both ends, but
       | they were tired of paying $5k/yr for the ETL to run.
       | 
       | After I wrote the initial application, they handed it off to
       | their South African dev team to maintain it.
        
       | weego wrote:
       | Their standard scenario to avoid is actually a perfectly
       | acceptable process to grow though prior to wedging another but of
       | infrastructure into your org that needs it's own provisioning,
       | maintenance and support, redundancy planning etc.
       | 
       | In that scenario the situation is so nebulous that the original
       | implementers had no way to know at what point the business/use
       | case would end up so why would they immediately jump to a belt
       | and braces solution.
       | 
       | It's the infrastructure version of my go to phrase: don't code
       | for every future.
        
       | pphysch wrote:
       | My favorite ETL framework is CAP.
       | 
       | C - curl+jq or CLI - extract
       | 
       | A - awk/jq - transform
       | 
       | P - postgres/prometheus(VictoriaMetrics) - load
       | 
       | Recently reduced an inherited 1000 line PHP ETL nonsense into a
       | CAP "1-liner" (my awk script is a dozen lines unrolled).
       | 
       | Frankly, this covers a lot of ETL cases. You can do basic
       | cleaning in AWK (drop nulls/outliers) and format the output
       | however you please, and it's blazingly fast.
        
       | 89vision wrote:
       | Argo workflow looks like the best of both worlds to me. You can
       | easily build up complex etl/data processing dags where each step
       | is a docker container, so you can choose the best tool for the
       | job. Argo has all the retry/backoff logic built into it and you
       | can plug the workflows into Argo events. It runs on kubernetes so
       | you can scale your compute according to your workflow demands
        
       | mcguire wrote:
       | Is this an advertisement?
        
       ___________________________________________________________________
       (page generated 2021-12-17 23:00 UTC)