[HN Gopher] Using an ETL framework vs. writing yet another ETL s... ___________________________________________________________________ Using an ETL framework vs. writing yet another ETL script Author : Arimbr Score : 81 points Date : 2021-12-17 16:03 UTC (6 hours ago) (HTM) web link (airbyte.io) (TXT) w3m dump (airbyte.io) | johanneskanybal wrote: | Well this is just an ad. At glance looks like an airflow knock | off. | hadlock wrote: | Reminds me of the xkcd "how standards proliferate" | | Arguably you could use Kubernetes as a scheduler, or "ETL | framework/kit", it supports cron jobs, has a restful api, local | and remote secrets storage, native support for cloud storage, | support for multiple logging solutions, distributed workloads, | supports cron jobs etc. | | Years ago I worked for a financial services company and they | would run their batch ETL jobs via a product called Tidal that | later got bought by Cisco. I really liked using Tidal for the | longest time, but 100% of what Tidal does, you can replicate with | the scheduler features of Kubernetes. | kartoshechka wrote: | I just got out of the job where we were working on legacy ETL | "script" in Elixir, and terrible code architecture decisions | aside, I think the pattern where you have to launch and monitor | long lasting jobs is a breeze in BEAM. You just spawn one process | for each job, report back via mailbox and monitor via Supervisor. | Unfortunately making changes to that system where all sources | were hardcoded was to say at least abysmal, but the job running | core was quite elegant. | | Hopefully Elixir and other BEAM compiled languages will gain | enough traction, I can't imagine rewriting something that | available in Erlang from the box in OOP languages with mutable | objects. | chasers wrote: | Definitely feels like the beam was built for this kind of work. | at_a_remove wrote: | I am just an amateur, but gosh I love doing ETL. There's | something about the specification of the data you're supposed to | be getting, and then the harsh reality of the filthy trash you | will actually be fed, and making something that can carefully | examine each of the assumptions in the spec to test for how it | won't go right, making tests for things that will, you are told, | "never happen" (only to get an email from your program three | years later that this has, in fact, happened), interpolating data | where you can, and in general making a "simple load process" into | one of those grinders people can feed cows and tungsten carbide | rods into. | | I feel like Data mentioning that he just ... loves scanning for | life forms. | eatonphil wrote: | I had this same kind of impression writing product/business | reports as an engineering manager at Oracle and elsewhere. But my | way of solving it was to build a smarter IDE that helps you do | this stuff without ETL. | | You shouldn't need to waste time every time you build a report | figuring out again how to query Postgres/MySQL/Elastic in Python | and how to generate graphs and where to host it or how to set up | recurring alerts on changes. The only part you should care about | is the actual queries to write, the joins to do between datasets, | and the fields to graph. The actual integration code (connecting | to the database, copying a file from a remote server, etc.) | should be handled for you. | | The tool I'm building to solve this is open source if you're | curious! | | https://github.com/multiprocessio/datastation | elchief wrote: | I gave up on all these frameworks and just use Cron and Make and | Python | johanneskanybal wrote: | Depends on what you need. It's classic thinking you need a too | big tool. Can imagine that setup working great for many cases. | Cwizard wrote: | Do you use some kind of library to help with transformations? | If you have some volume using plain python can become a | bottleneck no? Any tips in this area would be appreciated. | WhyCause wrote: | I'm not the person you were asking, but I, too, use mostly | Python for ETL tasks as a good chunk of my job. | | Python's speed, or lack thereof, is rarely an issue. Most | tasks that I write/run finish in less than a minute, | including the time for transformations. Some of those can get | fairly hairy, as well. Most of the time is spent querying the | DB and uploading to the remote. | | For the tasks where the transformations are too complex, you | can greatly decrease the time necessary by tweaking the | queries ( _i.e._ , limiting rows to just changed rows, _etc._ | ). And frankly, once you realize most data doesn't need to | update in real time, it doesn't matter how long the | transformation step takes (as long as it's fewer than about | 23 1/2 hours). | elchief wrote: | pyspark if big-ol' data sets | | pandas otherwise | MichaelRazum wrote: | Please correct me if I'm doing it wrong: | | Scripts are collection data from external sources Scripts are | inserting Data to DB Other Scripts are fetching DB and adding | Data to other tables | | Sure I mean it was mentioned, that data was missing and so on. | But in a mature project you basically have already some | monitoring, so you could just use existing solution instead of a | new Framework. | | Is there something wrong with such an approach? | jumpman500 wrote: | I've always found ETL frameworks to have their own problems. They | seem great on paper but usually they don't account for a specific | source system, APIs, applications, data size, data distribution | or scheduling situations. If your project is using it then | developers end up hacking the frameworks instead of writing | simple code that does the specific thing they need to do. | | Before you know it you have super long and super inefficient code | just to fit the framework. It takes about the same time to read | and understand an ETL framework as it is to write your own | python/bash script, and at least with your own code it's easier | to see bottlenecks. | mcguire wrote: | Couple that with the way ETL frameworks quickly become | undocumented, featuriferous that are opaque to anyone who isn't | deeply embedded into the framework, yeah. | aarreedd wrote: | I agree. I've used AWS Data Pipelines for some jobs but there | is a steep learning curve. It is good for launching servers on | demand to run your ETL jobs if you need that. | | The best solution I have found is writing ETL scripts in | Laravel which I use for most projects anyway. The framework has | built-in scheduling and error reporting. | dillondoyle wrote: | Not sure if this provides any insight or value. | | But I had this same experience. | | First exapmle was connecting Iterable - which looks like | Airbyte supports - to bigquery. | | In the past I had someone help me setup snowflake which was too | complicated for me to maintain / understand myself especially | AWS is too complicated for me compared to simpler google cloud. | | Have also tried stich and fivetran at different times. Mostly | to try to save time setting up non webhook syncs from FB | marketing, Front. The volume of iterable data would be way | hugely prohibitably expensive for us on those as paid | platforms. | | In the end I was able to do FB Marketing myself less than 1k | lines of python modified from a script I found on github which | used google cloud scheduler & function. I don't know python so | that felt somewhat satisfying. | | Another nuance in favor of a hosted/paid platform is that it | looks like airbyte uses an api lookup sync instead of webhooks. | That lets Airbyte get some more meta data to join to that I | don't collect. That's valuable! | | For iterable I ended up making a GAE app to accept incoming | webhook data -> push to pubsub -> pushes to function -> which | writes to bigquery. | | The latency for bq writes was too much to try and do it all at | once and i don't think iterable does webhook retries. Also | Iterable is MEGA bursty like I've seen our GAE will scale up to | somethings 40+ instances within minutes after we hit send on a | campaign. That was the hardest problem to figure out getting | the latency down for cold starts and scaling, cloud functions | didn't work. It's not perfect but it's good enough for our | needs. The simpler FB function grabs data 100% correct each day | which feels good last I talked to some of the paid ETL it was | flat $500 minimum a month not worth it. | | From learning all this I've been able to reuse this gae, | pubsub, function, bq/spanner pattern for other stuff I build | and it has saved a lot of time and headache. | ibains wrote: | Prophecy.io let's you create visual components from any Spark | function. Same with Airflow. So you can use standard components | (built-in or your new ones) without being restricted. | | Founder here - we're working to solve this exact problem. | wobblykiwi wrote: | What's the difference between Prophecy and the multitude of | other ETL tools out there, like StreamSets, Talend, Ab | Initio, and plenty more? | ibains wrote: | We're very different from the ETL tools in that we're bring | software development best practices to data. | | When you do visual drag and drop - prophecy is generating | high quality code on git that is 100% open source (spark, | airflow), you have tests and CI/CD - so you're visually | doing solid data engineering. | | You can toggle between code and visual - so if you change | the code (some), the visual graph updates - so small edits | directly to git don't break the visual layer. | | All visual components are generated from a spec - think a | Spark function with a some more info. So the data platform | teams will create their own library/framework and roll it | out to the wider teams. How it works us that in the visual | editor, you start with standard Spark library, but can load | visual components for delta, or encryption or data quality. | | Our customers are typically fed up of these ETL tools and | moving to us. We can also import the ETL formats (AbInitio, | Informatica, ...) in an automated way (we reverse | engineered their formats and created source to source | compilers) | switch007 wrote: | Indeed. A classic one is dealing with oauth2... | | Airbyte docs: | | > Note that the OAuth2Authenticator currently only supports | refresh tokens and not the full OAuth2.0 loop. | sherifnada wrote: | I think this is saying that particular class expects to | receive a refresh token as input. The "full oauth loop" means | the UI needs to produce a refresh token via user consent in | the browser. | huetius wrote: | I started writing and never completed a dead simple ETL | framework that left most of the work up to the programmer. It | was basically just an http server that you could hit with cron | jobs, a DAG data structure for mapping the process, and some | annotations that you could use to specify that a node/program | step was concurrency safe, blocking, etc. You're entirely right | that there's way more to ETL than meets the eye, but this still | makes me want to dig it back up. | travisgriggs wrote: | ETL was actually a new acronym for me: Extract, Transform, Load. | | https://en.m.wikipedia.org/wiki/Extract,_transform,_load | mcguire wrote: | Welcome to the enterprise! :-) | elchief wrote: | It really stands for Excel Taking Longtime | cheriot wrote: | Can any of these ETL frameworks kick off an ETL script without a | rewrite? Something that would handle scheduling, retries, emit | metrics around those actions, but let me use my own tools for the | actual data manipulation. | jasonpbecker wrote: | We use Jenkins for essentially this right now, and it works | fine (especially if you have someone who knows Jenkins | already). | johanneskanybal wrote: | One of the standard patterns is airflow and kubernetes executor | and then the pod can do whatever. There's the occasional side | effect and some setup but more or less it works as advertised. | timwis wrote: | Celery? | 0xbadcafebee wrote: | The work companies put into ETL is absolutely bizarre from a | business standpoint. | | Say you want to build a new kind of hammer. Normally what you do | is you pay a company for access to some facility that has forging | process, and you put your effort into designing the hammer, | working with the facility to get it forged the way you want, and | selling it. | | Building ETL pipelines from scratch is like building an ire ore | mining facility, building diggers, trucks, shipping containers, | trains, boats, iron smelting and forging equipment, and | warehouses, on top of designing and selling the hammer. People | today are so brainwashed by the idea that they need to be writing | software that they go to these extreme lengths to build | everything from scratch, when they should almost always be paying | somebody else to do most of it, and only spend time working on | your business's actual problem. Building factories and logistics | chains should not be a problem your business needs to solve if | all you need to do is build and sell a hammer. | davinchia wrote: | Airbyte Engineer here. | | I think some of the points made here about ETL scripts being just | 'ETL scripts' are very relevant. Definitely been on the other | side of the table arguing for a quick 3-hour script. | | Having written plenty of ETL scripts - in Java with Hadoop/Spark, | Python with Airflow and pure Bash - that later morphed into tech | debt monsters, I think many people underestimate how quickly | these can quickly snowball into proper products with actual | requirements. | | Unless one is extremely confident an ETL script will remain a | non-critical good-to-have part of the stack, I believe evaluating | and adopting a good ETL framework, especially one with pre-built | integrations is good case of 'sharpening the axe before cutting | the tree' and well worth the time. | | We've been very careful to minimise Airbyte's learning curve. | Starting up Airbyte is as easy as checking out the git repo and | running 'docker compose up'. A UI allows users to select, | configure and schedule jobs from a list of 120+ supported | connectors. It's not uncommon to see users successfully using | Airbyte within tens of mins. | | If a connector is not supported, we offer a Python CDK that lets | anyone develop their own connectors in a matter of hours. We have | a commitment to supporting community contributed connectors so | there is no worry about contributions going to waste. | | Everything is open source, so anyone is free to deep as dive as | they need or want to. | | We also build in the open and have single-digit hour Slack | response time on weekdays. Do check us out - | https://github.com/airbytehq/airbyte! | 0xbadcafebee wrote: | _" Definitely been on the other side of the table arguing for a | quick 3-hour script."_ | | But that's probably the time it took to write it, right? 80% of | the cost of software is in maintenance, so there's another 12 | hours worth of maintenance left to account for. If you know | you're going to spend 15 hours on it, then you might as well | use a system you know will cost less to extend or scale over | time. | | _" We've been very careful to minimise Airbyte's learning | curve."_ | | That's good for quickly onboarding new customers, but not | necessarily for the system to be scalable or extensible. | | _" Starting up Airbyte is as easy as checking out the git repo | and running 'docker compose up'."_ | | I'm always curious about this. Docker-compose doesn't run on | more than a single host, unless you're using it with the AWS | ECS integration (and maybe Swarm?). So sure, the developer can | "get it running" quickly to look at it, but to actually deploy | something to production they'll have to rewrite the docker- | compose thing in something else. If you provide them with | Terraform modules or a Helm chart, that would get them into | production faster. And maybe even a canned CI/CD pipeline in a | container so they can start iterating on it immediately. It's | more work for your company, but it shortens the friction for | the developers to get to production, and enables businesses to | start using your product in production _immediately_ , which I | think is a pretty big differentiator of business value. | unklefolk wrote: | Of course, the flip side is that sometimes that initial step of: | | > "We are doing a prototype on our new mapping product, we need | to write a one-off script to pull in some map data." | | ... is just that - a one off script. And it can prove to be a lot | quicker to write a one-off script than getting involved with an | ETL framework. I am not arguing against ETL Frameworks (Airbyte | etc). Just that over-engineering carries its own costs, just like | under-engineering does. | devnull255 wrote: | At our company, we actually built ETL-Framework-agnostic | wrappers, monitoring, logging and scheduling tooling around the | different ETL tools we used for four different ETL Product | Frameworks we used: Microfocus COBOL, Torrent Orchestrate, | Datastage (which incorporated Torrent) and Abinitio. The wrappers | invoked the ETL command, reformatted and consolidated logs. For | scheduling, we relied mostly on CA Autosys, instead of whatever | scheduling mechanisms came with the ETL Product. | | We found this approach made it easier to transition from one | product to another. As it consistently faster to plug the ETL | framework into the supporting framework than to implement | everything a new ETL Product offered. | | As we move from our on-prem environment to the cloud, we hope we | can implement a similar strategy even if we have to switch the | support frameworks. | cgardens wrote: | Hiya, I'm the original author. tl;dr for those deciding whether | or not to read it: | | If you are thinking about build versus buy for your ETL solution, | in this day and age, there are enough great tools out there where | buy is almost always the right option. You may think you can | write a "simple" little ETL script to solve your problem, but | invariably it grows into a monster that will be a reliability | liability and engineering time suck. The post goes into more | depth on why that is. Enjoy! | bob1029 wrote: | I've just been spinning up new C# console projects in VS and | pulling down Dapper to do most nasty ad-hoc things between | databases. | | I never bother to save any of these to source control because the | boilerplate is negligible and its usually a one-time deal. | computershit wrote: | I have been working a lot in this space for the last two years | but especially in the last 6 months. I believe we're about to | enter a phase where much more elegant and less restrictive ETL | platforms or frameworks are as commonplace as modern software | CICD offerings. Prefect and Dagster both stand out to me as | viable replacements for Airflow. | mrbungie wrote: | We currently are using Airflow for ELTs/ETLs to ingest from | different Postgres databases to BigQuery/Google Cloud Storage. | Airbyte looks sweet for the same task and would free us from a | big effort burden, but its Postgres source only supports SELECT * | statements (i.e. you can't deselect columns). | | That's kind of a dealbreaker for us, because for security reasons | our Postgres users permissions are granularly configured with | column-based security. I hope the Airbyte team solves this | eventually because the software is looking great. | lloydatkinson wrote: | That's crazy, why would it have the approach of "select all | columns" when you might never need all of them? Could you | create a view for it to select from instead? | | Either way if someone told me a ETL supports only literally all | the columns or nothing then I'd assume it was a MVP product to | demonstrate the idea... not a production system. | davinchia wrote: | We are working on it. This should be out next quarter! | lixtra wrote: | It's strange to advertise a framework that doesn't support | more than "select *". A framework should make easy things | easy and complicated things possible. Ideally there is a | natural way to do stuff, so different people understand | each other's approach. | | The whole article sounds like you target total amateurs in | the ETL domain. Every decent ETL engineer knows all of that | what is revealed after "a few months". | | That said, I always welcome new competition in the field. | Tools still suck. | jermaustin1 wrote: | I was once tasked with replacing a dying set of ETLs composed in | "ScribeSoft", apparently the built in scheduling and speed left | too much to be desired, and calling other jobs from inside the | job would halt the current job. Ended up replacing everything | with a C# console application that ran every 1 minute unless it | was currently running. There were a lot of bugs on both ends, but | they were tired of paying $5k/yr for the ETL to run. | | After I wrote the initial application, they handed it off to | their South African dev team to maintain it. | weego wrote: | Their standard scenario to avoid is actually a perfectly | acceptable process to grow though prior to wedging another but of | infrastructure into your org that needs it's own provisioning, | maintenance and support, redundancy planning etc. | | In that scenario the situation is so nebulous that the original | implementers had no way to know at what point the business/use | case would end up so why would they immediately jump to a belt | and braces solution. | | It's the infrastructure version of my go to phrase: don't code | for every future. | pphysch wrote: | My favorite ETL framework is CAP. | | C - curl+jq or CLI - extract | | A - awk/jq - transform | | P - postgres/prometheus(VictoriaMetrics) - load | | Recently reduced an inherited 1000 line PHP ETL nonsense into a | CAP "1-liner" (my awk script is a dozen lines unrolled). | | Frankly, this covers a lot of ETL cases. You can do basic | cleaning in AWK (drop nulls/outliers) and format the output | however you please, and it's blazingly fast. | 89vision wrote: | Argo workflow looks like the best of both worlds to me. You can | easily build up complex etl/data processing dags where each step | is a docker container, so you can choose the best tool for the | job. Argo has all the retry/backoff logic built into it and you | can plug the workflows into Argo events. It runs on kubernetes so | you can scale your compute according to your workflow demands | mcguire wrote: | Is this an advertisement? ___________________________________________________________________ (page generated 2021-12-17 23:00 UTC)