[HN Gopher] Lessons learned from running Apache Airflow at scale ___________________________________________________________________ Lessons learned from running Apache Airflow at scale Author : datafan Score : 218 points Date : 2022-05-23 15:31 UTC (7 hours ago) (HTM) web link (shopify.engineering) (TXT) w3m dump (shopify.engineering) | rr808 wrote: | Surely there is a really simple distributed scheduler that is | simple. Do I need to write one? Ie has dependencies, no database, | flat files, single instance but trivial to fail over to a backup. | I can even live without history or output. | 0xbadcafebee wrote: | What are you trying to do? Distributed scheduler with a single | instance? No database? Are you sure you don't just mean "a | scheduler" ala Luigi? https://github.com/spotify/luigi | | And what kind of scheduler? Again, for "a single instance" it | doesn't need to be distributed. For distributed operation, | Nomad is as simple and generic as you can get. If you need to | define a DAG, that's never going to be simple. | qkhhly wrote: | airflow is one piece of software that i hate very much, | especially the aspect that my job definition is intertwined with | the actual job code. if my job depends on something that | conflicts with airflow's dependency, it gets ugly. | | i actually like azkaban a lot better. of course, writing a plain | text job config could also be painful. i think ideally you could | write job def in python or other lang but it gets translates to | plain text config and does not interfere with your job code in | any way. | Mayzie wrote: | The biggest pain point in Airflow I have experienced is the | horrible and completely lacking documentation. The community | support (Slack) won't (or can't) help with anything beyond basic | DAG writing. | | That sore point makes running and using the software needlessly | frustrating, and honestly I won't ever be using it again because | of it. | 8589934591 wrote: | I agree with this. The slack is just the core developers | discussing further development and tickets. The documentation | is lacking big time. The only response to this is to raise PRs | to improve docs. | idomi wrote: | Make sure to checkout ploomber, our support is seamless tons of | docs (https://docs.ploomber.io/) and we take our users | seriously. P.S. We integrate with airflow and other | orchestrators if you still need to tackle those. | pid-1 wrote: | Agreed, I just would like to add the documentation got a lot | better in the past couple of years. | artwr wrote: | Can I ask more about your use case that you could not find an | answer for? | jlaneve wrote: | That's one of the things we're working on at Astronomer - check | out the Astronomer Registry! registry.astronomer.io | higeorge13 wrote: | I am wondering why they still use the celery executor while the | kubernetes executor is the go-to one for large deployments. I | have used the celery executor and had so many issues and stuck | tasks in the past and frequently fine-tune the celery | configuration in airflow config. | mcqueenjordan wrote: | I think airflow ends up creating as many problems as it solves | and kind of warps future development patterns/designs into its | black hole when it wouldn't otherwise be the natural choice. | There's the sort of promise of network effects -- "well of course | it's better if /everything/ is represented and executed within | the DAG of DAGs, right?" -- but it ends up being the case that | the inherent problems it creates plus the externalities of using | airflow for the wrong use cases start to compound, especially as | the org grows. | | I think it slowly ends up being sort of isomorphic to the set of | problems that sharing database access across service and | ownership boundaries has, and my view is increasingly of the | "convince me this can't be an RPC call, please" camp, and when it | really can't (for throughput reasons, for example), "ok, how | about this big S3 bucket as the interface, with object | notification on writes?" | trumpeta wrote: | We operate a (small?) Airflow instance with ~20 DAGs but, one of | those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL | backing it. | | We package all the code in 1-2 different Docker images and then | create the DAG. We've faced many issues (logs out of order, | missing, random race conditions, random task failures, etc.) | | But what annoys me the most is that for that 1 big DAG, the UI is | completely useless, tree view has insane dupplication, graph view | is super slow and hard to navigate through and answering basic | questions like, what exactly failed and what nodes are around it | are not easy. | artwr wrote: | At Airbnb, we were using SubDAGs to try to manage large number | of tasks in a single DAG. This allowed organizing tasks and | drilling down into failures more easily but came with its own | challenges. | | In more recent versions of Airflow, TaskGroups | (https://airflow.apache.org/docs/apache- | airflow/stable/concep..., | https://www.astronomer.io/guides/task-groups/ ) were made to | help this a little bit. Hopefully that helps a bit. | | At ~1k nodes in the graph introspection becomes hard anyway, as | others have suggested, breaking it down if possible might be a | good idea. | mywittyname wrote: | Also, the @task annotation provides no facilities to name | tasks. So if you like to build reusable tasks (as I do), you | end up with my_generic_task__1, my_generic_task__2, | my_generic_task__n. I've tried a few hacks to dynamically | rename these, but I just ended up bringing down my entire | staging cluster. | artwr wrote: | `your_task.override(task_id="your_generated_name")` not | working for you? | mywittyname wrote: | I got pretty excited when I read this response, but no, it | doesn't work. I'm not sure how this would work since | annotated tasks return an xcom object. | | Can you point me to the documentation on this function? | It's possible I'm not using it correctly. | | I can do something like this, which works locally, but | breaks when deployed: res = | annotated_task_function(...) res.operator.task_id = | 'manually assigned task id' | flowair wrote: | @task.python(task_id="this_is_my_task_name") | | def my_func(): | | ... | mywittyname wrote: | This still has the problem that, when you call my_func | multiple times in the same dag, the resulting tasks will | be labelled, my_func, my_func__1, my_func__2, ... | suifbwish wrote: | Does this imply file metadata content can effect the access | performance of those files even for operations that do not | directly concern the metadata? | rockostrich wrote: | We had a similar DAG that was the result of migration a single | daily Luigi pipeline to Airflow. I started identifying isolated | branches and breaking them off with external task sensors back | to the main DAG. This worked but it's a pain in the ass. My | coworker ended up exporting the graph to graphviz and started | identifying clusters of related tasks that way. | mywittyname wrote: | I've not had the best luck with ExternalTaskSensors. There | have been some odd errors like execution failing at 22:00:00 | every day (despite the external task running fine). | vbezhenar wrote: | Can someone enlighten me whether Apache Airflow is suitable as a | business process engine? | | We have something like orders. So people put orders into our | system, some orders are imported from external system. We have | something around 100-1000 orders per day, I think. Each order | goes through several states. Like CREATED, SOME_INFO_ADDED, | REVIEWED, CONCLUSION_CREATED, CONCLUSION_SENT_TO_EXTERNAL_SYSTEM | and so on. Some states are simple to change, like few | milliseconds to call some web services, some states are 5 minutes | from operator, some states are few days. This logic is encoded | into our program code. We have plenty of timers, every timer | usually transfers orders from one state to another. This is | further complicated by the fact that this processing is done via | several services, so it's not a single monolith but some kind of | service architecture. | | Our management wants something to have clear monitoring, so you | can find a given task by some property values, monitor its | lifetime, check logs for every step, find out why it's failing, | etc. | | What I usually see is that Apache Airflow is used more like cron | replacement. I've read some articles but it's still not clear | whether it could be used as a business process engine. I had some | experience with Java BPMN engines in the past, it was not very | pleasant, but I guess time moved on. | subsaharancoder wrote: | A friend of mine wanted an ETL (SQL Server to BQ for analysis and | dashboarding) set up and I ended up stumbling across Airflow. I | spun up two VMs on GCP, one for Airflow and the other for the | Postgres DB to store the metadata. | | - A few things I've noticed is Airflow generates a tonne load of | logs that will fill up your disk quite fast. I started with 100GB | and I'm now at 500GB, granted disk space isn't expensive, but | still even with a few DAGs i'm surprised at how quickly. | Apparently you need a DAG to run to clear those logs but I was | too lazy so I just purge the logs using a cron job. | | - The SQL Server Operator is buggy, I filed an issue with the | Airflow team but I had to do some hacky stuff to get it to work. | | - Even with a few DAGs, Airflow will spike the CPU utilization of | the VM to 100% for X minutes (in my case about 15 minutes) which | is quite interesting. My tasks basically query SQL Server -> dump | to CSV (stored on GCS) -> import to BQ. | | - My DAGs execute every hour, and if Airflow is down for X hours | and I resolve the issue, it will try to run all the tasks for the | hours it was down which isn't ideal because it will take hours to | catch up. So I've had to delete tasks and only run the most | recent ones. | | Granted my set up is pretty simple and YMMV, but Airflow has done | what it needs to do albeit with some pain. | veeti wrote: | FWIW if you don't need Airflow to catch up and backfill missed | tasks, you can either set catchup=False on the DAG or use a | LatestOnlyOperator. | subsaharancoder wrote: | I have catchup=False set in the DAG but that hasn't stopped | Airflow from back filling missed tasks, not sure why this is | the case? | awild wrote: | > Even with a few DAGs, Airflow will spike the CPU utilization | of the VM to 100% for X minutes (in my case about 15 minutes) | which is quite interesting. My tasks basically query SQL Server | -> dump to CSV (stored on GCS) -> import to BQ. | | Have you checked why that is? Airflow does Reimport every few | seconds. We've had an issue where it didn't honor the | airflowignore file making it execute our tests everx few | seconds. The easy solution was to put them into the docker | ignore. | | You might also be having too much logic in your root levels. | It's recommended to not even import at root level to make | importing faster. | | Not saying it's not an odd tool though. | vvladymyrov wrote: | Airflow brought one of the best tools with nice UI for running | pipelines back in 2014-2016. But now days engineers should be | aware about easier to use options and don't choose Airflow | blindly as default choice. IMHO for 80-90% of cases orchestration | system should not use code at all - it should be DAGs as a config | code. Airflow is popular and teams keep choosing it for building | simple DAGs and incurring avoidable otherwise Airflow maintenance | costs. | | Databricks Orchestration pipeline, AWS Step Functions - good | examples of DAGs as a configuration. | carschno wrote: | Do you have more examples for better tools, ideally open source | (unlike AWS Step functions)? | jonpon wrote: | We at magniv.io are building an alternative. | | Our core is open source https://github.com/MagnivOrg/magniv- | core | | We can set you up with our hosted if you would like to poke | around! | gadflyinyoureye wrote: | Flowable is a BPNM system. You can do a lot of async calls | with it. https://www.flowable.com/open-source | | We use it for a complex pricing process that invokes 30-40 | micro services that can take up to minutes per step. | qw wrote: | Kamelets and the Karavan UI in combination with k8s and | Knative for "serverless" integrations looks interesting. | antupis wrote: | We are using prefect+dbt and I like it. Altought they are | doing huge rewrite at the moment. | [deleted] | avemg wrote: | I've used AWS Step Functions extensively over the past several | years and give me code every day of the week over the | Stepfunctions json config. Once you get beyond a few simple | steps, it gets very hard to look at the config and understand | what's going on with it. Especially true when you haven't look | at the config in awhile. The DAG visualizer definitely helps, | but as soon as things get beyond the trivial I long for a | different tool. | coolsunglasses wrote: | I was until a week or two ago part of a team that build | datasets with extensive dependencies (thus, complicated DAGs) | | v1 of the system built before I joined was Step Functions and | the like. It gets hairy just as you say. | | v2 I built and designed with the lead data engineer, we | called it Coriaria originally. We're hoping/planning to open | source it eventually, although it's a little wrapped around | our company's internal needs & systems. | | It chooses neither "config" strictly speaking nor "code" for | the DAG, instead the primary representation/state is all in | the PostgreSQL database which tracks the dataset dependencies | and how each dataset is built. It's a DAG in PostgreSQL as | well. | | To make dataset creation and management easier, I also wrote | a custom Terraform provider for Coriaria. This made migrating | datasets into the new system dramatically faster. The | provider is really nice, supports `terraform import` and all | that. Currently we have it setup so that there are separate | roles/accounts that can modify an existing dataset, but | reading state only requires authentication, not | authorization. This enables one team to depend on another | team's dataset as an upstream data source for their datasets | without granting permission to modify it or create a | potentially stale copy of the dataset. Terraform's internal | DAG representation of the resource dependencies is leveraged | because "parent_datasets" references the upstream datasets | directly, including the ones we don't build. | | We're able to depend on datasets we don't build ourselves | because the system has support for Glue catalog backends to | track and register partition availability. | | Currently, it builds most of the datasets using AWS Athena & | S3, however this is abstracted over a single step function. | There's no DAG of step functions, it's just a convenient | wrapper for the Athena query execution. | | The system also explicitly understands dataset builds and | validations as separate steps. The dashboard makes it easy to | trace the DAG and see which datasets are blocking a dataset | build. | | We're adding more integrations to it soon so that other ways | of kicking off dataset builds and validations are available. | | If people are interested in this I can begin lobbying for | open sourcing the system. My colleague wanted to open source | it as well. | | All else fails, I'll rebuild it from scratch because I don't | like the existing solutions for managing datasets. We've been | calling it a data-flow orchestration system or ETL | orchestration system, not sure what would be most meaningful | to people. | | I think the main caveat to this system is that I'm not sure | how much use it'd be for streaming data pipelines, but it | could manage the discretization of streaming into validated | partitions wherever streamed data is sunk into. Our operating | assumptions are that you want validated datasets to drive | business decisions, not raw event data streamed in from | Kafka. Making sure the right data is located in each daily | (or hourly) partition is part of that validation. | latchkey wrote: | Why not just model the json as objects in (insert favorite | language) and then use that code to generate the json? | entropicdrifter wrote: | Ah yes, a home-made framework to generate configurations | for your framework that's supposed to make your life | easier. That way you can maintain your code that maintains | your configs that make it easier to run your code that you | have to maintain! | latchkey wrote: | Actually, yes. It allows for easier unit and integration | testing as well. The original complaint is that things | were getting hard to read and they wished there was code | for this. It seems logical to create a framework for the | json configuration files so that they can be easily | mocked and tested. As someone who greatly values spending | time on automated testing, it seems weird to not think of | it this way. | | Quick google shows that others have done things like this | already... | | [1] https://noise.getoto.net/2021/10/14/using-jsonpath- | effective... | | [2] https://aws.amazon.com/about-aws/whats- | new/2022/01/aws-step-... | | [3] https://docs.aws.amazon.com/step- | functions/latest/dg/sfn-loc... | pharmakom wrote: | This can be a much better approach than upgrading the DAG | description language to a true programming language. It | forces anything complex to happen at build time where it | can do less damage. Plus, we can often use the same | library to do static analysis on the output | savin-goyal wrote: | Metaflow provides a similar concept to interface with | Step Functions and Argo Workflows in Python - | https://docs.metaflow.org/going-to-production-with- | metaflow/... | [deleted] | TYPE_FASTER wrote: | AWS offers a service for managed Airflow: | https://aws.amazon.com/managed-workflows-for-apache-airflow/ | | Makes me wonder if Amazon internally was using Step | Functions, ran into issues trying to scale to larger graphs, | realized multiple teams were using Airflow, and created the | Managed Airflow service. | nomilk wrote: | > teams keep choosing it for building simple DAGs | | I am part of one such team. We were using Windows Task | Scheduler on a windows VM to run jobs, we figured it would be a | nice idea to (dramatically) modernise and move to airflow, but | we grossly underestimated the complexity, learning curve, and | surrounding tools it requires. In the end we (data science | team) didn't get a single production task up and running. The | data engineers had much more success with it thought, probably | because they dedicated much more time to it. | | Will look forward to trying AWS Step Functions. | commandlinefan wrote: | I tried installing Airflow locally to just play around with | it and make sense of what it's good for and finally gave up | after a few days - the install alone is insanely complicated, | with lots of tricky hidden dependencies. | always_left wrote: | Did you try installing with docker? You would just download | docker, `docker-compose up --build` and you'll be good to | go locally (usually) | idleprocess wrote: | I can second this. We were up-and-running with Docker on | our dev machines in just a few minutes. A native | installation involves substantially more setup (Python, | databases, Redis and/or Rabbit, etc.). The published | docker-compose file will handle all of that for you. We | have a very small data engineering team and have been | able to move very quickly with Docker and AWS ECS (for | orchestrating containers in test and prod environments). | anonymousDan wrote: | Can anyone ELI5 the value proposition of airflow? | lysecret wrote: | Well written article. One question I always have when reading | such an article. Is it really worth it for these kinds of | companies to run Airflow on Kubernetes. You could also run it for | example on AWS Batch with Spot instances. | beckingz wrote: | Running Airflow on Kubernetes has been one of the most painful | data engineering challenges I've worked on. | ricklamers wrote: | We kept hearing this from our users. We've just released our | k8s operator based deployment of Orchest that should give you | a good experience running an orchestration tool on k8s | without much trouble. https://github.com/orchest/orchest | | (We extended Argo, works fantastically well by the way!) | marcinzm wrote: | How so? Did you have any existing Kubernetes knowledge? We | found it fairly easy to deploy using the community Helm chart | (official chart wasn't out yet). | mrbungie wrote: | Did you have any previous experience running workloads in k8s | before? | | Running the Airflow Helm is pretty straightforward, even with | more "complex" use cases like heterogenous pods for different | task sizes. | kbd wrote: | I'm bullish about Dagster nowadays. Though, I don't have a lot of | experience with Airflow. Figured I'd ask if anyone has switched | from Airflow to Dagster and has any comments? | perfect_kiss wrote: | I had participated in migrating around 100 fairly complicated | pipelines from Airflow to Dagster over six months in 2021. We | used k8s launcher, so this feedback does not apply to other | launchers e.g. Celery. | | Key takeaways roughly those: | | - Dagster's integration with k8s really shines as compared to | Airflow, it is also based on extendable Python code so it is | easy to add custom features to the k8s launcher if needed. | | - It is super easy to scale UI/server component horizontally, | and since DAGs were running as pods in k8s, there was no | problem scaling those as well. For scheduling component it is | more complicated, e.g. builtin scheduling primitives like | sensors are not easily integrated with state-of-art message | queue systems. We ended up writing custom scheduling component | that was reading messages from Kafka and creating DAG runs via | networked API. It was like 500 lines of Python including tests, | and worked rock-solid. | | - networked API is GraphQL while Airflow is REST, both are | really straightforward, however in Dagster it felt better | designed, maybe due to tighter governance of Dagster's authors | over the design. | | - DAG definition Python API, e.g. solid/pipeline, or op/graph | in a newer Dagster API, is somewhat complicated as compared to | Airflow's operators, however it is easy to build custom DSL on | top of that. One would need custom DSL for complicated logic in | Airflow as well, and in case of Dagster it felt easier to | generate its primitives, than doing never ending operators | combinations in case of Airflow. | | - Unit and integration testing are much easier in Dagster, the | authors put testing as a first-class citizen, so mocks are | supported everywhere, and the code tested with local runner is | guaranteed to execute in the same way on k8s launcher. We never | had any problems with test environment drift. | | The biggest caveat was full change of internal APIs in 0.13, | which forced the team to execute a fairly complicated refactor, | due to deprecation of the features we were depending on e.g. | execution modes. Had we spent more time on Elementl slack, it | would be easier to put less dependencies on those features ^__^ | doom2 wrote: | At my previous employer, we were running self-hosted Airflow in | AWS, which really was a nightmare. The engineer that set it up | didn't account for any kind of scaling and all the code was a | mess. We would also get issues like logs not syncing correctly | in our environment or transient networking issues that somehow | didn't fail the given Airflow task. Eventually, we did a dual | migration: temporarily switching to AWS managed Airflow (their | Amazon Managed Workflows for Apache Airflow product) while also | rewriting the DAGs in Dagster. | | Dagster was a great solution for us. Their notion of software | defined assets allowed us to track metadata of the Redshift and | Snowflake tables we were working with. Working with re-runs and | partitioned data was a breeze. It did take a while to onboard | the whole team and get things working smoothly, which was a bit | difficult because Dagster is still young and they were often | making changes to how parts of the system worked (although | nothing that was immediately backwards incompatible). | | We also enjoyed some of the out of the box features like | resources and unit testing jobs. Overall, I think it made our | team focus more on our data and what we wanted to do with it | rather than feeling like we had to wrangle with Airflow just to | get things running. | kbd wrote: | Thanks for your comment! Ditto last time I ran Airflow | locally it took like 5 Docker containers. Then I forgot about | the project and for a while was furious at Docker for | randomly taking 100% CPU. Then I realized it was because of | the Airflow containers that would restart along with Docker. | I didn't get much further with Airflow. | | Dagster, on the other hand, seems to let you scale from using | it locally as a library all the way to running on ECS/K8s | etc. Along with that there's unfortunately a ton of | complexity in setting it up but that's not Dagster's fault | and it seems like Dagster works once you get it set up. Agree | about it being young and there being some rough spots but | it's got lots of good ideas. We were nearly done setting it | up but got pulled off onto more urgent things, so I haven't | run it in production yet. I'm glad to hear it worked well for | you! | computershit wrote: | Dagster is extremely nice to work with. I did a bakeoff of | Prefect vs Dagster internally at my current employer, and while | we ended up going with Prefect for reasons, I am still so | impressed with the way Dagster approaches certain pain points | in the orchestration of data pipelines and its solution for | them. | theptip wrote: | > for reasons | | I'd love to hear more on this. I've not evaluated Prefect, | and am currently keeping an eye on Dagster. What trade-offs | does Prefect win? | 64StarFox64 wrote: | I did a baby bakeoff internally in my prior role ~18mo ago | now. Prefect felt nicer to write code in but perhaps not as | easy to find answers in the docs (though their Slack is | phenomenal). Ended up going with Prefect so I could focus | on biz/ETL logic with less boilerplate, but I'm sure | Dagster is not a bad choice either. Curious to hear about | parent's experience | simo7 wrote: | I think the main lesson should be not to use it, especially at | scale. | 0xbadcafebee wrote: | If you have the headcount for people just to build/support | Airflow, please do yourself a favor and give that money to | Astronomer.io. Their offering is _stupid good_. There 's 20 | different reasons why paying them is a much better idea than | managing Airflow yourself (including using MWAA), and it's dirt | cheap considering what you get. | [deleted] | pid-1 wrote: | Last time I checked, they asked for a significant minimum $ + 1 | year commitment. | | I wish they had a "start small", self service, clear pricing | option. | emef wrote: | We've also been running airflow for the past 2-3 years at a | similar scale (~5000 dags, 100k+ task executions daily) for our | data platform. We weren't aware of a great alternative when we | started. Our DAGs are all config-driven which populate a few | different templates (e.g. ingestion = ingest > validate > publish | > scrub PII > publish) so we really don't need all the | flexibility that airflow provides. We have had SO many headaches | operating airflow over the years, and each time we invest in | fixing the issue I feel more and more entrenched. We've hit | scaling issues at the k8s level, scheduling overhead in airflow, | random race conditions deep in the airflow code, etc. Considering | we have a pretty simplified DAG structure, I wish we had gone | with a simpler, more robust/scalable solution (even if just | rolling our own scheduler) for our specific needs. | | Upgrades have been an absolute nightmare and so disruptive. The | scalability improvements in airflow 2 were a boon for our | runtimes since before we would often have 5-15 minutes of | overhead between task scheduling, but man it was a bear of an | upgrade. We've since tried multiple times to upgrade past the 2.0 | release and hit issues every time, so we are just done with it. | We'll stay at 2.0 until we eventually move off airflow | altogether. | | I stood up a prefect deployment for a hackathon and I found that | it solved a ton of the issues with airflow (sane deployment | options, not the insane file-based polling that airflow does). We | looked into it ~1 year ago or so, I haven't heard a lot about it | lately, I wonder if anyone has had success with it at scale. | pweissbrod wrote: | If your team is comfortable writing in pure python and you're | familiar with the concept of a makefile you might find Luigi a | much lighter and less opinionated alternative to workflows. | | Luigi doesn't force you into using a central orchestrator for | executing and tracking the workflows. Tracking and updating | tasks state is open functions left to the programmer to fill | in. | | It's probably geared for more expert programmers who work close | to the metal that don't care about GUIs as much as high degrees | of control and flexibility. | | It's one of those frameworks where the code that is not written | is sort of a killer feature in itself. But definitely not for | everyone. | teej wrote: | It's worth noting that Luigi is no longer actively maintained | and hasn't had a major release in a year. | pyrophane wrote: | Very similar experience to yours. Adopted Airflow about 3 years | ago. Was aware of Prefect but it seemed a bit immature at the | time. Checked back in on it recently and they were approaching | alpha for what looked like a pretty substantial rewrite (now in | beta). Maybe once the dust has settled from that I'll give it | another look. | throwusawayus wrote: | creator of prefect was an early major airflow committer. | anyone know what motivated the substantial rewrite of | prefect? i had assumed original version of prefect was | already supposed to fix some design issues in airflow? | timost wrote: | I think you mean prefect orion/v2[0]. I'm curious too. | | [0] https://www.prefect.io/orion/ | dopamean wrote: | If you could go back and use something else instead what would | you choose? | emef wrote: | It's a good question. I believe airflow was probably the | right choice at the time we started. We were a small team, | and deploying airflow was a major shortcut that more or less | handled orchestration so we could focus on other problems. | With the aid of hindsight, we would have been better off | spinning off our own scheduler some time in the first year of | the project. Like I mentioned in my OP, we have a set of | well-defined workflows that are just templatized for | different jobs. A custom-built orchestration system that | could perform those steps in sequence and trigger downstream | workflows would not be that complicated. But this is how | software engineering goes, sometimes you take on tech debt | and it can be hard to know when it's time to pay it off. We | did eventually get to a stable steady state, but with lots of | hair pulling along the way. | hbarka wrote: | dbt tool. getdbt.com | mywittyname wrote: | Can dbt run arbitrary code? If it can, it's not well | advertised in the documentation. Every time I've looked | into dbt, I found that it's mostly a scheduled SQL runner. | | The primary reason we run Airflow is because it can execute | Python code natively, or other programs via Bash. It's very | rare that a DAG I write is entirely SQL-based. | hbarka wrote: | You're right. I think the strength of dbt is in the T | part of ELT. I wrote ELT to make a distinction in | principle from the traditional ETL. (E)xtract and (L)oad | is the data ingestion phase that would probably be better | served by Dagster, where you could use Python. | | (T)transform is decoupled and would be served in set- | based operations managed by dbt. | igrayson wrote: | dbt has just opened a serious conversation about | supporting Python models. I'm sure they'd value your | viewpoint! https://github.com/dbt-labs/dbt- | core/discussions/5261 | KptMarchewa wrote: | Dbt is great, but solves only a small part of what Airflow | does. | digisign wrote: | Is Airflow good for an ETL pipeline? Right now a client uses | Jenkins, but it is quite clunky and difficult to automate, though | they've managed to. Cloud not an option. | theptip wrote: | Airflow is generally brought in when you have a DAG of jobs | with many edges, and where you might want to re-run a sub- | graph, or have sub-graphs run on different cadences. | | In a simplistic ETL/ELT pipeline you can model things as | "Extract everything, then Load everything, then Transform | everything", in which case you'll add a bunch of unnecessary | complexity with Airflow. | | If you're looking for a framework to make the plumbing of ELT | itself easier, but don't need sub-graph dependency modeling, | Meltano is a good option to consider. | skrtskrt wrote: | Could anyone comment on Temporal vs Airflow? | | After having a lot of pain points with an (admittedly older and | probably not best-practices) Airflow setup, I am now at a | different job running similar types of workflows on Temporal - | we're pretty happy with it so far, but haven't done anything | crazy with it. | matesz wrote: | I know airbyte.io (elt platform) is built on top of Temporal, | but I haven't used it. | tomwheeler wrote: | Yes, Airbyte is using Temporal. Here is a blog post they | wrote a few weeks ago that goes into more detail about it: | https://airbyte.com/blog/scale-workflow-orchestration- | with-t... | Serow225 wrote: | I'd love to hear that too :) | tomwheeler wrote: | Hi, Tom from Temporal here. I don't have a lot of experience | with Apache Airflow personally, but I was at Cloudera when it | was added to our Data Engineering service, so I learned about | it at the time. Here are a few things that come to mind: | | * Both Apache Airflow and Temporal are open source | | * Both create workflows from code, but the approach is | different. With Airflow, you write some code and then | generate a DAG that Airflow can execute. With Temporal, your | code _is_ your workflow, which means you can use your | standard tools for testing, debugging, and managing your | code. | | * With Airflow, you must write Python code. Temporal has SDKs | for several languages, including Go, Java, TypeScript, and | PHP. The Python SDK is already in beta and there's work | underway for a .NET SDK. | | * Airflow is pretty focused on the data pipeline use case, | while Temporal is a more general solution for making code run | reliably in an unreliable world. You can certainly run data | pipeline workloads on Temporal, but those are a small | fraction of what developers are doing with Temporal (more | here: https://temporal.io/use-cases). | claytonjy wrote: | Do you see Temporal as being a super-set of DAG managers | like Airflow/Dagster/Prefect, or do you see uses where | those tools would be a better choice than Temporal? | claytonjy wrote: | I'm also curious about this. The folks I hear about Temporal | from seem to be very disjoint from Airflow users, and | Temporal's python client is still alpha-stage. | | It seems notable to me that the big Prefect rewrite mentioned | elsewhere [0] leans into the same "workflow" terminology that | Temporal uses. I have to wonder if Prefect saw Temporal as | superceding the DAG tools in coming years and this is them | trying to head that off. | | That post's discussion of DAG vs workflow also sounds a _lot_ | like why PyTorch was created and has seen so much success. | Tensorflow was static graphs, pytorch gave us dynamism. | | [0] https://www.prefect.io/blog/announcing-prefect-orion/ | encoderer wrote: | Is anybody out there doing anything interesting with Airflow | monitoring? | | At my startup Cronitor we have an Airflow sdk[0] that makes it | pretty easy to provision monitoring for each DAG, but essentially | we are only monitoring that a DAG started on time and the total | time taken. I keep thinking about how we could improve this and | it would be great to hear about what's working well today for | monitoring. | | [0] https://github.com/cronitorio/cronitor-airflow | rozhok wrote: | I'm working at https://databand.ai -- a full-fledged solution | for Apache Ariflow monitoring, data observability and lineage. | We have airflow sync, integrations with | Spark/Databricks/EMR/Dataproc/Snowlake, configurable alerts, | dashboards, and a much more. Check it out. | AtlasBarfed wrote: | ... the service seems to be centrally managed. A lot of the pain | points are clearly "everyone running in the same instance" or | kind of similar. Sure makes for big brag points in the numbers. | | Sounds like basic SaaS needs to be provided as a capability, | while the teams spin up their instances and shard to their needs. | | One of the problems with enterprise workflows is putting | everything together. Workflows are already cacophonous. A | cacophony of cacophonies is madness. | taude wrote: | Tangentially to this thread....what sites, sources, etc. are | people who work on modern data pipelines (engineering and | analysts) going to follow the latest news, products, techniques, | etc. It's been hard to keep up without having Meetups and such | the last couple years. I'm finding a lot of people's comments | here pretty interesting, and showing me things I haven't heard | of. Thanks. | kderbe wrote: | I follow the Analytics Engineering Roundup weekly email. It's | published by dbt Labs but isn't overtly promotional. | | https://roundup.getdbt.com/ | taude wrote: | Thanks. We're starting to use DBT, too. I know the forums | over at DBT are pretty good, too. | mcnnowak wrote: | I'm also interested in this topic, but can't find anything | other than "Top 10 things you should STOP doing as a data | engineer" etc. content-mill, clickbait on Medium and other | sites. | taude wrote: | Yes, this. I'd like to get less of the "Marketing sales | stuff", and more in the trenches with the actual engineering | teams. | blakeburch wrote: | I've had really great success from engaging with the Locally | Optimistic Slack community. | | Also, Cristophe Blefari has an excellent data newsletter. | https://www.blef.fr/ | | And Modern Data Stack has a newsletter, tool information, Q&A | www.moderndatastack.xyz | jonpon wrote: | Data Twitter and Linkedin are great, there are a lot of people | putting out some really good content. There are also a lot of | substacks you can sign up for. Data Engineering Weekly is my | fave | dtjohnnyb wrote: | Slack groups have filled in the meetup space in my life, | mlops.community and locally optimistic are two of the best for | what it sounds like you're looking for | tinco wrote: | What sort of workflows do you run in Apache Airflow? Are they | automating interactions with partners/clients or internal | communications? How can it become so scaled up that they (and | many people in the comments here as well) have trouble managing | the hardware? How can it become so complex that the workflows | need to be expressed in DAG's? What's a workflow? | | I don't think I ever worked anywhere that had automated | workflows, though my I only worked for small startups so far. | ldjkfkdsjnv wrote: | Unless you have extremely complex dependency graphs, I really | don't think airflow is worth it. It's very easy to end up | essentially writing an "orchestrator" using airflow, it allows | for very flexible low level operations. The added complexity has | minimal benefit, and like something like apache spark, what looks | simple becomes hard to reason about in real world scenarios. You | need to understand how it works under the hood, and get the best | practices right. | | As mentioned elsewhere, AWS step functions are really the best in | orchestration. | arinlen wrote: | > _As mentioned elsewhere, AWS step functions are really the | best in orchestration._ | | AWS Step Functions is a proprietary service provided | exclusively by AWS, which reacts to events from AWS services | and calls AWS Lambdas. | | Unless you're already neck-deep in AWS, and are already | comfortable paying through the nose for trivial things you can | run yourself for free, it's hardly appropriate to even bring up | AWS Step Functions as a valid alternative. For instance, | Shopify's articles explicitly mention they are running their | services in Google Cloud. Would it be appropriate to tell them | to just migrate their whole services to AWS just because you | like AWS Step Functions? | Jugurtha wrote: | That was one the reasons we do "bring your own compute" with | https://iko.ai so people who already have a billing account | on AWS, GCP, Azure, DigitalOcean, can just get the config for | their Kubernetes clusters and link them to iko.ai and their | machine learning workloads will run on whichever cluster they | select. | | If you get a good deal from one cloud provider, you can get | started quickly. | | It's useful even for individuals such as students who get | free credits from these providers: create a cluster and | you're up and running in no time. | | Our rationale was that we didn't wanted to be tied to one | cloud provider. | parsnips wrote: | https://github.com/checkr/states-language-cadence allows you | to define workflows in states language over cadence. | literallyWTF wrote: | This is another symptom of a person who doesn't know what | they're talking about really. | | It's like those stackoverflow answers that tell the user to | stop using PHP and rewrite it in Python or something. | riku_iki wrote: | > already comfortable paying through the nose for trivial | things you can run yourself for free | | But fault tolerant workflow engine is not trivial thing, it | may cost you many engineer hours to build, monitor and | maintain it, so outsourcing it to someone else is totally | viable solution. | arinlen wrote: | > _But fault tolerant workflow engine is not trivial | thing,_ | | The complexity and risk of migrating cloud providers | eclipses whatever problem you assign to "fault tolerant | workflow engines". | | Any mention of AWS Step Functions makes absolutely no sense | at all and reads at best like a non-sequitur. | thorum wrote: | I read it as a comment on the UX / developer experience, | which can superior with Step Functions vs competition | regardless of whether Step Functions is an appropriate | (or even physically possible) option for non-AWS | projects. | nojito wrote: | I'll never understand why individuals always default to cloud | offerings when they are extremely expensive compared to a | dedicated tool. | wussboy wrote: | I don't need to manage the cloud offering and that management | time is expensive. Your befuddlement at this simple economic | calculation is, well, befuddling. | mywittyname wrote: | It's easy to get started and you don't need to worry about | infra. | | I've been a one-man army at places because leveraging these | cloud offerings allows me to crank out working software that | scales to the moon without much thought. | | I'd rather pay AWS/GCP to handle infra, so that I can get | 2-3x as many project done. | nojito wrote: | None of these problems in airflow in this thread are due to | infrastructure so how does using a cloud service solve | anything? | benjamoon wrote: | It's easy to understand when you have lots of money, but no | time. Cloud is simple and expensive, self managed is complex | and cheap. Time's money and all that! | nojito wrote: | Except for this workflow. | | You won't get around any of the problems of airflow by | moving to a cloud offering. | 0xbadcafebee wrote: | Why hire an expensive janitorial service to clean your | office? Why hire a mechanic to fix your car? | serial_dev wrote: | It's shocking that some people cannot fathom that in | certain scenarios cloud offerings make sense. | | They don't always make sense, in certain scenarios it is | worth taking an open source, cloud independent tool, in | some scenarios you can roll your own, but there are | circumstances where it's a good choice using a tool your | cloud provider gives you. | literallyWTF wrote: | Because they don't know what they're doing and aren't the | ones paying the bill. | | "Oh I have to learn how to use and setup this tool? I think | I'll just pay the equivalent salaries and be locked in..." | waynesonfire wrote: | You're going to pay the bill regardless, whether the | employee is hired by your team or hired by the cloud | vendor. | | I don't know how management works through this math, maybe | managing people gets exhausting and they just want to out- | source it so leadership doesn't have to deal with it and | then they can just focus on the core product. | | And the above "I don't want to deal with it" reason isn't | spoken of, the more more commonly touted benefit is cloud's | "flexibility". Sure, but this is actually _really_ | expensive. Every cloud migration effort I've experienced is | only just worthwhile to begin to talk about because the | costs are based on long-term contracts of cloud resources, | not the per-hour fees. Nice flexibility. | | With that said, the cloud may be a good place for | prototyping where the infrastructure isn't the core value | add and it's uncertain. A start-up is a prototype and so | here we are. But, for an established company to migrate to | the cloud and fire the staff that's maintaining the on | premise resources.. I'm skeptical. More than likely, this | leads to maintaining both cloud and on premise resources, | not firing anyone, and thus, actually increasing costs for | an uncomfortably long time. | | And for the folks on the ground, who don't pay the bills, | the increase of accidental complexity is rather painful. | saimiam wrote: | I'm very much paying my own cloud bills but there is no | chance I would be able to orchestrate some of the workflows | I want to orchestrate if it were not for Step Functions. | | For a one person shop like me, AWS is a force multiplier. | With it, I can do (say) 30% of what a dedicated engineer in | a specific role could do. Without it, I'd be doing 0%. | | I really like this tradeoff for my particular situation. | [deleted] | taude wrote: | Fast time to market with a fraction of the effort? | pyrophane wrote: | > As mentioned elsewhere, AWS step functions are really the | best in orchestration. | | Why? Where else is this mentioned? | tootie wrote: | I read some of these of massively complex data architecture | posts and I almost always come away asking "What the hell is | this for?" I know Shopify is a huge business but I see this | kind of engineering complexity and all I think is it has to | cost tens of millions to build and operate and how could they | possibly be getting ROI. There are ten boxes on that diagram | and none of them have a user interface for anyone except other | developers. | generalpf wrote: | A lot of times this is used for data warehousing so product | managers and otherwise can query the database of one app | joined with another, especially in an environment with | microservices. You might join a table containing orders with | another table that was from a totally different DB, like | payments, to find out which kinds of items are best to offer | BNPL or something. | | The author also mentions that it's used for machine learning | models which will ultimately feed back into Shopify's front | end, for instance. | thefourthchime wrote: | > AWS step functions are really the best in orchestration. | | At our company, AWS Step is a disaster. You're effectively | writing code in JSON/YAML. Anything beyond very simple steps | becomes 2 pages of YAML that's very hard to read or write. | There is no way to debug, polling is mostly unusable. Changes | need to be deployed with CF which can take forever, or worse | hang. | | It's the most one of the most annoying technologies I've used | in my 20+ years of engineering. | fmakunbound wrote: | Definitely this. | | Our teams have many 1 or 2 step DAGs that are idempotent. They | could have been lambdas and they're already pulling from SQS | already. It could be just my misfortune, but in AWS, MWAA is | kind of janky. It's difficult to track down problems in the | logs (task failures look fine there) and the Airflow UI is | randomly unavailable ("document returns empty", "connection | reset" kind of things). | mywittyname wrote: | Lambdas have resource constraints that Airflow DAGs don't. | Most notably, Airflow DAGs can run for any arbitrary length | of time. And the local storage attached to the Airflow | cluster is actual disk space, and not just a fake in-memory | disk, making it possible process files larger than the amount | of memory allocated to the DAG. | | There's certainly some functionality overlap, but I don't see | Lambda and Airflow as competitors. Each has capabilities that | the other doesn't. | xtracto wrote: | I remember reading that you can attach EFS to lambdas. That | would solve some of the storage issues. | byteflip wrote: | So interesting, a lot of comments seem to be negative | experiences. I haven't used Airflow at scale yet but would love | to convert our extremely limited, internally built orchestrator + | jobs over to Airflow. I think it would allow us TO scale, at | least for some time. I think a lot of companies are still really | behind the times. Our DAGs are fairly simple, and Airflow has | been a major improvement in my testing. The UI is great for | helping me debug jobs / monitoring feed health / backfilling. DAG | writing has been a bit frustrating but is much improved format | over the internal systems we have. Am I just naive? Is everyone | writing extremely complex graphs? Is this operational complexity | due mostly to K8s (I've just been playing with Celery)? Anyone | enjoying using Airflow? | jonpon wrote: | The problems in this article and in the comments are some of the | stuff we have heard at Magniv in the passed few months when | talking data practitioners. We are focused on solving some subset | of these problems. | | Personally, I think Airflow is currently being un-bundled and | will continue to be with more task specific tools. | | At the very least, if un-bundling doesnt occur, Prefect and | Dagster are working hard to solve lots of these issues with | Airflow. | | Evolution of products and engineering practices is not linear and | sometimes doesnt even make sense when looking at a-posteriori (as | much as I would like it to follow some logical process). Will be | interesting how this space will develop in the next year or so. | jimmytucson wrote: | I've used Airflow for a few years and here's what I don't like | about it: | | - Configuration as code. Configuration should be a way to change | an application's behavior _without_ changing the code. Make me | write a workflow as JSON or XML. If I need a for-loop, I'll write | my own script to generate the JSON. | | - It's complicated. You almost need a dedicated Airflow expert to | handle minor version upgrades or figure out why a task isn't | running when you think it should. | | - Operators often just add an API layer over top of existing | ones. For example, to start a transaction on Spanner, Google has | a Python SDK with methods to call their API. But with Airflow, | you need to figure out what _Airflow_ operator and method wraps | the Google SDK method you're trying to call. Sometimes the | operator author makes "helpful" (opinionated) changes that | refactor or rewrite the native API. | | I would love a framework that just orchestrates tasks (defined as | a command + an image) according to a schedule, or based on the | outcome of other tasks, and gives me a UI to view those outcomes | and restart tasks, etc. And as configuration, not code! | atombender wrote: | What you're asking for is basically Argo Workflows, I think. | | Not that I recommend it. It's quite lovely in principle, but | really flawed in practice. It's YAML hell on top of Kubernetes | hell (and I say that as someone who loves Kubernetes and uses | it for everything every day). | | Having worked with some of these tools, what I've started to | wish for is a system where pipelines are written in just plain | code. I'd like to run and debug my pipeline as a normal, | compiled program that I can run on my own machine using the | tools I already use to build software, including things like | debuggers and unit testing tools. Then, when I'm really to put | it into production, I want a super scalable scheduler to take | my program and run it across dozens of autoscaling nodes in | Kubernetes or whatever. | | The only thing I've come across that uses this model is | Temporal, but it's got a rather different execution model than | a straightforward pipeline scheduler. | ricklamers wrote: | The flexibility of code as configuration is indeed somewhat of | a footgun at times. That's why with Orchest we went with a | declarative JSON config approach. | | We take inspiration from the Kubeflow project and run tasks as | containers. With a GUI for editing pipelines and managing | scheduled runs we come pretty close to what you're asking for | (bring an image and run a command). And it's OSS, of course. | | https://github.com/orchest/orchest | KptMarchewa wrote: | >If I need a for-loop, I'll write my own script to generate the | JSON. | | That's how you end with extreme mess in logs, UI and metrics. | | > But with Airflow, you need to figure out what _Airflow_ | operator and method wraps the Google SDK method you're trying | to call. | | Or you can use PythonOperator with hooks, that generally | integrate external APIs with Airflow connection system. | | https://github.com/apache/airflow/blob/main/airflow/provider... | | I think the bigger problem with Airflow's Operator concept is | N*M problem of integrating multiple systems. That's how you end | with GoogleCloudStorageToS3Operator and stuff like that. | rubenfiszel wrote: | If your flow is more linear looking than a complex DAG and you | want to get a full featured webeditor (with lsp), automatic | dependency handling, typescript(deno) and python support, I am | building an OSS, self-hostable airflow/airplane alternative at: | https://github.com/windmill-labs/windmill | | You write the modules as normal python/deno scripts, we infer the | inputs by statically analyzing your script parameters and we take | care of the rest. You can also reuse modules made by the | community (building the script hub atm). | slig wrote: | Thank you! Exactly what I was looking for. | AtlasBarfed wrote: | Isn't there an Uber workflow product? Also that scales on top of | Cassandra? | tomwheeler wrote: | You're probably thinking of Temporal (https://temporal.io/), | which is a fork of the Cadence project originally developed at | Uber. | 8589934591 wrote: | I echo other comments. Running and managing Airflow beyond simple | jobs is complicated. But then if you are running and managing | Airflow for simpler jobs, then you might not need Airflow. | | One data center company that I know of uses airflow at scale with | docker and k8s. They have a huge team of devops just to manage | the orchestrator. They in turn have to fine tune the orchestrator | to run smoothly and efficiently. Similar to what shopify has | noted here, they have built on top of and extended airflow to | take care of pain points like point 4. For companies like this it | makes sense to run airflow. | | Another issue I see companies/engineers who adopt airflow is that | they use it as a substitute for a script than as an orchestrator. | For example, say you want to download files from an API, upload | to s3, load it to your warehouse (say snowflake) and do some | transformations to get your final table - instead of writing | separate scripts for each step of fetch/upload/ingest/transform | and call each step from the dag, they end up writing everything | as a task in a dag. A huge disadvantage is there is a lot of code | duplication. If you had a script as a CLI, all your dag/task has | to do is call the script with the respective args. I agree that | airflow comes with a lot of convenience wrappers to create tasks | for many things but I feel this results in losing flexibility. | | This also results in them tying their workflow with airflow and | any change they might need they have to modify their airflow code | directly. If you want to modify how/what you upload to s3, you | end up writing/modifying python functions in the respective dags' | code. This removes the flexibility to modify/substitute any | component of the workflow with something else or even change the | orchestrator from airflow to something else. Additionally, | different teams might write workflows in different ways - | standardization of practice is really hard. This in turn results | in pouring more investments to maintaining and hiring "airflow | data engineers". Companies fall into steep tech debts. | | Prefect/dagster are new orchestrators in town. I'm yet to try | them out but I've heard mixed reviews about them. | | EDIT: Forgot about upgrades. Lot of upgrades are breaking changes | esp the recent change from 1->2. You end up spending a lot of | time just trying to debug what went wrong. Just installing and | running it is a pain. | blakeburch wrote: | Love your observation about tying the workflow to Airflow. | | One of my biggest annoyances in the orchestration space is that | teams are mixing business logic with platform logic, while | still touting "lack of vendor lock-in" because it's open | source. At the point that you're importing Airflow specific | operators into your script and changing the underlying code to | make sure it works for the platform (XCom, task decorators, | etc.), you are directly locking yourself in and making edits | down the road even more difficult. | | While some of the other players do a better job, their method | of "code as workflow" still results in the same problems, where | workflows get built as a "mega-script" instead of as modular | components. | | I'm a co-founder at Shipyard, a light-weight hosted | orchestrator for data teams. One of our core principles is | "Your code should run the same locally as it does on our | platform". That means 0 changes to your code. | | You can define the workflow in a drag and drop editor or with | YAML. Each task is it's own independent script. At runtime, we | automatically containerize each task and spin up ephemeral file | storage for the workflow, letting you can run scripts one after | the other, each in their own virtual environment, while still | sharing generated files as if you were running them on your | local machine. In practice, that means that individual tasks | can be updated (in app or through GitHub sync) without having | to touch the entire workflow. | | I'm biased, but it seems crazy to me that so many engineers are | willing to spend hours fighting the configuration of their | orchestration platform rather than focusing on the solving the | problems at hand with code. | rockostrich wrote: | We've established a rule that all "custom" code (anything that | isn't a preexisting operator in airflow) needs to be contained | in a docker image and run through the k8s pod operator. What's | resulted is most folks do exactly what you said. They create a | repo with a simple CLI that runs a script and the only thing | that gets put in our airflow repo is the dependency | graph/configuration for the k8s jobs. | claytonjy wrote: | AFAICT this is the now-recommended way to use Airflow: as a | k8s task orchestrator. Even the Astronomer team (original | Airflow authors) will tell you to do it this way. | idomi wrote: | When it comes to scale and DS work I'd use the ploomber open- | source (https://github.com/ploomber/ploomber). It allows an easy | transition between dev and production, incrementally building the | DAG so you avoid expensive compute time and costs. It's easier to | maintain and integrates seamlessly with Airflow, generating the | DAGs for you. | dekhn wrote: | I tried to run airflow; I found pretty much everything about it | to be wrong for my usecase. Why can't I easily upload a workflow | through the UI? Why doesn't it handle S3 file staging for me? | jonpon wrote: | Would love to hear more about your use case and your issues -- | you can sign up on our website (magniv) or send me an email jon | at our domain. | hatware wrote: | It definitely takes some time getting used to the quirks of | Airflow. I know it took 6 months of running it at my last gig | to really understand what was happening underneath the UI. | | With great control comes great responsibility. | dekhn wrote: | actually I concluded it was just not that great a workflow | engine. It's probably just intended for a different use case | than mine. | ashtonbaker wrote: | we run airflow with ... considerably more dags than this. our | main "lesson learned" is that airflow should not be used "at | scale". | blakeburch wrote: | Lots of the comments here seem to be commenting on their own | experiences of complexity and frustration with Airflow, but I'd | venture to say that's most data orchestration tools. In fact, | that sort of feedback is so consistent that I'm half tempted to | start a podcast for "orchestration horror stories" (contact if | interested). | | What I've found while building out Shipyard, a hosted lightweight | orchestration platform, is that teams want something that "just | works". Servers that "just scale". Observability that doesn't | require digging. Notifications and retries that work | automatically. Workflows that don't mix business logic with | platform logic. Code and workflows that sync with git. Deployment | that only takes a few minutes. | | For the straightforward use cases, where you need to run tasks A | -> G daily, with a bit of branching logic, Airflow is overkill. | Yes, Airflow has a lot of great complex functionality that can | help you down the road. But Airflow keeps getting suggested to | everyone even if it's not best suited to their use case, | resulting in lots of lost time and engineering overhead. | | While I have definitely have bias, there are a lot of other high | quality alternatives out there to explore nowadays! ___________________________________________________________________ (page generated 2022-05-23 23:00 UTC)