[HN Gopher] Lessons learned from running Apache Airflow at scale
       ___________________________________________________________________
        
       Lessons learned from running Apache Airflow at scale
        
       Author : datafan
       Score  : 218 points
       Date   : 2022-05-23 15:31 UTC (7 hours ago)
        
 (HTM) web link (shopify.engineering)
 (TXT) w3m dump (shopify.engineering)
        
       | rr808 wrote:
       | Surely there is a really simple distributed scheduler that is
       | simple. Do I need to write one? Ie has dependencies, no database,
       | flat files, single instance but trivial to fail over to a backup.
       | I can even live without history or output.
        
         | 0xbadcafebee wrote:
         | What are you trying to do? Distributed scheduler with a single
         | instance? No database? Are you sure you don't just mean "a
         | scheduler" ala Luigi? https://github.com/spotify/luigi
         | 
         | And what kind of scheduler? Again, for "a single instance" it
         | doesn't need to be distributed. For distributed operation,
         | Nomad is as simple and generic as you can get. If you need to
         | define a DAG, that's never going to be simple.
        
       | qkhhly wrote:
       | airflow is one piece of software that i hate very much,
       | especially the aspect that my job definition is intertwined with
       | the actual job code. if my job depends on something that
       | conflicts with airflow's dependency, it gets ugly.
       | 
       | i actually like azkaban a lot better. of course, writing a plain
       | text job config could also be painful. i think ideally you could
       | write job def in python or other lang but it gets translates to
       | plain text config and does not interfere with your job code in
       | any way.
        
       | Mayzie wrote:
       | The biggest pain point in Airflow I have experienced is the
       | horrible and completely lacking documentation. The community
       | support (Slack) won't (or can't) help with anything beyond basic
       | DAG writing.
       | 
       | That sore point makes running and using the software needlessly
       | frustrating, and honestly I won't ever be using it again because
       | of it.
        
         | 8589934591 wrote:
         | I agree with this. The slack is just the core developers
         | discussing further development and tickets. The documentation
         | is lacking big time. The only response to this is to raise PRs
         | to improve docs.
        
         | idomi wrote:
         | Make sure to checkout ploomber, our support is seamless tons of
         | docs (https://docs.ploomber.io/) and we take our users
         | seriously. P.S. We integrate with airflow and other
         | orchestrators if you still need to tackle those.
        
         | pid-1 wrote:
         | Agreed, I just would like to add the documentation got a lot
         | better in the past couple of years.
        
         | artwr wrote:
         | Can I ask more about your use case that you could not find an
         | answer for?
        
         | jlaneve wrote:
         | That's one of the things we're working on at Astronomer - check
         | out the Astronomer Registry! registry.astronomer.io
        
       | higeorge13 wrote:
       | I am wondering why they still use the celery executor while the
       | kubernetes executor is the go-to one for large deployments. I
       | have used the celery executor and had so many issues and stuck
       | tasks in the past and frequently fine-tune the celery
       | configuration in airflow config.
        
       | mcqueenjordan wrote:
       | I think airflow ends up creating as many problems as it solves
       | and kind of warps future development patterns/designs into its
       | black hole when it wouldn't otherwise be the natural choice.
       | There's the sort of promise of network effects -- "well of course
       | it's better if /everything/ is represented and executed within
       | the DAG of DAGs, right?" -- but it ends up being the case that
       | the inherent problems it creates plus the externalities of using
       | airflow for the wrong use cases start to compound, especially as
       | the org grows.
       | 
       | I think it slowly ends up being sort of isomorphic to the set of
       | problems that sharing database access across service and
       | ownership boundaries has, and my view is increasingly of the
       | "convince me this can't be an RPC call, please" camp, and when it
       | really can't (for throughput reasons, for example), "ok, how
       | about this big S3 bucket as the interface, with object
       | notification on writes?"
        
       | trumpeta wrote:
       | We operate a (small?) Airflow instance with ~20 DAGs but, one of
       | those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL
       | backing it.
       | 
       | We package all the code in 1-2 different Docker images and then
       | create the DAG. We've faced many issues (logs out of order,
       | missing, random race conditions, random task failures, etc.)
       | 
       | But what annoys me the most is that for that 1 big DAG, the UI is
       | completely useless, tree view has insane dupplication, graph view
       | is super slow and hard to navigate through and answering basic
       | questions like, what exactly failed and what nodes are around it
       | are not easy.
        
         | artwr wrote:
         | At Airbnb, we were using SubDAGs to try to manage large number
         | of tasks in a single DAG. This allowed organizing tasks and
         | drilling down into failures more easily but came with its own
         | challenges.
         | 
         | In more recent versions of Airflow, TaskGroups
         | (https://airflow.apache.org/docs/apache-
         | airflow/stable/concep...,
         | https://www.astronomer.io/guides/task-groups/ ) were made to
         | help this a little bit. Hopefully that helps a bit.
         | 
         | At ~1k nodes in the graph introspection becomes hard anyway, as
         | others have suggested, breaking it down if possible might be a
         | good idea.
        
         | mywittyname wrote:
         | Also, the @task annotation provides no facilities to name
         | tasks. So if you like to build reusable tasks (as I do), you
         | end up with my_generic_task__1, my_generic_task__2,
         | my_generic_task__n. I've tried a few hacks to dynamically
         | rename these, but I just ended up bringing down my entire
         | staging cluster.
        
           | artwr wrote:
           | `your_task.override(task_id="your_generated_name")` not
           | working for you?
        
             | mywittyname wrote:
             | I got pretty excited when I read this response, but no, it
             | doesn't work. I'm not sure how this would work since
             | annotated tasks return an xcom object.
             | 
             | Can you point me to the documentation on this function?
             | It's possible I'm not using it correctly.
             | 
             | I can do something like this, which works locally, but
             | breaks when deployed:                   res =
             | annotated_task_function(...)         res.operator.task_id =
             | 'manually assigned task id'
        
               | flowair wrote:
               | @task.python(task_id="this_is_my_task_name")
               | 
               | def my_func():
               | 
               | ...
        
               | mywittyname wrote:
               | This still has the problem that, when you call my_func
               | multiple times in the same dag, the resulting tasks will
               | be labelled, my_func, my_func__1, my_func__2, ...
        
         | suifbwish wrote:
         | Does this imply file metadata content can effect the access
         | performance of those files even for operations that do not
         | directly concern the metadata?
        
         | rockostrich wrote:
         | We had a similar DAG that was the result of migration a single
         | daily Luigi pipeline to Airflow. I started identifying isolated
         | branches and breaking them off with external task sensors back
         | to the main DAG. This worked but it's a pain in the ass. My
         | coworker ended up exporting the graph to graphviz and started
         | identifying clusters of related tasks that way.
        
           | mywittyname wrote:
           | I've not had the best luck with ExternalTaskSensors. There
           | have been some odd errors like execution failing at 22:00:00
           | every day (despite the external task running fine).
        
       | vbezhenar wrote:
       | Can someone enlighten me whether Apache Airflow is suitable as a
       | business process engine?
       | 
       | We have something like orders. So people put orders into our
       | system, some orders are imported from external system. We have
       | something around 100-1000 orders per day, I think. Each order
       | goes through several states. Like CREATED, SOME_INFO_ADDED,
       | REVIEWED, CONCLUSION_CREATED, CONCLUSION_SENT_TO_EXTERNAL_SYSTEM
       | and so on. Some states are simple to change, like few
       | milliseconds to call some web services, some states are 5 minutes
       | from operator, some states are few days. This logic is encoded
       | into our program code. We have plenty of timers, every timer
       | usually transfers orders from one state to another. This is
       | further complicated by the fact that this processing is done via
       | several services, so it's not a single monolith but some kind of
       | service architecture.
       | 
       | Our management wants something to have clear monitoring, so you
       | can find a given task by some property values, monitor its
       | lifetime, check logs for every step, find out why it's failing,
       | etc.
       | 
       | What I usually see is that Apache Airflow is used more like cron
       | replacement. I've read some articles but it's still not clear
       | whether it could be used as a business process engine. I had some
       | experience with Java BPMN engines in the past, it was not very
       | pleasant, but I guess time moved on.
        
       | subsaharancoder wrote:
       | A friend of mine wanted an ETL (SQL Server to BQ for analysis and
       | dashboarding) set up and I ended up stumbling across Airflow. I
       | spun up two VMs on GCP, one for Airflow and the other for the
       | Postgres DB to store the metadata.
       | 
       | - A few things I've noticed is Airflow generates a tonne load of
       | logs that will fill up your disk quite fast. I started with 100GB
       | and I'm now at 500GB, granted disk space isn't expensive, but
       | still even with a few DAGs i'm surprised at how quickly.
       | Apparently you need a DAG to run to clear those logs but I was
       | too lazy so I just purge the logs using a cron job.
       | 
       | - The SQL Server Operator is buggy, I filed an issue with the
       | Airflow team but I had to do some hacky stuff to get it to work.
       | 
       | - Even with a few DAGs, Airflow will spike the CPU utilization of
       | the VM to 100% for X minutes (in my case about 15 minutes) which
       | is quite interesting. My tasks basically query SQL Server -> dump
       | to CSV (stored on GCS) -> import to BQ.
       | 
       | - My DAGs execute every hour, and if Airflow is down for X hours
       | and I resolve the issue, it will try to run all the tasks for the
       | hours it was down which isn't ideal because it will take hours to
       | catch up. So I've had to delete tasks and only run the most
       | recent ones.
       | 
       | Granted my set up is pretty simple and YMMV, but Airflow has done
       | what it needs to do albeit with some pain.
        
         | veeti wrote:
         | FWIW if you don't need Airflow to catch up and backfill missed
         | tasks, you can either set catchup=False on the DAG or use a
         | LatestOnlyOperator.
        
           | subsaharancoder wrote:
           | I have catchup=False set in the DAG but that hasn't stopped
           | Airflow from back filling missed tasks, not sure why this is
           | the case?
        
         | awild wrote:
         | > Even with a few DAGs, Airflow will spike the CPU utilization
         | of the VM to 100% for X minutes (in my case about 15 minutes)
         | which is quite interesting. My tasks basically query SQL Server
         | -> dump to CSV (stored on GCS) -> import to BQ.
         | 
         | Have you checked why that is? Airflow does Reimport every few
         | seconds. We've had an issue where it didn't honor the
         | airflowignore file making it execute our tests everx few
         | seconds. The easy solution was to put them into the docker
         | ignore.
         | 
         | You might also be having too much logic in your root levels.
         | It's recommended to not even import at root level to make
         | importing faster.
         | 
         | Not saying it's not an odd tool though.
        
       | vvladymyrov wrote:
       | Airflow brought one of the best tools with nice UI for running
       | pipelines back in 2014-2016. But now days engineers should be
       | aware about easier to use options and don't choose Airflow
       | blindly as default choice. IMHO for 80-90% of cases orchestration
       | system should not use code at all - it should be DAGs as a config
       | code. Airflow is popular and teams keep choosing it for building
       | simple DAGs and incurring avoidable otherwise Airflow maintenance
       | costs.
       | 
       | Databricks Orchestration pipeline, AWS Step Functions - good
       | examples of DAGs as a configuration.
        
         | carschno wrote:
         | Do you have more examples for better tools, ideally open source
         | (unlike AWS Step functions)?
        
           | jonpon wrote:
           | We at magniv.io are building an alternative.
           | 
           | Our core is open source https://github.com/MagnivOrg/magniv-
           | core
           | 
           | We can set you up with our hosted if you would like to poke
           | around!
        
           | gadflyinyoureye wrote:
           | Flowable is a BPNM system. You can do a lot of async calls
           | with it. https://www.flowable.com/open-source
           | 
           | We use it for a complex pricing process that invokes 30-40
           | micro services that can take up to minutes per step.
        
           | qw wrote:
           | Kamelets and the Karavan UI in combination with k8s and
           | Knative for "serverless" integrations looks interesting.
        
           | antupis wrote:
           | We are using prefect+dbt and I like it. Altought they are
           | doing huge rewrite at the moment.
        
         | [deleted]
        
         | avemg wrote:
         | I've used AWS Step Functions extensively over the past several
         | years and give me code every day of the week over the
         | Stepfunctions json config. Once you get beyond a few simple
         | steps, it gets very hard to look at the config and understand
         | what's going on with it. Especially true when you haven't look
         | at the config in awhile. The DAG visualizer definitely helps,
         | but as soon as things get beyond the trivial I long for a
         | different tool.
        
           | coolsunglasses wrote:
           | I was until a week or two ago part of a team that build
           | datasets with extensive dependencies (thus, complicated DAGs)
           | 
           | v1 of the system built before I joined was Step Functions and
           | the like. It gets hairy just as you say.
           | 
           | v2 I built and designed with the lead data engineer, we
           | called it Coriaria originally. We're hoping/planning to open
           | source it eventually, although it's a little wrapped around
           | our company's internal needs & systems.
           | 
           | It chooses neither "config" strictly speaking nor "code" for
           | the DAG, instead the primary representation/state is all in
           | the PostgreSQL database which tracks the dataset dependencies
           | and how each dataset is built. It's a DAG in PostgreSQL as
           | well.
           | 
           | To make dataset creation and management easier, I also wrote
           | a custom Terraform provider for Coriaria. This made migrating
           | datasets into the new system dramatically faster. The
           | provider is really nice, supports `terraform import` and all
           | that. Currently we have it setup so that there are separate
           | roles/accounts that can modify an existing dataset, but
           | reading state only requires authentication, not
           | authorization. This enables one team to depend on another
           | team's dataset as an upstream data source for their datasets
           | without granting permission to modify it or create a
           | potentially stale copy of the dataset. Terraform's internal
           | DAG representation of the resource dependencies is leveraged
           | because "parent_datasets" references the upstream datasets
           | directly, including the ones we don't build.
           | 
           | We're able to depend on datasets we don't build ourselves
           | because the system has support for Glue catalog backends to
           | track and register partition availability.
           | 
           | Currently, it builds most of the datasets using AWS Athena &
           | S3, however this is abstracted over a single step function.
           | There's no DAG of step functions, it's just a convenient
           | wrapper for the Athena query execution.
           | 
           | The system also explicitly understands dataset builds and
           | validations as separate steps. The dashboard makes it easy to
           | trace the DAG and see which datasets are blocking a dataset
           | build.
           | 
           | We're adding more integrations to it soon so that other ways
           | of kicking off dataset builds and validations are available.
           | 
           | If people are interested in this I can begin lobbying for
           | open sourcing the system. My colleague wanted to open source
           | it as well.
           | 
           | All else fails, I'll rebuild it from scratch because I don't
           | like the existing solutions for managing datasets. We've been
           | calling it a data-flow orchestration system or ETL
           | orchestration system, not sure what would be most meaningful
           | to people.
           | 
           | I think the main caveat to this system is that I'm not sure
           | how much use it'd be for streaming data pipelines, but it
           | could manage the discretization of streaming into validated
           | partitions wherever streamed data is sunk into. Our operating
           | assumptions are that you want validated datasets to drive
           | business decisions, not raw event data streamed in from
           | Kafka. Making sure the right data is located in each daily
           | (or hourly) partition is part of that validation.
        
           | latchkey wrote:
           | Why not just model the json as objects in (insert favorite
           | language) and then use that code to generate the json?
        
             | entropicdrifter wrote:
             | Ah yes, a home-made framework to generate configurations
             | for your framework that's supposed to make your life
             | easier. That way you can maintain your code that maintains
             | your configs that make it easier to run your code that you
             | have to maintain!
        
               | latchkey wrote:
               | Actually, yes. It allows for easier unit and integration
               | testing as well. The original complaint is that things
               | were getting hard to read and they wished there was code
               | for this. It seems logical to create a framework for the
               | json configuration files so that they can be easily
               | mocked and tested. As someone who greatly values spending
               | time on automated testing, it seems weird to not think of
               | it this way.
               | 
               | Quick google shows that others have done things like this
               | already...
               | 
               | [1] https://noise.getoto.net/2021/10/14/using-jsonpath-
               | effective...
               | 
               | [2] https://aws.amazon.com/about-aws/whats-
               | new/2022/01/aws-step-...
               | 
               | [3] https://docs.aws.amazon.com/step-
               | functions/latest/dg/sfn-loc...
        
               | pharmakom wrote:
               | This can be a much better approach than upgrading the DAG
               | description language to a true programming language. It
               | forces anything complex to happen at build time where it
               | can do less damage. Plus, we can often use the same
               | library to do static analysis on the output
        
               | savin-goyal wrote:
               | Metaflow provides a similar concept to interface with
               | Step Functions and Argo Workflows in Python -
               | https://docs.metaflow.org/going-to-production-with-
               | metaflow/...
        
             | [deleted]
        
           | TYPE_FASTER wrote:
           | AWS offers a service for managed Airflow:
           | https://aws.amazon.com/managed-workflows-for-apache-airflow/
           | 
           | Makes me wonder if Amazon internally was using Step
           | Functions, ran into issues trying to scale to larger graphs,
           | realized multiple teams were using Airflow, and created the
           | Managed Airflow service.
        
         | nomilk wrote:
         | > teams keep choosing it for building simple DAGs
         | 
         | I am part of one such team. We were using Windows Task
         | Scheduler on a windows VM to run jobs, we figured it would be a
         | nice idea to (dramatically) modernise and move to airflow, but
         | we grossly underestimated the complexity, learning curve, and
         | surrounding tools it requires. In the end we (data science
         | team) didn't get a single production task up and running. The
         | data engineers had much more success with it thought, probably
         | because they dedicated much more time to it.
         | 
         | Will look forward to trying AWS Step Functions.
        
           | commandlinefan wrote:
           | I tried installing Airflow locally to just play around with
           | it and make sense of what it's good for and finally gave up
           | after a few days - the install alone is insanely complicated,
           | with lots of tricky hidden dependencies.
        
             | always_left wrote:
             | Did you try installing with docker? You would just download
             | docker, `docker-compose up --build` and you'll be good to
             | go locally (usually)
        
               | idleprocess wrote:
               | I can second this. We were up-and-running with Docker on
               | our dev machines in just a few minutes. A native
               | installation involves substantially more setup (Python,
               | databases, Redis and/or Rabbit, etc.). The published
               | docker-compose file will handle all of that for you. We
               | have a very small data engineering team and have been
               | able to move very quickly with Docker and AWS ECS (for
               | orchestrating containers in test and prod environments).
        
       | anonymousDan wrote:
       | Can anyone ELI5 the value proposition of airflow?
        
       | lysecret wrote:
       | Well written article. One question I always have when reading
       | such an article. Is it really worth it for these kinds of
       | companies to run Airflow on Kubernetes. You could also run it for
       | example on AWS Batch with Spot instances.
        
         | beckingz wrote:
         | Running Airflow on Kubernetes has been one of the most painful
         | data engineering challenges I've worked on.
        
           | ricklamers wrote:
           | We kept hearing this from our users. We've just released our
           | k8s operator based deployment of Orchest that should give you
           | a good experience running an orchestration tool on k8s
           | without much trouble. https://github.com/orchest/orchest
           | 
           | (We extended Argo, works fantastically well by the way!)
        
           | marcinzm wrote:
           | How so? Did you have any existing Kubernetes knowledge? We
           | found it fairly easy to deploy using the community Helm chart
           | (official chart wasn't out yet).
        
           | mrbungie wrote:
           | Did you have any previous experience running workloads in k8s
           | before?
           | 
           | Running the Airflow Helm is pretty straightforward, even with
           | more "complex" use cases like heterogenous pods for different
           | task sizes.
        
       | kbd wrote:
       | I'm bullish about Dagster nowadays. Though, I don't have a lot of
       | experience with Airflow. Figured I'd ask if anyone has switched
       | from Airflow to Dagster and has any comments?
        
         | perfect_kiss wrote:
         | I had participated in migrating around 100 fairly complicated
         | pipelines from Airflow to Dagster over six months in 2021. We
         | used k8s launcher, so this feedback does not apply to other
         | launchers e.g. Celery.
         | 
         | Key takeaways roughly those:
         | 
         | - Dagster's integration with k8s really shines as compared to
         | Airflow, it is also based on extendable Python code so it is
         | easy to add custom features to the k8s launcher if needed.
         | 
         | - It is super easy to scale UI/server component horizontally,
         | and since DAGs were running as pods in k8s, there was no
         | problem scaling those as well. For scheduling component it is
         | more complicated, e.g. builtin scheduling primitives like
         | sensors are not easily integrated with state-of-art message
         | queue systems. We ended up writing custom scheduling component
         | that was reading messages from Kafka and creating DAG runs via
         | networked API. It was like 500 lines of Python including tests,
         | and worked rock-solid.
         | 
         | - networked API is GraphQL while Airflow is REST, both are
         | really straightforward, however in Dagster it felt better
         | designed, maybe due to tighter governance of Dagster's authors
         | over the design.
         | 
         | - DAG definition Python API, e.g. solid/pipeline, or op/graph
         | in a newer Dagster API, is somewhat complicated as compared to
         | Airflow's operators, however it is easy to build custom DSL on
         | top of that. One would need custom DSL for complicated logic in
         | Airflow as well, and in case of Dagster it felt easier to
         | generate its primitives, than doing never ending operators
         | combinations in case of Airflow.
         | 
         | - Unit and integration testing are much easier in Dagster, the
         | authors put testing as a first-class citizen, so mocks are
         | supported everywhere, and the code tested with local runner is
         | guaranteed to execute in the same way on k8s launcher. We never
         | had any problems with test environment drift.
         | 
         | The biggest caveat was full change of internal APIs in 0.13,
         | which forced the team to execute a fairly complicated refactor,
         | due to deprecation of the features we were depending on e.g.
         | execution modes. Had we spent more time on Elementl slack, it
         | would be easier to put less dependencies on those features ^__^
        
         | doom2 wrote:
         | At my previous employer, we were running self-hosted Airflow in
         | AWS, which really was a nightmare. The engineer that set it up
         | didn't account for any kind of scaling and all the code was a
         | mess. We would also get issues like logs not syncing correctly
         | in our environment or transient networking issues that somehow
         | didn't fail the given Airflow task. Eventually, we did a dual
         | migration: temporarily switching to AWS managed Airflow (their
         | Amazon Managed Workflows for Apache Airflow product) while also
         | rewriting the DAGs in Dagster.
         | 
         | Dagster was a great solution for us. Their notion of software
         | defined assets allowed us to track metadata of the Redshift and
         | Snowflake tables we were working with. Working with re-runs and
         | partitioned data was a breeze. It did take a while to onboard
         | the whole team and get things working smoothly, which was a bit
         | difficult because Dagster is still young and they were often
         | making changes to how parts of the system worked (although
         | nothing that was immediately backwards incompatible).
         | 
         | We also enjoyed some of the out of the box features like
         | resources and unit testing jobs. Overall, I think it made our
         | team focus more on our data and what we wanted to do with it
         | rather than feeling like we had to wrangle with Airflow just to
         | get things running.
        
           | kbd wrote:
           | Thanks for your comment! Ditto last time I ran Airflow
           | locally it took like 5 Docker containers. Then I forgot about
           | the project and for a while was furious at Docker for
           | randomly taking 100% CPU. Then I realized it was because of
           | the Airflow containers that would restart along with Docker.
           | I didn't get much further with Airflow.
           | 
           | Dagster, on the other hand, seems to let you scale from using
           | it locally as a library all the way to running on ECS/K8s
           | etc. Along with that there's unfortunately a ton of
           | complexity in setting it up but that's not Dagster's fault
           | and it seems like Dagster works once you get it set up. Agree
           | about it being young and there being some rough spots but
           | it's got lots of good ideas. We were nearly done setting it
           | up but got pulled off onto more urgent things, so I haven't
           | run it in production yet. I'm glad to hear it worked well for
           | you!
        
         | computershit wrote:
         | Dagster is extremely nice to work with. I did a bakeoff of
         | Prefect vs Dagster internally at my current employer, and while
         | we ended up going with Prefect for reasons, I am still so
         | impressed with the way Dagster approaches certain pain points
         | in the orchestration of data pipelines and its solution for
         | them.
        
           | theptip wrote:
           | > for reasons
           | 
           | I'd love to hear more on this. I've not evaluated Prefect,
           | and am currently keeping an eye on Dagster. What trade-offs
           | does Prefect win?
        
             | 64StarFox64 wrote:
             | I did a baby bakeoff internally in my prior role ~18mo ago
             | now. Prefect felt nicer to write code in but perhaps not as
             | easy to find answers in the docs (though their Slack is
             | phenomenal). Ended up going with Prefect so I could focus
             | on biz/ETL logic with less boilerplate, but I'm sure
             | Dagster is not a bad choice either. Curious to hear about
             | parent's experience
        
       | simo7 wrote:
       | I think the main lesson should be not to use it, especially at
       | scale.
        
       | 0xbadcafebee wrote:
       | If you have the headcount for people just to build/support
       | Airflow, please do yourself a favor and give that money to
       | Astronomer.io. Their offering is _stupid good_. There 's 20
       | different reasons why paying them is a much better idea than
       | managing Airflow yourself (including using MWAA), and it's dirt
       | cheap considering what you get.
        
         | [deleted]
        
         | pid-1 wrote:
         | Last time I checked, they asked for a significant minimum $ + 1
         | year commitment.
         | 
         | I wish they had a "start small", self service, clear pricing
         | option.
        
       | emef wrote:
       | We've also been running airflow for the past 2-3 years at a
       | similar scale (~5000 dags, 100k+ task executions daily) for our
       | data platform. We weren't aware of a great alternative when we
       | started. Our DAGs are all config-driven which populate a few
       | different templates (e.g. ingestion = ingest > validate > publish
       | > scrub PII > publish) so we really don't need all the
       | flexibility that airflow provides. We have had SO many headaches
       | operating airflow over the years, and each time we invest in
       | fixing the issue I feel more and more entrenched. We've hit
       | scaling issues at the k8s level, scheduling overhead in airflow,
       | random race conditions deep in the airflow code, etc. Considering
       | we have a pretty simplified DAG structure, I wish we had gone
       | with a simpler, more robust/scalable solution (even if just
       | rolling our own scheduler) for our specific needs.
       | 
       | Upgrades have been an absolute nightmare and so disruptive. The
       | scalability improvements in airflow 2 were a boon for our
       | runtimes since before we would often have 5-15 minutes of
       | overhead between task scheduling, but man it was a bear of an
       | upgrade. We've since tried multiple times to upgrade past the 2.0
       | release and hit issues every time, so we are just done with it.
       | We'll stay at 2.0 until we eventually move off airflow
       | altogether.
       | 
       | I stood up a prefect deployment for a hackathon and I found that
       | it solved a ton of the issues with airflow (sane deployment
       | options, not the insane file-based polling that airflow does). We
       | looked into it ~1 year ago or so, I haven't heard a lot about it
       | lately, I wonder if anyone has had success with it at scale.
        
         | pweissbrod wrote:
         | If your team is comfortable writing in pure python and you're
         | familiar with the concept of a makefile you might find Luigi a
         | much lighter and less opinionated alternative to workflows.
         | 
         | Luigi doesn't force you into using a central orchestrator for
         | executing and tracking the workflows. Tracking and updating
         | tasks state is open functions left to the programmer to fill
         | in.
         | 
         | It's probably geared for more expert programmers who work close
         | to the metal that don't care about GUIs as much as high degrees
         | of control and flexibility.
         | 
         | It's one of those frameworks where the code that is not written
         | is sort of a killer feature in itself. But definitely not for
         | everyone.
        
           | teej wrote:
           | It's worth noting that Luigi is no longer actively maintained
           | and hasn't had a major release in a year.
        
         | pyrophane wrote:
         | Very similar experience to yours. Adopted Airflow about 3 years
         | ago. Was aware of Prefect but it seemed a bit immature at the
         | time. Checked back in on it recently and they were approaching
         | alpha for what looked like a pretty substantial rewrite (now in
         | beta). Maybe once the dust has settled from that I'll give it
         | another look.
        
           | throwusawayus wrote:
           | creator of prefect was an early major airflow committer.
           | anyone know what motivated the substantial rewrite of
           | prefect? i had assumed original version of prefect was
           | already supposed to fix some design issues in airflow?
        
             | timost wrote:
             | I think you mean prefect orion/v2[0]. I'm curious too.
             | 
             | [0] https://www.prefect.io/orion/
        
         | dopamean wrote:
         | If you could go back and use something else instead what would
         | you choose?
        
           | emef wrote:
           | It's a good question. I believe airflow was probably the
           | right choice at the time we started. We were a small team,
           | and deploying airflow was a major shortcut that more or less
           | handled orchestration so we could focus on other problems.
           | With the aid of hindsight, we would have been better off
           | spinning off our own scheduler some time in the first year of
           | the project. Like I mentioned in my OP, we have a set of
           | well-defined workflows that are just templatized for
           | different jobs. A custom-built orchestration system that
           | could perform those steps in sequence and trigger downstream
           | workflows would not be that complicated. But this is how
           | software engineering goes, sometimes you take on tech debt
           | and it can be hard to know when it's time to pay it off. We
           | did eventually get to a stable steady state, but with lots of
           | hair pulling along the way.
        
           | hbarka wrote:
           | dbt tool. getdbt.com
        
             | mywittyname wrote:
             | Can dbt run arbitrary code? If it can, it's not well
             | advertised in the documentation. Every time I've looked
             | into dbt, I found that it's mostly a scheduled SQL runner.
             | 
             | The primary reason we run Airflow is because it can execute
             | Python code natively, or other programs via Bash. It's very
             | rare that a DAG I write is entirely SQL-based.
        
               | hbarka wrote:
               | You're right. I think the strength of dbt is in the T
               | part of ELT. I wrote ELT to make a distinction in
               | principle from the traditional ETL. (E)xtract and (L)oad
               | is the data ingestion phase that would probably be better
               | served by Dagster, where you could use Python.
               | 
               | (T)transform is decoupled and would be served in set-
               | based operations managed by dbt.
        
               | igrayson wrote:
               | dbt has just opened a serious conversation about
               | supporting Python models. I'm sure they'd value your
               | viewpoint! https://github.com/dbt-labs/dbt-
               | core/discussions/5261
        
             | KptMarchewa wrote:
             | Dbt is great, but solves only a small part of what Airflow
             | does.
        
       | digisign wrote:
       | Is Airflow good for an ETL pipeline? Right now a client uses
       | Jenkins, but it is quite clunky and difficult to automate, though
       | they've managed to. Cloud not an option.
        
         | theptip wrote:
         | Airflow is generally brought in when you have a DAG of jobs
         | with many edges, and where you might want to re-run a sub-
         | graph, or have sub-graphs run on different cadences.
         | 
         | In a simplistic ETL/ELT pipeline you can model things as
         | "Extract everything, then Load everything, then Transform
         | everything", in which case you'll add a bunch of unnecessary
         | complexity with Airflow.
         | 
         | If you're looking for a framework to make the plumbing of ELT
         | itself easier, but don't need sub-graph dependency modeling,
         | Meltano is a good option to consider.
        
       | skrtskrt wrote:
       | Could anyone comment on Temporal vs Airflow?
       | 
       | After having a lot of pain points with an (admittedly older and
       | probably not best-practices) Airflow setup, I am now at a
       | different job running similar types of workflows on Temporal -
       | we're pretty happy with it so far, but haven't done anything
       | crazy with it.
        
         | matesz wrote:
         | I know airbyte.io (elt platform) is built on top of Temporal,
         | but I haven't used it.
        
           | tomwheeler wrote:
           | Yes, Airbyte is using Temporal. Here is a blog post they
           | wrote a few weeks ago that goes into more detail about it:
           | https://airbyte.com/blog/scale-workflow-orchestration-
           | with-t...
        
         | Serow225 wrote:
         | I'd love to hear that too :)
        
           | tomwheeler wrote:
           | Hi, Tom from Temporal here. I don't have a lot of experience
           | with Apache Airflow personally, but I was at Cloudera when it
           | was added to our Data Engineering service, so I learned about
           | it at the time. Here are a few things that come to mind:
           | 
           | * Both Apache Airflow and Temporal are open source
           | 
           | * Both create workflows from code, but the approach is
           | different. With Airflow, you write some code and then
           | generate a DAG that Airflow can execute. With Temporal, your
           | code _is_ your workflow, which means you can use your
           | standard tools for testing, debugging, and managing your
           | code.
           | 
           | * With Airflow, you must write Python code. Temporal has SDKs
           | for several languages, including Go, Java, TypeScript, and
           | PHP. The Python SDK is already in beta and there's work
           | underway for a .NET SDK.
           | 
           | * Airflow is pretty focused on the data pipeline use case,
           | while Temporal is a more general solution for making code run
           | reliably in an unreliable world. You can certainly run data
           | pipeline workloads on Temporal, but those are a small
           | fraction of what developers are doing with Temporal (more
           | here: https://temporal.io/use-cases).
        
             | claytonjy wrote:
             | Do you see Temporal as being a super-set of DAG managers
             | like Airflow/Dagster/Prefect, or do you see uses where
             | those tools would be a better choice than Temporal?
        
         | claytonjy wrote:
         | I'm also curious about this. The folks I hear about Temporal
         | from seem to be very disjoint from Airflow users, and
         | Temporal's python client is still alpha-stage.
         | 
         | It seems notable to me that the big Prefect rewrite mentioned
         | elsewhere [0] leans into the same "workflow" terminology that
         | Temporal uses. I have to wonder if Prefect saw Temporal as
         | superceding the DAG tools in coming years and this is them
         | trying to head that off.
         | 
         | That post's discussion of DAG vs workflow also sounds a _lot_
         | like why PyTorch was created and has seen so much success.
         | Tensorflow was static graphs, pytorch gave us dynamism.
         | 
         | [0] https://www.prefect.io/blog/announcing-prefect-orion/
        
       | encoderer wrote:
       | Is anybody out there doing anything interesting with Airflow
       | monitoring?
       | 
       | At my startup Cronitor we have an Airflow sdk[0] that makes it
       | pretty easy to provision monitoring for each DAG, but essentially
       | we are only monitoring that a DAG started on time and the total
       | time taken. I keep thinking about how we could improve this and
       | it would be great to hear about what's working well today for
       | monitoring.
       | 
       | [0] https://github.com/cronitorio/cronitor-airflow
        
         | rozhok wrote:
         | I'm working at https://databand.ai -- a full-fledged solution
         | for Apache Ariflow monitoring, data observability and lineage.
         | We have airflow sync, integrations with
         | Spark/Databricks/EMR/Dataproc/Snowlake, configurable alerts,
         | dashboards, and a much more. Check it out.
        
       | AtlasBarfed wrote:
       | ... the service seems to be centrally managed. A lot of the pain
       | points are clearly "everyone running in the same instance" or
       | kind of similar. Sure makes for big brag points in the numbers.
       | 
       | Sounds like basic SaaS needs to be provided as a capability,
       | while the teams spin up their instances and shard to their needs.
       | 
       | One of the problems with enterprise workflows is putting
       | everything together. Workflows are already cacophonous. A
       | cacophony of cacophonies is madness.
        
       | taude wrote:
       | Tangentially to this thread....what sites, sources, etc. are
       | people who work on modern data pipelines (engineering and
       | analysts) going to follow the latest news, products, techniques,
       | etc. It's been hard to keep up without having Meetups and such
       | the last couple years. I'm finding a lot of people's comments
       | here pretty interesting, and showing me things I haven't heard
       | of. Thanks.
        
         | kderbe wrote:
         | I follow the Analytics Engineering Roundup weekly email. It's
         | published by dbt Labs but isn't overtly promotional.
         | 
         | https://roundup.getdbt.com/
        
           | taude wrote:
           | Thanks. We're starting to use DBT, too. I know the forums
           | over at DBT are pretty good, too.
        
         | mcnnowak wrote:
         | I'm also interested in this topic, but can't find anything
         | other than "Top 10 things you should STOP doing as a data
         | engineer" etc. content-mill, clickbait on Medium and other
         | sites.
        
           | taude wrote:
           | Yes, this. I'd like to get less of the "Marketing sales
           | stuff", and more in the trenches with the actual engineering
           | teams.
        
         | blakeburch wrote:
         | I've had really great success from engaging with the Locally
         | Optimistic Slack community.
         | 
         | Also, Cristophe Blefari has an excellent data newsletter.
         | https://www.blef.fr/
         | 
         | And Modern Data Stack has a newsletter, tool information, Q&A
         | www.moderndatastack.xyz
        
         | jonpon wrote:
         | Data Twitter and Linkedin are great, there are a lot of people
         | putting out some really good content. There are also a lot of
         | substacks you can sign up for. Data Engineering Weekly is my
         | fave
        
         | dtjohnnyb wrote:
         | Slack groups have filled in the meetup space in my life,
         | mlops.community and locally optimistic are two of the best for
         | what it sounds like you're looking for
        
       | tinco wrote:
       | What sort of workflows do you run in Apache Airflow? Are they
       | automating interactions with partners/clients or internal
       | communications? How can it become so scaled up that they (and
       | many people in the comments here as well) have trouble managing
       | the hardware? How can it become so complex that the workflows
       | need to be expressed in DAG's? What's a workflow?
       | 
       | I don't think I ever worked anywhere that had automated
       | workflows, though my I only worked for small startups so far.
        
       | ldjkfkdsjnv wrote:
       | Unless you have extremely complex dependency graphs, I really
       | don't think airflow is worth it. It's very easy to end up
       | essentially writing an "orchestrator" using airflow, it allows
       | for very flexible low level operations. The added complexity has
       | minimal benefit, and like something like apache spark, what looks
       | simple becomes hard to reason about in real world scenarios. You
       | need to understand how it works under the hood, and get the best
       | practices right.
       | 
       | As mentioned elsewhere, AWS step functions are really the best in
       | orchestration.
        
         | arinlen wrote:
         | > _As mentioned elsewhere, AWS step functions are really the
         | best in orchestration._
         | 
         | AWS Step Functions is a proprietary service provided
         | exclusively by AWS, which reacts to events from AWS services
         | and calls AWS Lambdas.
         | 
         | Unless you're already neck-deep in AWS, and are already
         | comfortable paying through the nose for trivial things you can
         | run yourself for free, it's hardly appropriate to even bring up
         | AWS Step Functions as a valid alternative. For instance,
         | Shopify's articles explicitly mention they are running their
         | services in Google Cloud. Would it be appropriate to tell them
         | to just migrate their whole services to AWS just because you
         | like AWS Step Functions?
        
           | Jugurtha wrote:
           | That was one the reasons we do "bring your own compute" with
           | https://iko.ai so people who already have a billing account
           | on AWS, GCP, Azure, DigitalOcean, can just get the config for
           | their Kubernetes clusters and link them to iko.ai and their
           | machine learning workloads will run on whichever cluster they
           | select.
           | 
           | If you get a good deal from one cloud provider, you can get
           | started quickly.
           | 
           | It's useful even for individuals such as students who get
           | free credits from these providers: create a cluster and
           | you're up and running in no time.
           | 
           | Our rationale was that we didn't wanted to be tied to one
           | cloud provider.
        
           | parsnips wrote:
           | https://github.com/checkr/states-language-cadence allows you
           | to define workflows in states language over cadence.
        
           | literallyWTF wrote:
           | This is another symptom of a person who doesn't know what
           | they're talking about really.
           | 
           | It's like those stackoverflow answers that tell the user to
           | stop using PHP and rewrite it in Python or something.
        
           | riku_iki wrote:
           | > already comfortable paying through the nose for trivial
           | things you can run yourself for free
           | 
           | But fault tolerant workflow engine is not trivial thing, it
           | may cost you many engineer hours to build, monitor and
           | maintain it, so outsourcing it to someone else is totally
           | viable solution.
        
             | arinlen wrote:
             | > _But fault tolerant workflow engine is not trivial
             | thing,_
             | 
             | The complexity and risk of migrating cloud providers
             | eclipses whatever problem you assign to "fault tolerant
             | workflow engines".
             | 
             | Any mention of AWS Step Functions makes absolutely no sense
             | at all and reads at best like a non-sequitur.
        
               | thorum wrote:
               | I read it as a comment on the UX / developer experience,
               | which can superior with Step Functions vs competition
               | regardless of whether Step Functions is an appropriate
               | (or even physically possible) option for non-AWS
               | projects.
        
         | nojito wrote:
         | I'll never understand why individuals always default to cloud
         | offerings when they are extremely expensive compared to a
         | dedicated tool.
        
           | wussboy wrote:
           | I don't need to manage the cloud offering and that management
           | time is expensive. Your befuddlement at this simple economic
           | calculation is, well, befuddling.
        
           | mywittyname wrote:
           | It's easy to get started and you don't need to worry about
           | infra.
           | 
           | I've been a one-man army at places because leveraging these
           | cloud offerings allows me to crank out working software that
           | scales to the moon without much thought.
           | 
           | I'd rather pay AWS/GCP to handle infra, so that I can get
           | 2-3x as many project done.
        
             | nojito wrote:
             | None of these problems in airflow in this thread are due to
             | infrastructure so how does using a cloud service solve
             | anything?
        
           | benjamoon wrote:
           | It's easy to understand when you have lots of money, but no
           | time. Cloud is simple and expensive, self managed is complex
           | and cheap. Time's money and all that!
        
             | nojito wrote:
             | Except for this workflow.
             | 
             | You won't get around any of the problems of airflow by
             | moving to a cloud offering.
        
           | 0xbadcafebee wrote:
           | Why hire an expensive janitorial service to clean your
           | office? Why hire a mechanic to fix your car?
        
             | serial_dev wrote:
             | It's shocking that some people cannot fathom that in
             | certain scenarios cloud offerings make sense.
             | 
             | They don't always make sense, in certain scenarios it is
             | worth taking an open source, cloud independent tool, in
             | some scenarios you can roll your own, but there are
             | circumstances where it's a good choice using a tool your
             | cloud provider gives you.
        
           | literallyWTF wrote:
           | Because they don't know what they're doing and aren't the
           | ones paying the bill.
           | 
           | "Oh I have to learn how to use and setup this tool? I think
           | I'll just pay the equivalent salaries and be locked in..."
        
             | waynesonfire wrote:
             | You're going to pay the bill regardless, whether the
             | employee is hired by your team or hired by the cloud
             | vendor.
             | 
             | I don't know how management works through this math, maybe
             | managing people gets exhausting and they just want to out-
             | source it so leadership doesn't have to deal with it and
             | then they can just focus on the core product.
             | 
             | And the above "I don't want to deal with it" reason isn't
             | spoken of, the more more commonly touted benefit is cloud's
             | "flexibility". Sure, but this is actually _really_
             | expensive. Every cloud migration effort I've experienced is
             | only just worthwhile to begin to talk about because the
             | costs are based on long-term contracts of cloud resources,
             | not the per-hour fees. Nice flexibility.
             | 
             | With that said, the cloud may be a good place for
             | prototyping where the infrastructure isn't the core value
             | add and it's uncertain. A start-up is a prototype and so
             | here we are. But, for an established company to migrate to
             | the cloud and fire the staff that's maintaining the on
             | premise resources.. I'm skeptical. More than likely, this
             | leads to maintaining both cloud and on premise resources,
             | not firing anyone, and thus, actually increasing costs for
             | an uncomfortably long time.
             | 
             | And for the folks on the ground, who don't pay the bills,
             | the increase of accidental complexity is rather painful.
        
             | saimiam wrote:
             | I'm very much paying my own cloud bills but there is no
             | chance I would be able to orchestrate some of the workflows
             | I want to orchestrate if it were not for Step Functions.
             | 
             | For a one person shop like me, AWS is a force multiplier.
             | With it, I can do (say) 30% of what a dedicated engineer in
             | a specific role could do. Without it, I'd be doing 0%.
             | 
             | I really like this tradeoff for my particular situation.
        
               | [deleted]
        
           | taude wrote:
           | Fast time to market with a fraction of the effort?
        
         | pyrophane wrote:
         | > As mentioned elsewhere, AWS step functions are really the
         | best in orchestration.
         | 
         | Why? Where else is this mentioned?
        
         | tootie wrote:
         | I read some of these of massively complex data architecture
         | posts and I almost always come away asking "What the hell is
         | this for?" I know Shopify is a huge business but I see this
         | kind of engineering complexity and all I think is it has to
         | cost tens of millions to build and operate and how could they
         | possibly be getting ROI. There are ten boxes on that diagram
         | and none of them have a user interface for anyone except other
         | developers.
        
           | generalpf wrote:
           | A lot of times this is used for data warehousing so product
           | managers and otherwise can query the database of one app
           | joined with another, especially in an environment with
           | microservices. You might join a table containing orders with
           | another table that was from a totally different DB, like
           | payments, to find out which kinds of items are best to offer
           | BNPL or something.
           | 
           | The author also mentions that it's used for machine learning
           | models which will ultimately feed back into Shopify's front
           | end, for instance.
        
         | thefourthchime wrote:
         | > AWS step functions are really the best in orchestration.
         | 
         | At our company, AWS Step is a disaster. You're effectively
         | writing code in JSON/YAML. Anything beyond very simple steps
         | becomes 2 pages of YAML that's very hard to read or write.
         | There is no way to debug, polling is mostly unusable. Changes
         | need to be deployed with CF which can take forever, or worse
         | hang.
         | 
         | It's the most one of the most annoying technologies I've used
         | in my 20+ years of engineering.
        
         | fmakunbound wrote:
         | Definitely this.
         | 
         | Our teams have many 1 or 2 step DAGs that are idempotent. They
         | could have been lambdas and they're already pulling from SQS
         | already. It could be just my misfortune, but in AWS, MWAA is
         | kind of janky. It's difficult to track down problems in the
         | logs (task failures look fine there) and the Airflow UI is
         | randomly unavailable ("document returns empty", "connection
         | reset" kind of things).
        
           | mywittyname wrote:
           | Lambdas have resource constraints that Airflow DAGs don't.
           | Most notably, Airflow DAGs can run for any arbitrary length
           | of time. And the local storage attached to the Airflow
           | cluster is actual disk space, and not just a fake in-memory
           | disk, making it possible process files larger than the amount
           | of memory allocated to the DAG.
           | 
           | There's certainly some functionality overlap, but I don't see
           | Lambda and Airflow as competitors. Each has capabilities that
           | the other doesn't.
        
             | xtracto wrote:
             | I remember reading that you can attach EFS to lambdas. That
             | would solve some of the storage issues.
        
       | byteflip wrote:
       | So interesting, a lot of comments seem to be negative
       | experiences. I haven't used Airflow at scale yet but would love
       | to convert our extremely limited, internally built orchestrator +
       | jobs over to Airflow. I think it would allow us TO scale, at
       | least for some time. I think a lot of companies are still really
       | behind the times. Our DAGs are fairly simple, and Airflow has
       | been a major improvement in my testing. The UI is great for
       | helping me debug jobs / monitoring feed health / backfilling. DAG
       | writing has been a bit frustrating but is much improved format
       | over the internal systems we have. Am I just naive? Is everyone
       | writing extremely complex graphs? Is this operational complexity
       | due mostly to K8s (I've just been playing with Celery)? Anyone
       | enjoying using Airflow?
        
       | jonpon wrote:
       | The problems in this article and in the comments are some of the
       | stuff we have heard at Magniv in the passed few months when
       | talking data practitioners. We are focused on solving some subset
       | of these problems.
       | 
       | Personally, I think Airflow is currently being un-bundled and
       | will continue to be with more task specific tools.
       | 
       | At the very least, if un-bundling doesnt occur, Prefect and
       | Dagster are working hard to solve lots of these issues with
       | Airflow.
       | 
       | Evolution of products and engineering practices is not linear and
       | sometimes doesnt even make sense when looking at a-posteriori (as
       | much as I would like it to follow some logical process). Will be
       | interesting how this space will develop in the next year or so.
        
       | jimmytucson wrote:
       | I've used Airflow for a few years and here's what I don't like
       | about it:
       | 
       | - Configuration as code. Configuration should be a way to change
       | an application's behavior _without_ changing the code. Make me
       | write a workflow as JSON or XML. If I need a for-loop, I'll write
       | my own script to generate the JSON.
       | 
       | - It's complicated. You almost need a dedicated Airflow expert to
       | handle minor version upgrades or figure out why a task isn't
       | running when you think it should.
       | 
       | - Operators often just add an API layer over top of existing
       | ones. For example, to start a transaction on Spanner, Google has
       | a Python SDK with methods to call their API. But with Airflow,
       | you need to figure out what _Airflow_ operator and method wraps
       | the Google SDK method you're trying to call. Sometimes the
       | operator author makes "helpful" (opinionated) changes that
       | refactor or rewrite the native API.
       | 
       | I would love a framework that just orchestrates tasks (defined as
       | a command + an image) according to a schedule, or based on the
       | outcome of other tasks, and gives me a UI to view those outcomes
       | and restart tasks, etc. And as configuration, not code!
        
         | atombender wrote:
         | What you're asking for is basically Argo Workflows, I think.
         | 
         | Not that I recommend it. It's quite lovely in principle, but
         | really flawed in practice. It's YAML hell on top of Kubernetes
         | hell (and I say that as someone who loves Kubernetes and uses
         | it for everything every day).
         | 
         | Having worked with some of these tools, what I've started to
         | wish for is a system where pipelines are written in just plain
         | code. I'd like to run and debug my pipeline as a normal,
         | compiled program that I can run on my own machine using the
         | tools I already use to build software, including things like
         | debuggers and unit testing tools. Then, when I'm really to put
         | it into production, I want a super scalable scheduler to take
         | my program and run it across dozens of autoscaling nodes in
         | Kubernetes or whatever.
         | 
         | The only thing I've come across that uses this model is
         | Temporal, but it's got a rather different execution model than
         | a straightforward pipeline scheduler.
        
         | ricklamers wrote:
         | The flexibility of code as configuration is indeed somewhat of
         | a footgun at times. That's why with Orchest we went with a
         | declarative JSON config approach.
         | 
         | We take inspiration from the Kubeflow project and run tasks as
         | containers. With a GUI for editing pipelines and managing
         | scheduled runs we come pretty close to what you're asking for
         | (bring an image and run a command). And it's OSS, of course.
         | 
         | https://github.com/orchest/orchest
        
         | KptMarchewa wrote:
         | >If I need a for-loop, I'll write my own script to generate the
         | JSON.
         | 
         | That's how you end with extreme mess in logs, UI and metrics.
         | 
         | > But with Airflow, you need to figure out what _Airflow_
         | operator and method wraps the Google SDK method you're trying
         | to call.
         | 
         | Or you can use PythonOperator with hooks, that generally
         | integrate external APIs with Airflow connection system.
         | 
         | https://github.com/apache/airflow/blob/main/airflow/provider...
         | 
         | I think the bigger problem with Airflow's Operator concept is
         | N*M problem of integrating multiple systems. That's how you end
         | with GoogleCloudStorageToS3Operator and stuff like that.
        
       | rubenfiszel wrote:
       | If your flow is more linear looking than a complex DAG and you
       | want to get a full featured webeditor (with lsp), automatic
       | dependency handling, typescript(deno) and python support, I am
       | building an OSS, self-hostable airflow/airplane alternative at:
       | https://github.com/windmill-labs/windmill
       | 
       | You write the modules as normal python/deno scripts, we infer the
       | inputs by statically analyzing your script parameters and we take
       | care of the rest. You can also reuse modules made by the
       | community (building the script hub atm).
        
         | slig wrote:
         | Thank you! Exactly what I was looking for.
        
       | AtlasBarfed wrote:
       | Isn't there an Uber workflow product? Also that scales on top of
       | Cassandra?
        
         | tomwheeler wrote:
         | You're probably thinking of Temporal (https://temporal.io/),
         | which is a fork of the Cadence project originally developed at
         | Uber.
        
       | 8589934591 wrote:
       | I echo other comments. Running and managing Airflow beyond simple
       | jobs is complicated. But then if you are running and managing
       | Airflow for simpler jobs, then you might not need Airflow.
       | 
       | One data center company that I know of uses airflow at scale with
       | docker and k8s. They have a huge team of devops just to manage
       | the orchestrator. They in turn have to fine tune the orchestrator
       | to run smoothly and efficiently. Similar to what shopify has
       | noted here, they have built on top of and extended airflow to
       | take care of pain points like point 4. For companies like this it
       | makes sense to run airflow.
       | 
       | Another issue I see companies/engineers who adopt airflow is that
       | they use it as a substitute for a script than as an orchestrator.
       | For example, say you want to download files from an API, upload
       | to s3, load it to your warehouse (say snowflake) and do some
       | transformations to get your final table - instead of writing
       | separate scripts for each step of fetch/upload/ingest/transform
       | and call each step from the dag, they end up writing everything
       | as a task in a dag. A huge disadvantage is there is a lot of code
       | duplication. If you had a script as a CLI, all your dag/task has
       | to do is call the script with the respective args. I agree that
       | airflow comes with a lot of convenience wrappers to create tasks
       | for many things but I feel this results in losing flexibility.
       | 
       | This also results in them tying their workflow with airflow and
       | any change they might need they have to modify their airflow code
       | directly. If you want to modify how/what you upload to s3, you
       | end up writing/modifying python functions in the respective dags'
       | code. This removes the flexibility to modify/substitute any
       | component of the workflow with something else or even change the
       | orchestrator from airflow to something else. Additionally,
       | different teams might write workflows in different ways -
       | standardization of practice is really hard. This in turn results
       | in pouring more investments to maintaining and hiring "airflow
       | data engineers". Companies fall into steep tech debts.
       | 
       | Prefect/dagster are new orchestrators in town. I'm yet to try
       | them out but I've heard mixed reviews about them.
       | 
       | EDIT: Forgot about upgrades. Lot of upgrades are breaking changes
       | esp the recent change from 1->2. You end up spending a lot of
       | time just trying to debug what went wrong. Just installing and
       | running it is a pain.
        
         | blakeburch wrote:
         | Love your observation about tying the workflow to Airflow.
         | 
         | One of my biggest annoyances in the orchestration space is that
         | teams are mixing business logic with platform logic, while
         | still touting "lack of vendor lock-in" because it's open
         | source. At the point that you're importing Airflow specific
         | operators into your script and changing the underlying code to
         | make sure it works for the platform (XCom, task decorators,
         | etc.), you are directly locking yourself in and making edits
         | down the road even more difficult.
         | 
         | While some of the other players do a better job, their method
         | of "code as workflow" still results in the same problems, where
         | workflows get built as a "mega-script" instead of as modular
         | components.
         | 
         | I'm a co-founder at Shipyard, a light-weight hosted
         | orchestrator for data teams. One of our core principles is
         | "Your code should run the same locally as it does on our
         | platform". That means 0 changes to your code.
         | 
         | You can define the workflow in a drag and drop editor or with
         | YAML. Each task is it's own independent script. At runtime, we
         | automatically containerize each task and spin up ephemeral file
         | storage for the workflow, letting you can run scripts one after
         | the other, each in their own virtual environment, while still
         | sharing generated files as if you were running them on your
         | local machine. In practice, that means that individual tasks
         | can be updated (in app or through GitHub sync) without having
         | to touch the entire workflow.
         | 
         | I'm biased, but it seems crazy to me that so many engineers are
         | willing to spend hours fighting the configuration of their
         | orchestration platform rather than focusing on the solving the
         | problems at hand with code.
        
         | rockostrich wrote:
         | We've established a rule that all "custom" code (anything that
         | isn't a preexisting operator in airflow) needs to be contained
         | in a docker image and run through the k8s pod operator. What's
         | resulted is most folks do exactly what you said. They create a
         | repo with a simple CLI that runs a script and the only thing
         | that gets put in our airflow repo is the dependency
         | graph/configuration for the k8s jobs.
        
           | claytonjy wrote:
           | AFAICT this is the now-recommended way to use Airflow: as a
           | k8s task orchestrator. Even the Astronomer team (original
           | Airflow authors) will tell you to do it this way.
        
       | idomi wrote:
       | When it comes to scale and DS work I'd use the ploomber open-
       | source (https://github.com/ploomber/ploomber). It allows an easy
       | transition between dev and production, incrementally building the
       | DAG so you avoid expensive compute time and costs. It's easier to
       | maintain and integrates seamlessly with Airflow, generating the
       | DAGs for you.
        
       | dekhn wrote:
       | I tried to run airflow; I found pretty much everything about it
       | to be wrong for my usecase. Why can't I easily upload a workflow
       | through the UI? Why doesn't it handle S3 file staging for me?
        
         | jonpon wrote:
         | Would love to hear more about your use case and your issues --
         | you can sign up on our website (magniv) or send me an email jon
         | at our domain.
        
         | hatware wrote:
         | It definitely takes some time getting used to the quirks of
         | Airflow. I know it took 6 months of running it at my last gig
         | to really understand what was happening underneath the UI.
         | 
         | With great control comes great responsibility.
        
           | dekhn wrote:
           | actually I concluded it was just not that great a workflow
           | engine. It's probably just intended for a different use case
           | than mine.
        
       | ashtonbaker wrote:
       | we run airflow with ... considerably more dags than this. our
       | main "lesson learned" is that airflow should not be used "at
       | scale".
        
       | blakeburch wrote:
       | Lots of the comments here seem to be commenting on their own
       | experiences of complexity and frustration with Airflow, but I'd
       | venture to say that's most data orchestration tools. In fact,
       | that sort of feedback is so consistent that I'm half tempted to
       | start a podcast for "orchestration horror stories" (contact if
       | interested).
       | 
       | What I've found while building out Shipyard, a hosted lightweight
       | orchestration platform, is that teams want something that "just
       | works". Servers that "just scale". Observability that doesn't
       | require digging. Notifications and retries that work
       | automatically. Workflows that don't mix business logic with
       | platform logic. Code and workflows that sync with git. Deployment
       | that only takes a few minutes.
       | 
       | For the straightforward use cases, where you need to run tasks A
       | -> G daily, with a bit of branching logic, Airflow is overkill.
       | Yes, Airflow has a lot of great complex functionality that can
       | help you down the road. But Airflow keeps getting suggested to
       | everyone even if it's not best suited to their use case,
       | resulting in lots of lost time and engineering overhead.
       | 
       | While I have definitely have bias, there are a lot of other high
       | quality alternatives out there to explore nowadays!
        
       ___________________________________________________________________
       (page generated 2022-05-23 23:00 UTC)