[HN Gopher] Nextflow: Data-Driven Computational Pipelines
       ___________________________________________________________________
        
       Nextflow: Data-Driven Computational Pipelines
        
       Author : brianzelip
       Score  : 58 points
       Date   : 2023-08-08 14:03 UTC (2 days ago)
        
 (HTM) web link (nextflow.io)
 (TXT) w3m dump (nextflow.io)
        
       | anyoneamous wrote:
       | It's kind of a shame this is based on Groovy, rather than Python
       | which is much more familiar to people in the HCLS space. I've
       | always been stuck on the fence between wanting to use NF (since
       | it's the most popular) and Snakemake (which feels like less of an
       | oddity development-wise).
        
         | bafe wrote:
         | Perhaps it is not as popular, but I found the groovy syntax
         | ideal for DSL like that
        
       | dekhn wrote:
       | In theory I want to like Nextflow, but my main criticism is that
       | it's really, really hard to debug programs that pass around lists
       | of Promises (nextflow's dag uses promises as handles on future
       | computations, and functions receive promises and can't easily
       | materialize them and print them.
       | 
       | The caching is often more trouble than it's worth. Also, the
       | little bash scripts that integrate with AWS break if your AWS
       | environment isn't vanilla (our enterprise AWS has a lot of
       | restrictions).
        
         | nonrepeating wrote:
         | Using Tower for Nextflow can help streamline it on AWS, it's
         | pretty powerful (but costs money for anything beyond trivial
         | use cases): tower.nf
        
       | adolph wrote:
       | Another day, another workflow DSL.                 * looks like
       | yaml       * has curly braces to look programmery       *
       | whitespace might be meaningful       * has pipes like a bash
       | script
       | 
       | https://www.commonwl.org/
       | 
       | https://github.com/common-workflow-language/common-workflow-...
       | 
       | mea culpa: The above was based on a first look at something
       | titled "A DSL for parallel and and scalable computational
       | pipelines"a as opposed to "Java workflow manager with Groovy
       | language scripting." The presented screenshot still looks to me
       | like an unholy union of yaml, js, py and sh. If that sounds
       | groovy to you; have fun.
        
         | esafak wrote:
         | In all fairness, they predate the competition (2013):
         | https://github.com/nextflow-io/nextflow/releases?page=25
        
           | geoffjentry wrote:
           | As GP referenced CWL, while NF had appeared first in terms of
           | the bioinformatics world Nextflow, CWL, Snakelike, and WDL
           | all erupted close enough to each other to be equal-ish. The
           | people were aware of each other but they were all so nascent
           | that it wasn't clear if it was worth joining in or not. At
           | the end of the day these all came from groups trying to
           | scratch particular itches, and not everyone agreed on the
           | right way to scratch.
           | 
           | However all of them were rejections of prior models as well
           | as the workflow solutions prominent in the business space.
        
             | adolph wrote:
             | Yeah, the thing that I find disappointing is that there is
             | a lot of science value locked into the different systems of
             | describing a workflow, pipeline or DAG. Like you said, they
             | all had different itches to scratch and even some barebones
             | "standards" like csv have flavors/extensions/etc.
        
             | bafe wrote:
             | They try to address similar solutions, but comparing
             | snakemake and nextflow doesn't do either tool a favour.
             | They use different computation models, nextflow is based on
             | dataflow programming and therefore schedules processes
             | dynamically as new data comes in, while snakemake is pull-
             | based and schedules the processes based on the dag defined
             | by the dependencies. Anyhow they are both great tools.
        
           | mbreese wrote:
           | In fairness, this is an old problem with many other
           | contenders. This issue is as old as batch schedulers. FWIW, I
           | was at an ISMB conference in 2005 that had at least 2-3
           | workflow managers presented.
        
         | bafe wrote:
         | If you refer to nextflow,the syntax is basically groovy
        
           | adolph wrote:
           | Interesting. From the docs:
           | 
           |  _The Nextflow scripting language is an extension of the
           | Groovy programming language. Groovy is a powerful programming
           | language for the Java virtual machine. The Nextflow syntax
           | has been specialized to ease the writing of computational
           | pipelines in a declarative manner._
           | 
           | https://www.nextflow.io/docs/latest/script.html?highlight=gr.
           | ..
        
           | notQuiteEither wrote:
           | Indeed, it's a fully functional scripting language that is an
           | extension of groovy. So "looks programmery" is more than a
           | little reductive.
           | 
           | Edit: functional as in, it's not half-hearted, not functional
           | as in functional programming.
        
       | firecraker wrote:
       | So my question to the non bioinformatics - is this already a
       | solved problem?
       | 
       | You have tasks which require resources based on the input
       | parameters, these are run in docker containers to ensure the
       | environment and you want to track the output of each step. Often
       | these are embarrassingly parallel operations (e.g. I have 200
       | samples to do the same thing on).
       | 
       | Something like dask perhaps,but can specify a docker image for
       | the task?
       | 
       | What is the goto in DevOps for similar tasks? GitHub actions
       | comes pretty close...
       | 
       | To bioinformatics what is the unique selling point of next flow
       | over say wdl/Cromwell?
        
       | radus wrote:
       | I've considered using Nextflow for bioinformatics pipelines but
       | have yet to take the plunge.
       | 
       | At work, I develop a proteomics pipeline that is composed of
       | huey1 tasks (Python library; simple alternative to Celery) which
       | either use subprocess to call out to some external tool, or are
       | just pure python. It runs in a worker container which is managed
       | by Docker swarm, and all containers pull jobs from redis. For our
       | scale, it works great. However, I don't have control over the
       | resource utilization of individual steps, and in the past I've
       | had issues with the pipeline blocking as a result of how I was
       | chaining tasks together. I think something like Nextflow would
       | remove these limitations, but one thing I think I would miss is
       | the ability to debug individual pipeline steps locally with an
       | interactive debugger. As far as I can tell, Nextflow has
       | logging/tracing facilities but nothing quite like an interactive
       | debugger. I'd be happy to be told I'm wrong, or even that I'm
       | doing it wrong.
       | 
       | Other reasons I'd like to start using Nextflow:
       | 
       | - my homebrew pipeline would be easier to setup/share
       | 
       | - there are some efforts in the proteomics community to develop
       | Nextflow pipelines (eg. QuantMS2). I think it would to have a
       | shared language to express pipelines, and it would make
       | benchmarking simpler.
       | 
       | ___
       | 
       | 1 https://github.com/coleifer/huey/
       | 
       | 2 https://docs.quantms.org/en/latest/
        
         | nonrepeating wrote:
         | The closest I've gotten to local debugging is having the Python
         | scripts that are launched by NextFlow steps connect to a remote
         | debugger process ("remote" but running on the same
         | workstation). PyCharm makes this fairly painless to
         | orchestrate. I've never been able to debug thr Groovy script in
         | a Nextflow pipeline itself; I think you'd need a debug build of
         | the nextflow executable for that.
        
       | suslik wrote:
       | I develop bioinformatics pipelines for a living and am very
       | opinionated on the topic.
       | 
       | Having enough experience with snakemake as well as nextflow in
       | production for many years now, I would always opt out for
       | snakemake for anything but extremely large DAgs (which is quite
       | rare for for bioinformatics pipelines). The fact that nextflow
       | still doesn't allow deleting temporary files during execution or
       | re-rerunning the workflow from a set of intermediary files is an
       | insane deal breaker (not mentioning other strange things like an
       | arbitrary limit of 1000 parallel jobs etc.). AWS runners that
       | once were nextflow's selling point is not an advantage anymore
       | give Amazon Genomics Cli.
       | 
       | Subjectively, writing pipelines that incorporate conditional
       | logic is much nicer in python+snakemake than groovy+nextflow, but
       | maybe there is someone out there who prefers groovy.
       | 
       | Is a proper dry run possible in nextflow already btw?
        
         | mbreese wrote:
         | I do too.. and have similar opinions. I wrote my own tool years
         | back for pipelines because it was always frustrating (started
         | roughly around the same time as Nextflow).
         | 
         | Allowing for files to be marked as transient (temp) and re-
         | running from arbitrary time points are definitely one of the
         | things I support... as is conditional logic within the pipeline
         | for job definition and resource usage. For me though, one of
         | the biggest things is that I like having composable pipelines,
         | so each part of the larger workflow can be developed
         | independently. They can interact with each other (DAG) and use
         | existing dependencies, but they don't have to exist in the same
         | document/script. I work on large WGS datasets, so 1000's of
         | jobs per patient isn't uncommon.
         | 
         | Happy to talk more if you're interested.
         | 
         | https://github.com/compgen-io/cgpipe
         | 
         | And yes, you can dry run the entire thing. It will write out a
         | bash script if you want to see exactly what is going to run
         | without submitting jobs. It's a full language for pipelines,
         | but heavily inspired by Makefiles (with conditional logic).
        
         | mribeirodantas wrote:
         | It's been a while since you can rerun/resume Nextflow
         | pipelines, and yes, you can have dry runs in Nextflow. I have
         | no idea what you're referring to with the 'arbitrary limit of
         | 1000 parallel jobs' though. As for deleting temporary files,
         | there are features that allow you to do a few things related to
         | that, and other features being implemented.
        
       ___________________________________________________________________
       (page generated 2023-08-10 23:00 UTC)