[HN Gopher] Nextflow: Data-Driven Computational Pipelines ___________________________________________________________________ Nextflow: Data-Driven Computational Pipelines Author : brianzelip Score : 58 points Date : 2023-08-08 14:03 UTC (2 days ago) (HTM) web link (nextflow.io) (TXT) w3m dump (nextflow.io) | anyoneamous wrote: | It's kind of a shame this is based on Groovy, rather than Python | which is much more familiar to people in the HCLS space. I've | always been stuck on the fence between wanting to use NF (since | it's the most popular) and Snakemake (which feels like less of an | oddity development-wise). | bafe wrote: | Perhaps it is not as popular, but I found the groovy syntax | ideal for DSL like that | dekhn wrote: | In theory I want to like Nextflow, but my main criticism is that | it's really, really hard to debug programs that pass around lists | of Promises (nextflow's dag uses promises as handles on future | computations, and functions receive promises and can't easily | materialize them and print them. | | The caching is often more trouble than it's worth. Also, the | little bash scripts that integrate with AWS break if your AWS | environment isn't vanilla (our enterprise AWS has a lot of | restrictions). | nonrepeating wrote: | Using Tower for Nextflow can help streamline it on AWS, it's | pretty powerful (but costs money for anything beyond trivial | use cases): tower.nf | adolph wrote: | Another day, another workflow DSL. * looks like | yaml * has curly braces to look programmery * | whitespace might be meaningful * has pipes like a bash | script | | https://www.commonwl.org/ | | https://github.com/common-workflow-language/common-workflow-... | | mea culpa: The above was based on a first look at something | titled "A DSL for parallel and and scalable computational | pipelines"a as opposed to "Java workflow manager with Groovy | language scripting." The presented screenshot still looks to me | like an unholy union of yaml, js, py and sh. If that sounds | groovy to you; have fun. | esafak wrote: | In all fairness, they predate the competition (2013): | https://github.com/nextflow-io/nextflow/releases?page=25 | geoffjentry wrote: | As GP referenced CWL, while NF had appeared first in terms of | the bioinformatics world Nextflow, CWL, Snakelike, and WDL | all erupted close enough to each other to be equal-ish. The | people were aware of each other but they were all so nascent | that it wasn't clear if it was worth joining in or not. At | the end of the day these all came from groups trying to | scratch particular itches, and not everyone agreed on the | right way to scratch. | | However all of them were rejections of prior models as well | as the workflow solutions prominent in the business space. | adolph wrote: | Yeah, the thing that I find disappointing is that there is | a lot of science value locked into the different systems of | describing a workflow, pipeline or DAG. Like you said, they | all had different itches to scratch and even some barebones | "standards" like csv have flavors/extensions/etc. | bafe wrote: | They try to address similar solutions, but comparing | snakemake and nextflow doesn't do either tool a favour. | They use different computation models, nextflow is based on | dataflow programming and therefore schedules processes | dynamically as new data comes in, while snakemake is pull- | based and schedules the processes based on the dag defined | by the dependencies. Anyhow they are both great tools. | mbreese wrote: | In fairness, this is an old problem with many other | contenders. This issue is as old as batch schedulers. FWIW, I | was at an ISMB conference in 2005 that had at least 2-3 | workflow managers presented. | bafe wrote: | If you refer to nextflow,the syntax is basically groovy | adolph wrote: | Interesting. From the docs: | | _The Nextflow scripting language is an extension of the | Groovy programming language. Groovy is a powerful programming | language for the Java virtual machine. The Nextflow syntax | has been specialized to ease the writing of computational | pipelines in a declarative manner._ | | https://www.nextflow.io/docs/latest/script.html?highlight=gr. | .. | notQuiteEither wrote: | Indeed, it's a fully functional scripting language that is an | extension of groovy. So "looks programmery" is more than a | little reductive. | | Edit: functional as in, it's not half-hearted, not functional | as in functional programming. | firecraker wrote: | So my question to the non bioinformatics - is this already a | solved problem? | | You have tasks which require resources based on the input | parameters, these are run in docker containers to ensure the | environment and you want to track the output of each step. Often | these are embarrassingly parallel operations (e.g. I have 200 | samples to do the same thing on). | | Something like dask perhaps,but can specify a docker image for | the task? | | What is the goto in DevOps for similar tasks? GitHub actions | comes pretty close... | | To bioinformatics what is the unique selling point of next flow | over say wdl/Cromwell? | radus wrote: | I've considered using Nextflow for bioinformatics pipelines but | have yet to take the plunge. | | At work, I develop a proteomics pipeline that is composed of | huey1 tasks (Python library; simple alternative to Celery) which | either use subprocess to call out to some external tool, or are | just pure python. It runs in a worker container which is managed | by Docker swarm, and all containers pull jobs from redis. For our | scale, it works great. However, I don't have control over the | resource utilization of individual steps, and in the past I've | had issues with the pipeline blocking as a result of how I was | chaining tasks together. I think something like Nextflow would | remove these limitations, but one thing I think I would miss is | the ability to debug individual pipeline steps locally with an | interactive debugger. As far as I can tell, Nextflow has | logging/tracing facilities but nothing quite like an interactive | debugger. I'd be happy to be told I'm wrong, or even that I'm | doing it wrong. | | Other reasons I'd like to start using Nextflow: | | - my homebrew pipeline would be easier to setup/share | | - there are some efforts in the proteomics community to develop | Nextflow pipelines (eg. QuantMS2). I think it would to have a | shared language to express pipelines, and it would make | benchmarking simpler. | | ___ | | 1 https://github.com/coleifer/huey/ | | 2 https://docs.quantms.org/en/latest/ | nonrepeating wrote: | The closest I've gotten to local debugging is having the Python | scripts that are launched by NextFlow steps connect to a remote | debugger process ("remote" but running on the same | workstation). PyCharm makes this fairly painless to | orchestrate. I've never been able to debug thr Groovy script in | a Nextflow pipeline itself; I think you'd need a debug build of | the nextflow executable for that. | suslik wrote: | I develop bioinformatics pipelines for a living and am very | opinionated on the topic. | | Having enough experience with snakemake as well as nextflow in | production for many years now, I would always opt out for | snakemake for anything but extremely large DAgs (which is quite | rare for for bioinformatics pipelines). The fact that nextflow | still doesn't allow deleting temporary files during execution or | re-rerunning the workflow from a set of intermediary files is an | insane deal breaker (not mentioning other strange things like an | arbitrary limit of 1000 parallel jobs etc.). AWS runners that | once were nextflow's selling point is not an advantage anymore | give Amazon Genomics Cli. | | Subjectively, writing pipelines that incorporate conditional | logic is much nicer in python+snakemake than groovy+nextflow, but | maybe there is someone out there who prefers groovy. | | Is a proper dry run possible in nextflow already btw? | mbreese wrote: | I do too.. and have similar opinions. I wrote my own tool years | back for pipelines because it was always frustrating (started | roughly around the same time as Nextflow). | | Allowing for files to be marked as transient (temp) and re- | running from arbitrary time points are definitely one of the | things I support... as is conditional logic within the pipeline | for job definition and resource usage. For me though, one of | the biggest things is that I like having composable pipelines, | so each part of the larger workflow can be developed | independently. They can interact with each other (DAG) and use | existing dependencies, but they don't have to exist in the same | document/script. I work on large WGS datasets, so 1000's of | jobs per patient isn't uncommon. | | Happy to talk more if you're interested. | | https://github.com/compgen-io/cgpipe | | And yes, you can dry run the entire thing. It will write out a | bash script if you want to see exactly what is going to run | without submitting jobs. It's a full language for pipelines, | but heavily inspired by Makefiles (with conditional logic). | mribeirodantas wrote: | It's been a while since you can rerun/resume Nextflow | pipelines, and yes, you can have dry runs in Nextflow. I have | no idea what you're referring to with the 'arbitrary limit of | 1000 parallel jobs' though. As for deleting temporary files, | there are features that allow you to do a few things related to | that, and other features being implemented. ___________________________________________________________________ (page generated 2023-08-10 23:00 UTC)