Supporting @sdwfrost’s suggestion of CWL is this paper:
Vivian, J., Rao, A., Nothaft, F. A., Ketchum, C., Armstrong, J., Novak, A., … Paten, B. (2016, January 1). Rapid and efficient analysis of 20,000 RNA-seq samples with Toil. bioRxiv. http://doi.org/10.1101/062497
Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores.
Sounds cool:
A workflow is composed of a set of tasks, or jobs, that are orchestrated by specification of a set of dependencies that map the inputs and outputs between jobs. In addition to CWL and draft WDL support, Toil provides a Python API that allows workflows to be declared statically, or generated dynamically, so that jobs can define further jobs as needed (Supplementary Note 1). The jobs defined in either CWL or Python can consist of Docker containers, which permit sharing of a program without requiring individual tool installation or configuration within a specific environment. Open-source workflows that invoke containers can therefore be run precisely and reproducibly, regardless of environment. We provide a repository of workflows as examples8. Toil also integrates with Apache Spark9 (Supplementary Note 6, Supplementary Fig. 4), and can be used to rapidly create containerized Spark clusters within the context of a larger workflow10.
Thoughts, @dooley and @laserson?