Containers for repertoire software?

sdwfrost · February 25, 2016, 9:45pm

Dear All (esp. Erick),

Do you think it’s worthwhile setting up a Github/Dockerhub organisation bcr_tcr, as a place for automated builds of containers? Here’s one of IgBLAST that I set up today:

https://hub.docker.com/r/sdwfrost/igblast/

I’ve been playing with Common Workflow Language and with little extra effort, one can use Docker containers for programs rather than having them installed locally. I’d be happy to post a wrapper for IgBLAST, if people are interested.

Best,
Simon

ematsen · February 27, 2016, 4:36pm

Hi @sdwfrost. Yes, I think this is a very important effort!

Could you contrast the approach of using CWL+Docker with that of using a project such as Bioboxes? My thought was that we would pack a variety of software into these Bioboxes, but with a uniform input (reads and germline genes) and uniform output (e.g. VDJ annotations). Then we could just build a single interface for sequence analysis and be able to drop in various tools, and do a comparison. IIUC in the CWL approach one would have the per-program information in CWL and Docker would just serve to run the software. Are there advantages to this?

Currently we have btcr on github, and I have just added a btcr account on Dockerhub. I have invited you to be an owner of both. Anyone else who wants to join, just PM me here or drop me an email and I’ll invite you too.

So far, a high school student (who will be returning this summer) has built out two Bioboxes which you can see on the btcr github page, one for partis and one for MiXCR.

Thanks for thinking about this with us.

sdwfrost · February 28, 2016, 8:53pm

Dear @ematsen,

CWL doesn’t depend on Docker - you can use a local executable or use Docker to supply the executable. It doesn’t assume a default command, so one can have several tools in a single container. CWL is more about making workflows than packaging up a particular tool. There’s usually more than just VDJ annotations - sequence QC (e.g. fastqc), trimming, read merging (if overlapping reads), etc., and having the repertoire tools as CWL as well would allow all the steps to be wrapped up into a single pipeline.

I’ll set up a repo on btcr on GitHub so you can take a look.

Best
Simon

ematsen · February 29, 2016, 5:17pm

Thanks for the pointers, @sdwfrost. Here are some responses.

Will CWL become the universal workflow language for bioinformatics? Maybe. Or maybe not. (I’d love to hear about evidence that it will.) But if it doesn’t, where are we if we have written everything up in CWL? In contrast, I would argue that with Docker we have a technology that is already an essential part of infrastructure, and is only growing in popularity.
There’s definitely a lot more to BCR/TCR sequence analysis than annotation. My thought was that each one of the standard steps (preprocessing, annotation, clustering) could be containerized with standardized in/out which would allow the various building blocks to be substituted in and out.

dooley · March 4, 2016, 6:43am

It’s worth checking out End of Day. It’s a project based on NextFlow which allows you to define workflows of Docker containers, both local and remote, as a YAML document then run them in a reproducible fashion. Currently EoD is not CWL compliant, but the developers are on the CWL working group and helping to sort through the gaps that make CWL challenging for broad use in real world scenarios.

@ematsen Generally speaking, this is a harder problem than it seems on the surface because of the need for consistent interface definitions to an image. The RunC spec doesn’t have anything in place at the moment to address this issue, so it’s really convention over configuration until the community settles on an approach. I like the direction BioBoxes is heading. Hopfully more of the community can rally behind the community to put it through the paces and harden the schema and metadata in the images.

sdwfrost · March 4, 2016, 7:29am

Dear @dooley

Thanks for the EoD pointer - I wasn’t aware of that one. It looks as though many of the workflow solutions are converging on a fairly similar interface anyway (see e.g. some of my Snakemake workflows ).

I think that @ematsen and I are talking at crossed purposes; deploying a single app would be simpler if a uniform interface was adopted (e.g. BioBoxes), but there’s the bigger picture in terms of stitching things together into a workflow. I’m encouraged by the number of participating organisations in CWL, as well as their focus on the spec rather than the implementation.

For the btcr repo , I’m happy to go along with @ematsen in building Bioboxes, then having workflows in different engines until things settle down.

ematsen · July 9, 2016, 12:33am

Supporting @sdwfrost’s suggestion of CWL is this paper:

Vivian, J., Rao, A., Nothaft, F. A., Ketchum, C., Armstrong, J., Novak, A., … Paten, B. (2016, January 1). Rapid and efficient analysis of 20,000 RNA-seq samples with Toil. bioRxiv. http://doi.org/10.1101/062497

Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores.

Sounds cool:

A workflow is composed of a set of tasks, or jobs, that are orchestrated by specification of a set of dependencies that map the inputs and outputs between jobs. In addition to CWL and draft WDL support, Toil provides a Python API that allows workflows to be declared statically, or generated dynamically, so that jobs can define further jobs as needed (Supplementary Note 1). The jobs defined in either CWL or Python can consist of Docker containers, which permit sharing of a program without requiring individual tool installation or configuration within a specific environment. Open-source workflows that invoke containers can therefore be run precisely and reproducibly, regardless of environment. We provide a repository of workflows as examples8. Toil also integrates with Apache Spark9 (Supplementary Note 6, Supplementary Fig. 4), and can be used to rapidly create containerized Spark clusters within the context of a larger workflow10.

Thoughts, @dooley and @laserson?

schristley · July 11, 2016, 9:02pm

We have dockerized all of the components of VDJServer in public images:

https://hub.docker.com/u/vdjserver/dashboard/

We use many for running VDJServer itself. The exception tends to be tools that are run on the supercomputer as many supercomputers do not support Docker, so we still have to do the old-fashioned compile and install. Regardless, we still use the docker images internally for debugging and testing code changes. Another tool we use is Docker-compose which allows you to collect multiple docker images together into a single container.

@ematsen I’m unclear what you mean by building a single interface. If you mean that each tool can be run with the same set of parameters, then I think this is challenging because most tools are trying to differentiate themselves by providing unique functionality, such a single interface seems to imply seeking the lowest common denominator.

ematsen · July 13, 2016, 12:03pm

Perhaps this is better for the repertoire analysis API thread, but I’m not quite following you here. VDJ alignment requires

a sequence file
a germline set
potentially some other blob of information-- e.g. for Simon’s tool it would be a multiple sequence alignment.

Being able to run a tool with a specified interface even at this level of resolution would be helpful. No?

laserson · July 26, 2016, 8:56pm

Haven’t used Toil much myself, but I know some of the people that developed it. (They’ve also been working on the ADAM project for genomics on Spark.) I’m relatively ambivalent when it comes to workflow managers. They generally are always designed with some particular set of applications in mind, and always have warts which are discovered once they system is designed. I guess I would just make sure my tools are easily adaptable to any workflow manager.

Another workflow manager that was recommended to me, btw, is airflow by Airbnb.

Finally, in the interest of being compatible with “big data” in the future, it’d be best to pick a workflow manager that is compatible and integrates well with the Hadoop ecosystem. Both Toil and airflow do that, to my knowledge.

dooley · August 8, 2016, 7:41am

@sdwfrost I’m fine with BioBoxes as a catchall. I’m working with several communities on similar efforts and the biggest problem with the current trend in building images like that (in addition to the unneeded bloat) is the amount of unneeded software bundled with the actual code. That makes it impossible to properly track versions and dependencies and convey to users what makes two containers different or alike. It also means that the invocation interfaces becomes difficult to define in any meaningful way. With a dozen possible binaries to run in a a single image, all you can really do is take arbitrary argument strings. That makes it much harder to do any machine automation, and all the good work done by CWL and others to meaningfully use such information is lost.

@ematsen Toil looks good. I’m not sure why they wouldn’t work from Luigi as a starting point given the Python slant and Luigi having about 100x the adoption, but it looks like they’ve committed to going their own way. The fact that they are working to maintain CWL support, however, means they have a shot at maintaining relevance in their target community.

@laserson I’m with you. Hundreds of them out there, most would get the job done if people would just stop reinventing the wheel long enough to try them out. Would you say it’s more desirable to shoot for Hadoop support than Spark?

laserson · August 8, 2016, 3:09pm

wrt Luigi: I worked with Luigi quite a bit and found it to be quite deficient in general for expressing more complex pipelines. (IIRC, no multiple outputs, no dynamic definition of the DAG, among other issues).

wrt Hadoop vs Spark: there is no distinction here. Supporting the Hadoop ecosystem means supporting the Spark ecosystem.