Defining an API for repertoire sequence analysis tools

ematsen · October 11, 2016, 6:32pm

This is a recap of a virtual meeting we had this morning. @caschramm, @laserson, @dooley, @javh and I were in attendance.

The most important part was @dooley making a strong case for CWL rather than Bioboxes. The essence of this argument was that Bioboxes is great but depends on a particular technology, namely Docker, and that we will be using something else in five years. In contrast, CWL gives a specification grammar that completely abstracts the underlying processing.

But, one might respond, Biobox is lovely because I can simply pull down a Biobox and run it using a single command without having to install any software other than the image. Here is, with my limited understanding, the moral equivalent of that.

Consumer

install a Python package: pip install cwlref-runner
Download the CWL file describing the tool, as described above
Put your parameters into a YAML file as described below
cwl-runner the-tool.cwl my-parameters.yml

Here’s what an example parameter specification file looks like, taken from the docs:

tarfile:
  class: File
  path: hello.tar
extractfile: goodbye.txt

This is exactly analogous to making the file needed to run a biobox.

Assembling such a file is not hard, but if we did want to have some meta-AIRR tool for some specific task (capable of running a variety of implementations of VDJ alignment, for example) it would be easy to write a little shim that would build a YAML parameter file on the fly given command line arguments.

Producer

If the producer wants to have their tool run in a Docker container to eliminate non-Docker dependencies, this is also easily supported. Best to look at the corresponding documentation, but the essence is that one can simply specify a command to be run inside of a specific container, and the CWL magic assembles the docker run command for you.

I’m pretty convinced, but would like to hear other thoughts. Also, we are back to wanting to come up with shared parameters to be used by the various tools of various categories (see above).

There was also discussion of Agave, which sounds lovely, as well as End of Day. We can chat about these later, but the sense is that by adopting CWL we will be able to integrate with a variety of workflow execution engines.

This post belongs in this thread and the containers thread.