Defining an API for repertoire sequence analysis tools

ematsen · July 2, 2016, 3:23pm

We’ve discussed file formats at length here and at meetings, which standardize what results look like, but could we go a step farther and standardize how we call programs? Of course, each program has its own special features, but there is a lot of common functionality and it would be great to standardize that via wrappers.

I am inspired by the http://nucleotid.es/ project in general, and by their API in particular. We’ve discussed the mechanics of how we can get software to run on various computers using Docker in another thread, but perhaps we can step back and ask back and enumerate the important elements without being too explicit about how the software side will work?

To start things off, for VDJ annotation I first thought that this would be obvious:

a fasta file of input sequences
a fasta file with the germline database

but then I realized that @sdwfrost’s IgSCUEAL requires a multiple sequence alignment of germlines, and I think also a tree. Should we accommodate that via an arbitrary tarball of “other germline information?”

ematsen · July 2, 2016, 11:30pm

@psathyrella has made some nice notes about standardizing the format of a germline set as a separate topic.

sdwfrost · July 5, 2016, 9:58am

One might imagine more evolution-based approaches for inferring germlines. What about taking an argument for a NEXUS formatted file for those methods that might use this information?

I also spoke to Peter Belmann (one of the Bioboxes devs), and he said that they were open to pull requests for APIs for other tools.

dooley · August 1, 2016, 11:07pm

I’m still a fan of the Common Workflow Language (CWL). It’s actively worked on, covers this exact problem, and is mature enough to handle this use case without requiring any extensions to the existing definition.

From a practical standpoint, there are existing third-party implementations for running CWL tasks locally, remotely, or in a hybrid scenario. The same CWL definition file is just YAML or JSON and can be published and shared as a standard text document.

ematsen · August 10, 2016, 12:24am

Sounds good.

For folks not familiar with CWL, there’s a gentle introduction here. The idea is that one writes a .cwl file that describes how to make a command line string from a list of parameters. One then specifies this list of parameters in another JSON or YAML file like so:

example_flag: true
example_string: hello
example_int: 42
example_file:
  class: File
  path: whale.txt

This is a great framework for formalizing things and then running them, but there are two things that still require some decision-making.

As before, what do we want to have as shared parameters for various tools?
Do we want to get behind some container-based framework for encapsulating tools and running them? This would sure save a lot of time on the part of folks who want to run the tools by avoiding complex dependencies.

For #1 in the case of VDJ annotation as in my original post,

It appears to me that CWL nicely handles optional parameters:

When the parameter type ends with a question mark ? it indicates that the parameter is optional.

So with this modification for an optional alignment, perhaps we are in good shape for #1 for VDJ alignment tools? I would, however, suggest something else than NEXUS, which is a pretty wild format itself. We also need to agree on output, and it seems like the Change-O data standard is for the time being the way to go.

For #2, do we have some votes for Rabix over Bioboxes? Something else? Or does it matter?

schristley · August 10, 2016, 6:24pm

An HPC friend mentioned Singularity, http://singularity.lbl.gov/. It apparently resolves the security issues that prevents HPC from using Docker (root level daemon, root escalation path).

laserson · September 7, 2016, 3:39pm

Thinking about DSLs for workflows, one thing to keep in mind is to prefer an “embedded DSL” so that your configuration is actually code for some programming language. This ensures that you have the power of the programming language at your disposal. For example, I don’t think that CWL has conditionals or loops.

ematsen · September 7, 2016, 3:46pm

Could you please provide examples of what you have in mind? (But I’m guessing you are talking about, say, PySpark?)

biocrusoe · September 10, 2016, 5:55pm

Hello, I’m the CWL Community Engineer and one of the co-founders.

This is a great use-case for CWL. You can describe the underlying tools but do so using the same exterior interface. For tools that need a more complicated transformation you can even have an entire workflow to do the job – in CWL the description of tools and workflows are interchangeable.

Here’s an example from CWL itself: a definition of how to invoke a generic CWL implementation: v1.1 cwl-runner interface by mr-c · Pull Request #278 · common-workflow-language/common-workflow-language · GitHub

Since we support unlimited user defined (and preferably namespaced & ontology backed) metadata you can even include a machine readable pointer to which “API” that a particular description is fulfilling. Such a convention would be useful to others and would likely be pulled into the standard at some point, so your participation would be welcome.

As for DSLs, we explicitly didn’t go that route as generating graphical user interfaces and being able to reason about the description are core use-cases. You can always call a wrapper script if needed. As for conditionals and loops we might add them someday but there isn’t consensus around how to represent those. We do allow for javascript in certain controlled instances.

My personal hope is that such discussions here and elsewhere bring more attention to the need for command line tool interfaces to received the same design thought that goes into other human-computer interactions. In the meantime we all can help authors of related tools converge on common semantics.

I’m happy to provide guidance and feedback; we have a chat room at https://gitter.im/common-workflow-language/common-workflow-language

Cheers!

ematsen · October 11, 2016, 6:32pm

This is a recap of a virtual meeting we had this morning. @caschramm, @laserson, @dooley, @javh and I were in attendance.

The most important part was @dooley making a strong case for CWL rather than Bioboxes. The essence of this argument was that Bioboxes is great but depends on a particular technology, namely Docker, and that we will be using something else in five years. In contrast, CWL gives a specification grammar that completely abstracts the underlying processing.

But, one might respond, Biobox is lovely because I can simply pull down a Biobox and run it using a single command without having to install any software other than the image. Here is, with my limited understanding, the moral equivalent of that.

Consumer

install a Python package: pip install cwlref-runner
Download the CWL file describing the tool, as described above
Put your parameters into a YAML file as described below
cwl-runner the-tool.cwl my-parameters.yml

Here’s what an example parameter specification file looks like, taken from the docs:

tarfile:
  class: File
  path: hello.tar
extractfile: goodbye.txt

This is exactly analogous to making the file needed to run a biobox.

Assembling such a file is not hard, but if we did want to have some meta-AIRR tool for some specific task (capable of running a variety of implementations of VDJ alignment, for example) it would be easy to write a little shim that would build a YAML parameter file on the fly given command line arguments.

Producer

If the producer wants to have their tool run in a Docker container to eliminate non-Docker dependencies, this is also easily supported. Best to look at the corresponding documentation, but the essence is that one can simply specify a command to be run inside of a specific container, and the CWL magic assembles the docker run command for you.

I’m pretty convinced, but would like to hear other thoughts. Also, we are back to wanting to come up with shared parameters to be used by the various tools of various categories (see above).

There was also discussion of Agave, which sounds lovely, as well as End of Day. We can chat about these later, but the sense is that by adopting CWL we will be able to integrate with a variety of workflow execution engines.

This post belongs in this thread and the containers thread.

sdwfrost · October 11, 2016, 8:21pm

For Python based tools that use argparse, one can use gxargparse, which provides a flag --generate_cwl_tool that generates a CWL description automatically. I tried to use it with pRESTO, however, with no success - anyone else want to give it a go?

javh · October 11, 2016, 9:19pm

That’s neat!

I can take a look at it. I suspect it has to do with some of these unsupported things in gxargparse:

Argument groups don’t work in CWL as arguments are sorted with a special algorithm.
Some features like templating of the version string are unsupported.
nargs=N. Number of arguments can not be specified in CWL (yet).
Mutual exclusion is not supported.

Or it’s missing the ability to do multiple inheritance of RawDescriptionHelpFormatter and ArgumentDefaultsHelpFormatter or some other such oddity.

biocrusoe · February 7, 2017, 8:34am

FYI: gxargparse has been renamed to argparse2tool and received a few upgrades: GitHub - hexylena/argparse2tool: transparently build CWL and Galaxy XML tool definitions for any script that uses argparse

@sdwfrost If that still doesn’t work with pRESTO could you file an issue?