As discussed at the last Germline DB Working Group, I have put together a Docker container for IgBLAST in order to help us understand what would be involved in defining an API for repertoire sequence analysis tools. Documentation on the container is here and the container itself is downloadable from docker.com (see documentation).
Here are some notes:
I adopted the Bioboxes format: the container expects input and output directories to be mapped, takes a command on the command line, and takes parameters in a .yaml file. There is a sample template for a Biobox which I used as a starting point. This allows other directories to be mapped, and for the IgBLAST BioBox I have mapped a cache directory, which is used to hold the germline files in IgBLAST format - the IMGT files are downloaded on first use. I think such a directory will be generally useful for repertoire sequence parsers (all the IgBLAST files are in a subdirectory so that the cache could be shared between parsers). I found the Bioboxes format sensible and easy to work with. If we didn't use it, we would need to come up with an equivalent format, and the template is a useful starting point.
If we want something that will just get people up and running, a simple yaml file along the lines of the one I have defined for this example may do the job, but if we're looking for containerised parsers to be more widely useful, I think it will be necessary to allow parser-specific parameters and other information to be passed to the parser. This could be done conveniently through the .yaml file: put a section in there headed 'parser-specific', for example, and parameters can be put under that heading without interfering with the rest of the file.
Bioboxes at the moment seem to be used exclusively for assemblers. There is talk of persuading reviewers/publishers to insist on bioboxes where new assemblers are described in a publication, which seems in line with our thinking. There was a a lot of activity on the biobox Github last year: not too much this year.
If we wanted to become a 'true' user of bioboxes, we would need to define an RFC for repertoire sequence parsers along the lines of this one and submit it for approval. We would also need to submit RFCs for the output file format, I think. There is a simplified biobox command line interface which can be used instead of calling docker directly. It is implemented by embedding code for each supported kind of biobox (architecturally, I find this a little ugly). I guess that we would be able to push support code for our biobox if our RFCs were accepted. But it's really not too difficult to call Docker, in my opinion.