A Container for Repertoire Sequence Parsers

w.lees · September 17, 2016, 11:15am

As discussed at the last Germline DB Working Group, I have put together a Docker container for IgBLAST in order to help us understand what would be involved in [defining an API for repertoire sequence analysis tools] (Defining an API for repertoire sequence analysis tools). Documentation on the container is here and the container itself is downloadable from docker.com (see documentation).

Here are some notes:

I adopted the Bioboxes format: the container expects input and output directories to be mapped, takes a command on the command line, and takes parameters in a .yaml file. There is a sample template for a Biobox which I used as a starting point. This allows other directories to be mapped, and for the IgBLAST BioBox I have mapped a cache directory, which is used to hold the germline files in IgBLAST format - the IMGT files are downloaded on first use. I think such a directory will be generally useful for repertoire sequence parsers (all the IgBLAST files are in a subdirectory so that the cache could be shared between parsers). I found the Bioboxes format sensible and easy to work with. If we didn’t use it, we would need to come up with an equivalent format, and the template is a useful starting point.

If we want something that will just get people up and running, a simple yaml file along the lines of the one I have defined for this example may do the job, but if we’re looking for containerised parsers to be more widely useful, I think it will be necessary to allow parser-specific parameters and other information to be passed to the parser. This could be done conveniently through the .yaml file: put a section in there headed ‘parser-specific’, for example, and parameters can be put under that heading without interfering with the rest of the file.

Bioboxes at the moment seem to be used exclusively for assemblers. There is talk of persuading reviewers/publishers to insist on bioboxes where new assemblers are described in a publication, which seems in line with our thinking. There was a a lot of activity on the biobox Github last year: not too much this year.

If we wanted to become a ‘true’ user of bioboxes, we would need to define an RFC for repertoire sequence parsers along the lines of this one and submit it for approval. We would also need to submit RFCs for the output file format, I think. There is a simplified biobox command line interface which can be used instead of calling docker directly. It is implemented by embedding code for each supported kind of biobox (architecturally, I find this a little ugly). I guess that we would be able to push support code for our biobox if our RFCs were accepted. But it’s really not too difficult to call Docker, in my opinion.

sdwfrost · September 24, 2016, 3:14pm

Hi @w.lees

I also set up IgBLAST a while ago (http://antibodyo.me/), although I didn’t use the Bioboxes format. I did speak to Peter Belmann of Bioboxes fame at a CAMI workshop (https://www.cami-challenge.org/), and he backed up what you said - that if we could propose an API, we can just submit a PR to the bioboxes repository.

Best
Simon

w.lees · September 25, 2016, 8:54am

Thanks Simon. I think it’s worth following the Bioboxes format for our API.Perhaps @psathyrella could be persuaded to implement it for partis ?

All the best

William

psathyrella · September 28, 2016, 12:37am

oops, sorry, missed this thread. Yes! I definitely plan to make a partis biobox, in fact we had a summer student working on last year, and there’s an issue as well. Maybe I should poke at that this week…

ematsen · October 11, 2016, 6:37pm

Action on a thread relevant to this one:

javh · October 6, 2017, 6:10pm

We finally made some progress on this on our end (see here). Some gotchas we encountered along the way:

Singularity support. Because docker requires sudo access, it’s generally not an option on computing clusters. You can pull images from docker hub into Singularity 2.3+ fairly easily and run them unprivileged.
We needed a meta-versioning system. For both the image itself and to track the versions of bundled software. We ended up going with a fairly non-portable solution involving a couple custom scripts where versions numbers for software are kept in a yaml file with the Dockerfile, and these files are parsed during build to determine software versions to install.
We used version number tags on our repositories and mapped them to image tags on docker hub to create versioned builds. It works fine, but it’s not as automated as I’d like.
We did something similar for tracking build information, such as the image build date and changesets for installations from repositories that are unversioned. Storing the information inside the image as a /Build.yaml file.

It’s my understanding, possibly incorrect, that cwltool is currently dependent upon docker and won’t work with singularity images. Does anyone know anything about this? We haven’t gotten very far into CWL yet, but this would be a deal breaker for us if that’s the case, because we couldn’t use it on our computing cluster.

Also, I’m not terribly thrilled with how we implemented the meta-versioning system (items 2-3), because it’s not very generalized. I couldn’t think of anything better at the time. If anyone knows of a more robust solution, that isn’t too complication, I’d love to hear about it.