Draft AIRR Software WG standard


#1

Hello everyone–

Below is the draft standard recommendation from the Software WG. It’s a group effort, but the heavy lifting was done by @w.lees.

We are soliciting comments on this draft. Please reply to comment! No comment too big or too small.


AIRR Software WG - Guidance for AIRR Software Tools

Version 1.0 (when finalised)

Introduction

The adaptive immune-receptor repertoire (AIRR) Community will benefit greatly from cooperation among groups developing tools and resources for AIRR research. The goal of the AIRR Software Working Group is to promote standards for AIRR software tools and resources in order to enable rigorous and reproducible immune repertoire research at the largest scale possible. As one contribution to this goal, we have established the following standards for software tools. Authors whose tools comply with this standard will, subject to ratification from the AIRR Software WG, be permitted to advertise their tools as being AIRR compliant.

Requirements

Tools must:

  1. Be published in source code form, and hosted on a publicly available repository with a clear versioning system.
  2. Be published under a license that permits free access, use, modification, and sharing (under terms no more restrictive than the original licence) for non-commercial purposes. Examples of suitable licences, from that point of view, are the Creative Commons Attribution and Attribution-ShareAlike licences, in both commercial and non-commercial forms.
  3. Support community-curated standard file formats and strive for modularity and interoperability with other tools. In particular, tools must read and write AIRR Data Standards corresponding to their tool.
  4. Include example data (in AIRR standard formats where applicable), and checks for expected output from that data, in order to provide a minimal example of functionality allowing users to check that the software is performing as described.
  5. Provide information about run parameters as part of output.
  6. Provide a Dockerfile that enables a Docker image to be built such that the tool can be used within a container running that image.
  7. Ensure that the Dockerfile is kept up-to-date by providing a container on Docker Hub that is automatically maintained in sync with the current release version of the tool.
  8. Provide at least some level of user support, making it clear what level of support users can expect, and how to obtain it.

Notes

Open Source Licences, Versioned Repositories

Tools in this field are evolving rapidly. In the interests of reproducibility and transparency, published work should be based on tools (and versions of tools) that can be obtained easily by other researchers in the future. To that end, AIRR compliant tools must be published in open repositories such as GitHub or Bitbucket, and we encourage publishing users to provide specifics on the version and configuration of tools that they employ.

Community-Curated File Formats

The AIRR Data Representation Working Group has defined standards for the submission of immune receptor repertoire sequencing datasets. Tool authors are requested to support these standards as much as possible, for applicable data sets. The currently implemented standard covers submission to NCBI repositories (BioProject, BioSample, SRA and Genbank). Tool authors can assist by easing/guiding the process of submission as much as possible.

Example Data and Checks

Because the installation and operation of the tools in this field can often be complex, we require example data and details of expected output, so that users can confirm that their installation is functioning correctly. Likewise dependencies (such as germline libraries and other metadata) should be checked when the tool runs, and informative error messages issued if necessary.

Dependencies and Containers

Containers

Containers encapsulate everything needed to run a piece of software into a single convenient executable that is largely independent of the user’s environment. Providers of AIRR-related tools must provide a Docker implementation (based on a published Dockerfile) as one download option that users can choose:

  • Containers allow users to evaluate a tool easily, without the need to resolve dependencies, configure the environment, etc.
  • They also provide a way for users to examine a working implementation, reproduce results, and understand the fine details of installation

To ensure that they are up to date, containers must be built automatically when the current release version of the tool is updated. We recommend the use of Docker Hub for this purpose. Docker files document dependencies clearly, and make it easy for the maintainer to keep the container’s dependencies up to date in subsequent releases.

An example Docker container is provided by the Software WG on GitHub. This example encapsulates IgBLAST, and implements the bioboxes command-line standard.

Workflow

  • At the moment, we do not endorse a specific workflow standard:
    • Technology is evolving too rapidly for us to commit to a particular workflow
    • Typically, AIRR analysis tools have many options and modes, which would make it difficult to support a ‘plug and play’ environment without unduly restricting functionality
  • As tools and workflows evolve, we will keep the position under review and may make stronger technology recommendations in the future.
  • We strongly encourage authors of tools to provide concrete, documented, examples of workflows that employ their tools, together with sample input and output data
  • Likewise we encourage authors of research publications to provide documented workflows that will enable interested readers to reproduce the results: see, for example, https://github.com/cdebourcy/PNAS_immune_aging > which embodies the workflow for de Bourcy et al., PNAS, 2017.

Standard Data Sets

The WG is working separately on the development and evaluation of simulated data sets. Lists of published real-world datasets are maintained in the AIRR Forum Wiki.

Support Statements

Tool authors must provide some level of support for the tool. They must state explicitly what level of support is provided, and explain how support should be obtained. We recommend a method such as the issues tracker on Github, that publishes support requests transparently, and links resolutions to specific versions or releases. Users are advised to check that the level of support and the frequency of software updates matches their expectations before committing to a tool.

Ratification

Authors may submit tools to the AIRR Software WG requesting ratification against the standard. The submission must include reviewable and itemised evidence of compliance with each Requirement listed above.

The Software WG will, where appropriate, issue a Certificate of Compliance, stating the version of the tool reviewed and the version of the Standard with which compliance was ratified. After receiving a Certificate, authors will be entitled to claim compliance with the Standard, and may incorporate any artwork provided by AIRR for that purpose.

The Software WG will maintain and publish a list of compliant software.

If a tool does not achieve ratification, the Software WG will provide an explanation, and encourages resubmission once issues have been resolved.

Authors must re-submit tools for ratification following major upgrades or substantial modifications. The Software WG may, at its discretion, request resubmission at any time. If a certified tool subsequently fails ratification, or is not re-submitted in response to a request from the Software WG, compliance may no longer be claimed and the associated artwork may no longer be used.

The Software WG may, at its discretion, issue a new version of this standard at any time. Tools certified against previous version(s) of the standard may continue to claim compliance with those versions and to use the associated artwork. Authors wishing to claim compliance with the new version must submit a new request for certification and may not claim compliance with the new version, or use associated artwork, until and unless certification is obtained.


#2

Great job! Two comments on point 2:

  • Is there any reason why you specifically mentioned the CC licenses? They are great for general material but were never meant to be used for software, so that even CC recommends against using CC licenses for software.
  • Restricting commercial use (like in the NC variants of the CC licenses) creates substantial legal insecurity for the user and thereby will negatively affect adoption. In addition, it is at variance with the definitions of both Open Source (per OSI) and Free Software (per FSF). Was there any specific motivation behind this?

#3

For some extra context on licensing see this issue on the AIRR standards repo:

We ultimately decided to stick with the CC license for the contents of the repo because it was a mix of software and non-software, with a significant amount of non-software content. Also, we didn’t include the copy left requirement, which appears to be the major hiccup with applying CC licenses to software.

The reason we use a CC license for Immcantation is largely historical. A dedicated software license would probably be more appropriate, but I don’t know that it’s worth the trouble to change it now.

As CC licenses are a bit of a grey area, I think we could consider them acceptable, but not recommend them specifically (for software).

As for non-commercial licenses, I tend to agree with @bussec. Promoting open source practices fits my scientific sensibilities.


#4

We just met as a WG and discussed these issues. A short update:

Someone made the important point that not everyone is at liberty to choose the licensing terms of their software, and others pointed out that we want to be inclusive.

Thus the current proposal is to have two versions of compliance: one that is basically as above (for those who want a commercial option), and another consisting of those plus an open-source license.

Anyone want to suggest names for these two options? Perhaps AIRR-SW-compliant and AIRR-SW-compliant-open? Acronyms? :thinking:

The commercial-license restriction is thorny. Are there better licenses for open-source? I note that the Rosetta modeling software is an example of such a product, and their licence is here.

And someone told me once that requiring an OSI-approved license is a bad idea, but I can’t remember why.


#5

AIRR-free? This implies that “the users have the freedom to run, copy, distribute, study, change and improve the software” (https://www.gnu.org/philosophy/free-sw.en.html) but does not restrict commercial use. It is stated that “Free software” does not mean “noncommercial.”


#6

I am in favor of not creating any roadblocks to commercial entities who are interested in adopting our standards. However, from my point of view the Free Software/Open Source requirement is a rather integral part of the draft standard above as well as the AIRR Commons ecosystem in general and thus should not be waived that easily.

I assume that AIRR Data Standards compliance is the most important point that commercial entities would be interested in implementing and this is clearly a win-win situation for everyone. We could therefore think about certifying AIRR Standards/Formats/API compliance separately and keep the Free Software requirement for the AIRR Software Standard.

Not sure what you mean with “better” :slight_smile: . If you want to have a permissive license that can be easily included into commercial software, Apache 2.0 is usually a good choice. If you want to prevent commercial use, then AGPL 3.0 provides a strong copyleft clause, even for web applications.

It is an example for better or for worse: Their license is at variance with the the freedom to run, distribute, study and modify as it prohibits redistribution and appropriates the results you are generating with it…


#7

Hi All,

Nice work!!!

My main comment (and question) would be around being specific about a container/installation mechanism. Currently, docker is widely used, but there are other mechanisms for managing software installations. Requiring docker specifically, in particular over other methods, should be done with a clear understanding that not everyone will believe that docker is the best mechanism for managing/deploying their software… In particular, docker has some security issues around user escalation to root on the host system that prevent its use on some platforms. In particular, HPC platforms often do not support docker as a container mechanism - which may be an issue for analysis tools that would be used on such a platform.

Don’t get me wrong, I like the fact that you are making strong recommendations on using best practices from a software development and deployment perspective. In fact, we are working on a “Turnkey AIRR Repository” and we have used docker containers for that. So I am a fan of your approach.

I suspect your goal with docker is to make things easy to use for the general user. If so, then it is probably a good way forward. I just wanted to make sure that you were aware of some of the limitations of docker.

I suppose it is also worth noting that people have done a lot of work to make it possible to use docker containers within other container systems (e.g. singularity - which is typically used on HPC systems) so that the docker container ecosystem can be taken advantage of… So this may not be a huge issue…

Brian


#8

Thanks for the comment, Brian. Our main aim in requiring a Docker implementation was to ensure that a demonstrable, working model of the software was available for evaluation, and also for consultation should there be issues installing the software in a local environment. We did initially incorporate a reference to Singularity, but decided to remove it as we’re not really seeking to influence how software gets run or installed (which is a point we should perhaps emphasise).


#9

Hi everyone, great work!

Comment: The link to the PNAS workflow throws a 404.

JP


#10

Thanks JP. The PNAS workflow pipeline seems to have been removed from Github, which I can’t help seeing as a sad step in the wrong direction. I’ve removed the reference from the draft - but if anyone’s aware of another example please let me know. The idea is to encourage authors to publish code that goes all the way from sequences to figures and tables in the resulting paper, making the entire analysis transparent.


#11

Hi All,

In the Common Repository working group we are planning on creating a list of repositories, the list having two levels:

  1. Listing repositories that store AIRR-seq data in some form.
  2. Listing repositories that store AIRR-seq data in an AIRR compliant form (with AIRR compliant yet to be defined from a repository perspective)

I have a couple of questions in this regard:

a) Do you think that a “repository” falls under your definition of an “AIRR software tool or resource”.

b) Are you planning on creating a list of AIRR compliant software (I assume so), and if so should we use this list to list all of the repositories that are AIRR compliant (rather than creating a separate list)? I am not sure that your AIRR compliance tool requirements completely encompass the definition of AIRR compliance for a repository. Should we try and close the loop on that?

c) Are you thinking about and planning on listing AIRR related, but not AIRR compliant, software tools? This would be as a service to the community type of a list… Essentially two lists, a list of tools and a list of AIRR compliant tools…

Trying to figure out where the overlaps are…

Brian


#12

Hi Brian,

I’m not sure that a repository falls naturally into our definition, given that we are placing an emphasis on published source code and continuous build, which may not be the most important issues for repositories. Perhaps it would be better to have a similar but parallel standard for them. But happy to discuss if this doesn’t sound right to you.

We do plan to maintain a list of AIRR compliant tools. We would like to give this some prominence on the AIRR website as it develops. We also plan to have a ‘badge’ that developers of AIRR compliant tools can display on their website.

There are already lists of various types of tool (and repository!) on b-t.cr at https://b-t.cr/c/wiki . These are community maintained. It would definitely make sense to indicate AIRR compliance in those lists - and it would also be worth giving them some prominence so that people know they exist and help to keep them up to date.