How to decide on an example data set to use for testing software?


I could be wrong, but I think what Erick’s getting at is that we want to handle all cases that we might see in real repertoires. Throwing all known alleles into one sample is readily achievable in the sense that it’s easy to simulate, but it’s simulating a situation which we will never see in real data. It thus tells us little about the performance that we care about on real repertoires. As Erick says, the way to test performance on the entire germline set is to simulate many different samples with many different germline repertoires, not to use all the germlines in one sample.


I thought the ultimate goal is to make sure a tool can handle all alleles in any sample. As such, I would focus on what the performance is on all alleles regardless of their frequencies in a population. If we can do that, then it would not matter what context these alleles are on.

On the other hand, if you adapt your simulation on a few real subjects (I assume you wouldn’t have too many such subjects whose genotypes are known) , then you will end up with over-testing high frequency alleles and under-testing (or no-testing) low frequency ones. As for assessing performance for real repertoires, I am not sure how much you’ll get from such simulations since the sequences are from simulations anyway. I’d think testing on real sequences isolated known genotypes are much more valuable.


Hello Jian, I think that I’ve understood a basic difference in perspective between us.

Our perspective is simply that sequences sit in repertoires, and that it’s most powerful to understand them in this way. We can use characteristics of the entire repertoire to tell us about each individual sequence. As a simple example, if we have a large repertoire sample and only one sequence is inferred to use a given germline V gene, that inference is probably incorrect: it would be very improbable for there to be only one rearranged receptor sequence using that gene. Most modern applications of immune receptor computational tools are to an entire sample at a time or more, so leveraging properties of the whole repertoires can be useful in practice.

So, if we are going to be testing methods that use whole-repertoire information to say something about each individual sequence, we need to be simulating whole repertoires. Because the result of such a simulation is a function of all of the sequences in the sample, it becomes difficult to disentangle the contribution to accuracy of each allele in the simulated repertoire. For example, simulating using two similar alleles will make for a more difficult problem than simulating from two very different ones. If we want to score a per-allele performance, do we do so in a simulated repertoire with other such “distracting” alleles or not?

There is also a more practical reason: at the end of the day people are going to want one or a few numbers summarizing performance. If we score performance on each allele individually, that’s hundreds of V alleles, but then we will need to test those in combination with the various D and J alleles. The number of combinations is quite large (of course, that’s the point :smile:) and too much for us to really digest. This motivates summaries across all these trials. I would argue that the most natural summary would be an average performance weighted by the frequency of occurrence of the various alleles, and we are back to looking at the allele frequency again! By simulating from repertoires with these weights built in we don’t have to try all combinations or do any such post-summarization-- it comes for free.

It is true that there aren’t many individuals that have been genotyped by direct sequencing of their unrearranged germline genes. However, there are now many tools that can be used to infer genotypes directly from rearranged sequences. Although these tools are not perfect, I think that we can all agree that a whole-repertoire simulation based on germline inference using these tools is going to be a lot more realistic than simulating a repertoire using all of the alleles in IMGT together. No single individual has ever had all of these alleles.

Thanks for thinking about this with me. Any other voices out there? @javh, @a.collins, @laserson?


I agree with you here, Erick. I think testing such tools, including inference tools, should do their best to model and test in “real life” repertoires. And I think this includes understanding how well certain tools do under different scenarios.


On the more mundane side of things, there is a question of how to just share these sets of sequences.

My proposal is not surprising: GitHub. These sequence files, when compressed, are not big. Although storing compressed files in git is an abomination, it has the following advantages.

  • machine-downloadable directly with stable URLs
  • free
  • familiar and easy
  • we already have an AIRR presence there
  • as likely to be around in 5 years as anything else besides NCBI (though watch those funding cuts!)

If people want to propose something else, such as figshare, say so.


I have recently been looking in more detail into Zenodo and think it would fit this purpose very well. Zenodo was created for sharing sets of binary files that do not profit from a full-fledged versioning system like git. It automatically creates a DOI upon public release of a dataset. Basic versioning is provided via the DOI metadata record, which links e.g. the previous and the updated version of a dataset. Otherwise:

  • download via stable URL or DOI, simple web-based upload procedure
  • free and open
  • open REST API for deposition and retrieval available
  • Supports “communities” (curated collections of materials), e.g. AIRR
  • Hosted and funded by CERN, so it should be around until the LHC finally manages to create black holes…


As follow-up on my previous post, Zenodo has now implemented DOI versioning:

The implementation is based on “concept DOIs”, which collect and link DOIs to the individual versions. While the concept DOI itself will always refer a user to the most recent version of a document, the version-specific DOIs will not change, however the user will receive a message that an updated version is available.


OK, with help from @javh and @krdav, I’ve uploaded a data set to Zenodo:

It was pretty painless. You can see that I was asked lots of questions, and did a somewhat OK job filling it out. For example, I have citations in the data set description, but didn’t bother putting every author on every related paper. I also didn’t cite any grants.

I also didn’t think terribly hard about licensing, but here were the options:

feel free to express your opinion about what should be used.


Hi Erick,

As promised during our last discussion, I’ve uploaded some of our data to Zenodo. Hope this dataset will be of use.


Thanks @mikhail.shugay. If you want, you can add this data set to the AIRR “community” on Zenodo, which we can use to collect everything in one place.


Done. Note that the dataset doesn’t contain any description among the uploaded files, just the Zenodo-based description. Once we agree on the description and metadata formats I’ll update the upload with a


Hello everyone–

@javh has sent a nice README in Markdown format for his deposition that can be a template:

Processed sequencing data from BioProject PRJNA349143.

### Study Design

Samples were collected from human volunteers as described in Laserson and Vigneault et al, 2014 (1).
Briefly, blood samples were collected from three individuals both pre- and post-vaccination for
seasonal influenza. Samples were collected for sequencing at time points
-8 days, -2 days, -1 hour, +1 hour, +1 day, +3 days, +7 days, +14 days,
+21 days and +28 days relative to injection with seasonal influenza vaccine.

### Library Preparation and Sequencing

The original samples from Laserson and Vigneault et al, 2014 (1) were re-sequenced as described in
Gupta et al, 2017 (2). Briefly, sequencing libraries were prepared from mRNA using 5'RACE with
addition of 17-nucleotide unique molecular identifiers (UMIs). Amplification was performed using
constant region primers specific to IGHA, IGHD, IGHE, IGHG, IGHM, IGKC and
IGLC. Sequencing was conducted on the Illumina MiSeq platform using the 600 cycle kit
with 325 cycles for read 1 and 275 cycles for read 2. A 10% PhiX spike-in was added
for sequencing.

### Data Processing

Sequences were processed using the [pRESTO]( (3) and
[Change-O]( (4) toolkits as described in Gupta et al, 2017 (2).

Note, the provided data has been filtered significantly, including the removal of
sequences that fail V(D)J alignment and the exclusion of non-functional sequences.

### Format

Processed sequences are provided in FASTA format annotated using the
[pRESTO scheme](

Annotations included are as follows:

+ **CONSCOUNT:**  Raw read count from which UMI consensus sequences were generated,
  summed over all UMIs for the given unique sequence.
+ **DUPCOUNT:** UMI count for the given unique sequence.
+ **PRCONS:**  Constant region primer (isotype).
+ **SUBJECT:**  Subject identifier.
+ **TIME_POINT:**  Time point label.

### Citations

1. Laserson U and Vigneault F, et al. High-resolution antibody dynamics of vaccine-induced immune responses. Proc Natl Acad Sci USA 111, 4928-33 (2014).
2. Gupta NT, et al. Hierarchical Clustering Can Identify B Cell Clones with High Confidence in Ig Repertoire Sequencing Data. J Immunol 1601850 (2017).
3. Vander Heiden JA and Yaari G, et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30, 1930–2 (2014).
4. Gupta NT and Vander Heiden JA, et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31, 3356–8 (2015).

I’ve updated it here:

Note that you can process the Markdown to HTML and paste that into the Zenodo text box (header tags don’t work, but oh well).

In general, we discussed the following guidelines:

  • Once you upload a data set, apply to have it be part of
  • In the absence of many choices (public domain might be the most appropriate) we decided on Creative Commons Attribution 4.0
  • Don’t forget a README!


Next we’d like to get some simulated data up as well. For a README it seems sensible to get

  • an overall summary of what this data is supposed to summarize
  • code version (preferably a git commit or equivalent)
  • command used to produce the simulation
  • pointers to any other files that were used to make the simulation (e.g. what germline database was used)

Anything else?


It would probably be a good idea to also codify the minimum simulation truth info that needs to be included. I.e., presumably everyone would put the correct V, D, and J genes, but it’s also import to get the number of deleted bases for 5’ and 3’ deletions, vd and dj inserted bases, and as you say the germline set used. Also to reduce errors in translating from the germline set, we should probably require explicitly the mutated positions and shm indels. Perhaps also a column telling us which other sequences in the sample are clonally related to this sequence, as well as perhaps tree info for each family.