How to decide on an example data set to use for testing software?

ematsen · June 22, 2016, 5:01pm

The idea of having a data set that could be used as a basic test of software functionality has come up several times. What sorts of features do we think this data set should have? Any nominations for one specific data set or another?

cc @Felix_Breden and @rarnaout

rarnaout · June 22, 2016, 5:56pm

Some considerations (just thinking out loud):

–It’s important to specify what test(s) a set is meant to help evaluate—for example, a test might be that an annotator will assign a certain V, D, and J at a certain probability (perhaps constrained by a certain set of germline V, D, and J segments). A corollary is to be aware that some test sets might not be useful for some tests. The implied action item is to write down a set of things we would want to test, and only then think about the datasets that would help test them.

–Consequently, it will be useful to think about test sets as tuples of dataset-test-result. So in the VDJ example above, you would have the dataset - your favorite VDJ annotator (with flags and settings) - VDJ calls. The analogy is to unit testing (from software development). When Felix writes EvenBetterAIRRnnotator, he can benchmark it against the VDJ calls from the previous specific set. Thinking about this as tuples prevents falling into the trap of thinking about those calls as “the right answer.”

–Relatedly, it will be useful to think in terms of the standard measures of and issues related to test performance, including gold standards and ROC curves (true and false positives). This thinking includes the realization that “gold standard” is not the same as “right answer,” and so a better method may well look worse than the gold standard (before, hopefully, supplanting it). This won’t happen with simple outputs (e.g. speed—faster is better) but could easily happen with more complicated ones.

–One can divide datasets is synthetic vs. experimental. Synthetic sets have three advantages: (1) you more often know the right answer(s)—that parenthetical (s) handles probabilisms (see the example above); (2) you can exercise fine control, for example, dialing up the mutation rate in a set of heavy-chain sequence from none to substantial; and (3) they are easier to generate than experimental datasets. Experimental datasets, though, will include things that we have not yet noticed/modeled well enough to make it into the generative models that produce the synthetic datasets. For these reasons, I imagine we will want a mix of synthetic and experimental datasets.

–Folks are unlikely to want to generate test sets for the sake of test sets, and predicting the future is tough. In the network theory world, there are a few venerable datasets that are used over and over to test new methods (karate club membership, condensed-matter physics coauthorship, etc.). They are used because others used them. This puts an emphasis on making it easy—hopefully automatic—to make datasets publicly available, so that some (many? scripting is powerful) will organically, through use, become reference sets.

ematsen · June 24, 2016, 2:43pm

@rarnaout – thank you for the very detailed post!

I must confess that I wasn’t sufficiently specific in my original post. I think that what you describe is very important, but I was thinking much more modestly: just finding a set of sequences on which we would expect all software to produce some output without crashing, rather than saying anything about what exactly that output should be.

Let’s see if we can head in both directions at once in this thread, unless folks see a need to split them apart.

initoby · August 23, 2016, 4:24pm

@ematsen, How about choosing from some dataset that has been published previously such as Grieff et al @ presto/changeo for example? I would suggest we do a pilot run with each “volunteer scientist” taking the same set of fastq file/s and performing the analysis with the tools agreed upon and then arranging a time to view/discuss the outcomes. Just a thought as this could be a relatively easy way to achieve a first pass.

ematsen · August 24, 2016, 2:42am

Great! Again, the focus of this direction is modest (see previous post).

With that in mind, how many sequences? A random sample of 10K sequences from Grieff?

w.lees · August 24, 2016, 1:42pm

Simon Frost’s paper, below, lists three test datasets which I think would be valuable - even if we set the bar low, initially, it would be good to have some datasets that we could do more with over the course of time.

William

ematsen · September 8, 2016, 12:18am

This was a primary topic in the most recent software WG phone call.

Real data sets

Here are the suggestions made for real data sets, with priority given to sequences from genotyped individuals:

Stanford S22 available here courtesy of @sdwfrost
An upcoming data set that the @steven.kleinstein group is going be submitting soon to the SRA, which is paired-end Illumina with UMIs from Personal Genome Project individuals

These are BCR sequences. Although most of the folks on the call think most about BCRs, perhaps we can have some suggestions from the bench about TCR sequences? (cc @DDW @mikhail.shugay)

Simulated data sets

It’s time to start pulling together a list of folks who would be willing to simulate sequences and present the correct answers in an AIRR-determined format for easy comparison. I’d also add a request that people are willing to write a short summary of the means of simulation if it’s not already in a publication. I’m guessing @psathyrella and @sdwfrost would be in given previous conversations, but perhaps others as well? @tbkepler? @rarnaout?

ematsen · September 8, 2016, 1:04am

Oh, right, and @caschramm pointed at

mikhail.shugay · September 8, 2016, 1:10pm

If the primary goal is to run some kind of test (“produce some output without crashing”), it is best to use a sample of naive T-cell repertoire (Human, both TRA and TRB) as the most diverse set that also contains non-coding and incomplete rearrangements and specify a certain CDR3 extraction ratio. I would also consider sampling uniformly across the spectratype to include both very short and very long CDR3s. Currently we have some naive TCR data, it is unpublished, however I can construct this kind of dataset from them as it will not convey any meaningful biological information.

PS. Is there any kind of dedicated repository for posting it? I can run benchmark for all our tools and provide results as well
PPS. As for other species, in my practice monkey TCRs are extremely similar to human, while those of mouse have shorter CDR3s. I’m unaware of situation with other species and unsure if we need to extend this far.

sdwfrost · September 16, 2016, 6:22pm

Happy to simulate, either blindly, or provide code to do the simulations!

javh · September 17, 2016, 10:55pm

Regarding the PGP influenza vaccination subjects, this data is a rerun of the data from @laserson and Francois Vigneault’s flu vaccination paper. I suspect it’ll be posted in about a month, but it’s not up yet. However, the paired TCR data won’t be posted soon, so that lessens the appeal somewhat. Also, I talked to @Daniel_Gadala-Maria briefly about the quality of the Ig locus sequencing, and we don’t know. We’ll have to look into it.

Another good option, which @initoby suggested during the call, is to use the data from Florian and Chris’ twins study, which has UMIs and both TCR and BCR. The raw data is on SRA under accession SRP065626, but it’s a bit trickier to preprocess. However, the processed data is available from ImmPort under accession SDY675.

For mice, @victor.greiff’s study has been mentioned before as a fairly easy to obtain and process data set.

Thoughts on whether we need both human and mouse test cases?

a.collins · October 5, 2016, 11:32am

Although the post was made a month ago, I thought it might be of interest to explain the S22 dataset, and why we believed it was a useful dataset for testing software. This is a dataset that came out of Scott Boyd’s lab, and working with Scott, we inferred that this individual was homozygous for a deletion polymorphism involving six contiguous IGHD genes. This was subsequently confirmed by shotgun sequencing, and was reported in Boyd et al 2010 J Immunol 2010; 184: 6986-92.

I first became interested in the challenges relating to the partitioning of VDJ genes almost 15 years ago, when we realized that VQUEST was not calling D genes accurately. After some years, we came up with iHMMune-align, which did seem to perform better, but there was no gold standard by which to measure the performance of different utilities.

Since the S22 dataset comes from an individual who does not carry the six genes from IGHD3-3 to IGHV2-8, the identification of these genes in alignments of S22 sequences are clearly in error. I therefore still believe that this dataset is a good way of testing D gene alignments.

It was not intended to assess the performance of V and J alignments, and I always expected that in time, other datasets would be identified that could be used in conjunction with S22 to test D gene alignment capabilities. I remain convinced that multiple sets are required it we are to properly test any software. We need datasets of known IGHV gene genotype, if we are to test V gene identification properly.

jianye · October 19, 2016, 5:40pm

Andrew, you touched an important issue here but I am not sure we can have a good solution for it.

In the example you have, when you search using a germline database that includes all D genes, reporting a match to the D gene that is missing in genome is technically an error indeed. However, this does not seem to be avoidable. We know many D genes are similar (some are highly similar). It is easy to imagine that a rearranged D gene might undergo some somatic mutations such that its sequence is now more similar to the missing D gene than another D genes that exist in the genome. In a typical situation where the genotype is not presented to a software tool, how can the tool assign the rearranged D gene to a less similar germline D gene (i.e., the “correct” D gene that is present in the genome) rather than a more similar germline D gene (i.e., the “incorrect” missing one)?

Even under the ideal situation where the genotypes are known and you limit your search to a germline database that contains the known genotypes only, you’d still face the problem mentioned above (i.e., a rearranged gene could mutate such that its sequence is now more similar to a germline gene that is not the one that this rearranged gene originates from).

a.collins · October 21, 2016, 12:31am

You are right @jianye that the strategy I outlined will include reasonable but incorrect alignments, as mutations can truly make one D gene look like another. Still, I think it is a step in the right direction, because a utility that is aligning poorly will have more such ‘misalignments’. I have always thought that the S22 dataset was just a first step along a path that could lead to robust testing of utility performance for D identification, and that would probably require a number of different sets, with different characteristics.

psathyrella · December 2, 2016, 6:42pm

I’m very late to this discussion, but I thought I’d mention that when I’m testing things I find it really useful to have sequences from a variety of data sets mashed together into a single testing sample.

In other words, all the data sets above sound great, but what about using some sequences from each? Perhaps the biggest problems that this helps to catch relate to different read lengths from different sequencing protocols. I.e., making sure you’re properly handling reads that don’t extend all the way through either V or J, as well as reads that extend well beyond either V or J.

But this also makes it easier to cover a wide range of different mutation levels, which is equally important, and hard to get from one data set. I also find it useful to inject a number of sequence that are complete pathological, garbage in one way or another (typically just not BCR sequences).

ctwatson · December 7, 2016, 9:42pm

I believe you are referring to the genomic data here? If so, I have had a look at these data (phased variant calls anyway). They may be usable for some validation of variants seen in the repseq data, but I would not trust these gDNA calls as “truth” on their own.

ematsen · December 19, 2016, 4:33pm

As part of the Software Working Group, we are moving towards developing a collaborative validation exercise. We’ve talked a little about various aspects of how we might do this on this thread, but given some recent conversations I wanted to back up and see if we could establish consensus on the large-scale ideas of what this might accomplish.

In my mind, the overall goal of validation should be to estimate the expected accuracy of a tool on an actual biological data set sampled from a real subject (human or otherwise). Here, the definition of accuracy depends on what exactly is being tested; more on that below.

The principle underlying the tests is that for any given observed set of immune receptor sequences, biological or simulated, there is a true underlying process that has led to those observed sequences. These processes are V(D)J rearrangement, clonal expansion and somatic hypermutation (for B cells), then sequence prep and sequencing. Again, our goal is to estimate deviation of the inferred processes to the underlying processes.

This goal and principle, then, guide us on data set choice for validation experiments. For biological validation data sets for which we know (perhaps partial information about) the history, such as mice with a reduced collection of germline genes, the rest of the process should as much as possible replicate the process happening in wild-type animals. Similarly, when simulating sequences on the computer, we should devise simulation procedures that mimic our understanding of the underlying processes as well as possible.

For example, I would imagine that a high-quality in silico BCR simulation process would consist of

A subject’s allelic genotype is simulated by drawing germline alleles from the complete set of germline alleles
Naive cell BCRs are simulated using that genotype using a realistic in silico rearrangement process, including realistic per-gene trimming and insertion distributions
Experienced cell BCRs are simulated by a process of diversification and nuculeotide substitution; this should include known biases of the mutation process such as context/position sensitivity
Sequencing errors are simulated

When assessing accuracy, we measure the deviation of inferences on the simulated sequences compared to the true elements that went into the sequences. For example, compare inferred allelic genotype to the actual genotype, compare inferred per-sequence rearrangement event to the true rearrangement, and compare inferred clonal families to the true clonal families.

By repeating this among many simulated subjects, with large sets of sequences, we can get an idea of how these tools perform on real data.

I know that this perspective isn’t universally shared. @jianye, would you like to present an alternate perspective?

jianye · January 19, 2017, 12:01am

Erick, I think I agree with most of the things you said here. My main point was that the test should reflect what happens in a population, not just a few subjects. As such, we have two options in theory. One is to measure real samples from a very large number of subjects with a large number of different conditions (say immune responses from all kinds of antigens) to approximate the population. But we know we’ll not have enough such samples. Therefore another options is to use simulations. Again, to simulate for the population data, we need to generate digital sequences from the entire germline repertoire we currently have (i.e., not based on genotypes from a few subjects).

ematsen · April 4, 2017, 5:46pm

Jian, thanks for your thoughts. I’m glad that we agree about many points.

However, we differ here:

First, as you know, a number of the germline alleles present in germline databases are artifacts. For example, see

I bet we can agree we wouldn’t want to use these alleles for simulation.

Second, the objective function we’d like to optimize is the expected error on a real sample. Thus it makes sense to skew simulation in the same way that the germline usage is skewed in real data. For example, if there is a very rare germline allele, then in my opinion bad performance on that gene is less important than bad performance on a common allele. Do you disagree?

I do agree that in order for this strategy to work we need many draws of simulated repertoires to explore the whole space.

jianye · April 5, 2017, 4:44pm

Erick,

I’d think a software tool should handle all cases if that’s all possible and does not leave any holes in the floor. Certainly a rare allele is less important but it is still important. Obviously you don’t want someone uses that tool to analyze his sample with rare alleles and gets wrong results.

Since sequence similarity search algorithms typically don’t depend on some particular features of a sequence, I don’t see why we don’t want to set our goal for a tool to handle all cases since this is readily achievable.

Sure we don’t need to concern about fake alleles. But their presence would not really bother me for testing purpose since they are just some sequences that are a little different from real ones and can be handled by the same sequence similarity search algorithm.