Summary statistics for repertoires

ematsen · April 15, 2017, 5:27pm

Immune repertoire sequence data sets are large and complex. Hence a lot of effort has gone into finding ways of characterizing repertoires. One way to do so is with summary statistics, here ways of turning a pile of sequences into collections of numbers that can be easily compared.

An important application of summary statistics is in comparing simulations to data. When we are simulating data to benchmark performance of immune repertoire sequence analysis tools, we must ensure that the simulated data recapitulates characteristics of real data sets. We can do this using summary statistics by evaluating if the summary statistics of the simulated data has similar values to that of the real data. In this way we can benchmark the benchmark data sets.

Here I’ve started a list of summary statistics that might be useful. I’ve focused on BCR repertoires, because they have the additional complication of being affinity matured. One could pick a set of TCR summary statistics by just taking those that are relevant to the TCR case.

When considering summary statistics, my general sense is that just more is better, at least initially. Later on, we can do something to look at the most discriminative summary statistics, or do principal components in this space to aggregate them. We should aspire to find statistics that are robust to noise.

Of course, I’d love to hear any general feedback on this approach.

“Naked” sequences (only error-corrected & paired-end assembled)

pairwise distance distribution
distributions of distance to closest, second closest, etc.

VDJ-aligned sequences

joint distribution of germline gene use
distribution of distance from naive to mature
distribution of distance to second closest naive sequence (perhaps with a different V gene?)
distribution of junction lengths
distribution of distances between CDR3s

Clustered sequences

cluster size distribution
estimated total clonal diversity (many estimators following work of Efron and Thisted)
mutation models (e.g. S5F)
selection estimates (e.g. synonymous/nonsynonymous rate)

Trees

tree balance (Mooers and Heard)
graph theoretical features (many papers of @ddw and Ramit Mehr)

mikhail.shugay · April 16, 2017, 2:55am

distribution of distances between CDR3s

You mean hamming distances?

Most of the points from VDJ-aligned sequences can be also summarized using VDJ rearrangemt models (see Murugan 2012 and all following works, one of them covers BCRs). These models put V/D+J segment usage, V/J deletions and insertions into a single probabilistic model and run an EM to estimate probabilities. This way one can get all the aforementioned statistics + remove biases coming from correlations in data e.g. high IGHVX usage -> IGHVX is prone to 3' deletions -> shorter CDR3s.

A couple of other things to consider:

Isotype usage and isotype switching
Physicochemical features of CDRs: GRAVY, etc

ematsen · April 16, 2017, 8:26pm

Thanks for pointing that out, Mike. Indeed I didn’t specify what I meant by sequence distance.

Yes, distances could be Hamming for sequences of equal length, or one could use the Levenshtein distance for ones that are not. Another perspective will come from a distance that is BCR specific, such as the HLP17 model or something else that takes hotspots into account.

I agree with you about the excellent work from Murugan & co, though note in passing that partis builds analogous HMM models.

psathyrella · April 16, 2017, 11:08pm

Those all seem reasonable. I’d suggest also

VDJ-aligned sequences

distance from naive to mature (assuming this means mutation freq/rate?) I find it’s important to subset down to V, D, and J, and also within-CDR3. And in all cases you have to decide whether you’re dividing number of mutations by sequence length. I find I generally need to look at both.
per-gene, and per-gene-per-position mutation rates
distributions of insertion and deletion lengths
if we’re talking about validating simulation, and validating inference on that simulation, distance between inferred and true naive is probably the most informative statistic.

Clustered sequences

for cases where we’re validating simulation, or validating inference on simulation, the purity/completeness of clusters (defined here, and alternatively called precision/sensitivy) is a good way to measure the similarity of two partitions.

javh · April 24, 2017, 3:45pm

Naked sequences

GC content
Hot/cold spot frequencies
Nei’s nucleotide diversity (not realistic for large data sets)

VDJ-aligned sequences

Motif substitution and targeting rates (mutation models)

Clustered sequences

Evenness of clone size distributions derived from Hill numbers at various values of q (eg, ~1, 2, 3, 4).

ematsen · April 24, 2017, 5:47pm

Thanks, Jason!

The original paper is dense, so just to clarify, the Nei-Li nucleotide diversity is just the average (across pairs of sequences) of the average (across sites) number of nucleotide differences (Wikipedia).

ematsen · April 25, 2017, 1:46am

In discussion with @caschramm @bussec @ctwatson @initoby @jianye today, people brought up some nice points.

Orthogonality: if we can do a perfect job of fitting each summary statistic, then that’s great. However, we will probably not be able to do so. In this case it’s worth thinking about which statistics overlap with what others, so that we can not double-count in some aggregate loss function.

Implementation: we’d like to work together towards finding implementations of these statistics and assembling them somewhere public.

Robustness: I mentioned above that we’d like statistics that are robust to noise, but the discussants today had a nice formulation. We should look for statistics that do a good job of separating data sets, but are similar between (technical and biological) replicate experiments.

Thanks everyone, and let me know if you’d like to join the conversation.

brandenolson · June 27, 2017, 11:24pm

This may be a vague or ill-defined question, but is it possible to obtain sufficient statistics for repertoires or phylogenies, under common models? I ask because ABC performs optimally when the difference metric operates on sufficient statistics of the data. The only seemingly relevant paper I’ve found is this one, though it’s not particularly well-cited.