Summary statistics for repertoires


Immune repertoire sequence data sets are large and complex. Hence a lot of effort has gone into finding ways of characterizing repertoires. One way to do so is with summary statistics, here ways of turning a pile of sequences into collections of numbers that can be easily compared.

An important application of summary statistics is in comparing simulations to data. When we are simulating data to benchmark performance of immune repertoire sequence analysis tools, we must ensure that the simulated data recapitulates characteristics of real data sets. We can do this using summary statistics by evaluating if the summary statistics of the simulated data has similar values to that of the real data. In this way we can benchmark the benchmark data sets.

Here I’ve started a list of summary statistics that might be useful. I’ve focused on BCR repertoires, because they have the additional complication of being affinity matured. One could pick a set of TCR summary statistics by just taking those that are relevant to the TCR case.

When considering summary statistics, my general sense is that just more is better, at least initially. Later on, we can do something to look at the most discriminative summary statistics, or do principal components in this space to aggregate them. We should aspire to find statistics that are robust to noise.

Of course, I’d love to hear any general feedback on this approach.

“Naked” sequences (only error-corrected & paired-end assembled)

  • pairwise distance distribution
  • distributions of distance to closest, second closest, etc.

VDJ-aligned sequences

  • joint distribution of germline gene use
  • distribution of distance from naive to mature
  • distribution of distance to second closest naive sequence (perhaps with a different V gene?)
  • distribution of junction lengths
  • distribution of distances between CDR3s

Clustered sequences

  • cluster size distribution
  • estimated total clonal diversity (many estimators following work of Efron and Thisted)
  • mutation models (e.g. S5F)
  • selection estimates (e.g. synonymous/nonsynonymous rate)



distribution of distances between CDR3s

You mean hamming distances?

Most of the points from VDJ-aligned sequences can be also summarized using VDJ rearrangemt models (see Murugan 2012 and all following works, one of them covers BCRs). These models put V/D+J segment usage, V/J deletions and insertions into a single probabilistic model and run an EM to estimate probabilities. This way one can get all the aforementioned statistics + remove biases coming from correlations in data e.g. high IGHVX usage -> IGHVX is prone to 3' deletions -> shorter CDR3s.

A couple of other things to consider:

  • Isotype usage and isotype switching
  • Physicochemical features of CDRs: GRAVY, etc


Thanks for pointing that out, Mike. Indeed I didn’t specify what I meant by sequence distance.

Yes, distances could be Hamming for sequences of equal length, or one could use the Levenshtein distance for ones that are not. Another perspective will come from a distance that is BCR specific, such as the HLP17 model or something else that takes hotspots into account.

I agree with you about the excellent work from Murugan & co, though note in passing that partis builds analogous HMM models.


Those all seem reasonable. I’d suggest also

VDJ-aligned sequences

  • distance from naive to mature (assuming this means mutation freq/rate?) I find it’s important to subset down to V, D, and J, and also within-CDR3. And in all cases you have to decide whether you’re dividing number of mutations by sequence length. I find I generally need to look at both.
  • per-gene, and per-gene-per-position mutation rates
  • distributions of insertion and deletion lengths
  • if we’re talking about validating simulation, and validating inference on that simulation, distance between inferred and true naive is probably the most informative statistic.

Clustered sequences

  • for cases where we’re validating simulation, or validating inference on simulation, the purity/completeness of clusters (defined here, and alternatively called precision/sensitivy) is a good way to measure the similarity of two partitions.


Naked sequences

VDJ-aligned sequences

  • Motif substitution and targeting rates (mutation models)

Clustered sequences

  • Evenness of clone size distributions derived from Hill numbers at various values of q (eg, ~1, 2, 3, 4).


Thanks, Jason!

The original paper is dense, so just to clarify, the Nei-Li nucleotide diversity is just the average (across pairs of sequences) of the average (across sites) number of nucleotide differences (Wikipedia).


In discussion with @caschramm @bussec @ctwatson @initoby @jianye today, people brought up some nice points.

Orthogonality: if we can do a perfect job of fitting each summary statistic, then that’s great. However, we will probably not be able to do so. In this case it’s worth thinking about which statistics overlap with what others, so that we can not double-count in some aggregate loss function.

Implementation: we’d like to work together towards finding implementations of these statistics and assembling them somewhere public.

Robustness: I mentioned above that we’d like statistics that are robust to noise, but the discussants today had a nice formulation. We should look for statistics that do a good job of separating data sets, but are similar between (technical and biological) replicate experiments.

Thanks everyone, and let me know if you’d like to join the conversation.


This may be a vague or ill-defined question, but is it possible to obtain sufficient statistics for repertoires or phylogenies, under common models? I ask because ABC performs optimally when the difference metric operates on sufficient statistics of the data. The only seemingly relevant paper I’ve found is this one, though it’s not particularly well-cited.