What are the outstanding computational challenges in repertoire sequence analysis?

ematsen · June 16, 2016, 6:47pm

In an online discussion, @w.lees suggested that it would be worth enumerating still-unsolved challenges in repertoire sequence analysis. Perhaps following the mold of previous such posts, make suggestions as replies and I’ll add to the list here:

Practical issues

Make current tools more flexible to germline sets.

Computational challenges

Personal germline gene databases: how to infer a personal collection of germline genes from a repertoire sample: this includes adding genes are not in germline gene databases, and restricting to the subset of germline genes in an individual.
Clonal family inference, a.k.a. finding clones in repertoires.
Specialized phylogenetic inference tools for BCRs. This includes the following two settings: first, sequences from the peripheral repertoire for which we don’t have especially dense sampling of a given lineage, and second, very dense sampling of a given lineage, say extracted from a single germinal center.

w.lees · June 17, 2016, 8:18am

Erick,

Here are some thoughts on clonal inference. Apologies in advance that these are half-formed at the moment and may well be mis-drected or overlook points that I’m not aware of, but I hope they might form a basis for discussion,

I’d suggest two challenges for the community - one experimental or developmental, and one computational.

Experimental/developmental: current NGS analyses typically infer clonal relatedness by assuming that the junction ‘fingerprint’ is unique to a clonal family, usually utilising a combination of junction sequence distance and V/J germline identity to match sequences against that fingerprint. A variety of approaches are taken, for example nucleotide vs protein sequence, categorical assignment of germlines vs a germline identity metric, but the foundations of this approach have only been tested against small, hand-curated experimental samples. How valid is the assumption of uniqueness? Assuming that it is valid, what is best practice in terms of approach (or what are the tradeoffs between the different approaches)? What clustering algorithms or other approaches provide the best approximations in practice, both for assessing a complete repertoire, and when focussing in detail on a smaller set of sequences?
Alongside the developmental challenge, there’s a need for tools that are capable of applying current methods (and extending to new ones over time) to repertoires of say 250,000-500,000 parsed sequences, in order to provide overall assessments of clonal diversity in a sample..