Here are some thoughts on clonal inference. Apologies in advance that these are half-formed at the moment and may well be mis-drected or overlook points that I’m not aware of, but I hope they might form a basis for discussion,
I’d suggest two challenges for the community - one experimental or developmental, and one computational.
Experimental/developmental: current NGS analyses typically infer clonal relatedness by assuming that the junction ‘fingerprint’ is unique to a clonal family, usually utilising a combination of junction sequence distance and V/J germline identity to match sequences against that fingerprint. A variety of approaches are taken, for example nucleotide vs protein sequence, categorical assignment of germlines vs a germline identity metric, but the foundations of this approach have only been tested against small, hand-curated experimental samples. How valid is the assumption of uniqueness? Assuming that it is valid, what is best practice in terms of approach (or what are the tradeoffs between the different approaches)? What clustering algorithms or other approaches provide the best approximations in practice, both for assessing a complete repertoire, and when focussing in detail on a smaller set of sequences?
Alongside the developmental challenge, there’s a need for tools that are capable of applying current methods (and extending to new ones over time) to repertoires of say 250,000-500,000 parsed sequences, in order to provide overall assessments of clonal diversity in a sample…