Sequences not described in literature

E.Rosati · March 28, 2017, 6:17am

The majority of the CDR3 sequences in my TCR data sets are unknown in literature, how to reduce this number of unknown sequences and actually find out what these sequences are?
Recently many of the sequences I search on the internet I found here:

GitHub - smangul1/TAIR: The Atlas of Immune Repertoires (TAIR)

or/and in some publication/supplementary material tables.

Which is the best organized way to make your sequences public in order to allow other people to check if the sequences in their dataset where already observed before? Are there other repositories for CDR3 sequences collection and upload?

And, in particular to know if these sequences are considered “public” or disease associated or so on? Now with VDJdb we can check for antigen specificity, but what about collecting info about sequences with unknown antigen?

Thank you for any feedback

mikhail.shugay · March 29, 2017, 7:01pm

The google thing works indeed, see Exmaple 6 from our VDJviz paper for quite an interesting case.

The thing with public clonotypes is that clonotypes can be rare and frequent in the populations. Ignoring individual disease history there are two factors influencing it:

The probability of VDJ rearrangement, see Elhanati 2014. Namely, you are far likely to find a clonotype with a single added N-nucleotide in V-J than in with 10 N-nucleotides.
The probability of passing thymic selection and proliferation in the peripheral blood. These are far more complex. For thymic selection see Košmrlj 2008, high fraction of strongly-interacting amino acids in CDRs lowers the probability to pass the selection.

Computing 1. and 2. yield results that are in a good agreement with just getting a big dataset of RepSeq samples and computing the incidence for each clonotype, so there are not much problems with getting public sequences in general.

Note that public clonotypes make an ideal “reference set” for comparing samples: overlapping them across various samples will get you a incidence dense matrix that is far easier to handle than.

The things that I believe are of interest here:

Donor MHC - the absence/presence of certain public clonotypes should be a good predictor of MHC haplotype
Disease associated public clonotypes - see http://friedmanlab.weizmann.ac.il/McPAS-TCR/
Tissue specificity - the database you’ve referenced

Right now none of these is published, so right now the easiest way to make something public is to share it on github in a plain text so that google can index it