we tried hamming distance, edit distance. I have to say it is hard to decide. What metrics are you using?
A very complex topic that depends on the question you’re asking:
-
How likely is that a pair of CDR3 is a result of somatic hypermutation(s) or independent assembly. For example a pair of junctions with matched V/D/J germline part and 5 out of 6 N-nucleotides are likely to result from a single hypermutation. This also applies to “public” clonotypes. In case a CDR3 amino acid sequence is shared by individuals in a given group it can be due to convergent recombination (the probability of corresponding V-D-J rearrangement is high) or to some biological context.
-
Are those CDR3 having the same antigen binding properties. In this case edit distances can become really high, and certain amino acid property profiles should be compared instead.
Do you have some good reference suggestions for the second point, i.e. functional similarity in terms of antigen binding? Is there anything principled way of choosing, maybe based on the sequences with known affinity such as the ones you have in your vdjdb?
I think the two options that Mikhail is setting up does somewhat relate to each other. Essentially if you can find all the sequences in a clonal family then they should also bind the same target on the same epitope surface, but with a range of different binding affinities. The problem being that inferring clonal relationship is not at all trivial, and in many cases not even possible based on the cdr3 sequence.
@andim A practical solution would be to focus on structural similarities (given that you have the full AB sequence) and not only on sequence similarities. E.g. you could look at the modelled structure by using homology modelling servers like:
http://www.cbs.dtu.dk/services/LYRA/cite.php
(my biased choice since this is made by a colleague of mine, but go look for other alternatives…)
for structural similarities, I wonder if it is possible for cdr3 alone? I think I read a paper about this. the problem is that it is not very mature and hard to know if their claims hold. besides, not easy to set up and slow. image you have millions of cdr3.
even if the full ab sequence is available, the problem above still exists. Anyway thank you for the reference.
Thanks for your answer. Is it possible to know the antigen binding properties by cdr3 alone without wet lab?
May be anti-DNA ? otherwise no.
There is a new preprint on predicting Fv structures from sequences here: http://biorxiv.org/content/early/2016/08/16/069930
Much of the validation data is currently missing though, so it’s hard to evaluate the accuracy at the moment. There are other programs available, too, but they are all probably too computationally intensive to be of much use on large datasets.
In certain rare cases, you might be able to predict antigen from sequence alone, if it matches known sterotypic antibodies against that antigen. But as a general rule, no.