So many great thoughts and ideas in this thread! Apologies for the delay in replying, I've been at an all-consuming small workshop (on phylogenetics).
This is a great intuition, and an important direction. I believe that the best long-range solution to implement such ideas is to develop probabilistic models of B cell diversification that incorporate features such as frequency of sampling a DNA sequence, synonymous/nonsynonymous mutation, and tree structure. We're beginning to work hard in this direction (in fact we've just hired @wsdewitt and this will be his primary project). A metaphor for something that already exists is say the "skyline plot" that people often infer using BEAST, which is an inference of ancestral population size. In your case you are looking to estimate sampling time relative to when the GC would naturally terminate.
Chaim is right on the mark here, however I wanted to point out that this assumption is baked into every phylogenetics program in common use (including phyML and derivatives). So the point of that paper was to infer the best substitution matrix one can that could be plugged into current software.
Here are the major differences in my mind between the standard assumptions of phylogenetics versus features of BCR development:
- Phylogenetic methods (with one recent exception) assume that we only sample from the tips of the tree. For B cells we sample parent cells along with their descendants.
- Mutation is context-sensitive. Doing this properly in a likelihood framework is hard (though stay tuned!)
- We can do a reasonable job of inferring the "root" sequence, namely the naive sequence. It's worth doing the best we can in this inference, because this will determine the root of the phylogenetic tree.
- To first approximation, all of the sequences in a lineage are optimizing the same objective function (binding antigen), and we can obtain measurements of how well they have done. The exception is when a BCR is "chasing" a moving target in chronic infection such as HIV. Re measurements of antigen binding here are also complications such getting the right T cell help.
Also, as described above, as far as I know are also no generative tree models for which likelihoods can be computed efficiently enough to do fitting. If you know of some, let me know!
The great thing is that there's tons of data that can be used to develop new methods and models, and that we can borrow clever methods from more thoroughly developed areas. None of these challenges are insurmountable, and we're working hard on all of them.