Phylogeny on somatic hyper mutations


As difficult as tree building already appears to be, let me make a suggestion that would make it even more complicated. To add to the last post of @ematsen, a standard assumption of BCR tree building is that there is a single tree. I suspect that activation of multiple identical naive B cells may be a common event, where public clonotypes are involved. This would lead to multiple trees that may need to be disentangled. Although it might seem counter-intuitive, I think this problem might be particularly common in the mouse, but less so in the human. Fortunately the relatively small trees of the mouse should facilitate disentanglement.


Important point. Likely multiple trees may develop in parallell also in cases beyond those represented by public stereotyped responses as any responding cell may divide prior to initiation of hypermutation, resulting in possibilities for multiple trees developing in parallell either in single GC (with competition) or in separate GC (without effective competition for antigen).


After reading your paper I think that your model is a real step in the right direction. Now I also understand what you mean when you said “mean-field”. The idea to include AID hotspots in the model is obvious but the way to do it is not and the way it is done in the paper is simple and elegant.

What I would like to hear your opinion on is the use of the transition vs. transversion parameter. How is the kappa parameter in you model compared to the transition/transversion bias consistently seen in mamalian genomes? I would imagine that the AID would result in more transitions because of the nature of uracil repair:
C->T (transition)
G->A (transition)


The UG mismatch can be processed in a number of ways. Replication
over the lesion results in a C3T transition on the newly synthesized strand.
Recognition and excision by UDG leads to production of abasic sites that are
unfaithfully repaired (40, 41). In S regions, the generation of multiple abasic
sites on both strands can cause double-stranded DNA breaks to initiate switch
recombination. Mouse and human deficiencies in UDG result in a shift in
transition/transversion ratios at C/G nucleotides from about 50% each to
greater than 95% transitions, whereas mutations at A/T nucleotides are unaffected

It looks like the Ts/Tv ratio is very much intertwined with the chance of being a hotspot. Maybe the model could be improved simply by introducing transition/transversion specific h, h_ts and h_tv, to reflect this. I guess it would be the same if e.g. doing like this:
k*(pi_j + b_ijh) --> pi_j + b_ijh*k

Or maybe even better include different probabilities for all different nucleotide substitutions, though this will add many new parameters to the model.



Thanks for sparking off such an interesting discussion!

A couple of simple observations on your mails which might help others, or help me if my thinking is wrong - I am not an expert on phylogeny and really appreciate the insights in this thread.

therefore no need to worry about multiple sequence alignments

In practice I find that the sequences I deal with often don’t start at exactly the same nucleotide and sometimes have differing lengths, in the latter case, most likely because of sequencing errors. I find it easiest to start with an MSA and then remove any sequences that are grossly out of alignment. The approach I find works best is codon-aligned nucleotide alignment. (you align the aas, and then revert back to the underlying nucleotide sequences once the codon alignment is complete). To me this makes sense from a biological point of view and in practice it eliminates some frustrating issues I’ve had with alignments at the nucleotide level, particularly in rabbit sequences, which feature a certain amount of gene conversion. TranslatorX ( does the codon alignment nicely, working with a choice of a number of aligners.

just more NGS errors

I also have to work with datasets that do not include UMIs. Although these are Illumina sets, I tend to cluster the nucleotide sequences at around 97% using UCLUST ( with the intention of removing the noisy reads that creep in from oversampling. While there is a danger that this will remove useful sequences that would help infer an accurate phylogeny, my intuition is that it removes a lot more noise. But I;d be interested to hear what other people think about this. Another approach would be to remove singletons. At the moment, the particular group I have been working with has not optimised read coverage against B-cell count, we are working in this direction but there are some practical issues, and I think we would lose just far too much detail if we eliminated singletons.

only the synonymous mutations will be neutral

In our studies, the variation we see in a clonal family after two or three trial immunizations tends to feature quite a few nonsynonymous mutations that are neutral in terms of binding affinity. I think this is because there are typically a number of positions in the CDRs that are under neutral selection, at least for some possible substitutions, so what you tend to see is the family ‘cycling’ between these, if you will, until, eventually, a fitter strain emerges. And even in the case where the antigen is not evolving, there may well be some natural variation that encourages such minor variation in the receptor - malformed sequences and so on.


As @w.lees says, this is an interesting discussion, so let me throw in another idea that I have been thinking about for years. Selection pressure is always discussed in terms of antigen binding, but there are other forces at work that can produce selection pressures, and an example is a situation that arises later in a response. Mutation is initiated by the focusing of the mutational machinery at RGYW mutational hotspots. As a response proceeds, affinity maturation will be achieved, and there will be pressure to prevent further mutation. This can be achieved by synonymous mutation of hotspot motifs. So what we should see is selection for silent mutations.


I agree that codon alignment is by far is the most realistic way of making an MSA if you cannot trim your reads to the same length. Notice from my approach, that the reason I get all same length reads is because these reads are already partitioned out into (likely) clonal families.

You write that you are using UCLUST (I assume you mean USEARCH?) to remove faulty reads. Can you elaborate a little on the idea behind this? Which reads do you remove, based on clusters, based on counts, cluster size??? I am also getting Illumina reads so I would be very interested in hearing how this is done.

This is just super interesting. In protein-protein interaction modelling we are also talking about “hotspots” and “coldspots” (not to be confused with the mutational hotspots in the BCR nucleotide sequence) this is something that we need to address in phylogenetic reconstruction e.g. by fitting a site specific rate parameter on the fly.


Okay so if I understand your post correctly you propose that late in maturation there will be a selection for more mutations in hotspot motifs?

If this is true, this is an effect that should be possible to simply observe in the affinity maturation datasets out there. Somebody else in this forum might have a go on that. My intuition also tell me that there will be a clear accumulation of synonymous mutations but I had not thought about them being positively selected to occur inside a hotspot motif. However the argument for your hypothesis also makes intuitive sense so I am inclined to believe that this is also an effect, if not measurable, then theoretically there.


@krdav yes, specifically the cluster_otus command of USEARCH: This preserves read counts, but regards all reads within an ‘identity radius’ as identical and replaces them with a consensus, the idea being to minimise noise associated with read errors. As mentioned, although it’s an option that cluster_otus supports, I don’t typically remove singletons. In my data I see too many singleton reads that have a high identity difference to other sequences in the set. I suspect that a better read depth (for the same number of B-cells) would address this, but it’s hard when you don’t have good control or measurement of the B-cell count in the first place. As I understand it, counting B-cells can be difficult with animal models because the markers aren’t as well understood as they are for humans.


Ah, I see. I think that this will indeed remove a lot of noise however it will not be very specific towards NGS errors vs. real mutations.

Assuming that lots of synonymous mutations will accumulate in the late phase of maturation the clustering will not deal with this very well. It will probably reflect the most mature sequence well, but the information about accumulation of synonymous mutations will be lost.


A bit of investigation has been done in that direction, w.r.t. selection pressures at the trunk vs canopy of B cell lineages:

Yaari, G., Benichou, J. I. C., Vander Heiden, J. A., Kleinstein, S. H. & Louzoun, Y. The mutation patterns in B-cell immunoglobulin receptors reflect the influence of selection acting at multiple time-scales. Philos. Trans. R. Soc. B Biol. Sci. 370, 20140242 (2015).

Still a lot of interesting directions to be explored.


Glad you think so! The kappa we’ve observed in the bNAb lineages we’ve looked at has been pretty consistently around 2, regardless of whether or not we include the hotspot adjustment. I’m not sure how that compares to other taxa, but it doesn’t seem too out there. We could try that adjustment out, though.


Interesting. Do you have a feeling of whether that there actually could be a nucleotide substitution bias in those codons with high probability of being hotspots? This would make sense when looking at the AID mechanism.


@krdav, yes I agree with your points re. clustering, but I’d suggest that, without some other approach to read errors, it is going to be difficult or even impossible to separate noise from signal when it comes to analysing those synonymous substitutions. Perhaps we have to carry out such work only on UMI-coded samples For less demanding work, such as tracing the development of a particular Ab of interest, I think the reduction of noise is probably worthwhile - but I wish I could put this on a firmer foundation.


@w.lees May I suggest the VSEARCH program instead of USEARCH:

They (the developers) say it is more sensitive than USEARCH. Surely I also get much less clusters at the same minimum identity with VSEARCH. Also it is 64bit, the license is permissive, there is better documentation and some features that are lacking in USEARCH e.g. writing fasta files without wrapping the lines.


+1 on vsearch over usearch – the background is someone smart reverse-engineered and open-sourced usearch after becoming frustrated with the license/lack of documentation. He (torognes) now prefers his newer project swarm, which I’ve played around with, but haven’t fully switched to (lots of explanation on the vsearch/swarm pages).


Thanks for the info on vsearch/swarm. The usearch licensing is certainly frustrating. @krdav to your earlier point I agree that clustering is not a discriminating technique in this application - at least as I am applying it currently. UMI barcoding looks to me to be the best answer, but the problem is what to do with sequence sets already in the pipeline. It seems to me that one should reduce the noise even at the expense of losing some signal, given the large amount of noise that will creep in with 0.1% - 1% read error rates, particularly if the sequencing depth gets too high as a result of low B-cell count.


@Kenneth_Hoehn Would your implementation of the codonPhyML also work for ancestral state reconstruction?


Yes, though I’m not aware of any publicly available software that will work with our model yet. I have scripts for doing it though. If you let me know your email I can send them, though I’ll need a day or two to prep them - they require a little more work than just using IgPhyML out of the box. Also we’ve expanded IgPhyML and released a new version if you’re interested:


Thank you @Kenneth_Hoehn you can send it to:

When you write “scripts doing it”, does this mean that they are doing the reconstruction based on the same mean field ML model described in your paper, or is it something more simple? When I asked I actually meant whether the --ancestral flag in PhyML is also working in IgPhyML? But then you are dependent on codonPhyML which might not even have this feature either?

expanded IgPhyML and released a new version if you’re interested

Sure I am! But I don’t see the big difference to the old paper? Maybe this is the version I read already.


Cool. Yes - the scripts use the same model. I wouldn’t expect the --ancestral flag to work. You might have seen the newer version. Currently there are two versions out on biorxiv- the biggest change is that the latter expands the model to include more hotspot motifs (WRC/GYW, and WA/TW) as well as coldspot motifs (SYC/GRS). You can also constrain certain motifs to use the same hotness parameter and test symmetric and asymmetric motif models. The newer version is also much faster than the initial release (and I believe I’ve fixed the issue you pointed out with BLAS as well).