Phylogeny on somatic hyper mutations

Dear all,

I was wondering if anybody in here have experience in doing phylogeny on BCR data? Specifically creating a tree view of the somatic hyper mutation process of a set of sequences having the same clonotype. I am currently giving it a shot by first partitioning my NGS sequences into clonal families (thanks to partis) and then using the derived naive sequence and the VH sequences from my NGS data as input in the phylogenetic analysis. Notice that since these partitioned sequences have same clonality they should also have the same length and therefore no need to worry about multiple sequence alignments.

This I then use as input to MrBayes where is split the codon into three partitions, one for each position in the codon, which will get its own model parameters. Then I set a broadly covering substitution model with 6 variables and run lots of iterations until convergence. I have also considered doing it on amino acid level and then use the substitution model appropriate for this.

I think it can be argued that the DNA level information is important in terms of recreating the right phylogenetic picture of the shm process, but what really drives the selection pressure is the amino acid identities on some key positions and therefore I fell that there are some important information that are left out when running phylogeny on DNA level.

I would like to hear the opinion/experience from anybody that have tried working on this problem. Are there a better method that I am unaware of (there very likely is…), am I doing anythings wrong etc. Also if/when there is a good solution to this practical problem these experiences should be written down for others to replicate them, something which I will be committed to do if I find the magic formula.


A fascinating question! And one that we’re thinking about a lot these days.

A key preliminary question is: what is it that you want to do with these trees? Do you care most about getting the discrete tree structure right? The branch lengths? The relative timings of events?

Ahh, good to hear that I am not the only one thinking down those lanes…

Of course I care about getting it the best, so that includes all of the things you have listed. That said I am definitely most interested in getting the overall structure right. Especially using it to show/find evolutionary dead ends in shm.

Also I would like a add an observation that I have seen (but not yet properly validated); highly expressed AB’s have many more different observations on DNA level than lowly expressed AB’s. And this is in the context of AB’s extracted directly from lymph nodes after immunization and therefore expected to be highly enriched in B cells from germinal centers.

One obvious reason for this is just more NGS errors. Simply the more times a given VH have been sequences the more sequencing errors will also sneak into the dataset, and these will be very hard to filter out without using UMI’s (which unfortunately is the case yet for the data I get). However my apparent felling is that the number of DNA variants of the same AB is not proportional with its expression though.

Another reason for this observation could be that the more affinity matured an AB gets the more of the possible amino acid substitutions will be lowering affinity and thereby causing counter selection. At some point in the shm most missense mutations by will be selected against and only the synonymous mutations will be neutral and thereby accumulate. In conclusion I am trying to derive a logic relationship between the number of different DNA observations coding for the same AB and its evolutionary progress aka. maturation inside a germinal center.

Is this completely bonkers?

At times of Sanger sequencing it was almost impossible to pick up immunoglobulin sequences from active alleles with crippling mutations. May be the new approaches are more sensitive and you are able to see low frequency sequences that are present for example in a lymph node for a very short time and missed in the past. So any low level somatic mutation can lead to antibodies with lower affinity to the antigen.

One interesting analysis to do is to compare the number of mutation in the conseved positions of V regions using the different sequence generation procedures.

I would agree that you will get the most accurate estimate of the “true” biological phylogeny from DNA sequences. Amino acid information would mostly be helpful if you had other data indicating that certain mutations couldn’t survive functional selection. At the simplest level, if MrBayes produces a tree with a sequence containing a stop codon as an intermediate node (as opposed to a leaf), you can rule out that possibility.

We have shown that this does appear to be the case in our paper on antibody evolutionary rates:

Specifically, the section titled “Functional selection pressure on CDRs and Framework regions”

@caschramm Thank you for the article. I will read it right away.

I just read another article in which they focus on phylogeny on protein levels and therefore parametrise an AA substitution matrix specific for antibodies:

We have struggled to understand the results from that paper --if you look at figure 3, the bias to Histidine is just weird. Their final substitution matrix doesn’t empirically match what we see at all.

(Full disclosure: we have a paper in review that also looks at amino acid substitutions in antibodies, though from a very different perspective.)

As I understand they describe a strong bias against histidine which they explain by its change in charge around the physiological pH. Whether this is good argument or not I can’t say but I just took a look at my VH sequences and sure enough, they are highly depleted from histidine.

The paper you are talking about is that in a preprint archive somewhere? I would be very interested in reading it.

Yes, His is ~4x rarer than the next rarest residue. The problem is that their model is symmetric, though, so the numbers equally imply that all other amino acids are more likely to mutate to His than anything else, which is obviously false. And, like I said, the overall results don’t match what we see, either.

Unfortunately, no preprint. I tried, but the decision ultimately wasn’t mine to make.

Just to be on the same page I will paste in one of the key figures from the article:

Maybe I am missing something here, but I think the whole point of having a symmetric substitution matrix comes from the fact that the substitutionrates are derived from the a matrix of “exchangeabilities” multiplied by the amino acid frequencies: r_ij = s_ij*pi_j with r_ij being the substitution rate, s_ij being the “exchangeability” and pi_j being the frequency.
At least this is what also applies for the BLOSUM/PAM matrices:

Therefore r_hn >> r_nh because pi_n >> pi_h which makes total sense.

Then again if the problem is more that the results doesn’t match what you are producing then there might still be something worrying about the method.

  1. In this case, the symmetry gives an obviously bad result, suggesting at least the need for a correction of some sort.
  2. The assumption of symmetry is more about mathematics, not an inherent property of the biology. When you have an MSA of “unrelated” (ie unordered) proteins and are looking strictly for functional exchangeability, this assumption makes sense. But when you have phylogenetic information and/or you know the originating (in this case, germ line) sequences, then you can empirically test the symmetry assumption. Our data shows that, in antibodies, many substitutions are asymmetric, sometimes drastically so.

The other thing we tested is to simulate SHM and compare to real sequences. S5F does reasonably well, despite being a nucleotide model. AB is just hopelessly different.

I’ve actually been doing some phylogenetic methods development work on B cell lineages, which might be useful to you. We’ve developed a codon substitution model tailored to SHM, which largely accounts for the effect of increased mutability of AID hotspot motifs, and doesn’t assume the substitution process is symmetric (which, as pointed out above, is important). Further, this effect of hypermutation is estimated as a parameter for each specific data set, along with the usual features of GY94-type codon models like dN/dS ratio (omega), transition/transversion ratio (kappa), and codon frequencies. It’s under review, and a version is on biorxiv if you want to check it out, though the model has changed slightly:

The implementation is also available, and though it’s still little rough around the edges, is pretty easy to use (at least on Linux systems). If you want to try it out, you can get it here:

1 Like

Cool I will check it out.
Since the IgPhyML is build on PhyML do you also have support for running with MPI? or is it OpenMP only?

Great - let me know if you have any issues with it. It’s based more closely off of codonPhyML, so just OpenMP.

In response

So many great thoughts and ideas in this thread! Apologies for the delay in replying, I’ve been at an all-consuming small workshop (on phylogenetics).

This is a great intuition, and an important direction. I believe that the best long-range solution to implement such ideas is to develop probabilistic models of B cell diversification that incorporate features such as frequency of sampling a DNA sequence, synonymous/nonsynonymous mutation, and tree structure. We’re beginning to work hard in this direction (in fact we’ve just hired @wsdewitt and this will be his primary project). A metaphor for something that already exists is say the “skyline plot” that people often infer using BEAST, which is an inference of ancestral population size. In your case you are looking to estimate sampling time relative to when the GC would naturally terminate.

Chaim is right on the mark here, however I wanted to point out that this assumption is baked into every phylogenetics program in common use (including phyML and derivatives). So the point of that paper was to infer the best substitution matrix one can that could be plugged into current software.

More generally

Here are the major differences in my mind between the standard assumptions of phylogenetics versus features of BCR development:

  1. Phylogenetic methods (with one recent exception) assume that we only sample from the tips of the tree. For B cells we sample parent cells along with their descendants.
  2. Mutation is context-sensitive. Doing this properly in a likelihood framework is hard (though stay tuned!)
  3. We can do a reasonable job of inferring the “root” sequence, namely the naive sequence. It’s worth doing the best we can in this inference, because this will determine the root of the phylogenetic tree.
  4. To first approximation, all of the sequences in a lineage are optimizing the same objective function (binding antigen), and we can obtain measurements of how well they have done. The exception is when a BCR is “chasing” a moving target in chronic infection such as HIV. Re measurements of antigen binding here are also complications such getting the right T cell help.

Also, as described above, as far as I know are also no generative tree models for which likelihoods can be computed efficiently enough to do fitting. If you know of some, let me know!

The great thing is that there’s tons of data that can be used to develop new methods and models, and that we can borrow clever methods from more thoroughly developed areas. None of these challenges are insurmountable, and we’re working hard on all of them.


I agree with those four points - sounds like we have similar thoughts on what parts of standard phylogenetic models need to be changed. On point 2), the (asymmetric) model implemented in IgPhyML deals with context dependence of SHM through a mean-field approximation, which preserves the indepedence of sites (otherwise, as you said, the likelihood calculations get out of hand). That’s how it stays within the likelihood framework, so you can still do model fitting, hypothesis testing, and ancestral state reconstruction in the usual ways. Though it is an approximation, it seems to largely capture the effect, at least in the respects that we’ve tested. It’ll be interesting to see how the other points all end up being addressed as well.

1 Like

One issue I had while compiling was that it requires “atlas_aux.h” from BLAS, and that this was deprecated years ago. Instead I have it compiled just with OpenMP and it works fine. It seems the reason for the deprecated BLAS header is that it is not updated either in codonPhyML.
Do you have any idea what the speed-up is using BLAS? In other words; is it time worth to find a solution to the problem?

This is an awesome thread! Thanks for correcting me on the symmetry of substitution matrices, I am learning as I go right now.

For the points made by @ematsen:

  1. Here I would like to use frequencies (observations of one AB in the repertoire relative to the total amount of AB observations) as proxy for the distance on a tree. In such a context high frequency would be a tip while lower frequencies would be middle branches. Of course, even better would be to have actual samples a different time points.
  2. Well this is just really hard - I don’t even want to try doing the combinatorics. I guess one easy way to approach it is to make some coarse grained discretizations of these contexts. I need to look into the article from @Kenneth_Hoehn to understand the mean field approximation.
  3. This seems to be the only problem where we are already a long way. Thanks to better clonal inference (e.g. by partis and others) the limit of accuracy is getting pushed. Ultimately a phylogenetic reconstruction of the entire clonal family could help pushing it even further. I imagine an approach where a maximum likelihood inferred naive sequence is used as an initial root and then after one iteration of phylogenetic reconstruction the naive sequence is updated, these iterations are then repeated until convergence.
  4. Solving problem 1-3 is essential for this one. If one can make a reasonable model with a single objective function then making the objective function time dependent and kicking this into the machinery should no really be that big of a deal (of course it might be very much of target because the additive error of a changing objective function).

Right - that is one thing due for a fix that I should just explain in the manual. For me, on a quad core desktop, the biggest speedup by far comes from OpenMP, so the version without BLAS is fine. BLAS will speed it up a little from there, but probably not enough to look for a proper solution right now. It’s on the to-do list, though. Until then, if you’re using Ubuntu and want to try the BLAS version out, a quick fix is to install an older version of libatlas-headers: