I would like to discuss here how the VBASE2 database was constructed, which part of it are still useful and what I would change if I would generate the database now. I will split the discussions according to the flow of arguments in the VBASE2 paper in http://www.ncbi.nlm.nih.gov/pubmed/?term=15608286 .
The aim of the VBASE2 database is to provide a list of gremline V genes. For each gremline V gene the evidence that it is a gremline gene is a gremline gene. For this, all entries of the public DNA databases are searched and the matching entries are linked to a particular gremline gene. Only non-mutated sequences are linked and somatically mutated sequences are discarded.
The VBASE2 database uses three classes of sequences. The definition is as follows:
Class 1 holds sequences for which a genomic sequence and a rearranged sequence are known.
Class 2 contains sequences that have not been found in a rearrangement, thus lacking evidence of functionality. This class includes pseudogenes and orphans, but it might also contain V genes of rare usage or V genes for which rearrangements are known only in a somatic mutated version.
Class 3 contains sequences, which have been observed in different V(D)J rearrangements that give strong evidence of the absence of mutations, but lack a genomic reference."
For a new gremline database I would like to suggest class 4 sequences in which the evidence is based on next generation sequencing data. I expect that if the number of sequences is much much higher compared to the time when VBASE2 was constructed, there is a high chance that “Germline Sequences” appear that are just “Noise”. So I expect that class 4 sequences contains both “Noise” and true Germline Sequences.
The procedure how the VBASE2 database is constructed can be found in the publication.
Important for the database is that not the V and D and J sequences but only V sequences are part of the database
" The DNAPLOT software is used to align, sort and compare the V gene sequences, identify J elements, RSS elements and pseudogenes. Synthetic sequences are detected and removed. All germ-line configured V genes are matched to the rearranged sequences. To assign a rearrangement to a germ-line sequence a 100% match in the V gene region is required. Thus, the sequence comparison is restricted to the FR1–FR3 region, excluding potential N nucleotides in CDR3."
Of course, the VBASE2 database also maintains a list of D elements and J elements (based on genomic data only).
One important feature of the VBASE2 database is that the V genes are grouped into V Gene families. These families are constructed based on the procedure outlined by Dildrop
This procedure simply defines a threshold of similarities between V genes that is required to be grouped within or outside of a V gene family. The V gene families get simple numbers and the procedure can be used for all V genes and for all species. Subsequently the V gene families (with numbers) can be associated to other V gene family names for cross references.
I suggest that V gene families in the various species and V gene classes are constructed using the Dildrop method.
She used the procedure successfully again a few years later for V genes in fish.
The VBASE2 entries can be found on the VBASE2 web page. My favourite one is the entry for the V186.2 gene.
You can look at it here http://www.vbase2.org/vgene.php?id=musIGHV057&ref=all
I just go through all the entries of this sample and it would be worth to discuss what to add in the new gremline database.
VBASE2 ID (unique number) Nothing really associated to this number. It is just unique in the database. (this is not quite true, I added MUS for mouse and added IgH for the type of the Vgene. But you could treat the letters like numbers). If I would do it again, I would use just a number and add the species information elsewhere and the type of V gene elsewhere.
Class (as discussed before) The example above is a class I gene (the best).
functionality (of course functional as it is a class I gene), functionality is tested by looking at the sequence at the time of the database construction and anything unusual could make the gene potentially non-functional, for example if conserved positions are impaired.
V gene names (a list of all names known, like in this case V186.2)
V gene family (here I display the trivial name but should display the number, the scientists working with mouse sequences know the VHJ558 Family so well that they would not remember the number (VH1).
Last update (as you can see, I did the latest update in 2007 (about 9 years ago).
The nucleotide sequence (in fasta format with name and sequence length)
The protein sequence (in fasta format with name, I would not use bp now but number of aa).
Nucleotide sequence structure Listing key positions within the sequence like
FR1, FR2, FR3, CDR1, CDR2, CDR3 (more important for the light chains…)
Conceived positions 1st and 2nd Cysteine and conserved Tryptophan in FR2.
genomic sequence(s) all entries of the DNA databases where the genomic sequence can be found (Evidence for a class I sequence)
All rearranged sequences in the EMBL database
and cross references
- All rearranged sequences in the IMGT database
- All rearranged sequences in the KABAT database
When the database was constructed the use of GO terms was not yet fully developed.
If I would construct the database now I would make use of GO terms (for example for species)
May be develop gremline gene specific GO terms (may be move the class definition there)
I would like much more information now in the database. It depends what is the purpose of the database. My main rationale was to have clearly defined building blocks to describe rearranged sequences for the DNAPLOT analysis software as a first step for repertoire analyses.
For (my) history book:
the first publication with output using “my software” can be seen in the paper from 1991
For example the figure 1 was generated using the DNAPLOT software (yes it can also produce homology plots).
The sequence tables in the publication were also generated by the software and it was also used to generate clonal trees.
I think it should be very easy to prevent ‘noise’. Each VDJ from NGS data is evidence for the existence of the germline gene, and if there are enough rearrangements that appear to utilize an unmutated V gene, the evidence is much stronger than the evidence from a single Sanger sequence. How many Sanger sequences represent an equal weight of evidence to 10,000 VDJ sequences.
I don’t see why NGS data would lead to a fourth category. In fact there may be an argument for some, though not all, of these sequences being in the highest category. There are certainly many inferred polymorphisms about which there really can be no doubt. Some of them are even very commonly expressed (eg IGHV1-2*p06), and their absence from earlier Sanger sequencing studies must just be down to chance. Whether such sequences are given the highest rating will depend on whether the rating is a sign of completeness: we accept the existence of this sequences, and we know everything we want to know about it (RSS, genomic position etc); or a sign of certainty: we accept the existence of this sequence.
Noise for example could come from low level somatically mutated V genes at hot spots. Even at the time of Sanger sequences, there were cases of VDJs with identical somatically mutated V genes. One would have to look out for “masked” hotspots. (Was very rare but one could find these if one looks at many V genes). Ideally one identify the unrearranged gremlin V gene on the genomic level.
In species were we will see examples of gene conversion, again it could be difficult to identify the germeline gene from rearranged sequences only.
Where the regions annotated automatically or did they require manual input?
Automatic by the program, no manual input.