Sponsored by the AIRR Community

Standardizing the format of a germline set

I like the way this is going but I wonder if it could be made simpler.

If I understand the current proposal correctly, it’s to have a variety of these positions_xxx.csv and names_xxx.csv files nested in the three directories. This seems like it could be convenient for using this information, but a little annoying for distribution-- if I want to use a different set of names and positions, I have to properly extract six files into the 3 respective directories.

What about using one fasta file with all the sequences, and then each set of sequence annotations being provided by a pair of files describing the positions and names, each of which has a column describing whether that sequence is a H, K, or L.

Then, for a given combination of sequences and positions/names, the database could process these things and offer downloads in a format that would be easily pulled into software tools. E.g. if the sequences were described with aaa and the positions/names were described with xxx, then it would be available as db-aaa-xxx.tgz or something.

This would of course be done with publicly available code such that a user could do the same processing using their own collection of files as desired.

Good point. I don’t see a reason we couldn’t allow an arbitrary identifier rather than H/K/L, including alpha and beta for TCRs. Hopefully @martin_corcoran can inform us concerning other organisms.

So I don’t feel strongly, but the rationale for separate h/k/l directories is partly for human readability, and partly to make the catastrophic error of aligning, for example, an igk sequence against an igh gene (which is a mistake I’ve made) much harder. In my imagination the typical use case is someone who isn’t familiar with the format, and needs to interface it with their software. If h/k/l are in subdirs, then they look in the tgz, or whatever they’ve downloaded, see a fasta file in a k/ directory, and they’ve already got what they need for a minimal interface.

I definitely agree that minimizing the number of files should be a design criterion, but I’d say that argues for smooshing all the non-sequence info into a single .csv

Here’s a schematic regarding Ig content in different species, from a 2014 review by Rita Pettinello and Helen Dooley in ‘Biomolecules’.


Clearly the H/K/L designations will not suffice if we are aiming for an inclusive system.
I personally prefer a single fasta file for each gene type (one fasta for all IGHV, one for all IGKV etc) since it makes it easier to port to other analysis tools that allow you to upload custom database files.

2 Likes

Really interesting discussion. One point though. Computationally efficient approaches does not always translate into outputs that are easily interpreted by the wider community beyond the computation environment. I believe that there is a need for standardized nomenclature in numbering schemes when data is presented in final form as that greatly facilitates interpretation of the study’s outcomes. If different numbering schemes are used during computation the outputs should in the end be transformed into results based on such standard nomenclature as it greatly facilitates discussion beyond the computational environment, a mode of communication that remains as a important environment for scientific discussion of study results, and likely will remain as a important platform for exchange of ideas. I suggest that AIRR recommendations will include the use of a standard final output numbering scheme and that AIRR and other partners in antibody research work with other sources of scientific information (such as PDB) to make them adopt the same type of final numbering nomenclature. Any thoughts?

I’m coming in a bit late here, but in terms of file format, this data is going to be pretty small, and it seems it may need some flexibility. Storing it across many files across multiple directories seems complex to me, especially for distribution or putting together custom versions. What do people think about coming up with a JSON schema for this, and simply putting all necessary data into a single JSON file.

Basically, the file would contain a bunch of JSON objects where each one is a single germline gene element. That objects could then include top-level fields for all the information of interest (e.g., gene name, species, locus, ungapped sequence, IMGT-gapped sequence, Cys pos, etc.). Furthermore, each object could include an array inside of it that contains any additional annotations that need to be located against the sequence (e.g., the equiv of a BED file inside each JSON record).

This would make the data very easy to read/write/edit (it’s text based), easy to distribute (single file), easy to support new features (it’s semi-structured and easy to add custom fields), and very easy to work with for developers, as there exist JSON parsers for virtually every language out there.

I’m not certain what to do with additional inferred/aggregated germline info (e.g., alignments/trees). But this could surely be encoded in a second JSON file with appropriate structure.

1 Like

For future reference, here’s the paper @martin_corcoran referenced.

Great point, @mats.ohlin. I think at this point we are simply describing the format of such output rather than exactly what should get put in it. I assume that we all agree that the format shouldn’t dictate the content, but let me know otherwise.

@laserson-- I agree with many your points, and it would be lovely to have everyone using JSON for everything. However, people can effortlessly throw a FASTA file into other sequence analysis tools, and open CSVs in Excel.

I don’t think that it would be crazy to have the fundamental format be a unified JSON file, and then have downloads available in CSV/FASTA that get auto-generated to keep up with the JSON. In any case specifying a schema would be a good way to agree on logical organization first. Do you have any suggested tools for working with schema? E.g. https://github.com/Julian/jsonschema ?

I think it would be very useful and helpful to define a schema that covers all the information that we expect a germline parser to need in order to do its job.

As I see it, the data will come from multiple sources: the raw sequences, the collection of raw sequences into a sequence set, the naming, the identification of CDRs could all be contributed by different groups. Call these ‘classes of information’ for want of a better description. It’s the combination of these classes that forms a germline set. I think it’s important that germline sets can be defined and modified flexibly by combining and adding classes, so that the field can develop and build on what’s there already, though I fully accept @mats.ohlin’s point that we should encourage generally accepted standards, that can change over time.

It might well make sense for all the data for a given germline set to be held in a single file that can be conveniently published for tool users, but in general it should be straightforward to add and modify classes separately: for example to change the naming convention, or add an additional numbering scheme. I think that’s a question of having a clear separation between the classes, and tracking versioning and authorship of each class independently. Provided the schema allows for that, the representation in one file or many isn’t so much of a problem.

Following yesterday’s meeting. I’ve volunteered to put together a draft schema - probably just an organized list of fields - which we can review and discuss. Hope to have it up here in a day or two.

William

In VBASE2 the sequence files are stored as fasta files without gaps. During the construction of the VBASE2 database the alignment process provides the aligned sequence with the gaps. The process is so fast that it can be done on the fly.

Just to mention that the program that generates the VBASE2 database can also automatically generate files that look like Genbank entries. It is very easy to generate flat files like that and with a proper vocabulary all the information can be added to the file structure. A lot of software can read Genbank or sequence files.
In my experience it is very fast to analyse V gene segments, so no need to store it in the fasta file itself.

The vbase2 page generates an output that does just that.
At the end of the page you find a text block that is comma delimited and can be pasted into an excel sheet.
It contains a lot of structured information including the information if the conserved positions are present in the sequence analysed. Please try it out at http://www.vbase2.org.
If you would like to get a different type of output, please let me know.

Werner,

Thanks very much for this.

I searched for a human gene (humIGHV034 / VH1-69*01) and got this output: http://www.vbase2.org/vgene.php?id=humIGHV034 . How do I obtain the comma-delimited output you mention?

For a well-known gene like this, how do you determine the FR and CDR assignments? Are they based on IMGT alignments, or on underlying properties of the sequence?

William

This is very easy (two steps)

From the Vbase2 entry you are interested in you copy the sequence and go to quick search and past the sequence and analyse the sequence. In the output at the bottom you find various comma delimited outputs.

Of course you can lookup your own sequence as well. For more then 10 sequences, please use the DNAPLOT Query tool.

By the way, for your sequence, there are the comma delimited outputs:
The first line is always the header.

This output is using the IMGT alignment but I could easily produce other alignments as well (like Kabat or Chothia). The IMGT alignment format is by the way mostly based on my “vset alignment format” (with one difference at position 11 it was identical). In the IMGT consortium we then agreed on the IMGT alignment format at the time.


Amino Acid Table in comma-separated values file format (file extension: .csv)

Name:,Notes / Problems:,V-Gene (VBASE2):,V-Gene (IMGT):,D-Gene (VBASE2):,D-Gene (IMGT):,J-Gene (VBASE2):,J-Gene (IMGT):,FR1:,CDR1:,FR2:,CDR2:,FR3:,CDR3:,FR4:,Amino acid sequence:
humIGHV034 294 bp,humIGHV034,IGHV1-69*01,—,---,not found,QVQLVQSGAEVKKPGSSVKVSCKAS,GGTFSSYA,ISWVRQAPGQGLEWMGG,IIPIFGTA,NYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYC,AR,—,QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCAR

Nucleotide Table in comma-separated values file format (file extension: .csv)

Name:,Notes / Problems:,V-Gene (VBASE2):,V-Gene (IMGT):,D-Gene (VBASE2):,D-Gene (IMGT):,J-Gene (VBASE2):,J-Gene (IMGT):,V-FR1:,V-CDR1:,V-FR2:,V-CDR2:,V-FR3:,V-CDR3:,P1-CDR3:,N1-CDR3:,P2-CDR3:,D-CDR3:,P3-CDR3:,N2-CDR3:,P4-CDR3:,J-CDR3:,J-FR4:
humIGHV034 294 bp,humIGHV034,IGHV1-69*01,—,---,not found,CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCT,GGAGGCACCTTCAGCAGCTATGCT,ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGG,ATCATCCCTATCTTTGGTACAGCA,AACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACGAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGT,—,---,—,---,—,---,—,---,—,---

Top
Mutation Table (Beginning with the 15. Amino Acid)

Name:,Notes / Problems:,V-Gene (VBASE2):,V-Gene (IMGT):,D-Gene (VBASE2):,D-Gene (IMGT):,J-Gene (VBASE2):,J-Gene (IMGT):,Mutated Sequence:,V-Gene:,D-Gene:,J-Gene:,FR1:,CDR1:,FR2:,CDR2:,FR3:,V-CDR3:,D-CDR3:,J-CDR3:,FR4:
humIGHV034 294 bp,humIGHV034,IGHV1-69*01,—,---,not found,no,0,—,0,0,0,0,0,0,—,---

As you can see in the output, it is a V gene segment only. You can try a rearranged sequence yourself. Please let me know how it works for you.

Thanks very much for the explanation. I tried it on some complete sequences from my datasets and it worked very nicely. Do you have any thoughts/experience on how the germline assignment and junction translation compares with IMGT and IgBLAST?

William

I of course think that “my way” the best way.
As my software was the basis for the IMGT software, I expect that the junctions are very similar between the dnaplot analysis in VBASE2 and IMGT.
I did not test IgBLAST.

Of course :slight_smile: I was wondering, really, whether you saw VBASE2 as an alternative to IMGT or IgBLAST for large-scale analysis, and, if so, whether you might be planning to create a version that could be downloaded and run locally? For my own work I have moved away from High V-Quest because I found the analyses were taking too long and the timing was unpredictable (at one point, two summers ago, I waited several weeks for analyses to complete). Precise control over the germline set is another factor we’ve discussed in this group. My current work is with rabbits, where there are I think quite a few problems with currently available sets.

1 Like

The original VBASE server is older than the IMGT or IgBLAST server (and its analysis pipeline). It still works and is using the Kabat and/or Chothia alignment (in case you are curious here is the entry page: http://www2.mrc-lmb.cam.ac.uk/vbase/dnaplot2.php .

The VBASE2 server was established as a complementary service to the IMGT server and is able to run large-scale analyses. I can run the programs either web based or as a command line on my local computer for large datasets, so it is constructed to deal with large datasets.

The server can handle all kind of sequences. I also have a version still for T cell receptor sequences which is much easier as there is no somatic hypermutation.

The VBASE2 server is used by others and the way I do it at the moment is to provide a server for an individual group that can be modified to the need of the users. I looked into other species as well and it would be relatively easy to set up a server for a particular species.
Such a server could be easily included in a script through a simple call (also using secure connections and password protected).

I also have at least one setup local at a collaborator site and they have attached a database server capturing the parsing results.

It is interesting to learn of the history of VQUEST, which I have wondered about for many years.

Of course in recent years, there have been many other alignment utilities developed, and certainly our utility IHMMUNE-ALIGN was developed to better deal with gene ends and IGHD identification. the V end problems with VQUEST are exacerbated by the CDR3 nucleotides of the V gene being excluded from the V gene alignment. This issue is most easily seen with IGHV3-3003 and IGHV3-3018, which are identical except for a single nucleotide within the CDR3 region. Even today, if you align the IGHV3-30*18 sequence in VQUEST, it returns both *03 and *18. The expansion of the (non-IMGT) germline repertoire in recent years has led to additional allele pairs like this.

VQUEST still has many fans. People who are used to it, and don’t notice the problems because most of the time, its output is right. It could continue to improve, but from my experience (many years ago), it is very difficult to encourage modifications to VQUEST. Perhaps if people turn to DNAPLOT, they will find a familiar interface, and Werner might be more willing to respond to any critical examinations of the DNAPLOT performance that arise.