Sponsored by the AIRR Community

Standardizing the format of a germline set

This is great, @w.lees!

To be clear, would a given “germline database” have the option of containing a subset of this master list?

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

To be clear, would a given “germline database” have the option of containing a subset of this master list?

Yes. I’d see the names_xxx.csv files containing a subset of the master list, and this subset would comprise the ‘germline database’ xxx.

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

Yes. That way we don’t have to worry about an existing numbering scheme changing, or not being applicable to a particular species.

1 Like

I like many of the ideas expressed so far. I’m currently digging into the details of VDJServer’s germline db so I can better describe issues or differences to the group. One thing I notice is that we store a “hierarchy” of four levels, i.e. gene type, gene family, gene, and allele. This is because users commonly want usage counts at these different levels. With codified naming then this can be parsed, but getting away from that with neutral names means that this information needs to be stored separately. If the depth of the “hierarchy” is fixed a four levels, then this can be handled with 4 columns in the names_xxx.csv file.

Is the h, k, l separation valid for all organism? And is that just for BCR, what about TCR? What about alpha/beta? Should we be using directory structure to designate these “types” or might it be better that this information is in the names_xxx.csv?

I like the way this is going but I wonder if it could be made simpler.

If I understand the current proposal correctly, it’s to have a variety of these positions_xxx.csv and names_xxx.csv files nested in the three directories. This seems like it could be convenient for using this information, but a little annoying for distribution-- if I want to use a different set of names and positions, I have to properly extract six files into the 3 respective directories.

What about using one fasta file with all the sequences, and then each set of sequence annotations being provided by a pair of files describing the positions and names, each of which has a column describing whether that sequence is a H, K, or L.

Then, for a given combination of sequences and positions/names, the database could process these things and offer downloads in a format that would be easily pulled into software tools. E.g. if the sequences were described with aaa and the positions/names were described with xxx, then it would be available as db-aaa-xxx.tgz or something.

This would of course be done with publicly available code such that a user could do the same processing using their own collection of files as desired.

Good point. I don’t see a reason we couldn’t allow an arbitrary identifier rather than H/K/L, including alpha and beta for TCRs. Hopefully @martin_corcoran can inform us concerning other organisms.

So I don’t feel strongly, but the rationale for separate h/k/l directories is partly for human readability, and partly to make the catastrophic error of aligning, for example, an igk sequence against an igh gene (which is a mistake I’ve made) much harder. In my imagination the typical use case is someone who isn’t familiar with the format, and needs to interface it with their software. If h/k/l are in subdirs, then they look in the tgz, or whatever they’ve downloaded, see a fasta file in a k/ directory, and they’ve already got what they need for a minimal interface.

I definitely agree that minimizing the number of files should be a design criterion, but I’d say that argues for smooshing all the non-sequence info into a single .csv

Here’s a schematic regarding Ig content in different species, from a 2014 review by Rita Pettinello and Helen Dooley in ‘Biomolecules’.


Clearly the H/K/L designations will not suffice if we are aiming for an inclusive system.
I personally prefer a single fasta file for each gene type (one fasta for all IGHV, one for all IGKV etc) since it makes it easier to port to other analysis tools that allow you to upload custom database files.

2 Likes

Really interesting discussion. One point though. Computationally efficient approaches does not always translate into outputs that are easily interpreted by the wider community beyond the computation environment. I believe that there is a need for standardized nomenclature in numbering schemes when data is presented in final form as that greatly facilitates interpretation of the study’s outcomes. If different numbering schemes are used during computation the outputs should in the end be transformed into results based on such standard nomenclature as it greatly facilitates discussion beyond the computational environment, a mode of communication that remains as a important environment for scientific discussion of study results, and likely will remain as a important platform for exchange of ideas. I suggest that AIRR recommendations will include the use of a standard final output numbering scheme and that AIRR and other partners in antibody research work with other sources of scientific information (such as PDB) to make them adopt the same type of final numbering nomenclature. Any thoughts?

I’m coming in a bit late here, but in terms of file format, this data is going to be pretty small, and it seems it may need some flexibility. Storing it across many files across multiple directories seems complex to me, especially for distribution or putting together custom versions. What do people think about coming up with a JSON schema for this, and simply putting all necessary data into a single JSON file.

Basically, the file would contain a bunch of JSON objects where each one is a single germline gene element. That objects could then include top-level fields for all the information of interest (e.g., gene name, species, locus, ungapped sequence, IMGT-gapped sequence, Cys pos, etc.). Furthermore, each object could include an array inside of it that contains any additional annotations that need to be located against the sequence (e.g., the equiv of a BED file inside each JSON record).

This would make the data very easy to read/write/edit (it’s text based), easy to distribute (single file), easy to support new features (it’s semi-structured and easy to add custom fields), and very easy to work with for developers, as there exist JSON parsers for virtually every language out there.

I’m not certain what to do with additional inferred/aggregated germline info (e.g., alignments/trees). But this could surely be encoded in a second JSON file with appropriate structure.

1 Like

For future reference, here’s the paper @martin_corcoran referenced.

Great point, @mats.ohlin. I think at this point we are simply describing the format of such output rather than exactly what should get put in it. I assume that we all agree that the format shouldn’t dictate the content, but let me know otherwise.

@laserson-- I agree with many your points, and it would be lovely to have everyone using JSON for everything. However, people can effortlessly throw a FASTA file into other sequence analysis tools, and open CSVs in Excel.

I don’t think that it would be crazy to have the fundamental format be a unified JSON file, and then have downloads available in CSV/FASTA that get auto-generated to keep up with the JSON. In any case specifying a schema would be a good way to agree on logical organization first. Do you have any suggested tools for working with schema? E.g. https://github.com/Julian/jsonschema ?

I think it would be very useful and helpful to define a schema that covers all the information that we expect a germline parser to need in order to do its job.

As I see it, the data will come from multiple sources: the raw sequences, the collection of raw sequences into a sequence set, the naming, the identification of CDRs could all be contributed by different groups. Call these ‘classes of information’ for want of a better description. It’s the combination of these classes that forms a germline set. I think it’s important that germline sets can be defined and modified flexibly by combining and adding classes, so that the field can develop and build on what’s there already, though I fully accept @mats.ohlin’s point that we should encourage generally accepted standards, that can change over time.

It might well make sense for all the data for a given germline set to be held in a single file that can be conveniently published for tool users, but in general it should be straightforward to add and modify classes separately: for example to change the naming convention, or add an additional numbering scheme. I think that’s a question of having a clear separation between the classes, and tracking versioning and authorship of each class independently. Provided the schema allows for that, the representation in one file or many isn’t so much of a problem.

Following yesterday’s meeting. I’ve volunteered to put together a draft schema - probably just an organized list of fields - which we can review and discuss. Hope to have it up here in a day or two.

William

In VBASE2 the sequence files are stored as fasta files without gaps. During the construction of the VBASE2 database the alignment process provides the aligned sequence with the gaps. The process is so fast that it can be done on the fly.

Just to mention that the program that generates the VBASE2 database can also automatically generate files that look like Genbank entries. It is very easy to generate flat files like that and with a proper vocabulary all the information can be added to the file structure. A lot of software can read Genbank or sequence files.
In my experience it is very fast to analyse V gene segments, so no need to store it in the fasta file itself.

The vbase2 page generates an output that does just that.
At the end of the page you find a text block that is comma delimited and can be pasted into an excel sheet.
It contains a lot of structured information including the information if the conserved positions are present in the sequence analysed. Please try it out at http://www.vbase2.org.
If you would like to get a different type of output, please let me know.

Werner,

Thanks very much for this.

I searched for a human gene (humIGHV034 / VH1-69*01) and got this output: http://www.vbase2.org/vgene.php?id=humIGHV034 . How do I obtain the comma-delimited output you mention?

For a well-known gene like this, how do you determine the FR and CDR assignments? Are they based on IMGT alignments, or on underlying properties of the sequence?

William

This is very easy (two steps)

From the Vbase2 entry you are interested in you copy the sequence and go to quick search and past the sequence and analyse the sequence. In the output at the bottom you find various comma delimited outputs.

Of course you can lookup your own sequence as well. For more then 10 sequences, please use the DNAPLOT Query tool.

By the way, for your sequence, there are the comma delimited outputs:
The first line is always the header.

This output is using the IMGT alignment but I could easily produce other alignments as well (like Kabat or Chothia). The IMGT alignment format is by the way mostly based on my “vset alignment format” (with one difference at position 11 it was identical). In the IMGT consortium we then agreed on the IMGT alignment format at the time.


Amino Acid Table in comma-separated values file format (file extension: .csv)

Name:,Notes / Problems:,V-Gene (VBASE2):,V-Gene (IMGT):,D-Gene (VBASE2):,D-Gene (IMGT):,J-Gene (VBASE2):,J-Gene (IMGT):,FR1:,CDR1:,FR2:,CDR2:,FR3:,CDR3:,FR4:,Amino acid sequence:
humIGHV034 294 bp,humIGHV034,IGHV1-69*01,—,---,not found,QVQLVQSGAEVKKPGSSVKVSCKAS,GGTFSSYA,ISWVRQAPGQGLEWMGG,IIPIFGTA,NYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYC,AR,—,QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCAR

Nucleotide Table in comma-separated values file format (file extension: .csv)

Name:,Notes / Problems:,V-Gene (VBASE2):,V-Gene (IMGT):,D-Gene (VBASE2):,D-Gene (IMGT):,J-Gene (VBASE2):,J-Gene (IMGT):,V-FR1:,V-CDR1:,V-FR2:,V-CDR2:,V-FR3:,V-CDR3:,P1-CDR3:,N1-CDR3:,P2-CDR3:,D-CDR3:,P3-CDR3:,N2-CDR3:,P4-CDR3:,J-CDR3:,J-FR4:
humIGHV034 294 bp,humIGHV034,IGHV1-69*01,—,---,not found,CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCT,GGAGGCACCTTCAGCAGCTATGCT,ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGG,ATCATCCCTATCTTTGGTACAGCA,AACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACGAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGT,—,---,—,---,—,---,—,---,—,---

Top
Mutation Table (Beginning with the 15. Amino Acid)

Name:,Notes / Problems:,V-Gene (VBASE2):,V-Gene (IMGT):,D-Gene (VBASE2):,D-Gene (IMGT):,J-Gene (VBASE2):,J-Gene (IMGT):,Mutated Sequence:,V-Gene:,D-Gene:,J-Gene:,FR1:,CDR1:,FR2:,CDR2:,FR3:,V-CDR3:,D-CDR3:,J-CDR3:,FR4:
humIGHV034 294 bp,humIGHV034,IGHV1-69*01,—,---,not found,no,0,—,0,0,0,0,0,0,—,---

As you can see in the output, it is a V gene segment only. You can try a rearranged sequence yourself. Please let me know how it works for you.

Thanks very much for the explanation. I tried it on some complete sequences from my datasets and it worked very nicely. Do you have any thoughts/experience on how the germline assignment and junction translation compares with IMGT and IgBLAST?

William

I of course think that “my way” the best way.
As my software was the basis for the IMGT software, I expect that the junctions are very similar between the dnaplot analysis in VBASE2 and IMGT.
I did not test IgBLAST.