Sponsored by the AIRR Community

Standardizing the format of a germline set

Great, that sounds good. Thanks for commenting.

I’m totally onboard with you, it makes more sense to have the standard be unaligned, but with the CDR[123] or J frame information, or whatever else (which is as far as I can tell the main non-subjective part of the alignment) in the csvs.

I’d just add that the redundant information in separate aligned and unaligned files can be trouble (this is the way my gl directories are set up now, and I’d like to change it), particularly since imgt has a habit of changing the sequence that correspond to a given allele name (I juts got screwed by this yesterday, in fact…).

So, uh, I think I’m just totally agreeing with your second message.

It probably doesn’t matter for germ line databases because they are relatively small, but in general I don’t like having to match a fasta against a csv. I do the following in the header line:
>seqid property=value property=value1,value2 property=....
USearch output does something similar:
>id;size=n;...
with a final semicolon mandatory.

It’s a little messy if you want to drop in additional information - for example to add kabat numbering to the sequences at a later date. It feels as though that would be cleaner if the files were separate and a bit easier to manipulate: personally I’d rather add annotation to a separate file rather than change the headers. But one could argue that it’s easier for things to get out of synch that way. I suppose the really important thing is that the information is versioned sufficiently that you can tell you’re using a consistent set of files. At the end of the day the representation’s not as important, it;s easy enough to manipulate into something else.

1 Like

I’m sorry to say I don’t know what the markers are that one would use (analogous to Cys/Trp) to demarcate these. Could you clarify, @w.lees?

I’d like to hear from @javh and other tool developers on the needed information (though I think we can work out details such as CSV vs fasta header later).

Erick,

I don’t think there are foolproof ways of identifying CDR1 and 2 from sequence (or CDR3!)

Here’s a guide to doing it by hand - you’ll see there are some exceptions noted.

IMGT works out the CDR1 and 2 from the gap-aligned germline sequences. There are tables listing the positions in the alignment for each species - this is the one for the human heavy chain. As far as I know the alignments are verified by hand.

IgBLAST has files in the internal_data directory. Here’s a line from internal_data/human/human.ndm.imgt:

IGHV1-2*04 1 75 76 99 100 150 151 174 175 288 VH 0

the numbers are the first and last nt position of each field, for example the last position of FR1 is 75. If you de-gap the corresponding sequence from the IMGT library, the positions line up.

IgBLAST also has kabat alignments. Here’s a line from human.ndm.kabat (it does not list alleles):

VH1-2 1 90 91 105 106 147 148 198 199 294 VH 0

I can’t immediately make sense of these numbers! I would have expected something closer to the IMGT file. But I hope this casts light on the needed information, for a couple of parsers at least.

p.s. - having checked the reference I gave again, I see that the Kabat CDR1 is 15nt upstream of the IMGT CDR1 - which explains the largest variation in the two IgBLAST lines above.

William

Does it make sense to others to be combining sequences and inferences about sequence features into one unit? To me it seems strange: I could imagine two different ways of identifying these features for the same underlying data, but I wouldn’t say that these are two different germline sets.

An option could be to have two layers: one, the sequences themselves, and two, inferences using those sequences. I certainly classify multiple sequence alignments as being inferences.

To me that’s a powerful argument.

Should we also consider names and collections to be inferences? Perhaps we should have a base collection of germline genes and provenance, and a defined format that allows one or more people to collect them into sets for different purposes, define names according to some convention, and add annotations such as CDR locations and alignments.

I really strongly agree with Erick that we should keep germline sequences and inference on them as separate as we can. So how about the model is: a github repo with the minimal information necessary in the master branch, but in a format that allows arbitrary new information to be added, so e.g. there could be different branches for different MSAs.

It sounds like CDR1 and CDR2 are pretty ambiguous, so I’d say leave those out of the minimal instance. I might also leave CDR3 out, but users all expect to get functional information, and for that you need the cyst/tryp/phen positions. And as long as we need them, it seems better to make sure we’re all using the same ones.

I’ve thrown up a prototype/example of what this could look like here.

I would suggest keeping the versioning of the underlying sequences separate from versioning of inferences.They will likely be managed by different, separate, people according to different rules.

Here’s a modified suggestion for the file structure:

  • h, k, l directories as in your github
  • within each directory:
    germline.fasta - a single file containing the master set of germline sequences that we create. These would have neutral names, probably just an incrementally assigned index number.
    names_xxx.csv - one or more files defining names that are used under a particular naming scheme, mapping them to the ‘neutral’ names in the germline file. For example, names_imgt.csv might, for the human germlines, map IGHV1-69*01 to something, and so on.
    positions_xxx.csv - one or more files defining key positions (as in your extras.csv) for a particular numbering scheme, e.g. positions_imgt.csv, positions_kabat.csv

How this would work in practice:

AIRR would sponsor or manage a central repository for germlines: the germline.fasta files for each species, and the underlying data showing who contributed them, what their origins were and so on. Each gene would be allocated an index number, and existing entries would never change. Like a genbank for germlines.

UNSWIG, IMGT and others - maybe AIRR as well in some cases - would publish names_xxx.csv files, mapping these underlying genes into their favoured naming scheme, and bringing together the collections that they feel are suitable for particular purposes. Because entries in germline.fasta never change, the version dependency is quite simple: you just need a germline.fasta file that contains all the genes referenced in names_xxx.csv.

The same (or different) people would publish positions_xxx.csv files for numbering schemes of interest. These would need to be kept in step with the names_xxx.csv files.

It would be easy to convert existing IMGT-aligned germline sets into this format. IMGT definition files could be maintained in this format automatically, just being updated whenever the IMGT libraries changed. Hopefully we could persuade UNSWIG and other maintainers of germlines to adopt the format, or automate conversion to it from whatever they use at the moment,

1 Like

Great. That all seems sensible.

This is great, @w.lees!

To be clear, would a given “germline database” have the option of containing a subset of this master list?

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

To be clear, would a given “germline database” have the option of containing a subset of this master list?

Yes. I’d see the names_xxx.csv files containing a subset of the master list, and this subset would comprise the ‘germline database’ xxx.

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

Yes. That way we don’t have to worry about an existing numbering scheme changing, or not being applicable to a particular species.

1 Like

I like many of the ideas expressed so far. I’m currently digging into the details of VDJServer’s germline db so I can better describe issues or differences to the group. One thing I notice is that we store a “hierarchy” of four levels, i.e. gene type, gene family, gene, and allele. This is because users commonly want usage counts at these different levels. With codified naming then this can be parsed, but getting away from that with neutral names means that this information needs to be stored separately. If the depth of the “hierarchy” is fixed a four levels, then this can be handled with 4 columns in the names_xxx.csv file.

Is the h, k, l separation valid for all organism? And is that just for BCR, what about TCR? What about alpha/beta? Should we be using directory structure to designate these “types” or might it be better that this information is in the names_xxx.csv?

I like the way this is going but I wonder if it could be made simpler.

If I understand the current proposal correctly, it’s to have a variety of these positions_xxx.csv and names_xxx.csv files nested in the three directories. This seems like it could be convenient for using this information, but a little annoying for distribution-- if I want to use a different set of names and positions, I have to properly extract six files into the 3 respective directories.

What about using one fasta file with all the sequences, and then each set of sequence annotations being provided by a pair of files describing the positions and names, each of which has a column describing whether that sequence is a H, K, or L.

Then, for a given combination of sequences and positions/names, the database could process these things and offer downloads in a format that would be easily pulled into software tools. E.g. if the sequences were described with aaa and the positions/names were described with xxx, then it would be available as db-aaa-xxx.tgz or something.

This would of course be done with publicly available code such that a user could do the same processing using their own collection of files as desired.

Good point. I don’t see a reason we couldn’t allow an arbitrary identifier rather than H/K/L, including alpha and beta for TCRs. Hopefully @martin_corcoran can inform us concerning other organisms.

So I don’t feel strongly, but the rationale for separate h/k/l directories is partly for human readability, and partly to make the catastrophic error of aligning, for example, an igk sequence against an igh gene (which is a mistake I’ve made) much harder. In my imagination the typical use case is someone who isn’t familiar with the format, and needs to interface it with their software. If h/k/l are in subdirs, then they look in the tgz, or whatever they’ve downloaded, see a fasta file in a k/ directory, and they’ve already got what they need for a minimal interface.

I definitely agree that minimizing the number of files should be a design criterion, but I’d say that argues for smooshing all the non-sequence info into a single .csv

Here’s a schematic regarding Ig content in different species, from a 2014 review by Rita Pettinello and Helen Dooley in ‘Biomolecules’.


Clearly the H/K/L designations will not suffice if we are aiming for an inclusive system.
I personally prefer a single fasta file for each gene type (one fasta for all IGHV, one for all IGKV etc) since it makes it easier to port to other analysis tools that allow you to upload custom database files.

2 Likes

Really interesting discussion. One point though. Computationally efficient approaches does not always translate into outputs that are easily interpreted by the wider community beyond the computation environment. I believe that there is a need for standardized nomenclature in numbering schemes when data is presented in final form as that greatly facilitates interpretation of the study’s outcomes. If different numbering schemes are used during computation the outputs should in the end be transformed into results based on such standard nomenclature as it greatly facilitates discussion beyond the computational environment, a mode of communication that remains as a important environment for scientific discussion of study results, and likely will remain as a important platform for exchange of ideas. I suggest that AIRR recommendations will include the use of a standard final output numbering scheme and that AIRR and other partners in antibody research work with other sources of scientific information (such as PDB) to make them adopt the same type of final numbering nomenclature. Any thoughts?

I’m coming in a bit late here, but in terms of file format, this data is going to be pretty small, and it seems it may need some flexibility. Storing it across many files across multiple directories seems complex to me, especially for distribution or putting together custom versions. What do people think about coming up with a JSON schema for this, and simply putting all necessary data into a single JSON file.

Basically, the file would contain a bunch of JSON objects where each one is a single germline gene element. That objects could then include top-level fields for all the information of interest (e.g., gene name, species, locus, ungapped sequence, IMGT-gapped sequence, Cys pos, etc.). Furthermore, each object could include an array inside of it that contains any additional annotations that need to be located against the sequence (e.g., the equiv of a BED file inside each JSON record).

This would make the data very easy to read/write/edit (it’s text based), easy to distribute (single file), easy to support new features (it’s semi-structured and easy to add custom fields), and very easy to work with for developers, as there exist JSON parsers for virtually every language out there.

I’m not certain what to do with additional inferred/aggregated germline info (e.g., alignments/trees). But this could surely be encoded in a second JSON file with appropriate structure.

1 Like

For future reference, here’s the paper @martin_corcoran referenced.

Great point, @mats.ohlin. I think at this point we are simply describing the format of such output rather than exactly what should get put in it. I assume that we all agree that the format shouldn’t dictate the content, but let me know otherwise.

@laserson-- I agree with many your points, and it would be lovely to have everyone using JSON for everything. However, people can effortlessly throw a FASTA file into other sequence analysis tools, and open CSVs in Excel.

I don’t think that it would be crazy to have the fundamental format be a unified JSON file, and then have downloads available in CSV/FASTA that get auto-generated to keep up with the JSON. In any case specifying a schema would be a good way to agree on logical organization first. Do you have any suggested tools for working with schema? E.g. https://github.com/Julian/jsonschema ?