Sponsored by the AIRR Community

Standardizing the format of a germline set

Does it make sense to others to be combining sequences and inferences about sequence features into one unit? To me it seems strange: I could imagine two different ways of identifying these features for the same underlying data, but I wouldn’t say that these are two different germline sets.

An option could be to have two layers: one, the sequences themselves, and two, inferences using those sequences. I certainly classify multiple sequence alignments as being inferences.

To me that’s a powerful argument.

Should we also consider names and collections to be inferences? Perhaps we should have a base collection of germline genes and provenance, and a defined format that allows one or more people to collect them into sets for different purposes, define names according to some convention, and add annotations such as CDR locations and alignments.

I really strongly agree with Erick that we should keep germline sequences and inference on them as separate as we can. So how about the model is: a github repo with the minimal information necessary in the master branch, but in a format that allows arbitrary new information to be added, so e.g. there could be different branches for different MSAs.

It sounds like CDR1 and CDR2 are pretty ambiguous, so I’d say leave those out of the minimal instance. I might also leave CDR3 out, but users all expect to get functional information, and for that you need the cyst/tryp/phen positions. And as long as we need them, it seems better to make sure we’re all using the same ones.

I’ve thrown up a prototype/example of what this could look like here.

I would suggest keeping the versioning of the underlying sequences separate from versioning of inferences.They will likely be managed by different, separate, people according to different rules.

Here’s a modified suggestion for the file structure:

  • h, k, l directories as in your github
  • within each directory:
    germline.fasta - a single file containing the master set of germline sequences that we create. These would have neutral names, probably just an incrementally assigned index number.
    names_xxx.csv - one or more files defining names that are used under a particular naming scheme, mapping them to the ‘neutral’ names in the germline file. For example, names_imgt.csv might, for the human germlines, map IGHV1-69*01 to something, and so on.
    positions_xxx.csv - one or more files defining key positions (as in your extras.csv) for a particular numbering scheme, e.g. positions_imgt.csv, positions_kabat.csv

How this would work in practice:

AIRR would sponsor or manage a central repository for germlines: the germline.fasta files for each species, and the underlying data showing who contributed them, what their origins were and so on. Each gene would be allocated an index number, and existing entries would never change. Like a genbank for germlines.

UNSWIG, IMGT and others - maybe AIRR as well in some cases - would publish names_xxx.csv files, mapping these underlying genes into their favoured naming scheme, and bringing together the collections that they feel are suitable for particular purposes. Because entries in germline.fasta never change, the version dependency is quite simple: you just need a germline.fasta file that contains all the genes referenced in names_xxx.csv.

The same (or different) people would publish positions_xxx.csv files for numbering schemes of interest. These would need to be kept in step with the names_xxx.csv files.

It would be easy to convert existing IMGT-aligned germline sets into this format. IMGT definition files could be maintained in this format automatically, just being updated whenever the IMGT libraries changed. Hopefully we could persuade UNSWIG and other maintainers of germlines to adopt the format, or automate conversion to it from whatever they use at the moment,

1 Like

Great. That all seems sensible.

This is great, @w.lees!

To be clear, would a given “germline database” have the option of containing a subset of this master list?

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

To be clear, would a given “germline database” have the option of containing a subset of this master list?

Yes. I’d see the names_xxx.csv files containing a subset of the master list, and this subset would comprise the ‘germline database’ xxx.

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

Yes. That way we don’t have to worry about an existing numbering scheme changing, or not being applicable to a particular species.

1 Like

I like many of the ideas expressed so far. I’m currently digging into the details of VDJServer’s germline db so I can better describe issues or differences to the group. One thing I notice is that we store a “hierarchy” of four levels, i.e. gene type, gene family, gene, and allele. This is because users commonly want usage counts at these different levels. With codified naming then this can be parsed, but getting away from that with neutral names means that this information needs to be stored separately. If the depth of the “hierarchy” is fixed a four levels, then this can be handled with 4 columns in the names_xxx.csv file.

Is the h, k, l separation valid for all organism? And is that just for BCR, what about TCR? What about alpha/beta? Should we be using directory structure to designate these “types” or might it be better that this information is in the names_xxx.csv?

I like the way this is going but I wonder if it could be made simpler.

If I understand the current proposal correctly, it’s to have a variety of these positions_xxx.csv and names_xxx.csv files nested in the three directories. This seems like it could be convenient for using this information, but a little annoying for distribution-- if I want to use a different set of names and positions, I have to properly extract six files into the 3 respective directories.

What about using one fasta file with all the sequences, and then each set of sequence annotations being provided by a pair of files describing the positions and names, each of which has a column describing whether that sequence is a H, K, or L.

Then, for a given combination of sequences and positions/names, the database could process these things and offer downloads in a format that would be easily pulled into software tools. E.g. if the sequences were described with aaa and the positions/names were described with xxx, then it would be available as db-aaa-xxx.tgz or something.

This would of course be done with publicly available code such that a user could do the same processing using their own collection of files as desired.

Good point. I don’t see a reason we couldn’t allow an arbitrary identifier rather than H/K/L, including alpha and beta for TCRs. Hopefully @martin_corcoran can inform us concerning other organisms.

So I don’t feel strongly, but the rationale for separate h/k/l directories is partly for human readability, and partly to make the catastrophic error of aligning, for example, an igk sequence against an igh gene (which is a mistake I’ve made) much harder. In my imagination the typical use case is someone who isn’t familiar with the format, and needs to interface it with their software. If h/k/l are in subdirs, then they look in the tgz, or whatever they’ve downloaded, see a fasta file in a k/ directory, and they’ve already got what they need for a minimal interface.

I definitely agree that minimizing the number of files should be a design criterion, but I’d say that argues for smooshing all the non-sequence info into a single .csv

Here’s a schematic regarding Ig content in different species, from a 2014 review by Rita Pettinello and Helen Dooley in ‘Biomolecules’.


Clearly the H/K/L designations will not suffice if we are aiming for an inclusive system.
I personally prefer a single fasta file for each gene type (one fasta for all IGHV, one for all IGKV etc) since it makes it easier to port to other analysis tools that allow you to upload custom database files.

2 Likes

Really interesting discussion. One point though. Computationally efficient approaches does not always translate into outputs that are easily interpreted by the wider community beyond the computation environment. I believe that there is a need for standardized nomenclature in numbering schemes when data is presented in final form as that greatly facilitates interpretation of the study’s outcomes. If different numbering schemes are used during computation the outputs should in the end be transformed into results based on such standard nomenclature as it greatly facilitates discussion beyond the computational environment, a mode of communication that remains as a important environment for scientific discussion of study results, and likely will remain as a important platform for exchange of ideas. I suggest that AIRR recommendations will include the use of a standard final output numbering scheme and that AIRR and other partners in antibody research work with other sources of scientific information (such as PDB) to make them adopt the same type of final numbering nomenclature. Any thoughts?

I’m coming in a bit late here, but in terms of file format, this data is going to be pretty small, and it seems it may need some flexibility. Storing it across many files across multiple directories seems complex to me, especially for distribution or putting together custom versions. What do people think about coming up with a JSON schema for this, and simply putting all necessary data into a single JSON file.

Basically, the file would contain a bunch of JSON objects where each one is a single germline gene element. That objects could then include top-level fields for all the information of interest (e.g., gene name, species, locus, ungapped sequence, IMGT-gapped sequence, Cys pos, etc.). Furthermore, each object could include an array inside of it that contains any additional annotations that need to be located against the sequence (e.g., the equiv of a BED file inside each JSON record).

This would make the data very easy to read/write/edit (it’s text based), easy to distribute (single file), easy to support new features (it’s semi-structured and easy to add custom fields), and very easy to work with for developers, as there exist JSON parsers for virtually every language out there.

I’m not certain what to do with additional inferred/aggregated germline info (e.g., alignments/trees). But this could surely be encoded in a second JSON file with appropriate structure.

1 Like

For future reference, here’s the paper @martin_corcoran referenced.

Great point, @mats.ohlin. I think at this point we are simply describing the format of such output rather than exactly what should get put in it. I assume that we all agree that the format shouldn’t dictate the content, but let me know otherwise.

@laserson-- I agree with many your points, and it would be lovely to have everyone using JSON for everything. However, people can effortlessly throw a FASTA file into other sequence analysis tools, and open CSVs in Excel.

I don’t think that it would be crazy to have the fundamental format be a unified JSON file, and then have downloads available in CSV/FASTA that get auto-generated to keep up with the JSON. In any case specifying a schema would be a good way to agree on logical organization first. Do you have any suggested tools for working with schema? E.g. https://github.com/Julian/jsonschema ?

I think it would be very useful and helpful to define a schema that covers all the information that we expect a germline parser to need in order to do its job.

As I see it, the data will come from multiple sources: the raw sequences, the collection of raw sequences into a sequence set, the naming, the identification of CDRs could all be contributed by different groups. Call these ‘classes of information’ for want of a better description. It’s the combination of these classes that forms a germline set. I think it’s important that germline sets can be defined and modified flexibly by combining and adding classes, so that the field can develop and build on what’s there already, though I fully accept @mats.ohlin’s point that we should encourage generally accepted standards, that can change over time.

It might well make sense for all the data for a given germline set to be held in a single file that can be conveniently published for tool users, but in general it should be straightforward to add and modify classes separately: for example to change the naming convention, or add an additional numbering scheme. I think that’s a question of having a clear separation between the classes, and tracking versioning and authorship of each class independently. Provided the schema allows for that, the representation in one file or many isn’t so much of a problem.

Following yesterday’s meeting. I’ve volunteered to put together a draft schema - probably just an organized list of fields - which we can review and discuss. Hope to have it up here in a day or two.

William

In VBASE2 the sequence files are stored as fasta files without gaps. During the construction of the VBASE2 database the alignment process provides the aligned sequence with the gaps. The process is so fast that it can be done on the fly.

Just to mention that the program that generates the VBASE2 database can also automatically generate files that look like Genbank entries. It is very easy to generate flat files like that and with a proper vocabulary all the information can be added to the file structure. A lot of software can read Genbank or sequence files.
In my experience it is very fast to analyse V gene segments, so no need to store it in the fasta file itself.

The vbase2 page generates an output that does just that.
At the end of the page you find a text block that is comma delimited and can be pasted into an excel sheet.
It contains a lot of structured information including the information if the conserved positions are present in the sequence analysed. Please try it out at http://www.vbase2.org.
If you would like to get a different type of output, please let me know.