Standardizing the format of a germline set

psathyrella · July 2, 2016, 10:46pm

Splitting out the discussion of germline (gl) set formats (from here) at Erick’s request:

I think it’s not just igscueal, but most methods need alignment info, in effect – because we need the cyst and tryp positions for each V and J in order to determine functionality. The only reliable way I know of to get the cyst position for a V allele is to have the imgt alignment and take codon 104 (1-indexed). The tryp position you seem to be able to just do your own msa (say, with mafft) and line all the Js up. In either case, I waste a lot of time figuring out alignments and cyst/tryp positions, and I’d imagine most of you folks do, as well. It would be really nice if this hypothetical online database always gave you alignments and cyst/tryp positions, so we knew we were all using all the same info.

I think the other main point is that in the long run we want be able to run on paired heavy/light chain data, so we should have a gl set format that contains info for both chains.

Given this, a structure that seems reasonable to me would be h/, k/, and l/ subdirectories. Each of these would have aligned sequences in ig[hkl][vdj].fasta, and csvs for the codon info. For instance h/ might contiain:

ighv.fasta
ighd.fasta
ighj.fasta
cyst-positions.csv
tryp-positions.csv

I think it makes sense to have the separate csv files, because at least in all the fasta files I’ve dealt with it seems like the only way to store extra info is like so:

>sequence_name | with | more | info!

but then to figure out what, say, with means, you have to go and find some documentation on someone’s web page or something to tell you the ordering. But csvs tell you right in the file.

psathyrella · July 3, 2016, 12:58am

Well, honestly I think in the long term it makes more sense to have unaligned sequences in the fasta files, since different folks will want to use different alignments. For instance, with Erick’s phylogenetics stuff, he needed to use his own alignments. Whereas the cyst position is usually pretty much the same with any sensible alignment. But the folks that actually use the alignments would have to comment if that would be more trouble than it’s worth for them.

w.lees · July 3, 2016, 3:27pm

Agree that it is best not to try storing this information in the FASTA header.

I think there might be some other needed information:

The positions of CDR1 and CDR2 in the V-gene, so that the parser can provide CDR1 and 2 as well as 3
The frame alignment of each J-gene (although perhaps that can be inferred from the tryp-position)

I think it would be good to have aligned sequences (so that IMGT-gapped output can be provided) as well as unaligned sequences. We could accommodate this either by adding an additional file ighv-gapped.fasta, or by allowing ighv.fasta to contain gaps. In that latter case, one would build two germline sets: one gapped and one not.

w.lees · July 3, 2016, 3:37pm

Sorry, I realized that the CDR 1 and 2 positions are not needed if the V-genes are IMGT-aligned. But I think it’s probably better if they are not aligned: we don’t know for sure that the IMGT alignment will work well for all species going forward, or if better alignments will emerge. So I’d suggest non-aligned V-genes, explicit positions of key fields in other file(s), but also as in my previous message the option to provide additional aligned files, so that parsers can produce gapped output according to the alignment.

psathyrella · July 3, 2016, 4:09pm

Great, that sounds good. Thanks for commenting.

I’m totally onboard with you, it makes more sense to have the standard be unaligned, but with the CDR[123] or J frame information, or whatever else (which is as far as I can tell the main non-subjective part of the alignment) in the csvs.

I’d just add that the redundant information in separate aligned and unaligned files can be trouble (this is the way my gl directories are set up now, and I’d like to change it), particularly since imgt has a habit of changing the sequence that correspond to a given allele name (I juts got screwed by this yesterday, in fact…).

So, uh, I think I’m just totally agreeing with your second message.

caschramm · July 4, 2016, 2:06am

It probably doesn’t matter for germ line databases because they are relatively small, but in general I don’t like having to match a fasta against a csv. I do the following in the header line:
>seqid property=value property=value1,value2 property=....
USearch output does something similar:
>id;size=n;...
with a final semicolon mandatory.

w.lees · July 4, 2016, 8:57pm

It’s a little messy if you want to drop in additional information - for example to add kabat numbering to the sequences at a later date. It feels as though that would be cleaner if the files were separate and a bit easier to manipulate: personally I’d rather add annotation to a separate file rather than change the headers. But one could argue that it’s easier for things to get out of synch that way. I suppose the really important thing is that the information is versioned sufficiently that you can tell you’re using a consistent set of files. At the end of the day the representation’s not as important, it;s easy enough to manipulate into something else.

ematsen · July 6, 2016, 6:10pm

I’m sorry to say I don’t know what the markers are that one would use (analogous to Cys/Trp) to demarcate these. Could you clarify, @w.lees?

I’d like to hear from @javh and other tool developers on the needed information (though I think we can work out details such as CSV vs fasta header later).

w.lees · July 6, 2016, 8:44pm

Erick,

I don’t think there are foolproof ways of identifying CDR1 and 2 from sequence (or CDR3!)

Here’s a guide to doing it by hand - you’ll see there are some exceptions noted.

IMGT works out the CDR1 and 2 from the gap-aligned germline sequences. There are tables listing the positions in the alignment for each species - this is the one for the human heavy chain. As far as I know the alignments are verified by hand.

IgBLAST has files in the internal_data directory. Here’s a line from internal_data/human/human.ndm.imgt:

IGHV1-2*04 1 75 76 99 100 150 151 174 175 288 VH 0

the numbers are the first and last nt position of each field, for example the last position of FR1 is 75. If you de-gap the corresponding sequence from the IMGT library, the positions line up.

IgBLAST also has kabat alignments. Here’s a line from human.ndm.kabat (it does not list alleles):

VH1-2 1 90 91 105 106 147 148 198 199 294 VH 0

I can’t immediately make sense of these numbers! I would have expected something closer to the IMGT file. But I hope this casts light on the needed information, for a couple of parsers at least.

p.s. - having checked the reference I gave again, I see that the Kabat CDR1 is 15nt upstream of the IMGT CDR1 - which explains the largest variation in the two IgBLAST lines above.

William

ematsen · July 7, 2016, 2:28pm

Does it make sense to others to be combining sequences and inferences about sequence features into one unit? To me it seems strange: I could imagine two different ways of identifying these features for the same underlying data, but I wouldn’t say that these are two different germline sets.

An option could be to have two layers: one, the sequences themselves, and two, inferences using those sequences. I certainly classify multiple sequence alignments as being inferences.

w.lees · July 11, 2016, 7:14pm

To me that’s a powerful argument.

Should we also consider names and collections to be inferences? Perhaps we should have a base collection of germline genes and provenance, and a defined format that allows one or more people to collect them into sets for different purposes, define names according to some convention, and add annotations such as CDR locations and alignments.

psathyrella · July 11, 2016, 9:38pm

I really strongly agree with Erick that we should keep germline sequences and inference on them as separate as we can. So how about the model is: a github repo with the minimal information necessary in the master branch, but in a format that allows arbitrary new information to be added, so e.g. there could be different branches for different MSAs.

It sounds like CDR1 and CDR2 are pretty ambiguous, so I’d say leave those out of the minimal instance. I might also leave CDR3 out, but users all expect to get functional information, and for that you need the cyst/tryp/phen positions. And as long as we need them, it seems better to make sure we’re all using the same ones.

I’ve thrown up a prototype/example of what this could look like here.

w.lees · July 12, 2016, 11:32am

I would suggest keeping the versioning of the underlying sequences separate from versioning of inferences.They will likely be managed by different, separate, people according to different rules.

Here’s a modified suggestion for the file structure:

h, k, l directories as in your github
within each directory:
germline.fasta - a single file containing the master set of germline sequences that we create. These would have neutral names, probably just an incrementally assigned index number.
names_xxx.csv - one or more files defining names that are used under a particular naming scheme, mapping them to the ‘neutral’ names in the germline file. For example, names_imgt.csv might, for the human germlines, map IGHV1-69*01 to something, and so on.
positions_xxx.csv - one or more files defining key positions (as in your extras.csv) for a particular numbering scheme, e.g. positions_imgt.csv, positions_kabat.csv

How this would work in practice:

AIRR would sponsor or manage a central repository for germlines: the germline.fasta files for each species, and the underlying data showing who contributed them, what their origins were and so on. Each gene would be allocated an index number, and existing entries would never change. Like a genbank for germlines.

UNSWIG, IMGT and others - maybe AIRR as well in some cases - would publish names_xxx.csv files, mapping these underlying genes into their favoured naming scheme, and bringing together the collections that they feel are suitable for particular purposes. Because entries in germline.fasta never change, the version dependency is quite simple: you just need a germline.fasta file that contains all the genes referenced in names_xxx.csv.

The same (or different) people would publish positions_xxx.csv files for numbering schemes of interest. These would need to be kept in step with the names_xxx.csv files.

It would be easy to convert existing IMGT-aligned germline sets into this format. IMGT definition files could be maintained in this format automatically, just being updated whenever the IMGT libraries changed. Hopefully we could persuade UNSWIG and other maintainers of germlines to adopt the format, or automate conversion to it from whatever they use at the moment,

psathyrella · July 12, 2016, 9:02pm

Great. That all seems sensible.

ematsen · July 13, 2016, 11:57am

This is great, @w.lees!

To be clear, would a given “germline database” have the option of containing a subset of this master list?

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

w.lees · July 13, 2016, 12:23pm

To be clear, would a given “germline database” have the option of containing a subset of this master list?

Yes. I’d see the names_xxx.csv files containing a subset of the master list, and this subset would comprise the ‘germline database’ xxx.

The format of these positions_xxx.csv files would be numberings based on the actual underlying sequences themselves, not with respect to an existing numbering such as IMGT, right?

Yes. That way we don’t have to worry about an existing numbering scheme changing, or not being applicable to a particular species.

schristley · July 13, 2016, 4:20pm

I like many of the ideas expressed so far. I’m currently digging into the details of VDJServer’s germline db so I can better describe issues or differences to the group. One thing I notice is that we store a “hierarchy” of four levels, i.e. gene type, gene family, gene, and allele. This is because users commonly want usage counts at these different levels. With codified naming then this can be parsed, but getting away from that with neutral names means that this information needs to be stored separately. If the depth of the “hierarchy” is fixed a four levels, then this can be handled with 4 columns in the names_xxx.csv file.

Is the h, k, l separation valid for all organism? And is that just for BCR, what about TCR? What about alpha/beta? Should we be using directory structure to designate these “types” or might it be better that this information is in the names_xxx.csv?

ematsen · July 22, 2016, 1:07pm

I like the way this is going but I wonder if it could be made simpler.

If I understand the current proposal correctly, it’s to have a variety of these positions_xxx.csv and names_xxx.csv files nested in the three directories. This seems like it could be convenient for using this information, but a little annoying for distribution-- if I want to use a different set of names and positions, I have to properly extract six files into the 3 respective directories.

What about using one fasta file with all the sequences, and then each set of sequence annotations being provided by a pair of files describing the positions and names, each of which has a column describing whether that sequence is a H, K, or L.

Then, for a given combination of sequences and positions/names, the database could process these things and offer downloads in a format that would be easily pulled into software tools. E.g. if the sequences were described with aaa and the positions/names were described with xxx, then it would be available as db-aaa-xxx.tgz or something.

This would of course be done with publicly available code such that a user could do the same processing using their own collection of files as desired.

Good point. I don’t see a reason we couldn’t allow an arbitrary identifier rather than H/K/L, including alpha and beta for TCRs. Hopefully @martin_corcoran can inform us concerning other organisms.

psathyrella · July 22, 2016, 3:50pm

So I don’t feel strongly, but the rationale for separate h/k/l directories is partly for human readability, and partly to make the catastrophic error of aligning, for example, an igk sequence against an igh gene (which is a mistake I’ve made) much harder. In my imagination the typical use case is someone who isn’t familiar with the format, and needs to interface it with their software. If h/k/l are in subdirs, then they look in the tgz, or whatever they’ve downloaded, see a fasta file in a k/ directory, and they’ve already got what they need for a minimal interface.

I definitely agree that minimizing the number of files should be a design criterion, but I’d say that argues for smooshing all the non-sequence info into a single .csv

martin_corcoran · July 23, 2016, 9:21am

Here’s a schematic regarding Ig content in different species, from a 2014 review by Rita Pettinello and Helen Dooley in ‘Biomolecules’.

Clearly the H/K/L designations will not suffice if we are aiming for an inclusive system.
I personally prefer a single fasta file for each gene type (one fasta for all IGHV, one for all IGKV etc) since it makes it easier to port to other analysis tools that allow you to upload custom database files.