Splitting out the discussion of germline (gl) set formats (from here) at Erick’s request:
I think it’s not just igscueal, but most methods need alignment info, in effect – because we need the cyst and tryp positions for each V and J in order to determine functionality. The only reliable way I know of to get the cyst position for a V allele is to have the imgt alignment and take codon 104 (1-indexed). The tryp position you seem to be able to just do your own msa (say, with mafft) and line all the Js up. In either case, I waste a lot of time figuring out alignments and cyst/tryp positions, and I’d imagine most of you folks do, as well. It would be really nice if this hypothetical online database always gave you alignments and cyst/tryp positions, so we knew we were all using all the same info.
I think the other main point is that in the long run we want be able to run on paired heavy/light chain data, so we should have a gl set format that contains info for both chains.
Given this, a structure that seems reasonable to me would be h/
, k/
, and l/
subdirectories. Each of these would have aligned sequences in ig[hkl][vdj].fasta, and csvs for the codon info. For instance h/
might contiain:
ighv.fasta
ighd.fasta
ighj.fasta
cyst-positions.csv
tryp-positions.csv
I think it makes sense to have the separate csv files, because at least in all the fasta files I’ve dealt with it seems like the only way to store extra info is like so:
>sequence_name | with | more | info!
but then to figure out what, say, with
means, you have to go and find some documentation on someone’s web page or something to tell you the ordering. But csvs tell you right in the file.