I like the way this is going but I wonder if it could be made simpler.
If I understand the current proposal correctly, it’s to have a variety of these positions_xxx.csv
and names_xxx.csv
files nested in the three directories. This seems like it could be convenient for using this information, but a little annoying for distribution-- if I want to use a different set of names and positions, I have to properly extract six files into the 3 respective directories.
What about using one fasta file with all the sequences, and then each set of sequence annotations being provided by a pair of files describing the positions and names, each of which has a column describing whether that sequence is a H, K, or L.
Then, for a given combination of sequences and positions/names, the database could process these things and offer downloads in a format that would be easily pulled into software tools. E.g. if the sequences were described with aaa
and the positions/names were described with xxx
, then it would be available as db-aaa-xxx.tgz
or something.
This would of course be done with publicly available code such that a user could do the same processing using their own collection of files as desired.
Good point. I don’t see a reason we couldn’t allow an arbitrary identifier rather than H/K/L, including alpha and beta for TCRs. Hopefully @martin_corcoran can inform us concerning other organisms.