Metadata storage with VDJ rearrangement data

laserson · August 24, 2016, 4:40pm

In addition to storing the VDJ rearrangements themselves, we need to store some metadata about the data set. This can include the version of the format we’re using, the software used to generate the data (incl. versions and cmd line options). What all should go into the metadata?

In addition, we need to propose a format for storing the metadata with the data. It can be done as a header to the data file, or can be done as a separate file. What are people’s preferences here?

caschramm · August 24, 2016, 8:38pm

I vote for a file header.

werner.muller · August 25, 2016, 5:51am

a stupid question,

is a VDJ rearranged sequence as such never a gremline sequence?
(may be part of it…)

and if it is not, why to include VDJ sequences in a gremline database directly?
(and not only the gremline sequences.

caschramm · August 25, 2016, 2:20pm

I don’t think I have ever seen a heavy chain sequence without at least some N insertions, but I assume it would be technically possible. There are definitely light chain sequences with only germline-encoded nucleotides.

However, this thread is not about the germline database that is being discussed elsewhere on B-T.CR. Here, the question is how software tools should output analysis of actual rearranged sequences, so that programs from different authors can be made inter-operable.

javh · August 26, 2016, 8:24pm

Assuming we are talking about a TSV file or some other flat format without explicit metadata support, then I also vote for a header. Probably using either # (R default), ## (VCF) or @ (SAM) as the comment character(s).

Also, I think it might be best to require metadata to be named fields with only one field per line. For example:

#RUN=A40BD
#DATE=2016.08.26
#ALIGNER=igblast-1.4.0
#REFERENCE=Human.IMGT-2015.07.15

dooley · August 27, 2016, 3:55pm

Might be helpful to include a field to reference a url to the full metadata, provenance chain, rerun form, etc. Also authorship, ownership, license, internal timestamp, and signature for validation.

caschramm · August 30, 2016, 4:38pm

Yes, similar to PDB text format REMARKs

laserson · August 31, 2016, 5:24pm

What about using the “data package” spec?

http://frictionlessdata.io/data-packages/

Using JSON for metadata ensures that we don’t have to implement any custom parsers or do any weird hacks in the header. The spec here also supports having a single “data set” include multiple TSV files. Finally, it really shouldn’t be much of a hassle, as BAM files are regularly passed around with companion index files.

One additional issue (and advantage of the data package approach) is that in the Hadoop/cloud world, data sets are often paths to directories that contain multiple “part” files. For example, if I use Spark to analyze some VDJ data, and my computation uses 10 reducers, then the output “TSV” file would actually be 10 different TSV files. Writing out metadata into one/all of them would then be a hassle. But having a small json file in that directory would be easy, and also very expressive.

laserson · August 31, 2016, 5:26pm

I should also say that we don’t have to follow that spec exactly. Rather, I am more suggesting that we simply have a companion JSON file along with the data file itself.

Having the separate file also makes it easier to modify the metadata if necessary.

dooley · November 24, 2016, 5:03am

I like the thumbnail approach you’re describing.

Who would have the responsibility of importing, validating, scrubbing, and generating data packages from the raw datasets people provide?
Would it be reasonable to also provide a summary or indexed companion file to speed up searching and filtering across all known sets?
What are your suggestions for storing the various bits (no pun intended)?
Should we assume everyone is running hadoop/spark and make federated search the norm or does it make sense to aggregate each of the derived files into an efficient storage mechanism that one could query across more effectively?
Are we reliving the debate over semantic correctness of Resource representations and their content-type values in the HTTP spec?