Nested vs flat data for VDJ rearrangements

javh · October 12, 2016, 5:23pm

I think my preference would be to leave the decision about how to resolve ties up to the alignment software, because then we don’t get into the business of deciding how to draw conclusions from alignment scores. (And I’m okay with it if the aligner chooses to put both genes in the same cell if an ambiguous alignment cannot be resolved, as @w.lees described.)

The way I see it, having two separate files is partitioning the output of the reference aligner into two pieces:

The “truth” in the main TSV file. Or rather, the conclusions drawn from alignment inference about the genomic origin of the V(D)J segments.
The details of the inference in the hit table (“vdj alignments” file).

I think we’ll hit this snag again in later analysis steps, because there are other common outputs that aren’t compatible with one row per observed sequence, such as annotated lineage trees. We may need to define a standard for those at some point as well.

To me, it doesn’t seem like a big burden for an alignment tool to (optionally) output both:

Top alignment information, as additional columns appended to an input table.
A hit table with a matching column of sequence IDs, with number of hits reported determined by some user defined max hits parameter. Preferably in a tidy data structure, to simplify analysis/merging.

laserson · October 25, 2016, 4:15am

Properties of single file:

no need to join metadata or alignments to the “truth” file to do analyses
slightly simplifies piping output, if that’s important to you
one file for the data set is appealing
data is inherently nested, requiring multiple records per read. This can be accomodated by ensuring that all records for a given read are grouped together (e.g., sorted). (For better and worse)
multiple records per read requires a global sort/groupby on the data, which is a very strong constraint. Iterating over files requires some trickier, error-prone logic.
perhaps the rearrangement schema contains a column specifying whether a record should be considered “primary”

Properties of multiple files:

can adhere to relational semantics more easily. Makes it much easier to do dataframe-style operations. Easier to follow data modeling best-practices. Standard tools expect these types of structures.
Simplifies interpretation for people that aren’t experts…one read should have a “correct” rearrangement
Requires tracking multiple files and performing joins across them (per @schristley’s point of looking at metadata)

If we’re willing to do multiple files, we should probably be willing to have an extra file for the metadata that is shared across all the reads (e.g., aligner properties, information about the samples, etc).

If we’re doing a single file, and we put the metadata into a header, I would vote strongly in favor of making the header a standard format as well. In this case, I’d guess we’d want to support nested data. I’d suggest something like YAML, which has very broad programming language support, and is very human-read and quite intuitive.