I think my preference would be to leave the decision about how to resolve ties up to the alignment software, because then we don’t get into the business of deciding how to draw conclusions from alignment scores. (And I’m okay with it if the aligner chooses to put both genes in the same cell if an ambiguous alignment cannot be resolved, as @w.lees described.)
The way I see it, having two separate files is partitioning the output of the reference aligner into two pieces:
- The “truth” in the main TSV file. Or rather, the conclusions drawn from alignment inference about the genomic origin of the V(D)J segments.
- The details of the inference in the hit table (“vdj alignments” file).
I think we’ll hit this snag again in later analysis steps, because there are other common outputs that aren’t compatible with one row per observed sequence, such as annotated lineage trees. We may need to define a standard for those at some point as well.
To me, it doesn’t seem like a big burden for an alignment tool to (optionally) output both:
- Top alignment information, as additional columns appended to an input table.
- A hit table with a matching column of sequence IDs, with number of hits reported determined by some user defined max hits parameter. Preferably in a tidy data structure, to simplify analysis/merging.