I don’t think they’re semantically identical. To make it easier, let me call the “primary” output file the “rearrangements file” and the “secondary” file with all the alternative alignments the “alignments file”.
In the rearrangements file, the read_id
functions as a primary key, and should be unique in the file, while in the alignments file, the ‘read_id’ is not unique. I would also imagine that the schemas would likely not be identical (though they can be if we want) for several reasons:
- It would likely be wasteful to replicate the same read/rearrangement-associated fields into each row that’s associated with the read. The alignments file may also be much larger, as there can be tens of rows per read. The alignments file is more “specialized” anyway, so would not necessarily need the extra fields. And it should be easy enough to join the two files if necessary. (e.g., using a
data.frame
in R or pandas
in Python, or (Py)Spark for huge files).
- I could imagine that if we were defining different types for V, D, and J alignments, they might actually have slightly different schemas, so even having a single file designed for “alignments” is a bit of an abuse from a pure data modeling perspective. You could imagine you’d want a separate file for each. Put another way, each row in the alignments file would be either a V alignment, D alignment, or J alignment.
I would also prefer to have just a single file as output, but to successfully do that and capture the multiple alignments, the structure of the record for a single read would have to include 3 nested subtables (one each for V, D, and J), and this is not possible to do with CSV (unless you do things that are generally considered dangerous).
The way I imagine it, an analysis tool that’s writing out this data would by default, say, output only the rearrangements file. You could instead add a flag which specifies that you should output both the rearrangements file and the alignments file. Any tool implementor would need to support both of these, but since it’s just simple CSV, it should be pretty simple.
That said, if you do really think that many people would at least want access to the top 2 or 3 alignments for V, D, and J, that could easily be supported in the rearrangements file with fields like v_gene_1
, v_gene_2
, etc.
Finally, I agree that the rearrangements and alignments file can have identical schemas. It would just mean that people would have to be more careful about interpreting the results in such a file. You would also reproduce all the read-associated fields for each individual V, D, and J alignment for that read.