Properties of single file:
- no need to join metadata or alignments to the “truth” file to do analyses
- slightly simplifies piping output, if that’s important to you
- one file for the data set is appealing
- data is inherently nested, requiring multiple records per read. This can be accomodated by ensuring that all records for a given read are grouped together (e.g., sorted). (For better and worse)
- multiple records per read requires a global sort/groupby on the data, which is a very strong constraint. Iterating over files requires some trickier, error-prone logic.
- perhaps the rearrangement schema contains a column specifying whether a record should be considered “primary”
Properties of multiple files:
- can adhere to relational semantics more easily. Makes it much easier to do dataframe-style operations. Easier to follow data modeling best-practices. Standard tools expect these types of structures.
- Simplifies interpretation for people that aren’t experts…one read should have a “correct” rearrangement
- Requires tracking multiple files and performing joins across them (per @schristley’s point of looking at metadata)
If we’re willing to do multiple files, we should probably be willing to have an extra file for the metadata that is shared across all the reads (e.g., aligner properties, information about the samples, etc).
If we’re doing a single file, and we put the metadata into a header, I would vote strongly in favor of making the header a standard format as well. In this case, I’d guess we’d want to support nested data. I’d suggest something like YAML, which has very broad programming language support, and is very human-read and quite intuitive.