What about using the “data package” spec?
Using JSON for metadata ensures that we don’t have to implement any custom parsers or do any weird hacks in the header. The spec here also supports having a single “data set” include multiple TSV files. Finally, it really shouldn’t be much of a hassle, as BAM files are regularly passed around with companion index files.
One additional issue (and advantage of the data package approach) is that in the Hadoop/cloud world, data sets are often paths to directories that contain multiple “part” files. For example, if I use Spark to analyze some VDJ data, and my computation uses 10 reducers, then the output “TSV” file would actually be 10 different TSV files. Writing out metadata into one/all of them would then be a hassle. But having a small json file in that directory would be easy, and also very expressive.