How to incorporate clustering info in AIRR-compliant files

Daron · January 5, 2021, 6:20am

Hi AIRR Community,

Our BCR/TCR clustering software, InterClone, is in the final stages of development. InterClone accepts AIRR-compliant tsv files containing BCR or TCR AA sequences and returns clusters under various similarity thresholds. Our thinking was, it would be easiest for users if the output consisted of the original AIRR files with cluster IDs added as additional columns. Any thoughts on this? There does not seem to be a concept of “cluster” in the current schema, but at the rate of data growth, I think there already is a need. Any thoughts how to handle this within the current standards?

Thanks in advance for your thoughts

Daron

caschramm · January 5, 2021, 1:40pm

If the clusters are supposed to represent clones, you should use the clone_id field. Otherwise, I think a custom field is fine, for now (this is what SONAR does). We are pretty open to revisiting what might need to be a reserved field, though, if you’d like to make the case in more detail.

In general, this type of inquiry would be better to post as an issue on the AIRR GitHub repo.

caschramm · January 5, 2021, 1:41pm

Also, please consider these guidelines in developing your product and consider submitting it to the Software WG for AIRR-compliance certification when you are ready for release!

Daron · January 5, 2021, 2:10pm

Thanks Chaim,
No, they are not clones (hence the name InterClone). Will post any further questions on GitHub repo as you suggest!

Daron · January 6, 2021, 12:59pm

I don’t think we plan to make (all of) this software open-source in the short term. But we do intend to keep it free for academics always. Is there any flexibility within this AIRR community on the open source issue?