The following set of criteria for including an inferred gene into a database came out of a late-night wine-infused discussion at the AIRR meeting. Thus, it’s not a fully formed proposal, but it could be a starting point for discussion:
- Sequence must be full length, and included with a peer-reviewed publication
- The number of unique, productive, un-mutated sequences containing the novel segment must have absolute # significantly above background (≥10 sequences)
- Constitute a significant % of the total unique sequences in the repertoire (≥0.01% of the total unique un-mutated repertoire)
- Constitute a significant % of the total unique rearrangements to the closest known segment in the repertoire (≥10% of the total unique reads)
- Have a significant association with multiple known J genes (which do not contain mutations) (≥2 J genes)
- Matching 3’ and 5’ UMI or alternative test to exclude chimeric amplicons
- Must be ≤3 other alleles of closest gene in the (inferred) individual genotype
- The novel segment must be found in ≥2 independent biological replicates or sequencing runs (including public data sets)
There’s plenty to not like here, but perhaps @ctwatson or @tbkepler would like to express their reservations?