Naming conventions for AIRR identifiers

laserson · August 24, 2016, 4:34pm

People have strong opinions on how to name objects, and often it reflects their background and programming language of choice. Regardless, it’s important to pick and stick to a single naming convention to make the format spec clearer to understand and evolve. Some options can be to use snake_case, camelCase, how to deal with abbreviations (VDJ or Vdj), capitalization, etc.

Daniel_Gadala-Maria · August 25, 2016, 12:40am

I would strongly prefer snake case over camel case and slightly prefer all lowercase over all upper or mixed.

javh · August 26, 2016, 8:10pm

My preference is also underscore delimited field names without mixed case.

I have no real preference as to whether field names should be all lowercase or all uppercase. The Change-O standard uses all uppercase for a reason, but it’s not a particularly important reason. I think best practice is to make your applications case-insensitive w.r.t. field names anyway.

Also, I personally prefer the use of abbreviations wherever possible (eg, “seq” over “sequence”). However, clarity should always take priority over abbreviation. Ie, keep field names as short as possible without making them ambiguous.

ematsen · August 28, 2016, 12:17pm

And what is this reason?

javh · August 29, 2016, 4:01pm

It helps keep field names visually distinct. We started that way for fasta header annotations and log files, and it just carried through to the columnar format.

schristley · August 29, 2016, 4:30pm

Any reason not be case-insensitive? It is straightforward to handle this in programs. I personally prefer snake_case as the underscore make it easier to read.

laserson · August 29, 2016, 4:37pm

I am personally strongly opposed to ALL CAPS. In programming convention for many languages, all caps is reserved exclusively for constants. It’s also harder to read, and looks really angry.

Regarding @schristley’s point, most programming languages are case-sensitive, so it’s a source of bugs/confusion to allow case-insensitive field-names. It’s also not hard IMO to simply use the specified case exactly.

I’m ambivalent between snake case and camel case, as I’ve done a lot of Python and JVM programming

schristley · August 29, 2016, 4:54pm

Just as a follow up, my comment was only in terms of the metadata names for VDJ alignment format. When it comes to actually programming, I tend to prefer different conventions depending upon the language being used. I like GNU standards for straight C (or other procedural) languages, camelCase for object-oriented.

javh · August 29, 2016, 4:55pm

So, it looks like we have consensus by apathy on all lowercase snake_case?

psathyrella · August 29, 2016, 5:10pm

very in favor of all-lowercase snakes. google can’t be all wrong, right?

mats.ohlin · August 30, 2016, 7:08am

Well, I believe this matter reaches beyond programming and into the field of scientific communication. If we want to have the same style in programming and in general text (not necessary but maybe the preferred option) I believe there is a strong case for ALL CAPS as that format allows the reader to easily find such information information in a regular text.

laserson · August 30, 2016, 2:08pm

That’s an interesting point, @mats.ohlin. What’s an example you envision where this would be necessary? I will also mention that it’s very common to format code-like entities like so which gives it visual distinction and makes it even clearer that we’re talking about machine-readable.

mats.ohlin · August 30, 2016, 3:36pm

I believe it might be confusing if different annotations and standards are used in different situations (data files, publications, presentations etc.). If certain formats are used in programming they will likely end up in say publications as well even if the official standard says something else. It has proven very difficult for the antibody community to stick to a defined nomenclature and not even current standards are implemented in all situations (PDB is a prime example). It is better to have one format type across the board.
The second matter related to future standards. Are we to introduce a second standard (in which case we better deviate as much as possible from current IMGT standards to avoid confusion) or is the intention in the long run (as discussed on B-T.CR and as recently discussed by Andrew and Menno, correct me if I’m wrong) to have one joint standard in the field. In my view (maybe not an opinion shared by everyone) it is important to work towards one standard to avoid confusion such as that caused by the variety of Kabat, Chothia, AHo and IMGT naming/numbering conventions that exist (indeed even IMGT numbering conventions have changed).

laserson · August 30, 2016, 6:44pm

it might be confusing if different annotations and standards are used in different situations

That’s a good point, but I think it probably applies more to the content of the file format we’re talking about (e.g., the gene names) than the schema definitions themselves (e.g., the field “gene_name”). This file format will be “new”, and there should be only a single, unambiguous way to refer to the fields that we define. That said, I imagine that referring to the field names in our schema will be relatively rare compared with referring to the content in the files.

I think your second point is more a matter for the germline team to decide. Whatever they go with, our file format will be able to represent.