[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [netcdfgroup] legal NetCDF variable names



Hi Dave,

> May I request that Netcdf names beginning with a digit be handled as
> a special exception, requiring some kind of permission mode to be
> set before the API will allow these variables to be created?  In
> this way the default behavior for most users would be the same as in
> previous versions.
> 
> I think it is generally desirable to have Netcdf names follow the 
> same rigor and restrictions as variable names in common programming 
> languages.  Indeed Fortran, NCL, C/C++, Grads, and one or more 
> Netcdf conventions all require the first character alphabetic or 
> sometimes underscore.  The original authors of the Netcdf specs had 
> something like this in mind when they formalized this restriction.

You're right that our intent was to provide easy mappings between netCDF
variables and variables in programming languages.  However, there are
also good reasons to support use of names such as "4LFTX" for a 4-layer
lifted index and "5WAVH" for a 5-wave potential height.  These come
directly from a well-known table of model output parameter names, and
they have apparently been in use for some time in netCDF archives,
although we didn't realize this.

After the recent release that returned an error when asked to create
such names, we also heard from another user who wrote:

  ... This breaks our application, which uses variable names that are
  completely numeric. This is partly for historical reasons, partly
  because numbers aren't language-specific (they are mapped to a
  language-specific string when they are displayed), and partly
  because it allows us to create more performant and space-efficient
  indices using integers (we have a lot of index information!).

The C-based netCDF libraries have always supported reading netCDF data
with arbitrary names.  On defining names in a new dataset, the libraries
have enforced restrictions on names.  The absence of a check on creating
names beginning with a number ("numeric names" for short) was
inadvertent.  We didn't realize enforcing this restriction in the new
software would cause problems, but now that we know there are
operational data collections that make use of numeric names, we have to
consider our commitment to backward compatibility:

  
http://www.unidata.ucar.edu/software/netcdf/workshops/2007/netcdf4/Compatibility.html

The problem of mapping numeric names to variable names in C and
Fortran has a fairly simple solution, for example mapping the name
"4LFTX" to the variable name "DIGIT_4_LFTX" in the C or Fortran
programs generated by ncgen from CDL input.

The ncdump utility can display numeric names by escaping the first
character, so that ncgen can parse the resulting CDL and easily
distinguish names from numbers.  For example, when ncgen sees

  variables:
    float \4LFTX(time, level, lat, lon) ;

it will interpret this as a declaration for a variable named "4LFTX".
This CDL support is already implemented and will be in the next daily
snapshot release.

> Now I think the alpha restriction is important in practice for 
> several diffuse reasons.  Primarily, it will reduce compatibility 
> problems over time between producers and consumers of Netcdf data, 
> because of conventions.  Also, it's common practice to copy variable 
> names between files and program code, aiding clarity.  Also, the 
> occurrence of non-letters at the start of a name can be helpful as 
> an early warning of malfunctions.

The role of conventions is important.  The CF Conventions, for
example, still require that variable, dimension, and attribute names
begin with a letter or "_", so numeric names technically should not be
used in data for which CF compliance is important.  In practice some
software that requires CF-compliant data may work fine with numeric
names, because the netCDF libraries don't check name syntax when
reading.

A larger change that has elicited few comments but that is relevant to
this issue is the addition of support for Unicode names in the most
recent release.  The first character of a name does not have to be in
the US-ASCII character set.  It can also be a non-ASCII Unicode
character, like the rest of the characters in a name, to permit data
to be more self-describing in contexts that use other alphabets.

This is relevant because Unicode actually includes duplicate encodings
of numerals within several other code blocks.  Trying to extend rules
for ASCII names to cover Unicode characters is complicated enough that
we opted to allow any legal non-ASCII Unicode characters (UTF-8 encoded)
in names, in addition to the rules for ASCII characters.  The ncdump and
ncgen utilities in the most recent netCDF releases handle UTF-8 Unicode
in names.

 ...
> I see this as a small amount of work now to save more in distributed 
> problems later.  Thanks for your consideration.

You may be right, we may be opening a can of worms by extending the
character set and loosening the rules for netCDF names.

Applications and archives need not support the new character sets or
names.  If you avoid use of numeric names or non-ASCII Unicode
characters in names, previous versions of ncgen should continue to
work with your data.  If you specify that your applications or
archives require CF-compliant netCDF, for example, you should not
encounter problems allowed by the new looser rules for names.
Postel's principal that you should "be liberal in what you accept and
conservative in what you send" applies here, as a guide for
application developers and data providers, respectively.

We will continue to try to maintain backward compatibility while
supporting current users, existing data archives, and new standards
and technologies.  As always, we welcome your feedback.

--Russ