[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Seaspace satellite formats
- Subject: Re: Seaspace satellite formats
- Date: Mon, 01 May 1995 10:10:06 -0600
Susan,
> I got a note from Dave Johnson that Serafin has
> okayed the satellite receive station. Do you have
> any knowledge of the Terascan format they use...
> isn't it some variant of netCDF?
It's TDF, the TeraScan Data Format. It was developed by Joe Fahle, one of
the early contributors to the netCDF interface design. TDF and netCDF have
diverged quite a bit since then, with TDF not being constrained by the
necessity for a Fortran interface or use of XDR for storing data portably.
TDF is a more advanced interface in many ways, implementing
- a way to subset data without copying it and to reference data in other
files ("assemblies");
- a way to import ASCII or binary data from files in other formats without
copying or converting it ("instant import");
- built in handling of time;
- built in support for georeferencing using rectangular, mercator, utm,
polyconic, oblique stereographic, and polar stereographic projections;
- support for "relations" (ordered lists of variables) besides dimensions,
variables, and attributes;
- support for a "string" data type;
- built-in (required) attributes (e.g. units, badval, scale, offset, ...);
Here's some comments from Joe in recent correspondence with him:
> We have an hdf-to-tdf converter (and vice-versa). Dave Wilensky has
> done a lot of netcdf <-> tdf conversions at LSU.
>
> Because of the XDR issue, we probably cannot do and 'uninstantiated'
> import of a netcdf dataset, which would be the best solution.
>
> The converter approach pretty much is the way to go.
>
> Joe
>
> PS. Not having the XDR stuff has kept Terascan from running on PC's.
> I have programmed up a solution that takes care of this without using
> XDR's, but no questions, XDR's are the way to go for data portability
>
> As far as a C++ interface, that wouldn't take more than a day to whip
> that up. But I'll never go back to Fortran.
I've appended a man-page overview of TDF.
--Russ
DATASETS(7) TeraScan Overview DATASETS(7)
NAME
datasets - TeraScan common data format (TDF)
SYNOPSIS
lib/libcdf.a
DESCRIPTION
Introduction
Each TeraScan dataset is a separate UNIX file organized in
the TeraScan common data format (TDF). The TDF is an
extremely versatile file format that is capable of assimi-
lating a wide variety of data types, shapes and sizes. For
example, a single dataset could contain satellite image
data, random _i_n-_s_i_t_u data, and 3-D model data.
The TDF was developed during the same period that NASA
developed the Common Data Format (CDF) [Treinish and Gough,
1987], and served as a basis for the UNIDATA Network Common
Data Format (netCDF) [Rew, 1988]. The TDF has been substan-
tially upgraded since then.
Dimensions, variables, relations and attributes are the
basic dataset components. Variables are simply arrays of
data; dimensions define the sizes of these arrays. Rela-
tions are ordered lists of variables. Attributes hold
information about the dataset as a whole, or about indivi-
dual variables, dimensions or relations. Only datasets,
variables, and relations can currently have application
defined attributes.
The following datatypes may be used to define variables and
attributes; byte, short, long, float, double, and string.
Codes and ranges for these datatypes are defined in
include/gp.h. String is a variable-width datatype, i.e.,
the number of bytes required to store one element is appli-
cation defined. Applications can implement a complex-valued
variable by adding an extra dimension of length 2 to the
variable.
Normally all dataset definitions and data are stored in a
single UNIX file. However, a dataset can reference vari-
ables from several files using _l_i_n_k_s. _L_i_n_k_s allow
rapid
import of non-TDF data, and support lightweight dataset sub-
sets and assemblies.
Programming Interface
TDF access routines are independent of any other TeraScan
software component except lib/utils.a. See dirfile(3),
misc(3), and terrno(3). Therefore, TDF applications can be
written without using TeraScan user interface or earth
TeraScan Last change: 1/13/93 1
DATASETS(7) TeraScan Overview DATASETS(7)
transform facilities. TDF calls can be embedded in existing
non-TeraScan applications as desired.
TDF datatypes, constants, and error status codes are defined
in include/gp.h.
Object Pointers
The basic TDF objects are sets, dimensions, variables, and
relations. Application-defined attributes are not con-
sidered objects, even though they can be treated as such.
Files are secondary objects, and are of only passing concern
to applications.
Pointers to objects (actually object data structures) are
returned by search or definition functions. These pointers
are used as arguments to other functions. All data struc-
tures have magic numbers and alignment criteria which helps
to identify bogus pointers. A pointer to an object's data
structure is "pinned" (i.e., can never change) until the
object is no longer available (i.e. the containing dataset
is closed).
All application accessible data structures exist in memory
that is allocated using UNIX malloc(). malloc is used spar-
ingly and in an unfragmented manner, so as not to impact
applications which also use and (possibly abuse) malloc.
Applications cannot be prevented from modifying data struc-
tures, even for datasets opened as readonly. Given this, it
was decided to let applications perform all operations
except variable I/O for readonly datasets, including defin-
ing new variables, relations, and attributes.
One obvious disadvantage of having application accessible
data structures is that applications will undoubtably trash
them more easily than if they were hidden. All data struc-
ture components should be considered readonly, unless other-
wise specified.
Applications can loop through a list of similar objects
(e.g. all dimensions belonging to a dataset) using
while (pointer != NULL) pointer = pointer->next;
Attributes
Attributes refine the definitions of datasets and their com-
ponents. There are two kinds of attributes:
- Built-in attributes, i.e., fields in application acces-
sible data structures (See include/gp.h)
TeraScan Last change: 1/13/93 2
DATASETS(7) TeraScan Overview DATASETS(7)
- Application-defined attributes, created using the define
or copy attribute functions
Dimensions and files do not have application-defined attri-
butes. The only file attribute of any interest to applica-
tions is _f_i_l_e->_p_a_t_h, which is built-in.
Application-defined
dimension attributes may be added in the future.
Note that applications are free to change names directly and
potentially generate name conflicts within a dataset. This
is the least harmful of all the ways applications can damage
datasets.
Different objects can have attributes with the same name,
but with different datatypes or lengths. This new flexibil-
ity should be used cautiously; two attributes with different
meaning should never have the same name.
The following built-in attributes are intended for use by
applications; only those marked (*) can be set directly by
applications.
* dim->name - dimension name
dim->unlimited - non-zero if dimension can grow
dim->size - current size
* dim->coord - dimension coordinate
* dim->scale - orig index = index * scale + offset
* dim->offset
* var->name - variable name
* var->units - units
var->type - datatype
* var->badval - missing value as stored on disk
* var->usemin - minimum valid stored value
* var->usemax - maximum valid stored value
* var->scale - true value = scale * stored value + offset
* var->offset
* rel->name - relation name
* rel->kind - relation kind (analogous to variable units)
* att->name - attribute name
* att->units - attribute units
att->type - datatype
att->size - number of elements in attribute
file->path - file path name
Application defined attributes are normally not accessed
like objects. Their values are set and retrieved by name,
rather than by pointer. Pointers to attribute definitions
are available for getting attribute datatype, lengths, and
TeraScan Last change: 1/13/93 3
DATASETS(7) TeraScan Overview DATASETS(7)
units, as well as looping through lists of attributes.
Application Defined Relationships
The new abstraction, "relation", has been added to datasets.
A relation consists of an ordered list of variables all
belonging to the same dataset. Relations have built-in
attributes "name" and "kind", where relation kind is analo-
gous to variable units. Relations also can have
application-defined attributes. The number and order of the
variables associated by a relation, as well its
application-defined attributes, are determined by its kind.
The following is an example of how relations can be used:
Given a variable "date" that contains an ordered list of
dates, a variable "year" that contains an ordered list of
years, and a variable "year_index" that is defined as fol-
lows:
index[i] = j if k > j => date[k] >= year[i]
define the relation "year_index" of kind "sparse_index",
consisting of the ordered tuple (date, year, year_index).
(Obviously, "date" and "year" must have the same units for
this to work.)
Builtin Relationships
The following relationships are built-in to application
accessible data structures; only those marked (*) can be
changed directly by applications:
var->dim[], var->ndims - variable has dimensions
rel->var[], rel->nvars - relation relates variables
* dim->var - a dimension can get its values from a variable
i.e., value coresponding to dim=i is var[i]
var->file - a variable's data is stored in a file
set->natts, set->att, att->next - dataset has attributes
var->natts, var->att, att->next - variable has attributes
rel->natts, rel->att, att->next - relation has attributes
set->ndims, set->dim, dim->next - a dataset has dimensions
set->nvars, set->var, var->next - a dataset has variables
set->nrels, set->rel, rel->next - a dataset has relations
firstset, set->next - a program has a list of datasets
dim->owner - a dimension belongs to a dataset
var->owner - a variable belongs to a dataset
TeraScan Last change: 1/13/93 4
DATASETS(7) TeraScan Overview DATASETS(7)
rel->owner - a relation belongs to a dataset
att->owner - an attribute belongs to a dataset, variable,
or relation
Pointers are used to represent all built-in relationships.
Linked lists are used for all "has" relationships except
two: var->dim[] and rel->var[]. In both cases, these asso-
ciations are "many-to-many". Linked lists are impractical
due to multi-threading. Instead, variable dimensions and
relation variables are stored in arrays. The number of vari-
able dimensions is limited (e.g., GP_VAR_DIMS = 5). There
is no limit on the number of relation variables.
Some built-in relationships are circular; e.g. var->dim[]
and dim->var, or set->var and var->owner. Due to the
hierarchical nature of declarations in C, some of these
pointers have to be declared of type "char", which is unfor-
tunate.
Scaled Variable Data
In original TeraScan datasets, information for converting
8-bit or 16-bit data to real values was stored in applica-
tion defined scaling attributes. Now, scaling attributes
are built-in to all variables, regardless of datatype.
var->scale and var->offset are used to convert stored data
to its true form:
true value = var->scale * stored value + var->offset
Note, built-in attributes var->badval, var->usemin and var-
>usemax all refer to stored values. When presenting these
attributes to users, applications may want to apply scaling
to at least var->usemin and var->usemax.
The most common use of scaling is to store real-valued data
with a minimum yet appropriate number of significant bits.
However, scaling can be used to help change variable units
without changing actual data; e.g., to change from degrees
Celsius to degrees Fahrenheit:
gpputname(var->units, C_FAHRENHEIT);
var->scale *= 1.8;
var->offset += 32.;
Another benefit of builtin scaling is that it allows appli-
cations to pretend they are working with a single type of
data: double precision. Variable read and write routines,
that respectively scale and unscale data, are provided as
part of the standard interface. This does not preclude the
writing of applications that treat each type of variable
differently.
TeraScan Last change: 1/13/93 5
DATASETS(7) TeraScan Overview DATASETS(7)
Dimension Coordinates
Applications may use the coord, scale, and offset builtin
dimension attributes to relate different dimensions. For
example, if two dimensions have the same coord attribute,
applications may choose to decide that the two dimensions
are parallel. The scale and offset attribute can then be
used to determine the exact correspondence between the two
dimensions, assuming that correspondence is linear.
Coordinate types GP_X_COORD, GP_Y_COORD, GP_Z_COORD,
GP_TIME_COORD, and GP_NO_COORD are defined in include/gp.h
for this purpose. Applications are not restricted to these
coordinate types.
Unlimited (Growing) Dimensions
Unlimited dimensions can be defined using a size of
GP_UNLIMITED, found in include/gp.h. The following guide-
lines apply when working with datasets with unlimited dimen-
sions:
- Only one dimension in a dataset can be growing; defining
a second unlimited dimension will fix the size of the
former growing dimension.
- If a variable is defined with a growing dimension, that
dimension must be the variable's leading dimension.
- All variables to be defined with a unlimited leading
dimension must be defined prior to writing any data
corresponding to that dimension. The size of the unlim-
ited dimension will be fixed at the point where the new
variable is defined.
Cloning Objects.
Cloning an object refers to the process of creating a like
object with the same attributes, optionally with a new name.
When a variable is cloned, the new variable is created with
the same named dimensions. These dimensions must exist in
the output dataset, but do not have to have the same sizes
as the corresponding dimensions of the original variable.
Similarly, when a relation is cloned, the new relation is
created, associating the same named variables.
When a dimension is cloned, its corresponding variable (if
one is defined) is not carried over to the new dimension.
This would present a _c_h_i_c_k_e_n _a_n_d _e_g_g problem,
because the
dimension could not be created without the variable, and the
variable could not be created without the dimension.
TeraScan Last change: 1/13/93 6
DATASETS(7) TeraScan Overview DATASETS(7)
Definitions vs. Variable Data
Everything about a dataset with the exception of variable
data is maintained in virtual memory until the dataset is
closed or synced. If a dataset is opened for read access
and then is closed, nothing is written to disk regardless of
whether the application changed attribute values or defined
new objects.
If a dataset is opened with write access and then is closed,
all object definitions and attributes are saved to disk.
Saving definition and attribute changes can be suppressed by
aborting the dataset rather than closing it.
However, changes to variable data occur at the whim of the
underlying file system. Variable data is not maintained in
virtual memory, but is written to directly to the file sys-
tem. Aborting a dataset in the midst of writing variable
data will leave the dataset in an undefined, probably
unreadable state.
TeraScan datasets support random _h_y_p_e_r_c_u_b_e access to
vari-
able data. A hypercube is defined by a starting 0-relative
coordinate, (_i_1,_i_2,...) and a cube size (_n_1,_n_2,...).
Vari-
able indexing is similar to array indexing under C; i.e.,
the index of the last dimension is the fastest moving.
Link Subsets and Assemblies
Any array data that can support random hypercube access can
be linked to a TDF variable. For example, data for a vari-
able or variable hypercube in one TeraScan dataset can be
linked to a variable in another (or the same) TeraScan
dataset. This _l_i_n_k mechanism allows data from one or more
datasets to be linked to a single dataset without instantia-
tion, i.e., without moving any data around.
The following TeraScan applications take advantage of this
link mechanism.
subset Creates a variable and/or dimension subset
of input datasets.
assemble Gathers selected variables from input
datasets into a single output dataset.
burst Slices variables along any dimension, creat-
ing link variables for each of the slices.
impbin Imports structured array data from non-TDF
files.
TeraScan Last change: 1/13/93 7
DATASETS(7) TeraScan Overview DATASETS(7)
This link mechanism is similar to the UNIX facility for
creating symbolic file links. One drawback of using links
is that links can be orphaned. If data in file _X is linked
to a variable _V in dataset _A, and then _X is removed, then
the link variable _V is orphaned.
As a special case, a _N_U_L_L file can be linked to a TDF vari-
able. In this case, all stored values for the variable are
assumed to be 0.
Automatic Uncompression
Datasets that have been compressed using the UNIX compress
function can be uncompressed automatically by TeraScan.
TeraScan uses the UNIX zcat function to uncompress datasets,
redirecting the output to the scratch directory defined by
the environment variable UNCOMPRESSDIR. If UNCOMPRESSDIR is
undefined, uncompression is not attempted.
A list of automatically uncompressed files is kept in the
Registry file in the UNCOMPRESSDIR. This file is ASCII but
is not intended to be edited. For each automatically
uncompressed file, the following information is shown: true
path name of original, full path name of uncompressed copy,
last modification time of original is seconds, and the max
idle time in seconds.
Idle time is defined to be difference between the current
time and the last access time of the original. The environ-
ment variable UNCOMPRESSIDLE specifies the maximum idle time
in minutes for automatically uncompressed files. If
UNCOMPRESSIDLE is not set, the maximum idle time is assumed
to be 60 minutes. Different files can have different max-
imum idle times.
The environment variable UNCOMPRESSMAX specifies the maximum
space in megabytes to be allocated in the UNCOMPRESSDIR for
automatically uncompressed files. If UNCOMPRESSMAX is not
set, the maximum is assumed to be 10 megabytes. This max-
imum is only a rough limit; see the algorithm outlined
below:
Given input compressed file _F
If UNCOMPRESSDIR is not defined, can't uncompress _F
If _F is in Registry, _F's last modification time matches
what's in the Registry, and _F's uncompressed copy still
exists, use it
Delete all entries in Registry if original no longer
exists, original's last modification time does not
TeraScan Last change: 1/13/93 8
DATASETS(7) TeraScan Overview DATASETS(7)
match Registry, uncompressed copy does not exist, or
idle time (e.g., current time - last access time of
original) exceeds the max idle time
While the total space occupied by uncompressed copies
plus the size of _F (not its uncompressed copy!) exceeds
UNCOMPRESSMAX, delete the entry in Registry closest to
exceeding its max idle time
Uncompress _F and put it in the Registry, setting its
max idle time to UNCOMPRESSIDLE.
Hard Limits
There are currently only two hard limits for TeraScan
datasets: length of names and number of variable dimensions.
The name length limit applies not only to names, but to such
built-in attributes as var->units and rel->kind. Arbitrary
name lengths were not implemented for the following reasons:
- Applications are invariably written assuming a maximum
name length, which may as well be constant across custo-
mer sites.
- If names have unlimited length, built-in attributes
var->units and rel->kind also would have unlimited
length.
- Unlimited length names mean more extensive use of mal-
loc, which has been avoided.
Error Handling
Pipeline processing applications, interactive display appli-
cations, and application subsystems (e.g., TeraScan earth
trasnform) have very different error handling requirements:
- Pipeline processing applications typically take a very
brutal approach to errors; i.e., abort!
- Interactive display applications must always return con-
trol to the user, even on such _s_h_o_w _s_t_o_p_p_i_n_g
errors as
running out diskspace or memory
- Application subsystems must always return control to the
application after converting lower level error codes
into higher level ones (e.g., no such attribute =>
dataset does not have earth location).
In order to support these different cases, a switchable
error handler is used by all the dataset interface routines.
(See CALLING SEQUENCES.) An application subsystem can switch
TeraScan Last change: 1/13/93 9
DATASETS(7) TeraScan Overview DATASETS(7)
its own error handler in and out several times while an
application is running.
The default error handler simply sets the Terascan global
variable terrno to the approriate error code. In addition
to UNIX file open and memory allocation errors, the follow-
ing errors may be encountered. These error codes are
defined in include/gp.h.
FILES
include/gp.h, lib/libcdf.a, lib/libutils.a,
/usr/include/errno.h
SEE ALSO
gpatt(3), gpdim(3), gperr(3), gpio(3), gplink(3), gpname(3),
gprel(3), gpset(3), gptype(3), gpvar(3), dirfile(3),
misc(3), terrno(3), open(2), close(2), malloc(3),
compress(1) One of the strong points of the TDF and its pro-
gramming interface is that applications do not depend on the
physical layout of data on disk. The physical layout of a
typical dataset is as follows:
- dataset header of 644 bytes (historical)
- data for non-link variables
- file descriptions for link variables
- dataset attributes
- dimension descriptions
- variable descriptions and attributes
- relation descriptions and attributes
The start of data for a given variable is defined by _v_a_r-
>_d_a_t_a_s_t_a_r_t. Data for non-link variables is
guaranteed
either to be completely contiguous or _r_o_w-_w_i_s_e contiguous.
The _i_t_h row of array _A is defined to be all elements of _A
with leading index _i. The distance between rows is _v_a_r-
>_d_i_m_d_i_s_t[_0].
TeraScan Last change: 1/13/93 10