We are having issues with NCML dataset aggregation caching
and would like to report some of our findings.
We are attempting to host some datasets representing daily
climate predictions. We have datasets from 4 models each with
the output of 2 scenarios with outputs for 3 variables (precip,
temp_min and temp_max). Each variable is contained in a single
file with data for a 10 year period. The data is laid out as
such. All data sets using the same time units, days since
12-31-1959. Each scenario contains ~120 GB of data stored in
NetCDF 3 format using a CF gridded dataset convention. We've
been tasked with presenting the data for each model/scenario
pair as a single dataset.
For clarity, here is how we are representing this data on
disk (1 model and 1 scenario, all variables and time periods):
temp_min/
model1.scenario1. temp_min.1960.1969.nc
model1.scenario1. temp_min.1970.1979.nc
...
model1.scenario1. temp_min.2090.2099.nc
temp_max/
model1.scenario1. temp_max.1960.1969.nc
model1.scenario1. temp_max.1970.1979.nc
...
model1.scenario1. temp_max.2090.2099.nc
Attempt #1:
Description: Aggregate as a single NcML file with internal
nested aggregations, let's assume the file's path is
/data/model1.scenario1.internal.ncml
=== /data/model1.scenario1.internal.ncml ===
<?xml version="1.0" encoding="UTF-8"?>
<aggregation
type="union">
<netcdf>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/precip/" suffix=".nc"/>
</aggregation>
</netcdf>
<netcdf>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/temp_min/" suffix=".nc"/>
</aggregation>
</netcdf>
<netcdf>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/temp_max/" suffix=".nc"/>
</aggregation>
</netcdf>
</aggregation>
</netcdf>
======
Observation: No caching. The same behavior exists wether
using <scan /> or multiple explicit <netcdf
location="..." /> elements.
Investigation: Each nested/leaf aggregation (for precip,
temp_min or temp_max) results in a cache read attempt on a
cache file named file-data-model1.scenario1.internal.ncml#null.
If this file exists it is most likely unusable as the
netcdf cache dataset ids will not match. This is a result of
each nested/leaf cache utilizing the same file name. As each
nested/leaf aggregation is processed it overwrites the cached
result of the prior nested/leaf aggregation.