Hey John,
With regards to Attempt 1 (internally nested aggregations). I think the root cause is in ucar.nc2.ncml.NcMLReader.NcmlElementReader (private class). The constructor takes an ncmlLocation and a location, I think the 'location' is supposed to be the attribute from <netcdf location="..." /> but with these internally nested datasets the location is null. I can't see how to generate a unique id across different runtimes .
public class NcMLReader { [snip] private class NcmlElementRader { [snip] public NetcdfFile open(String cacheName, int buffer_size, CancelTask cancelTask, Object spiObject) throws IOException { if (debugAggDetail) System.out.println(" NcmlElementReader open nested dataset " + cacheName); NetcdfFile result = readNcML(ncmlLocation, location, netcdfElem, cancelTask); result.setLocation(ncmlLocation + "#" + location);// maybe could be like this, but 'id' attribute referenced below is not required... // if (location != null) { // result.setLocation(ncmlLocation + "#" + location); // } else { // result.setLocation(ncmlLocation + "#" + netcdfElem.getAttribute("id").getValue()); // } return result; } [snip] }
For Attempt 2 maybe the (externaly referenced nested aggregations) the cache name arguable shouldn't change between cache read and write attempts. It would be nice if the caching could be reused when the file is referenced stand-alone ncml or from another ncml file. I couldn't track where the effective cacheName (result.getLocation()) was being changed...
Another oddity... If one uses "timeUnitsChange" the benefits of caching are somewhat negated (i guess you save one read). The code goes to disk again and the calculation needed for "timeUnitsChange" is not cached. This might be something to look at...
Tom On Mar 22, 2011, at 8:05 AM, John Caron wrote:
Hi Tom:
hmm, thats wierd, i didnt realize that i wasnt generating unique
names for these internal ncml datasets. Ill see if i can fix this.
im wondering if these files can be opened as grid datasets? if so,
then can you use FMRC aggregation? Even better is FMRC feature
collection. The intention is for this to replace the older
aggregation types, which are, um, clumsy.
In FMRC feature collection, you just name the entire collection of
files and its supposed to figure everything out. these nested
aggregations are nasty.
JOhn
On 3/21/2011 3:47 PM, Tom Kunicki wrote:
Hi John,
We are having issues with NCML dataset aggregation caching
and would like to report some of our findings.
Background:
We are attempting to host some datasets representing daily
climate predictions. We have datasets from 4 models each with
the output of 2 scenarios with outputs for 3 variables (precip,
temp_min and temp_max). Each variable is contained in a single
file with data for a 10 year period. The data is laid out as
such. All data sets using the same time units, days since
12-31-1959. Each scenario contains ~120 GB of data stored in
NetCDF 3 format using a CF gridded dataset convention. We've
been tasked with presenting the data for each model/scenario
pair as a single dataset.
For clarity, here is how we are representing this data on
disk (1 model and 1 scenario, all variables and time periods):
model1.scenario1/
precip/
model1.scenario1.precip.1960.1969.nc
model1.scenario1.precip.1970.1979.nc
...
model1.scenario1.precip.2090.2099.nc
temp_min/
model1.scenario1. temp_min.1960.1969.nc
model1.scenario1. temp_min.1970.1979.nc
...
model1.scenario1. temp_min.2090.2099.nc
temp_max/
model1.scenario1. temp_max.1960.1969.nc
model1.scenario1. temp_max.1970.1979.nc
...
model1.scenario1. temp_max.2090.2099.nc
Attempt #1:
Description: Aggregate as a single NcML file with internal
nested aggregations, let's assume the file's path is
/data/model1.scenario1.internal.ncml
=== /data/model1.scenario1.internal.ncml ===
<?xml version="1.0" encoding="UTF-8"?>
<aggregation
type="union">
<netcdf>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/precip/" suffix=".nc"/>
</aggregation>
</netcdf>
<netcdf>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/temp_min/" suffix=".nc"/>
</aggregation>
</netcdf>
<netcdf>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/temp_max/" suffix=".nc"/>
</aggregation>
</netcdf>
</aggregation>
</netcdf>
======
Observation: No caching. The same behavior exists wether
using <scan /> or multiple explicit <netcdf
location="..." /> elements.
Investigation: Each nested/leaf aggregation (for precip,
temp_min or temp_max) results in a cache read attempt on a
cache file named file-data-model1.scenario1.internal.ncml#null.
If this file exists it is most likely unusable as the
netcdf cache dataset ids will not match. This is a result of
each nested/leaf cache utilizing the same file name. As each
nested/leaf aggregation is processed it overwrites the cached
result of the prior nested/leaf aggregation.
Attempt #2:
Description: Aggregate with multiple NcML files with
nested aggregations contained in separate files.
=== /data/model1.scenario.external.ncml ===
<?xml version="1.0" encoding="UTF-8"?>
<aggregation
type="union">
<netcdf
id="precip" location="model1.scenario.precip.ncml"/>
<netcdf
id="temp_min" location="model1.scenario.temp_min.ncml"/>
<netcdf
id="temp_max" location="model1.scenario.temp_max.ncml"/>
</aggregation>
</netcdf>
===/data/model1.sceanario.precip.ncml===
<?xml version="1.0" encoding="UTF-8"?>
<aggregation
type="joinExisting" dimName="time">
<scan
location="model1.scenario1/precip/" suffix=".nc"/>
</aggregation>
</netcdf>
======
Observation: No caching.
Investigation: The caching code attempts to read file-data-model1.scenario1.precip.ncml
but then writes to file-data-model1.scenario1#file-data-model1.scenario1.precip.ncml.
You can rename the the created cache files or generate a
cache by hitting the external ncmls individually, unless you
catch this your performance will suffer... Somehow the cache
name of the instance changes in-between
AggregationExisting.persistRead() and persistWrite(). The
same object instance is used for both these calls, but
somewhere the cache name changes...
Tom Kunicki
Center for Integrated Data Analytics
U.S. Geological Survey
8505 Research Way
Middleton, WI 53562
|