This is great info Mike! I've just been testing the Union vs joinExisting and I really appreciate hearing about your strategy for large datasets. Fortunately, my largest netCDF datasets are static, so I plan on using your suggestions and getting thing set up that way. Thanks again!! -kevin. On 10/28/13 11:48 AM, Michael McDonald
wrote:
Kevin,I have been triggering this initial scan by clicking on the services for the aggregated dataset. Is there another way to perform the initial indexing of netCDF aggregations (like is done with GRIB Collections) besides clicking on a service link?We trigger all of our initial catalog scans via ongoing Nagios (http://www.nagios.org/) queries that check the most frequently accessed datasets (really only need to query the datasets that change, i.e., forecast datasets, and the large aggregations). We set the Nagios queries to extremely high timeout values (5~10minutes) and then just let them run normally. We occasionally get false-positives from this when the tomcat server is reset/synchronized on a daily basis. All of the other misc datasets will be triggered by the users when requested. However, these misc/smaller datasets are usually quick to scan/generate on the fly. All of your static datasets should have the "recheckEvery" value *excluded* from its catalog file. Therefore, once the cache/agg file is created it will only be removed when the NetcdfFileCache scour value elapses. This is a tricky balance to get right. We are still trying to fine tune this on our servers.Also, I assume that the scouring of NetcdfFileCache would not remove this index file from cache/agg, correct? Otherwise users would be in for a long wait each time they click on an aggregated service. According to http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html, the cache/agg dir is only for joinExisting. I'm trying to use Union right now.Assume that anything in the cache/agg folder is game for removal/scrub. "everything/anything" in cache/agg older than the scour value will be deleted! We were testing out a btsync between our two thredds servers and this tomcat scour was deleting dot-files/folders unrelated to thredds. So we now do our sync one directory level higher "cache" and exclude all directories but the "agg" folder. If your dataset does not change, and you want it to be cached for a while - avoiding the initial scan, then you need to set the NetcdfFileCache scour value to multiple days. Make sure you have plenty of disk space for the cache/agg folder, since all other datasets will now be cached for much longer. However, all of our catalogs in cache/agg typically occupy less than 25MB of space. The real cache consumer is NCSS (a separate scour value/schedule)! I don't think unions are stored in cache/agg. Best test is to look in this folder for a file resembling the dataset name. Inspect the file and note its size, timestamp, and contents. Nearly all of our aggregations are nested joinExisting(like variables)+union(top). I see all of the joinExisting cache files in this cache/agg folder, but zero files with the "union" type. Are you sure you should be performing a union on this dataset and not using joinExisting (time series data) instead? What we do is many small/manageable joinExisting scans of like data. Then we do a union at the top level of these netcdf datasets. This way all of the components get cached and then the top level union is simply a combination of the cached data (see latest.xml attachment). This idea was in one of the advanced thredds examples (or on the forum) and it has helped significantly reduce our initial scan times. /mike --
Kevin Manross NCAR/CISL/Data Support Section Phone: (303)-497-1218 Email:address@hidden Web:http://rda.ucar.edu |
_______________________________________________ thredds mailing list address@hidden For list information or to unsubscribe, visit: http://www.unidata.ucar.edu/mailing_lists/