This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi Benno, > I have a question for you, something inadequately tested, I am afraid. > > I am reading MODIS tiles from the original HDF files using netcdf 4.1.2 > 32bit. There is a square array (18x36) of tiles, of which 317 are present > in the 1km resolution data, and somewhat fewer in the 250m resolution. > > I have numbers for the 1km dataset. When I roughly profiled my data > processing program, I got > > Runopenncfilestream0 DOmultiplenetcdfRPScolori20 profile221.0s22.5s 18.6s > profile2a7.4s20.7s 18.5s profile2ms64.8s20.4s 21.3s profile2ams7.2s18.2s > 19.6s profile2cache0017.4s > > The runs are in pairs, with the 2a runs immediately after the 2 runs, to > separate out disk access time (1 variable with all the 1km tiles > decompressed is about 1.7 GB, which is also roughly the total file size, > compressed with 317/648 tiles). There are multiple fields in the files, and > I am only reading one field. > > the "open" routine runs netcdf open, the DOmultiple routine reads the data > into arrays, and PScolor does the processing, irrelevant to this question. > > 1) "opening the files" takes 21s (directly attached disk) and 65s (NFS > mounted) the first time, 7s the second time, i.e. from memory. > "DOmultiple" always takes the same amount of time, indicating that there are > no disk accesses during the reads, only during the "opens". So it looks > like the open hits all the disk blocks needed to read the first field at > least, if not more. I assume DOmultiple is doing the decompression, which > on my machine is slow. > > The last line is reading the uncompressed data with tiles sequentially > arranged in a pair of binary files -- it takes about 6s the first time, 2s > from memory. Much faster, though obviously much more disk space. > > I presume 250m will be 16x slower, though I have not run the numbers yet. > > So my questions are > > 1) is the open reading the whole file, or at least bytes from all the blocks > in the file? Clearly this is a wierd case for netcdf, with so many files > and only reading a record from each. No, but a netCDF open does read all the metadata (the file schema, with dimension, variable, and attribute definitions and attribute values) in a file. The metadata is kept in memory as long as the file is open. This is in contrast to the HDF5 library, which reads metadata only as required when data is accessed ("lazy" evaluation). As a result, netCDF takes significantly longer to initially open files that have a lot of metadata than HDF5 does, but netCDF is often faster than HDF5 at subsequent accesses to metadata and to data that requires associated metadata. > 2) are there linking choices to improve the decompression performance? Decompression happens the first time a compressed variable is read. If the variable is stored using chunking, only the needed chunks are decompressed. There is a chunk cache that can prevent the same data chunk from being decompressed multiple times when multiple reads are made to data in the same chunk, but the default chunk cache may not be adequate. The size of the chunk cache can be determined and set with calls to nc_get_var_chunk_cache() and nc_set_var_chunk_cache(). The guidance in our documentation is currently inadequate for how to configure the chunk cache when the defaults don't work well. Some experimentation may be required, as it depends on how much memory you can devote to chunk cache. Chunk sizes and shapes were determined when a file was written, but data can be "rechunked" to improve access using new parameters in the nccopy utility, if copying such a large file is practical just to improve read performance. > 3) is the slowness inherent in the process, or am I just doing something > really inefficiently? It's hard to tell from the brief description you've given. We recently worked with another user to greatly improve access to data stored in a bad order for access on a server by rechunking the data to match the most common pattern of access. I'm planning to eventually distill that effort into better guidance for improving access performance by rechunking. > Some 1km MODIS tiles are at > > ftp://e4ft101.cr.usgs.gov/MOLA/NYD11A2.005/ > > if that helps. That may help, thanks. That at least would make it possible to see the current size and shapes of chunks, though you could also see this by using "ncdump -h -s" on one of the MODIS tile files. --Russ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: IDT-559068 Department: Support netCDF Priority: Normal Status: Closed