This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi Benno, > We have been looking at our netcdf read performance again, particularly with > hdf4/hdf5 files. > > We do not have a clear story for the most part, but there seems to be a clear > problem with compression in hdf5-based > netcdf files. > > We would appreciate any insight. You are seeing artifacts of - Chunking with a chunk cache that's too small for the chunk shapes used for compression - Poor default chunk shapes for early netCDF-4 version (4.1.2) - Measuring I/O performance with ncdump utility, which is not designed for high performance A chunk (or tile) is the smallest unit for HDF5 data compression and access. The ncdump utility just uses the default chunk cache size, which in netCDF version 4.1.2 was small (4194304 bytes). The temperature variable in your test file has 9 chunks, each of size 1 x 1196 x 1196 shorts, so each chunk is 2860832 bytes. That means only 1 uncompressed chunk will fit in the default chunk cache. Reading all the values in each row of 2500 values will read and uncompress 3 chunks, and since the chunk cache only holds one of those chunks, the same chunks will be re-read and uncompressed repeatedly until all the data is read! I don't think ncdump is a very good program for testing read performance. It was not designed to be high-performance, as it spends much of its time comparing each value with a file value before converting it to ASCII for formatting output a row at a time. The ncdump utility doesn't have an option for specifying the size of chunk cache to use for compressed files. The nccopy utility is more appropriate for timing I/O with compression and chunking, as it's designed to be efficient. It uses only the netCDF library to read and write, so it's testing the efficiency of the netCDF software. However, nccopy was not available for early versions of netCDF-4, such as 4.0.1. Here's the current man page: http://www.unidata.ucar.edu/netcdf/docs/nccopy-man-1.html Later versions of netCDF, such as 4.2.x and 4.3.x have better default chunking strategies, so perform better on your file. For example, in netCDF 4.3.0, better chunk sizes are used (1 x 1250 x 1250) so there's only 4 chunks rather than 9 chunks, and compression works better, even with the same level of deflation: $ nccopy -d1 spv.nc spv-d1.nc $ ls -l spv-d1.nc -rw-rw-r-- 1 russ ustaff 2832831 Nov 26 14:44 spv-d1.nc which is better than the 3538143 bytes of the compressed file you sent. And time for the above compression was about 0.8 sec on my Linux desktop. A pretty good timing test for reading is to read, uncompress, and copy the compressed file, using nccopy. Before running any such test, you should make sure you aren't just reading a cached copy of the input file in system memory. See "A note about timings" at the end of my blog "Chunking Data: Why it Matters" for how to do this: http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters That blog also has some advice about choosing chunk shapes and sizes for good performance. My follow-up blog, "Chunking Data: Choosing Shapes", has more specific advice: http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes Anyway, here's how much time it takes to copy and uncompress the two versions of your compressed file, one using the 1 x 1196 x 1196 chunks that were from the old defaults in netCDF 4.1.2, and the other using the 1 x 1250 x 1250 chunks in the current netCDF release: $ clear_cache.sh; time nccopy -d0 -k1 spv-199901011900_compressed.nc tmp.nc real 0m1.98s user 0m0.27s sys 0m0.06s $ clear_cache.sh; time nccopy -d0 -k1 spv-d1.nc tmp.nc real 0m1.83s user 0m0.19s sys 0m0.07s In each case, the output is a netCDF-3 classic format file matching the uncompressed file you sent. And just for FYI, here's the time for running ncdump on the two versions of the compressed data: $ clear_cache.sh; time ncdump spv-199901011900_compressed.nc > /dev/null real 4m5.91s user 3m58.97s sys 0m4.43s $ clear_cache.sh; time ncdump spv-d1.nc > /dev/null real 3m29.15s user 3m25.79s sys 0m0.86s Both of those would be much faster if ncdump reserved enough chunk cache in memory to hold all the chunks in a row of a variable when dumping it. I could add that optimization option, if you really need ncdump to be faster, but it would use a lot more memory than it does now. --Russ > Benno > > ---------- Forwarded message ---------- > From: Igor Khomyakov <address@hidden> > Date: Thu, Nov 14, 2013 at 4:53 PM > Subject: netcdf 4.1.2+ issue > To: Benno Blumenthal <address@hidden> > Cc: John del Corral <address@hidden> > > Benno, here's the test case for netcdf developers. Please let me know if you > need more information. Attached, please > find the sample data files and the strace log. > > Igor > > THE DATA FILES: The compressed version of netcdf file was produced using > nccopy (option -d). The uncompressed file is > 12.5MB, the compressed file is 3.5MB. Attached, you may find datafiles.tgz > that contains both data files. > > THE PROBLEM: ncdump 4.1.2+ of the compressed file takes 50 times more time > than ncdump of the original netcdf file. > Ncdump 4.0.1 doesn't appear to have this issue. > > $ time ncdump spv-199901011900.nc >/dev/null > > real 0m1.652s > user 0m1.605s > sys 0m0.017s > > $ time ncdump spv-199901011900_compressed.nc >/dev/null > > real 1m28.273s > user 1m11.460s > sys 0m16.681s > > THE STRACE LOG: we straced ncdump 4.1.2 of compressed file and found that it > calls 'read' function 7,526 times, and > reads 3,384,680,557 bytes! This is 1000 times more than the size of the > file. Attached, please find the strace log. > > -- > Dr. M. Benno Blumenthal address@hidden > International Research Institute for climate and society > The Earth Institute at Columbia University > Lamont Campus, Palisades NY 10964-8000 (845) 680-4450