This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi James, I wrote: > The problem is revealed by running "ncdump -s -h" on the netCDF-4 files, > which shows > the the variables that use the unlimited dimension nsets get "chunked" into > 3-D tiles, and > the netCDF-4 library chooses default chunk sizes that cause the file > expansion you see. > > One simple solution would be to write the netCDF-4 data with the unlimited > dimension > nsets changed instead to a fixed size dimension, an operation supported by > the "-u" > option of the nccopy utility. Then the variable data would all be stored > contiguously > instead of chunked, as is required when a variable uses the unlimited > dimension. When I tested this by using "nccopy -u" on your file, it changed nsets from an unlimited dimension to a fixed-size dimension, but the result was still chunked the same as for the input file, rather than stored contiguously, so it didn't make the file smaller. However, when I used a new nccopy capability to rechunk the input in a way that made it contiguous (1 chunk) or used chunk sizes that evenly divided the dimension sizes, it worked as expected, reducing the size to essentially the same as the netCDF-3 file: $ ls -l Fluid_Meas.* -rw-rw-r-- 1 russ ustaff 42186296 Dec 22 13:26 Fluid_Meas.snc -rw-rw-r-- 1 russ ustaff 95528366 Dec 22 13:27 Fluid_Meas.snc-nccopy-k3 $ nccopy -k3 -c nsets/,n_variables/,npoints/ Fluid_Meas.snc tmp.nc $ ls -l tmp.nc -rw-rw-r-- 1 russ ustaff 42251686 Jan 6 15:21 tmp.nc This shows that the netCDF-4 version of the netCDF-3 file Fluid_Meas.snc, with the right chunk sizes, has only slightly more overhead than the netCDF-3 file, even without compression. The "-c nsets/,n_variables/,npoints/" option says to chunk all variables that use the dimensions "nsets", "n_variables", or "npoints" with a chunksize the same as the dimension length. The time for nccopy to read and rechunk the original 42 MB file to the netCDF-4 output file was fast on my desktop machine: real 0m0.34s user 0m0.07s sys 0m0.19s and the times were similar if nccopy instead used the 95 MB netCDF-4 file that used the default chunk sizes: real 0m0.25s user 0m0.08s sys 0m0.14s > Another possibility would be to explicitly set the chunksizes for the output > to better > values than determined by the current library algorithm for selecting default > chunk > sizes. We're discussing whether we could fix the default chunk size > algorithm to avoid > extreme file size expansion, such as you have demonstrated in this case. > > For example, the library currently sets the default chunksizes for the > measurements > variable as this output from netcdf -h -s shows: > > float measurements(nsets, n_variables, npoints) ; > measurements:_Storage = "chunked" ; > measurements:_ChunkSizes = 1, 9, 120669 ; > > resulting in 20 chunks, each of size 1*9*120669*4 = 4344084 bytes, for a > total of > 86881680 bytes, about 87 Mbytes. > > Better choices of chunksizes would be (1, 11, 152750) with 5 chunks or > (1, 1, 152750) with 55 chunks or or (1, 11, 76375) with 110 chunks for > example, none of which would waste any space in the chunks and which > would all result in total storage of 33605000, about 34 Mbytes. > > It looks like the current default chunking can result in a large amount > of wasted space in cases like this. > > Thanks for pointing out this problem. In summary, to work around it currently > you either have to avoid using the unlimited dimension for these netCDF-4 > files > or you have to explicitly set the chunk sizes using the appropriate API call > to > not waste as much space as for the current choice of default chunk sizes. > > I'm currently working on making it easy to specify chunksizes in the output of > nccopy, but I don't know whether that will make the upcoming 4.1.2 release. > If > not, it will be available separately in subsequent snapshot releases and > should > help deal with problems like this, if we don't find a better algorithm for > selecting > default chunksizes. Options supporting specifying custom chunksizes to nccopy are now supported in the latest snapshot release, so will be in the upcoming version 4.1.2. All the choices for chunksizes described above work fine, and result in a data file essentially as small as the netCDF-3 data file. For example, to specify 110 chunks, each of size 1 x 1 x 76375 for variables using the "nsets", "n_variables", and "npoints" dimensions of size 5, 11, and 152750 respectively, an nccopy invocation would be: $ nccopy -c nsets/1,n_variables/1,npoints/76375 Fluid_Meas.snc tmp.nc $ ls -l tmp.nc [TODO] $ ncdump -h -s tmp.nc | grep "measurements" [TODO] and the output file is a netCDF-4 classic model file if you didn't specify an output file type and are copying a netCDF-3 classic file. Also, for the case of copying a netCDF classic format file with specified chunk sizes, a default of size 1 is assumed for any dimension not mentioned, so the above could be done more simply with just: $ nccopy -c npoints/76375 Fluid_Meas.snc tmp.nc Note that you can do considerably better with output file size if you specify compression, as can be accomplished with the -d (deflate) option to nccopy: $ nccopy -d 1 -c npoints/76375 Fluid_Meas.snc tmp.nc $ ls -l tmp.nc -rw-rw-r-- 1 russ ustaff 32686447 Jan 10 09:17 tmp.nc and you can do even better by specifying the "shuffle" option to improve the compression (which is still lossless) by reordering the data with all the first bytes for each chunk in a single block, followed by all the second bytes, all the thrid bytes, and all the fourth bytes: $ nccopy -d 1 -s -c npoints/76375 Fluid_Meas.snc tmp.nc $ ls -l tmp.nc -rw-rw-r-- 1 russ ustaff 27864220 Jan 10 09:17 tmp.nc You can verify with ncdump that all the data is still identical to the original. Furthermore, by chunking the data, the compression is more useful for accessing a subset of the data, as each chunk is independently compressed and the library only uncompresses the chunks needed for the data requested, rather than all the data in the fall. Uncompressed chunks are cached, so accessing the same data usually doesn't incur the cost of uncompressing the data again. Anyway, the new version of nccopy is in the current snapshot. More complete documentation and more testing will be in the upcoming 4.1.2 release. We're still considering tweaking the current default chunk size algorithm to avoid the file size increase in your example. Your example has been quite helpful for testing and refining the chunking/rechunking functionality in the nccopy utility. --Russ --Russ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: AIQ-275071 Department: Support netCDF Priority: Normal Status: Closed