This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
--- Begin Message ---
- Subject: Re: sample files for nccopy
- Date: Tue, 16 Jul 2013 14:26:37 -0600
Hi Russ. We were able to duplicate your results. Thanks again. ...Rob On 07/15/2013 07:41 PM, Russ Rew wrote:Thanks Russ. Your results are encouraging. We seem to have some sort of build problem here that is giving us much poorer results. It's good to know that we should be able to get satisfactory results for our files.I duplicated the poor results you saw when I didn't use eith the "-u" flag to fix the size of the unlimited dimension or the "-c time/3157" arguments to set the chunk length to something better than the default 1 used for unlimited dimensions. So maybe there's nothing wrong with your build, unless you've tried those arguments and still get poor compression. I also verified that you don't have to do each of these approaches in two separate nccopy calls using an intermediate file, as in my examples. Each of them can be done with just one nccopy call using the options from the separate calls, and get the same compression: $ nccopy -u -d9 proc.nc proc-u-d9.n $ nccopy -c time/3157 -d9 proc.nc proc-c-d9.nc $ ls -l proc.nc proc-u-d9.n proc-c-d9.nc -rw-rw-r-- 1 russ ustaff 3843892 Jul 15 15:31 proc.nc -rw-rw-r-- 1 russ ustaff 1355552 Jul 15 19:36 proc-u-d9.nc -rw-rw-r-- 1 russ ustaff 1355552 Jul 15 19:36 proc-c-d9.nc Also, the -k4 is not needed, as nccopy can figure out the type of the output file. --Russ...Rob On 07/15/2013 04:45 PM, Russ Rew wrote:Rob, You may have been a little puzzled by my last response. For some reason, I thought that nccopy was achieving significantly less compression than just gzipping the original file, but now I see that's not the case with the proc.nc file: $ ls -l proc.nc proc.nc.gz proc-u-d9.nc proc-c-d9.nc -rw-rw-r-- 1 russ ustaff 3843892 Jul 15 15:31 proc.nc -rw-rw-r-- 1 russ ustaff 1368904 Jul 15 16:35 proc.nc.gz -rw-rw-r-- 1 russ ustaff 1355552 Jul 15 16:07 proc-u-d9.nc -rw-rw-r-- 1 russ ustaff 1355552 Jul 15 16:14 proc-c-d9.nc So in the response I sent, I was trying to explain why nccopy compression didn't do as well as gzip, but it fact it does a little bit better! Maybe I was thinking of the smaller "raw" file, for which the compression using nccopy is definitely worse: $ ls -l raw.nc raw.nc.gz raw-u-d9.nc raw-c-d9.nc -rw-rw-r-- 1 russ ustaff 1795940 Jul 15 15:31 raw.nc -rw-rw-r-- 1 russ ustaff 212845 Jul 15 16:40 raw.nc.gz -rw-rw-r-- 1 russ ustaff 337520 Jul 15 16:43 raw-u-d9.nc -rw-rw-r-- 1 russ ustaff 337520 Jul 15 16:42 raw-c-d9.nc --RussHi Rob,Thanks for taking the time to look at this. There are a couple of files at ftp://ftp.ngdc.noaa.gov/pub/outgoing/prenticedaily.2013-096.N19.proc.nc daily.2013-096.N19.raw.nc We downloaded the latest version of netcdf and ran the ncopy command with both -k 3 and -k 4, with and without -d 9, on a Redhat Linux platform. nccopy -k 3 -d 9 infile outfileI just looked at the files quickly, to see what's going on, and verified the results you reported, that gzip of the whole raw file provides 8.4 to 1 compression, and gzip of the whole proc file yields about 2.8 to 1 compression. However, using nccopy with -d1 does poorly (and is very slow), resulting in making the files larger by factors of about 7.0 and 6.7, respectively. I think use of the unlimited time dimension is the root of the problem, because it means that each variable is divided into chunks for compression, with one record per chunk, and what you are seeing is the HDF5 space overhead for storing lots of tiny chunks, each supposedly compressed. Two solutions come to mind: 1. If you don't need the unlimited dimension any more, perhaps because no more data will be appended to the files, then convert the unlimited dimension into a fixed-size dimension, resulting in all the values of each variable being stored contiguously, which should be more compressible. 2. If you still need the unlimited dimension, then rechunk the data before compressing it, so the compression can work on larger chunks. The nccopy utility can be used for both of these approaches. For approach 1: $ nccopy -u proc.nc proc-u.nc # makes unlimited dimension fixed size $ nccopy -d9 proc-u.nc proc-u-d9.nc # compresses result, 2.8 to 1 and similarly for the raw.nc file For approach 2: $ nccopy -c time/3157 proc.nc proc-c.nc # chunks time dimension $ nccopy -d9 proc-c.nc proc-c-d9.nc # compresses result, 2.8 to 1 Both of these achieve the same modest amount of compression, which isn't as good as gzip because - each chunk of each variable is separately compressed, whereas gzip compresses the whole file as a single chunk of data - the file metadata in the header is not compressed, only the data - the HDF5 overhead is a larger portion of the file for relatively small files like these If your variables were a lot larger or you had fewer variables per file or the variables were multidimensional, nccopy might be able to achieve better compression, but the benefit of compressing each chunk of each variable separately is that you can read a small amount of data out of the file without uncompressing the whole file. Only the compressed chunks of the desired variable need to be uncompressed.And congratulations on your daughter's marriage. Our daughter, Rachel, got married last September. It's an exciting time.Congratulations to her, too! --Russ
--- End Message ---