e
daily.2013-096.N19.proc.nc
daily.2013-096.N19.raw.nc
We downloaded the latest version of netcdf and ran the ncopy
command with both -k 3 and -k 4, with and without -d 9, on a
Redhat Linux platform.
nccopy -k 3 -d 9 infile outfile
I just looked at the files quickly, to see what's going on, and verified
the results you reported, that gzip of the whole raw file provides 8.4
to 1 compression, and gzip of the whole proc file yields about 2.8 to 1
compression.
However, using nccopy with -d1 does poorly (and is very slow), resulting
in making the files larger by factors of about 7.0 and 6.7,
respectively.
I think use of the unlimited time dimension is the root of the problem,
because it means that each variable is divided into chunks for
compression, with one record per chunk, and what you are seeing is the
HDF5 space overhead for storing lots of tiny chunks, each supposedly
compressed.
Two solutions come to mind:
1. If you don't need the unlimited dimension any more, perhaps
because no more data will be appended to the files, then convert
the unlimited dimension into a fixed-size dimension, resulting in
all the values of each variable being stored contiguously, which
should be more compressible.
2. If you still need the unlimited dimension, then rechunk the data
before compressing it, so the compression can work on larger
chunks.
The nccopy utility can be used for both of these approaches.
For approach 1:
$ nccopy -u proc.nc proc-u.nc # makes unlimited dimension fixed size
$ nccopy -d9 proc-u.nc proc-u-d9.nc # compresses result, 2.8 to 1
and similarly for the raw.nc file
For approach 2:
$ nccopy -c time/3157 proc.nc proc-c.nc # chunks time dimension
$ nccopy -d9 proc-c.nc proc-c-d9.nc # compresses result, 2.8 to 1
Both of these achieve the same modest amount of compression, which isn't
as good as gzip because
- each chunk of each variable is separately compressed, whereas gzip
compresses the whole file as a single chunk of data
- the file metadata in the header is not compressed, only the data
- the HDF5 overhead is a larger portion of the file for relatively
small files like these
If your variables were a lot larger or you had fewer variables per file
or the variables were multidimensional, nccopy might be able to achieve
better compression, but the benefit of compressing each chunk of each
variable separately is that you can read a small amount of data out of
the file without uncompressing the whole file. Only the compressed
chunks of the desired variable need to be uncompressed.
And congratulations on your daughter's marriage. Our daughter, Rachel,
got married last September. It's an exciting time.
Congratulations to her, too!
--Russ