[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDF #AIQ-275071]: [netcdf-hdf] Unexpected overall file size jump
- Subject: [netCDF #AIQ-275071]: [netcdf-hdf] Unexpected overall file size jump
- Date: Mon, 10 Jan 2011 09:27:40 -0700
Hi James,
I wrote:
> The problem is revealed by running "ncdump -s -h" on the netCDF-4 files,
> which shows
> the the variables that use the unlimited dimension nsets get "chunked" into
> 3-D tiles, and
> the netCDF-4 library chooses default chunk sizes that cause the file
> expansion you see.
>
> One simple solution would be to write the netCDF-4 data with the unlimited
> dimension
> nsets changed instead to a fixed size dimension, an operation supported by
> the "-u"
> option of the nccopy utility. Then the variable data would all be stored
> contiguously
> instead of chunked, as is required when a variable uses the unlimited
> dimension.
When I tested this by using "nccopy -u" on your file, it changed nsets from an
unlimited
dimension to a fixed-size dimension, but the result was still chunked the same
as for
the input file, rather than stored contiguously, so it didn't make the file
smaller.
However, when I used a new nccopy capability to rechunk the input in a way that
made it
contiguous (1 chunk) or used chunk sizes that evenly divided the dimension
sizes, it
worked as expected, reducing the size to essentially the same as the netCDF-3
file:
$ ls -l Fluid_Meas.*
-rw-rw-r-- 1 russ ustaff 42186296 Dec 22 13:26 Fluid_Meas.snc
-rw-rw-r-- 1 russ ustaff 95528366 Dec 22 13:27 Fluid_Meas.snc-nccopy-k3
$ nccopy -k3 -c nsets/,n_variables/,npoints/ Fluid_Meas.snc tmp.nc
$ ls -l tmp.nc
-rw-rw-r-- 1 russ ustaff 42251686 Jan 6 15:21 tmp.nc
This shows that the netCDF-4 version of the netCDF-3 file Fluid_Meas.snc, with
the right chunk sizes, has only slightly more overhead than the netCDF-3 file,
even
without compression. The "-c nsets/,n_variables/,npoints/" option says to
chunk
all variables that use the dimensions "nsets", "n_variables", or "npoints" with
a
chunksize the same as the dimension length. The time for nccopy to read and
rechunk
the original 42 MB file to the netCDF-4 output file was fast on my desktop
machine:
real 0m0.34s
user 0m0.07s
sys 0m0.19s
and the times were similar if nccopy instead used the 95 MB netCDF-4 file that
used
the default chunk sizes:
real 0m0.25s
user 0m0.08s
sys 0m0.14s
> Another possibility would be to explicitly set the chunksizes for the output
> to better
> values than determined by the current library algorithm for selecting default
> chunk
> sizes. We're discussing whether we could fix the default chunk size
> algorithm to avoid
> extreme file size expansion, such as you have demonstrated in this case.
>
> For example, the library currently sets the default chunksizes for the
> measurements
> variable as this output from netcdf -h -s shows:
>
> float measurements(nsets, n_variables, npoints) ;
> measurements:_Storage = "chunked" ;
> measurements:_ChunkSizes = 1, 9, 120669 ;
>
> resulting in 20 chunks, each of size 1*9*120669*4 = 4344084 bytes, for a
> total of
> 86881680 bytes, about 87 Mbytes.
>
> Better choices of chunksizes would be (1, 11, 152750) with 5 chunks or
> (1, 1, 152750) with 55 chunks or or (1, 11, 76375) with 110 chunks for
> example, none of which would waste any space in the chunks and which
> would all result in total storage of 33605000, about 34 Mbytes.
>
> It looks like the current default chunking can result in a large amount
> of wasted space in cases like this.
>
> Thanks for pointing out this problem. In summary, to work around it currently
> you either have to avoid using the unlimited dimension for these netCDF-4
> files
> or you have to explicitly set the chunk sizes using the appropriate API call
> to
> not waste as much space as for the current choice of default chunk sizes.
>
> I'm currently working on making it easy to specify chunksizes in the output of
> nccopy, but I don't know whether that will make the upcoming 4.1.2 release.
> If
> not, it will be available separately in subsequent snapshot releases and
> should
> help deal with problems like this, if we don't find a better algorithm for
> selecting
> default chunksizes.
Options supporting specifying custom chunksizes to nccopy are now supported in
the
latest snapshot release, so will be in the upcoming version 4.1.2. All the
choices
for chunksizes described above work fine, and result in a data file essentially
as
small as the netCDF-3 data file. For example, to specify 110 chunks, each of
size
1 x 1 x 76375 for variables using the "nsets", "n_variables", and "npoints"
dimensions
of size 5, 11, and 152750 respectively, an nccopy invocation would be:
$ nccopy -c nsets/1,n_variables/1,npoints/76375 Fluid_Meas.snc tmp.nc
$ ls -l tmp.nc
[TODO]
$ ncdump -h -s tmp.nc | grep "measurements"
[TODO]
and the output file is a netCDF-4 classic model file if you didn't specify an
output file type and are copying a netCDF-3 classic file. Also, for the case of
copying a netCDF classic format file with specified chunk sizes, a default of
size 1 is assumed for any dimension not mentioned, so the above could be done
more
simply with just:
$ nccopy -c npoints/76375 Fluid_Meas.snc tmp.nc
Note that you can do considerably better with output file size if you specify
compression, as can be accomplished with the -d (deflate) option to nccopy:
$ nccopy -d 1 -c npoints/76375 Fluid_Meas.snc tmp.nc
$ ls -l tmp.nc
-rw-rw-r-- 1 russ ustaff 32686447 Jan 10 09:17 tmp.nc
and you can do even better by specifying the "shuffle" option to improve the
compression (which is still lossless) by reordering the data with all the
first bytes for each chunk in a single block, followed by all the second bytes,
all the thrid bytes, and all the fourth bytes:
$ nccopy -d 1 -s -c npoints/76375 Fluid_Meas.snc tmp.nc
$ ls -l tmp.nc
-rw-rw-r-- 1 russ ustaff 27864220 Jan 10 09:17 tmp.nc
You can verify with ncdump that all the data is still identical to the original.
Furthermore, by chunking the data, the compression is more useful for accessing
a subset of the data, as each chunk is independently compressed and the library
only uncompresses the chunks needed for the data requested, rather than all the
data in the fall. Uncompressed chunks are cached, so accessing the same data
usually doesn't incur the cost of uncompressing the data again.
Anyway, the new version of nccopy is in the current snapshot. More complete
documentation and more testing will be in the upcoming 4.1.2 release. We're
still considering tweaking the current default chunk size algorithm to avoid
the file size increase in your example. Your example has been quite helpful
for testing and refining the chunking/rechunking functionality in the nccopy
utility.
--Russ
--Russ
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu
Ticket Details
===================
Ticket ID: AIQ-275071
Department: Support netCDF
Priority: Normal
Status: Closed