[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDFJava #BNA-191717]: chunking in Java
- Subject: [netCDFJava #BNA-191717]: chunking in Java
- Date: Wed, 07 May 2014 06:22:22 -0600
Hi Jeff,
> From those articles the purpose of chunking is to improve performance for
> large multi-dimensional data sets. It seems like it won't really provide
> any benefit in out situation since we only have one dimension. I know that
> NetCDF4 added chunking, but are all NetCDF4 files chunked, i.e., is there
> such a thing as a non-chunked NetCDF4 files? Or is that a contradiction in
> terms somehow?
No, all netCDF-4 files aren't chunked. The simpler alternative, contiguous
layout,
is better if you don't need compression, unlimited dimensions, or support for
multiple patterns of access that chunking makes possible in netCDF-4 files.
A netCDF-4 variable can use contiguous layout if doesn't use an unlimited
dimension or any sort of filter such as compression or checksums.
> Given that NetCDF4 readers are backwards-compatible with NetCDF3 files, is
> there any reason not to use a NetCDF3 file from your perspective? My
> suspicion is that our requirement is just being driven by "use the latest
> version" rather than any technical reasons.
I think I agree with you. With only one unlimited dimension, and if you don't
need
the transparent compression that netCDF-4 makes possible, there's no need to
not just use the default contiguous layout that a netCDF-3 format file provides.
However, you should still use the netCDF-4 library, just don't specify the
netCDF-4
format when you create the file. That's because the netCDF-4 software includes
bug
fixes, performance enhancements, portability improvements, and remote access
capabilities mot available in the old netCDF-3.6.3 version software.
The reason you were seeing a 7-fold increase in size is exactly as Ethan
pointed out,
due to way the HDF5 storage layer implements unlimited dimensions, using
chunking
implemented with B-tree data structures and indices, rather than a simpler
contiguous
storage used in the classic netCDF format. The recent netcdf-4.3.2 version
improves
the default chunking for 1-dimensional variables with an unlimited dimension,
as in
your case, so may be sufficient to provide both smaller files and benefits of
netCDF-4
chunking, but without testing I can't predict how close it comes to the simpler
netCDF
classic format in this case. Maybe I can get time later today to try it ...
> I couldn't find anything on the NetCDF website regarding "choosing the
> right format for you". I was hoping there'd be something along those lines
> in the FAQ, but no luck.
The FAQ section on "Formats, Data Models, and Software Releases"
http://www.unidata.ucar.edu/netcdf/docs/faq.html
is intended to clarify the somewhat complex situation with multiple versions of
netCDF
data models, software, and formats, but evidently doesn't help much in your
case of
choosing whether to use the default classic netCDF format, the netCDF-4 classic
model
format, or the netCDF-4 format.
Thanks for pointing out the need for improving this section, and in particular
the answer
to the FAQ "Should I get netCDF-3 or netCDF-4?", which should really address
the question
"When should I use the netCDF classic format?".
--Russ
> address@hidden> wrote:
>
> > Hi Jeff,
> >
> > How chunking and compression affect file size and read/write performance
> > is a complex issue. I'm going to pass this along to our chunking expert
> > (Russ Rew) who, I believe, is back in the office on Monday and should be
> > able to provide you with some better advise than I can give.
> >
> > In the mean time, here's an email he wrote in response to a conversation
> > on the effect of chunking on performance that might be useful:
> >
> >
> > http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2013/msg00498.html
> >
> > Sorry I don't have a better answer for you.
> >
> > Ethan
> >
> > Jeff Johnson wrote:
> > > Ethan-
> > >
> > > I made the changes you suggested with the following result:
> > >
> > > 10000 records, 8 bytes / record = 80000 bytes raw data
> > >
> > > original program (NetCDF4, no chunking): 537880 bytes (6.7x)
> > > file size with chunk size of 2000 = 457852 bytes (5.7x)
> > >
> > > So a little better, but still not good. I then tried different chunk
> > sizes
> > > of 10000, 5000, 200, and even 1, which I would've thought would give me
> > the
> > > original size, but all gave the same resulting file size of 457852.
> > >
> > > Finally, I tried writing more records to see if it's just a symptom of a
> > > small data set. With 1M records:
> > >
> > > 8MB raw data, chunk size = 2000
> > > 45.4MB file (5.7x)
> > >
> > > This is starting to seem like a lost cause given our small data records.
> > > I'm wondering if you have information I could use to go back to the
> > archive
> > > group and try to convince them to use NetCDF3 instead.
> > >
> > > jeff
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: BNA-191717
> > Department: Support netCDF
> > Priority: Normal
> > Status: Open
> >
> >
>
>
> --
> Jeff Johnson
> DSCOVR Ground System Development
> Space Weather Prediction Center
> address@hidden
> 303-497-6260
>
>
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu
Ticket Details
===================
Ticket ID: BNA-191717
Department: Support netCDF
Priority: Normal
Status: Closed