[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDFJava #BNA-191717]: chunking in Java

Subject: [netCDFJava #BNA-191717]: chunking in Java
Date: Wed, 07 May 2014 06:22:22 -0600
Hi Jeff,

> From those articles the purpose of chunking is to improve performance for
> large multi-dimensional data sets. It seems like it won't really provide
> any benefit in out situation since we only have one dimension. I know that
> NetCDF4 added chunking, but are all NetCDF4 files chunked, i.e., is there
> such a thing as a non-chunked NetCDF4 files? Or is that a contradiction in
> terms somehow?

No, all netCDF-4 files aren't chunked.  The simpler alternative, contiguous 
layout, 
is better if you don't need compression, unlimited dimensions, or support for
multiple patterns of access that chunking makes possible in netCDF-4 files.

A netCDF-4 variable can use contiguous layout if doesn't use an unlimited
dimension or any sort of filter such as compression or checksums.

> Given that NetCDF4 readers are backwards-compatible with NetCDF3 files, is
> there any reason not to use a NetCDF3 file from your perspective? My
> suspicion is that our requirement is just being driven by "use the latest
> version" rather than any technical reasons.

I think I agree with you.  With only one unlimited dimension, and if you don't 
need
the transparent compression that netCDF-4 makes possible, there's no need to
not just use the default contiguous layout that a netCDF-3 format file provides.
However, you should still use the netCDF-4 library, just don't specify the 
netCDF-4
format when you create the file.  That's because the netCDF-4 software includes 
bug 
fixes, performance enhancements, portability improvements, and remote access
capabilities mot available in the old netCDF-3.6.3 version software.

The reason you were seeing a 7-fold increase in size is exactly as Ethan 
pointed out,
due to way the HDF5 storage layer implements unlimited dimensions, using 
chunking
implemented with B-tree data structures and indices, rather than a simpler 
contiguous
storage used in the classic netCDF format.  The recent netcdf-4.3.2 version 
improves
the default chunking for 1-dimensional variables with an unlimited dimension, 
as in
your case, so may be sufficient to provide both smaller files and benefits of 
netCDF-4
chunking, but without testing I can't predict how close it comes to the simpler 
netCDF
classic format in this case.  Maybe I can get time later today to try it ...

> I couldn't find anything on the NetCDF website regarding "choosing the
> right format for you". I was hoping there'd be something along those lines
> in the FAQ, but no luck.

The FAQ section on "Formats, Data Models, and Software Releases"

   http://www.unidata.ucar.edu/netcdf/docs/faq.html

is intended to clarify the somewhat complex situation with multiple versions of 
netCDF 
data models, software, and formats, but evidently doesn't help much in your 
case of
choosing whether to use the default classic netCDF format, the netCDF-4 classic 
model
format, or the netCDF-4 format.

Thanks for pointing out the need for improving this section, and in particular 
the answer
to the FAQ "Should I get netCDF-3 or netCDF-4?", which should really address 
the question
"When should I use the netCDF classic format?".

--Russ

> address@hidden> wrote:
> 
> > Hi Jeff,
> >
> > How chunking and compression affect file size and read/write performance
> > is a complex issue. I'm going to pass this along to our chunking expert
> > (Russ Rew) who, I believe, is back in the office on Monday and should be
> > able to provide you with some better advise than I can give.
> >
> > In the mean time, here's an email he wrote in response to a conversation
> > on the effect of chunking on performance that might be useful:
> >
> >
> > http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2013/msg00498.html
> >
> > Sorry I don't have a better answer for you.
> >
> > Ethan
> >
> > Jeff Johnson wrote:
> > > Ethan-
> > >
> > > I made the changes you suggested with the following result:
> > >
> > > 10000 records, 8 bytes / record = 80000 bytes raw data
> > >
> > > original program (NetCDF4, no chunking): 537880 bytes (6.7x)
> > > file size with chunk size of 2000 = 457852 bytes (5.7x)
> > >
> > > So a little better, but still not good. I then tried different chunk
> > sizes
> > > of 10000, 5000, 200, and even 1, which I would've thought would give me
> > the
> > > original size, but all gave the same resulting file size of 457852.
> > >
> > > Finally, I tried writing more records to see if it's just a symptom of a
> > > small data set. With 1M records:
> > >
> > > 8MB raw data, chunk size = 2000
> > > 45.4MB file (5.7x)
> > >
> > > This is starting to seem like a lost cause given our small data records.
> > > I'm wondering if you have information I could use to go back to the
> > archive
> > > group and try to convince them to use NetCDF3 instead.
> > >
> > > jeff
> >
> >
> > Ticket Details
> > ===================
> > Ticket ID: BNA-191717
> > Department: Support netCDF
> > Priority: Normal
> > Status: Open
> >
> >
> 
> 
> --
> Jeff Johnson
> DSCOVR Ground System Development
> Space Weather Prediction Center
> address@hidden
> 303-497-6260
> 
> 
Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: BNA-191717
Department: Support netCDF
Priority: Normal
Status: Closed
Prev by Date: [netCDF #BPR-855218]: Clarification on gennet.for
Next by Date: [netCDFJava #BNA-191717]: chunking in Java
Previous by thread: [netCDFJava #BNA-191717]: chunking in Java
Next by thread: [netCDFJava #BNA-191717]: chunking in Java
Index(es):
- Date
- Thread