[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #AWT-862217]: nccopy chunking argument

Subject: [netCDF #AWT-862217]: nccopy chunking argument
Date: Thu, 19 Jan 2012 11:29:22 -0700

Mark,

> I've been looking closer at the cause of the second problem, and have a 
> hypothesis. When you look at how nccopy iterates through a variable when 
> making the copy (ie. in up_start_by_chunks() in nciter.c), it goes in reverse 
> order of the dimensions. e.g. for CHL1_mean[date,lon,lat] it scans first 
> through lat first, then lon, then date. However, this can be very memory 
> inefficient in the situation where you are trying to make the rearrangement 
> along the date dimension - you essentially have to load the entire file to 
> get enough data to write an entire date chunk....
> 
> I could see two solutions.
> 
> 1. automagically work out which dimension to scan in (hard to implement 
> robustly)
> 2. infer the scan direction from the -c argument i.e. if you only specify 
> date/5186 (and nothing else), and you have a variable with 
> date/1,lat/30,lon/30, then the most efficient way to rechunk it would be to 
> read along the date dimension first, then the lon and lats.....
> 
> Hmmm. I'm not sure that makes any sense - it's kind of hard to explain. Can 
> you follow my logic?

Yes, but I see some complications that make my head hurt.

If you want to rechunk a variable, it's not clear whether it's better
to access the input one input chunk at a time to write the output in an 
inefficient order, or to access the input in an inefficient order so that
you can write the output one output chunk at a time.

Currently the nc_next_iter() function in nciter.c does the former, but it 
sounds like you think it would be better if it did the latter. I think you
can construct examples where either strategy is efficient or horribly 
inefficient, depending on the shapes of chunks in the input and output 
files.

I think the right thing to do would be to determine, from the chunk shapes
of input and output, which strategy to implement, or even whether to use
a hybrid strategy involving multiple passes and an intermediate file or
in-memory structure.  I tried to determine whether this research has
already been done, but couldn't find a paper that provided a clear solution.

Maybe it's easier than I'm making it out to be, and there's a clear and simple
solution.  If so, I'd like to implement it!

--Russ

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: AWT-862217
Department: Support netCDF
Priority: Normal
Status: Closed

Prev by Date: [netCDF #AWT-862217]: nccopy chunking argument
Next by Date: [netCDF #YHF-407954]: Fwd: Re: [netcdfgroup] netcdf-4.2-rc1 release candidate available
Previous by thread: [netCDF #AWT-862217]: nccopy chunking argument
Next by thread: [netCDF #AHZ-822837]: [netcdfgroup] performance issues
Index(es):
- Date
- Thread