[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[netCDF #UAU-670796]: Rechunking of a huge NetCDF file
- Subject: [netCDF #UAU-670796]: Rechunking of a huge NetCDF file
- Date: Tue, 28 Jan 2014 10:02:55 -0700
Hi Henri,
> I have a 200GB uncompressed NetCDF file with 5 variables (+lat,lon,time) of
> ECMWF ERA-Interim data like this:
>
> dimensions(sizes): lon(480), lat(241), time(99351)
>
> I need to access all time instants of the data, one gridpoint at a time.
> Unfortunately the data is organized inefficiently for this, and retrieving
> one slice takes 10 minutes or so. I have tried to rechunk the data with this
> command:
>
> nccopy -w -c time/99351,lat/1,lon/1 all5.nc all5_T.nc
>
> but the processing has taken 9 days already (I have allocated 1 CPU and 250GB
> of memory to it). Is there some way to estimate how it’s doing and how long
> this will take? I ran the same command with a test file of only 9 grid
> points, and estimated that if the process scaled perfectly, the full data
> would be finished in 2 days.
>
> Alternatively, is there some smarter way to do this? I suppose I should have
> done this in smaller pieces, but I’d hate to kill the process now if it’s
> close to finishing.
You might want to read these blog posts, if you haven't already:
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes
You haven't mentioned whether the 200 GB source file is a netCDF-4
classic model format using compression. That might make a difference,
as you may be spending an enormous amount of time uncompressing the
same source chunks over and over again, due to using a small chunk
cache.
Even if the source data is not compressed, you probably need to
specify use of a chunk cache to make sure the same source data doesn't
need to be reread from the disk repeatedly for each of the 480x241
points. And I would advise using a different shape for the output
chunks, something more like time/10000,lat/10,lon/20 so that you can
get the data for one point with 10 disk accesses instead of 1,
probably still fast enough. Also, such a shape would store data for
200 adjacent points together in 1 chunk, so if it's cached, nearby
queries will be very fast after the first.
I would also advise just giving up on the current nccopy, which may
well take a year to finish! Spend a little time experimenting with
using some of the advanced nccopy options, such as -w, -m, -h, and -e,
which could make a significant difference in rechunking time:
http://www.unidata.ucar.edu/netcdf/docs/nccopy-man-1.html
What works best is platform-specific, but you may be able to get
something close to optimum by timing with smaller examples. I'd be
interested in knowing what turns out to be practical!
--Russ
Russ Rew UCAR Unidata Program
address@hidden http://www.unidata.ucar.edu
Ticket Details
===================
Ticket ID: UAU-670796
Department: Support netCDF
Priority: Normal
Status: Closed