This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi Henri, > I have a 200GB uncompressed NetCDF file with 5 variables (+lat,lon,time) of > ECMWF ERA-Interim data like this: > > dimensions(sizes): lon(480), lat(241), time(99351) > > I need to access all time instants of the data, one gridpoint at a time. > Unfortunately the data is organized inefficiently for this, and retrieving > one slice takes 10 minutes or so. I have tried to rechunk the data with this > command: > > nccopy -w -c time/99351,lat/1,lon/1 all5.nc all5_T.nc > > but the processing has taken 9 days already (I have allocated 1 CPU and 250GB > of memory to it). Is there some way to estimate how it’s doing and how long > this will take? I ran the same command with a test file of only 9 grid > points, and estimated that if the process scaled perfectly, the full data > would be finished in 2 days. > > Alternatively, is there some smarter way to do this? I suppose I should have > done this in smaller pieces, but I’d hate to kill the process now if it’s > close to finishing. You might want to read these blog posts, if you haven't already: http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes You haven't mentioned whether the 200 GB source file is a netCDF-4 classic model format using compression. That might make a difference, as you may be spending an enormous amount of time uncompressing the same source chunks over and over again, due to using a small chunk cache. Even if the source data is not compressed, you probably need to specify use of a chunk cache to make sure the same source data doesn't need to be reread from the disk repeatedly for each of the 480x241 points. And I would advise using a different shape for the output chunks, something more like time/10000,lat/10,lon/20 so that you can get the data for one point with 10 disk accesses instead of 1, probably still fast enough. Also, such a shape would store data for 200 adjacent points together in 1 chunk, so if it's cached, nearby queries will be very fast after the first. I would also advise just giving up on the current nccopy, which may well take a year to finish! Spend a little time experimenting with using some of the advanced nccopy options, such as -w, -m, -h, and -e, which could make a significant difference in rechunking time: http://www.unidata.ucar.edu/netcdf/docs/nccopy-man-1.html What works best is platform-specific, but you may be able to get something close to optimum by timing with smaller examples. I'd be interested in knowing what turns out to be practical! --Russ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu Ticket Details =================== Ticket ID: UAU-670796 Department: Support netCDF Priority: Normal Status: Closed