This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi Keith, > > We don't have an explanation for this. I just checked with Glenn Davis, > > who implemented it, and he agrees that it should take essentially the > > same amount of time to access a value at the beginning of each record as > > at the end. If you have a small test case that demonstrates otherwise, > > we'd be interested, because it would either indicate a bug or a behavior > > of lseek() we don't understand. The offset for each record variable > > within a record is calculated once and stored in the header with other > > information about that variable, so the only difference between the two > > situations should be adding a zero offset vs. adding a nonzero offset to > > the record offset before the seek. > > We've been running more tests on a variety of file sizes and can no > longer duplicate this phenomenon. We're definitely at a loss to > explain this. We thought a passage in the netCDF manual might be > referring to the ordering effect we saw (on p. 138): > > "The order in which the data in the fixed-size data part and in each > record appears is the same as the order in which the variables were > defined, in increasing numerical order by netCDF variable ID. This > knowledge can sometimes be used to enhance data access performance, > since the best data access is currently achieved by reading or writing > the data in sequential order." > > But, now we're not sure why we saw an ordering performance difference > before, but none now. What does this passage really mean? It's referring to the fact that if you read the data in the same order it's written on the disk, you can take advantage of the read-ahead that's done in systems like UNIX, where when you read a disk block, the system will read the next block into a memory buffer for you so it can be quickly accessed if you are reading sequentially. Also the netCDF library doesn't make unnecessary lseek() calls if it notices it's already at the right offset in a file for a requested read, which is the case with reading the data in the same order in which it was written. On the other hand, if you were to read the variables in the reverse of the order in which they're written, you would end up doing an lseek() call before each read, flushing the read-ahead cache, and not getting any benefit from the read-ahead buffers provided by the operating system. Whether these differences are actually significant depends on lots of things, including the record size and data type > > Our understanding is that there should be no need for creating smaller > > files, and that caching the header should be enough to get the performance > > you want. We'd be interested in hearing about your progress in diagnosing > > this problem, especially if it indicates a problem with netCDF performance > > that we can't currently explain. > > We prefer using big files as it reduces the number to open and keep > track of. We'll keep you posted if we figure out any more on what is > happening with the juke box. Thanks. One other thought I had was that you might be linking to the HDF/netCDF software (-lmfhdf) from NCSA rather than the Unidata XDR-based netCDF library (-lnetcdf). The mfhdf library uses a completely different I/O implementation, and might explain the performance differences you are seeing. --Russ