This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Tushar Mohan <address@hidden> writes: > Hi Ed, > > I've been probing the performance issue in mppnccombine. The program > combines records across multiple netcdf input files into a single > output file. > Interesting! I am forwarding this to the HDF5 programmers for some attention. I note that the latest HDF5 is 1.8.4, not 1.8.2, so you might want to give that a try... > My experiments seem to show that the performance drops off due to > excessive TLB misses (thrashing of some sort) on the > Montecito/Montvale chips (Itanium-64) in certain HDF5 functions, > notably HFI_register. This happens, while processing certain input > files in the hdf5 format -- v3-classic input files do not show this > problem. The output file format is used (v3-classic or hdf5) makes no > difference in performance. The problem shows up, however, at the point > when the output file hits the 2GB mark, and remains from then on. We > are using a netcdf chunk size of 64K. Worth noting is that if certain > input files are left out from the combine, the problem doesn't show up > at all. I've tried with the latest hdf5 release - 1.8.2, and > netcdf-4.0.1. Snapshot development versions of netcdf, while slightly > better in performance than the 4.0.1 used for this experiment, exhibit > a similar performance profile: the performance-challenged runs are > overwhelmingly dominated by time spent in a few HDF5 functions. > > Using a performance tool - hpcrun (part of HPCToolkit) - I was able to > obtain a statistical profile for total cycles and TLB misses for a > partial run. The total cycles (or time spent) closely tracked the TLB > misses, and leapt exponentially once the output size hit the 2 GB > mark. I'm attaching the pruned profile. The full one is a few MB, and > is at: > > http://www.samaratechnologygroup.com/pub/mppnccombine-hdf5-profile.txt.gz > > It seems that a region of code performing a pointer traversal is > exhibiting poor locality and is causing thrashing in the TLB. If you > study the H5I_register code in the attached profile, you'll see that > when an "excessive" number of objects are registered using > H5I_register, then an extremely expensive operation to guard against > duplicate IDs being parceled out, is performed. This test, a dozen > lines of source code, consumes most of the application runtime. > > In fact, it's conceivable that the 2GB point is a red herring, and > it's just that the number of objects being registered causes the > "duplicate ID check", and this just happened to be at the 2 GB point > in the output file. My observation that certain input files cause the > problem, may be explained by them having enough objects to register > that the "duplicate ID check" code is triggered, although I don't > understand enough of the file contents to confirm this hypothesis. > > This problem seems to be in the province of the hdf5 developers, > however, I thought I'd check the netcdf team, in case, you have a clue > on whether certain parameters used for hdf5 by the netcdf library can > affect the performance. While the code did not seem to imply so, it's > possible that increasing the hdf5 type hash size may help, for > example. > > In the attached profile, the first column showing percentages is the > CPU cycles column, and it's showing as a percentage of total > application CPU time. The second column shows the % of D-TLB misses. > > If you know of anybody in the hdf5 developer list, who might be able > to help, I'd appreciate if you could include them in the mail thread. > > Please contact me if you have any questions, and thanks in advance for > your help. > > Regards, > Tushar > > On Wed, Dec 9, 2009 at 12:10 AM, Ed Hartnett <address@hidden> wrote: >> "V. Balaji" <address@hidden> writes: >> >>> Hi Ed, one of our developers has noticed interesting (disturbing) >>> behaviour in one of our homegrown netCDF tools. >>> >>> I don't want to drag you into the melee by ccing you into this >>> group, but I wonder if the sudden performance cliff at 2GB rings any >>> bells for you or colleagues at Unidata, either in terms of changes >>> to libnetcdf or changes to the way we're invoking it. >>> >>> Thanks, >> >> Howdy all! >> >> I have read Jeff's description of the problem, and a few facts may help >> clarify the situation... >> >> * Certainly you should be testing with the netCDF snapshot release, it >> has some performance improvements: >> ftp://ftp.unidata.ucar.edu/pub/snapshot/netcdf-4-daily.tar.gz >> >> * When creating netCDF-4 (and netCDF-4 classic model) the output should >> be in HDF5. One way to tell is to look at the first 4 bytes of the >> file. In emacs the file will start like this: HDF If you see "CDF" >> instead, then you are not creating a netCDF-4 file. (Check your >> nc_create call - by default netCDF-4 still produces classic format >> files, not netCDF-4/HDF5 files). From what Jeff says, it seems that >> you are not actually using netCDF-4/HDF5 files: >> >> "My very last experiment showed that the output of the 4.0.1 >> mppnccombine produces by default a file that does not seem to be >> in the hdf5 format (or at least has a format different from the >> *.000? files). How did I deduce that? When you do "od -a -N 10" >> on the output files, you will see "CDF ...", which is the format >> for the netcdf classic or the netcdf 64-bit offset format, but is >> different from the format if you do "od" on the .000* files, which >> show up as "hdf5..."." >> >> * Performance will be the same for netCDF-4 files with or without the >> classic model turned on. That affects what you can add to the file, >> but not how (and how fast) data are read or written. >> >> * NetCDF-4 can easily handle files and variables larger than 2GB. The >> NC_CLASSIC_MODEL flag doesn't matter for this. >> >> * Chunk sizes are an important consideration. Chunk sizes are chosen by >> default if you don't specify them, and that works pretty poorly for >> larger variables. (But it seems that you are not producing netCDF-4 >> files anyway. See above.) >> >> Please let me know if this doesn't help. You should not see any serious >> performance problems with netCDF-4 if we are doing everything right... >> >> Thanks, >> >> Ed >> >> -- >> Ed Hartnett -- address@hidden >> > -- Ed Hartnett -- address@hidden