This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
John, > I assume that bit shifting and comparision, etc. is fast. In ieee floating > point, you use "illegal" exponent values to signal exceptional numbers. So > we need to decide on an exponent scheme. I think a simple biased exponent scheme in which the last exponent (all 1 bits) is used for exceptional values such as the _FillValue might be OK. Unfortunately, it would require using one of a small number of exponent values to represent only one special value, if there is a _FillValue. This is sort of wasteful of space in the case that the exponent range is an exact power of 2, since then it requires an extra bit that is used only for this special value. But it has the virtue of simplicity, and I'm willing to see if this yields adequate packing in most cases. > > Your idea of compressing the exponents (and perhaps the sign bits) > > separately from the mantissas is a good one, but in my opinion we can't > > use it for netCDF data access. As you have pointed out, it makes direct > > access to a single unpacked value depend on first unpacking all the > > values for a variable. Similarly, it makes writing values in a > > different order from the order in which they are stored or from the > > order in which they will be read difficult. It also imposes > > requirements for large memory allocations on applications that may only > > need one value for each slice of a variable with very large slices. > > I'm not convinced about this yet. I would say that the current, unpacked > design is a reasonable way to allow "direct access to a single value" and > "writing values in a different order from the order in which they are > stored", etc. Now whats motivating packing? Basically, certain efficiency > considerations for very large datasets. So the design needs to answer those > efficiency considerations, but not necessarily all the original design > goals need remain intact. Ideally, we'd like to have some options that > allow these tradeoffs to be made by the application layer. I guess I disagree. A fundamental characteristic of the netCDF API is the ability to read general array cross sections, to access small subsets of large datasets efficiently. This is used in data visualization applications supported by netCDF and was the main reason we chose to implement netCDF data access in terms of direct access (seeks) instead of sequential access (read next). Giving up this feature is too high a price to pay for the addition of (semi-)transparent packing. Ideally, applications that read netCDF data should not have to know whether it's packed or not. That should be determined by the writer at variable definition time, but an application that wants to read a small subset of a large dataset should not pay the penalty of much slower access or having to malloc a Gbyte of space just because the data happens to have been packed. The access time and memory space used to read a small subset of data may depend on the order in which the data is written (which the reader specifies and can predict from a variable declaration), but they should be proportional to the size of the data subset requested, not the size of the dataset out of which the data is extracted. > Its instructive to consider what the ccm2 people have now, and compare that > to possible netcdf extensions. They have certainly given up "direct access > to a single value" and "writing values in a different order from the order > in which they are stored". Indeed, they have probably optimized writing > datasets as opposed to reading them (I am guessing at the read access > pattern), which is probably the wrong thing to do from a long-term point of > view (assume many reads per write). > > Anyway, they use "latitude slices" as their basic array, which is nlon x > nlev. Then they have a separate scale/offset for each level, so they pack > nlon values at a time. nlon is typically 128. The read accesses that seem > most common to me are horizontal slices (nlon x nlat) or zonal slices (nlat > x nlev). So both read accesses need a lot more disk accesses than optimal, > though in the first case, you dont actually have to unpack anything you > dont need. Also, they have all fields for one lat slice together, and I > assume that read accesses more often deal with one field over the entire > volume, at least for visualization. However, it may be a good way to do it > for scientific processing, where you need to calculate using many of the > fields at a certain location. > > So you might imagine a design that gives ccm2 the ability to store data in > the way they are already doing it, so that the current read and write > (in)efficiencies are preserved, but with the advantages of the netcdf API > and machine independence. And that should be possible by specifying the order of dimensions and the packing parameters (number of bits precision, range of values) for each netCDF variable when it is defined. But I'm not convinced that the benefits of the extra compression you get by giving up on the ability to read small subsets of the data efficiently are worth the cost. If this means the CCM2 can't use netCDF because its storage scheme is not optimized enough for that particular application, its developers will have to live with the other trade-offs involved in developing an application-specific interface and format that is more optimally suited to that purpose. If, however, the problem you have described for the CCM2 is just an example of a more general problem that occurs over and over in scientific data access, I might be convinced of the importance of developing a solution for optimal-compression at the expense of the performance of read access. Currently, I'm leaning toward preserving this fundamental property of netCDF access for the thousands of other users who are satisfied with even the current meager packing capabilities. > More generally, you might imagine some way to allow users to specify > implementation (storage strategy), with no change in API (obvious change in > efficiency of API), with high level tools to reorganize a file without the > interface looking any different (like DB managers do). I assume that will > be the thrust of the HDF implementation of the "data chunking" idea. Data > compression might fall into this category. I'll have to think about this. DB managers generally base changes in the storage that preserve the database schema on usage and access patterns, because the patterns and frequencies of database transactions and queries can't necessarily be anticipated when a database is created. This is also true of scientific data, but caching and multi-level storage hierarchies have been the main way to address this in the past. This may be a laudable goal, but it seems like it would be difficult to implement. > Or perhaps netcdf should stay lean and clean, and these complexities be > implemented in a larger system like HDF, which seems to have a lot of > funding? I dont know what your vision of netcdf is, and its relation to > other systems. With hdf having a netcdf interface, one could argue that > large datasets should move to hdf, and netcdf remain a small, > understandable system. > > > I'm still hoping we can work out the details of a packed floating-point > > representation such as you have suggested, because I think it's superior to > > my idea of using arrays of scales and offsets. Please let me know if you > > have any other thoughts on this. > > Perhaps you could give me a thumbnail sketch of your "array of scales and > offsets" design, so I can think about it concretely. I remain undecided as > to the advantages of scale and offset vs small floating point. I'll have to send this tomorrow. --Russ