This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hello David, Thank you for the comprehensive description of the issue, and the proposed solution! After consulting with Russ, I have created a ticket for this in our JIRA system, https://bugtracking.unidata.ucar.edu/browse/NCF-337, and am going to try to integrate the fix before the next netCDF release. We're currently preparing for our annual Python workshop, being held next week, but I will be able to turn my attention to this shortly thereafter. The fix seems pretty straightforward; the only confounding issue will be how to test for it, since it seems difficult to cause the issue; I'm sure I can come up with something. We also don't have access to Cray hardware or a LUSTRE filesystem, but as you point out this is not limited to that environment. Thanks again for the comprehensive information! Have a great day, -Ward > Full Name: David Knaak > Email Address: address@hidden > Organization: Cray Inc. > Package Version: 4,3,3,1 > Operating System: > Hardware: > Description of problem: This ticket is directly related to these tickets: > > 08 Apr 2015 > [netCDF #KZJ-320086]: Short read are not managed? > http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg13072.html > > 23 Mar 2015 > [netCDF #PDZ-683250]: Short read are not managed? > http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg13053.html > > Cray has analyzed the short read situation. We believe we understand > the problem and have a proposed fix for NetCDF. > > Due to a combination of factors, this short read issue has shown up on > Cray systems with Lustre file systems. But this issue is not limited Cray > systems nor is it limited to Lustre file systems. I will start with the > specifics for Cray and Lustre but will then generalize it. > > The first factor is that a major change introduced with Lustre 2.5 has > caused behavior that is, by POSIX standards, legal, but is not the intended > Lustre behavior. The behavior is that a race condition can occur in Lustre > that sometimes causes a read request to be only partially satisfied with > a single read call. This race condition is more likely to occur on large > and very busy file systems but could occur on any Lustre 2.5 file system. > Technically speaking, this is not a bug because POSIX semantics allows > this (see below). But this is not the intended behavior of Lustre and > Lustre will be modified in a future release so that this does not happen. > > The second factor is that not all programs and libraries properly handle > the case of a short POSIX read or POSIX write. This is the case with UCAR > NetCDF when the creation mode is NC_CLASSIC_MODEL. It may also be the > case in other libraries and many user programs that are not properly coded. > > In general, if a program calls read or write without checking for the > number of bytes actually transferred and reading again if necessary > then the program is exposed to the issue. POSIX does not guarantee > that a single read call will read all of the bytes requested or that a > single write call will write all of the bytes requested. Quoting from > opengroup.org: > > http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html > > Upon successful completion, where nbyte is greater than 0, read() shall > mark for update the st_atime field of the file, and shall return the > number of bytes read. This number shall never be greater than nbyte. The > value returned may be less than nbyte if the number of bytes left in > the file is less than nbyte, if the read() request was interrupted by a > signal, or if the file is a pipe or FIFO or special file and has fewer > than nbyte bytes immediately available for reading. For example, a read() > from a file associated with a terminal may return one typed line of data. > > If a read() is interrupted by a signal before it reads any data, it > shall return -1 with errno set to [EINTR]. > > If a read() is interrupted by a signal after it has successfully read > some data, it shall return the number of bytes read. > > The issue for POSIX write is essentially the same. See: > > http://pubs.opengroup.org/onlinepubs/009695399/functions/write.html > > So if a read returns some but not all bytes, read should be called again. > The code in libsrc/posixio.c shows that for the NC_CLASSIC_MODEL path, > read is not called again if there is a short read: > > errno = 0; > nread = read(nciop->fd, vp, extent); > if(nread != (ssize_t) extent) > { > status = errno; > if(nread == -1 || status != ENOERR) > return status; > /* else it's okay we read less than asked for */ > (void) memset((char *)vp + nread, 0, (ssize_t)extent - nread); > } > *nreadp = nread; > *posp += nread; > > return ENOERR; > > With this code, if the POSIX read does not read the full number of bytes, > the read is not retried but rather "memset" zeroes out the rest of the > user's buffer even though there may still be more bytes in the file to read. > This is the exact behavior that some of our users have experienced when > using NetCDF. > > With some local modifications to the NetCDF library and some test cases, > Cray verified that the NC_CLASSIC_MODEL path does in fact pass through the > above code. But for creation mode NC_NETCDF4 it does not. For this mode, > HDF5 I/O is called and HDF5 I/O properly handles short reads. > > Since a short read can potentially happen on any POSIX compliant file > system, code calling read should handle this possibility with code > something like this: > > /* fd is the file descriptor */ > /* buf is the initial address of the user buffer */ > /* request_count is the initial number of bytes requested */ > char *p = buf; > size_t read_count; > size_t nread; > ssize_t bytes_xfered = 0; > > do { > read_count = request_count - bytes_xfered; > nread = read(fd, p, read_count); > if (nread > 0) { > bytes_xfered += nread; > p += nread; > } > } while ((nread > 0 && bytes_xfered < request_count) || > (nread == -1 && errno == EINTR)); > > Other examples of this method of reading again can be seen for HDF5 I/O > in HDF5 source and for MPI I/O in ANL MPICH2 source. > > After analyzing the issues, we provided one of our users who was seeing > the issue with a wrapper routine for the POSIX read call. This wrapper > reads again when necessary as shown above. With the wrapper, the user > no longer had any failures, verifying both the path and the fix. > > As stated at the beginning, this issue is not unique to Cray systems or > to Lustre file systems. Lustre will eventually be modified so that it > behaves as Lustre is intended to. That is, Lustre will eventually do the > additional reads such that POSIX read and POSIX write of a Lustre file will > never return a short read. But that doesn't remove the responsibility > of program developers and library developers to handle the short read > and short write cases. Other file systems my exhibit the short read or > write behavior. > > We are informing our customers of the issue and encouraging them to > correct their own calls to POSIX read and write if necessary. Cray is not > intending to provide our customers with a locally modified NetCDF library. > We leave it to UCAR to provide the appropriate fixes for NetCDF. When UCAR > applies an appropriate fix and releases the new version, Cray will build > it for our systems and release it to our customers. > > Please connect me with any questions, comments, or concerns. > > David Knaak > > > Ticket Details =================== Ticket ID: KZJ-320086 Department: Support netCDF Priority: High Status: Closed