This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Brief comment on the obvious: It is less important what the agreed definition of a "data set" (etc. for "collection", "catalog", "directory", etc.) is than that there BE an agreed definition. I suggest that someone should circulate an authoritative DODS glossary before the meeting. It could save hours of definitional confusion. (Personally I like the simple definition "In a DODS server, a dataset is something you can get a DAS and DAP from." Maybe this should be the def'n of a "DODS data set".) John: Any thoughts you'd care to share prior to the meeting about the potential for a DODS web crawler ("harvester", "scanner", ... more glossary issues) automatically to produce a single giant thematic "DODS collection" in the THREDDS framework? - steve =========================================== Peter Cornillon wrote: > John Caron wrote: > > > > Peter Cornillon wrote: > > > > >>Just to make sure i understand your terminology: > > >> > > >>files = physical files > > >> > > > > > > YUP > > > > > > > > >>datasets = logical files we want the user to see > > >> > > > > > > I don't think about datasets in a file concept. It could be a group of > > > files, a single file,... I guess that the reason that I don't think > > > about it that way is that the data need not be in digital form to be > > > grouped in a data set. Beach profiles that have been collected over > > > the past 50 years and consist of pages of numbers - monthly values of > > > depth below mean low water at specified distances from a marker in a > > > given direction would qualify. I suppose that your definition is > > > correct from a computer perspective, I just don't think of it that way. > > > > ok, i didnt really mean to use the word "file". how about: > > > > "a dataset is a logical grouping of data, associated in some meaningful way > > from > > the user's perspective." > > Yup. > > > In a DODS server, a dataset is something you can get a DAS and DAP from. > > Well not really. You can only get a DDS and DAS from a data set IF it is > either a sinlge file or has a description in a file server or now in the > Aggregation Server. > > > in THREDDS, a "collection" is a collection of datasets, for which the above > > definition also works just fine. so whats the difference between a dataset > > and a > > collection? > > At URI we have a half dozen SST datasets derived from the AVHRR sensors: > one for the area off of Cape Hatteras, another for the Great Lakes, ... > Each has on the order of 15,000 passes in it. I assume that you would > call the ensemble of these a collection? > > > this is the same issue that Benno has pointed out: in his DODS > > server, there is no distinction between collections and datasets, because > > the > > server seamlessly moves between collections, physical files, and the fields > > in > > the files, presenting a uniform API of datasets with their DAP and DAS. > > But, you would be hard pressed to aggregate the things that I call datasets > at URI (the Hatteras one with the Great Lakes one) with your Aggregation > Server. > As I noted in my previous e-mail the actual grouping of data into a dataset is > arbitrary, so one could call the collection of datasets at URI a dataset or > one > could refer to each one as a dataset. One could call all data at a site a > data set, or in the extreme, all earth science data accessible via DODS as > a dataset. > > > (I am not going to try to answer the question of what's the difference > > between a > > catalog and a collection yet; hopefully others might have some ideas) > > > > in THREDDS, a dataset has a URI, and is the smallest choosable thing in the > > catalog. > > I think that this is pretty much what we refer to as a directory, although > we are still working on making a single URL for each dataset described in > the various directories. > > > our goal as middleware is to present the list of dataset choices to the > > user very quickly, without having to actually contact the server. once the > > user > > selects a dataset, then the user can expect some delay while a connection is > > made to the server, and the "real" dataset metadata is collected. This > > implies > > that the catalog metadata may not be exactly right at all times (eg the > > list of > > available times of the dataset), which makes life easier for implementors. > > > > > > > > > > >>inventory = listing of datasets > > >> > > > > > > No, a listing of datasets is what I refer to as a directory (not a > > > directory on a computer). The GCMD is an example of same. An > > > inventory is a listing of elements in a data set, it could be a > > > list of times for satellite images in an archive along with the > > > physical location of the data (tape C18341 on a rack, or > > > N861230147.hat in a computer directory on my machine) or a list > > > of times and locations of each XBT in an XBT archive. > > > > so is an inventory an internal thing that the server uses to construct the > > datasets that are visible to the outside world? > > I don't think so. First, it need not be internal. For a long time > we maintained inventories of the data sets at JPL. The inventory > is simply a list of the contents of a dataset. A dataset can > exist without an inventory, in that the dataset is a logical > grouping of the data. The GCMD identifies a lot of datasets > that to the best of my knowledge do not have inventories. Well, > in a sense they do in that they might often comprise all of the > files in a directory on a computer, so the directory listing is > to some extent an inventory of the data in the dataset. > > > >>question: > > >>what does it mean to "group files into data sets"? like the agg server? > > >> > > > > > > One mightsay that all images in this projection, from this satellite, > > > processed this way form a data. Or one could say that all images in > > > this projection, from this suite of satellites processed this way > > > form a data set. Or... This is the trouble with data sets, different > > > people call different groupings of the data a data set. This caused > > > a lot of blood letting between NASA and NOAA a number of years back. > > > The idea is NOT to call every granule or every file in the system a > > > data set, you know the difference between lumpers and splitters. In > > > order for us to make progress, we have to back off a bit and look at > > > the big picture, grouping things into data sets allows us to do that. > > > This is exactly the problem that the DODS crawler has. When it crawls > > > a site such as our satellite archive, it ends up with thousands of > > > entries and the system or the person viewing the results struggles > > > with a data overload, more information that s/he/it (humm... have > > > to be careful with these gender neutral versions) wants or needs to > > > locate the group of files that define the object of interest. Given > > > that there is no precise definition for how to group files into a > > > data set, I think that we can reduce the amount of information that > > > we have to deal with to a reasonable view of the all the data on the > > > system without losing much if anything. The crawler is likely to group > > > the files slightly differently in some cases than the human would, but > > > one could probably discover this pretty quickly and steer the crawler > > > if necessary. > > > > ok, this seems to be similar to the "collections" vs "datasets" issue > > above. I > > think i need to hear Steve's tech presentation before I can understand this > > any > > deeper. > > > > > > > >>Generating "inventories of granules in data sets" makes sense in the > > >>context of > > >>an agg server, but is there also meaning to it in the context of a normal > > >>DODS > > >>server? > > >> > > > > > > Not sure exactly what you mean here. We have file servers which are > > > inventories of granules in data sets. Actually the terminology is a > > > bit loose here also. The server in this case is a DODS FreeForm server. > > > It serves a table that contains a list of URLs with the characteristic(s) > > > that differentiate one URI from another, time in the case of our satellite > > > archives. > > > > i think some of the problem is that i think of DODS narrowly as a specific > > client/server protocol, and you include services and extensions that have > > been > > built with or use that protocol. > > Yes! The DODS DAP is the thing that defines the low level data access > protocol. To use it effectively one needs to add higher level constructs > such as the file server. > > Peter > -- > Peter Cornillon > Graduate School of Oceanography - Telephone: (401) 874-6283 > University of Rhode Island - FAX: (401) 874-6728 > Narragansett RI 02882 USA - Internet: address@hidden