we expect that data holdings can be divided into two categories. 1) sites in
which the monitoring (eg crawling) can be done occasionally (once a day, once an
hour, once a week?), and the impact of the crawling is therefore minimal. 2)
real-time sites that have constantly changing data. For these, we probably need
a different strategy, and we are considering instrumenting the LDM as one
possible solution.
But in sites that are being continuously updated, it seems to me
that you need a local inventory, a file or some other way of
keeping track of the contents of a data set. This is our notion
of a file server or your configuration file in the Aggregation
Server. This is the thing that you want to discover when searching
for data sets, not all of the files (or granules or whatever) in
the data set. This is what we are wrestling with in the crawler that
we are looking at. In particular, I have asked Steve to look at
ways of having the crawler group files into data sets automatically
and then to reference the inventory for the data set rather than
the entire data set and to make the crawler capable of updating
the inventory.
Our hope is that the crawler would work locally
building the inventory locally and could be made to run as often
as you like. However, the inventory need not reside at the site
containing the actual data and the crawler could be run from a
remote site as our prototype does. The point here is that there
are two types of crawlers generating two types of lists, one
that generates inventories of granules in data sets (generally
locally and can be run as often at you like) and the other generating
inventories of data sets - directories (generally run remotely
less often). Finally, I note that the inventory could be generated
in other ways, for example every time a granule is added to a
data set, the inventory could automatically be updated. I really
see the inventory issue as a local process. What is strange is
the number of data sets that we encounter that do not have a
formal inventory and this is what gives rise to this problem.