We're having a lot of problems with thredds crashing lately.
We're hoping you can look into the problem or give us advice.
We recently updated to THREDDS Data Server Version 3.16.22 Build
Date = 2007-10-25 19:10:28.
The general scenario is:
1) One or more invalid .nc files are added to the data directories.
Obviously, this isn't intentional.
And we just found the source of many of these: a bad NIC card.
But given the huge number of files involved, it has always
occurred periodically.
Often, it is a file with 0 bytes.
2) During aggregation, thredds notices the bad file and reports
it to catalina.out. For example:
2007-11-15T09:17:45.277 -0800 [ 1446340][ 3538] ERROR -
thredds.server.opendap.NcDODSServlet - Error
java.io.IOException: Cant read
/u00/satellite/AK/sstd/hday/AK2007272_014200h_sstd.nc: not a
valid NetCDF file.
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:459)
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:261)
at
ucar.nc2.dataset.NetcdfDataset.openFile(NetcdfDataset.java:341)
at
ucar.nc2.ncml.Aggregation$Dataset$PolymorphicReader.open(Aggregation.java:1286)
at
ucar.nc2.NetcdfFileCache.acquire(NetcdfFileCache.java:188)
at
ucar.nc2.ncml.Aggregation$Dataset._acquireFile(Aggregation.java:1186)
at
ucar.nc2.ncml.Aggregation$Dataset.acquireFile(Aggregation.java:1172)
at
ucar.nc2.ncml.Aggregation$Dataset.getNcoords(Aggregation.java:1100)
at
ucar.nc2.ncml.Aggregation$Dataset.setStartEnd(Aggregation.java:1121)
at
ucar.nc2.ncml.Aggregation$Dataset.access$100(Aggregation.java:1002)
at
ucar.nc2.ncml.Aggregation.buildCoords(Aggregation.java:315)
at
ucar.nc2.ncml.AggregationExisting.buildDataset(AggregationExisting.java:61)
at ucar.nc2.ncml.Aggregation.finish(Aggregation.java:293)
at ucar.nc2.ncml.NcMLReader.readNetcdf(NcMLReader.java:397)
3) Sometime afterwards
(not immediatly, but it seems to be sooner if there are more
bad files),
thredds crashes. For example:
2007-11-15T08:17:57.883 -0800 [ 1024030][ 6343] ERROR -
thredds.server.opendap.NcDODSServlet - Error
java.lang.OutOfMemoryError: Java heap space
DODServlet ERROR (anyExceptionHandler):
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
ReqState:
serverClassName: 'thredds.server.opendap.NcDODSServlet'
dataSet: 'satellite/CM/usfc/hday'
requestSuffix: 'dods'
CE: 'altitude,lat,lon,time'
compressOK: false
InitParameters:
DebugOn: ''
maxNetcdfFilesCached: '10'
java.lang.OutOfMemoryError: Java heap space
There may be a couple of these errors (referring to different
datasets) in rapid succession, then:
2007-11-15T08:18:18.048 -0800 [ 1044195][ 6487] ERROR -
org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/thredds].[NetcdfDODS]
- Servlet.service() for servlet NetcdfDODS threw exception
java.lang.OutOfMemoryError: Java heap space
Nov 15, 2007 8:18:12 AM
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor
processChildren
SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: Java heap space
Nov 15, 2007 8:18:53 AM unknown unknown
SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: Java heap space
So, we aren't certain that the bad files are the cause of the
OutOfMemoryErrors, but the correlation is strong. With no or
few bad files, thredds seems to stay up longer. With more bad
files, thredds seems more likely to crash.
As a possible cause, I'll mention the previous problem with .nc
files not being automatically closed by an over-ridden finalize
method. (See Original Message below.) This caused a memory and
resource leak if the program didn't explicitly close every file
(even if an Exception or Error is thrown during processing). A
similar problem could certainly be the cause of the problem here.
And I'll mention our previous request to have Thredds simply
ignore bad files, but keep serving the rest of the dataset. (See
Original Message below.) You said you were going to work on
that. I don't know if you made changes or what changes you made.
I'll just point out that we only recently upgraded to a version
of thredds which might have these changes in it. We had
previously been using a somewhat older version because of other
problems with the intervening versions of thredds.
I hope that is helpful.
Thanks for looking into this.
Sincerely,
Bob Simons
Satellite Data Product Manager
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
(831)658-3205
address@hidden
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
-------- Original Message --------
Subject: finalize
Date: Fri, 18 Aug 2006 12:10:57 -0700
From: Bob Simons <address@hidden>
Organization: NOAA/ERD
To: John Caron <address@hidden>
I had a bug in a new procedure I was writing (my fault, I admit),
wherein the close() method of many NetcdfFile objects was never
called.
On Windows, this code ran fine. The Java implementation on Windows
(specifically the garbage collector) must have been smart enough to
recover all the memory allocated and close all the files when the
NetcdfFile objects were no longer referenced.
But on Linux, the procedure failed after many calls, and I got
an error
message (from Java? from Tomcat? it doesn't matter) saying "too
many
files open".
Fine. I fixed the problem in my code. But I two requests for you:
1) In the JavaDocs for the NetcdfFile class (up at the top), please
include the same suggestion as you already have in
NetcdfFileCache's
JavaDocs regarding the use of "finally" to ensure that close() is
called. E.g.,
NetcdfFile ncfile = null;
try {
ncfile = NetcdfFile.open(fileName);
...
} finally {
ncfile.close();
}
2) I think it would be a huge improvement to NetcdfFile (and
presumably
NetcdfFileCache, NetcdfDataset, NetcdfDatasetCache, DODSNetcdfFile
(which isn't in the latest copy of the documentation I have),
and any
similar object), to include the following method (or something
like it):
protected void finalize() throws Throwable {
close();
}
As you probably know, the finalize() method is inherited from the
java.lang.Object class and is called whenever an object is
eligible for
garbage collection. In this case, overriding finalize() lets
you ensure
that the close() method always gets called.
In real life, it isn't always possible to use try/finally. That
opens up
the possibility of not calling close() if an error occurs. It is
dangerous to have that resource leakage be a possibility.
Your use of finalize() prevents this type of resource leakage (as
dangerous as memory leaks in other languages) from occurring.
I admit again that it was my coding error that lead to these issue
coming up. But I think it is good defensive programming to have a
finalize() method for classes like these.
Thank you.
***end -------- Original Message --------
-------- Original Message --------
Subject: Request for enhancement
Date: Wed, 03 Jan 2007 15:23:39 -0800
From: Bob Simons <address@hidden>
Organization: NOAA/ERD
To: John Caron <address@hidden>
CC: Roy Mendelssohn <address@hidden>, Dave Foley
<address@hidden>
The Problem:
We often use THREDDS to aggregate near-real-time data. We have one
directory per data set and frequently (hourly or daily) add an
additional data file to each data set's directory. Sometimes
things go
wrong (e.g., power outages) and a partial and/or damaged file is
left in
a data set's directory. We try to deal with this, but our
methods aren't
perfect.
Currently, when THREDDS is aggregating a data set, if THREDDS
finds an
invalid .nc data file in with all the .nc data files for a given
data
set, it throws an Exception ("java.io.IOException: Cant read
[fileName]:
not a valid NetCDF file.") and rejects the entire dataset. This
is a
serious problem for us because one bad file makes an entire dataset
disappear from THREDDS. The only way for us to find out if this has
happened is to go through the <tomcat>/logs/catalina.out log
file (which
we do periodically) and look for the Exception messages.
The Request:
It would be nice if THREDDS simply rejected the bad files
(throwing an
Exception as it does now), but kept going with the aggregation
of the
data set.
(I realize that the proposed recommendation is good for our
situation
and you may not be comfortable making it the standard behavior
for THREDDS.)
Thank you.
***end -------- Original Message --------