[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
20051208: LDM on ensemble.ecmwf.int (cont.)
- Subject: 20051208: LDM on ensemble.ecmwf.int (cont.)
- Date: Thu, 08 Dec 2005 11:03:26 -0700
>From: David Ian Brown <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200512081747.jB8HlM7s017330
Dave, et. al.,
>I tried reconfiguring the ldmd.conf file on dataportal with
>the request line:
>request SPARE ".*" ensemble.ecmwf.int primary
>in place of
>request SPARE ".*" teaccess.ecmwf.int primary
>
>but the log now has many lines similar to the following:
>
>Dec 08 17:30:45 dataportal ensemble[28056] ERROR: Terminating due to
>LDM failure; Couldn't get IP address of host ensemble.ecmwf.int
> -dave
This appears to be a DNS issue.
>Also I'd like a bit more advice on how to proceed with testing.
>So far I have not actually saved any data. I assume the queue just
>overwrites the oldest data,
Yes, the LDM queue module will delete the oldest products in the queue
to make room for new ones. The age of the oldest queue product can be
see using pqmon:
<as 'ldm'>
pqmon -l-
Here is an example from one of our NOAAPORT ingest machines:
pqmo08 17:52:50 pqmon NOTE: Starting Up (14912)
Dec 08 17:52:50 pqmon NOTE: nprods nfree nempty nbytes maxprods maxfree
minempty maxext age
Dec 08 17:52:50 pqmon NOTE: 108726 1 135413 999632480 165656 6
78483 369056 6220
Dec 08 17:52:50 pqmon NOTE: Exiting
The last listed value is the age of the oldest product in the queue in
seconds.
>and that for now we are just testing to see
>if the data can be transferred quickly enough between sites.
Yes. This is the first step. One word of advice. We may find that we
need to split the data requests into several, disjoint ones. This
technique helps mitigate the backoff feature of the current
implementations of TCP. If/when TCP gets updated with fast TCP, this
feed splitting should no longer be necessary.
>I see the statistics for dataportal are now visible at
>http://www.unidata.ucar.edu/software/idd/rtstats/siteindex.php?
>dataportal.ucar.edu
Yes.
>The graphs seem to indicate that data transfer to dataportal stopped
>yesterday
>at around 0600 hours. Did something happen externally or has something
>gone
>wrong with the dataportal ldm that I need to attend to?
All: there have been multiple reports to Unidata User Support of LDMs
stopping yesterday at 5Z. I experienced this also on a dual Xeon
EM64T, Fedora Core 4 64-bit machine running LDM-6.4.2 in my office. I
was able to restart the LDM after deleting and remaking the LDM queue
twice. My gut feeling is that the assertion failure that was
reported:
Dec 08 05:24:22 ldm thelma[8869]: assertion "n > 0" failed: file "pq.c", line
2187
was somehow related to the time (!?). If this hunch is true, it seems
to me that one should be able to restart the LDM without deleting and
remaking the queue. Anyone who sees this problem listed in their LDM
log file: please report the failure to Unidata User Support
<address@hidden>. Thanks!
Our LDM developer, Steve Emmerson will be looking at this failure when
he returns from the AGU meeting.
Just so you know, this is the first time we have been this failure
in any LDM-6 installation.
Cheers,
Tom
--
+-----------------------------------------------------------------------------+
* Tom Yoksas UCAR Unidata Program *
* (303) 497-8642 (last resort) P.O. Box 3000 *
* address@hidden Boulder, CO 80307 *
* Unidata WWW Service http://www.unidata.ucar.edu/*
+-----------------------------------------------------------------------------+