[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: aeolus problems - LDM dying
- Subject: Re: aeolus problems - LDM dying
- Date: Wed, 06 Feb 2002 11:31:30 -0700
Russ Rew wrote:
>
>
> After thinking about it more, it turns out if the signature (an array
> of 4 ints) was all zeros, n would be zero and the assertion "n>0" would
> be violated, even though n was an unsigned int.
>
> Although it's not supposed to be possible to get an all-zero signature
> (it's the result of an MD5 digest of a product), it also seems likely
> that a memory failure might be manifested as reading all zeros for a
> memory fetch, or that a disk corruption might have the symptom of
> zeroing out some bytes on the disk where signatures were stored.
>
So, this leaves us in an unresolved state. From the system logs we saw
that aeolus had a CPU panic and rebooted itself at 07:56 local time.
And, an hour later it corrected a memory error. But, the assertion
violation errors reported in the ldm logs that caused the crashes
occurred hours later.
I also can't explain the bad latencies that were logged for only a few
products:
ldmd.log.3:Feb 05 17:41:04 aeolus motherlode[1329]: skipped:
20020205160304.032 (2280.714 seconds)
ldmd.log.3:Feb 05 18:03:47 aeolus motherlode[1329]: skipped:
20020205164524.036 (1102.943 seconds)
ldmd.log.2:Feb 05 22:23:24 aeolus motherlode[3932]: skipped:
20020205211554.685 (449.618 seconds)
ldmd.log.1:Feb 05 22:59:31 aeolus motherlode[4249]: skipped:
20020205215159.267 (451.825 seconds)
In two out of four crashes that I am aware of, these skipped products
occurred immediately before the assertion failure. In a third crash two
products were skipped well before the crash, and in the fourth crash
there were no such skips. I guess the bad latencies are unrelated to
crash, and must just reflect some problem in the connection during that
5+ hour time period. Although, it seems odd that just a few would have
such bad latencies.
So, we can't say for sure what went wrong. I suggest we watch aeolus
for the rest of the day and if it behaves properly then send out a note
to the effect that downstream sites could reconnect, although perhaps
with a caveat...
Anne
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************