[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
incident with LDM on shemp, Friday, May 12, 21:49:00 (fwd)
- Subject: incident with LDM on shemp, Friday, May 12, 21:49:00 (fwd)
- Date: Mon, 15 May 2000 15:01:40 -0600 (MDT)
===============================================================================
Robb Kambic Unidata Program Center
Software Engineer III Univ. Corp for Atmospheric Research
address@hidden WWW: http://www.unidata.ucar.edu/
===============================================================================
---------- Forwarded message ----------
Date: Mon, 15 May 2000 12:56:41 -0600
From: Russ Rew <address@hidden>
To: address@hidden
address@hidden
Subject: incident with LDM on shemp, Friday, May 12, 21:49:00
Chiz,
We restarted the LDM on shemp this morning by shutting it down
normally and then rebooting shemp, because someone noticed and
reported that an ADDE server on shemp wasn't responding. I copied all
the log files and Mike captured output from top and ps before
rebooting; we also copied the old product queue just in case it would
be useful. All this is available on shemp in the directory
/local/ldm/logs/incident/.
At this point I'm not studying this too closely, because it looks like
it may be just problems caused by a router problem that occurred on
Friday and took a while to fix. I'm afraid I went up to the "Spring
Fling 2000" on the Mesa and missed most of this, but I probably should
have checked shemp's LDM over the weekend.
Apparently this morning before shutting down shemp's LDM there were
301 rpc.ldmd processes running (see ps.out), lots of other associated
processes, and the load average was about 272 (see top.out). The
pqmon.log.1 showed that products stopped going into the queue between
21:48:45 and 21:49:01 on Friday:
May 12 21:48:45 pqmon: 116649 128 43223 1875448600 159996 3832
2 9747720 6891
May 12 21:49:01 pqmon: 116705 128 43167 1876524032 159996 3832
2 9747720 6906
May 12 21:49:16 pqmon: 116705 128 43167 1876524032 159996 3832
2 9747720 6921
May 12 21:49:31 pqmon: 116705 128 43167 1876524032 159996 3832
2 9747720 6936
...
and ldmd.log.2 shows symptoms of network problems during the hour
before that (lots of "pq_sequence failed: I/O error (errno = 5)" and
"nullproc(<hostname>): RPC: Unable to receive" and RECLASS messages).
Robb says it's expected that the LDM starts up a lot of extra
processes when the network is flaky as it was on Friday, so unless you
see something else funny about this, I'm assuming it has nothing to do
with the new pq code.
--Russ
P.S. There's a problem with starting up pqmon from an exec line in
ldmd.conf. Instead of sleeping for 30 second intervals between
outputting status lines with
exec "pqmon -i 30 -l /usr/local/ldm/logs/pqmon.log"
it outputs the product queue status every time a product comes in,
because it gets woken up by the product group signal that a new
product is available. I'll have to fix this, but for now I just
killed the pqmon started in the LDM's product group and started up a
new one outside the rpc.ldmd's product group appending to pqmon.log ...