[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing



On Fri, 1 Jun 2001, anne wrote:

> "Arthur A. Person" wrote:
> >
> > >
> > > When you say "it still thrashes", do you mean that products aren't being
> > > received in a timely manner?  Right now products on ldm.meteo appear to
> > > be arriving pretty quickly.  And, 'top' is showing a low load average,
> > > the machine appears to be responsive, and there's a reasonable number of
> > > rpc.ldmds...  Is this all with your 600Mb queue?
> >
> > By thrashing, I mean that the disk I/O light is mostly on and occasionally
> > blinks off and the system has very slow response and the IDD reception is
> > lagging at the reclass time limit but a "top" shows only a few percent of
> > cpu usage.  The IDD seems fine on ldm right now because I restarted it
> > last night and also remade the queue to 600MB.  This doesn't tell us
> > anything about the cause, but I'm beginning to suspect that it has
> > something to do with using a large queue.  I'm going to run it with the
> > queue at 600MB until I leave for vacation next Friday... if it makes it
> > that long without a problem, I'll conclude it's queue size related and we
> > can resume working on this when we both get back from vacation.
> >
> > I still have my wsi data coming in, so if I don't see problems in the next
> > week, I'll probably assume the wsi rpc's are a symptom rather than a
> > cause, although they should still shut down when a connection is lost.
> >
>
> Art,
>
> FYI, Charlie O'Brian at WSI agreed to feed our 7.1 machine temporarily
> starting Monday.  I'll request the WSI data then, and try it with
> various queue sizes.

Okay... that will be another test, although, I'm feeling like the wsi
issue is more a symptom than a cause.

> Also, he said:
>
> > Unless there is a problem (ie internet congestion, system crash,
> > client LDM stopping, etc), out program should never have to reconnect.
> > Our processes check every 5 minutes to make sure the client is
> > connected.  I noticed that we did a lot of restarting thru 5z this
> > morning.  I would hazzard to guess they are fine, now.
>
> Yesterday, from the piece of the log I ftp'ed from your site, there were
> 155 connections in about 12 hours.  (And only 106 disconnects, as I
> recall.)  Could connectivity be a factor?  And yet, I'm assuming you had
> no similar problems when you were using navier, is that right?

I've been having on-and-off problems with wsi connectivity to navier from
wsi, but I haven't pushed the issue because navier's been overloaded and I
could never be sure what the real problem might be.  There could be
network delay's to wsi via ldm.meteo.psu.edu, but as I mentioned, my
current thinking is that's not the primary problem.

> You could try going back to the 2Gb queue and see if the problem
> returns...

I ran the 600MB queue over the weekend (since ~ last Thursday) and have
seen no problems.  I'm going to coast into my vacation period this way and
when I get back, I will try the large queue again... I fully expect it to
fail again as before, for whatever reason... we'll see.  Interesting
problem...

                      Thanks for your help thus far...

                                  Art.

> Anne
> --
> ***************************************************
> Anne Wilson                     UCAR Unidata Program
> address@hidden                  P.O. Box 3000
>                                   Boulder, CO  80307
> ----------------------------------------------------
> Unidata WWW server       http://www.unidata.ucar.edu/
> ****************************************************
>

Arthur A. Person
Research Assistant, System Administrator
Penn State Department of Meteorology
email:  address@hidden, phone:  814-863-1563