[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing

Subject: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
Date: Thu, 31 May 2001 15:10:56 -0400 (EDT)
On Thu, 31 May 2001, Anne Wilson wrote:

> "Arthur A. Person" wrote:
> >
> > On Thu, 31 May 2001, Anne Wilson wrote:
> >
> > >  And, what's the history on the
> > > queue size?
> >
> > I believe I started running the ldm with a queue size of 2GB around May
> > 18, but with only a couple of rpc's in test mode.  I then added my
> > downstream sites the end of last week, and over the weekend (Sunday) the
> > system choked with the thrashing.  I came in and power-cycled, re-made the
> > queue at 300MB and restarted in hopes I would get through the rest of the
> > long weekend okay, and did.  At this point, my swap space was a 1.5GB
> > partition and I began thinking I perhaps needed swap space larger than my
> > queue size if the queue is mapped, so I added a swap 2GB swap file to the
> > system and then restarted the ldm again with a re-made 2GB queue and
> > restarted on Tuesday.  This morning I noticed the system was thrashing
> > again, don't know exactly when it started.
> >
> > >  Do you normally run with less than 300MB, and is that what
> > > you're doing now?
> >
> > I'm running with a 2GB queue now, which is what I want to run with.
> > Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it.
> >
>
> Please see below for a comment about this.
>
> > > How many rpc.ldmd processes are currently running?  (I hope it's
> > > responsive enough to tell.)
> >
> > Perhaps this is revealing... there's a bunch of rpc's running, I think
> > more than there should be:
>
> yes, this doesn't look right.  Currently you have 78 of these processes
> running.  That's 5 more than what you reported to me earlier, and four
> more than when I first logged in.  The number seems to be growing.
>
> [ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc
>      79     789    6380
>
> The max number of rpc.ldmds you should have is the number of requests to
> "unique" hosts plus the number of allows to "unique" hosts plus one.  (I
> qualify "unique" because, as you know, the LDM will group
> requests/allows to the same hosts unless you trick it by using the IP
> address.)  You may have fewer rpc.ldmds if your upstreams hosts are
> unavailable or your downstream sites are not connected.   Anyway, you
> have way more than your should, based on the entries in your ldmd.conf:
>
> [ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc
>       5      20     228
> [ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc
>      36     107    1669
>
> I'm developing a hypothesis:  In looking at the PIDs of the running
> rpc.ldmds and comparing those with the PIDs listed in the log, it looks
> like sysu1.wsicorp.com is connecting a lot more than it's exiting.  Take
> a look at this:
>
> [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc
>     177    1416   12213
> [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc
>     121     726    5203
> [ldm@ldm ~/logs]$ ^sysu1^windfall
> grep windfall ldmd.log | grep -E "Exiting" | wc
>      44     264    2024
> [ldm@ldm ~/logs]$ ^Exiting^Connection from
> grep windfall ldmd.log | grep -E "Connection from" | wc
>      44     352    3564
> [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc
>      18     144    1170
> [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc
>      18     108     846
>
> The stats for windfall and bob are for comparison.  You'll see that for
> those two hosts the number of connects and exits are the same.  I'd
> expect them to be the same plus or minus one.
>
> I know WSI has their own unique version of the LDM based on a very early
> version.  As an experiment, are you able to do without this data for a
> bit?  I will conduct my own experiment here on our 7.1 machine, but it
> may take me a little time, as I have to be away for a few hours starting
> soon.

I was starting to suspect the wsi feed as well.  I know they've had a lot
of trouble staying connected here and when I saw all the rpc's, I started
thinking about wsi trying to connect.  If their software is old, they
should probably update since their feed has not been as reliable as it
used to be, unless it's just a networking bandwidth problem.  I guess I
would have to take that up with them.  But the connect/reconnect thing
shouldn't haul my system down either, should it?

> One other point.  With your 2Gb queue, you have lots of data.  At the
> time I ran the pqmon command below you had over 10 hours worth of data,
> and it was growing (see the 'age' field - it gives the age of the oldest
> product in the queue).  Also, the number of products in the queue is
> also going up, so space is not yet being recycled:
>
> [ldm@ldm ~/data]$ pqmon -i3
> May 31 17:49:25 pqmon: Starting Up (17268)
> May 31 17:49:26 pqmon: nprods nfree  nempty      nbytes  maxprods
> maxfree  minempty    maxext  age
> May 31 17:49:26 pqmon: 185848    64  326088  2051968120    185848
> 471    326088   1549296 36372
> May 31 17:49:29 pqmon: 185868    64  326068  2052103712    185868
> 471    326068   1549296 36376
> May 31 17:49:33 pqmon: 185877    64  326059  2052139000    185877
> 471    326059   1549296 36379
> May 31 17:49:36 pqmon: 185890    64  326046  2052203688    185890
> 471    326046   1549296 36382
> May 31 17:49:39 pqmon: 185901    64  326035  2052238392    185901
> 471    326035   1549296 36386
> May 31 17:49:42 pqmon: 185916    64  326020  2052322080    185916
> 471    326020   1549296 36389
> May 31 17:49:46 pqmon: 185923    63  326014  2052353264    185923
> 471    326014   1549296 36392
> May 31 17:49:49 pqmon: 185938    63  325999  2052437608    185938
> 471    325999   1549296 36395
> May 31 17:49:52 pqmon: 185944    63  325993  2052463160    185944
> 471    325993   1549296 36398
> May 31 17:49:55 pqmon: 185947    63  325990  2052480008    185947
> 471    325990   1549296 36402
> May 31 17:49:59 pqmon: 185952    63  325985  2052525544    185952
> 471    325985   1549296 36405
> May 31 17:50:03 pqmon: 185959    63  325978  2052588304    185959
> 471    325978   1549296 36409
> May 31 17:50:06 pqmon: 185967    62  325971  2052651936    185967
> 471    325971   1549296 36412
> May 31 17:50:09 pqmon: 185977    62  325961  2052717376    185977
> 471    325961   1549296 36416
> May 31 17:50:12 pqmon: 185988    62  325950  2052812104    185988
> 471    325950   1549296 36419
> May 31 17:50:16 pqmon: 185992    62  325946  2052852920    185992
> 471    325946   1549296 36422
> May 31 17:50:19 pqmon: 186002    62  325936  2052912024    186002
> 471    325936   1549296 36425
> May 31 17:50:22 pqmon: 186013    62  325925  2053009880    186013
> 471    325925   1549296 36428
> May 31 17:50:25 pqmon: 186018    61  325921  2053029616    186018
> 471    325921   1549296 36432
> May 31 17:50:29 pqmon: 186031    61  325908  2053061800    186031
> 471    325908   1549296 36435
> May 31 17:50:32 pqmon: 186039    61  325900  2053099008    186039
> 471    325900   1549296 36439
> May 31 17:50:35 pqmon: 186048    61  325891  2053150176    186048
> 471    325891   1549296 36442
> May 31 17:50:39 pqmon: 186059    61  325880  2053246544    186059
> 471    325880   1549296 36445
> May 31 17:50:42 pqmon: 186070    61  325869  2053333296    186070
> 471    325869   1549296 36448
> May 31 17:50:45 pqmon: 186081    61  325858  2053422336    186081
> 471    325858   1549296 36452
> May 31 17:50:49 pqmon: 186095    61  325844  2053506456    186095
> 471    325844   1549296 36455
> May 31 17:50:52 pqmon: 186103    61  325836  2053532408    186103
> 471    325836   1549296 36459
> May 31 17:50:56 pqmon: 186112    61  325827  2053643864    186112
> 471    325827   1549296 36462
> May 31 17:50:59 pqmon: 186118    61  325821  2053755592    186118
> 471    325821   1549296 36465
> May 31 17:51:02 pqmon: 186124    61  325815  2053858840    186124
> 471    325815   1549296 36469
> May 31 17:51:05 pqmon: 186128    61  325811  2053906992    186128
> 471    325811   1549296 36472
> May 31 17:51:09 pqmon: 186139    61  325800  2054017464    186139
> 471    325800   1549296 36475
> May 31 17:51:12 pqmon: 186148    61  325791  2054157200    186148
> 471    325791   1549296 36478
> May 31 17:51:15 pqmon: 186155    61  325784  2054262720    186155
> 471    325784   1549296 36481
> May 31 17:51:19 pqmon: 186162    60  325778  2054333056    186162
> 471    325778   1549296 36485
> May 31 17:51:22 pqmon: 186172    60  325768  2054454576    186172
> 471    325768   1549296 36488
> May 31 17:51:26 pqmon: 186176    60  325764  2054533992    186176
> 471    325764   1549296 36492
> May 31 17:51:29 pqmon: 186185    60  325755  2054675840    186185
> 471    325755   1549296 36495
> May 31 17:51:32 pqmon: 186190    60  325750  2054758024    186190
> 471    325750   1549296 36498
> May 31 17:51:35 pqmon: 186197    59  325744  2054844960    186197
> 471    325744   1549296 36501
> May 31 17:51:36 pqmon: Interrupt
> May 31 17:51:36 pqmon: Exiting
>
> Do you really want to keep that much data?  If you have the space and
> everything's working fine, I guess there's no reason not to...  This is
> just a FYI.

Yeh, I know it will hold a lot, but I like lots of data :)  As I said, if
I could make the queue even bigger, I would.  Space is cheap these days
and I figure as a relay, if someone downstream is down for a bunch of
hours, they can still catch up on the data.

> Please let me know what you think about the WSI feed.  I will be leaving
> here in about 15 minutes, but will give my own test a try later this
> afternoon when I return.

Maybe I'll try killing off some rpc.ldmd processes and see if things
improve assuming I don't jam the system.

                                   Thanks.

                                     Art.

> Anne
> --
> ***************************************************
> Anne Wilson                   UCAR Unidata Program
> address@hidden                       P.O. Box 3000
>                                         Boulder, CO  80307
> ----------------------------------------------------
> Unidata WWW server       http://www.unidata.ucar.edu/
> ****************************************************
>

Arthur A. Person
Research Assistant, System Administrator
Penn State Department of Meteorology
email:  address@hidden, phone:  814-863-1563
Follow-Ups:
- Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
  - From: anne
- Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
  - From: Arthur A. Person
References:
- Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
  - From: Anne Wilson
Prev by Date: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
Next by Date: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
Previous by thread: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
Next by thread: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
Index(es):
- Date
- Thread