[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
- Subject: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
- Date: Thu, 31 May 2001 15:10:56 -0400 (EDT)
On Thu, 31 May 2001, Anne Wilson wrote:
> "Arthur A. Person" wrote:
> >
> > On Thu, 31 May 2001, Anne Wilson wrote:
> >
> > > And, what's the history on the
> > > queue size?
> >
> > I believe I started running the ldm with a queue size of 2GB around May
> > 18, but with only a couple of rpc's in test mode. I then added my
> > downstream sites the end of last week, and over the weekend (Sunday) the
> > system choked with the thrashing. I came in and power-cycled, re-made the
> > queue at 300MB and restarted in hopes I would get through the rest of the
> > long weekend okay, and did. At this point, my swap space was a 1.5GB
> > partition and I began thinking I perhaps needed swap space larger than my
> > queue size if the queue is mapped, so I added a swap 2GB swap file to the
> > system and then restarted the ldm again with a re-made 2GB queue and
> > restarted on Tuesday. This morning I noticed the system was thrashing
> > again, don't know exactly when it started.
> >
> > > Do you normally run with less than 300MB, and is that what
> > > you're doing now?
> >
> > I'm running with a 2GB queue now, which is what I want to run with.
> > Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it.
> >
>
> Please see below for a comment about this.
>
> > > How many rpc.ldmd processes are currently running? (I hope it's
> > > responsive enough to tell.)
> >
> > Perhaps this is revealing... there's a bunch of rpc's running, I think
> > more than there should be:
>
> yes, this doesn't look right. Currently you have 78 of these processes
> running. That's 5 more than what you reported to me earlier, and four
> more than when I first logged in. The number seems to be growing.
>
> [ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc
> 79 789 6380
>
> The max number of rpc.ldmds you should have is the number of requests to
> "unique" hosts plus the number of allows to "unique" hosts plus one. (I
> qualify "unique" because, as you know, the LDM will group
> requests/allows to the same hosts unless you trick it by using the IP
> address.) You may have fewer rpc.ldmds if your upstreams hosts are
> unavailable or your downstream sites are not connected. Anyway, you
> have way more than your should, based on the entries in your ldmd.conf:
>
> [ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc
> 5 20 228
> [ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc
> 36 107 1669
>
> I'm developing a hypothesis: In looking at the PIDs of the running
> rpc.ldmds and comparing those with the PIDs listed in the log, it looks
> like sysu1.wsicorp.com is connecting a lot more than it's exiting. Take
> a look at this:
>
> [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc
> 177 1416 12213
> [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc
> 121 726 5203
> [ldm@ldm ~/logs]$ ^sysu1^windfall
> grep windfall ldmd.log | grep -E "Exiting" | wc
> 44 264 2024
> [ldm@ldm ~/logs]$ ^Exiting^Connection from
> grep windfall ldmd.log | grep -E "Connection from" | wc
> 44 352 3564
> [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc
> 18 144 1170
> [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc
> 18 108 846
>
> The stats for windfall and bob are for comparison. You'll see that for
> those two hosts the number of connects and exits are the same. I'd
> expect them to be the same plus or minus one.
>
> I know WSI has their own unique version of the LDM based on a very early
> version. As an experiment, are you able to do without this data for a
> bit? I will conduct my own experiment here on our 7.1 machine, but it
> may take me a little time, as I have to be away for a few hours starting
> soon.
I was starting to suspect the wsi feed as well. I know they've had a lot
of trouble staying connected here and when I saw all the rpc's, I started
thinking about wsi trying to connect. If their software is old, they
should probably update since their feed has not been as reliable as it
used to be, unless it's just a networking bandwidth problem. I guess I
would have to take that up with them. But the connect/reconnect thing
shouldn't haul my system down either, should it?
> One other point. With your 2Gb queue, you have lots of data. At the
> time I ran the pqmon command below you had over 10 hours worth of data,
> and it was growing (see the 'age' field - it gives the age of the oldest
> product in the queue). Also, the number of products in the queue is
> also going up, so space is not yet being recycled:
>
> [ldm@ldm ~/data]$ pqmon -i3
> May 31 17:49:25 pqmon: Starting Up (17268)
> May 31 17:49:26 pqmon: nprods nfree nempty nbytes maxprods
> maxfree minempty maxext age
> May 31 17:49:26 pqmon: 185848 64 326088 2051968120 185848
> 471 326088 1549296 36372
> May 31 17:49:29 pqmon: 185868 64 326068 2052103712 185868
> 471 326068 1549296 36376
> May 31 17:49:33 pqmon: 185877 64 326059 2052139000 185877
> 471 326059 1549296 36379
> May 31 17:49:36 pqmon: 185890 64 326046 2052203688 185890
> 471 326046 1549296 36382
> May 31 17:49:39 pqmon: 185901 64 326035 2052238392 185901
> 471 326035 1549296 36386
> May 31 17:49:42 pqmon: 185916 64 326020 2052322080 185916
> 471 326020 1549296 36389
> May 31 17:49:46 pqmon: 185923 63 326014 2052353264 185923
> 471 326014 1549296 36392
> May 31 17:49:49 pqmon: 185938 63 325999 2052437608 185938
> 471 325999 1549296 36395
> May 31 17:49:52 pqmon: 185944 63 325993 2052463160 185944
> 471 325993 1549296 36398
> May 31 17:49:55 pqmon: 185947 63 325990 2052480008 185947
> 471 325990 1549296 36402
> May 31 17:49:59 pqmon: 185952 63 325985 2052525544 185952
> 471 325985 1549296 36405
> May 31 17:50:03 pqmon: 185959 63 325978 2052588304 185959
> 471 325978 1549296 36409
> May 31 17:50:06 pqmon: 185967 62 325971 2052651936 185967
> 471 325971 1549296 36412
> May 31 17:50:09 pqmon: 185977 62 325961 2052717376 185977
> 471 325961 1549296 36416
> May 31 17:50:12 pqmon: 185988 62 325950 2052812104 185988
> 471 325950 1549296 36419
> May 31 17:50:16 pqmon: 185992 62 325946 2052852920 185992
> 471 325946 1549296 36422
> May 31 17:50:19 pqmon: 186002 62 325936 2052912024 186002
> 471 325936 1549296 36425
> May 31 17:50:22 pqmon: 186013 62 325925 2053009880 186013
> 471 325925 1549296 36428
> May 31 17:50:25 pqmon: 186018 61 325921 2053029616 186018
> 471 325921 1549296 36432
> May 31 17:50:29 pqmon: 186031 61 325908 2053061800 186031
> 471 325908 1549296 36435
> May 31 17:50:32 pqmon: 186039 61 325900 2053099008 186039
> 471 325900 1549296 36439
> May 31 17:50:35 pqmon: 186048 61 325891 2053150176 186048
> 471 325891 1549296 36442
> May 31 17:50:39 pqmon: 186059 61 325880 2053246544 186059
> 471 325880 1549296 36445
> May 31 17:50:42 pqmon: 186070 61 325869 2053333296 186070
> 471 325869 1549296 36448
> May 31 17:50:45 pqmon: 186081 61 325858 2053422336 186081
> 471 325858 1549296 36452
> May 31 17:50:49 pqmon: 186095 61 325844 2053506456 186095
> 471 325844 1549296 36455
> May 31 17:50:52 pqmon: 186103 61 325836 2053532408 186103
> 471 325836 1549296 36459
> May 31 17:50:56 pqmon: 186112 61 325827 2053643864 186112
> 471 325827 1549296 36462
> May 31 17:50:59 pqmon: 186118 61 325821 2053755592 186118
> 471 325821 1549296 36465
> May 31 17:51:02 pqmon: 186124 61 325815 2053858840 186124
> 471 325815 1549296 36469
> May 31 17:51:05 pqmon: 186128 61 325811 2053906992 186128
> 471 325811 1549296 36472
> May 31 17:51:09 pqmon: 186139 61 325800 2054017464 186139
> 471 325800 1549296 36475
> May 31 17:51:12 pqmon: 186148 61 325791 2054157200 186148
> 471 325791 1549296 36478
> May 31 17:51:15 pqmon: 186155 61 325784 2054262720 186155
> 471 325784 1549296 36481
> May 31 17:51:19 pqmon: 186162 60 325778 2054333056 186162
> 471 325778 1549296 36485
> May 31 17:51:22 pqmon: 186172 60 325768 2054454576 186172
> 471 325768 1549296 36488
> May 31 17:51:26 pqmon: 186176 60 325764 2054533992 186176
> 471 325764 1549296 36492
> May 31 17:51:29 pqmon: 186185 60 325755 2054675840 186185
> 471 325755 1549296 36495
> May 31 17:51:32 pqmon: 186190 60 325750 2054758024 186190
> 471 325750 1549296 36498
> May 31 17:51:35 pqmon: 186197 59 325744 2054844960 186197
> 471 325744 1549296 36501
> May 31 17:51:36 pqmon: Interrupt
> May 31 17:51:36 pqmon: Exiting
>
> Do you really want to keep that much data? If you have the space and
> everything's working fine, I guess there's no reason not to... This is
> just a FYI.
Yeh, I know it will hold a lot, but I like lots of data :) As I said, if
I could make the queue even bigger, I would. Space is cheap these days
and I figure as a relay, if someone downstream is down for a bunch of
hours, they can still catch up on the data.
> Please let me know what you think about the WSI feed. I will be leaving
> here in about 15 minutes, but will give my own test a try later this
> afternoon when I return.
Maybe I'll try killing off some rpc.ldmd processes and see if things
improve assuming I don't jam the system.
Thanks.
Art.
> Anne
> --
> ***************************************************
> Anne Wilson UCAR Unidata Program
> address@hidden P.O. Box 3000
> Boulder, CO 80307
> ----------------------------------------------------
> Unidata WWW server http://www.unidata.ucar.edu/
> ****************************************************
>
Arthur A. Person
Research Assistant, System Administrator
Penn State Department of Meteorology
email: address@hidden, phone: 814-863-1563