[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
- Subject: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
- Date: Thu, 31 May 2001 12:55:42 -0600
"Arthur A. Person" wrote:
>
> On Thu, 31 May 2001, Anne Wilson wrote:
>
> > And, what's the history on the
> > queue size?
>
> I believe I started running the ldm with a queue size of 2GB around May
> 18, but with only a couple of rpc's in test mode. I then added my
> downstream sites the end of last week, and over the weekend (Sunday) the
> system choked with the thrashing. I came in and power-cycled, re-made the
> queue at 300MB and restarted in hopes I would get through the rest of the
> long weekend okay, and did. At this point, my swap space was a 1.5GB
> partition and I began thinking I perhaps needed swap space larger than my
> queue size if the queue is mapped, so I added a swap 2GB swap file to the
> system and then restarted the ldm again with a re-made 2GB queue and
> restarted on Tuesday. This morning I noticed the system was thrashing
> again, don't know exactly when it started.
>
> > Do you normally run with less than 300MB, and is that what
> > you're doing now?
>
> I'm running with a 2GB queue now, which is what I want to run with.
> Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it.
>
Please see below for a comment about this.
> > How many rpc.ldmd processes are currently running? (I hope it's
> > responsive enough to tell.)
>
> Perhaps this is revealing... there's a bunch of rpc's running, I think
> more than there should be:
yes, this doesn't look right. Currently you have 78 of these processes
running. That's 5 more than what you reported to me earlier, and four
more than when I first logged in. The number seems to be growing.
[ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc
79 789 6380
The max number of rpc.ldmds you should have is the number of requests to
"unique" hosts plus the number of allows to "unique" hosts plus one. (I
qualify "unique" because, as you know, the LDM will group
requests/allows to the same hosts unless you trick it by using the IP
address.) You may have fewer rpc.ldmds if your upstreams hosts are
unavailable or your downstream sites are not connected. Anyway, you
have way more than your should, based on the entries in your ldmd.conf:
[ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc
5 20 228
[ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc
36 107 1669
I'm developing a hypothesis: In looking at the PIDs of the running
rpc.ldmds and comparing those with the PIDs listed in the log, it looks
like sysu1.wsicorp.com is connecting a lot more than it's exiting. Take
a look at this:
[ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc
177 1416 12213
[ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc
121 726 5203
[ldm@ldm ~/logs]$ ^sysu1^windfall
grep windfall ldmd.log | grep -E "Exiting" | wc
44 264 2024
[ldm@ldm ~/logs]$ ^Exiting^Connection from
grep windfall ldmd.log | grep -E "Connection from" | wc
44 352 3564
[ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc
18 144 1170
[ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc
18 108 846
The stats for windfall and bob are for comparison. You'll see that for
those two hosts the number of connects and exits are the same. I'd
expect them to be the same plus or minus one.
I know WSI has their own unique version of the LDM based on a very early
version. As an experiment, are you able to do without this data for a
bit? I will conduct my own experiment here on our 7.1 machine, but it
may take me a little time, as I have to be away for a few hours starting
soon.
One other point. With your 2Gb queue, you have lots of data. At the
time I ran the pqmon command below you had over 10 hours worth of data,
and it was growing (see the 'age' field - it gives the age of the oldest
product in the queue). Also, the number of products in the queue is
also going up, so space is not yet being recycled:
[ldm@ldm ~/data]$ pqmon -i3
May 31 17:49:25 pqmon: Starting Up (17268)
May 31 17:49:26 pqmon: nprods nfree nempty nbytes maxprods
maxfree minempty maxext age
May 31 17:49:26 pqmon: 185848 64 326088 2051968120 185848
471 326088 1549296 36372
May 31 17:49:29 pqmon: 185868 64 326068 2052103712 185868
471 326068 1549296 36376
May 31 17:49:33 pqmon: 185877 64 326059 2052139000 185877
471 326059 1549296 36379
May 31 17:49:36 pqmon: 185890 64 326046 2052203688 185890
471 326046 1549296 36382
May 31 17:49:39 pqmon: 185901 64 326035 2052238392 185901
471 326035 1549296 36386
May 31 17:49:42 pqmon: 185916 64 326020 2052322080 185916
471 326020 1549296 36389
May 31 17:49:46 pqmon: 185923 63 326014 2052353264 185923
471 326014 1549296 36392
May 31 17:49:49 pqmon: 185938 63 325999 2052437608 185938
471 325999 1549296 36395
May 31 17:49:52 pqmon: 185944 63 325993 2052463160 185944
471 325993 1549296 36398
May 31 17:49:55 pqmon: 185947 63 325990 2052480008 185947
471 325990 1549296 36402
May 31 17:49:59 pqmon: 185952 63 325985 2052525544 185952
471 325985 1549296 36405
May 31 17:50:03 pqmon: 185959 63 325978 2052588304 185959
471 325978 1549296 36409
May 31 17:50:06 pqmon: 185967 62 325971 2052651936 185967
471 325971 1549296 36412
May 31 17:50:09 pqmon: 185977 62 325961 2052717376 185977
471 325961 1549296 36416
May 31 17:50:12 pqmon: 185988 62 325950 2052812104 185988
471 325950 1549296 36419
May 31 17:50:16 pqmon: 185992 62 325946 2052852920 185992
471 325946 1549296 36422
May 31 17:50:19 pqmon: 186002 62 325936 2052912024 186002
471 325936 1549296 36425
May 31 17:50:22 pqmon: 186013 62 325925 2053009880 186013
471 325925 1549296 36428
May 31 17:50:25 pqmon: 186018 61 325921 2053029616 186018
471 325921 1549296 36432
May 31 17:50:29 pqmon: 186031 61 325908 2053061800 186031
471 325908 1549296 36435
May 31 17:50:32 pqmon: 186039 61 325900 2053099008 186039
471 325900 1549296 36439
May 31 17:50:35 pqmon: 186048 61 325891 2053150176 186048
471 325891 1549296 36442
May 31 17:50:39 pqmon: 186059 61 325880 2053246544 186059
471 325880 1549296 36445
May 31 17:50:42 pqmon: 186070 61 325869 2053333296 186070
471 325869 1549296 36448
May 31 17:50:45 pqmon: 186081 61 325858 2053422336 186081
471 325858 1549296 36452
May 31 17:50:49 pqmon: 186095 61 325844 2053506456 186095
471 325844 1549296 36455
May 31 17:50:52 pqmon: 186103 61 325836 2053532408 186103
471 325836 1549296 36459
May 31 17:50:56 pqmon: 186112 61 325827 2053643864 186112
471 325827 1549296 36462
May 31 17:50:59 pqmon: 186118 61 325821 2053755592 186118
471 325821 1549296 36465
May 31 17:51:02 pqmon: 186124 61 325815 2053858840 186124
471 325815 1549296 36469
May 31 17:51:05 pqmon: 186128 61 325811 2053906992 186128
471 325811 1549296 36472
May 31 17:51:09 pqmon: 186139 61 325800 2054017464 186139
471 325800 1549296 36475
May 31 17:51:12 pqmon: 186148 61 325791 2054157200 186148
471 325791 1549296 36478
May 31 17:51:15 pqmon: 186155 61 325784 2054262720 186155
471 325784 1549296 36481
May 31 17:51:19 pqmon: 186162 60 325778 2054333056 186162
471 325778 1549296 36485
May 31 17:51:22 pqmon: 186172 60 325768 2054454576 186172
471 325768 1549296 36488
May 31 17:51:26 pqmon: 186176 60 325764 2054533992 186176
471 325764 1549296 36492
May 31 17:51:29 pqmon: 186185 60 325755 2054675840 186185
471 325755 1549296 36495
May 31 17:51:32 pqmon: 186190 60 325750 2054758024 186190
471 325750 1549296 36498
May 31 17:51:35 pqmon: 186197 59 325744 2054844960 186197
471 325744 1549296 36501
May 31 17:51:36 pqmon: Interrupt
May 31 17:51:36 pqmon: Exiting
Do you really want to keep that much data? If you have the space and
everything's working fine, I guess there's no reason not to... This is
just a FYI.
Please let me know what you think about the WSI feed. I will be leaving
here in about 15 minutes, but will give my own test a try later this
afternoon when I return.
Anne
--
***************************************************
Anne Wilson UCAR Unidata Program
address@hidden P.O. Box 3000
Boulder, CO 80307
----------------------------------------------------
Unidata WWW server http://www.unidata.ucar.edu/
****************************************************