[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
- Subject: Re: 20010531: ldm 5.1.3 with RH 7.1 thrashing
- Date: Thu, 31 May 2001 15:33:06 -0400 (EDT)
On Thu, 31 May 2001, Arthur A. Person wrote:
> On Thu, 31 May 2001, Anne Wilson wrote:
>
> > "Arthur A. Person" wrote:
> > >
> > > On Thu, 31 May 2001, Anne Wilson wrote:
> > >
> > > > And, what's the history on the
> > > > queue size?
> > >
> > > I believe I started running the ldm with a queue size of 2GB around May
> > > 18, but with only a couple of rpc's in test mode. I then added my
> > > downstream sites the end of last week, and over the weekend (Sunday) the
> > > system choked with the thrashing. I came in and power-cycled, re-made the
> > > queue at 300MB and restarted in hopes I would get through the rest of the
> > > long weekend okay, and did. At this point, my swap space was a 1.5GB
> > > partition and I began thinking I perhaps needed swap space larger than my
> > > queue size if the queue is mapped, so I added a swap 2GB swap file to the
> > > system and then restarted the ldm again with a re-made 2GB queue and
> > > restarted on Tuesday. This morning I noticed the system was thrashing
> > > again, don't know exactly when it started.
> > >
> > > > Do you normally run with less than 300MB, and is that what
> > > > you're doing now?
> > >
> > > I'm running with a 2GB queue now, which is what I want to run with.
> > > Actually, I'd make it bigger, but Linux (ldm?) doesnt' seem to support it.
> > >
> >
> > Please see below for a comment about this.
> >
> > > > How many rpc.ldmd processes are currently running? (I hope it's
> > > > responsive enough to tell.)
> > >
> > > Perhaps this is revealing... there's a bunch of rpc's running, I think
> > > more than there should be:
> >
> > yes, this doesn't look right. Currently you have 78 of these processes
> > running. That's 5 more than what you reported to me earlier, and four
> > more than when I first logged in. The number seems to be growing.
> >
> > [ldm@ldm ~/etc]$ ps -ef | grep rpc.ldmd | wc
> > 79 789 6380
> >
> > The max number of rpc.ldmds you should have is the number of requests to
> > "unique" hosts plus the number of allows to "unique" hosts plus one. (I
> > qualify "unique" because, as you know, the LDM will group
> > requests/allows to the same hosts unless you trick it by using the IP
> > address.) You may have fewer rpc.ldmds if your upstreams hosts are
> > unavailable or your downstream sites are not connected. Anyway, you
> > have way more than your should, based on the entries in your ldmd.conf:
> >
> > [ldm@ldm ~/etc]$ grep -E "^request" ldmd.conf | wc
> > 5 20 228
> > [ldm@ldm ~/etc]$ grep -E "^allow" ldmd.conf | wc
> > 36 107 1669
> >
> > I'm developing a hypothesis: In looking at the PIDs of the running
> > rpc.ldmds and comparing those with the PIDs listed in the log, it looks
> > like sysu1.wsicorp.com is connecting a lot more than it's exiting. Take
> > a look at this:
> >
> > [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Connection from" | wc
> > 177 1416 12213
> > [ldm@ldm ~/logs]$ grep sysu1 ldmd.log | grep -E "Exiting" | wc
> > 121 726 5203
> > [ldm@ldm ~/logs]$ ^sysu1^windfall
> > grep windfall ldmd.log | grep -E "Exiting" | wc
> > 44 264 2024
> > [ldm@ldm ~/logs]$ ^Exiting^Connection from
> > grep windfall ldmd.log | grep -E "Connection from" | wc
> > 44 352 3564
> > [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Connection from" | wc
> > 18 144 1170
> > [ldm@ldm ~/logs]$ grep bob ldmd.log | grep -E "Exiting" | wc
> > 18 108 846
> >
> > The stats for windfall and bob are for comparison. You'll see that for
> > those two hosts the number of connects and exits are the same. I'd
> > expect them to be the same plus or minus one.
> >
> > I know WSI has their own unique version of the LDM based on a very early
> > version. As an experiment, are you able to do without this data for a
> > bit? I will conduct my own experiment here on our 7.1 machine, but it
> > may take me a little time, as I have to be away for a few hours starting
> > soon.
>
> I was starting to suspect the wsi feed as well. I know they've had a lot
> of trouble staying connected here and when I saw all the rpc's, I started
> thinking about wsi trying to connect. If their software is old, they
> should probably update since their feed has not been as reliable as it
> used to be, unless it's just a networking bandwidth problem. I guess I
> would have to take that up with them. But the connect/reconnect thing
> shouldn't haul my system down either, should it?
>
> > One other point. With your 2Gb queue, you have lots of data. At the
> > time I ran the pqmon command below you had over 10 hours worth of data,
> > and it was growing (see the 'age' field - it gives the age of the oldest
> > product in the queue). Also, the number of products in the queue is
> > also going up, so space is not yet being recycled:
> >
> > [ldm@ldm ~/data]$ pqmon -i3
> > May 31 17:49:25 pqmon: Starting Up (17268)
> > May 31 17:49:26 pqmon: nprods nfree nempty nbytes maxprods
> > maxfree minempty maxext age
> > May 31 17:49:26 pqmon: 185848 64 326088 2051968120 185848
> > 471 326088 1549296 36372
> > May 31 17:49:29 pqmon: 185868 64 326068 2052103712 185868
> > 471 326068 1549296 36376
> > May 31 17:49:33 pqmon: 185877 64 326059 2052139000 185877
> > 471 326059 1549296 36379
> > May 31 17:49:36 pqmon: 185890 64 326046 2052203688 185890
> > 471 326046 1549296 36382
> > May 31 17:49:39 pqmon: 185901 64 326035 2052238392 185901
> > 471 326035 1549296 36386
> > May 31 17:49:42 pqmon: 185916 64 326020 2052322080 185916
> > 471 326020 1549296 36389
> > May 31 17:49:46 pqmon: 185923 63 326014 2052353264 185923
> > 471 326014 1549296 36392
> > May 31 17:49:49 pqmon: 185938 63 325999 2052437608 185938
> > 471 325999 1549296 36395
> > May 31 17:49:52 pqmon: 185944 63 325993 2052463160 185944
> > 471 325993 1549296 36398
> > May 31 17:49:55 pqmon: 185947 63 325990 2052480008 185947
> > 471 325990 1549296 36402
> > May 31 17:49:59 pqmon: 185952 63 325985 2052525544 185952
> > 471 325985 1549296 36405
> > May 31 17:50:03 pqmon: 185959 63 325978 2052588304 185959
> > 471 325978 1549296 36409
> > May 31 17:50:06 pqmon: 185967 62 325971 2052651936 185967
> > 471 325971 1549296 36412
> > May 31 17:50:09 pqmon: 185977 62 325961 2052717376 185977
> > 471 325961 1549296 36416
> > May 31 17:50:12 pqmon: 185988 62 325950 2052812104 185988
> > 471 325950 1549296 36419
> > May 31 17:50:16 pqmon: 185992 62 325946 2052852920 185992
> > 471 325946 1549296 36422
> > May 31 17:50:19 pqmon: 186002 62 325936 2052912024 186002
> > 471 325936 1549296 36425
> > May 31 17:50:22 pqmon: 186013 62 325925 2053009880 186013
> > 471 325925 1549296 36428
> > May 31 17:50:25 pqmon: 186018 61 325921 2053029616 186018
> > 471 325921 1549296 36432
> > May 31 17:50:29 pqmon: 186031 61 325908 2053061800 186031
> > 471 325908 1549296 36435
> > May 31 17:50:32 pqmon: 186039 61 325900 2053099008 186039
> > 471 325900 1549296 36439
> > May 31 17:50:35 pqmon: 186048 61 325891 2053150176 186048
> > 471 325891 1549296 36442
> > May 31 17:50:39 pqmon: 186059 61 325880 2053246544 186059
> > 471 325880 1549296 36445
> > May 31 17:50:42 pqmon: 186070 61 325869 2053333296 186070
> > 471 325869 1549296 36448
> > May 31 17:50:45 pqmon: 186081 61 325858 2053422336 186081
> > 471 325858 1549296 36452
> > May 31 17:50:49 pqmon: 186095 61 325844 2053506456 186095
> > 471 325844 1549296 36455
> > May 31 17:50:52 pqmon: 186103 61 325836 2053532408 186103
> > 471 325836 1549296 36459
> > May 31 17:50:56 pqmon: 186112 61 325827 2053643864 186112
> > 471 325827 1549296 36462
> > May 31 17:50:59 pqmon: 186118 61 325821 2053755592 186118
> > 471 325821 1549296 36465
> > May 31 17:51:02 pqmon: 186124 61 325815 2053858840 186124
> > 471 325815 1549296 36469
> > May 31 17:51:05 pqmon: 186128 61 325811 2053906992 186128
> > 471 325811 1549296 36472
> > May 31 17:51:09 pqmon: 186139 61 325800 2054017464 186139
> > 471 325800 1549296 36475
> > May 31 17:51:12 pqmon: 186148 61 325791 2054157200 186148
> > 471 325791 1549296 36478
> > May 31 17:51:15 pqmon: 186155 61 325784 2054262720 186155
> > 471 325784 1549296 36481
> > May 31 17:51:19 pqmon: 186162 60 325778 2054333056 186162
> > 471 325778 1549296 36485
> > May 31 17:51:22 pqmon: 186172 60 325768 2054454576 186172
> > 471 325768 1549296 36488
> > May 31 17:51:26 pqmon: 186176 60 325764 2054533992 186176
> > 471 325764 1549296 36492
> > May 31 17:51:29 pqmon: 186185 60 325755 2054675840 186185
> > 471 325755 1549296 36495
> > May 31 17:51:32 pqmon: 186190 60 325750 2054758024 186190
> > 471 325750 1549296 36498
> > May 31 17:51:35 pqmon: 186197 59 325744 2054844960 186197
> > 471 325744 1549296 36501
> > May 31 17:51:36 pqmon: Interrupt
> > May 31 17:51:36 pqmon: Exiting
> >
> > Do you really want to keep that much data? If you have the space and
> > everything's working fine, I guess there's no reason not to... This is
> > just a FYI.
>
> Yeh, I know it will hold a lot, but I like lots of data :) As I said, if
> I could make the queue even bigger, I would. Space is cheap these days
> and I figure as a relay, if someone downstream is down for a bunch of
> hours, they can still catch up on the data.
>
> > Please let me know what you think about the WSI feed. I will be leaving
> > here in about 15 minutes, but will give my own test a try later this
> > afternoon when I return.
>
> Maybe I'll try killing off some rpc.ldmd processes and see if things
> improve assuming I don't jam the system.
Okay, I killed off most or all of the hung rpc's... not sure if it helped
much though. Disk may be a little less busy, but it's still pretty busy.
Maybe this is still some sort of a queue issue...?
Art.
Arthur A. Person
Research Assistant, System Administrator
Penn State Department of Meteorology
email: address@hidden, phone: 814-863-1563