[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Tom McDermott: Re: 20001214: LDM: out of per user processes
- Subject: Tom McDermott: Re: 20001214: LDM: out of per user processes
- Date: Wed, 20 Dec 2000 10:54:21 -0700
Hi,
In case you're interested in the mystery of the process table filling
up on Tom McDermott's LDM host, there are a few more possible clues in
this note, but it hasn't been solved yet ...
--Russ
------- Forwarded Message
Date: Wed, 20 Dec 2000 12:35:25 -0500
From: Tom McDermott <address@hidden>
To: Russ Rew <address@hidden>
Subject: Re: 20001214: LDM: out of per user processes
On Fri, 15 Dec 2000, Russ Rew wrote:
> In the above message, you wrote:
>
> > Dec 14 06:31:03 vortex unix: NOTICE: out of per-user processes for uid 214
> > Dec 14 06:32:25 vortex last message repeated 23 times
> ...
> > Now uid 214 is the ldm, so it is the likely culprit. This happened
> > once before several months ago.
I think I should have phrased this as 'ldm is a possible culprit'.
> I've been unable to identify any certain LDM-related cause for this,
> but I can offer a couple of theories that you can evaluate.
>
> The only time we have seen anything like this here was in August when
> an LDM host had its load average climb to 2000 (!), and we determined
> that this was caused by a different LDM host running on an HP-UX 11.0
> system hammering it with FEEDME requests. We have never successfully
> gotten around the RPC library problems on the HP-UX 11.0 platform, so
> we distribute HP-UX 10.20 binaries for it and recommend people build
> the LDM using the HP-UX 10 compatibility mode for HP-UX 11 platforms.
>
> So one possibility is that some downstream site built the LDM for
> HP-UX 11 and then requested data from your site many times per second,
> causing an LDM sender process to be launched for each such request.
> The only sites we see feeding from your vortex host are
> blizzard.weather.brockport.edu and catwoman.cs.moravian.edu, but we
> don't have a record of whether either of these is an HP-UX platform.
> Do you happen to know? We've just gotten a new HP-UX 11 platform in,
> so we hope to be able to fix or find a workaround for this problem in
> the near future.
Well, blizzard is another Sun (Ultra-10) running Solaris 7. Catwoman at
Moravian College is a Sun Ultra 450 server running Solaris 8, so I think
we can definitely exclude the HP possibility.
> Another possible cause Anne had seen was upgrading to 5.1.2 without
> remaking the queue, but I have been unable to duplicate this problem
> here and can't understand how that could cause spawning any additional
> processes. When I tried it here, the LDM just reported the problem
> and exited, as it is supposed to do:
>
> Dec 14 20:09:06 rpc.ldmd[25256]: ldm.pq: Not a product queue
> Dec 14 20:09:06 rpc.ldmd[25256]: pq_open failed: ldm.pq: Invalid argument
> Dec 14 20:09:06 rpc.ldmd[25256]: Exiting
> Dec 14 20:09:06 rpc.ldmd[25256]: Terminating process group
I definitely remade the queue when upgrading to 5.1.2 at the end of August
(actually even before that because I ran the beta release), and have
remade the queue quite a few times since in order to change its size (you
recall the pq_del_oldest problem), so I think we can positively eliminate
this possibility as well.
> A final possibility is that the problem is caused by some decoder or
> other program or script launched by pqact. It is relatively easy to
> write a recursive shell script that quickly consumes the process table
> if there are no per-user limits set for a user who tries to debug and
> run such a script (I've done it!). If you have other users on the
> machine, one of their programs could have spawned processes
> recursively or in a loop and used up all the process table entries, so
> when the LDM tried to spawn a decoder process, it hit the limit and
> produced the message.
This I think is most likely. The only problem I see with a different non-
ldm user executing a runaway program is that the syslog message stated
that it was 'out of per-user processes for uid 214', implying that it was
the user ldm which had run out of processes, not necessarily the system as
a whole. I don't know that if a non-ldm user filled up the (global)
process table, a different user (ldm) trying to fork a new process would
get that per-user message if it was a different process that had run amok.
So it would seem that focussing on the processes ldm spawns, such as a
decoder or script, might be the first line of attack. I don't recall any
recent changes to these. I last compiled gempak programs on 7/28/00, but
there were almost no changes to the binaries. The only other new decoder
is the pnga2area program for the new MCIDAS compression format, but I have
no reason to suspect this program. I will make a closer inspection of
pqact.conf when I get a chance to make sure nothing has escaped my memory
on this point.
> Here's a couple of suggestions that might help diagnose the problem.
> First, take ps snapshots (or use top) to see all the ldm processes
> running and try to account for each one from log file entries, to make
> sure there aren't any extra processes being created. The "pgrep"
> command on Solaris 2.7 and later is useful for this, for example
>
> pgrep -fl -u ldm
>
> shows all processes owned by user "ldm", and piping this into "wc -l"
> would give you a quick count of ldm processes, and would let you
> monitor if ldm processes were climbing slowly. But this would be of
> no help if something triggers spawning a bunch of processes quickly.
> If that happens, it would help to have ps or pgrep output, but to
> catch it you might have to run a cron job that dumped the output of
> pgrep to a file every minute (overwriting the previous file or
> ping-ponging between two files), so that if your process table filled,
> you would have a record of what things looked like within the previous
> minute.
OK, I will try the top and ps file output suggestion. I guess the crontab
entry would be all stars (* * * * *) to indicate it should be run every
minute. But it may be a while before get another ocurrrence. It was
around 2 months between the 2 incidents. Also, I will be upgrading to
gempak 5.6 sometime in the next couple of weeks, so that will change the
decoder environment. And while I'm doing that, I'll be taking a closer
look at pqact.conf to see if there's anything amiss there.
The only other things that come to mind are that a much earlier version of
the ncsa httpd server (v. 1.3.4, I think), used to under some conditions
start forking hundreds of copies of the daemon on our system. But
upgrading to v 1.5.2 about 4 years ago solved that problem, and I have no
special reason to believe it has caused this problem other than it caused
a similar problem in the past. I suppose I should have upgraded to the
apache server before now. The other possible problem is a script (run out
of ldm's crontab) that downloads some radar graphics from a nws http
server. Under some conditions, which arise fairly frequently, the script
and its child processes don't exit and remain on the system. This results
in 3 additional processes for each hour that it happens, but they aren't
forking more processes, which I think would be required to fill up the
process tables.
Something I thought of too late is that I'm running that syscheck script
distributed with ldm. Since I almost never look at the output, I forgot
last week I was running it. I didn't think of this until Monday and it
was rotated out of existence by then. (I will rotate 7 logs from now on.)
But I'm not sure if that script gives the sort of info that might be
useful here anyways. I will have to go back to my tapes and see what's in
the logs for the first incident in Oct. I will let you know if I find
anything interesting.
> I'm still very interested in resolving whether this is a symptom of an
> LDM bug, so if you find out anything else, please let me know.
> Thanks!
Really, ldm on sparc Solaris has been an extremely stable, I might say
awesomely stable, platform in recent years. However, I guess no program,
no matter how stable, is without bugs. They just become more obscure and
are triggered under infrequent sequences of conditions.
Tom
------------------------------------------------------------------------------
Tom McDermott Email: address@hidden
System Administrator Phone: (716) 395-5718
Earth Sciences Dept. Fax: (716) 395-2416
SUNY College at Brockport
------- End of Forwarded Message