This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Hi, In case you're interested in the mystery of the process table filling up on Tom McDermott's LDM host, there are a few more possible clues in this note, but it hasn't been solved yet ... --Russ ------- Forwarded Message Date: Wed, 20 Dec 2000 12:35:25 -0500 From: Tom McDermott <address@hidden> To: Russ Rew <address@hidden> Subject: Re: 20001214: LDM: out of per user processes On Fri, 15 Dec 2000, Russ Rew wrote: > In the above message, you wrote: > > > Dec 14 06:31:03 vortex unix: NOTICE: out of per-user processes for uid 214 > > Dec 14 06:32:25 vortex last message repeated 23 times > ... > > Now uid 214 is the ldm, so it is the likely culprit. This happened > > once before several months ago. I think I should have phrased this as 'ldm is a possible culprit'. > I've been unable to identify any certain LDM-related cause for this, > but I can offer a couple of theories that you can evaluate. > > The only time we have seen anything like this here was in August when > an LDM host had its load average climb to 2000 (!), and we determined > that this was caused by a different LDM host running on an HP-UX 11.0 > system hammering it with FEEDME requests. We have never successfully > gotten around the RPC library problems on the HP-UX 11.0 platform, so > we distribute HP-UX 10.20 binaries for it and recommend people build > the LDM using the HP-UX 10 compatibility mode for HP-UX 11 platforms. > > So one possibility is that some downstream site built the LDM for > HP-UX 11 and then requested data from your site many times per second, > causing an LDM sender process to be launched for each such request. > The only sites we see feeding from your vortex host are > blizzard.weather.brockport.edu and catwoman.cs.moravian.edu, but we > don't have a record of whether either of these is an HP-UX platform. > Do you happen to know? We've just gotten a new HP-UX 11 platform in, > so we hope to be able to fix or find a workaround for this problem in > the near future. Well, blizzard is another Sun (Ultra-10) running Solaris 7. Catwoman at Moravian College is a Sun Ultra 450 server running Solaris 8, so I think we can definitely exclude the HP possibility. > Another possible cause Anne had seen was upgrading to 5.1.2 without > remaking the queue, but I have been unable to duplicate this problem > here and can't understand how that could cause spawning any additional > processes. When I tried it here, the LDM just reported the problem > and exited, as it is supposed to do: > > Dec 14 20:09:06 rpc.ldmd[25256]: ldm.pq: Not a product queue > Dec 14 20:09:06 rpc.ldmd[25256]: pq_open failed: ldm.pq: Invalid argument > Dec 14 20:09:06 rpc.ldmd[25256]: Exiting > Dec 14 20:09:06 rpc.ldmd[25256]: Terminating process group I definitely remade the queue when upgrading to 5.1.2 at the end of August (actually even before that because I ran the beta release), and have remade the queue quite a few times since in order to change its size (you recall the pq_del_oldest problem), so I think we can positively eliminate this possibility as well. > A final possibility is that the problem is caused by some decoder or > other program or script launched by pqact. It is relatively easy to > write a recursive shell script that quickly consumes the process table > if there are no per-user limits set for a user who tries to debug and > run such a script (I've done it!). If you have other users on the > machine, one of their programs could have spawned processes > recursively or in a loop and used up all the process table entries, so > when the LDM tried to spawn a decoder process, it hit the limit and > produced the message. This I think is most likely. The only problem I see with a different non- ldm user executing a runaway program is that the syslog message stated that it was 'out of per-user processes for uid 214', implying that it was the user ldm which had run out of processes, not necessarily the system as a whole. I don't know that if a non-ldm user filled up the (global) process table, a different user (ldm) trying to fork a new process would get that per-user message if it was a different process that had run amok. So it would seem that focussing on the processes ldm spawns, such as a decoder or script, might be the first line of attack. I don't recall any recent changes to these. I last compiled gempak programs on 7/28/00, but there were almost no changes to the binaries. The only other new decoder is the pnga2area program for the new MCIDAS compression format, but I have no reason to suspect this program. I will make a closer inspection of pqact.conf when I get a chance to make sure nothing has escaped my memory on this point. > Here's a couple of suggestions that might help diagnose the problem. > First, take ps snapshots (or use top) to see all the ldm processes > running and try to account for each one from log file entries, to make > sure there aren't any extra processes being created. The "pgrep" > command on Solaris 2.7 and later is useful for this, for example > > pgrep -fl -u ldm > > shows all processes owned by user "ldm", and piping this into "wc -l" > would give you a quick count of ldm processes, and would let you > monitor if ldm processes were climbing slowly. But this would be of > no help if something triggers spawning a bunch of processes quickly. > If that happens, it would help to have ps or pgrep output, but to > catch it you might have to run a cron job that dumped the output of > pgrep to a file every minute (overwriting the previous file or > ping-ponging between two files), so that if your process table filled, > you would have a record of what things looked like within the previous > minute. OK, I will try the top and ps file output suggestion. I guess the crontab entry would be all stars (* * * * *) to indicate it should be run every minute. But it may be a while before get another ocurrrence. It was around 2 months between the 2 incidents. Also, I will be upgrading to gempak 5.6 sometime in the next couple of weeks, so that will change the decoder environment. And while I'm doing that, I'll be taking a closer look at pqact.conf to see if there's anything amiss there. The only other things that come to mind are that a much earlier version of the ncsa httpd server (v. 1.3.4, I think), used to under some conditions start forking hundreds of copies of the daemon on our system. But upgrading to v 1.5.2 about 4 years ago solved that problem, and I have no special reason to believe it has caused this problem other than it caused a similar problem in the past. I suppose I should have upgraded to the apache server before now. The other possible problem is a script (run out of ldm's crontab) that downloads some radar graphics from a nws http server. Under some conditions, which arise fairly frequently, the script and its child processes don't exit and remain on the system. This results in 3 additional processes for each hour that it happens, but they aren't forking more processes, which I think would be required to fill up the process tables. Something I thought of too late is that I'm running that syscheck script distributed with ldm. Since I almost never look at the output, I forgot last week I was running it. I didn't think of this until Monday and it was rotated out of existence by then. (I will rotate 7 logs from now on.) But I'm not sure if that script gives the sort of info that might be useful here anyways. I will have to go back to my tapes and see what's in the logs for the first incident in Oct. I will let you know if I find anything interesting. > I'm still very interested in resolving whether this is a symptom of an > LDM bug, so if you find out anything else, please let me know. > Thanks! Really, ldm on sparc Solaris has been an extremely stable, I might say awesomely stable, platform in recent years. However, I guess no program, no matter how stable, is without bugs. They just become more obscure and are triggered under infrequent sequences of conditions. Tom ------------------------------------------------------------------------------ Tom McDermott Email: address@hidden System Administrator Phone: (716) 395-5718 Earth Sciences Dept. Fax: (716) 395-2416 SUNY College at Brockport ------- End of Forwarded Message