This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Tom McDermott wrote: > > On Tue, 4 Dec 2001, Anne Wilson wrote: > > > > I believe pqbinstats reads the queue, so that might explain the queue > > corruption. It is not uncommon to see the runaway rpc.ldmd processes > > once the ldm gets in such a confused state. At that point, killing them > > by hand like you did may be the only option. > > No, the queue corruption was the result of the system crashing. The > system was ingesting normally (I have a watch window on my workstation so > I know) at the time of the crash. The log said it couldn't sync the disc, > hence the corruption. It has been a number of years since I've had this > queue corruption problem. But as I recall, the pqact processes at least > terminated, nor were dozens of rpc.ldmd processes spawned as a result of > connection attempts. Perhaps this behavior has changed now. Ideally, > once it determines that the pq is corrupt, the entire system should shut > down, since there is no point in continuing. How difficult that would be > to detect from the programmer's point of view, I don't know. > > So my message was really directed more toward the post-crash behavior of > ldm than toward the cause of the crash. > Tom, Yes, you're right - pqbinstats would not have corrupted the queue. That was a half baked thought on my part. In looking back over the archives and searching my memory, I know of only a few cases of runaway processes within the past few years. One was yours, almost a year ago, 12/14/2000, where ldm processes overran your machine. In that case it was not clear that they were all rpc.ldmd processes. Another was Gilbert's machine, and seemed to correlate with a particular site trying to connect. Indeed, one of our own machines had this problem when a remote host running an unsupported OS was trying to connect. Another instance was a site that had upgraded to a version that required a new queue, but had not upgraded the queue. This last one was the only one that clearly involved the queue, indeed, it seems like the other two didn't. And, there have been instances where the queue was corrupted where the ldm didn't spawn processes like this. The only really major changes to the ldm recently were the queue algorithms that were improved about a year ago. I don't think that would affect the spawning of children, but I could be wrong about that. I'm not sure how hard it would be to detect a corrupted queue from within the code. Since the problem doesn't appear to occur very often, I will leave it at making a note of it. If it reoccurs, I'll reexamine this position. Anne > Tom > ----------------------------------------------------------------------------- > Tom McDermott Email: address@hidden > Systems Administrator Phone: (716) 395-5718 > Earth Sciences Dept. Fax: (716) 395-2416 > SUNY College at Brockport -- *************************************************** Anne Wilson UCAR Unidata Program address@hidden P.O. Box 3000 Boulder, CO 80307 ---------------------------------------------------- Unidata WWW server http://www.unidata.ucar.edu/ ****************************************************