This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
>Date: Wed, 20 Dec 2000 12:35:25 -0500 >From: Tom McDermott <address@hidden> >To: Russ Rew <address@hidden> >Subject: Re: 20001214: LDM: out of per user processes Tom, In regards to a problem you reported with too many processes on an LDM host, we recently had a similar occurrence of too many LDM processes. The circumstances were a power outage followed by the LDM being automatically restarted without remaking the product queue on reboot when the power came back on. The power outage may have resulted in a corrupted queue, and we now have a theory for how a corrupt queue can cause spawning extra processes. After the LDM is restarted with a corrupted queue, a downstream site requests a feed. The LDM spawns a sender process to provide the feed, and the sender process starts an hour back in the queue to send any missed products. The sender process gets stuck in a loop accessing products by insertion time in the corrupted queue, perhaps because the "next product by insertion time" function returns a previous product. The downstream site notices it isn't getting any response to its FEEDME request so sends another FEEDME request. The LDM assumes the unresponsive sender process has died, so starts up another one. It may be more complicated than this, with a different process getting a lock on a region of the queue and never giving the lock up due to looping, deadlocking other processes including a sender process. If this is what is really happening, a possible fix would involve making each queue scanner process notice when it isn't making any progress through the queue and returning an error indication before exiting. This may be a difficult bug to reproduce, because we need to get a queue in an inconsistent state that will cause another process to loop accessing its products, but we'll see if we can reproduce it. --Russ _____________________________________________________________________ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu