This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
=============================================================================== Robb Kambic Unidata Program Center Software Engineer III Univ. Corp for Atmospheric Research address@hidden WWW: http://www.unidata.ucar.edu/ =============================================================================== ---------- Forwarded message ---------- Date: Mon, 15 May 2000 12:56:41 -0600 From: Russ Rew <address@hidden> To: address@hidden address@hidden Subject: incident with LDM on shemp, Friday, May 12, 21:49:00 Chiz, We restarted the LDM on shemp this morning by shutting it down normally and then rebooting shemp, because someone noticed and reported that an ADDE server on shemp wasn't responding. I copied all the log files and Mike captured output from top and ps before rebooting; we also copied the old product queue just in case it would be useful. All this is available on shemp in the directory /local/ldm/logs/incident/. At this point I'm not studying this too closely, because it looks like it may be just problems caused by a router problem that occurred on Friday and took a while to fix. I'm afraid I went up to the "Spring Fling 2000" on the Mesa and missed most of this, but I probably should have checked shemp's LDM over the weekend. Apparently this morning before shutting down shemp's LDM there were 301 rpc.ldmd processes running (see ps.out), lots of other associated processes, and the load average was about 272 (see top.out). The pqmon.log.1 showed that products stopped going into the queue between 21:48:45 and 21:49:01 on Friday: May 12 21:48:45 pqmon: 116649 128 43223 1875448600 159996 3832 2 9747720 6891 May 12 21:49:01 pqmon: 116705 128 43167 1876524032 159996 3832 2 9747720 6906 May 12 21:49:16 pqmon: 116705 128 43167 1876524032 159996 3832 2 9747720 6921 May 12 21:49:31 pqmon: 116705 128 43167 1876524032 159996 3832 2 9747720 6936 ... and ldmd.log.2 shows symptoms of network problems during the hour before that (lots of "pq_sequence failed: I/O error (errno = 5)" and "nullproc(<hostname>): RPC: Unable to receive" and RECLASS messages). Robb says it's expected that the LDM starts up a lot of extra processes when the network is flaky as it was on Friday, so unless you see something else funny about this, I'm assuming it has nothing to do with the new pq code. --Russ P.S. There's a problem with starting up pqmon from an exec line in ldmd.conf. Instead of sleeping for 30 second intervals between outputting status lines with exec "pqmon -i 30 -l /usr/local/ldm/logs/pqmon.log" it outputs the product queue status every time a product comes in, because it gets woken up by the product group signal that a new product is available. I'll have to fix this, but for now I just killed the pqmon started in the LDM's product group and started up a new one outside the rpc.ldmd's product group appending to pqmon.log ...