This archive contains answers to questions sent to Unidata support through mid-2025. Note that the archive is no longer being updated. We provide the archive for reference; many of the answers presented here remain technically correct, even if somewhat outdated. For the most up-to-date information on the use of NSF Unidata software and data services, please consult the Software Documentation first.
Kevin, > To: Unidata Support <address@hidden> > cc: David Knight <address@hidden>, > cc: address@hidden, > cc: address@hidden, > cc: Tom McDermott <address@hidden> > From: "Kevin R. Tyle" <address@hidden> > Subject: Re: 20030519: NLDN inject machine problems after upgrading to LDM-6? > Organization: UCAR/Unidata > Keywords: 200305191919.h4JJJgLd028120 The above message contained the following: > This is the 3rd time we've had to restart the ldm since upgrading > to LDM6. I'm surprised and a bit alarmed. We haven't seen this behavior anyplace else -- but Striker's situation is somewhat unique. I'd like to get to the bottom of the problem as quickly as possible. ... > We're still collecting data to diagnose the problem. The first > two times, we saw messages saying "Too many open files" in the > logs. To how many downstream LDM-s does Striker send the NLDN data? Do they request the data or does Striker initiate the connection? > This time, such messages did not appear. However, we did > see these messages in the ldm log file, beginning about 9 > hours prior to the outage: > > May 17 08:21:57 striker rpc.ldmd[29795]: fork: Not enough space The above message means that the LDM couldn't fork itself (in order to respond to an incoming request to the LDM server or to request data from an upstream LDM). This was probably due to a shortage of swap space. How much swap space does Striker have? > The messages resumed again at 17Z > > May 17 17:16:45 striker rpc.ldmd[29795]: fork: Not enough space > > Shortly after that time, the # of active rpc.ldmd processes dropped > from 56 to under 20. These messages popped up: > > May 17 19:31:52 striker updraft(feed)[10782]: h_clnt_call: > updraft.db.erau.edu: BLKDATA: time elapsed 22.311053 The above message means that Striker took over 22 seconds to send a data packet to the requesting LDM-5 on Updraft. It's not a serious error, but does indicate a problem with the connection. > Some more could not fork messages appeared, Those are very bad. > and then a bunch of gethostbyaddr failures appear. > > May 17 20:10:50 striker rpc.ldmd[29795]: fork: Not enough space > May 17 20:10:51 striker last message repeated 1 time > May 17 20:10:52 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 128.174.80.47 > May 17 20:10:52 striker rpc.ldmd[29795]: Denying connection from > 128.174.80.47 > May 17 20:10:53 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 128.118.28.12 > May 17 20:10:53 striker rpc.ldmd[29795]: Denying connection from > 128.118.28.12 > May 17 20:10:54 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 130.253.215.243 > May 17 20:10:54 striker rpc.ldmd[29795]: Denying connection from > 130.253.215.243 > May 17 20:10:54 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 155.42.21.33 > May 17 20:10:54 striker rpc.ldmd[29795]: Denying connection from > 155.42.21.33 > May 17 20:10:56 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 128.196.30.175 > May 17 20:10:56 striker rpc.ldmd[29795]: Denying connection from > 128.196.30.175 > May 17 20:10:56 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 129.93.52.150 > May 17 20:10:56 striker rpc.ldmd[29795]: Denying connection from > 129.93.52.150 > May 17 20:10:57 striker rpc.ldmd[29795]: gethostbyaddr: failed for > 166.66.44.84 > May 17 20:10:57 striker rpc.ldmd[29795]: Denying connection from > 166.66.44.84 Striker's inability to resolve the above IP addresses means that the host's names won't appear in log messages. This is unfortunate but not fatal. > We've seen the gethostbyaddr failures and fork errors each > time the LDM had problems. The fork(2) errors are very bad. > We're now logging the CPU load, real and virtual memory use, and > # of open files every 5 minutes. Excellent! > Our ingest program was modified to reflect changes in pqinsert.c > from version 5 to version 6--it is indeed possible this is the > source of the problem. I'm going to take a closer look at this. You might also try rebuilding the LDM package with debugging turned on (set environment variable CFLAGS to "-g" before running the configure script and after doing a "make distclean"). If you then set the corefile size to "unlimited" before running the LDM, this may provide a core file that might give us some useful information. > We have not made any changes to the open file limits at the system > level. > > More info to follow . . . > > --Kevin Regards, Steve Emmerson